Darkassassin07@lemmy.ca to

Selfhosted@lemmy.worldEnglish · 1 day ago

Searching through a bulk of pdf files

31

Searching through a bulk of pdf files

Darkassassin07@lemmy.ca to

Selfhosted@lemmy.worldEnglish · 1 day ago

I have a pile of part lists for tools I’m maintaining, in pdf format; and I’m looking for a good way to take a part number, search through the collection of pdfs, and output which files contain that number. Essentially letting me match random unknown part numbers to a tool in our fleet.

I’m pretty sure the majority of them are actual text you can select and copy+paste, so searching those shouldn’t be too difficult; but I do know there’s at least a couple in there that are just a string of jpgs packed in a pdf file. They will probably need OCR, but tbh I can probably live with skipping over those altogether.

I’ve been thinking of spinning up an instance of paperless-ngx and stuffing them all in there so I can let it index the contents including using OCR, then use it’s search feature; but that also seems a tad overkill.

I’m wondering if you fine folks have any better ideas. What do you think?

Chat

MysteriousSophon21@lemmy.world
link
fedilink
English
arrow-up
3·
23 hours ago
You might want to check out Docspell - it’s lighter than paperless-ngx but still handles PDF indexing and searching realy well, plus it can do basic OCR on those image-based PDFs without much setup.

Selfhosted@lemmy.world

selfhosted@lemmy.world

You are not logged in. However you can subscribe from another Fediverse account, for example Lemmy or Mastodon. To do this, paste the following into the search field of your instance: !selfhosted@lemmy.world

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don’t control.

Rules:

Be civil: we’re here to support and learn from one another. Insults won’t be tolerated. Flame wars are frowned upon.
No spam posting.
Posts have to be centered around self-hosting. There are other communities for discussing hardware or home computing. If it’s not obvious why your post topic revolves around selfhosting, please include details to make it clear.
Don’t duplicate the full text of your blog or github here. Just post the link for folks to click.
Submission headline should match the article title (don’t cherry-pick information from the title to fit your agenda).
No trolling.

Resources:

selfh.st Newsletter and index of selfhosted software and apps
awesome-selfhosted software
awesome-sysadmin resources
Self-Hosted Podcast from Jupiter Broadcasting

Any issues on the community? Report it using the report flag.

Questions? DM the mods!

Visibility: Public

This community can be federated to other instances and be posted/commented in by their users.

247 users / day
718 users / week
747 users / month
768 users / 6 months
1 local subscriber
50.6K subscribers
70 Posts
395 Comments
Modlog