

9·
7 hours agoIn case you are already using ripgrep (rg) instead of grep, there is also ripgrep-all (rga) which lets you search through a whole bunch of files like PDFs quickly. And it’s cached, so while the first indexing takes a moment any further search is lightning fast.
It supports a whole truckload of file types (pdf, odt, xlsx, tar.gz, mp4, and so on) but i mostly used it to quickly search through thousands of research papers. Takes around 5 minutes to index everything for my 4000 PDFs on the first run, then it’s smooth sailing for any further searches from there.
For the OCR process you can probably wrangle up a simple bash pipeline with ocrmypdf and just let it run in the background once until all your PDFs have a text layer.
With that tool it should be doable with something like a simple while loop:
find . -type f -name '*.pdf' -print0 | while IFS= read -r -d '' file; do echo "Processing $file ..." ocrmypdf "$file" "$file" # ocrmypdf "$file" "${file%.pdf}_ocr.pdf" # if you want a new file instead of overwriting the old done
If you need additional languages or other options you’ll have to delve a little deeper into the ocrmypdf documentation but this should be enough duct tape to just whip up a full OCR cycle.