• 0 Posts
  • 2 Comments
Joined 9 months ago
cake
Cake day: November 23rd, 2024

help-circle
  • For the OCR process you can probably wrangle up a simple bash pipeline with ocrmypdf and just let it run in the background once until all your PDFs have a text layer.

    With that tool it should be doable with something like a simple while loop:

    find . -type f -name '*.pdf' -print0 |
        while IFS= read -r -d '' file; do
            echo "Processing $file ..."
            ocrmypdf "$file" "$file"
            # ocrmypdf "$file" "${file%.pdf}_ocr.pdf"   # if you want a new file instead of overwriting the old
        done
    

    If you need additional languages or other options you’ll have to delve a little deeper into the ocrmypdf documentation but this should be enough duct tape to just whip up a full OCR cycle.


  • In case you are already using ripgrep (rg) instead of grep, there is also ripgrep-all (rga) which lets you search through a whole bunch of files like PDFs quickly. And it’s cached, so while the first indexing takes a moment any further search is lightning fast.

    It supports a whole truckload of file types (pdf, odt, xlsx, tar.gz, mp4, and so on) but i mostly used it to quickly search through thousands of research papers. Takes around 5 minutes to index everything for my 4000 PDFs on the first run, then it’s smooth sailing for any further searches from there.