FTF-Identify-Unsearchable-PDFs-Within-Folders

Fast Tip Friday – Identify Unsearchable PDFs Within Folders

This fast tip demonstrates how to use a free tool called Count Anything to identify which PDF files are unsearchable.

Sean O'Shea shared this tip in his article entitled Where Are My Unsearchable PDFs.

Source: Sean O'Shea

7 Comments

  1. Good one. It has a commandline facility as well, which means that you could have a kind of automated process (for identifying and then OCRing) if you wanted to.

    1. This command line idea has instant utility! I would be interested in seeing that work. Would it depend on a command prompt script? First, a script to call the Count Anything app to identify documents in a selection, then save a text delimited version in a specified location. Next we set up the excel spreadsheet and Identify the ones we want to OCR.

      Another command prompt to identify the selection of filenames, then OCR the documents in the background. I theoretically could use a VBA to do this, also, once the filenames are in excel. This solves a major issue I’m having with OCRing PDFs– the problem of one-by-one OCRing each one.

      I’d like to discover a way to OCR all the PDFs in the background. Right now, my database identifies the documents that need OCR, but it occupies my machine’s memory/processes to do so.

      Thanks for sharing Sean’s find.

      1. What I was thinking was:
        1) you have a tranche of PDFs, and you make a file/folder listing
        2) manipulate the file/folder listing so that you call up Count Anything for each file – say for example you have 2GB or 1,000 files, then what I would do is to sort this into say 4x 500MB chunks (your file listing would need to have the file size and you’d want to have an Excel formulae (or something) to work out a cumulative size)
        3) initiate each instance of the 4 commandline scripts (ie batch files) instantanesously – where each output to a unique name
        4) parse the output – I’m not a VBA wiz so this would be a manual step, however I think there is a way to parse a text file for a specific string of text where you want to know which files don’t contain text – ie count is zero or whatever the output is
        5) copy those [naughty] files that require OCRing to a ‘hot folder’ or somewhere
        6) if you have something like ABBYY or an equivalent which monitors a hot folder then it would automatically OCR the contents of the hot folder
        7) alternatively, then use your OCR weapon of choice – I’d also be looking at load balancing by cumulative file size as otherwise you get one machine completing a task quicker
        8) go get a coffee and reflect on the bad old days of how miserable you were with OCRing manually

        We have a fleet of virtual machines that are our workers to do things like this, not yet optimised for load balancing and farming out jobs, but still pretty good as we get to save significant time by doing lots of things in parallel.
        Good luck.

Leave a Reply

Your email address will not be published.