It has been more than 20 years since the last time I played seriously with OCR, and back then neither the hardware nor the software was up to the task.
Back then, OCR was done letter by letter. "Let's see... that's a circle with a bar sticking out of it. Is the bar on the left or the right? On the right, hmm, I'll bet it's a lower case d. Now let's try the next letter. It looks kinda like another circle. Maybe a lower case o. Or an e. Or a c. One chance in three. Go for it."
Now OCR does it word by word, with built-in dictionaries for hundreds of different languages. And it's f a s t! Three or four seconds a page.
I just spent $30 on the ABBYY program for OCR (half-price sale) and compared to my previous experiences the results are staggering. On clean text, it is nearly 100% perfect. On dirty text... well, see for yourself. The sample below is a screenshot from a PDF file that was fed to ABBYY. (As bad as it is, it is still better than the original paper document which I massaged intensely with my Fujitsu scanner!) I ran the PDF by ABBYY as a joke, just to see if it could parse out any of the text, and it ran at least 99% accurate. [OK, in the interests of accuracy, there were about 30 OCR errors out of 2,603 characters, so it was only 98.8% accurate.]
Color me impressed,
tanstaafl.
Attachments
_________________________
"There Ain't No Such Thing As A Free Lunch"