Here’s a little background before I get into the tutorial. I’m studying abroad in Germany this fall and was sent a bunch a documents in German to fill out and return. My German is pretty bad, and these are documents I can’t misinterpret else I’d probably mess things up and arrive and not have housing because I mailed something to the wrong address. My friend was in the same boat, so I did some thinking and came up with a solution:
- Scan hard copies to tif files
- Turn the images into text using OCR software
- Translate the German text into English. It didn’t have to be perfect because I know a small ammount of German.
I had some requirements too. Had to run on OS X, and the software I used had to be free. I had access to my roommate’s cheap CanoScan scanner which fortunately had functioning drivers for OS X (albeit a terrible interface).
The first piece of software you need is scanning software. Hopefully you can set that up on your own. Make tiffs of all the documents you want to translate. Black and white, no compression.
Next, we need some OCR software. I tried GOCR first, but I had no luck. Then I came across Tesseract OCR that is mostly released by Google. It’s open source, so I gave it a shot because it said it might work on OS X but didn’t give any promises. Here’s how to install it on OS X. It’s super easy.
Download the source from here. I used 2.0. Extract it, and fire up a terminal.
Navigate to the source directory
cd Desktop/tesseract-2.00
Configure it and make it
./configure
make
sudo make install
Hopefully that went smoothly. Now you’ll need language support for the language you’re scanning. Download the language tar.gz langauge pack from the download page, and extract it. You will want to copy the contents of the tessdata fold that extracted to /usr/local/share/tessdata.
Run /usr/local/bin/tesseract (or tesseract if /usr/local/bin is in your path) in the terminal just to see if it installed properly. It should spit out some usage info.
Let’s try running it on a file now. Since I used german my lang keyword is deu (so change this to your appropriate language). My files are also sequentially labeled OCRXX.tif. So run
/usr/local/bin/tesseract OCR01 1 -l deu
And if that succeeds, it will output to 1.txt, so you can make sure everything is okay by running
cat 1.txt
OS X outputs these files not very correctly if you open them in TextEdit.app so cutting and pasting them isn’t advisable unless you have a better editor that supports UTF-8 (which is what I think the output files are in). What I did since I have some webspace is I uploaded all the text files to my server and then passed the url to the .txt file to google language tools and it displayed perfectly in my browser, translated and all.
Tesseract is good stuff, especially for the price (free).
Even lazier? Here’s some bash code for you! Just save all your tifs in a single folder. I bolded the parts that you should change to suit your needs (changing the language).
for f in *.[tT][iI][Ff]*; do
out=`echo $f | sed ’s/\\.[tT][iI][fF].*//’`
/usr/local/bin/tesseract $f $out -l deu
done
Now after running that you should have a bunch of text files. If you’re even lazier and don’t want to upload each one and send them through Google language tools you can use cat or something and just sent one of them.
I’ve had some issues with multiple columns in documents. You can just use a simple image editor, even Preview.app to crop and rearrange columns into a linear format to help the OCR software along.
I think that’s all. The same methodology should work with other operating systems as well. You might need some MacPorts utilities installed. I didn’t test without them. Sorry!