Update: I’ve turned off commenting on this article because it was just a bunch of people asking for help and never getting any. If you need help with these instructions, go to Stack Overflow and ask there. If you have corrections to the article, please send them directly to me using the Contact form.
Tesseract is a great and powerful OCR engine, but their instructions for adding a new font are incredibly long and complicated. At CourtListener we have to handle several unusual blackletter fonts, so we had to go through this process a few times. Below I’ve explained the process so others may more easily add fonts to their system.
The process has a few major steps:
Create training documents
To create training documents, open up MS Word or LibreOffice, paste in the contents of the attached file named ‘standard-training-text.txt’. This file contains the training text that is used by Tesseract for the included fonts.
Set your line spacing to at least 1.5, and space out the letters by about 1pt. using character spacing. I’ve attached a sample doc too, if that helps. Set the text …more ...