The Design and Construction of eBooks, by Steve Thomas


When OCR goes bad

OCR does a remarkable job converting page images into text, providing your image is of optimal quality. Nothing will save you when the original pages were of dismal quality; you’re going to have lots of errors, and it may not be worth attempting to create a text version from them.

But more often, I find that most pages are good, and the OCR result is therefore acceptable, with just a few pages where the entire page is corrupted. This is usually because the original scan skewed the page beyond the limits of the OCR software. In such a case, you may be able to rescue the text by de-skewing the image for a page and re-doing the OCR for that page.

  1. Download the original page scan (e.g. from
  2. Straighten the page in an image edit program
  3. Run these commands (Linux command line):
    convert page.jpg page.tif
    tesseract page.tif page
  4. Insert page.txt into ebook, replacing the corrupted text.

Last updated Tuesday, January 26, 2016 at 23:27