OCR

Back to Help Index

OCR stand for Optical Character Recognition, which means translation of the scanned image data to text which can be selected, copied or modified.

If you enable the „OCR after scan“ feature in the scanner settings, or select the „OCR“ action in the edit menu, PDFScanner will turn the scanned image into a real PDF which contains the text of the document. This means it can be found using Spotlight search or copied to the pasteboard for using it in other applications.

When doing OCR, PDFScanner adds an invisible text layer to the PDF file, so it can be selected and copied - but you still only see the scanned image. You can then copy the text and paste it in other applications like MS Word - be sure to use “paste and match style” to adjust the font style and size.

PDFScanner also supports performing OCR on an imported PDF document or image. Just use the menu option or drop a file to the main window to import or open it. Then, OCR and deskew can be performed via the Edit menu. The language to use for manual OCR operations can be set in the Preferences of PDFScanner. Please note that this only works if the imported PDF does not already contain any text.

OCR works best if the language of the document is known, so the scanner settings dialog allows to select an OCR language. PDFScanner currently supports English, German, French, Spanish, Italian, Dutch, Portuguese, Swedish, Danish, Norwegian and Finnish.

Unfortunatly, text recognition is a very hard problem and therefore the results are not always accurate. Most of the time, this is not a problem, because the main reason to do OCR on scanned documents is to make them searchable, which works very well even if some words are not correctly recognized.

PDFScanner uses the free Tesseract OCR engine, which is a program supported by Google. See https://github.com/tesseract-ocr/ for more information.