Much of this guide has been reused with a Creative Commons license, from the "Optical Character Recognition (OCR) @ Pitt" by University of Pittsburgh Library System, with much gratitude
Optical Character Recognition (OCR) is the electronic conversion of images of text into digitally encoded text using specialized software. OCR software enables a computer to convert a scanned document, a digital photo of text, or any another digital image of text into machine-readable, searchable, retrievable, and editable data. OCR data can then be used for a variety of applications, including data extraction, data/text mining, and text-to-speech technology.
The OCR process typically involves at least three steps:
Depending on the quality of your document, you may also have to edit or "preprocess" the image to improve the quality and, thus, enable the OCR software to recognize the text more accurately. If you're working with text that the OCR software isn't equipped to recognize (handwritten or atypical typography), you might need to use language packages, patterns, and training data to supplement the software's default pattern recognition settings. And, finally, depending on the accuracy of the OCR, you may have to verify and correct ("post-process") the OCR-generated text. These steps could require a considerable amount of time and effort, depending on the quality and extent of your documents, so you will want to account for this in your process.
OCR software is not generally as useful for handwritten texts. Handwritten text recognition (HTR) is similar to OCR in that machine learning is used to generate transcriptions of documents, however HTR is using machine learning to transcribe handwritten documents instead of printed documents. There are different HTR tools, programs, and programming packages available for different types of HTR projects. They're generally into three categories: business, personal, and archival.
Transkribus is a software program that allows users to load documents into the program and create HTR models to generate transcriptions using PyLaia and HTR+ engines. Transkribus Expert Client is the software available for download to operate on your desktop, and Transkribus Lite is the online version of Transkribus with the same abilities to load documents, run line segmentation, transcribe ground truth, create models, and use premade models, with added capability of collaboration of multiple users working on the same document collection. The more you use Transkribus, the more accurate it becomes. Each different hand requires starting anew, though.
Live Text is a feature from Apple IOS 15, available for use in Apple iPhones and and iPads. Live Text is HTR/OCR that can be used directly in your camera app and available in photos. If you take a picture of a sign with text on it, Live Text can detect and transcribe the text so that it is copy-and-paste-able directly from the camera app before the picture is taken or the photo in your camera roll. Live Text has some success with handwritten text as well, depending on how clear the text is written and how uniform the letters are to standardized characters.
Google Lens is a feature of the Google app available for Apple and Android devices. Google Lens can be used with your device's camera or in photos to extract typed or handwritten text. That extracted text can be Google-searched from the Google Lens app or copy-and-pasted.
Evernote offers the most decent handwriting-to-text recognition engine but excels at modern handwriting and not historical documents.
Adobe Acrobat Pro DC works as a text converter, automatically extracting text from any scanned paper document or image file and converting it to editable text in a PDF. Acrobat can recognize text and its formatting. Your new PDF will match your original printout thanks to automatic custom font generation. You can work with converted PDF files in other applications, preserve the exact look and feel of your documents, and restrict editing capabilities by saving them as smart PDFs that include text you can search and copy.
Free Online OCR is a free online OCR service, based on Tesseract OCR engine, that can analyze the text in any image file that you upload, and then convert the text from the image into text that you can easily edit on your computer. Free Online OCR allows unlimited uploads and the following input files: image files (JPEG, JFIF, PNG, GIF, BMP, PBM, PGM, PPM, PCX); multi page documents (TIFF, PDF, DjVu); compressed files (Unix compress, bzip2, bzip, gzip), including multiple images in ZIP archive; and DOCX, ODT files with images. Free Online OCR supports 122 recognition languages and fonts, multi-language recognition, mathematical equations recognition, page layout analysis (multi-column text recognition), selection of area on page for OCR, page rotation, poorly scanned and photographed pages, and low-resolution images.
Google Lens is an image recognition technology that uses visual analysis based on a neural network to extract text from images and bring up relevant information related to objects it identifies. Users can copy text once it has been recognized. Google Lens can be used as a standalone app or as an integrated feature in the Google Photos, Google Assistant, Google Image Search, and Chrome mobile apps. The mobile apps also enable translation of recognized text using Google Translate.
Copyfish is a free OCR software that allows you to copy, paste and translate text from image, video, and PDF files. The web browser extension (Chrome, FireFox, Microsoft Edge) works with every website, including videos and PDF documents. The desktop capture OCR feature, which you can install in addition to the browser extension, allows you to extract text from opened documents (e.g., text and tables from brochures and leaflets that are only available as graphics), file menus, browser extensions, web pages, presentations, games, and PDF files.
Programmatic tools tools require at least some programming knowledge. Depending on the tool and the proficiency of your coding skills, you may be able to customize the OCR functionality more than with out-of-the-box tools. The following recommended tools vary by type (e.g., JavaScript scripts, Python module, Python scripts, Python wrapper) and may or may not be compatible with your platform (operating system). All tools are freely available.
Tesseract is an open source OCR software and can be used directly via command line, or (for programmers) by using an API, to extract printed text from images. Tesseract doesn’t have a built-in GUI (Graphic User Interface), but there are several available from the 3rdParty page. The engines include a neural net (LSTM) based OCR engine, which is focused on line recognition, as well as an engine that works by recognizing character patterns. Tesseract has unicode (UTF-8) support, and can recognize more than 100 languages "out of the box". Tesseract supports various output formats: plain text, hOCR (HTML), PDF, invisible-text-only PDF, TSV.
Rescribe is a research collective with a focus on Optical Character Recognition (OCR) software and training for historical texts.Their OCR training packages are designed for the Tesseract and OCRopus engines and can be downloaded and used for free from latinocr.org and github.com. The software and tools we create are all released as free and open source software.
Kraken OCR is a command-line Python package that generates transcriptions for historical documents in a variety of languages. Kraken can train models to generate transcriptions for Latin scripts and non-Latin scripts (e.g., Haskh, Aramaic, Devangari), as well as texts written right-to-left and top-to-bottom. The Kraken package provides free public models available for users to run on their documents.
Nautilus-OCR is an open-source, Python-based OCR engine developed at the National Library of Luxembourg. Nautilus-OCR works with the METS/ALTO schemas, with the ability to take in a METS/ALTO dataset and produce an improved METS/ALTO dataset. The National Library of Luxembourg used Nautilus-OCR on their historical newspaper collection and published the OCR models they produced on that project for public use with Nautilus-OCR.
Neural Network OCR trains a multi-layer perceptron (MLP) neural network to perform OCR. The training set is automatically generated using a heavily modified version of the captcha-generator node-captcha. It also supports MNIST handwritten digit database.
CMU Linguistic Annotation Backend (LAB) has released an extension that allows for OCR handwriting correction. "This tool allows you to train a model that fixes the recognition errors made by a first pass OCR system. In the first step, the user uploads a set of images of documents and gets back the transcribed output from an off-the-shelf OCR engine. Once a few of these documents have been manually corrected, they can be used in step 2 to train a new post-correction model."