CMU LibGuides: History: Digital History

Digital Humanities Resources at CMU

Data 101
by Alfredo Gonzalez-Espinoza Last Updated Nov 5, 2024 688 views this year
GIS & Spatial Data
by Jessica Benner Last Updated Feb 17, 2025 123 views this year
Text & Data Mining
by Nicky Agate Last Updated Feb 20, 2024 524 views this year

Digital History Resources Online

DHQ: Digital Humanities Quarterly
Open-access, peer-reviewed, digital journal covering all aspects of digital media in the humanities. Published by the Alliance of Digital Humanities Organizations (ADHO)
Digital History: A Guide to Gathering, Preserving and Presenting the Past on the Web
E-book that provides a plainspoken and thorough introduction to the web for historians—teachers and students, archivists and museum curators, professors as well as amateur enthusiasts—who wish to produce online historical work, or to build upon and improve the projects they have already started in this important new medium.
Digital History Resources-American Historical Association
Resources for getting started, professional guidelines, etc.
H-Digital-History Resources
Serves as a venue for discussion of issues related to historical computing for a wide audience of scholars in the humanities and social sciences.
Programming Historian
Novice-friendly, peer-reviewed tutorials that help humanists learn a wide range of digital tools, techniques, and workflows to facilitate their research.
Index of Digital Humanities Conferences
1960s-present. Browse 7,113 presentations from 494 digital humanities conferences spanning 61 years, featuring 8,420 different authors hailing from 1,830 institutions and 86 countries.

Optical Character Recognition (OCR)

What is OCR?

Optical Character Recognition (OCR) is the electronic conversion of images of text into digitally encoded text using specialized software. OCR software enables a computer to convert a scanned document, a digital photo of text, or any another digital image of text into machine-readable, searchable, retrievable, and editable data. OCR data can then be used for a variety of applications, including data extraction, data/text mining, and text-to-speech technology.

How to use OCR

The OCR process typically involves at least three steps:

Scanning and/or opening a document in the OCR software,
Recognizing the text in the document using the OCR software, and
Saving the new OCR-processed document in the file format of your choosing.

Depending on the quality of your document, you may also have to edit or "preprocess" the image to improve the quality and, thus, enable the OCR software to recognize the text more accurately. If you're working with text that the OCR software isn't equipped to recognize (handwritten or atypical typography), you might need to use language packages, patterns, and training data to supplement the software's default pattern recognition settings. And, finally, depending on the accuracy of the OCR, you may have to verify and correct ("post-process") the OCR-generated text. These steps could require a considerable amount of time and effort, depending on the quality and extent of your documents, so you will want to account for this in your process.

Out-of-the-Box OCR Tools

Adobe Acrobat Pro DC

Adobe Acrobat Pro DC works as a text converter, automatically extracting text from any scanned paper document or image file and converting it to editable text in a PDF. Acrobat can recognize text and its formatting. Your new PDF will match your original printout thanks to automatic custom font generation. You can work with converted PDF files in other applications, preserve the exact look and feel of your documents, and restrict editing capabilities by saving them as smart PDFs that include text you can search and copy.

Type: Desktop application
Access: Individual subscription or from Virtual Andrew or on Windows at CMU Computer Labs
Batch Processing: Yes
Helpful Resource(s):

Free Online OCR

Free Online OCR is a free online OCR service, based on Tesseract OCR engine, that can analyze the text in any image file that you upload, and then convert the text from the image into text that you can easily edit on your computer. Free Online OCR allows unlimited uploads and the following input files: image files (JPEG, JFIF, PNG, GIF, BMP, PBM, PGM, PPM, PCX); multi page documents (TIFF, PDF, DjVu); compressed files (Unix compress, bzip2, bzip, gzip), including multiple images in ZIP archive; and DOCX, ODT files with images. Free Online OCR supports 122 recognition languages and fonts, multi-language recognition, mathematical equations recognition, page layout analysis (multi-column text recognition), selection of area on page for OCR, page rotation, poorly scanned and photographed pages, and low-resolution images.

Type: Web application
Batch Processing: No
Helpful Resource(s): N/A

Google Lens

Google Lens is an image recognition technology that uses visual analysis based on a neural network to extract text from images and bring up relevant information related to objects it identifies. Users can copy text once it has been recognized. Google Lens can be used as a standalone app or as an integrated feature in the Google Photos, Google Assistant, Google Image Search, and Chrome mobile apps. The mobile apps also enable translation of recognized text using Google Translate.

Type: Mobile application, Mobile application integrated feature, Web application integrated feature
Batch Processing: No
Helpful Resource(s):
- Krishnan, Amal. “How to Perform OCR Scanning with Google Lens.” MashTips, March 14, 2019.

Copyfish

Copyfish is a free OCR software that allows you to copy, paste and translate text from image, video, and PDF files. The web browser extension (Chrome, FireFox, Microsoft Edge) works with every website, including videos and PDF documents. The desktop capture OCR feature, which you can install in addition to the browser extension, allows you to extract text from opened documents (e.g., text and tables from brochures and leaflets that are only available as graphics), file menus, browser extensions, web pages, presentations, games, and PDF files.

Type: Web browser extension
Batch Processing: No
Helpful Resource(s):
- Copyfish. “How to Use Copyfish.”

Programmatic OCR Tools

Programmatic tools tools require at least some programming knowledge. Depending on the tool and the proficiency of your coding skills, you may be able to customize the OCR functionality more than with out-of-the-box tools. The following recommended tools vary by type (e.g., JavaScript scripts, Python module, Python scripts, Python wrapper) and may or may not be compatible with your platform (operating system). All tools are freely available.

Tesseract

Tesseract is an open source OCR software and can be used directly via command line, or (for programmers) by using an API, to extract printed text from images. Tesseract doesn’t have a built-in GUI (Graphic User Interface), but there are several available from the 3rdParty page. The engines include a neural net (LSTM) based OCR engine, which is focused on line recognition, as well as an engine that works by recognizing character patterns. Tesseract has unicode (UTF-8) support, and can recognize more than 100 languages "out of the box". Tesseract supports various output formats: plain text, hOCR (HTML), PDF, invisible-text-only PDF, TSV.

Type: Command-line program
Batch Processing: Yes
Helpful Resource(s):
- tessdoc. “Tesseract User Manual.”
- Wolf, Nick. “Research Guides: Tesseract OCR Software Tutorial: Home.”

Rescribe

Rescribe is a research collective with a focus on Optical Character Recognition (OCR) software and training for historical texts.Their OCR training packages are designed for the Tesseract and OCRopus engines and can be downloaded and used for free from latinocr.org and github.com. The software and tools we create are all released as free and open source software.

Type: Desktop tool
Batch Processing: Unsure
Helpful Resource(s):
- a blog which contains various guides and articles, written to be useful for humanists, librarians and technologists.
- A Scholar's Guide to using Optical Character Recognition, a 5 lesson course on Academia.edu, 2021.
- A Pipeline for the Ages: Medieval Manuscript OCR from the comfort of your own home, Lightning Talk for the Schoenberg Symposium 2020.

Kraken OCR

Kraken OCR is a command-line Python package that generates transcriptions for historical documents in a variety of languages. Kraken can train models to generate transcriptions for Latin scripts and non-Latin scripts (e.g., Haskh, Aramaic, Devangari), as well as texts written right-to-left and top-to-bottom. The Kraken package provides free public models available for users to run on their documents.

Type: Python package
Tested/Compatible Platform(s): Linux, macOS X

Nautilus-OCR

Nautilus-OCR is an open-source, Python-based OCR engine developed at the National Library of Luxembourg. Nautilus-OCR works with the METS/ALTO schemas, with the ability to take in a METS/ALTO dataset and produce an improved METS/ALTO dataset. The National Library of Luxembourg used Nautilus-OCR on their historical newspaper collection and published the OCR models they produced on that project for public use with Nautilus-OCR.

Type: Python package
Tested/Compatible Platform(s): Linux, Mac OS

Neural Network OCR

Neural Network OCR trains a multi-layer perceptron (MLP) neural network to perform OCR. The training set is automatically generated using a heavily modified version of the captcha-generator node-captcha. It also supports MNIST handwritten digit database.

Type: JavaScript scripts
Tested/Compatible Platform(s): macOS

CMULAB OCR Post Correction

CMU Linguistic Annotation Backend (LAB) has released an extension that allows for OCR handwriting correction. "This tool allows you to train a model that fixes the recognition errors made by a first pass OCR system. In the first step, the user uploads a set of images of documents and gets back the transcribed output from an off-the-shelf OCR engine. Once a few of these documents have been manually corrected, they can be used in step 2 to train a new post-correction model."

Handwritten Text Recognition (HTR)

OCR software is not generally as useful for handwritten texts. Handwritten text recognition (HTR) is similar to OCR in that machine learning is used to generate transcriptions of documents, however HTR is using machine learning to transcribe handwritten documents instead of printed documents. There are different HTR tools, programs, and programming packages available for different types of HTR projects. They're generally into three categories: business, personal, and archival.

Archival HTR tools are used by libraries, archives, museums, and government institutions to make their digitized collections of handwritten documents searchable, as well as by researchers studying handwritten manuscripts.
Business HTR tools are built for businesses who use handwritten information written on physical documents in their work to be transcribed to computer text to increase access to that information and/or store it in databases. Businesses that use HTR are insurance companies, banks, and healthcare companies.
Personal HTR tools are available via apps on smartphones and computers. Individuals use Personal HTR to generate transcriptions of handwritten documents usually written themselves. Students may use HTR to generate text transcriptions of their handwritten class notes to study or share online. Personal HTR tools are useful for community archiving and family archiving.

Archival Tools

Transkribus is a software program that allows users to load documents into the program and create HTR models to generate transcriptions using PyLaia and HTR+ engines. Transkribus Expert Client is the software available for download to operate on your desktop, and Transkribus Lite is the online version of Transkribus with the same abilities to load documents, run line segmentation, transcribe ground truth, create models, and use premade models, with added capability of collaboration of multiple users working on the same document collection. The more you use Transkribus, the more accurate it becomes. Each different hand requires starting anew, though.

Type: software program
Cost: Transkribus uses a credit system for text transcriptions. Users may use Transkribus for free for everything except text transcriptions via premade models or models you create. Users get 500 credits for free upon signing up. Cost is calculated by the number of pages to transcribe and which engine used to transcribe them (PyLaia or HTR+). A single page of handwritten material costs 1 credit to transcribe using the PyLaia engine, while a single page of handwritten material costs 1.25 credits to transcribe using the HTR+ engine.
Helpful Resource(s):
- Transkribus How-To Guides
Example Projects:

Personal Tools

Live Text is a feature from Apple IOS 15, available for use in Apple iPhones and and iPads. Live Text is HTR/OCR that can be used directly in your camera app and available in photos. If you take a picture of a sign with text on it, Live Text can detect and transcribe the text so that it is copy-and-paste-able directly from the camera app before the picture is taken or the photo in your camera roll. Live Text has some success with handwritten text as well, depending on how clear the text is written and how uniform the letters are to standardized characters.

Type: software
Cost: software is free with Apple products
Helpful Resource(s):
- The Complete Guide to Using Live Text on iOS 15
- How to use Live Text with iOS 15

Google Lens is a feature of the Google app available for Apple and Android devices. Google Lens can be used with your device's camera or in photos to extract typed or handwritten text. That extracted text can be Google-searched from the Google Lens app or copy-and-pasted.