Skip to Main Content

Text & Data Mining: CMU Sources

Designed to introduce you to text & data mining (TDM) at Carnegie Mellon University

Although most of the libraries' databases do not allow text or data mining due to license agreements, we are continually exploring licensing agreements for TDM rights from content providers as we negotiate database purchases. If you want TDM rights to particular content, please contact your liaison librarian.  

Below is a list of TDM sources available for CMU affiliates:

The EEBO Text Creation Partnership (TCP) is creating textual transcriptions of the EEBO digitized images.  The work is being done in Phases and different instructions apply to each phase:

EEBO-TCP Phase I – 25,000 texts freely available for anyone to use without restriction, for example, to text mine, modify or share with others.  You can download the files from box.com here. The readme.txt file describes the file formats.

EEBO-TCP Phase II (ongoing) – Early English Books Online LogoMore than 30,000 texts created.  Access to and use of these texts are subject to restrictions specified in CMU Libraries’ license. You may download these files for local use, but you may not share or redistribute them to users at non-TCP partner institutions without permission. To download these files, send email to tcp-info@umich.edu (please cc dn22@andrew.cmu.edu), requesting access to download the EEBO-TCP Phase II files under the terms of Carnegie Mellon’s local management agreement.  When your request is received, you will be sent instructions and given access to the files.

See EEBO-TCP for more information about EEBO content and the project to transcribe first editions in EEBO.

Gale Databases LogoAuthenticated CMU users may download and mine files from the following Gale databases for non-commercial purposes:

 

  • 17th & 18th Century Burney Collection - 1,000 British pamphlets, proclamations, newsbooks and newspapers
  • 18th Century Collection Online - 150,000 18th century books
  • 19th Century British Library Newspapers - Forty-eight 19th century British newspapers
  • 19th Century U.S. Newspapers - 500 19th century U. S. newspapers
  • De-Classified Documents Reference System - Formerly classified U.S. government documents on international relations since WWII
  • The Making of the Modern World - Original literature of economics from 1460-1914
  • Sabin Americana, History & Culture - Full-text works about the North, Central & South America, the Arctic & Antarctica and the West Indies, 1500-1926

Files for these collections can be accessed and downloaded here.

Users may distribute snippets of the text and data mining outputs provided that they are accompanied by a DOI link to the full-text article or chapter and the following proprietary notice: "Some rights reserved. This work permits non-commercial use, distribution, and reproduction in any medium, provided the original author and source are credited."

Users may not do any of the following:

  • Substantially or systematically reproduce, retain, or redistribute the files
  • Use snippets of text exceeding 200 characters
  • Extract, develop, or use the data in any direct or indirect commercial activity
  • Modify, abridge, translate or create derivative works, or remove, obscure, or modify copyright or other notices that appear in the files 

HathiTrust LogoHathiTrust is a research university collaboration to archive and share digitized collections. HathiTrust makes collections of works available for research purposes, including the public domain works digitized by Google in the Google Books project. See HathiTrust datasets for more information about the process of establishing research access.  

The HathiTrust Research Center supports researchers using TDM computation to plumb the HathiTrust collection by developing cutting edge tools and infrastructure. To learn more about their services, support, and community, visit their website.

Oxford English Dictionary Online (OED)Oxford University Press grants research access to the Corpus for academic projects that can demonstrate a strong practical need for this data.  For additional information, see their documentation.  To apply for research access to the Corpus, fill out and email this application form.

ScienceDirect LogoCMU Libraries has licensed TDM rights to Elsevier’s ScienceDirect database. The database content may be mined for noncommercial purposes using the ScienceDirect APIs.  See Elsevier’s policy for terms and conditions and how to gain access. 

SpringerLink LogoYou can download subscribed and open access content for TDM purposes directly from the SpringerLink platform.  TDM rights, for non-commercial research, are now included in new and renewed subscription agreements (see SpringLink Policy). TDM researchers are requested to be considerate and limit their downloading speed to a reasonable rate.

Content can be downloaded via a web browser or with an HTTP GET request using a scripting tool such as curl, wget and Python’s urllib, among others. Content can also be accessed as via friendly URLs - PDF or HTML (when available). 

(Note that the tool should be enabled to follow HTTP 301, 302 and 303 redirects.  No API key or other authentication is required)