CMU LibGuides: Text & Data Mining: CMU Sources

Constellate is a comprehensive service that allows faculty and students to teach, learn, and perform text analysis with scholarly and primary source content from JSTOR, PORTICO, and other partners. The platform combines the educational materials, data, and tools needed to teach, learn, and perform text analysis, helping faculty advance their own scholarship and develop the data skills their students need for success in their education and employment.

Apply natural language processing tools to raw text data (OCR) from Gale Primary Sources in a single research platform. By integrating an unmatched depth and breadth of digital primary source matter with the most popular Digital Humanities (DH) tools, Gale Digital Scholar Lab provides a new lens to explore history and empowers researchers to generate world-altering conclusions and outcomes. The Digital Scholar Lab offers advanced humanities computing tools that make natural language processing (NLP) for historical texts accessible, more efficient, and impactful, thus expanding the footprint of digital humanities across campus.

ProQuest TDM Studio is a web-based platform that allows you to access and analyze large amounts of text data while collaborating with colleagues in real-time on one platform. Using content retrieved from ProQuest database, you can build your corpus and conduct data analysis, text mining, and visualization to uncover relationships, patterns, and connections within and between datasets. It allows you to either use your preferred data analysis methods in a coding workbench in Jupyter Notebook environment, or a pre-defined data visualization module with no coding experience needed. Results can be shared within your team or exported for further use.

Anyone with an active CMU email address can access TDM Studio. To set up an account:

Go to https://tdmstudio.proquest.com
Click “Create an account” button
Use your Andrew.edu email address to create your account.

For more information about getting started, creating a data set and exploring your data, see the TDM Studio Quick Start Guide. To add collaborators to your workbench, contact TDM Studio at TDMStudio@clarivate.com.

EEBO-TCP is a partnership between the Universities of Michigan and Oxford and the publisher ProQuest to create accurately transcribed and encoded texts based on the image sets published by ProQuest via their Early English Books Online (EEBO) database (https://eebo.chadwyck.com). The general aim of EEBO-TCP was to encode one copy (usually the first edition) of every monographic English-language title published between 1473 and 1700 available in EEBO. Textual transcriptions of the EEBO digitized images can be downloaded from the Oxford Text Archive. The EEBO database, to which CMU subscribes, also contains about 50% of these encoded transcriptions.

EEBO-TCP aimed to produce large quantities of textual data within the usual project restraints of time and funding, and therefore chose to create diplomatic transcriptions (as opposed to critical editions) with light-touch, mainly structural encoding based on the Text Encoding Initiative (http://www.tei-c.org).

The EEBO-TCP project was divided into two phases. The 25,363 texts created during Phase 1 of the project were released into the public domain as of 1 January 2015. The 28,462 texts of Phase 2 were released into the public domain as of July 2020.

HathiTrust is a research university collaboration to archive and share digitized collections. HathiTrust makes collections of works available for research purposes, including the public domain works digitized by Google in the Google Books project. See HathiTrust datasets for more information about the process of establishing research access.

The HathiTrust Research Center supports researchers using TDM computation to plumb the HathiTrust collection by developing cutting edge tools and infrastructure. To learn more about their services, support, and community, visit their website.

Researchers interested in text mining the OED can test out the OED Text Visualizer, which creates annotated visualizations of historical texts using OED data, and use the OED Researcher API to quickly and easily access and manipulate the OED's data.

All Elsevier journals and books enable text and data mining (TDM) although full access is limited to the products to which CMU has access. To apply for an API key, got to https://dev.elsevier.com/. Read more about Elsevier's TDM policies and procedures.

You can download CMU-licensed and open-access content for TDM purposes directly from the SpringerLink platform and no registration or API key is required. Content may be downloaded for TDM directly from SpringerLink, and downloading may be automated for that purpose; Springer APIs may be used to identify desired content for download.

Limitations: Non-commercial use only, users should adhere to the Springer TDM policy.

Authenticated CMU users may download and mine files from the following Gale databases for non-commercial purposes:

17th & 18th Century Burney Collection - 1,000 British pamphlets, proclamations, newsbooks and newspapers
18th Century Collection Online - 150,000 18th century books
19th Century British Library Newspapers - Forty-eight 19th century British newspapers
19th Century U.S. Newspapers - 500 19th century U. S. newspapers
De-Classified Documents Reference System - Formerly classified U.S. government documents on international relations since WWII
The Making of the Modern World - Original literature of economics from 1460-1914
Sabin Americana, History & Culture - Full-text works about the North, Central & South America, the Arctic & Antarctica and the West Indies, 1500-1926

Files for these collections can be accessed and downloaded here.

Users may distribute snippets of the text and data mining outputs provided that they are accompanied by a DOI link to the full-text article or chapter and the following proprietary notice: "Some rights reserved. This work permits non-commercial use, distribution, and reproduction in any medium, provided the original author and source are credited."

Users may not do any of the following:

Substantially or systematically reproduce, retain, or redistribute the files
Use snippets of text exceeding 200 characters
Extract, develop, or use the data in any direct or indirect commercial activity
Modify, abridge, translate or create derivative works, or remove, obscure, or modify copyright or other notices that appear in the files