CMU LibGuides: Text & Data Mining Resources Guide: Home

Word of caution

Avoid web scraping or downloading large amounts of content from databases to which the library subscribes. Instead — if the publisher allows text and data mining — they will provide an API or other means of access. This helps provide the data in a stable and secure manner that complies with copyright laws and helps the library comply with the subscription license terms. Failure to comply may lock you out of the content and may jeopardize other library users' access.

Acknowledgments

This guide was largely reproduced from Washington University Library's Text and Data Mining Guide with the kind permissions from Sarah Swanz, Digital Humanities Librarian & Data Curator.

TDM Resources

Text & Data Mining (TDM) Resources Guide

Text and Data Mining (TDM) resources vary in their accessibility and usage terms. This guide provides information about both library-licensed content and freely available resources for TDM projects.

Important: Always verify current terms of use before beginning any TDM project.

Library-Licensed Content

Commercial Publishers and Databases

Publisher	Content Available	Access Information	Policy/Notes
Elsevier	ScienceDirect	Elsevier developers portal	See Elsevier text and data mining policy
Springer Nature	Licensed & Open Access content	Springer Nature API Portal	See Springer Nature text and data mining policy
JSTOR	JSTOR and Portico content	Constellate text analytics service	-
Wiley	Licensed Content	-	See Wiley Text and Data Mining Guide
Taylor & Francis	Licensed Content	Email support@tandfonline.com	See T&F text and data mining policy
SAGE Journals	Licensed Content	-	See SAGE Text and Data Mining policy
Clarivate Analytics	Web of Science	Web of Science APIs	Access via Clarivate Developers Portal
Adam Matthew Digital	AM Explorer		See AM Text and Data Mining Information and Permission

Publishers with Restricted Access

Publisher	Access Status	Notes
ProQuest	Additional Fee Required	Available through ProQuest TDM Studio
Factiva	Restricted	No TDM access available
EBSCO	Restricted	No TDM access available

ProQuest TDM Studio This link opens in a new window
A web-based text mining platform that allows you to access and analyze large amounts of text data. Using content retrieved from ProQuest database, you can build your corpus and conduct data analysis, text mining, and visualization using your preferred methods to uncover relationships, patterns, and connections within and between datasets while collaborating with colleagues in real-time on one platform. To access a workbench, you must submit a request.To find information on how to do this, visit this guide.
Digital Scholar Lab This link opens in a new window
Apply natural language processing tools to raw text data (OCR) from Gale Primary Sources in a single research platform. By integrating an unmatched depth and breadth of digital primary source matter with the most popular Digital Humanities (DH) tools, Gale Digital Scholar Lab provides a new lens to explore history and empowers researchers to generate world-altering conclusions and outcomes. The Digital Scholar Lab offers advanced humanities computing tools that make natural language processing (NLP) for historical texts accessible, more efficient, and impactful, thus expanding the footprint of digital humanities across campus.

Index of Open Access disciplinary repositories
In addition to the specific resources listed below, check out this list of Open Access disciplinary repositories if you are looking for scholarly publications.

Freely Available Content

Academic and Research Resources

Resource	Content Available	Access Method	Notes
arXiv	Scholarly articles in physics, math, CS, biology, finance, statistics	arXiv API	Non-peer-reviewed content
PubMed Central	Biomedical and life sciences articles	Text Mining Tools API	Check license status
PLOS	Article corpus and metadata	PLOS API	See PLOS Text and Data Mining home
CrossRef	Metadata records	CrossRef API	DOI-based access
ORCID	Researcher profiles	ORCID API	Public API available
OpenAlex	A bibliographic catalogue of scientific papers, authors and institutions with over 250 million scholarly works.	OpenAlex API	Public API available
Semantic Scholar	Scientific publication data about authors, papers, citations, venues, and more as well as academic datasets.	API Documentation	API Tutorial

Government and Legal Resources

Resource	Content Available	Access Method	Additional Information
Library of Congress	Historical newspapers	Chronicling America API	-
LC for Robots	Digital collections, laws, bibliographic info	API	Multiple collection access
CaseLaw Access Project	U.S. federal and state case law	API	See access policy
Congress.gov	Legislative data	Congress.gov API	Includes bills, amendments, reports
National Library of Medicine	Biomedical databases	NLM APIs	Multiple tools available

Digital Collections and Archives

Resource	Content Available	Access Method	Usage Notes
HathiTrust	17+ million digitized items	HathiTrust APIs	Includes metadata, images, OCR
Internet Archive	Wayback Machine, Open Library	Developer Portal	Multiple collection access
Project Gutenberg	60,000+ books	Mirror sites	No direct API; scraping allowed
Text Creation Partnership	Early English Books Online (EEBO), Eighteenth Century Collections Online (ECCO-TCP), and Evans Early American Imprints (Evans-TCP)	TCP Documentation	See documentation
WorldBank	Development data, World Bank operations and financial data, and climate data	Multiple APIs	-
World Digital Library	Primary sources, multiple languages	Multiple access options	-

Important Notes

Always check terms and conditions before starting TDM projects
Most providers limit use to non-commercial, research purposes
Publisher restrictions may apply to prevent server strain
Verify current terms of use before beginning any project
If you do not see the resource you are looking for, please contact your subject librarian about obtaining access or where to find corpora for your research needs.

Last Updated: 2/20/2025