Skip to Main Content Carnegie Mellon University Libraries

Text & Data Mining Resources Guide: Home

This guide provides a comprehensive overview of available resources, both freely available and library-licensed content, to support your TDM projects.

Word of caution

Avoid web scraping or downloading large amounts of content from databases to which the library subscribes. Instead — if the publisher allows text and data mining — they will provide an API or other means of access. This helps provide the data in a stable and secure manner that complies with copyright laws and helps the library comply with the subscription license terms. Failure to comply may lock you out of the content and may jeopardize other library users' access.

Acknowledgments


This guide was largely reproduced from Washington University Library's Text and Data Mining Guide with the kind permissions from Sarah Swanz, Digital Humanities Librarian & Data Curator.

TDM Resources

Text & Data Mining (TDM) Resources Guide

Text and Data Mining (TDM) resources vary in their accessibility and usage terms. This guide provides information about both library-licensed content and freely available resources for TDM projects.

Important: Always verify current terms of use before beginning any TDM project.

Library-Licensed Content

Commercial Publishers and Databases

Publisher Content Available Access Information Policy/Notes
Elsevier ScienceDirect Elsevier developers portal See Elsevier text and data mining policy
Springer Nature Licensed & Open Access content Springer Nature API Portal See Springer Nature text and data mining policy
JSTOR JSTOR and Portico content Constellate text analytics service -
Wiley Licensed Content - See Wiley Text and Data Mining Guide
Taylor & Francis Licensed Content Email support@tandfonline.com See T&F text and data mining policy
SAGE Journals Licensed Content - See SAGE Text and Data Mining policy
Clarivate Analytics Web of Science Web of Science APIs Access via Clarivate Developers Portal
Adam Matthew Digital AM Explorer   See AM Text and Data Mining Information and Permission

Publishers with Restricted Access

Publisher Access Status Notes
ProQuest Additional Fee Required Available through ProQuest TDM Studio
Factiva Restricted No TDM access available
EBSCO Restricted No TDM access available

Freely Available Content

Academic and Research Resources

Resource Content Available Access Method Notes
arXiv Scholarly articles in physics, math, CS, biology, finance, statistics arXiv API Non-peer-reviewed content
PubMed Central Biomedical and life sciences articles Text Mining Tools
API
Check license status
PLOS Article corpus and metadata PLOS API See PLOS Text and Data Mining home
CrossRef Metadata records CrossRef API DOI-based access
ORCID Researcher profiles ORCID API Public API available
OpenAlex A bibliographic catalogue of scientific papers, authors and institutions with over 250 million scholarly works. OpenAlex API Public API available
Semantic Scholar Scientific publication data about authors, papers, citations, venues, and more as well as academic datasets. API Documentation API Tutorial

Government and Legal Resources

Resource Content Available Access Method Additional Information
Library of Congress Historical newspapers Chronicling America API -
LC for Robots Digital collections, laws, bibliographic info API Multiple collection access
CaseLaw Access Project U.S. federal and state case law API See access policy
Congress.gov Legislative data Congress.gov API Includes bills, amendments, reports
National Library of Medicine Biomedical databases NLM APIs Multiple tools available

Digital Collections and Archives

Resource Content Available Access Method Usage Notes
HathiTrust 17+ million digitized items HathiTrust APIs Includes metadata, images, OCR
Internet Archive Wayback Machine, Open Library Developer Portal Multiple collection access
Project Gutenberg 60,000+ books Mirror sites No direct API; scraping allowed
Text Creation Partnership Early English Books Online (EEBO), Eighteenth Century Collections Online (ECCO-TCP), and Evans Early American Imprints (Evans-TCP) TCP Documentation See documentation
WorldBank Development data, World Bank operations and financial data, and climate data Multiple APIs -
World Digital Library Primary sources, multiple languages Multiple access options -

Important Notes

  • Always check terms and conditions before starting TDM projects
  • Most providers limit use to non-commercial, research purposes
  • Publisher restrictions may apply to prevent server strain
  • Verify current terms of use before beginning any project
  • If you do not see the resource you are looking for, please contact your subject librarian about obtaining access or where to find corpora for your research needs.

Last Updated: 2/20/2025