Avoid web scraping or downloading large amounts of content from databases to which the library subscribes. Instead — if the publisher allows text and data mining — they will provide an API or other means of access. This helps provide the data in a stable and secure manner that complies with copyright laws and helps the library comply with the subscription license terms. Failure to comply may lock you out of the content and may jeopardize other library users' access.
This guide was largely reproduced from Washington University Library's Text and Data Mining Guide with the kind permissions from Sarah Swanz, Digital Humanities Librarian & Data Curator.
Text and Data Mining (TDM) resources vary in their accessibility and usage terms. This guide provides information about both library-licensed content and freely available resources for TDM projects.
Publisher | Content Available | Access Information | Policy/Notes |
---|---|---|---|
Elsevier | ScienceDirect | Elsevier developers portal | See Elsevier text and data mining policy |
Springer Nature | Licensed & Open Access content | Springer Nature API Portal | See Springer Nature text and data mining policy |
JSTOR | JSTOR and Portico content | Constellate text analytics service | - |
Wiley | Licensed Content | - | See Wiley Text and Data Mining Guide |
Taylor & Francis | Licensed Content | Email support@tandfonline.com | See T&F text and data mining policy |
SAGE Journals | Licensed Content | - | See SAGE Text and Data Mining policy |
Clarivate Analytics | Web of Science | Web of Science APIs | Access via Clarivate Developers Portal |
Adam Matthew Digital | AM Explorer | See AM Text and Data Mining Information and Permission |
Publisher | Access Status | Notes |
---|---|---|
ProQuest | Additional Fee Required | Available through ProQuest TDM Studio |
Factiva | Restricted | No TDM access available |
EBSCO | Restricted | No TDM access available |
A web-based text mining platform that allows you to access and analyze large amounts of text data. Using content retrieved from ProQuest database, you can build your corpus and conduct data analysis, text mining, and visualization using your preferred methods to uncover relationships, patterns, and connections within and between datasets while collaborating with colleagues in real-time on one platform. To access a workbench, you must submit a request.To find information on how to do this, visit this guide.
Apply natural language processing tools to raw text data (OCR) from Gale Primary Sources in a single research platform. By integrating an unmatched depth and breadth of digital primary source matter with the most popular Digital Humanities (DH) tools, Gale Digital Scholar Lab provides a new lens to explore history and empowers researchers to generate world-altering conclusions and outcomes. The Digital Scholar Lab offers advanced humanities computing tools that make natural language processing (NLP) for historical texts accessible, more efficient, and impactful, thus expanding the footprint of digital humanities across campus.
In addition to the specific resources listed below, check out this list of Open Access disciplinary repositories if you are looking for scholarly publications.
Resource | Content Available | Access Method | Notes |
---|---|---|---|
arXiv | Scholarly articles in physics, math, CS, biology, finance, statistics | arXiv API | Non-peer-reviewed content |
PubMed Central | Biomedical and life sciences articles | Text Mining Tools API |
Check license status |
PLOS | Article corpus and metadata | PLOS API | See PLOS Text and Data Mining home |
CrossRef | Metadata records | CrossRef API | DOI-based access |
ORCID | Researcher profiles | ORCID API | Public API available |
OpenAlex | A bibliographic catalogue of scientific papers, authors and institutions with over 250 million scholarly works. | OpenAlex API | Public API available |
Semantic Scholar | Scientific publication data about authors, papers, citations, venues, and more as well as academic datasets. | API Documentation | API Tutorial |
Resource | Content Available | Access Method | Additional Information |
---|---|---|---|
Library of Congress | Historical newspapers | Chronicling America API | - |
LC for Robots | Digital collections, laws, bibliographic info | API | Multiple collection access |
CaseLaw Access Project | U.S. federal and state case law | API | See access policy |
Congress.gov | Legislative data | Congress.gov API | Includes bills, amendments, reports |
National Library of Medicine | Biomedical databases | NLM APIs | Multiple tools available |
Resource | Content Available | Access Method | Usage Notes |
---|---|---|---|
HathiTrust | 17+ million digitized items | HathiTrust APIs | Includes metadata, images, OCR |
Internet Archive | Wayback Machine, Open Library | Developer Portal | Multiple collection access |
Project Gutenberg | 60,000+ books | Mirror sites | No direct API; scraping allowed |
Text Creation Partnership | Early English Books Online (EEBO), Eighteenth Century Collections Online (ECCO-TCP), and Evans Early American Imprints (Evans-TCP) | TCP Documentation | See documentation |
WorldBank | Development data, World Bank operations and financial data, and climate data | Multiple APIs | - |
World Digital Library | Primary sources, multiple languages | Multiple access options | - |
Last Updated: 2/20/2025