Skip to Main Content Carnegie Mellon University Libraries

Text & Data Mining: Free Data Sources

Designed to introduce you to text & data mining (TDM) at Carnegie Mellon University

There are many open sources for text or data that are available on the web. The list below is a selection of sources that come to our attention and/or may have not been already included in available online directories such as Open Access Directory's data repositories.

ArXiv.org LogoFormerly, xxx.lanl.gov, arXiv.org started in August 1991, and is now a highly-automated electronic archive and distribution server for research articles. Covered areas include physics, mathematics, computer science, nonlinear sciences, quantitative biology and statistics.

BioMed CentralAs of August 13, 2015, over 250,000 full-text, peer-reviewed articles included in Biomed CentralChemistry Central, and SpringerOpen are available for TDM. Instructions and more information are available here.

This online collection offers OCR bulk downloads from the Library of Congress digitized historical newspapers from 1836-1922.

 

The corpora at this site were created by Mark Davies, Professor of Linguistics at Brigham Young University. These are probably the most widely-used corpora 

Eighteenth Century Collections Logo Online A freely available and fully-searchable, SGML/XML-encoded texts from among the 150,000 titles included in the Eighteenth Century Collections Online.  ECCO-TCP texts is available in various formats and may be used and shared (read more).

MSU Libraries Humanities DataMSU Libraries Humanities Data includes but is not limited to digitized and born digital text, audio, images, moving images, and the metadata that describes them. Current collection strengths reside in text and audio data. Their collections have been prepared with an eye toward enabling computational analysis at the micro and macro scale.

Internet Archive LogogThe Internet Archive and Open Library offers over 8,000,000 fully accessible and texts.  Please be sure to read bulk-download instructions.

The JSTOR Data for Research (DfR) service, freely available to the public, provides text-and-data-mining tools for selecting and interacting with the content in JSTOR. The tools include faceted searching, topic modeling, and data visualization. Researchers can contact JSTOR directly at support@ithaka.org to obtain, view and bulk download document-level datasets, including word frequencies, citations, key terms and ngrams. For more information, see the Data for Research FAQ.

New York Times Developers LogoNew York Times now offers API access to its newspapers.  It can be searched as a whole or in sections (see available API).

Project Gutenberg LogogProject Gutenberg was the first producer of free electronic books (ebooks). Their catalog includes nearly 30,000 free books and a grand total of over 100,000 titles. Here is the Project's Terms of Use.

Public Library of Science (PLOS) LogoPLOS provides two Application Programming Interfaces (APIs):

  • The PLOS Search API enables developers to query the content of PLOS journals and integrate the data into applications for the web, desktop or mobile devices. For more information, see the Search API FAQ.

  • The PLOS Article-Level Metrics (ALM) API gives developers access to data collected by the PLOS Article-Level Metrics application for every article published in a PLOS journal, including usage statistics (e.g., page views, downloads), citation counts, mentions in Wikipedia, activity on social networks and blog coverage.  For more information, see the ALM API FAQ.  The PLOS API Display Policy specifies how data extracted using the PLOS APIs may be displayed.

PubMed LogoPubMed Central offers access to its texts via various freely available mining tools with a focus on the automatic extraction of biological entities (genes, diseases, chemicals, mutations, species) and their relations from free text.  In addition, there are "large-scale" literature indexing and text simplification tools and several biomedical corpora with manual annotation (e.g. NCBI Disease Corpus).

University of Oxford Text Archive (OTA)OTA provides access to electronic literary and linguistic resources, is involved in the development of standards and infrastructure for them, and gives advice on their creation and use. Visit their site to learn more, read their FAQ, and the OTA User Agreement.

The Online Books Page is a website that facilitates access to books that are freely readable over the Internet. It also aims to encourage the development of such online books, for the benefit and edification of all.  Their collections include various repositories, including non-English collections (read more).