CMU LibGuides: Text & Data Mining: Free Data Sources

Formerly, xxx.lanl.gov, arXiv.org started in August 1991, and is now a highly-automated electronic archive and distribution server for research articles. Covered areas include physics, mathematics, computer science, nonlinear sciences, quantitative biology and statistics.

As of August 13, 2015, over 250,000 full-text, peer-reviewed articles included in Biomed Central, Chemistry Central, and SpringerOpen are available for TDM. Instructions and more information are available here.

This online collection offers OCR bulk downloads from the Library of Congress digitized historical newspapers from 1836-1922.

The corpora at this site were created by Mark Davies, Professor of Linguistics at Brigham Young University. These are probably the most widely-used corpora

A freely available and fully-searchable, SGML/XML-encoded texts from among the 150,000 titles included in the Eighteenth Century Collections Online. ECCO-TCP texts is available in various formats and may be used and shared (read more).

MSU Libraries Humanities Data includes but is not limited to digitized and born digital text, audio, images, moving images, and the metadata that describes them. Current collection strengths reside in text and audio data. Their collections have been prepared with an eye toward enabling computational analysis at the micro and macro scale.

The Internet Archive and Open Library offers over 8,000,000 fully accessible and texts. Please be sure to read bulk-download instructions.

The JSTOR Data for Research (DfR) service, freely available to the public, provides text-and-data-mining tools for selecting and interacting with the content in JSTOR. The tools include faceted searching, topic modeling, and data visualization. Researchers can contact JSTOR directly at support@ithaka.org to obtain, view and bulk download document-level datasets, including word frequencies, citations, key terms and ngrams. For more information, see the Data for Research FAQ.

New York Times now offers API access to its newspapers. It can be searched as a whole or in sections (see available API).

Project Gutenberg was the first producer of free electronic books (ebooks). Their catalog includes nearly 30,000 free books and a grand total of over 100,000 titles. Here is the Project's Terms of Use.

PLOS provides two Application Programming Interfaces (APIs):

The PLOS Search API enables developers to query the content of PLOS journals and integrate the data into applications for the web, desktop or mobile devices. For more information, see the Search API FAQ.

The PLOS Article-Level Metrics (ALM) API gives developers access to data collected by the PLOS Article-Level Metrics application for every article published in a PLOS journal, including usage statistics (e.g., page views, downloads), citation counts, mentions in Wikipedia, activity on social networks and blog coverage. For more information, see the ALM API FAQ. The PLOS API Display Policy specifies how data extracted using the PLOS APIs may be displayed.

PubMed Central offers access to its texts via various freely available mining tools with a focus on the automatic extraction of biological entities (genes, diseases, chemicals, mutations, species) and their relations from free text. In addition, there are "large-scale" literature indexing and text simplification tools and several biomedical corpora with manual annotation (e.g. NCBI Disease Corpus).

OTA provides access to electronic literary and linguistic resources, is involved in the development of standards and infrastructure for them, and gives advice on their creation and use. Visit their site to learn more, read their FAQ, and the OTA User Agreement.

The Online Books Page is a website that facilitates access to books that are freely readable over the Internet. It also aims to encourage the development of such online books, for the benefit and edification of all. Their collections include various repositories, including non-English collections (read more).