There are many open sources for text or data that are available on the web. The list below is a selection of sources that come to our attention and/or may have not been already included in available online directories such as Open Access Directory's data repositories.
As of August 13, 2015, over 250,000 full-text, peer-reviewed articles included in Biomed Central, Chemistry Central, and SpringerOpen are available for TDM. Instructions and more information are available here.
A freely available and fully-searchable, SGML/XML-encoded texts from among the 150,000 titles included in the Eighteenth Century Collections Online. ECCO-TCP texts is available in various formats and may be used and shared (read more).
MSU Libraries Humanities Data includes but is not limited to digitized and born digital text, audio, images, moving images, and the metadata that describes them. Current collection strengths reside in text and audio data. Their collections have been prepared with an eye toward enabling computational analysis at the micro and macro scale.
The Internet Archive and Open Library offers over 8,000,000 fully accessible and texts. Please be sure to read bulk-download instructions.
The JSTOR Data for Research (DfR) service, freely available to the public, provides text-and-data-mining tools for selecting and interacting with the content in JSTOR. The tools include faceted searching, topic modeling, and data visualization. Researchers can contact JSTOR directly at support@ithaka.org to obtain, view and bulk download document-level datasets, including word frequencies, citations, key terms and ngrams. For more information, see the Data for Research FAQ.
New York Times now offers API access to its newspapers. It can be searched as a whole or in sections (see available API).
Project Gutenberg was the first producer of free electronic books (ebooks). Their catalog includes nearly 30,000 free books and a grand total of over 100,000 titles. Here is the Project's Terms of Use.
PLOS provides two Application Programming Interfaces (APIs):
PubMed Central offers access to its texts via various freely available mining tools with a focus on the automatic extraction of biological entities (genes, diseases, chemicals, mutations, species) and their relations from free text. In addition, there are "large-scale" literature indexing and text simplification tools and several biomedical corpora with manual annotation (e.g. NCBI Disease Corpus).
OTA provides access to electronic literary and linguistic resources, is involved in the development of standards and infrastructure for them, and gives advice on their creation and use. Visit their site to learn more, read their FAQ, and the OTA User Agreement.
The Online Books Page is a website that facilitates access to books that are freely readable over the Internet. It also aims to encourage the development of such online books, for the benefit and edification of all. Their collections include various repositories, including non-English collections (read more).