There are many open sources for text or data that are available on the web. The list below is a selection of sources that come to our attention and/or may have not been already included in available online directories such as Open Access Directory's data repositories.
MSU Libraries Humanities Data includes but is not limited to digitized and born digital text, audio, images, moving images, and the metadata that describes them. Current collection strengths reside in text and audio data. Their collections have been prepared with an eye toward enabling computational analysis at the micro and macro scale.
The Internet Archive and Open Library offers over 8,000,000 fully accessible and texts. Please be sure to read bulk-download instructions.
The JSTOR Data for Research (DfR) service, freely available to the public, provides text-and-data-mining tools for selecting and interacting with the content in JSTOR. The tools include faceted searching, topic modeling, and data visualization. Researchers can contact JSTOR directly at email@example.com to obtain, view and bulk download document-level datasets, including word frequencies, citations, key terms and ngrams. For more information, see the Data for Research FAQ.
New York Times now offers API access to its newspapers. It can be searched as a whole or in sections (see available API).
PubMed Central offers access to its texts via various freely available mining tools with a focus on the automatic extraction of biological entities (genes, diseases, chemicals, mutations, species) and their relations from free text. In addition, there are "large-scale" literature indexing and text simplification tools and several biomedical corpora with manual annotation (e.g. NCBI Disease Corpus).
The Online Books Page is a website that facilitates access to books that are freely readable over the Internet. It also aims to encourage the development of such online books, for the benefit and edification of all. Their collections include various repositories, including non-English collections (read more).