Skip to Main Content

Text & Data Mining: Overview

Designed to introduce you to text & data mining (TDM) at Carnegie Mellon University
The Text & Data Mining ProcessText and data mining (TDM) are becoming increasingly popular ways to conduct research. They entail using automated tools to process large volumes of digital content to identify and select relevant information and discover previously unknown patterns or connections. Text mining extracts information from natural language (textual) sources. Data mining extracts information from structured databases of facts.  The extracted information is assembled to reveal new facts or to formulate hypotheses that can be further explored using conventional methods. TDM is useful in many disciplines, from the humanities, where it is used by digital humanities scholars, to the sciences, where useful data can be mined from large non-text datasets and textual databases of published literature.

Note:  you may also want to look at Research Data Management for information on dealing with collected data. 

Application Programmers/Programming Interface The technical window/programming language interface through which users can access and obtain vast quantities of information (text/data/objects) in a machine-readable format.
Corpus A collection of documents such as webpages or journal articles.
Crawling A method that automatically finds links within a website and "scrapes" the information from them (see scraping) so that it can then be "cleaned up" and made machine-readable.
Document Type Definition The mark-up of a document created through a coding language such as HTML or SGML to recognize the structure and tag text to show how a document should be understood by computers.
Entity Refers to a real world thing (e.g. a name).
Extensible Mark-up Language A web standard for document mark up, designed to simplify and provide flexibility to Web and other digital media authorship and design.  Unlike HTML, it is not a fixed format language.
Hypertext Mark-up Language A text-based coding language interpreted by web browsers and used to construct web pages.
Information Extraction Automatically isolating specific data (e.g. identity) from unstructured text.
Lema & Lexim A lemma is the word, but a lexeme is a unit of meaning, and can be presented in multiple words.  For example, in English, read, reads, reading are the same lexeme, but have different lemma (forms). 
Machine Learning A mathematical or statistical algorithm that automatically identifies (learns) patterns in data.
Natural Language Processing Software or services facilitating the automatic analysis of text.
Ontology The organization of a specific domain with the entities that belong in it and their relationships.
Ontology Web Language A representation of relationships between entities in a way that computers can process them.
Parsing (Linguistic) parsing refers to the process of (syntactic) analysis of text and breaking down a sentence into its component parts (in machine terms, a file can be "parsed" into its component parts).
Relationship Extraction he process of automatically finding "semantic relationships" between to (or more) entities.
Scraping The process of identifying, copying, and pasting information into files that can be later "cleaned up" or made machine-readable.
Semantic Relationship A linguistic relationship between two or more entities expressed in a way that can be understood by a computer.
Sentiment Analysis The extraction of words or phrases that convey meaning.
Standard Generalized Mark-up Language The most comprehensive of all coding languages (XML, and HTML, for example).
Stop List (or stoplist) A set of words automatically omitted from a computer search, concordance, or index because they slow down processing of text or produce false results.
Taxonomy Specific vocabulary that expresses relationships, organizes information in a hierarchical or linear manner.
Text and Data Mining The extraction of natural language works (books or articles, for example) or numeric data (i.e. files or reports) and use of software that read and digest digital information to identify relationships and patterns far more quickly than a human can.
Token A token represents a word type - similar to "part of speech" in linguistics and is used to measure lexical density (the ratio of lexemes to the total number of tokens).  In terms of writing, lexical density measures how informative a text is.  Tokenization is the process of assigning word types.
Treebank This is a corpus of syntactically parsed documents used to train TDM models.


Since the mid-80s, technology propelled text and data mining to prominence across disciplines.  Increased interest in the field surfaced multiple issues such as copyright, fair-use, and commercial viability.  For example, the flexibility of copyright laws in the US, Israel, Taiwan, and other countries deems TDM transformative and, thus, lawful under fair use (see the Authors Guild v. Google, for example).  Here, we introduce a few links to sources that will shed light on major issues in TDM:

Six Degrees of Francis BaconCarnegie Mellon University and Georgetown University have created Six Degrees of Francis Bacon, a groundbreaking digital humanities project that recreates the British early modern social network to trace the personal relationships among figures like Bacon, Shakespeare, Isaac Newton and many others.

From CulturomicsBookworm is a simple and powerful way to visualize trends in repositories of digitized texts.  Users must register for an account to create their own "bookworm."

Women Writers Project logoAnother example that uses early modern text is Northeastern University's Women Writers Project.  This project allows researchers to study pre-Victorian women writers text in a new way and enables them to extrapolate existing relationships in ways that are far more possible than can be done through close reading.

Google Books Ngram Viewer logoThis Ngram displays a graph showing how those phrases have occurred in a corpus of books (e.g., "British English," "English Fiction," "French") over the selected years.

Vogue magazine coversThere are also projects that analyzes visual data.  For example, Yales Robots Reading Vogue analyzes text and visual images to explore questions of gender studies and other persepctives.

Need additional help?

If you still need additional help, please visit the CMU Libraries' Research Data Services team or contact them directly.