Skip to Main Content

Text & Data Mining: Tools

Designed to introduce you to text & data mining (TDM) at Carnegie Mellon University

Directory of Digital Research Tools (DIRT)There are many text mining tools available.  This page provides you with a select list of sources to get you started.  Please feel free to suggest your favorite tools to add to this list.

In addition, the Directory of Digital Research Tools (DiRT) aggregates information about digital research tools for scholarly use and makes it easy to find and compare available TDM and visualization resources.

AntConcA freeware corpus analysis toolkit for concordancing and text analysis (works with Mac OS & Windows).

CasualConcordence is a program designed for Mac OS and runs text concordance that allows you to analyze your own collection of text files (primarily English, though users reported success with other European languages).  It also comes with additional tools that allow you to tag, transcribe, and extract text.

Crossref Text and Data Mining ServicesAllows researchers to easily harvest full text documents from all participating publishers regardless of their business model (e.g. open access, subscription) by maintaining a database of DOIs for its 4000+ publisher members and bibliographic metadata associated with these DOIs.

GephiA visualization and exploration software for all kinds of graphs and networks. Gephi is open-source and free.  Its applications include analysis of exploratory data, links, social or biological networks, and poster creation.

GloVe is a an unsupervised learning algorithm for obtaining vector representations for global word-word co-occurrence in a text corpus to show interesting linear substructures of the word vector space. This tool provides an effective method for measuring the linguistic or semantic similarity of the corresponding words and reveals relevant words that may lie outside of an average human's vocabulary.  For example, the word frog can result in frogs, toad, litoria, leptodactylidae, rana, lizard, and eleutherodactylus.

IcyLogoIcy is an open community platform for bioimage informatics.  It provides the software resources to visualize, annotate and quantify bioimaging data.

ImageJ (FIJI)

ImageJ is an open source image processing program designed for scientific multidimensional images.
It is highly extensible, with thousands of plugins and macros for performing a wide variety of tasks, and a strong, established user base.  It includes FIJI and related software.

Software Studies InitiativeImagePlot is a free software tool from Software Studies Initiative that visualizes collections of images and video of any size.

Import.ioAllows you to extract data from (prices, images, names, addresses etc...) by entering the URL for that web page into a search box, it transforms the web page into data in seconds.

JuxtaThis open-source tool compares and collates multiple witnesses to a single textual work. Originally designed to aid scholars and editors examine the history of a text from manuscript to print versions, Juxta offers a number of possibilities for humanities computing and textual scholarship.

KNIMEThis tool does all three components of data processing:  extraction, transformation and loading. Users can create nodes for visualization as well as a platform for analysis, reporting, and integration for machine learning. It is easy to extend and to add plugins and other functionality. Plenty of data integration modules are already included in the core version.

Machine Learning for LanguagE Toolkita Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text. 

Paper MachinesThis tool is a plugin for the Zotero and makes cutting-edge topic-modeling analysis accessible to humanities researchers without requiring extensive computational resources or technical knowledge. It synthesizes several approaches to visualization within a highly accessible user interface.

RapidMinderOffers advanced analytics through template-based frameworks and users hardly have to write any code. In addition to data mining, it provides functionality like data preprocessing and visualization, predictive analytics and statistical modeling, evaluation, and deployment.

VoyantA web-based text reading and analysis designed to make it easy to work with your own text or collection of texts in a variety of formats, including plain text, HTML, XML, PDF, RTF, and MS Word. It also allows you to work with pre-defined text collections like Shakespeare.

WordHoardFrom Northwestern University, this application for the close reading and scholarly analysis of deeply tagged texts.  WordHoard contains the entire canon of Early Greek epic in the original and in translation, as well as all of Chaucer, Shakespeare, and Spenser.

WordSeerWordSeer is a text analysis environment that combines visualization, information retrieval, sensemaking and natural language processing to make the contents of text navigable, accessible, and useful.

Qualitative Data Analysis Tools

Listed below are some good examples of high-performance Computer Assisted Qualitative Data Analysis Software (CAQDAS) platforms that are free of charge. Some have graphical user interfaces (GUI) and others do not. Some preliminary investigation will be required to determine what will best suit your needs. Note that two of these free programs, Aquad and RQDA, make use of the powerful statistical analysis package R. While there is only a plug-in for Aquad, you will need to install R to use RQDA. R is free to download and supported by a vast user community.  

If you are not familiar with R, you may want to reach out to our digital humanities team at dSHARP for assistance.