CMU LibGuides: Artificial Intelligence Research: Find Datasets

Specialized Repositories

Machine Learning Data Repositories

UCI Machine Learning Repository: A collection of databases, domain theories, and data generators used by the machine learning community to empirically analyze machine learning algorithms. It has been widely used by students, educators, and researchers worldwide as a primary source of machine learning data sets.

Hugging Face: An open-source, community-owned collection of AI datasets, models, and applications.

WordNet: A large lexical database of English. Nouns, verbs, adjectives, and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept.

Open Data for Deep Learning: Maintained by a model deployment platform, Skymind. Has a collection of open datasets.

StateOfTheArt.ai: An entirely community-driven website for tasks, datasets, metrics, or results.

Papers With Code: The mission of Papers With Code is to create a free and open resource with machine learning papers, code, and evaluation tables.

NLP-progress: Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.

Datasets for Computer Vision

COVE: COVE is an online repository for computer vision datasets sponsored by the Computer Vision Foundation. It is intended to aid the computer vision research community and serve as a centralized reference for all datasets in the field.

Hugging Face: An open-source, community-owned collection of AI datasets, models, and applications. Includes datasets for multimodal, computer vision, natural language processing, and audio tasks.

ImageNet: An image database organized according to the WordNet hierarchy (currently only the nouns), in which each node of the hierarchy is depicted by hundreds and thousands of images.

OpenCV: OpenCV is the word's biggest computer vision library. It's open source, contains over 2500 algorithms and is operated by the non-profit Open Source Vision Foundation.

Text Mining Collections

CMU-Licensed Data Platforms for Text Mining

Constellate
Note: This database will retire on July 1, 2025. Read full details and learn about service continuation.
Constellate is a comprehensive service that combines the educational materials, data, and tools needed to teach, learn, and perform text analysis, helping faculty advance their own scholarship and develop the data skills their students need for success in their education and employment.
Digital Scholar Lab
Apply natural language processing tools to raw text data (OCR) from Gale Primary Sources in a single research platform. By integrating an unmatched depth and breadth of digital primary source matter with the most popular Digital Humanities (DH) tools, Gale Digital Scholar Lab provides a new lens to explore history and empowers researchers to generate world-altering conclusions and outcomes. The Digital Scholar Lab offers advanced humanities computing tools that make natural language processing (NLP) for historical texts accessible, more efficient, and impactful, thus expanding the footprint of digital humanities across campus.
ProQuest TDM Studio
A web-based text mining platform that allows you to access and analyze large amounts of text data. Using content retrieved from ProQuest database, you can build your corpus and conduct data analysis, text mining, and visualization using your preferred methods to uncover relationships, patterns, and connections within and between datasets while collaborating with colleagues in real-time on one platform. To access a workbench, you must submit a request.To find information on how to do this, visit this guide.

Generalist Data Repositories

Data Repository Collections

R3data.org (Registry of Research Data Repositories): A global registry of research data repositories from all academic disciplines. It provides an overview of existing research data repositories to help researchers identify a suitable repository for their data.

FAIRsharing.org: A curated, informative, and educational resource on data and metadata standards, databases, policies, and collections. Contains many collections in the biomedical field.

CMU-Supported Generalist Data Repositories

KiltHub: CMU's institutional repository. Contains a collection of manuscripts, datasets, presentations, and theses from CMU authors. See this guide for more information.

Open Science Framework: An open-source web platform for researchers to manage their projects and share data. See this guide for more information.

Other Generalist Data Repositories

The following repositories are commonly used open repositories for research data. They contain large amounts of curated data from many disciplines, and include many data types:

Mendeley Data

FigShare

Zenodo

Dryad

Dataverse

IEEE Dataport

Other Specialized Repositories

Nature Scientific Data has an excellent list of recommended subject-specific repositories.

LearnSphere: Integrates existing and new educational data and analysis repositories to offer the world's largest learning analytics infrastructure with methods, linked data, and portal access to relevant resources.

DataShop: A data repository and web application for learning science researchers. It provides secure data storage as well as an array of analysis and visualization tools available through a web-based interface.

United Nations Data Catalog: A comprehensive and representative overview of UN system open data assets.

Data.gov: US government's open data, tools, and resources.

OpenEI: Maintained by CKAN. Includes industry open data.

WorldData.AI: A searchable digital platform that provides access to 3.3 Billion curated datasets across macroeconomics, trade, labour statistics, financial markets, weather, health, and demographics. Free for academics. Here is a short article that teaches you how to use it.

COVID-19 Datasets

COVID-19 Text Dataset Collection

During the COVID-19 outbreak, many researchers and healthcard professionals are rapidly publishing their findings that help to understand the mechanism and epidemiology of SARS-CoV-2 and offering insights and solutions for the COVID-19 pandemic. Below are a few high quality collections.

COVID-19 Open Research Dataset (CORD-19): A machine readable, free resource developed by the Allen Institute for AI. Contains over 44,000 scholarly articles, including over 29,000 with full text, about COVID-19 and the coronavirus family of viruses for use by the global research community. This dataset is intended to mobilize researchers to apply recent advances in natural language processing to generate new insights in support of the fight against this infectious disease. The corpus will be updated weekly as new research is published in peer-reviewed publications and archival services like bioRxiv, medRxiv, and others.

WHO global research and publications data base on COVID-19: Latest scientific findings and knowledge on coronavirus disease (COVID-19), together with a searchable WHO database of publications.

LitCovid: A curated literature hub for tracking up-to-date scientific information about the 2019 novel Coronavirus. Currently contains a collection of more than 1,200 journal articles hosted by the National LIbrary of Medicine.

COVID-19 Case Dataset Collection

Johns Hopkins University COVID-19 data: The data repository for the COVID-19 Dashboard operated by the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE). Also, Supported by ESRI Living Atlas Team and the Johns Hopkins University Applied Physics Lab (JHU APL). Aggregated from many data sources including WHO, CDC, WorldoMeters, New York Times, and much more.