Skip to main content Carnegie Mellon University Libraries

Machine Learning and AI: Find Datasets

Let me know what you need!

I am in the process of the collecting high-quality databases and datasets. Please contact me at huajinw@cmu.edu to suggest new resources. My hope is that these lists would grow everyday! 

Generalist Data Repositories

Data Repository Collections

R3data.org (Registry of Research Data Repositories): A global registry of research data repositories from all academic disciplines. It provides an overview of existing research data repositories in order to help researchers to identify a suitable repository for their data. 

FAIRsharing.orgA curated, informative and educational resource on data and metadata standards, databases, policies, and collections. Contains many collections in biomedical field. 

Generalist Data Repositories

CMU-supported platforms: 

KiltHub: CMU's institutional repository. Contains a collections of manuscripts, datasets, presentations, and thesis from CMU authors. See this guide for more information. 

Open Science Framework: An open source web platform for researchers to manage their projects and share data. See this guide for more information. 

 

The following repositories are commonly used open repositories for research data. They contain large amount of curated data from many disciplines, and include many data types: 

FigShare

Zonodo

Dryad Digital Repository

Harvard Dataverse

NYU Data Catalog: An open repository maintained by NYU medical school. It includes datasets generated by NYU researchers as well as publically available and licensed datasets that are generated at external organizations, e.g. the Bureau of Labor Statistics.

 

Specialized Repositories

Machine Learning Data Repositories

UCI Machine Learning Repository: A collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms. It has been widely used by students, educators, and researchers all over the world as a primary source of machine learning data sets.

WordNet: A large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept.

ImageNet: An image database organized according to the WordNet hierarchy (currently only the nouns), in which each node of the hierarchy is depicted by hundreds and thousands of images.

Open Data for Deep Learning: Maintained by a model deployment platform,  Skymind. Has a collections of open datasets. 

 

Other Specialized Repositories

Nature Scientific Data has a very good list of recommended subject-specific repositories. 

 
Below are a list of additional resources: 

United Nations Data Catalog: A comprehensive and representative overview of UN system open data assets. 

Data.gov: US government's open data, tools, and resources. 

OpenEI: Maintained by CKAN. Includes industry open data. 

Namara: A data discovery platform. Has some collections of open data.