Skip to Main Content

Machine Learning and AI: Research Reproducibility


Questions? Suggestions?

Email me, or schedule a one-on-one research consultation.

Huajin Wang's picture
Huajin Wang
Mellon Institute Library
4th Floor Mellon Institute


JupyterLab Is Now Ready for Users!


JupyterLab is an interactive development environment for working with notebooks, code and data. It enables you to use text editors, terminals, data file viewers, and other custom components side by side with notebooks in a tabbed work area.

Main features: 

  • Drag-and-drop to reorder notebook cells and copy them between notebooks.
  • Run code blocks interactively from text files (.py, .R, .md, .tex, etc.).
  • Link a code console to a notebook kernel to explore code interactively without cluttering up the notebook with temporary scratch work.
  • Edit popular file formats with live preview, such as Markdown, JSON, CSV, Vega, VegaLite, and more 

Essential Readings on Research Reproducibility

What is Research Reproducibility?


The "Reproducibility Crisis" and What Can We Do? 


Simple Rules to Enhance Reproducibility

Reproducible Workflow for Biomedical Research

"Reproducibility: automated.":  use  “continuous analysis" workflow to automate and containerize data analysis steps, allowing others to easily reproduce and build on the results. 

Reproducibility in the Research Life Cycle

Stage 1: Designing and Planning

Organizing literature with reference managers: Mendeley or Zotero

  • Collect articles and PDF from web browsers as you discover them

  • Organize, read and annotate in one place

  • Share with your group

  • Prepare references for Microsoft Word or LaTeX (BibTeX) with ease

Use an Electronic Lab Notebook (ELN)

  • Document study design, reagents, procedures, data analysis, images, and other results in one platform
  • Searchable and discoverable
  • Great tool for note-keeping, lab management, and collaboration
  • Which one to pick?
    • Access your experimental needs

    • Institutional support (We are in the process of evaluating ELNs and purchasing a license. Feedback welcomed!)

Use a project management platform: Open Science Framework

Plan ahead and write the Data Management Plan (DMP)

  • Required by many funders and publishers
  • Before starting a project, think about research question, sample collection methods, statistical power, software and hardware tools, project management and documentation, result sharing and dissemination
  • Find more about DMP basics here

Stage 2: Collecting and Analyzing Data

Follow good data management practices to avoiding losing your work

  • Follow good file naming conventions: use meaningful names, and avoid space and special characters
  • Document metadata
  • Consider file security
  • Back up following the 3-2-1 rule
    • 3 copies of your data - 2 copies are not enough
    • 2 different formats - i.e. hard drive+tape backup or DVD (short term)+flash drive
    • 1 off-site backup - have 2 physical backups and one in the cloud

Use an ELN for note-taking and OSF for project management (see above)

​Use Literate Programming to weave together text, code, and visualization

Use Version Control

  • Git and GitHub

Reproducible computational environment

  • Docker
  • Free research computing allocations at PSC Bridges
    • A data- and memory-intensive system designed to integrate HPC with Big Data
    • supports a high degree of interactivity, science gateways, and a very flexible user environment
    • Many popular applications for simulation, machine learning and data analytics already installed and running
    • Available at no charge to the open research community

Stage 3+4: Publishing, Archiving, and Sharing your work

Collaborative writing tool for LaTeX: Overleaf (CMU license)

  • Version-controlled, web-based platform that allow multiple authors to work simutaneously 
  • Many tutorials available
  • Many style templates for specific journals, presentations, reports
  • Format and insert citations with ease using .bib files

Publish in open access journals


CMU’s institutional repository: KiltHub

  • Repository for many form of research product, including papers, posters, datasets, videos, etc
  • Every item gets assigned a DOI
  • Powered by FigShare, indexed by Google search and usually rank high in search results

Other Generalist or Subject-specific Open Repositories

Repositories for Computational Reproducibility

  • Code Ocean
    • ​A cloud-based computational reproducibility platform
    • Preserve your code, data, and computational environment in a capsule and get a DOI
    • Let others easily run your code in the cloud and share it privately or publicly
    • Use widget to embed a working copy of your code directly into any webpage, including your personal site
    • Free version available, with limited features 
  • Software and Data Artifacts in the ACM Digital Library
    • A repository put together by ACM’s Reproducibility Task Force
    • Encourages authors to submit software and data sets with their papers.