Skip to Main Content Carnegie Mellon University Libraries

Data Management for Research

FAIR Principles for Computational Workflows

What Are FAIR Principles?

The FAIR principles stand for Findable, Accessible, Interoperable, and Reusable. Originally developed for research data, these principles have been adapted specifically for computational workflows to maximize their value as research assets and facilitate their adoption by the wider research community.

Key Definitions

Computational Workflow

Software with two main characteristics:

  1. Composition of multiple components (software, workflows, code snippets, tools, services)
  2. Explicit abstraction from run mechanics using high-level workflow language that specifies data flow between components
Workflow Specification

The formal specification of data flow and execution control between executable components, expected datasets, and parameter files.

Workflow Run

The instantiation of the workflow with inputs (parameters, input datasets) and outputs (output data, provenance execution log, lineage of data products).

Workflow Management System (WMS)

Software that handles data flow and/or execution control, abstracting the workflow from underlying digital infrastructure (examples: Nextflow, Galaxy, Snakemake, Parsl).

The FAIR Principles for Computational Workflows

🔍F - Findable

F1. Workflow assigned globally unique and persistent identifier

What it means: Your workflow needs a permanent, unique "address" on the internet

How to implement:

  • Use DOI (Digital Object Identifier) through repositories like Zenodo, WorkflowHub
  • Register workflows in workflow registries (WorkflowHub, Dockstore)
  • Ensure identifiers persist even if hosting changes

F1.1. Components assigned distinct identifiers

What it means: Each part of your workflow (scripts, tools, sub-workflows) needs its own identifier

How to implement:

  • Version control individual components
  • Use container registries for Docker/Singularity containers
  • Reference specific versions of external tools and datasets

F1.2. Different versions assigned distinct identifiers

What it means: Each version of your workflow gets a unique identifier

How to implement:

  • Use semantic versioning (v1.0.0, v1.1.0, etc.)
  • Tag releases in Git repositories
  • Create new DOIs for major versions

F2. Workflow described with rich metadata

What it means: Comprehensive information about your workflow's purpose, requirements, and usage

How to implement:

  • Document workflow purpose and scientific application
  • List computational requirements and dependencies
  • Provide example input/output data
  • Include author information and creation date

F3. Metadata explicitly includes workflow identifier

What it means: The description clearly states which workflow it describes

How to implement:

  • Include DOI/identifier in README files
  • Reference identifier in documentation
  • Use structured metadata formats (schema.org, Bioschemas)

F4. Registered in searchable FAIR resource

What it means: Your workflow can be found through search engines and registries

How to implement:

  • Submit to WorkflowHub
  • Register in Dockstore
  • Use institutional repositories with metadata harvesting

🔓A - Accessible

A1. Retrievable by identifier using standardized protocol

What it means: Anyone can download your workflow using standard web protocols

How to implement:

  • Host on platforms using HTTPS
  • Provide direct download links
  • Ensure stable URLs that don't break

A1.1. Protocol is open, free, and universally implementable

What it means: No special software needed to access your workflow

How to implement:

  • Use HTTPS (not proprietary protocols)
  • Avoid platform-specific access methods
  • Provide standard file downloads

A1.2. Authentication/authorization when necessary

What it means: If access restrictions are needed, use standard authentication

How to implement:

  • Use institutional single sign-on (SSO)
  • Implement standard OAuth protocols
  • Document access requirements clearly

A2. Metadata accessible even when workflow unavailable

What it means: Description remains available even if workflow can't be run

How to implement:

  • Store metadata separately from workflow code
  • Use long-term preservation repositories
  • Maintain documentation in multiple locations

🔗I - Interoperable

I1. Use formal, accessible language for knowledge representation

What it means: Use standard formats that both humans and computers can understand

How to implement:

  • Write workflows in standard languages (CWL, WDL, Nextflow DSL)
  • Use structured metadata formats (JSON-LD, RDF)
  • Follow established workflow description standards

I2. Use vocabularies following FAIR principles

What it means: Use standardized terms and classifications

How to implement:

  • Use domain-specific ontologies (EDAM for bioinformatics)
  • Apply Bioschemas markup for life sciences
  • Reference standard vocabulary resources

I3. Components read/write data meeting domain standards

What it means: Your workflow uses standard file formats and data structures

How to implement:

  • Use standard file formats (CSV, JSON, standard domain formats)
  • Document input/output specifications
  • Ensure compatibility with common tools

I4. Include qualified references to other objects/components

What it means: Clear links to external tools, datasets, and related workflows

How to implement:

  • Cite tool versions with DOIs when available
  • Reference datasets with persistent identifiers
  • Link to related workflows and publications

♻️R - Reusable

R1. Described with accurate and relevant attributes

What it means: Complete documentation enabling others to understand and use your workflow

How to implement:

  • Write comprehensive README files
  • Document installation and execution instructions
  • Provide example usage scenarios
  • Include troubleshooting guides

R1.1. Released with clear and accessible license

What it means: Legal terms for using and modifying your workflow are explicit

How to implement:

  • Choose appropriate open source license (MIT, Apache 2.0, GPL)
  • Include LICENSE file in repository
  • Clearly state licensing terms in documentation

R1.2. Components have clear licenses

What it means: All parts of your workflow have explicit licensing

How to implement:

  • Document licenses of all dependencies
  • Ensure license compatibility
  • Include license information for containers and external tools

R1.3. Associated with detailed provenance

What it means: Clear history of workflow development and data lineage

How to implement:

  • Maintain version history in Git
  • Document workflow development process
  • Include provenance tracking in workflow runs
  • Link to source publications and datasets

R2. Include qualified references to other workflows

What it means: Clear connections to related workflows and dependencies

How to implement:

  • Reference parent workflows or templates
  • Link to sub-workflows with specific versions
  • Document workflow ecosystem relationships

R3. Meet domain-relevant community standards

What it means: Follow best practices specific to your research field

How to implement:

  • Use field-specific workflow languages and tools
  • Follow community guidelines (e.g., Galaxy Tool Shed guidelines)
  • Implement domain-specific quality checks

Workflow Complexity Spectrum

Computational workflows exist on a spectrum of complexity, and the implementation of FAIR principles can vary depending on the scale and sophistication of your workflow.

Simple Workflows (Without WMS)

Simple Workflow Characteristics
  • Scale: Small datasets, few processing steps (2-10 steps)
  • Implementation: Scripts (Bash, Python, R), Jupyter notebooks, simple pipelines
  • Infrastructure: Single machine, minimal computational requirements
  • Examples: Data cleaning scripts, basic analysis pipelines, small-scale data transformations

FAIR Implementation for Simple Workflows:

  • Findability: Focus on clear documentation and version control (Git repositories)
  • Accessibility: Share via GitHub/GitLab with direct download links
  • Interoperability: Use standard file formats (CSV, JSON), document dependencies clearly
  • Reusability: Provide clear README files, example data, and licensing

Practical Steps:

  1. Version Control: Use Git from day one, even for single scripts
  2. Documentation: Write clear README with usage examples
  3. Dependencies: Use requirements.txt (Python) or similar dependency files
  4. Testing: Include sample input/output data
  5. Licensing: Add a simple license file (MIT, Apache 2.0)

Complex Workflows (With WMS)

Complex Workflow Characteristics
  • Scale: Large datasets, many processing steps (10+ steps), parallel processing
  • Implementation: Workflow Management Systems (Nextflow, Snakemake, Galaxy, CWL)
  • Infrastructure: Multi-node clusters, cloud computing, HPC environments
  • Examples: Genomics pipelines, climate modeling, machine learning pipelines, multi-omics analysis

FAIR Implementation for Complex Workflows:

  • Findability: Use workflow registries (WorkflowHub, Dockstore), structured metadata
  • Accessibility: Containerization (Docker/Singularity), cloud deployment
  • Interoperability: Standard workflow languages (CWL, WDL), formal metadata schemas
  • Reusability: Comprehensive documentation, provenance tracking, modular design

Practical Steps:

  1. Workflow Language: Choose established WMS (Nextflow, Snakemake, etc.)
  2. Containerization: Package all dependencies in containers
  3. Registry Publication: Submit to WorkflowHub or Dockstore
  4. Metadata Standards: Use Bioschemas, schema.org, or domain-specific ontologies
  5. Provenance Tracking: Implement automatic execution logging
  6. Testing: Include continuous integration, multiple test datasets

Comparison Table

Below is a table that illustrates common features for simple and complex workflows:

Aspect Simple Workflows Complex Workflows
Scale  < 1GB data, < 10 steps  > 1GB data, 10+ steps
Time Investment Hours to days Weeks to months
FAIR Complexity Basic implementation Full implementation
Primary Tools Git, GitHub, basic docs WMS, registries, containers
Metadata README files, comments Structured schemas, ontologies
Testing Manual testing Automated CI/CD
Deployment Local execution Multi-platform deployment
Maintenance Occasional updates Ongoing maintenance
Consider WMS When:
  • Your workflow has more than 5-10 interconnected steps
  • You need to process multiple datasets with the same pipeline
  • You require parallel processing or cluster computing
  • You collaborate with multiple researchers or institutions
  • You need detailed provenance tracking
  • Your workflow takes more than a few hours to run
  • You plan to publish your methodology

Quick Implementation Checklist

For Simple Workflows (Scripts/Notebooks)

Before You Start

  • Choose appropriate scripting language and libraries
  • Plan your data processing steps
  • Identify input/output file formats

During Development

  • Use version control (Git) from the beginning
  • Write clear comments in your code
  • Use meaningful variable and file names
  • Test with small sample datasets

For Sharing

  • Create clear README with usage instructions
  • Add LICENSE file (recommend MIT or Apache 2.0)
  • Include requirements/dependencies file
  • Provide example input and expected output
  • Upload to GitHub/GitLab public repository
  • Consider Zenodo integration for DOI

For Complex Workflows (WMS-based)

Before You Start

  • Choose appropriate workflow management system
  • Plan your workflow architecture and components
  • Identify required licenses and dependencies
  • Design containerization strategy

During Development

  • Use version control (Git) from the beginning
  • Document as you develop
  • Use standard file formats and naming conventions
  • Test with example data
  • Implement modular, reusable components
  • Set up continuous integration testing

For Publication

  • Create comprehensive README and documentation
  • Add LICENSE file
  • Test installation instructions on clean systems
  • Prepare example datasets and test cases
  • Register in appropriate workflow registry
  • Obtain DOI through registry or Zenodo
  • Create structured metadata (Bioschemas, schema.org)

After Publication

  • Monitor for issues and provide support
  • Update documentation as needed
  • Create new versions with distinct identifiers
  • Maintain long-term accessibility

Practical Tools and Resources

Workflow Registries

  • WorkflowHub (https://workflowhub.eu/): Multi-domain workflow registry
  • Dockstore (https://dockstore.org/): Container and workflow sharing platform
  • Galaxy ToolShed: For Galaxy workflows and tools

Repository Services

  • Zenodo (https://zenodo.org/): General-purpose research repository with DOIs
  • GitHub/GitLab: Version control with release management
  • Institutional repositories: KiltHub, CMU's institutional repository is an instance of FigShare and both follow FAIR principles.

Metadata Standards

  • schema.org: General structured metadata
  • Bioschemas: Life sciences extension of schema.org
  • CodeMeta: Software metadata standard
  • CWL: Common Workflow Language for portable workflows

Containerization

  • Docker Hub: Container registry
  • Singularity Hub: Scientific container registry
  • GitHub Container Registry: Integrated with GitHub

Common Challenges and Solutions

Challenge: Complex Dependencies

Solution: Use containerization (Docker, Singularity) to package dependencies

Challenge: Large Data Files

Solution: Use data repositories (Zenodo, domain-specific archives) and reference by DOI

Challenge: Platform-Specific Code

Solution: Use workflow management systems that abstract execution environment

Challenge: Evolving Software Dependencies

Solution: Pin specific versions and use containers for reproducibility

Challenge: Limited Documentation Time

Solution: Start with minimal documentation and improve iteratively

Examples of FAIR Workflow Implementation

Simple Workflow Example: Data Processing Script

Example: Gene Expression Analysis Script
  • Type: Python script for differential expression analysis
  • Scale: Single-file script, processes < 100MB data
  • Repository: GitHub repository with direct download
  • FAIR Implementation:
    • F1: GitHub release with version tags (v1.0.0, v1.1.0)
    • F2: Detailed README with purpose, requirements, usage
    • A1: Direct download via HTTPS from GitHub
    • I1: Standard CSV input/output formats
    • R1: MIT license, clear documentation, example datasets
    • Example: https://github.com/researcher/gene-expression-analysis

Key FAIR Features for Simple Workflows:

  • Clear repository structure with README, LICENSE, requirements.txt
  • Example data files and expected outputs
  • Simple installation instructions
  • Version tags for releases
  • DOI through Zenodo-GitHub integration

Complex Workflow Example: Protein MD Setup Workflow

  • Repository: WorkflowHub (DOI: 10.48546/workflowhub.workflow.29.3)
  • Language: Common Workflow Language (CWL)
  • License: Apache License 2.0
  • Components: BioBB building blocks with individual identifiers
  • Metadata: Structured using RO-Crate format
  • Scale: Multi-step molecular dynamics simulation pipeline
  • Infrastructure: Requires HPC or cloud computing resources

Complex Workflow Example: Digital Pathology Workflow

  • Repository: Zenodo with workflow run provenance
  • Language: CWL with CWLProv provenance tracking
  • Format: RO-Crate with Provenance Run Crate profile
  • Standards: OpenSlide for digital pathology images
  • License: MIT License for workflow and components
  • Scale: Processes whole slide images (GBs of data)
  • Infrastructure: GPU-accelerated computing for deep learning

Comparison of Implementation Approaches

FAIR Aspect Simple Workflow Approach Complex Workflow Approach
F1 - Identifiers GitHub releases, optional Zenodo DOI Workflow registry DOI, component DOIs
F2 - Metadata README files, inline comments Structured metadata, ontologies
A1 - Access Direct GitHub download Container images, registry access
I1 - Standards Standard file formats Workflow languages (CWL, WDL)
R1 - Documentation README + examples Comprehensive docs + tutorials
R1.3 - Provenance Git history, manual logs Automated provenance tracking

Getting Started Today

For Simple Workflows (Start Here)

  1. Pick one script or notebook you use regularly for analysis
  2. Create a Git repository and upload your code
  3. Write a basic README explaining what the script does and how to run it
  4. Add a LICENSE file (MIT is simple and permissive)
  5. Create a requirements file listing dependencies
  6. Test your instructions by asking a colleague to run your code
Quick Start for Simple Workflows

Time investment: 2-4 hours

Immediate benefits: Easier sharing, version control, basic reproducibility

Next steps: Add example data, create GitHub release, consider Zenodo DOI

For Complex Workflows (Advanced)

  1. Choose one complex analysis pipeline to make FAIR as a pilot project
  2. Start simple: Focus on F1 (getting a DOI) and R1.1 (adding a license)
  3. Use existing tools: Don't reinvent the wheel - use established registries
  4. Document iteratively: Improve documentation over time
  5. Engage with community: Join relevant working groups and forums
Migration Path

Start with simple workflow practices, then gradually adopt WMS features:

  1. Basic FAIR (Git + README + License): 1-2 hours
  2. Enhanced documentation (Examples + tests): 4-8 hours
  3. Registry publication (DOI + metadata): 8-16 hours
  4. Full WMS implementation (Containers + provenance): 40+ hours

Progressive Implementation Strategy

Phase 1: Foundation (All Workflows)

  • Set up version control
  • Write basic documentation
  • Add licensing information
  • Include example data

Phase 2: Enhancement (Growing Complexity)

  • Obtain persistent identifiers (DOIs)
  • Improve metadata quality
  • Add containerization for dependencies
  • Implement basic testing

Phase 3: Full FAIR (Complex Workflows)

  • Use workflow management systems
  • Publish in specialized registries
  • Implement provenance tracking
  • Follow domain-specific standards

Benefits of FAIR Workflows

Benefits
Increased citations and research impact
Easier collaboration with other researchers
Reduced duplication of effort
Better reproducibility of research results
Compliance with funder and publisher requirements
Future-proofing your research outputs
Remember

FAIR is a journey, not a destination. Start with what you can implement today and improve over time!

References

  1. Wilkinson, S.R., Aloqalaa, M., Belhajjame, K., et al. (2025). Applying the FAIR Principles to computational workflows. Scientific Data, 12:328. https://doi.org/10.1038/s41597-025-04451-9
  2. Wilkinson, M.D., Dumontier, M., Aalbersberg, I.J., et al. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data, 3:160018. https://doi.org/10.1038/sdata.2016.18
  3. Barker, M., Chue Hong, N.P., Katz, D.S., et al. (2022). Introducing the FAIR principles for research software. Scientific Data, 9:622. https://doi.org/10.1038/s41597-022-01710-x