Skip to Main Content Carnegie Mellon University Libraries

Data Management for Research

Guide to FAIR Principles for Chemistry Research Data

Guide to FAIR Principles for Chemistry Research Data

Introduction

With the growing volume and complexity of chemical research data, there's an urgent need to improve how we manage, share, and reuse this information. The FAIR principles provide a framework for making data Findable, Accessible, Interoperable, and Reusable - both for humans and machines. This guide explains how to apply these principles specifically to chemistry research data.

Compared with even 50 years ago, today's chemistry lab is very different. More researchers are carrying out more experiments than ever before, using increasingly sophisticated and automated tools and generating a deluge of data. Research output (including articles, books, and data sets) is growing by 8–9% a year, but the way data from experiments are shared and reused hasn't kept pace.

What are the FAIR Principles?

The FAIR Guiding Principles describe distinct considerations for contemporary data publishing environments with respect to supporting both manual and automated deposition, exploration, sharing, and reuse. FAIR differs from other guidelines in that it describes concise, domain-independent, high-level principles that can be applied to a wide range of scholarly outputs.

Core FAIR Principles Defined

Principle Technical Definition Chemistry Context
Findable Data, as well as the metadata describing them, should have globally unique and persistent machine-readable identifiers. Chemical structures should have unique identifiers (InChIs); datasets should have DOIs
Accessible Data and their metadata should be retrievable from their identifiers, with a standardized protocol that incorporates an authentication and authorization procedure, as necessary. Data repositories with standard web protocols; metadata remains accessible even if data is restricted
Interoperable Data and their metadata should be formatted in a formal, shared, and broadly applicable language that includes cross-references to other metadata. These cross-references should include the relationships between the data. Chemical data using standard formats that other systems can interpret (CIF files, standardized NMR data)
Reusable Data and their metadata should be described thoroughly enough so that they can be replicated and combined in different settings. Detailed experimental procedures, properly documented spectra with metadata on acquisition parameters

Why FAIR Matters for Chemistry

Some researchers and policy makers would like to change how data is shared and reused, pushing for the chemistry community to implement what are called the FAIR principles of data management. These efforts are being bolstered by funders like the European Research Council (ERC) and the US National Institutes of Health, which are mandating that the science they fund be made open access and have data management plans in place.

Ultimately, the FAIR principles are about making sure that the work that chemists and other scientists are doing can be found, extracted, and then applied elsewhere. But despite this seemingly sensible goal, many scientists have not kept up with ensuring that their raw data are preserved and accessible.

Benefits of FAIR Data in Chemistry

  1. Improved Reproducibility: Well-documented data allows others to validate findings
  2. Enhanced Collaboration: Easier data sharing across research groups and disciplines
  3. Increased Efficiency: About "80% of all the effort regarding data goes into data wrangling and data preparation. Only 20% is actually effective research and analytics." That's because data aren't yet FAIR.
  4. Greater Impact: FAIR data is more likely to be cited and reused
  5. Funding Compliance: Many agencies now require FAIR data management plans

Making Chemistry Data FAIR

Findable

  1. Use Persistent Identifiers:
    • Obtain DOIs for datasets through repositories like Dataverse, Figshare, or Dryad
    • Use International Chemical Identifiers (InChIs), which are a machine-readable way of describing chemical structures
  2. Create Rich Metadata:
    • Include detailed information about experimental conditions
    • Raw spectra files could be uploaded at the same time a journal article is submitted or accepted. The files could include experimental metadata (essentially, data about the data) to describe how the spectra were obtained.
  3. Register in Searchable Resources:
    • Deposit in chemistry-specific repositories
    • Include in general scientific databases

Accessible

  1. Use Standard Communication Protocols:
    • Make data accessible via HTTP/HTTPS
    • The FAIR principles are about making sure that the work that chemists and other scientists are doing can be found, extracted, and then applied elsewhere.
  2. Clarify Access Conditions:
    • Document any authentication requirements
    • FAIR is not open and free. FAIR just means it's technically possible for data to "talk to" each other. Even data with privacy issues that cannot be made open can be accessed through the proper channels.
  3. Preserve Metadata:
    • Ensure metadata remains accessible even when data is unavailable
    • Chemists will need to describe and deposit their data as they are being created.

Interoperable

  1. Use Formal Knowledge Representation:
    • Synthesis routes could be formatted and structured in a machine-readable way so that researchers anywhere could extract the protocols and reproduce them with automated scripts or programs.
  2. Adopt Community Standards:
    • The crystallography community developed crystallographic information files (CIFs) that are now standard for reporting crystal structures in a machine-readable way.
    • For NMR data: Use standard data formats with acquisition parameters
    • For mass spectrometry: Follow established reporting guidelines
  3. Link Related Data:
    • Connect data to publications using DOIs
    • Cross-reference related datasets
    • Chemistry is often described as the central science, underpinning many other disciplines. So its data should be accessible and interoperable across those other disciplines.

Reusable

  1. Document Data with Detailed Attributes:
    • Include full experimental conditions
    • Document instrument settings and calibration
    • If a synthesis route section is formatted properly, chemists could use the data or reproduce the protocol even if it was separated from the context of the paper.
  2. Specify Clear Licenses:
    • Use standard licenses (CC-BY, CC0)
    • Even companies should see the value in FAIR data. If firms don't put a FAIR data infrastructure in place, they "will not be able to have all these data talk to each other" or extract the implicit knowledge the data contain.
  3. Include Detailed Provenance:
    • Document the complete data generation workflow
    • Track data processing steps
    • The more that chemists can annotate their data and make them searchable and available, "that's a win for everyone".

Infrastructure and Tools for FAIR Chemistry Data

Repositories and Platforms

  1. Chemistry-Specific Repositories:
    • Cambridge Structural Database (for crystal structures)
    • NMRShiftDB (for NMR data)
    • The Go FAIR Chemistry Implementation Network (ChIN) has been working in collaboration with organizations like the International Union of Pure and Applied Chemistry to establish data standards and protocols.
  2. General Scientific Repositories:
    • Dataverse: Generates a formal citation for each deposit, following a standard. It makes the Digital Object Identifier (DOI) or other persistent identifiers public when the dataset is published.
    • Zenodo, Figshare, Dryad
    • The NFDI chemistry consortium (NFDI4Chem) is tasked with building tools and infrastructures for FAIR data.
  3. Curated Data Services:
    • CAS, a division of ACS that provides "content and chemical information" to researchers and organizations. CAS acquires data from sources such as publishers and patent offices, indexes them, standardizes them, and makes them easily searchable.

Standards and Tools

  1. Chemical Structure Representation:
    • International Chemical Identifier (InChI)
    • SMILES notation
    • To make data usable across disciplines requires that they be described in an unambiguous way. So chemists need to apply precise metadata standards to data.
  2. Spectroscopic Data Formats:
    • JCAMP-DX for spectral data
    • nmrML for NMR data
    • Nuclear magnetic resonance data could be suited to a similar treatment as CIFs. Raw spectra files could be uploaded at the same time a journal article is submitted or accepted.
  3. Metadata Frameworks:
    • NFDI4Chem infrastructure will include repositories where researchers will have to deposit data themselves, with a minimum set of metadata standards.

Implementation Strategy for Chemistry Research Groups

Starting Points

  1. Assess Current Practices:
    • Review how your lab currently manages data
    • Identify gaps between current practice and FAIR principles
  2. Develop a Data Management Plan:
    • Create templates for different experiment types
    • Define workflows for data collection, processing, and archiving
  3. Choose Appropriate Tools:
    • Select repositories that support your data types
    • Adopt electronic lab notebooks with FAIR support

Practical Checklist for Making Chemistry Research Data FAIR

✓ Findable

  • Assign DOIs or other persistent identifiers to all datasets
  • Use InChIs for all chemical structures
  • Create comprehensive metadata describing your experiments
  • Deposit data in searchable repositories (discipline-specific when possible)
  • Link datasets to related publications and other datasets

✓ Accessible

  • Ensure data is retrievable using standard web protocols (HTTP/HTTPS)
  • Clearly document any access restrictions or authentication requirements
  • Separate metadata from data to ensure metadata remains accessible
  • Provide contact information for data access inquiries
  • Consider long-term accessibility and choose stable repositories

✓ Interoperable

  • Use established chemistry data formats (CIF, JCAMP-DX, etc.)
  • Apply community-agreed metadata standards
  • Include structured experimental procedures in machine-readable formats
  • Ensure analytical data includes standardized acquisition parameters
  • Use controlled vocabularies when describing chemical processes

✓ Reusable

  • Document complete experimental conditions and instrument settings
  • Apply clear, machine-readable licenses to all datasets
  • Include detailed information on sample preparation and handling
  • Provide complete provenance of data transformation and processing steps
  • Meet domain-relevant community standards for your specific chemistry subfield

References

  1. Howes, L. (2019). Making chemistry FAIR. Chemical & Engineering News, 97(35), 22-25.
  2. Wilkinson, M. D., Dumontier, M., Aalbersberg, I. J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J.-W., da Silva Santos, L. B., Bourne, P. E., Bouwman, J., Brookes, A. J., Clark, T., Crosas, M., Dillo, I., Dumon, O., Edmunds, S., Evelo, C. T., Finkers, R., … Mons, B. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data, 3, 160018. https://doi.org/10.1038/sdata.2016.18
  3. Sansone, S.-A., McQuilton, P., Rocca-Serra, P., Gonzalez-Beltran, A., Izzo, M., Lister, A. L., Thurston, M., & the FAIRsharing Community. (2019). FAIRsharing as a community approach to standards, repositories and policies. Nature Biotechnology, 37(4), 358-367.
  4. Draxl, C., & Scheffler, M. (2018). NOMAD: The FAIR concept for big data-driven materials science. MRS Bulletin, 43(9), 676-682.
  5. Moreau, L., & Groth, P. (2013). Provenance: An Introduction to PROV. Synthesis Lectures on the Semantic Web: Theory and Technology, 3(4), 1-129.