Skip to Main Content Carnegie Mellon University Libraries

Data Management for Research

How to backup Federal data

Creating Federal data Backups

The United States (US) federal government collects, aggregates, and disseminates a large volume of information and data. This content is used by researchers, policymakers, and many others for various purposes.

Protecting access to US federal government data between and during presidential administrations is important. Data can potentially disappear because of government shutdowns, broken links, and policy shifts.

This libguide provides recommendations you can take to ensure the government data you use in your research remains accessible to you and others.

Identify the data you are working with

  • Identify and document the US federal government data you are using and that you want to safeguard. 
    • Consider if you are using a model or data source that is based on US federal government-produced data. 
    • It may be useful to make a table as you document how you accessed the data. Include the dataset title, URL, which specific agency and program produced it, the date you accessed it, and any additional access method information. 
  • Document what the dataset contains and what data you are using.

Confirm data availability

  • Check if the data has already been deposited in a non-governmental data repository. If the data is already preserved in a reliable place, making a backup of the data may not be necessary. Places to check:
    • Non-interactive datasets hosted on government websites may already be backed up in the Internet Archive’s Wayback Machine, which captures webpages. 
    • Some large data products are duplicated by non-profits or research projects. You should check any that are common for your community. 

Making backups

  • Back up governmental webpages and non-interactive datasets that are hosted on them in the Internet Archive’s Wayback Machine, or in projects such as the End of Term Archive.
  • For code that is in a version control system and on the web, use the Software Heritage project to back it up. 
  • If the data are not complex, very large (>1TB), or restricted you can make a local copy. For the data to be useful to you, your team, or your community, it’s important to include as much information on the data as possible, to make it findable and reusable in the future. For any data you copy, include: 
    • Actual, complete title of dataset
    • Agency name that produced the data
    • Program or office name
    • Website urls, including both the data.gov URL if applicable and the URL where the data are hosted
    • Date downloaded
    • Method of access (may have been captured under first part)
    • File names for data downloaded
    • Identifiers associated with the dataset, e.g.,
      • DOI
      • If you are using data.gov, open the Data.json Metadata and look for the “identifier” value
      • Any other thing that looks like an identifier and might help you identify the data in the future
  • Note the license and any access and sharing restrictions. Do not share restricted data or data containing PII 
  • Additionally, save a current copy of the federal webpage that points to the raw data in the Internet Archive.
  • Additional things to document:
    • Coverage dates
    • Size
    • Format
    • Version
    • Description
    • GeoLocation/Spatial coverage
    • Related Items
    • Checksum
  • Consider putting a copy of the raw data that you just backed up in a data repository. 
  • For larger or interactive projects with field-wide importance you may wish to consult with colleagues about how your field is preserving this data.

 

Don't forget to create appropriate documentation for your backup, here is a readme template you can use.

Rescued Dataset README Template

This guide was created reusing MIT Libraries' libguide[1]

[1] Checklist for USA Federal Data Backups by Data Management Services. Copyright © 2024-12-05 MASSACHUSETTS INSTITUTE OF TECHNOLOGY is licensed under a Creative Commons Attribution 4.0 International License except where otherwise noted. [https://creativecommons.org/licenses/by/4.0/]. Access at https://libraries.mit.edu/data-management/store/backups/checklist-usa/  


Federal data backup resources

A list of resources for accessing preserved federal data is being curated and will be provided here.

  • Archive of data.govThis is a regularly updated mirror of all data files linked from data.gov. The repository is maintained by the Harvard Law School Library Innovation Lab.
  • IPUMS: IPUMS provides census and survey data from around the world integrated across time and space
  • Internet Archive CDC Datasets: Back up access to the CDC datasets made available through the Internet Archive 
  • The Data Rescue Project: A clearinghouse for data rescue-related efforts and data access points for public US governmental data that are currently at risk
  • Global Biodata Coalition: Mirrors for various bioinformatics data sources.

For more general guidance in finding data, please see our finding data guide