Skip to Main Content Carnegie Mellon University Libraries

Data Management for Research

FAIR Principles for Research Data in Social Media and Health Studies

1. Introduction

In an era of data-intensive research, effective data management and stewardship have become critical to scientific advancement. The scientific community increasingly recognizes that data used for research purposes, including those extracted from social media, should be made accessible whenever possible, and their reuse should be encouraged, provided ethical and legal concerns are addressed.

This guide focuses on applying FAIR principles to research data in social media and health-related studies. As researchers increasingly utilize social media data (SMD) to understand health behaviors, perceptions, and outcomes, ensuring these data are managed according to FAIR principles becomes essential for scientific rigor, reproducibility, and ethical research conduct.

2. Understanding FAIR Principles

2.1 Definition and Origin

FAIR stands for Findable, Accessible, Interoperable, and Reusable.

The FAIR Guiding Principles emerged from a workshop titled "Jointly Designing a Data Fairport" held in Leiden, Netherlands, in 2014. They were subsequently refined by a FAIR working group established by members of the FORCE11 community.

The principles were developed to address the critical need for improved infrastructure supporting the reuse of scholarly data. FAIR principles aim to enhance the ability of both humans and machines to automatically find and use data, in addition to supporting its reuse by individuals.

2.2 The Four FAIR Pillars

The FAIR principles encompass the following core components:

Principle Description Requirements
Findable Data should be easy to find for both humans and computers F1. (Meta)data are assigned a globally unique and persistent identifier
F2. Data are described with rich metadata
F3. Metadata clearly and explicitly include the identifier of the data it describes
F4. (Meta)data are registered or indexed in a searchable resource
Accessible Once found, users need to know how data can be accessed A1. (Meta)data are retrievable by their identifier using a standardized communications protocol
A1.1 The protocol is open, free, and universally implementable
A1.2 The protocol allows for an authentication and authorization procedure, where necessary
A2. Metadata are accessible, even when the data are no longer available
Interoperable Data should be able to be integrated with other data and work with applications for analysis, storage, and processing I1. (Meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation
I2. (Meta)data use vocabularies that follow FAIR principles
I3. (Meta)data include qualified references to other (meta)data
Reusable Data should be well-described so they can be replicated and/or combined in different settings R1. Meta(data) are richly described with a plurality of accurate and relevant attributes
R1.1. (Meta)data are released with a clear and accessible data usage license
R1.2. (Meta)data are associated with detailed provenance
R1.3. (Meta)data meet domain-relevant community standards

It's important to note that the FAIR principles can be applied to any digital research objects, including algorithms, tools, and workflows that lead to data, not just data itself. All components of the research process must be available to ensure transparency, reproducibility, and reusability.

3. From FAIR Data to Fair Data Use

3.1 Methodological Data Fairness

While the FAIR principles focus on technical aspects of data management (making data findable, accessible, interoperable, and reusable), they do not explicitly address the social and ethical implications of data use. This is where the concept of "methodological data fairness" becomes relevant, particularly when dealing with social media data in health research.

Methodological data fairness is a form of fairness that complements data management principles such as FAIR by enhancing the actionability of social media data for future research. It focuses on how the treatment of data during the research process affects the credibility and justice with which the outputs of the study portray and affect all sections of society.

Key components of methodological data fairness include:

  1. Equitable representation: Ensuring that data collection methods do not systematically exclude or underrepresent certain populations
  2. Contextual understanding: Preserving the context in which data was originally produced
  3. Transparency in methods: Clearly documenting and reporting data collection, processing, and analysis decisions
  4. Responsible interpretation: Acknowledging the limitations of the data and its potential biases
  5. Ethical consideration: Respecting the rights and dignity of data subjects

3.2 Epistemic Injustice in Data Practices

When using social media data for health research, researchers must be aware of the potential for epistemic injustice. Philosopher Miranda Fricker distinguishes between two forms of epistemic injustice that can affect data analysis:

  1. Testimonial injustice: Prejudice in the existing economy of credibility
    • Example: Privileging Twitter as a data source due to ease of access, despite its demographic biases
  2. Hermeneutical injustice: Prejudice in the economy of collective resources used to make sense of the world
    • Example: Assuming data collected from young urban dwellers represents the views of older people living in rural areas

Implementing methodological data fairness can help mitigate these forms of injustice by:

  • Countering testimonial injustice through data practices that critically engage with existing norms around what counts as appropriate evidence
  • Addressing hermeneutical injustice by leveraging diverse sources of knowledge to counter existing prejudice

4. Implementing FAIR Principles in Social Media Health Research

4.1 Research Planning and Design

The foundation of FAIR data practices begins with thoughtful research planning and design. Researchers should:

  • Clearly articulate research goals, questions, and methods
  • Identify conceptual assumptions being made about the data
  • Consider potential biases and limitations in social media data sources
  • Develop a robust Data Management Plan (DMP) that incorporates FAIR principles
  • Engage with stakeholders, including potential data subjects and communities
  • Obtain appropriate ethical review and approvals
  • Anticipate changes in social media platforms' Application Programming Interfaces (APIs)

4.2 Data Choices, Sampling, and Acquisition

When selecting social media data sources for health research, consider:

  • Platform selection: Evaluate which platforms contain relevant data while acknowledging selection biases
  • Data representativeness: Understand demographic characteristics of users on selected platforms
  • Context preservation: Consider how extraction methods may strip data of crucial context
  • Terms of Service: Ensure compliance with platform terms while balancing research needs
  • Consent considerations: Determine whether platform users' consent to Terms of Service constitutes meaningful consent for research
  • Geographic and temporal factors: Account for location-based and time-based variations in data
  • Data authenticity: Evaluate whether social media accounts are genuine, automated, or fake

4.3 Data Processing and Analysis

During data processing and analysis, researchers should:

  • Document all data cleaning and filtering decisions
  • Preserve contextual information about data sources
  • Be aware of biases in machine learning and computational methodologies
  • Ensure that processing steps don't introduce new forms of discrimination
  • Consider using frameworks like Discrimination-aware Data Mining (DADM) and Fairness, Accountability, and Transparency in Machine Learning (FAT ML)
  • Evaluate the trade-offs between privacy and utility in data processing

4.4 Data Storage, Stewardship, and Management

Effective data stewardship involves:

  • Implementing secure storage systems with appropriate access controls
  • Creating and maintaining comprehensive metadata
  • Addressing challenges related to consent, confidentiality, and the right to be forgotten
  • Developing protocols for handling sensitive or identifying information
  • Creating systems that enable data verification and study replication
  • Planning for long-term data preservation and access

4.5 Reporting and Publishing

FAIR principles extend to how research using social media data is reported and published:

  • Provide detailed methodological reporting, including data sources, sampling methods, and analytical approaches
  • Acknowledge limitations of the data and potential biases
  • Share data and/or code when possible, or explain why sharing is limited
  • Use persistent identifiers for data and publications
  • Include rich metadata with publications
  • Follow community standards for data citation

5. Practical Examples of FAIR Implementation

Several platforms and initiatives demonstrate effective implementation of FAIR principles:

  1. Dataverse: An open-source data repository software that generates formal citations with DOIs, provides multilevel metadata, and offers machine-accessible interfaces for data access.
  2. FAIRDOM: Integrates the SEEK and openBIS platforms for FAIR data and model management, providing unique URLs, web-accessible formats, and standardized annotations.
  3. ISA framework: A community-driven metadata tracking framework that facilitates standards-compliant collection, curation, management, and reuse of life science datasets.
  4. Open PHACTS: A data integration platform that provides multiple representations (human and machine-readable), canonical URLs, and standardized dataset descriptions.
  5. UniProt: A comprehensive resource for protein sequence and annotation data that offers stable URLs, rich metadata in multiple formats, extensive links to other databases, and explicit typing of records.

6. Challenges and Solutions

Research using social media data for health studies faces several challenges:

Challenge Description Potential Solutions
Platform limitations Major platforms like Facebook and Instagram restrict research access Develop partnerships with platforms; use alternative data sources; advocate for research access
Representativeness Social media users may not reflect the general population Acknowledge biases; combine with other data sources; avoid overgeneralizing findings
Contextual information Data extraction can strip posts of broader conversational context Preserve threading information; document extraction methods; consider qualitative analysis
Consent issues Users didn't consent to research when posting Develop transparent consent processes; anonymize data; focus on public health benefits
Dynamic APIs Platform APIs change, affecting data consistency Document API versions; preserve raw data; develop flexible processing pipelines
Reproducibility challenges Restrictions on data sharing limit study reproduction Share code and processing methods; use tweet IDs instead of content; develop synthetic datasets

7. FAIR Principles in Social Media Health Research: Essential Checklist

✓ Planning & Design

Findable

  • Assign persistent identifiers (DOIs) to datasets and document collection parameters (hashtags, keywords)
  • Create rich metadata describing platform source, APIs used, and collection timeframes

Accessible

  • Document data access methods that comply with platform Terms of Service
  • Plan for sharing tweet/post IDs rather than content when redistribution is restricted

Interoperable

  • Select standard vocabularies for health terms and social media features
  • Plan conversion from platform-specific formats to standardized structures

Reusable

  • Determine appropriate licensing respecting both researcher and platform rights
  • Document platform-specific collection methods, rate limits, and sampling approaches

Methodological Fairness

  • Research and document demographics of selected social media platforms
  • Establish protocols to assess representation biases in health discussions

✓ Collection & Processing

Findable & Accessible

  • Preserve original data format alongside processed versions with clear version control
  • Document API version, rate limits, and any sampling that occurred

Interoperable & Reusable

  • Normalize data across platforms (timestamps, engagement metrics, geographic information)
  • Document all platform-specific cleaning steps (bot removal, deduplication, handling of multimedia)
  • Record filtering decisions (language settings, verification status, content removal)

Methodological Fairness

  • Preserve contextual information including conversation threads and responses
  • Assess demographic biases specific to platforms and health topics studied

✓ Storage & Management

Findable & Accessible

  • Store data with appropriate access controls for sensitive health topics
  • Maintain records of platform-specific identifiers (tweet/post IDs) for hydration
  • Document any platform restrictions affecting data redistribution

Interoperable & Reusable

  • Use standardized formats (JSON, CSV) with consistent structure
  • Implement protocols for removing personally identifiable information
  • Create appropriate safeguards for sensitive health information shared on social media

✓ Analysis & Reporting

Findable & Accessible

  • Share platform-specific data collection and analysis code
  • Document any platform-specific tools or limitations that affect replication

Interoperable & Reusable

  • Use standard methods for social media text analysis and health term extraction
  • Document approaches to handling platform-specific features (hashtags, mentions, emoji)
  • Detail methods for filtering inauthentic health content

Methodological Fairness

  • Acknowledge platform-specific demographic limitations in findings
  • Address how platform algorithms and features may influence health discussions
  • Discuss differences between online health discussions and offline health experiences

✓ Publication & Long-term Stewardship

Findable & Accessible

  • Include platform sources and collection timeframes in publication metadata
  • Provide clear instructions for accessing or hydrating datasets
  • Plan for maintaining accessibility despite potential platform changes

Interoperable & Reusable

  • Use standard reporting guidelines for social media research
  • Document how platform-specific features were handled consistently
  • Provide contingency plans for platform API changes or shutdowns

Methodological Fairness

  • Explicitly state limitations in generalizing social media health findings
  • Address ethical considerations specific to the health topics and platforms studied
  • Plan for periodic reviews as platform features and policies evolve

8. Institutional Review Board Considerations

IRB Challenges Specific to Social Media Health Research

Based on the literature, IRBs face several challenges when reviewing social media health research:

  1. Participant definition ambiguity: Determining who constitutes a "research participant" when analyzing publicly available social media posts discussing health conditions.
  2. Distance from data providers: Unlike traditional studies, researchers typically have no direct contact with the individuals whose data they analyze.
  3. Evolving data types: The broadening of what constitutes "data" in social media research creates challenges for existing IRB frameworks.
  4. Risk-benefit assessment: Balancing potential research benefits against uncertain harms to social media users who did not explicitly consent to research participation.
  5. Rapidly changing landscape: Social media platforms frequently change their APIs, Terms of Service, and data accessibility with little notice.

Guidance for IRB Submissions

When preparing IRB submissions for social media health research, consider the following:

  • Consent considerations: Clearly articulate whether and why obtaining explicit consent is impractical, and how you will respect user privacy in its absence.
  • Terms of Service compliance: Document how your research adheres to platform Terms of Service, or provide justification for any potential deviations.
  • Risk mitigation strategies: Detail how you will protect potentially vulnerable individuals discussing sensitive health topics on social media.
  • Data security and anonymization: Describe robust procedures for securing, de-identifying, and protecting sensitive health information.
  • Platform-specific vulnerabilities: Address how your research accounts for platform-specific issues (e.g., how Twitter's public nature differs from more private health forums).
  • Contextual integrity: Explain how you will maintain the contextual integrity of health discussions rather than extracting posts in isolation.

Evolving IRB Approaches

The literature suggests IRBs are adapting to social media research through:

  • "Agile ethics" approaches that can respond to the dynamic nature of social media research
  • Ethics ecosystems involving information sharing between ethics boards
  • Regular reviews of the research process throughout the project lifecycle
  • More explicit consideration of fairness in addition to traditional ethical principles

When submitting social media health research for IRB review, researchers should anticipate providing more detailed information about their methodological approaches to fairness, as traditional IRB templates may not fully address these concerns.

Practical Steps for IRB Approval

  1. Early consultation: Engage with your IRB early in the planning process, as many boards are still developing guidelines for social media research.
  2. Educational materials: Provide educational resources about social media research methods to IRB members who may be unfamiliar with these approaches.
  3. Platform documentation: Include detailed information about the social media platforms you'll study, including their privacy policies and research guidelines.
  4. Transparency commitments: Detail how you will be transparent with findings, including acknowledgment of platform-specific biases.
  5. Ongoing ethical oversight: Propose a plan for continuous ethical assessment throughout the research as platform policies and features evolve.
  6. Disciplinary standards: Reference any emerging disciplinary standards for social media health research ethics in your field.
  7. Data management plan: Submit a comprehensive data management plan addressing FAIR principles alongside traditional ethical considerations.

9. References

  1. Leonelli, S., Lovell, R., Wheeler, B. W., Fleming, L., & Williams, H. (2021). From FAIR data to fair data use: Methodological data fairness in health-related social media research. Big Data & Society, 8(1), 1-14.
  2. Wilkinson, M. D., Dumontier, M., Aalbersberg, I. J., Appleton, G., Axton, M., Baak, A., ... & Mons, B. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data, 3(1), 1-9.
  3. Mons, B., Neylon, C., Velterop, J., Dumontier, M., da Silva Santos, L. O. B., & Wilkinson, M. D. (2017). Cloudy, increasingly FAIR; revisiting the FAIR data guiding principles for the European open science cloud. Information Services & Use, 37(1), 49-56.
  4. Taylor, L. (2017). What is data justice? The case for connecting digital rights and freedoms globally. Big Data & Society, 4(2), 2053951717736335.
  5. Fricker, M. (2009). Epistemic injustice: Power and the ethics of knowing. Oxford University Press.
  6. Halford, S., Weal, M., Tinati, R., Carr, L., & Pope, C. (2017). Digital data infrastructures: Interrogating the social media data pipeline. AoIR 2016: The 17th annual conference of the association of internet researchers.
  7. Zook, M., Barocas, S., Boyd, D., Crawford, K., Keller, E., Gangadharan, S. P., ... & Pasquale, F. (2017). Ten simple rules for responsible big data research. PLoS Computational Biology, 13(3), e1005399.
  8. Kennedy, H., Elgesem, D., & Miguel, C. (2017). On fairness. Convergence: The International Journal of Research into New Media Technologies, 23(3), 270-288.