FAIR Principles for Research Data in Social Media and Health Studies

1. Introduction
2. Understanding FAIR Principles
2.1 Definition and Origin
2.2 The Four FAIR Pillars
3. From FAIR Data to Fair Data Use
3.1 Methodological Data Fairness
3.2 Epistemic Injustice in Data Practices
4. Implementing FAIR Principles in Social Media Health Research
4.1 Research Planning and Design
4.2 Data Choices, Sampling, and Acquisition
4.3 Data Processing and Analysis
4.4 Data Storage, Stewardship, and Management
4.5 Reporting and Publishing
5. Practical Examples of FAIR Implementation
6. Challenges and Solutions
7. FAIR Principles in Social Media Health Research: Essential Checklist
8. Institutional Review Board Considerations
9. References

1. Introduction

In an era of data-intensive research, effective data management and stewardship have become critical to scientific advancement. The scientific community increasingly recognizes that data used for research purposes, including those extracted from social media, should be made accessible whenever possible, and their reuse should be encouraged, provided ethical and legal concerns are addressed.

This guide focuses on applying FAIR principles to research data in social media and health-related studies. As researchers increasingly utilize social media data (SMD) to understand health behaviors, perceptions, and outcomes, ensuring these data are managed according to FAIR principles becomes essential for scientific rigor, reproducibility, and ethical research conduct.

2. Understanding FAIR Principles

2.1 Definition and Origin

FAIR stands for Findable, Accessible, Interoperable, and Reusable.

The FAIR Guiding Principles emerged from a workshop titled "Jointly Designing a Data Fairport" held in Leiden, Netherlands, in 2014. They were subsequently refined by a FAIR working group established by members of the FORCE11 community.

The principles were developed to address the critical need for improved infrastructure supporting the reuse of scholarly data. FAIR principles aim to enhance the ability of both humans and machines to automatically find and use data, in addition to supporting its reuse by individuals.

2.2 The Four FAIR Pillars

The FAIR principles encompass the following core components:

Principle	Description	Requirements
Findable	Data should be easy to find for both humans and computers	F1. (Meta)data are assigned a globally unique and persistent identifier F2. Data are described with rich metadata F3. Metadata clearly and explicitly include the identifier of the data it describes F4. (Meta)data are registered or indexed in a searchable resource
Accessible	Once found, users need to know how data can be accessed	A1. (Meta)data are retrievable by their identifier using a standardized communications protocol A1.1 The protocol is open, free, and universally implementable A1.2 The protocol allows for an authentication and authorization procedure, where necessary A2. Metadata are accessible, even when the data are no longer available
Interoperable	Data should be able to be integrated with other data and work with applications for analysis, storage, and processing	I1. (Meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation I2. (Meta)data use vocabularies that follow FAIR principles I3. (Meta)data include qualified references to other (meta)data
Reusable	Data should be well-described so they can be replicated and/or combined in different settings	R1. Meta(data) are richly described with a plurality of accurate and relevant attributes R1.1. (Meta)data are released with a clear and accessible data usage license R1.2. (Meta)data are associated with detailed provenance R1.3. (Meta)data meet domain-relevant community standards

It's important to note that the FAIR principles can be applied to any digital research objects, including algorithms, tools, and workflows that lead to data, not just data itself. All components of the research process must be available to ensure transparency, reproducibility, and reusability.

3. From FAIR Data to Fair Data Use

3.1 Methodological Data Fairness

While the FAIR principles focus on technical aspects of data management (making data findable, accessible, interoperable, and reusable), they do not explicitly address the social and ethical implications of data use. This is where the concept of "methodological data fairness" becomes relevant, particularly when dealing with social media data in health research.

Methodological data fairness is a form of fairness that complements data management principles such as FAIR by enhancing the actionability of social media data for future research. It focuses on how the treatment of data during the research process affects the credibility and justice with which the outputs of the study portray and affect all sections of society.

Key components of methodological data fairness include:

Equitable representation: Ensuring that data collection methods do not systematically exclude or underrepresent certain populations
Contextual understanding: Preserving the context in which data was originally produced
Transparency in methods: Clearly documenting and reporting data collection, processing, and analysis decisions
Responsible interpretation: Acknowledging the limitations of the data and its potential biases
Ethical consideration: Respecting the rights and dignity of data subjects

3.2 Epistemic Injustice in Data Practices

When using social media data for health research, researchers must be aware of the potential for epistemic injustice. Philosopher Miranda Fricker distinguishes between two forms of epistemic injustice that can affect data analysis:

Testimonial injustice: Prejudice in the existing economy of credibility
- Example: Privileging Twitter as a data source due to ease of access, despite its demographic biases
Hermeneutical injustice: Prejudice in the economy of collective resources used to make sense of the world
- Example: Assuming data collected from young urban dwellers represents the views of older people living in rural areas

Implementing methodological data fairness can help mitigate these forms of injustice by:

Countering testimonial injustice through data practices that critically engage with existing norms around what counts as appropriate evidence
Addressing hermeneutical injustice by leveraging diverse sources of knowledge to counter existing prejudice

4. Implementing FAIR Principles in Social Media Health Research

4.1 Research Planning and Design

The foundation of FAIR data practices begins with thoughtful research planning and design. Researchers should:

Clearly articulate research goals, questions, and methods
Identify conceptual assumptions being made about the data
Consider potential biases and limitations in social media data sources
Develop a robust Data Management Plan (DMP) that incorporates FAIR principles
Engage with stakeholders, including potential data subjects and communities
Obtain appropriate ethical review and approvals
Anticipate changes in social media platforms' Application Programming Interfaces (APIs)

4.2 Data Choices, Sampling, and Acquisition

When selecting social media data sources for health research, consider:

Platform selection: Evaluate which platforms contain relevant data while acknowledging selection biases
Data representativeness: Understand demographic characteristics of users on selected platforms
Context preservation: Consider how extraction methods may strip data of crucial context
Terms of Service: Ensure compliance with platform terms while balancing research needs
Consent considerations: Determine whether platform users' consent to Terms of Service constitutes meaningful consent for research
Geographic and temporal factors: Account for location-based and time-based variations in data
Data authenticity: Evaluate whether social media accounts are genuine, automated, or fake

4.3 Data Processing and Analysis

During data processing and analysis, researchers should:

Document all data cleaning and filtering decisions
Preserve contextual information about data sources
Be aware of biases in machine learning and computational methodologies
Ensure that processing steps don't introduce new forms of discrimination
Consider using frameworks like Discrimination-aware Data Mining (DADM) and Fairness, Accountability, and Transparency in Machine Learning (FAT ML)
Evaluate the trade-offs between privacy and utility in data processing

4.4 Data Storage, Stewardship, and Management

Effective data stewardship involves:

Implementing secure storage systems with appropriate access controls
Creating and maintaining comprehensive metadata
Addressing challenges related to consent, confidentiality, and the right to be forgotten
Developing protocols for handling sensitive or identifying information
Creating systems that enable data verification and study replication
Planning for long-term data preservation and access

4.5 Reporting and Publishing

FAIR principles extend to how research using social media data is reported and published:

Provide detailed methodological reporting, including data sources, sampling methods, and analytical approaches
Acknowledge limitations of the data and potential biases
Share data and/or code when possible, or explain why sharing is limited
Use persistent identifiers for data and publications
Include rich metadata with publications
Follow community standards for data citation

5. Practical Examples of FAIR Implementation

Several platforms and initiatives demonstrate effective implementation of FAIR principles:

Dataverse: An open-source data repository software that generates formal citations with DOIs, provides multilevel metadata, and offers machine-accessible interfaces for data access.
FAIRDOM: Integrates the SEEK and openBIS platforms for FAIR data and model management, providing unique URLs, web-accessible formats, and standardized annotations.
ISA framework: A community-driven metadata tracking framework that facilitates standards-compliant collection, curation, management, and reuse of life science datasets.
Open PHACTS: A data integration platform that provides multiple representations (human and machine-readable), canonical URLs, and standardized dataset descriptions.
UniProt: A comprehensive resource for protein sequence and annotation data that offers stable URLs, rich metadata in multiple formats, extensive links to other databases, and explicit typing of records.

6. Challenges and Solutions

Research using social media data for health studies faces several challenges:

Challenge	Description	Potential Solutions
Platform limitations	Major platforms like Facebook and Instagram restrict research access	Develop partnerships with platforms; use alternative data sources; advocate for research access
Representativeness	Social media users may not reflect the general population	Acknowledge biases; combine with other data sources; avoid overgeneralizing findings
Contextual information	Data extraction can strip posts of broader conversational context	Preserve threading information; document extraction methods; consider qualitative analysis
Consent issues	Users didn't consent to research when posting	Develop transparent consent processes; anonymize data; focus on public health benefits
Dynamic APIs	Platform APIs change, affecting data consistency	Document API versions; preserve raw data; develop flexible processing pipelines
Reproducibility challenges	Restrictions on data sharing limit study reproduction	Share code and processing methods; use tweet IDs instead of content; develop synthetic datasets

7. FAIR Principles in Social Media Health Research: Essential Checklist

✓ Planning & Design

Findable

Assign persistent identifiers (DOIs) to datasets and document collection parameters (hashtags, keywords)
Create rich metadata describing platform source, APIs used, and collection timeframes

Accessible

Document data access methods that comply with platform Terms of Service
Plan for sharing tweet/post IDs rather than content when redistribution is restricted

Interoperable

Select standard vocabularies for health terms and social media features
Plan conversion from platform-specific formats to standardized structures

Reusable

Determine appropriate licensing respecting both researcher and platform rights
Document platform-specific collection methods, rate limits, and sampling approaches

Methodological Fairness

Research and document demographics of selected social media platforms
Establish protocols to assess representation biases in health discussions

✓ Collection & Processing

Findable & Accessible

Preserve original data format alongside processed versions with clear version control
Document API version, rate limits, and any sampling that occurred

Interoperable & Reusable

Normalize data across platforms (timestamps, engagement metrics, geographic information)
Document all platform-specific cleaning steps (bot removal, deduplication, handling of multimedia)
Record filtering decisions (language settings, verification status, content removal)

Methodological Fairness

Preserve contextual information including conversation threads and responses
Assess demographic biases specific to platforms and health topics studied

✓ Storage & Management

Findable & Accessible

Store data with appropriate access controls for sensitive health topics
Maintain records of platform-specific identifiers (tweet/post IDs) for hydration
Document any platform restrictions affecting data redistribution

Interoperable & Reusable

Use standardized formats (JSON, CSV) with consistent structure
Implement protocols for removing personally identifiable information
Create appropriate safeguards for sensitive health information shared on social media

✓ Analysis & Reporting

Findable & Accessible

Share platform-specific data collection and analysis code
Document any platform-specific tools or limitations that affect replication

Interoperable & Reusable

Use standard methods for social media text analysis and health term extraction
Document approaches to handling platform-specific features (hashtags, mentions, emoji)
Detail methods for filtering inauthentic health content

Methodological Fairness

Acknowledge platform-specific demographic limitations in findings
Address how platform algorithms and features may influence health discussions
Discuss differences between online health discussions and offline health experiences

✓ Publication & Long-term Stewardship

Findable & Accessible

Include platform sources and collection timeframes in publication metadata
Provide clear instructions for accessing or hydrating datasets
Plan for maintaining accessibility despite potential platform changes

Interoperable & Reusable

Use standard reporting guidelines for social media research
Document how platform-specific features were handled consistently
Provide contingency plans for platform API changes or shutdowns

Methodological Fairness

Explicitly state limitations in generalizing social media health findings
Address ethical considerations specific to the health topics and platforms studied
Plan for periodic reviews as platform features and policies evolve

8. Institutional Review Board Considerations

IRB Challenges Specific to Social Media Health Research

Based on the literature, IRBs face several challenges when reviewing social media health research:

Participant definition ambiguity: Determining who constitutes a "research participant" when analyzing publicly available social media posts discussing health conditions.
Distance from data providers: Unlike traditional studies, researchers typically have no direct contact with the individuals whose data they analyze.
Evolving data types: The broadening of what constitutes "data" in social media research creates challenges for existing IRB frameworks.
Risk-benefit assessment: Balancing potential research benefits against uncertain harms to social media users who did not explicitly consent to research participation.
Rapidly changing landscape: Social media platforms frequently change their APIs, Terms of Service, and data accessibility with little notice.

Guidance for IRB Submissions

When preparing IRB submissions for social media health research, consider the following:

Consent considerations: Clearly articulate whether and why obtaining explicit consent is impractical, and how you will respect user privacy in its absence.
Terms of Service compliance: Document how your research adheres to platform Terms of Service, or provide justification for any potential deviations.
Risk mitigation strategies: Detail how you will protect potentially vulnerable individuals discussing sensitive health topics on social media.
Data security and anonymization: Describe robust procedures for securing, de-identifying, and protecting sensitive health information.
Platform-specific vulnerabilities: Address how your research accounts for platform-specific issues (e.g., how Twitter's public nature differs from more private health forums).
Contextual integrity: Explain how you will maintain the contextual integrity of health discussions rather than extracting posts in isolation.

Evolving IRB Approaches

The literature suggests IRBs are adapting to social media research through:

"Agile ethics" approaches that can respond to the dynamic nature of social media research
Ethics ecosystems involving information sharing between ethics boards
Regular reviews of the research process throughout the project lifecycle
More explicit consideration of fairness in addition to traditional ethical principles

When submitting social media health research for IRB review, researchers should anticipate providing more detailed information about their methodological approaches to fairness, as traditional IRB templates may not fully address these concerns.

Practical Steps for IRB Approval

Early consultation: Engage with your IRB early in the planning process, as many boards are still developing guidelines for social media research.
Educational materials: Provide educational resources about social media research methods to IRB members who may be unfamiliar with these approaches.
Platform documentation: Include detailed information about the social media platforms you'll study, including their privacy policies and research guidelines.
Transparency commitments: Detail how you will be transparent with findings, including acknowledgment of platform-specific biases.
Ongoing ethical oversight: Propose a plan for continuous ethical assessment throughout the research as platform policies and features evolve.
Disciplinary standards: Reference any emerging disciplinary standards for social media health research ethics in your field.
Data management plan: Submit a comprehensive data management plan addressing FAIR principles alongside traditional ethical considerations.

9. References

Leonelli, S., Lovell, R., Wheeler, B. W., Fleming, L., & Williams, H. (2021). From FAIR data to fair data use: Methodological data fairness in health-related social media research. Big Data & Society, 8(1), 1-14.
Wilkinson, M. D., Dumontier, M., Aalbersberg, I. J., Appleton, G., Axton, M., Baak, A., ... & Mons, B. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data, 3(1), 1-9.
Mons, B., Neylon, C., Velterop, J., Dumontier, M., da Silva Santos, L. O. B., & Wilkinson, M. D. (2017). Cloudy, increasingly FAIR; revisiting the FAIR data guiding principles for the European open science cloud. Information Services & Use, 37(1), 49-56.
Taylor, L. (2017). What is data justice? The case for connecting digital rights and freedoms globally. Big Data & Society, 4(2), 2053951717736335.
Fricker, M. (2009). Epistemic injustice: Power and the ethics of knowing. Oxford University Press.
Halford, S., Weal, M., Tinati, R., Carr, L., & Pope, C. (2017). Digital data infrastructures: Interrogating the social media data pipeline. AoIR 2016: The 17th annual conference of the association of internet researchers.
Zook, M., Barocas, S., Boyd, D., Crawford, K., Keller, E., Gangadharan, S. P., ... & Pasquale, F. (2017). Ten simple rules for responsible big data research. PLoS Computational Biology, 13(3), e1005399.
Kennedy, H., Elgesem, D., & Miguel, C. (2017). On fairness. Convergence: The International Journal of Research into New Media Technologies, 23(3), 270-288.

Data Management for Research

FAIR Principles for Research Data in Social Media and Health Studies

Table of Contents

1. Introduction

2. Understanding FAIR Principles

2.1 Definition and Origin

2.2 The Four FAIR Pillars

3. From FAIR Data to Fair Data Use

3.1 Methodological Data Fairness

3.2 Epistemic Injustice in Data Practices

4. Implementing FAIR Principles in Social Media Health Research

4.1 Research Planning and Design

4.2 Data Choices, Sampling, and Acquisition

4.3 Data Processing and Analysis

4.4 Data Storage, Stewardship, and Management

4.5 Reporting and Publishing

5. Practical Examples of FAIR Implementation

6. Challenges and Solutions

7. FAIR Principles in Social Media Health Research: Essential Checklist

✓ Planning & Design

Findable

Accessible

Interoperable

Reusable

Methodological Fairness

✓ Collection & Processing

Findable & Accessible

Interoperable & Reusable

Methodological Fairness

✓ Storage & Management

Findable & Accessible

Interoperable & Reusable

✓ Analysis & Reporting

Findable & Accessible

Interoperable & Reusable

Methodological Fairness

✓ Publication & Long-term Stewardship

Findable & Accessible

Interoperable & Reusable

Methodological Fairness

8. Institutional Review Board Considerations

IRB Challenges Specific to Social Media Health Research

Guidance for IRB Submissions

Evolving IRB Approaches

Practical Steps for IRB Approval

9. References