Keywords
Epidemics - public health surveillance - social media
1 Introduction
Detecting the spread of infection in an epidemic allows health systems and governments
to implement timely public health interventions. The term ‘epidemic intelligence’
refers to “all the activities related to early identification of potential health
threats, their verification, assessment and investigation in order to recommend public
health measures to control them” [[1]]. Traditional epidemic intelligence systems mainly use clinical epidemiological
data, such as reports from hospitals and healthcare providers, which often lead to
time delays. In the last 20 years, the use of journalistic and other unofficial online
information sources for epidemic intelligence has gained interest. News aggregators
such as HealthMap [[2]] or MediSys [[3]] collect online news and information from websites or blogs to provide an overview
of the worldwide disease situation for the purpose of real-time surveillance. Data
are also collected from search engines or messages on social media, such as Twitter.
These data, which are now being utilized as a form of participatory health informatics
(PHI), may provide a complementary source of information to traditional sources such
as health system, thereby helping to detect and predict the volume and spread of infection
in epidemics [[4]].
PHI is a multidisciplinary field that uses information technology as provided through
the web, smartphones, or wearables to increase participation of individuals in their
care process and to enable them in practicing self-care and shared decision-making
[[5]]. PHI deals with the resources, devices, and methods required to support active
participation and engagement of the stakeholders, such as social media [[5]]. Goals to be achieved through PHI include improving and maintaining health and
well-being; improving the healthcare system and health outcomes; sharing experiences;
achieving life goals; and gaining self-education [[6]]. Beyond eliciting epidemic intelligence, participatory health is used to engage
or inform citizens of disease outbreaks or governmental activities related to an outbreak.
The COVID-19 pandemic has highlighted the potential role of PHI in pandemics. For
example, during the COVID-19 crisis Chinese central government agencies used social
media to promote citizen engagement [[4], [7]]. Other studies during the COVID-19 pandemic have used online data to assess citizens‘
risk perceptions or attitudes and opinions related to the pandemic [[8], [9]].
Given these research developments, we aim to examine which methods and features of
PHI are considered, and which roles PHI plays in assessing, managing and controlling
pandemics. Furthermore, we aimed to identify and summarize the research about what
roles citizens play in this process. Specifically, we aim to use a literature search
and synthesis to address the following research questions:
-
Which epidemics have been studied by means of a PHI approach to disease surveillance?
-
Which tools of PHI and which methods are used to analyze citizens‘ contributions?
-
Does citizen input correspond with epidemic data?
-
What are the barriers to the use of social media for pandemic detection and management?
2 Methods
We undertook a literature review to identify studies that help in answering the listed
questions above. The Preferred Reporting Items for Systematic Reviews and Meta-Analyses
(PRISMA) criteria guided the conduct and reporting of the review [[10]].
2.1 Search Strategy
The full search was carried out on June 10th, 2020 (see [Appendix 1]). The search covered PubMed; ACM Digital Library; IEEE Xplore; CINAHL and Cochrane
library using the following keywords:
-
Keywords related to epidemics or pandemics: Epidemics (MeSH) OR Pandemics (MeSH) OR
Disease Outbreaks (MeSH) OR Chikungunya OR Cholera OR Crimean-Congo haemorrhagic fever
OR Ebola virus disease OR Hendra virus infection OR Influenza OR Lassa fever OR Marburg
virus disease OR Meningitis OR MERS-CoV OR Monkeypox OR Nipah virus infection OR coronavirus
OR 2019-nCoV OR Covid OR Plague OR Rift Valley fever OR SARS OR Smallpox OR Tularaemia
OR Yellow fever OR Zika virus disease.
-
Keywords related to participatory health: Social media OR social network site OR online
social network OR online community OR Facebook OR Twitter OR YouTube OR Instagram
OR WhatsApp OR mHealth OR mobile health OR e-health OR ehealth OR mobile applications
OR apps.
-
Keywords related to treatment/interventions: Management OR Detection OR surveillance
OR Infoveillance OR infodemic.
2.2 Eligibility
We uploaded all search references to Rayyan (https://rayyan.qcri.org) and removed
duplicates. To assess the eligibility of the articles, in a first step, all titles
and abstracts were divided among two reviewers (EG, OR) where each reviewer looked
at half of the papers. After title and abstract screening, in a second step, full
text of all potentially eligible articles was obtained, and articles were reviewed
to confirm their eligibility by two reviewers independently (EG, OR). Conflicts were
discussed with a third reviewer (KD) until consensus was reached.
2.3 Inclusion and Exclusion Criteria
Articles were included if they: a) focused on epidemics, disease outbreaks or any
of the 20 pandemic diseases recognized by WHO (https://www.who.int/emergencies/diseases/en/);
b) addressed the role or features of social media, mobile health, or other PHI; c)
were primary studies reporting results; and d) were in the English language. Articles
were excluded if they: did not deal with epidemics, outbreak diseases or pandemics;
did not deal with PHI; were not primary studies or did not report results (i.e., study
protocols, opinion, frameworks or review papers); or were published in languages other
than English.
The selected articles were divided among three authors (EG, KD, OR) for data extraction.
We extracted: 1) Disease/epidemic, settings and country; 2) Objective, data source,
type of information provided, user group, epidemic and considered region; 3) Data
preprocessing, analysis techniques and features; 4) Outcome and reasons that limit
the outcome. Data were abstracted into a Microsoft Excel spreadsheet standardized
for this review, piloted and refined with 10 preliminary papers. The selected articles
were included in the narrative synthesis.
3 Results
3.1 Sample
The database search retrieved 461 records, with 417 records remaining after duplicate
removal; 53 papers met the inclusion criteria after full text review and were included
in the qualitative synthesis ([Figure 1]). A summary of the included studies can be found in [Appendix 2].
Fig. 1 PRISMA Flowchart of the paper selection procedure.
3.2 Which Epidemics Have Been Studied?
The 53 papers considered various epidemics; influenza-like illnesses (27/53; 51%)
and COVID-19 (11/53; 19%) were the most frequently studied epidemics ([Appendix 3]). Almost all studies analyzed retrospectively collected social media data where
contributors were unaware of the usage of their content for the purpose of epidemic
surveillance. In all these papers, the data sets were created based on predefined
keywords such as example tweets collected using keywords like flu, influenza etc.
(e.g., [[11],[12]]). Only one study was conducted prospectively [[13]] with data actively generated by users.
About a quarter (14/53; 26%) of the papers had a global scope, with 10/53 (19%) focused
on the USA and Canada, and 7/53 (13%) on China. Australia and Malaysia were considered
a region in three papers (6%). Japan, Madagascar, the Netherlands, and Italy were
the focus of two papers each (4%). Greece or the UK were target locations of one paper
(2%). Just over one in ten papers (6/53, 11%) did not specify a region.
The objectives of the 53 studies were classified into six main categories: analyzing
contents related to epidemics (i.e., reactions, opinion, attitudes, quality of the
information, sentiment analysis, distribution patterns, etc.); detecting disease;
disease monitoring or disease surveillance; comparing number of posts with official
disease numbers; predicting future epidemics; and tracing contacts ([Appendix 4]). The three most common objectives were analyzing content related to the epidemic
(20/53; 38%), detecting disease (12/53; 23%), and monitoring or surveillance (10/53;
19%).
3.3 Which Tools of PHI Were Examined, and Which Methods Were Used?
Social media was the data source in the majority of the included studies (49/53; 92%).
Among those, Twitter was the most commonly used channel (34/53; 64%), followed by
Sina Weibo (10/53, 19%) (Table 1). Other data sources were official disease outbreak
data (15/53; 28%), Google Trends (6/53; 11%), and Internet search engines (6/53; 11%).
Table 1
Data source used by the studies included in the review (n=53)
Data source
|
References using this data source
|
Number (%) of papers citing this data source
|
Twitter
Sina Weibo
Baidu
Social media (in general)
Coosto (social media monitoring tool)
Facebook
WeChat
Wikipedia
YouTube
Official disease outbreak data (CDC, WHO, laboratory,… etc)
Google Trends
Internet search/search engines
News, online news
Surveys (any type)
Spinn3r (Web and social media indexing service)
|
[[7],[11],[12],[14]–[58]]
[[11],[12],[15],[16],[20],[23],[25]–[30],[32]–[38],[40],[42]–[52],[55],[57],[59]]
[[7],[17]–[19],[22],[24],[31],[39],[53],[54]]
[[17],[18],[31],[39]]
[[17],[21],[56],[59]]
[[32],[41]]
[[32]]
[[14]]
[[28]]
[[58]]
[[12],[17],[18],[20],[21],[24],[29],[31],[38],[39],[41],[46],[59]–[61]]
[[11],[12],[20],[39],[41],[55]]
[[17],[23],[26],[28],[61],[62]]
[[18],[20],[26],[56],[59]]
[[13],[21]]
[[60]]
|
49 (92%)
34 (64%)
10 (19%)
4 (7.5%)
4 (7.5%)
1 (2%)
1 (2%)
1 (2%)
1 (2%)
1 (2%)
15 (28%)
6 (11%)
6 (11%)
5 (9%)
2 (4%)
1 (2%)
|
* Note: some studies included more than one data source
More than two thirds of the included studies provided information about the content
of social media posts (37/53; 70%). The next most common types of provided information
were the reporting of spatiotemporal data from social media, found in 14 papers (26%);
and Internet search queries in 12 papers (23%) ([Appendix 5]).
A preprocessing stage that prepares data to be analyzed is required when automated
analyses techniques are used. Preprocessing techniques clean data and transform them
to the predictable and analyzable format required by analysis algorithms. Examples
of common related techniques in natural language processing are lowercasing, stemming,
lemmatization, stop word removal, and normalization. Preprocessing is crucial when
data sources present a dynamic and specific language like those used in social networks.
Reporting the data preprocessing techniques used in automated analysis is relevant
for research reproducibility. Thirteen studies (13/53; 25%) did not require a data
preprocessing stage for two main reasons [[11],[13],[18],[25],[26],[28],[30],[39],[41],[58],[59],[61],[62]]: they were based on quantitative data such as web search index, or authors followed
a manual approach to analyze the collected data. Although a preprocessing stage was
required in the remaining included studies, four of them (4/53; 7.5%) did not report
any information regarding this stage [[14],[17],[29],[60]].
A total of 36 studies reported data preprocessing. Among those, data preprocessing
and filtering was reported in 19 studies (19/40; 48%). Several approaches to filtering
data were reported. Nineteen of the studies reporting preprocessing information included
a data preparation stage (19/40; 48%). Information regarding data cleaning was the
most frequently reported. However, many of these studies reported vaguely about data
preparation that did not include details on specific techniques and tools used. Data
aggregation was implemented in 14 studies (14/40; 35%) in which data from several
sources such as Google Trends data, Twitter, Web search indexes, or official Centre
for Disease control (CDC) data were combined on a weekly or daily basis. See [Appendix 6] for further details.
Regarding the analysis techniques, the most common analyses in the included studies
were the correlation analysis (23/53; 43%), followed by spatiotemporal analysis using
various techniques (22/53; 42%), and classification problem (14/53; 26%) (Table 2).
Table 2
Analysis techniques used in included papers (n=53)
Analysis
|
Algorithm or technique (References using this technique)
|
Number of papers citing this technique
|
Correlation analysis
Spatiotemporal analysis
Time series analysis
Seasonal analysis
Classification
Relevance classification
Emotions classification
Detection
Topic Identification
Symptom Recognition
Language detection
Prediction/Estimation
Sentiment analysis
Other analysis
Qualitative analysis
Link analysis, Influence analysis, and/or Communities identification
|
Spearman ([[11],[18],[31],[33],[39],[54]])
Pearson ([[19],[20],[37],[38],[43],[44],[48],[49],[55],[60]])
unspecified ([[12],[22],[25],[30],[50],[51],[61]])
FDR [[23]], DBNM [[24]], KLD [[47]], JI [[51]], GCt [[53]], DFt [[53]], ARIMAX [[55]], ESA [[57]]
unspecified ([[13],[14],[30],[35],[38],[43],[49],[50],[61],[62]])
SARIMA [[17]], SDA [[17]], LiR [[17]], BCP [[28]], SH-ESD [[42]], STL [[54]]
unspecified [[48]]
SVM ([[15],[16],[27],[35],[36],[38],[44],[49],[53]])
LR ([[35],[46],[55]])
NB ([[36],[42]])
LiR ([[46]])
ME ([[48]])
RF [[53]], DT ([[53]]), ET ([[53]]), KNN ([[53]]), MP ([[53]])
NB [[47]]
LDA ([[11],[27],[44],[54]])
BTM ([[33]])
RF ([[54]])
Km ([[57]])
unspecified ([[40],[51]])
LDA ([[21]])
LR ([[21]])
Unspecified ([[32]])
LiR ([[12],[21],[22],[37],[38],[46]])
MRM ([[30],[46],[61]])
NB ([[34]])
ME ([[34]])
DLMC ([[34]])
SSA ([[47]])
unspecified ([[7],[47],[50]])
Thematic analysis ([[59]])
Classification ([[31]])
GNA ([[60]])
|
23 (43%)
17 (32%)
5 (9%)
13 (25%)
1 (2%)
8 (15%)
1 (2%)
1 (2%)
8 (15%)
5 (9%)
2 (4%)
1 (2%)
|
* Note: some studies reported more than one type of information
FDR: False Discovery Rate; DBNM: Dynamic Bayesian Network Model; KLD: Kullback-Leibler
Divergence; JI: Jaccard Index; GCt: Granger Causality Test; DFt: Dickey-Fuller tests;
ARIMAX: AutoRegressive Integrated Moving Average model with eXogenous covariates;
ESA: Exponential Smoothing Algorithm; SARIMA: Seasonal AutoRegressive Integrated Moving
Average; SDA: Seasonal Decomposition Analysis; LiR: Linear Regression; BCP: Bayesian
Change Point; SH-ESD: Seasonal-Hybrid Extreme Studentized Deviate; STL: Seasonal-Trend
decomposition based on Loess; SVM: Supported Vector Machine; LR: Logistic Regression;
NB: Naïve Bayes; ME: Maximum Entropy; RF: Random Forest; DT: Decision Tree; ET: Extra
Tree; KNN: K nearest neighbors; MP: Multilayer Perceptron; LDA: Latent Dirichlet Allocation;
BTM; Biterm Topic Model; Km: K-means; MRM: Multivariable Regression Model; DLMC: Dynamic
Language Model Classifier; SSA: Stanford Sentiment Analyzer; GNA: Girvan-Newman Algorithm.
When it comes to features used in the analysis, spatiotemporal features were the most
common data used in the analysis of the included studies. Thirty-four studies used
time stamp or time series in the analysis (34/53; 64%). Most of the studies used the
location information to filter data out in the collection stage, but only 16 of them
compared results from different geolocation (areas, cities, or countries) (16/53;
30%).
Twenty-three of the included studies analyzed features extracted from post contents.
Seven of those studies analyzed words or keywords (7/53; 13%). Six studies used word
frequencies or the Term Frequency-Inverse Document Frequency (TF-IDF) values (6/53;
11%). Five studies used “bag of words” in their analysis (5/53; 9%). Other features
such as linguistic characteristics were used in a single study. Further details about
features used in the analyses are reported in [Appendix 7].
3.4 Does Citizen Input Correspond With Epidemic Data?
Table 3 summarizes the reported outcomes by disease. Several papers reported correlations
between social media data related to COVID-19, influenza-like illnesses, measles,
MERS-CoV, MRSA and the plague and official case numbers [[11],[30],[41],[53],[61],[62]]. Results regarding the timeliness of the detected events, i.e., whether PHI can
help in detecting outbreaks ahead of official statistics were rare and contradictory.
For conjunctivitis and COVID-19, two papers (3.8%) reported that social media data
can detect outbreaks as early as, or earlier than, the official reporting mechanisms
[[53], [62]]. For Ebola, one paper concluded it is unlikely that online surveillance provides
an alert more than a week before the official announcement [[47]].
Table 3
Reported outcomes related to PHI per epidemic in the reviewed papers
Epidemic
|
Outcome
|
Conjunctivitis
COVID-19
Dengue
Ebola
Influenza, Seasonal flu, Influenza-like illnesses, Avian influenza
Measles
MERS-CoV
Methicillin-resistant Staphylococcus aureus (MRSA)
Plague (bubonic plague, pneumonic plague)
Zika
|
PHI (Google search data) enable earlier outbreak detection. [[62]]
PHI (number of tweets) correlates positively with daily case numbers [[19],[22],[39],[54]]
PHI (Reports of symptoms and diagnosis of COVID-19 on social media) enabled to predict
daily case counts up to 14 days ahead of official statistics [[53]]
Value for communication and information provision: Media richness negatively predicts
citizen engagement through government social media, but dialogic loop facilitates
engagement. Information relating to the latest news about the crisis and the government’s
handling of the event positively affects citizen engagement through government social
media [[7]]
Social media user trust in information shared by health professionals and others in
their online social networks [[40]]
PHI did not enable to earlier outbreak detection [[26]].
Analysis of emotions in social media microblogging data (Twitter) may be utilized
as a source of evidence for disease outbreak detection and monitoring [[47]].
Outbreak detection:
• Partially positive correlation between local influenza-like illness percentages
and tweet rates [[16]]
• Positive correlation between the number of tweets, search volume or frequent daily
discussions and daily case numbers [[20],[24],[25],[27],[31],[35],[36],[38],[43],[44],[49],[50],[56],[60]].
• PHI (Twitter) enabled earlier detection of outbreaks [[42]]. Differences in the degree of sensitivity exist between social media: A high sensitivity
of 92% was found for Google and a low sensitivity of 50% was calculated for Twitter.
Wikipedia had the lowest sensitivity of 33% [[28]].
• PHI led to false alarms: Twitter flu surveillance erroneously indicated a typical
flu season during 2011-2012.
Value for communication and information provision: Social media provides insight into
the opinions regarding the pandemic that are at a certain moment salient among the
public [[59]]
There is a positive correlation between the weekly number of social media messages
and the weekly number of online news articles [[59]]
High correlations between social media data and the number of confirmed MERS cases.
High correlations between social media data and the number of quarantined cases [[11]]
PHI enabled rapid identification of potential MRSA outbreaks [[41]]
Statistically significant positive correlations were found between Google trends search
data and confirmed, suspected, and probable cases [[30],[61]]
Value for communication and information provision: Social media is unlikely to be
useful or effective for communication on an epidemic; There are discrepancies between
what the general public was most interested in, or concerned about, and what public
health authorities provided [[29],[32],[52]]
|
Some papers mentioned a positive effect of using social media for crisis communication.
Posting of latest news and information on how the government is handling the pandemic
can affect citizens’ engagement [[7], [59]], thereby demonstrating that posting can be valuable for citizen education [[18]]. Other papers question the usefulness or effectiveness of social media for communication
on an epidemic [[29],[32],[52]] since there are discrepancies between the interest or concerns of the population
in general and the provided information by public health authorities. Misinformation
and false alarms were mentioned in two papers [[48], [58]] (Table 3).
3.5 Barriers to the Use of Social Media for Pandemic Detection and Management
In the included papers, we identified four groups of issues that impact the outcome
of PHI for detecting or retrieving knowledge on pandemics. These are usage of app
data, data collection, behavior of individuals, and analysis and interpretation of
social media data.
When apps are used for disease surveillance, privacy issues and concerns about personal
confidentiality can hamper the data usage. Authorizing social media apps to use personal
data including personal information, activity status data, and spatiotemporal data
is still not acceptable [[14]]. Another barrier relates to data collection. Specifically, not all data generated
on social media is available for analysis. Several research papers used the Twitter
API which only allows a collection of a subset of the data posted on Twitter, which
may have resulted in leaving relevant tweets unconsidered [[12],[33],[34]]. Only one paper considered popular tweets, i.e., those that were re-tweeted many
times [[52]], which means that these tweets are not representative of all tweets.
Another bias may arise from censorship in countries like China, thus limiting the
completeness of the data [[22]]. Data collection from Google Trends only provides relative and not absolute values,
thus hindering the possibility of further refining and processing them [[61]]. Finally, data related to disease surveillance, such as geolocation or other user
data, might be unavailable or imprecise (e.g., when using search logs from Google
or access rates of Wikipedia pages) [[28], [62]].
The third group of problems is related to the behavior of individuals, or citizens’
input. When a disease (e.g., influenza) becomes a hot topic, people do not post about
it [[15]]. In contrast, when celebrities are concerned about a disease, there are more people
posting about it [[62]] which may generate false concern. Public awareness of disease surveillance methods
using social media could influence behavior and consequently lead to false reporting
[[16]].
The fourth group of issues concerns the analysis and interpretation of data, i.e.,
data processing for the purpose of disease surveillance. First of all, when considering
free text as a data source, misspellings, abbreviations, and use of slang hamper processing
[[27]]. Second, social media data is dynamic: new words can appear (e.g., new slang referring
to a disease). This requires retraining classifiers so that new vocabulary and new
anomalies in social signals can be learned [[16], [27]].
4 Discussion
4.1 Summary of Findings
This literature review and synthesis confirms that PHI has been used to address a
wide variety of public health issues relating to pandemics. Most literature has focused
on influenza or influenza-like illness or, in 2020, COVID-19. The vast majority of
studies have used data from social media posts or web search patterns with a wide
variety of data analysis techniques. For most diseases, the small number of studies
identified means that firm conclusions about the utility of PHI in detecting and monitoring
these disease outbreaks cannot be reached. In comparison, the extensive literature
on influenza and COVID-19 (in spite of the fact that the literature search ended in
June 2020) provides valuable insights into the potential for PHI to provide additional,
more timely or efficient pandemic monitoring.
4.2 Epidemics and PHI
Although most of the articles included in our review focused on influenza-like illnesses
and COVID-19, other epidemics have also been considered in PHI research (i.e., Ebola,
Zika, Dengue, etc.). PHI research on previous pandemics has probably facilitated fast
developments in relation to the COVID-19 emergency. Likewise, PHI research on COVID-19
may be applicable for detecting and managing future epidemics.
PHI is an imperfect source of data with its own biases such that its content and frequency
of posting are not equally distributed across the population. For example, a recent
secondary analysis of survey data revealed that women were more likely to post on
COVID-19 than men; that black, Latino, and other non-white males were more likely
to post on the topic than whites, and that people age 65 and above were more likely
to post than younger people [[68]]. However, existing analysis tools take this into account, and use the frequency
of posting as a variable in and of itself. Research has yet to study the association
between differential frequency of posting, and outbreak detection.
It should also be noted that PHI is varied and, as such, some types of PHI are better
at answering specific questions than others. For example, search data on the CDC website
was better and faster at detecting influenza trends on a national, but not state level
[[69]]. There is a need to clarify which questions are best suited to be answered by which
source of PHI. The same holds true for analysis techniques. As individuals generate
increasingly more PHI, and its use for detecting and managing pandemics persists,
newer, more refined tools and analyses are required to assess how PHI best assists
in promoting health and, along with that, what its characteristics are.
Finally, recent work analyzing Tweets to capture public sentiment about COVID-19,
identified five dominant themes: health care environment, emotional support, business
economy, social change, and psychological stress [[70]]. These are not captured in electronic health records, and yet they provide invaluable
insights into population needs and concern which should receive public health attention.
Since some papers claim that early alerts cannot be achieved from social media, more
information needs to be collected on various diseases to understand to what degree
patterns generalize.
4.3 PHI Tools, Methods and Citizens’ Input
Analysis of social media and web searches shows that posting and search frequencies
have consistent positive correlations with official disease incidence numbers in the
cases of influenza, COVID-19 and for the related MERS Co-V. Most recent studies in
COVID-19 suggest that analysis of these data may even predict an increase in case
numbers ahead of official health system generated data [[53]]. Previous literature synthesis has also concluded that online social networks generate
data that can track pandemic development [[63]] and other disease outbreaks [[64]]. As these data are created by citizens in their daily lives and do not incur additional
data collection costs, they therefore represent an attractive additional source of
information to complement traditional disease surveillance data. The timeliness of
PHI provides other significant benefits, such as noting a certain level of awareness
of and concern with COVID-19, which epidemiological records do not convey. Further
studies validating these observations are a high priority, particularly as the COVID-19
pandemic unfolds.
Since citizen behavior and input may change as the pandemic evolves, these correlations
between infection incidence and secondary data sources may not remain stable. On the
other hand, this might reflect psychological responses to the pandemic which are related
to, but not the same as, the actual prevalence of COVID-19. Furthermore, lack of correlation
between disease incidence and media trends might result from adaptation and familiarity,
leading, for example, to fewer information searches (e.g., on Wikipedia).
The requirements for data cleaning and analytic methods, which may need to rapidly
evolve, may present additional barriers to this approach to disease surveillance being
widely adopted in multiple geographic settings. The availability of standardized surveillance
approaches and efficient development of effective algorithms have previously been
identified as barriers to use of social media in surveillance of illicit drug use
[[65]]. Furthermore, the uncertainties about how representative data are and if sufficient
population coverage is reached remain unresolved.
Our synthesis of the outcomes of using PHI in pandemics suggests that analysis of
social media posting is useful in assessing disease-related informational needs, such
as reducing vaccination hesitancy [[71]].
The analysis of social media posts can also be useful for assessing the effectiveness
of government or health authority communication with their populations. Again, issues
of how representative social media users are of the wider population remain unresolved.
At present, each analysis requires a bespoke approach to data collection and analysis.
It would seem likely that this use of social media and web browsing data will complement
traditional research approaches with the advantages of more immediacy of data.
4.4 Barriers of Using Social Media During Pandemics
We found several barriers of using social media for detecting or retrieving knowledge
on pandemics: data privacy and concerns about personal confidentiality, data collection
(technical limitation like the Twitter API data sample, lack of data completeness,
censorship, or potential inaccuracies), behavior trends, and complexity of data analysis
and interpretation.
Data privacy and personal confidentiality are two of the most relevant issues of using
social media for participatory health purposes [[72], [73]]. Included studies reported privacy and confidentiality barriers showing that there
are still unresolved ethical, legal, and technological questions. Therefore, new models
of responsible and transparent data collection and treatment addressing these questions
are needed, especially in public health emergencies [[74]]. Limitations in data collection from social media sources is an issue that is commonly
reported in the scientific literature [[75], [76]]. The social influence that an individual’s posts on social media may have on others’
behaviors is also reported as a relevant aspect to be considered in digital surveillance
systems using social media [[77]]. Simple methods are commonly used to collect data from social media sources, resulting
in a dataset including noise (data not related to specific pandemic). Then, a filtering
stage is required to select efficiently the data sample. Both manual and automated
filtering are commonly used to classify collected data. Most automated methods are
based on artificial intelligence. Due to social media data characteristics, several
processing stages are required to prepare data to be used by analysis models. However,
artificial intelligence supporting participatory health is still in its infancy [78].
Although the most common application of artificial intelligence in participatory health
is the secondary analysis of social media data [78], there are several challenges
that must be addressed [78]. Additionally, a combination of epidemiologic expertise,
analytical expertise, and advanced computational skills are required to interpretate
data for pandemic surveillance [[77]].
4.5 Limitations
One limitation of our work is that the data collection ended on June 10th, 2020, while the COVID-19 pandemic continued to evolve. Thus, for example, we did
not include a study published in August 2020 on how media coverage influences Google
search trends, so that they cannot be assumed to only reflect people’s health [[66]]. Likewise, we did not include a study on natural language processing that revealed
changes in large mental health groups (e.g., SuicideWatch and Depression) on Reddit
during the COVID-19 pandemic [[67]]. This novel, illuminating work was published after our inclusion date. Regardless,
we believe it was important to end the search at that date so that COVID-19 was still
included in the review and so that the results of the review are released when COVID-19
is still of relevance, so they can be utilized by health officials and researchers.
In order to move the field of PHI forward for detecting and managing future pandemics
we recommend:
✓ Finding the best way to deal with the current barriers to fuller impact of PHI data
(i.e., privacy issues, commercial practices, governmental practices, etc.)
✓ Clarifying which questions are best suited to be answered by which source of PHI
✓ Creating more refined tools and analyses is required to assess how PHI best assists
in promoting health during pandemics]
5 Conclusions and Recommendations for Future Research
5 Conclusions and Recommendations for Future Research
This paper explored the role of PHI in managing and detecting pandemics. We conclude
that PHI provides an unmediated, authentic, and readily available source of information
that can be highly useful in the detection and management of pandemics. Our findings
highlight the ways in which social media can be used as a form of participatory health,
to manage and detect pandemics. They also illuminate the barriers to fuller impact
of such data: some of these barriers stem from privacy issues, some from commercial
practices such as providing relative but not absolute rankings of trends, while others
are rooted in governmental practices such as censorship. There is a series of questions
that future studies could aim to answer: What are the issues that hamper citizens’
contribution and the value of their contribution, and what can facilitate their contribution?
To what extent are citizens invited to contribute to outbreak detection and crisis
communication using PHI? Given that citizen input is instrumental in early detection
of diseases and is crucial in detection of mental distress resulting from diseases,
governments should strive to invite such input in a standardized, anonymized manner.