Keywords
Data mining - secondary use - informed consent - evidence-based practice - program
evaluation
Introduction
In this contribution, the IMIA working group on Technology Assessment & Quality Development
in Health Informatics offers some evaluation considerations for secondary uses of
clinical data. In setting out important principles for an evidence-based approach
to policy and implementation, we identify two quite distinct conceptual categories
of concern: first, evaluation of the secondary uses of data from philosophical, methodological, and ethical perspectives;
and second, evaluation of health systems and services based upon secondary uses of data.
‘Secondary use’ needs definition. Simply put, ‘Secondary use of health data applies
personal health information for uses outside of direct health care delivery’ [[2]]. Data is recorded for a particular purpose within a healthcare encounter, such
as recording presenting problems, tentative diagnosis, and treatment action initiated.
Few would argue against using that data to track an antigen batch, to audit quality
of delivery, or to deal with a subsequent complaint, even though these were not anticipated
at the time of recording. Secondary use is when data recorded for an operational care
purpose is used to create new intelligence or knowledge away from its context of origin
and without the originators necessarily being aware. Examples include: clinical research,
population health, epidemiology, and pharmaceutical effectiveness. Specific lines
of research may include policy effectiveness (achieving objectives); treatment outcomes
against intent; multi-morbidity patterns; polypharmacy outcomes; changing illness
patterns. Though potential sources of data may appear homogeneous, such as hospital
records, the components and resultant analyses are very varied and focussed. The potential
scope may include laboratory data, pharmacy, radiology, immunisation, emergency/elective
attendances, primary care, mental health, social care, payers, public health, bio-surveillance,
pharmacovigilance, and incident reporting, while external linkages to enable greater
depth of interpretation in a big data modality may include census data, meteorological
data, law enforcement data, education data, and housing data.
To set the scene, we highlight the fact that there are different aggregations and
resultant analysis of big data sets. In the commercial world, ‘big data’ may be seen
as the aggregation of data from disparate unrelated sources. For instance, data might
be combined from consumer spending recorded on loyalty cards, weather forecasts, small
area socio-economic profiling, and television schedules to forecast product demand
for supermarket branches to drive the supply chain. By contrast, most large health
data sets are of similar data set types, such as pooled anonymised primary care consultation
and prescribing data to look at long term outcomes. Increasingly, however, the health
sector focus is on what might be called ‘hybrid’ large data sets, where different
health data sets are analysed together. Far more than commercial ‘big data’, ‘hybrid’
health data faces ethical and governance issues and questions of public trust and
acceptability. There are both practical and public perception issues about re-use
that take very different and operationally unlinked data together in analyses to seek
new and unanticipated forms of personal behaviour, illness trajectories, and likely
outcomes.
Timely and accurate health data spanning the continuum of care linked at patient level,
and safely shared as necessary for care delivery purposes, has been recognized globally
as an essential tool. Secondary use is seen as critical not just for the optimal delivery
of individual health care interventions, but also for improving performance of health
care systems and health outcomes of patients, for obtaining longer-term and real world
evaluations of existing treatments including in a multi-morbidity context, for supporting
the re-design and evaluation of new models of health care service delivery and for
contributing to the discovery and evaluation of new treatments [[1]]. This is the foundation of both Smarter Healthcare [[3], [4]] and Learning Health Systems [[5]].
This means there is a need for an evidence-based approach: balancing innovation and
evaluation with trust and governance. This results in the need for an evaluation lifecycle
of secondary analysis, with both formative and summative elements [[6], [7]]. This would build trust, support accountability, transparency and regulation; and
have as one requirement transparent reporting of the evaluation of secondary use analyses.
This paper starts with the first conceptual category introduced above, evaluation
of secondary uses of data. Sections two and three address evaluation issues from the
perspectives of governance, trust, theoretical considerations, semantics, and context.
Then we consider, in section four, how national and regional policies are framing
evaluation factors relevant to secondary uses (both of secondary uses and based upon secondary uses). Section five presents examples of how the analysis based upon secondary uses has informed enhancements in clinical performance and health IT design.
Finally, section six synthesises the two categories and demonstrates why a multi-level
evaluation approach is needed. We conclude with summary recommendations.
2 Governance and Trust
Anxiety about health information confidentiality is an issue for many patients and
care professionals, particularly when data are collected digitally and held virtually.
Managing safe use of health data is a major concern across the Organisation for Economic
Co-operation and Development (OECD) countries, having a direct impact on the sharing
of personal health data, and even causing patients to engage in “privacy protective
behaviours” (avoiding screening tests, treatment, or be recruited in research protocols).
The development and publication of suitable policies or guidelines greatly increases
public transparency [[1], [5]].
The possibility of wide secondary use, for reasons and by agencies not known at the
time of data collection, and without individual permission at the time of analysis,
raises a multiplicity of new concerns about personal confidentiality and about the
exploitation of knowledge for unknown purposes. Uncontrolled use for secondary purposes
may thus lead to greater anxiety, greater potential for protective behaviours, and
thereby also for incomplete and biased data sets [[8]]. Conversely, however, undue restriction on controlled secondary use closes down
important research options without society having opportunity to debate this potential
non-discovery of new knowledge.
These concerns could be magnified by the push towards “open data” [[9], [10]], even though data is intended to be aggregate and non-identifiable. However, a
recent study of 13,000 US biobank participants reported that although 51% expressed
worry about privacy, results did not suggest that open data sharing would adversely
affect participant recruitment [[11]]. Given that biobank participation is based on explicit consent of some kind, this
finding is not necessarily transferable to routine secondary use of “open data” from
the general population. Indeed, a recent survey of over 20,000 citizens from across
the European Union (EU) found a strong preference for not sharing anonymised health
data with academic researchers [[12]].
A particular privacy concern relates to re-identification of “anonymised” data. Privacy
legislation allows the disclosure of health data for secondary purposes without patient
consent if the data are de-identified. De-identification is the act of reducing the
information content in data to decrease the probability of discovering an individual’s
identity. It has been argued that de-identification methods do not provide sufficient
protection because they are easy to reverse and thus data can be easily re-identified.
However, a systematic review [[13]] showed that only a few attacks have involved health data and more importantly,
most re-identified data has not been de-identified according to existing standards.
To manage the harmful risks of re-identification, future research should focus on
re-identification attacks on large databases that have been de-identified following
existing standards, and success rates should be correlated with how well de-identification
was performed. It is important to collect an evidence-based understanding of the extent
to which de-identification standards and practices protect against real re-identification
attacks and how the standards and practices should be developed to cover the future
challenges. Given that “it can be impossible to assess re-identification risk with
absolute certainty” [[14]], this and other citizen concerns demand open and transparent debate. Scotland is
an example of a country which has sought open debate on local approaches, and made
clear its policies [[15]].
Good governance is therefore an essential prerequisite for ensuring effective primary
and secondary use of health IT systems. It provides a framework to create the necessary
trust to enable full and willing secondary, or value-adding, use [[1]]. A definition of governance [[16]] specifies monitoring and evaluation as an integral part of any policy, and e-health
is no exception. Developing and implementing national e-health strategies calls for
monitoring and assessing their progress towards availability, usability, quality,
and integrity of the data, and its safe sharing ability and transparency – data governance
[[17]]. Evaluation is essential to identify good practices from which others could learn
to support the movement toward common best practices [[1]].
The OECD has a long-standing interest and expertise in this area and has published
eight key data governance mechanisms to support strengthening national health information
systems and enabling multi-country projects to improve the population health ([Box 1]). Each of these calls for some kind of assessment and evaluation.
Provision of good quality personal health data is a prerequisite for extracting good
quality statistics for secondary use purposes, but it is merely a beginning. There
is still much work needed to develop criteria and assess maturity of individual countries’
data governance systems related to secondary use of health data. Ensuring trust through
conspicuous and transparent governance frameworks is an essential prerequisite. Sound
examples exist, and continued evaluation is necessary to refine best and most effective
practice.
Box 1 OECD key data governance mechanisms
-
The health information system supports the monitoring and improvement of health care
quality and system performance, as well as research innovations for better health
care and outcomes. (There are indicators in the OECD e-health model survey for organisations
to measure attainment of this principle.)
-
The processing and the secondary use of data for public health, research, and statistical
purposes are permitted, subject to safeguards specified in the legislative framework
for data protection. (This principle calls for evaluation of national policies and
legislation.)
-
The public is consulted upon and informed about the collection and processing of personal
health data. (Existence of this mechanism calls for policy analysis; public awareness
can be monitored by citizen surveys.)
-
A certification/accreditation process for the processing of health data for research
and statistics is implemented.
-
The project approval process is fair and transparent, and decision-making is supported
by an independent, multidisciplinary, project review body.
-
Best practices in data de-identification are applied to protect patient data privacy.
-
Best practices in data security and management are applied to reduce re-identification
and breach risks.
-
Governance mechanisms are periodically reviewed at an international level to maximise
societal benefits and minimise societal risks as new data sources and new technologies
are introduced (see [[1]] section 5.1).
3 Theoretical Considerations, Semantics, and Context
3 Theoretical Considerations, Semantics, and Context
Secondary use of clinical data carries several fairly obvious assumptions, all of
which are fundamental to inferential statistics, but the limitations of which are
not always acknowledged in the way health data are used or abused. Firstly, it is
frequently believed that it is theoretically valid to re-use data outside their context
of origination and that meaning can be safely asserted independently of that context.
Secondly, it must be assumed that operational clinical data are of sufficient minimum
quality to be reliable and usable (albeit with various “data cleansing” procedures
required). Thirdly, it is held that sound population-level inferences can be drawn
from such secondary use of data. In this section, we examine these assumptions.
Contextual Validity of Data
As Ingenerf nicely summarises, in health informatics, the problem of providing meaning
to data communicated and then processed is the issue of semantic interoperability
[[18]]: when communicating, healthcare professionals are used to interact dynamically
at a syntactic and semantic levels until they have a common understanding. When dealing
with a patient case, a physician creates and tests a mental image while interpreting
data into information, based on his entire professional context, and it is within
this context that he/ she communicates. The risk of electronic data transfer is to
lose the context by ‘lifting the ink off the paper’. Thus, the challenge is to ensure
that context is faithfully carried with the data and information transferred. The
physical reality, the clinician’s mental model, and the information model embodied
in the electronic health record (EHR) or data exchange may be three or even four quite
different things [[19], [20]].
In 1991, Johan van der Lei spoke and wrote adamantly against the misuse of data in
computer-stored medical records and formulated the First Law of Medical Informatics:
“Data shall be used only for the purpose for which they were collected” [[21]] with the explicit consequence that “If no purpose was defined prior to the collection
of the data, then the data should not be used.” Van der Lei gave two major reasons
for this: a) the quality of the data, and b) the context of the data. Given advances
in technology, the challenges to elicit and process such data are now quite tractable,
at least in Western countries, and there is an increasing demand to exploit this data.
So, we have to look carefully at and beyond the two barriers mentioned by van der
Lei, with a focus on the fundamental principles applying to re-use of data and information.
Perhaps it is time to re-formulate the First Law with words such as “Usage of data
for purposes other than those for which they were generated is acceptable only when
this has been validated stringently according to both ethical and scientific principles,
including faithful reflection of context.” The ethical validation should include consideration
of the potential socio-economic benefit, but also that patient concerns about data
misuse can contribute to “censored” EHR content [[8]].
Assumptions about Data Quality and Provenance
There is a general organisational context and a specific clinical context of stored
clinical data. Transferability is a very real and serious issue, for at least the
following two reasons: analytical variation and biological variation. For example,
even if a laboratory test has the same name in two clinical locations, the analytical
methods may not be identical, and hence the data generated may vary significantly.
The problem of transferability was investigated by many research groups in the 1980s
and 1990s. For instance, the impact of various factors, when considered individually,
on the validity of the outcome of decision support systems (e.g. technology, methodology,
and terminology factors) was clearly demonstrated [[22]]. For instance, the study showed that even for international standard clinical protocols
there are differences in the local interpretation of the meaning of individual clinical
signs and symptoms. There can be variation in the nosology: the state of knowledge
with regard to the investigation or classification of the clinical problem(s) under
examination, co-morbidities, previous clinical history, interventions, and drugs taken.
There may even be cultural differences in clinical practice and technologies applied
or differences in the common language [[22]]. Our conceptual understanding and interpretation of ‘disease’ changes over time,
as does diagnostic capability, treatment and care regimes, technical and pharmaceutical
abilities, and political governance. Therefore, a technical solution to the problem
of interoperability at a semantic level, for instance in terms of standardised terminologies,
is merely a partial solution.
Justification of Inferences: Scientific and Technological Advances
Uncritical secondary use of data from medical records based on blind trust in the
semantic interoperability of such data is irresponsible. Such unconstrained secondary
usage of data could for instance erroneously extrapolate a pharmaceutical drug trial
based on a cohort of single illness young to middle-aged men to a very different context
such as prescribing the drug to small children, postmenopausal or aged women, or people
with co-morbidities. The suitability of the knowledge drawn into the new setting needs
to be assessed as to its context and origins to decide if it is applicable to the
setting, including what verification, adjustments, or safety parameters are needed.
Epidemiological differences may to some extent be compensated for by normalisation
procedures. Terminological differences may be coped with by standardization and mapping,
which has required decades of sustained efforts and funding. A recent example was
the Observational Medical Outcomes Partnership (OMOP), which developed a common data
model to support analysis of heterogeneous data from operational EHRs, adverse incident
reports, and financial claims [[23]]. Methodological differences and differences with respect to analytical quality
may also sometimes be compensated for by normalisation procedures. Such calculations
are feasible when one knows the exact correlations and the valid context for interpretation,
but may also be accomplished according to the local reference intervals at the point
of the clinical intervention.
Scientific efforts have been made in large European Union Research and Development
(EU R&D) projects as well as in smaller national projects to combine and exploit the
merge or comparison of clinical data from several databases from various countries.
The EU-ADR project used normalisation to score clinical events and the PSIP project
used a crude normalisation of laboratory data (relative to a population mean) but
did not merge the various databases [[24]].
It is feasible to provide a solution through a structured definition of the necessary
amounts of details for each element within the context in order to enable valid usage
of data. The means is meta-data, meta-information and meta-knowledge for each individual
datum, information, and piece of knowledge applied [[25], [26]]. Such required data, information, and knowledge exist, but they are distributed,
and unfortunately constitute a large but necessary overhead at the secondary use processing.
The state of the art in the technological aspects of secondary use of data is progressing
rapidly. The astonishing potential of digital technology to offer high-quality, high-volume,
routine data to generate a virtuous circle of data-driven quality improvements to
both direct patient care and secondary uses to support operational management, public
health, and research has stimulated massive investment [[27]]. A promising example is the “Green Button” project, which aims eventually at offering
real time EHR cohort analysis to provide decision support for the many cases where
gold standard randomized control trial (RCT) evidence is lacking [[28], [29]]. The Patient Centred Outcomes Research Institute (PCORI) promotes the development
of methodological standards for research that can enhance the development of evidence
–based patient-centred health [[30], [31]]. The approach is founded on a systematic process involving public comment, engagement,
and revision. The aim is to promote research that is scientifically sound, meaningful,
and patient-centred. This approach parallels many of the developments in precision
medicine, which can be defined as prevention and treatment strategies that account
for individual variability [[32]]. The applicability of precision medicine has increased dramatically with the development
of large-scale biologic databases involving genomics, proteomics, and metabolomics,
along with the computational tools for dealing with this data. Major developments
in patient centred outcomes research and precision medicine are in turn underpinned
by major works to ensure patient consent (e.g., the Data Segmentation for Privacy–
DS4P initiative) [[33]] and patient safety monitoring (e.g., work on establishing common formats to allow
for the uniform collection and reporting of patient safety data by patient safety
organisations) [[34]].
It may be that current developments with machine learning using previously unimaginable
levels of computational power and quantities of diverse but linkable data will leapfrog
traditional approaches to the issues described here (at least the technical ones)
[[35]], but it seems premature to yet regard this as a foregone conclusion or a comprehensive
solution.
4 Evaluation Aspects in National and Regional Policy
4 Evaluation Aspects in National and Regional Policy
OECD Findings
The overall ambition of OECD member states is to better include e-health into their
health policies and better align e-health investments to health needs [[11]]. Already in 2012, most OECD countries participating in an OECD study [[36]] reported a national plan or policy to implement EHRs (22 of 25 countries). Most
had also begun to implement that plan (n=20) and a majority (n=18) had included some
form of secondary use of EHRs within their national plan. The most commonly included
secondary uses were public health monitoring and health system performance monitoring
(n=15). Half of the countries also indicated that their plans included that physicians
could query the data to support treatment decisions. The least commonly reported planned
data use was for facilitating or contributing to clinical trials (n= 10). Regular
use of EHR data for secondary analyses was already underway, mainly for public health
monitoring (n=13) and general research (n=11) [[36]].
Key Elements in Evaluation from the Viewpoint of Secondary Use of Health Data
As noted above, an important prerequisite for secondary use of personal health data
is the transferability of data, which requires organizational, technical, semantic,
and legal interoperability, as well as quality and protection of personal data [[37]–[39]]. As countries develop and implement their e-health strategies, they will need to
monitor progress to ensure that these requirements are met and that the e-health efforts
are indeed contributing to health policy goals. For example, the EU e-health action
plan section on global collaboration [[40]] stated that from 2013 the Commission should enhance its work on data collection
and benchmarking activities in health care with relevant national and international
bodies to include more specific e-health indicators and assess the impact and economic
value of e-health implementation. Close collaboration with the OECD and other actors
is required to harmonize e-health indicators, including the OECD work on indicators
for availability and usage of e-health [[41]] and the Nordic e-health indicator work [[42]], which has defined some common Nordic indicators also for interoperability, protection
and quality of the personal health data – key elements in evaluation from the secondary
use viewpoint. From the methodological viewpoint, triangulation of methods is needed
to be able to cover all the aspects required by the use of personal health data.
5 Examples: Using Routine Clinical Data for HIT Evaluation and Quality Indicators
5 Examples: Using Routine Clinical Data for HIT Evaluation and Quality Indicators
We now turn to practical examples of evaluation based upon secondary uses of data. The increasing uptake of EHRs and other health information
systems has made routine collection and analysis of clinical data to evaluate and
improve clinical performance an easier and faster undertaking. Furthermore, this provides
opportunities to create a fine-grained picture of systems’ effects on quality of care
by analysing interaction data that are a by-product of their use [[43]]. This section discusses two examples of how routinely collected data can be used
to evaluate clinical performance, and how routine clinical and interaction data can
be synergized to study the mechanisms of health information systems in detail and
optimize their efficacy.
Re-use of Routinely Collected Clinical Data to Systematically Evaluate Clinical Performance
Health professionals need measures to judge the quality of care they provide in order
to identify areas for improvement. Further, due to societal pressure on transparency
and accountability, governments, accreditation organizations, patient associations
and insurance companies have tremendously increased the amount of quality indicators
to be measured. The current number of quality indicators makes their manual calculation
impracticable. Besides being time-intensive, causing registration burden and lack
of timeliness, manual calculation is also error-prone and can jeopardize the reproducibility,
validity, and comparability of quality measure results [[44]]. Therefore, quality indicators should be automatically calculated from routinely
collected data from EHRs.
Quality indicators are often compared over time and among health care institutions
or care providers to identify outliers, which require quality improvement activities.
Results of these benchmarking activities can have large negative consequences for
those who underperform in terms of financial restrictions imposed by insurance companies,
loss of faith by patients, and loss of motivation by care providers. Aspects such
as reproducibility, validity, and comparability of quality indicators are hence of
utmost importance. However, these aspects are hampered by the fact that quality indicators
are often ambiguously defined in natural language, which impedes their automated computability.
Therefore, quality indicators should be formalized before their release and application
on routinely collected data from EHRs. The CLIF method developed by Dentler et al.
[[44]] transforms quality indicators—which are typically described in unstructured text—into
precise queries that can be computed on the basis of routinely collected clinical
data. The method includes eight steps to formalize the nominator and denominator of
a quality indicator and ensures that the formalizations obtained faithfully represent
the meaning of the indicator. During the first step, the clinical concepts such as
diagnoses and procedures are extracted from the text describing the quality indicator.
These concepts need to be coded by standard terminologies such as SNOMED CT or ICD-9/10
depending on the used national coding system. During the second step, these concepts
are bound to concepts in the EHR’s underlying information model. In step three, the
temporal aspects (e.g. a procedure should be performed before another procedure) of
the indicator are formalized. Step four formalizes numeric criteria (e.g. HbA1c value
must be below 53 mmol/mol). In steps five and six, the Boolean criteria (e.g. three
codes for Diabetes are combined with OR) are formalized and grouped. Step seven formalizes
the exclusion criteria and negations, and in step eight criteria that only aim at
the numerator and not to the denominator are identified. The generalizability and
reproducibility of CLIF has been positively evaluated [[44], [45]]. Whilst CLIF may not directly solve re-use challenges such as missing data and
poor data quality, it can guide implementation of local EHRs with respect to how clinical
data items should be collected to increase data quality.
Unobtrusive Quantitative Process Evaluations to Optimize Health Information Systems
Formalised quality indicators and guidelines are presented in electronic health information
systems such as clinical decision support (CDS) and audit and feedback (A&F) systems.
These systems have been moderately successful at ensuring that patients receive improved
care, but their effectiveness is highly variable [[46], [47]]. CDS provides clinicians with case-specific advice at the point of care (e.g.,
alerts or reminders) [[48]], whereas A&F provides population-level performance feedback on quality indicators
over a period of time [[46]]. The reasons for their variable effectiveness are unclear because the mechanisms
behind A&F’s success or failure are poorly understood [[49]]. This limits the ability to design better interventions [[50]]. The electronic nature of modern A&F systems allows for new possibilities to study
the mechanisms of A&F quantitatively and unobtrusively by harnessing data that are
routinely captured as a by-product of using the systems in real-life [[43]].
Exploring the mechanism through which interventions bring about change is crucial
to understanding both how the effects of the specific intervention occurred and how
these effects might be replicated by similar future interventions [[51]]. Coiera [[19]] describes this mechanism as an information value chain that connects the use of
a system to health outcomes. The chain begins with a user interacting with a system,
and some of these interactions will provide information. Some of this information
may cause the user to change her decision, which in turn can change the process of
care. Finally, only some process changes affect health outcomes. For example, suppose
that a general practitioner prescribing non-selective beta-blockers in a patient with
asthma is alerted by a CDS system that this may cause exacerbations (“interaction”).
When the general practitioner notices the alert (“information received”) and decides
to cancel the prescription (“decision changed”) this will affect the patient’s medication
regimen (“care process altered”) and can ultimately reduce the risk of asthma exacerbations
and unscheduled hospital admissions (“outcome changed”). Whereas most A&F studies
only investigate the relationship between exposure (i.e., inviting health professionals
to interact with the system) and care processes or outcomes (stage 4 and 5), electronic
health information systems can produce usage logs that allow us to evaluate the relationships
between all other stages in the information value chain, often with high fidelity
[[43]]. Using measurements from all those stages can provide a more comprehensive picture
of the intervention process to help explain the observed variability in its effectiveness.
In fact, analysing the number and types of events in each stage may help to identify
obstructions in the chain that withhold value from progressing to the subsequent stage,
and reveal the determinants for a successful progression. However, we would like to
emphasize that we are not arguing that analysing the information value chain makes
qualitative process evaluations obsolete. Whereas a quantitative approach will reveal
that certain events occurred (e.g., users declining an alert), a qualitative approach
is more suitable to explore why these events occurred (e.g., the alert conflicted
with patient preferences). Our vision is that quantitative evaluations may discover
gaps in the intervention process, which may then be filled in by qualitative work,
making them complementary.
6 The Need for Multi-level Evaluation – Key Evaluative Criteria
6 The Need for Multi-level Evaluation – Key Evaluative Criteria
Health systems, and their supporting technologies, should continuously learn and improve,
as postulated by the Learning Health System approach [[3]], and thus evaluation of the means and processes of secondary use of health data
is vital as being essential good practice. Particular foci of evaluation should be:
i) the consumers of secondary analysis of health data (e.g., health care managers,
policy makers, clinicians, researchers, therapeutics developers, and society as ultimate
beneficiary of better services); ii) considerations related to the utilisation of
the secondary use of data; and iii) ensuring the validity and quality of the secondary
use of clinical data [[52]].
Consumers of the Secondary Analysis of Health Data
Health care is increasing in its complexity – not only is there a growing prevalence
of multi-morbidity (neonates surviving with ongoing health conditions; ageing populations
with greater hazard of health events) but also increased specialisation of service
delivery which can lead to fragmentation. Secondary use of data based on robust data
linkage techniques has the potential to improve our understating of the breadth and
course of health care delivery [[53]]. But while the secondary use of data continues to expand into a fast growing industry,
there are important concerns about whether consumers are sufficiently aware of what
is going on. For instance, is there a sufficient public awareness [[2]] of the benefits and challenges associated with secondary use of data?
The utilisation of health system data is a sensitive community topic. Any mistrust
or lack of confidence about the way that data is handled could inhibit its application
and severely affect its utilisation [[54]]. Major questions about the use of secondary data in health [[2]] continue to revolve around whether patients have the right to audit or place constraints
on the use of their data. How does society ensure that the use of secondary data is
transparent and is safeguarded? Several countries (e.g., Australia and the United
Kingdom) are considering “opt-out” models of data consent which provide patients with
right to opt out of their personal information being used for purposes beyond their
direct care, but this may well lead to bias, for instance by social group or by health
condition. This right is also reversible [[54]]. These issues relate very strongly to public trust, which we described in section
2 of this contribution.
The secondary use of health system data relies upon some key principles including
transparency and coordination with all stakeholders [[55]]. It also involves the establishment of mechanisms that can monitor, detect, and
report on the application of knowledge derived from secondary use of health data (including
any adverse incidents) and help to enhance its impact [[56]].
Considerations Related to the Utilisation of Secondary Use of Data
The increasing availability and accessibility of large volumes of data from clinical
and non-clinical sources have helped to broaden the scope and utilisation of secondary
health system data [[57]]. The technological ability to merge, link, re-use, and exchange data has outpaced
the establishment of policies, procedures, and processes that monitor the ethics and
legality of secondary use of data [[2]]. Types of data brought into integrational secondary analysis may include:
-
Web and social media data (Twitter, Facebook etc.)
-
Machine to machine data (sensors, vital signs etc.)
-
Biometric data (genetics, medical images, etc.)
-
Human-generated data (e.g., EMRs) [[58]]
These data can be clinical or non-clinical [[57]]. Common clinical repositories may include data from EHRs and disease registries
which are used to monitor patient care. These may be linked to administrative records
and other non-clinical sources such as data from over-the-counter medications, finance,
and other consumer data sources. These various sources and types of data come each
with their own nomenclature and definition e.g, de-identified data, anonymised data,
reversible anonymised data, etc. [[2]]
The conceptual framework for secondary use of health system data analytics is similar
to, and can be based in part on, traditional health informatics processes such as
the de-identification and anonymisation of data [[2]]. But there are also some important additional conceptual (architectural) considerations.
In most cases, the user interfaces of traditional analytical tools differ from those
used by “big data” which involve different informatics skills often requiring the
use of open-source tools to address complex issues related to the retrieval, pooling,
processing, and warehousing of data. These tools currently lack the support and the
user friendliness of traditional analytical packages [[58]].
Ensuring the Validity and Quality of Secondary Use of Clinical Data
In the past, large silos of traditional paper records remained dormant and were seldom
analysed, which meant they played little to no role to enhance the effectiveness and
safety of health care [[57]]. Important methodological considerations to ensure that the product of the secondary
use of health data is valid, reliable, and applicable, must involve:
-
Consideration of the quality of data
-
Understanding context to ensure that meanings inferred from the data are not distorted
-
Promoting transparency and governance [[59]]
The discipline of health informatics has been built in large part on optimising key
standards and considerations for data quality and data metrics [[60]–[62]]. These include consideration of the:
-
Accuracy of data
-
Data comparability
-
Data completeness
-
Data consistency
-
Data relevance
-
Data usability
-
Data validity
The translation of data from secondary analysis into reliable and applicable knowledge
that can be use to enhance the quality of care relies also on the proper and effective
choice of study design. Large data sources may enhance the potential for evaluation
but they are still dependent on the formulation of robust evaluation questions and
topics, as well as the proper study design and the use of appropriate tools to support
rigorous measurement and assessment [[63]–[66]].
Methodological Frameworks for Secondary Uses
One framework for secondary use is SPIRIT (Systematic Planning of Intelligent Reuse
of Integrated Clinical Routine Data), a best-practice framework and procedure model
for the systematic planning of intelligent reuse of integrated clinical routine data
[[67]]. Unlike other methods that concentrate on the analysis part, such as the KDD process
(Knowledge Discovery in Databases) as proposed by Fayyad et. al. in 1996 [[68]] or OLAP (OnLine Analytical Processing) as proposed by Codd in 1993 [[69]], SPIRIT allows a holistic view of secondary use and supports the structured, stepwise
planning and conduct of secondary use of clinical data in heterogeneous environments,
with a special focus on the objectives of data analysis and supporting reproducibility
of data analysis. Its application can and should be evaluated in various ways.
First, after secondary data analysis, project management should evaluate whether the
defined goals of secondary data reuse have been fulfilled. Often, we can find a scope
creep, i.e. a change in originally intended goals to other or additional goals. This
is not bad in itself, but should be made transparent. How can this evaluation be done?
One approach is to evaluate whether the generated reports respond to the originally
defined goals.
Second, the acceptance by stakeholders should also be evaluated: how do various stakeholders
see the information and reports that are derived from secondary data analysis? Do
they find them helpful? Do they use them regularly? Do they do a continuous reporting?
Are there unexpected or adverse effects of secondary use of clinical data, e.g. changes
in processes with the sole aim to optimize reported indicators? This evaluation assesses
whether the chosen indicators of secondary data analysis respond to stakeholders’
needs and fulfil defined goals.
7 Conclusion
In conclusion, it can be postulated that while analysis of “Big Data” is politically
sexy and attracts funding, nonetheless it needs serious evaluation and evidence-based
thinking. As with all health informatics activities and innovations, to do less –
and thus condone imperfect or erroneous outcomes – would be unethical. This should
be based not just on data science, but also on broader evidence-based health informatics
considerations that are needed to build and underpin trust and ensure feasibility
and policy effectiveness [[70]].