Keywords
Open data - FLOSS - Open source - healthcare - privacy - security
1 Introduction
The International Medical Informatics Association (IMIA) Open Source Working Group
(OSWG) is a voluntary group supported by IMIA that brings together researchers and
practitioners from multiple countries with a diverse range of informatics experience
but common interest in the adoption of open approaches to advancing the use of informatics
to improve healthcare. Although not engaged on a common research project, working
group members foster international discussions and debates on current issues related
to the adoption of open source approaches and have supported the development of an
open access database of Free, Libre, and Open Source Software (FLOSS), called MedFLOSS
(www.medfloss.org), to be used in the medical domain.
In response to the theme of the 2018 issue of the IMIA yearbook, “Sharing data: balancing
access and privacy to advance healthcare”, the IMIA OSWG members have collaborated
on a working group discussion to discuss the interplay and commonalities between the
open source and open data movements in the healthcare domain.
The openness of software has accelerated innovation in recent decades[1]
[2]
[3]. FLOSS has been widely used as an infrastructure for powering in the Internet and
has also been widely adopted in the medical domain[4]
[5]
[6]. This trend towards openness in software development is now being applied to the
publication of datasets. In recent years, datasets have been opened to the public
under FLOSS-like licences on the Internet as “open data”[7]. There is still some debate about the definition of “open data” but, in general,
“open data” is the phrase that has come to mean available “for anyone to use, for
any purpose, at no cost”[8].
The Open Knowledge Foundation (OKF) has defined “open data” as datasets released with
these conditions:
-
Availability and access: the data must be available as a whole and at no more than a reasonable reproduction
cost, preferably by downloading over the Internet. The data must also be available
in a convenient and modifiable form;
-
Reuse and redistribution: the data must be provided under terms that permit reuse and redistribution including
the intermixing with other datasets. The data must be machine-readable;
-
Universal participation: everyone must be able to use, reuse, and redistribute—there should be no discrimination
against fields of endeavour or against persons or groups. For example, ‘non-commercial’
restrictions that would prevent ‘commercial’ use, or restrictions of use for certain
purposes (e.g. only in education), are not allowed.
In this article, the term “open data” follows the OKF definition.
The open data movement has rapidly spread across the world since Barack Obama issued
his memorandum on transparency and open government in 2009, which emphasized how openness
would strengthen democracy and promote government efficiency and effectiveness[9]. Following this memorandum, the US government implemented a Web-based platform to
open their data called Data.Gov
[10]. Other countries followed this activity and there are now many open data websites
operated by governments around the world. The social coding website, GitHub, now hosts
a wide range of open source tools to manipulate open data and the site is frequently
used by the open data community to discuss implementation guides, data requirements,
and other processes needed to use open data effectively[11]. There are various FLOSS tools to utilize “open data”, such as geographical mapping
tools[12] and tools for genetic analysis[13], and there are now a wide range of internet companies using open data to create
online services such as Citymappers and Transport for London who use these open data
sources to provide visualised information[14]
[15]. The beneficial economic effect of open data has been estimated to be between 3
to 5 trillion USD per year, in the global economy, and the use of open data in health
care has been claimed to save 450-550 billion USD per year in US[16]. The European commission estimated the total market size of “open data” to between
193 and 209 billion EUR for 2016, and expected this figure to grow by 36.9% by 2020
in the EU 28+[17].
In the scientific domain, genomics data have been made open through internet repositories
and these have accelerated the completion of the human genome project[18] and cultivated a wide range of bioinformatics and otheromics research through the
availability of data[19] computer science, mathematics, and statistics. Data intensive, large-scale biological
problems are addressed from a computational point of view. The most common problems
are modeling biological processes at the molecular level and making inferences from
collected data. A bioinformatics solution usually involves the following steps: Collect
statistics from biological data. Build a computational model. Solve a computational
modeling problem. Test and evaluate a computational algorithm. This chapter gives
a brief introduction to bioinformatics by first providing an introduction to biological
terminology and then discussing some classical bioinformatics problems organized by
the types of data sources. Sequence analysis is the analysis of DNA and protein sequences
for clues regarding function and includes subproblems such as identification of homologs,
multiple sequence alignment, searching sequence patterns, and evolutionary analyses.
Protein structures are three-dimensional data and the associated problems are structure
prediction (secondary and tertiary. Free access to scientific databases has supported
a wide range of large-scale scientific projects globally and is now largely regarded
as part of the information technology (IT) infrastructure for many projects. Organisation
for Economic Co-operation and Development (OECD) guidelines have described these beneicial
effects of open data and now promote the disclosure of research data from public-funded
projects to maximise the effects of government investment[20].
Healthcare data usually contains sensitive information that needs to be protected
for privacy reasons, which, at first thought, may be considered as a barrier to openness.
However, many non-healthcare open data repositories already contain human subject
data and open data availability is increasingly being adopted in healthcare despite
this concern. There are now rising pressures to adopt open data for research and healthcare
operational transparency but the use of open data will still require a careful balancing
of free access and privacy in order to protect healthcare clients’ confidentiality[8]
[21]. This is because although we would like the patient to have access to his or her
own data, we also want to preserve the right for any patient to keep his or her health
data private if (s)he so wishes. This is highlighted in the “second pillar” of the
Estonian e-Society[22] which states that citizens can have access to their data records but that other
government agencies have access to the data only if the citizen authorises access
via his or her personalised digital signature.
Citizens’ right to personal privacy has been extended to patients by physicians from
the time of Hippocrates of Cos (400 B.C.), who established a school of medicine and
expected medical practitioners to subscribe to a medical oath, now known as the Hippocratic
oath[23], essential variants ofwhich are still in use today. New clinicians promise that
“whatsoever I shall see or hear in the course of my profession, as well as outside
my profession in my intercourse with men, if it be what should not be published abroad,
I will never divulge, holding such things to be holy secrets”.
The oath's respect for privacy and human dignity also has a medical dimension: “I
will use treatment to help the sick according to my ability and judgment, but never
with a view to injury and wrong-doing.” This can be relevant in situations where a
patient's health may be adversely affected by having little or no control over his
or her own medical privacy[24], or where societal forces allow employers to pursue an expectation of having access
to personal medical data by presuming that employee medical data will be made available
to them because they can buy it, or because they can expect all medical data of their
employees to be made open to them. The new General Data Protection Regulation coming
into force to the state members of European Union in 2018[25] will threaten worldwide companies with large ines if they misuse personal data from
Europe. Alongside data protection issues, there arise questions about the use of medical
health data that categorise individuals[26] by commercial organisations. And yet, at the same time, perhaps there are medical
details patients will want to share, for purposes of medical research[27]. Perhaps, too, patients would like to have access to all of their own medical data.
With the topic of this year's Yearbook of Medical Informatics issue being how to balance
the various requirements of the various stakeholders in medical health data, an important
medico-socio-technical issue can be seen as one of seeking to allow persons comprehensive
access to their own medical data, where their own interests are served, and using
healthcare data to improve medical research and healthcare delivery while protecting
their medical data privacy, where it is medically justifiable to do so.
2 Methods
Through a combination of Open Source Working Group discussions and desk research,
we selected a number of existing open healthcare data repositories to serve as examples
of the types of standards and approaches to open data in healthcare. Working group
members also independently reviewed recent issues and controversies surrounding the
open data movement to produce a series of mini-case studies.
We evaluated whether health data repositories meet the OKF open definition[28] in the following categories:
-
Open licence;
-
Free accessibility;
-
Machine readability;
-
Open format.
Licences must be public domain or compatible with open licences and should be downloadable
via the Internet at reasonable cost. The data should be in a machine-readable form
that can be easily processed by a computer using an open data format[28].
Through a snowball sampling technique, we aimed to discover open data repositories
in use in healthcare by examining grey and white literature retrieved from internet
searching by keywords, such as “open data health”, journal database searches, and
the Open Health Data Journal recommendations[29]. While the “data portal” site[30] shows 524 open data repositories, we sampled 13 data repositories that include healthcare
data shown below ([Table 1]) as examples.
Table 1
Health Data Repositories and their profiles
No.
|
Data repository name
|
Issuer
|
URL
|
Accessibility
|
License
|
Machine readability
|
Available formats
|
1
|
HealthData.gov
|
U.S. Department of Health & Human Services
|
https://www.healthdata.gov/
|
open
|
open
|
3531/3542
|
JSON, CSV, XML, RDF
|
2
|
DATA.GOV.UK
|
United Kingdom, government digital service
|
https://data.gov.uk/
|
open
|
open
|
2121/2132
|
CSV, XLS, HTML
|
3
|
The human mortality database
|
The Human Mortality Database Project
|
http://www.mortality.org/
|
registration required
|
not open
|
12103
|
CSV
|
4
|
Global health observatory data
|
World Health Organisation
|
http://www.who.int/gho/database/en/
|
open
|
not open
|
>1000
|
JSON, CSV, XML, XLS
|
5
|
Big cities health coalition
|
A forum for the leaders of America's largest metropolitan health departments
|
https://bchi.bigcitieshealth.org/
|
open
|
not explicit
|
53/53
|
CSV
|
6
|
DATA GO JP
|
Cabinet secretariat of Japan
|
http://www.data.go.jp
|
open
|
open
|
112/624
|
CLS, HTML, PDF
|
7
|
Dryad
|
Non-profit organisation
|
http://datadryad.org
|
open
|
open
|
974/974
|
XLS, MATLAB, RMD, SOLR
|
8
|
UKDA
|
Academic group
|
http://www.data-archive.ac.uk
|
open
|
open
|
>1502
|
PDF, XLSX, CSV
|
9
|
Physionet
|
NIH granted project
|
http://www.physionet.org
|
open but partial registration required
|
open
|
>111
|
R, MATLAB, Database, API
|
10
|
Open Health Data dataverse
|
Harvard university
|
https://dataverse.harvard.edu/dataverse/openhealthdata
|
open
|
not open
|
About 62900
|
Text, CSV, XSLC, R, SPSS
|
11
|
Figshare
|
Academic group
|
https://figshare.com/
|
open
|
open
|
(more than 40)
|
CSV, XSLX
|
12
|
SND
|
Swedish national data service
|
https://snd.gu.se/sv
|
permission required
|
not explicit
|
0
|
-
|
13
|
eResearch South Australia
|
Joint venture of universities
|
https://data.sa.edu.au/
|
open
|
open
|
1
|
Text
|
3 Results
[Table 1] shows an overview of each of the example healthcare data repositories and how they
align with the openness principles defined by the OKF which we summarise here:
Organisation
The maintenance of the majority of the data repositories identified was performed
by local or national governments, or both working together. Repositories #3, #8, #9,
#10, #11, and #13 were maintained by academic groups. The remaining repositories #1,
#2, and #6 are “mash-up” sites that collect relevant data repositories of their nations
to provide each dataset.
Repository #11 is a data repository site for researchers that enables them to freely
share data within a size limit in a private repository and provides unlimited storage
with open conditions. If users need more storage over the limit, a premium service
is also available.
Data Categorisation
Every data repository reviewed has various categories of data, such as public health,
ep-idemiological data, health services, disease distribution (by age groups, areas,
or other risk factors), profiles of healthcare providers, and geolocation of hospitals.
Data Accessibility
All data sets were accessible via the Internet without any charge. Most of the data
sets could be opened and freely downloaded, but the repositories #3, #9, and #12 requested
registration or permission in order to download data.
Repositories #1, #2, #9, #10, and #11 also provide data via web application programming
interface (API) for developers to query data programmatically. One example of open
data API use is the rHealthDataGov project on GitHub which enables API access to HealthData.gov for users of the R statistical analysis package[31].
License
Repositories #1, #2, #6, #8, #9, #10, #11, and #13 have open licenses that satisfy
the open data definition, such as Creative Commons CC-BY[32], Open Data Commons Open Database Licence[33], or other open licences. Other sites did not show explicit information on the licences
used for data distribution or modification.
Even though most of the datasets contained in the repositories described were released
under open licences on “open data” repositories, some of the datasets were not published
under open licences, because of legal restrictions related to the governance of health
data.
Machine Readability, Available Formats
In most of the sites, common data formats such as CSV (Comma Separated Values), or
exchange data formats such as JSON (JavaScript Object Notation), and XML (extensible
Markup Language) are available. The number of formats were categorized in #1, #2,
and #6, but not shown in other sites. However, some of the datasets provide only PDF
(Portable Document Format). For example, repository #6 provides 624 datasets, but
512 of them are in PDF format. Therefore, only 112 of 624 datasets were available
as machine-readable formats. Even though PDF is an ISO 32000-2 standardized format,
it is not a preferred option for processing data for analysis by most scientists.
There are also many datasets published in spreadsheets using the XLS format that is
widely used in Microsoft Excel 2003 or earlier. Since the specification of XLS format
is openly published by Microsoft Corporation[34], it might be categorized as an open format. The Office Open XML format, that was
adopted in Microsoft Excel 2007 or later, is one of the ISO standards[35] but it was less frequently adopted than XLS in published datasets.
Data Protection
There are more than 80,000 open datasets in all the repositories we identified but
personally identifiable information was not found in the example database we selected
although more detailed analysis is needed to confirm this. Repositories #1, #2, #4,
#6, #7, #10, #11, and #12 addressed their privacy policy to their datasets, and we
could not identify any fault against their policies in their datasets.
4 Case Studies
The Working Group discussion and individual working group member desk reviews identified
a number of Case Studies that serve to highlight common themes related to the adoption
of open data in healthcare, and ethical issues about how medical data sets are shared,
used, and reused by various agencies.
4.1 Case 1 - Care Providers and Global Businesses
Accenture PLC[36] is one of the largest consulting companies in the world. When analysing the medical
domain, they found various forms of disruption entering the industry over the next
decade[37]. For example, Accenture found examples of device companies (such as hearing-aid
suppliers) and pharmaceutical companies turning towards service support. They also
found that new partnerships developed between digital companies, healthcare providers,
and service support industries. There seems to be a particular interest for advanced
cognitive systems such as IBM Watson[38], in areas such as genomics, oncology, care management, and drug discovery. Google's
DeepMind Health systems are being employed to serve patients, nurses, and doctors[39]. The DeepMind “Streams” app, for example, allows clinicians to be informed when
patient vital signs deteriorate using data from patient-monitoring technology that
deliver in real-time significant patient life-sign indicators to the clinician's mobile
device. The potential for such support, particularly in developing countries, has
been highlighted as a major medical benefit. Many of these big data and cognitive
solutions are proprietary technologies which are engaged under contract. Ethical questions
regarding policy, data ownership, and data usage by national governments and private
solution providers arise. Deployment of free and open source eHealth solutions may
take a greater national importance when considered as an option to national implementations
in this environment.
4.2 Case 2 - Medical Data Aggregated with Data from Other Sources
In 2003, NHS England launched care.data[40] in order to combine all healthcare records stored by general practitioners with
all information stored by social services and hospitals, the data being loaded into
national Health and Social Care Information Centre (HSCIC) databases. Another aspect
of HSCIC databases is the Hospital Episode Statistics dataset which collects and collates
data from 125 million individual inpatients, outpatients, and Accident and Emergency
records yearly in England. Questions naturally arise regarding how this data will
be combined, how it will be used, under what circumstances it needs to be anonymised,
and whether it can be anonymised. National implementers face numerous eHealth medical,
ethical, and governance challenges[41]
[42], and a range of tools for framing questions is emerging[43]
[44]. These difficult issues aside, national healthcare providers who have made significant
movements in healthcare include Estonia, Sweden, Norway, Denmark, Japan, Canada, and
England. More research is needed to compare and contrast the merits and demerits of
various approaches taken by national governments.
4.3 Case 3 - Selling Medical Data to Commercial Organisations
Care.data has been a valuable research resource for such tasks as resource allocation
and monitoring of treatment effectiveness. Controversially, the combined data was
made available to pharmaceutical companies, insurance companies, health charities,
hospital trusts, think tanks, and other private companies. In 2014, as part of an
audit of sales, it was disclosed that anonymous, pseudonymous, and identifiable data
was sold to 160 organisations[40]. In response to a Freedom of Information request[45], HSCIC stated: “We recognise that there will however remain a latent risk that when
combined with other sources of data, the identity of the individual may be ascertained”.
Questions arise regarding the importance of anonymisation, and the suitability of
pseudonymisation in cases where pseudonymisation does not protect the identity of
patients from commercial organisations.
Regarding the responsibility of organisations such as HSCIC in their provision of
anonymous and pseudonymous data, the Article 29 working party[46] has specified that “to identify if a person is identifiable, account should be taken
of all the means likely reasonably to be used either by the controller or by any other
person to identify the said person.” Questions regarding commitment to patient data
security arise.
4.4 Case 4 - Patient, Treatment, Research, and Citizen's Rights
In a case from 2008, a patient who had suffered a life-threatening cancer as a youth[24] later became a clinician. Details of the patient's condition were passed on to researchers,
and researchers subjected the adult clinician to intrusive and upsetting telephone
calls in the name of research. The clinician complained to the hospital where the
researchers were based, but no action was taken to remove the patient from their register.
Eventually, through recourse to the law, the situation was resolved. The hospital
apologised for its treatment of the clinician and removed the clinician's name from
the register. The hospital, in its apology, noted that the concept of medical confidentiality
is paramount. Questions arise regarding the steps patients need to take to ensure
their wishes and data security are respected so that their previous illnesses don't
cause them further distress after treatment has ended.
4.5 Case 5 - Responding to Concerns, Learning Lessons
Because of numerous concerns, opt-outs (over 1 million people, including 40% of GPs)
and governmental criticisms, and following a government review ofcare.data[26], care.data was closed down in July 2016. However, bringing together social records,
hospital records, and general practitioner records is ongoing. One of the outcomes
from the DeepMind work, the Streams app, which allows clinicians to remain informed
of a patient's life signs, has been so successful that Taunton and Somerset NHS Foundation
Trust and DeepMind Healthcare signed a 5-year contract in June 2017 to develop and
evaluate a system able to detect early signs of kidney-failure[47]. Taunton and Somerset have been pioneers in the use of open source software, with
their contract with IMS Maxim. That said, there are many ethical questions still outstanding.
Over 1.6 million live NHS data records were given to Google, via DeepMind, inappropriately,
and it is neither clear what other such mistakes will be made, nor how the business
relationship will develop over the years between DeepMind, Google, and NHS England.
In this regard, it is encouraging to note the emergence of an ethics unit within DeepMind
Health Systems[47]. Important questions arise over how to learn lessons from past failures as technical
rollouts proceed and whether sophisticated business information system solutions,
as they become the norm, will be at odds with ethical considerations and, in particular,
the Hippocratic oath.
These questions are pertinent to a new age of digitised medical health provision.
Where service providers need to find a balance between data security, data openness,
and citizen data rights, such cases can help us to focus attention on important ethical
questions. Strong opinions need to be traded one against the other in a dialectical
process in which all data stakeholders are suficiently represented.
5 Discussion
We outlined some of the key “open data” repositories in the healthcare domain and
reviewed recent cases that highlight the privacy and security issues associated with
open data. The main concerns around “open data” are how to manage the benefits and
transparency effects that we discuss below.
The open data movement has been productive for a positive cycle of creation for both
datasets and tools[7]. Even though the spectrum of datasets for open data is different from person identifiable
data sets ([Figure 1]), there is frequent scepticism about open health data initiatives. Questions that
arise here are not only those about mixing open data with private medical information,
there are also questions about providing commercial organisations with live patient
data that could be used in a business space, for business advantage.
Fig. 1 Data spectum of data sets. Reproduced from the Open Data Institute website
To protect individual privacy, open data repositories set their own privacy rules
and usually oblige data issuers to perform de-identification, pseudonymisation, and
anonymisation. From the open government memorandum of Barack Obama, legislation was
advanced for US, EU, and other countries as safeguards[48]. For example, opt-out rules from private data collections were obligated in many
countries. However, there are differences between the privacy regulation of the EU[25] and the US[49]
[50]. Since data sources are distributed worldwide, privacy legislation needs to be harmonized
over countries or regions.
Even though health data sets are freely available, five out of 13 of the repositories
did not provide datasets under open licences. Some of them addressed publication guides
for research but there were no explicit licences for secondary use of their datasets.
While these data repositories are primarily designed to use data for research which
is published in the academic arena, open data have also been used to develop software
to visualize the datasets with or without analysis. In 2011, open data were used to
facilitate emergency responses to disasters. After the east Japan earthquake, OpenStreetMap[12] was used to share the disaster information for logistics with victims[51]. In the same year, Hurricane Irene approached New York city and nursing home capability
data were used to build a plan to optimize the evacuates[52]
Because health data contains sensitive private information, the level of openness
of health datasets needs to be restricted. Most of the data for clinical research
cannot be opened because the data needs to be person identifiable to be clinically
useful. Another issue that has been problematic recently is data fraud in clinical
research[53]. The International Committee of Medical Journal Editors (ICMJE) proposed a data
sharing framework for clinical trials to improve the transparency for research[54]. An open data approach might help to build up such frameworks for transparency of
clinical trials.
One of the most widely discussed controversies is about who owns the data in healthcare
data repositories. Consensus is building that personal data should belong to each
person, but population data could be considered as a resource of communities or societies.
Since the open data movement is such a social movement, sharing de-identified or anonymised
health data as open data could be an approach to advocate eHealth for new generations.
6 Conclusion
Many aspects of the open data movement originated from the open source software movement
with the aims of improving transparency and fostering innovation. The benefits of
open data have been established in many cases and, as health data are increasingly
included in open data repositories, issues of anonymisation and de-identification
need to be addressed and appropriately managed. The open data and open source movements
share many common philosophies and approaches including using developing international
collaborations across multiple organisations and domains of expertise. Both movements
aim to reduce the costs of advancing scientific research and improving healthcare
provision for people around the world by developing intellectual property licence
agreements and codes of practice that are increasingly being adopted in the software
development and scientific communities.