Keywords
Catalogs as topic - data collection - metadata
1 Introduction
Existing individual-level human data cover large groups of human individuals, populations
and environments through high-value and longitudinal data capturing multiple dimensions,
such as lifestyle, demography, laboratory measures, omics, clinical parameters, as
well as data accounting for use of healthcare. Many data have been collected primarily
for the purpose of research, such as population-based and clinical trial cohorts.
In addition, even greater volumes of data have been collected for surveillance purposes,
such as congenital anomalies, pathology, birth registries or infectious disease registries.
Another example is data collected for the primary reason of healthcare conduct and/or
administration that can also be used for secondary purposes such as research.
This broad range of pre-existing medical/administrative data that can be reused for
research is often referred to as ‘secondary use’ or real-world (RW) data [[1]].
However, certain challenges must be overcome to fully benefit from the content, temporal
and geographic diversity of information [[2]
[3]
[4]] and to support co-analysis of data across studies to reach the statistical power
needed to elucidate the complex relationships between genetic traits, environment
and disease and enhance capacity to undertake comparison, cross validation or replication
analysis. In each situation and for each type of data collection, effective and efficient
use and reuse of data critically depends on availability of detailed and specific
metadata, as again recently acknowledged in the context of the COVID-19 pandemic [[5]].
Recent years have therefore seen large investments in data catalogues within the health
research domain. This development has been in part due to the focus on data reuse
and the FAIR principles movement, i.e., that data should be Findable, Accessible,
Interoperable and Reusable [[6]], and in part because larger datasets are needed for much health-related research
that cannot be provided by one single data source.
In this context, an increasing number of data catalogues are being developed to support:
-
discovering data, to facilitate access to information and optimise usage of available data;
-
understanding data, to assess their fitness-for-purpose to address specific research questions;
-
leveraging integration of data, to address a broad range of research questions, increase statistical
power, and enable comparison, cross validation or replication analysis.
While these catalogues are of great value, their heterogeneity poses challenges to
optimally exploit their richness. This opinion review focuses on catalogues that describe
data sources for human subject research such as cohorts, biobanks, registries, administrative
and healthcare records. We propose a pathway towards a unified framework to enable
interoperability and reuse of existing catalogues.
The review is organised as follows:
-
In Section 2, we review examples of catalogues to understand their concepts and use
cases;
-
In Section 3, we analyse the conceptual overlap between existing catalogues as the
basis for a unified conceptual framework for metadata catalogues for health research;
-
In Section 4, we analyse requirements and pre-conditions to promote catalogue collaborations
and reuse of catalogue contents;
-
In Section 5, we make recommendations towards interoperable representation of data
collections to enable sharing and cross-querying.
2 Existing Catalogues for Human Data
2 Existing Catalogues for Human Data
The subset of harmonisation catalogues is of particular interest in this review. Data
collection may be prospectively harmonised, compatible with certified standards requiring
straightforward transformation to the standard, although these standards might only
apply to a given area, province, or country. However, implementation of a prospective
approach is not always possible or suitable for research data, for example due to
novel research questions or technical limitations [[7]], especially where data are generated for purposes other than research (e.g., as
a part of routine healthcare). Retrospective harmonisation (i.e., harmonisation after
data collection) is thus often the only option to permit data integration [[8]]. Below, we describe examples of catalogues based on experiences from the co-authors,
first reviewing cohort data catalogues, then RW data catalogues, possibly also including
cohorts.
2.1 Cohort Data Catalogues
One of the largest catalogues of human studies is the Directory of Biobanks of BBMRI-ERIC
[[9]] with 1,829 collections from 619 organisations (https://directory.bbmri-eric.eu/#/), which is based on the MIABIS (Minimum Information About BIobank data Sharing) minimum
information about a biobank standard [[10], [11]]. This catalogue does not provide information on the individual variables collected,
nor does it emphasise data harmonisation. Another example is the International HundredK+
Cohorts Consortium catalogue (https://atlas.ihccglobal.org/), but this catalogue also does not provide information on variables. Examples of
domain specific catalogues are Birthcohorts (http://birthcohorts.net), describing cohorts specifically from the perspective of data before or during pregnancy
or latest at birth, including data on mother-child pairs, and ImmPort (https://www.immport.org/home) for open access immunological assay data for translational and clinical research
[[12]].
An example of a catalogue that provides sufficient information to inform multi-center
data analysis is the catalogue of the Maelstrom initiative. Maelstrom is an international
collaboration of epidemiologists, statisticians, and computer scientists with a focus
on enabling multi-cohort data analysis using data harmonisation for which they provide
rigorous guidelines on what metadata to collect [[7]]. Their catalogue, https://www.maelstrom-research.org/, documents cohorts related to a particular research domain, e.g. paediatrics/parent-child,
and networks of cohorts and researchers, such as international research consortia/projects,
that mostly retrospectively harmonise data following the Maelstrom data harmonisation
principles [[13]]. This catalogue provides detailed listings of 23 Networks, Individual studies (177
with variables) and Harmonisation projects (7, numbers checked Jan 2022). The Networks’
list contains detailed information on collaborations between multiple cohort studies.
The Individual Studies’ list contains detailed descriptions of cohort studies, including
the definition of their data collection timeline (e.g., baseline, follow-up), (sub)populations
sampled (e.g., children from northern England), and most notably, in many cases, detailed
listing of the variables collected (i.e., variable name, type, description, code list)
(see [Figure 1]). Finally, the Harmonisation projects’ list documents objectives of the harmonisation
project, selection criteria for cohort studies and participants to be included, and
detailed descriptions to map collected variables from cohort studies onto harmonised
variables using Maelstrom harmonisation guidelines [[7]]. This includes a detailed listing of the harmonised variables defined and then,
for each cohort study, a mapping detailing whether variables from this cohort study
could be harmonised, and if so, what algorithm was used. Of particular use is the
‘Areas of information’ classification that enables users to quickly find cohorts,
networks, or variables on a particular topic (topics available in 18 domains and 135
sub-domains), for example: ‘tobacco use’ or ‘cognitive functioning’. The Maelstrom
catalogue software is available as open source via the Mica software in OBiBa (http://www.obiba.org) [[14]].
Fig. 1 Concept of collection events, populations and variables. Adapted from [[13]].
The large boxes define areas of metadata for multi-center cohort study catalogues:
the population(s) being samples, e.g., ‘Mothers in Northern Netherlands’; rounds/events
of data collection, e.g., ‘baseline’ and ‘followup’; and data dictionaries of variables
collected, e.g., on ‘cigarettes smoked per week’. In addition, these catalogues describe
what harmonised research variables were defined for pooled analysis across multiple
cohorts, e.g., ‘pack years smoked’; and, for each cohort, if and how these harmonised
variables can be constructed from the collected variables, e.g., match including algorithm
or ‘no match’. The small boxes are examples.
In Europe, several catalogues have emerged following the convention of Maelstrom,
typically as consortia funded by the European Commission, starting with BioSHaRE [[15], [16]]. In the terminology of Maelstrom, each such consortium is a Network, and its catalogue
a Network catalogue. A recent example is the LifeCycle catalogue (https://catalogue.lifecycle-project.eu/) [[17]]. This catalogue provides two entry points. By default, users are led to browse
all harmonised variables, based on a keyword classification similar to (but not the
same as) Maelstrom. For each variable, users can see if and how variables from partner
cohorts in LifeCycle have been mapped. Secondly, users can browse the cohorts to receive
general descriptions of the cohort. Currently, the LifeCycle catalogue is being expanded
to include metadata from other harmonisation consortia, notably LongITools [[18]] and ATHLETE [[19]] and to be expanded to other consortia in the European Human Exposome Network (https://www.humanexposome.eu/). These catalogues are available as open source via the MOLGENIS software suite [[20]]. Another recent example is the RECAP Preterm catalogue (https://platform.recap-preterm.eu), containing information about pregnancy cohorts, which also makes use of OBiBa’s
Mica software to extensively catalogue the studies and data available on the project’s
platform. A variable classification system (in the vein of Maelstrom’s ‘Areas of information’)
was developed [[21]] and the project’s variables were mapped to it in order to improve variable searching
and harmonisation on the platform.
2.2 Real World Data Catalogues
Also, consortia using RW data have been establishing catalogues to document ‘secondary
use’ data sources. The European Medicines Agency (EMA) in 2021 funded the MINERVA
project to define a set of metadata to describe RW data sources. As an introductory
step, the MINERVA project executed a search for relevant catalogues to describe RW
data sources and we report here a selection of that list [[22], [23]].
The catalogue of the IMI-EHDEN project includes a dashboard allowing to display characteristics
of the data sources participating in the project, all converted to a same common data
model, and graphical and aggregated information about their content. It also allows
the comparison of the harmonisation potential between different data sources [[24]]. The EMIF catalogue is an example of a catalogue representing both RW and cohort-derived
data sources [[25]]. It is organised as a set of data research communities, each covering distinct
data types, diseases, and users. It does not require pre-specified metadata to be
catalogued: each community defines its own data model, to describe the catalogue entries.
Despite this customisation facility, it lacks a common schema to integrate easily
with external catalogue systems. The ENCePP Resource Database is an initiative of
the EMA-sponsored European Network of Centres for Pharmacoepidemiology and Pharmacovigilance,
which is freely accessible (https://www.encepp.eu/encepp/resourcesDatabase.jsp).
However, it does not provide a description of the data model of the data sources,
and EMA funded the above-mentioned MINERVA project as part of the effort to improve
the ENCePP catalogue.
With this mission, the MINERVA consortium has developed common data elements for RW
data catalogues [[26]]. This catalogue was based on a conceptual framework from the IMI-ConcePTION Project
[[27]], which is also embedded in the IMI-ConcePTION catalogue [[28]], and that we briefly summarise here: a data source is a collection of data banks, e.g. the National Health Registries of Denmark is a data source including the Danish
National Patient Registry, the Danish Medical Birth Registry, and others. Each data
bank is a collection of data that is mandated and sustained by an organization (in
the example, both data banks are mandated by the Danish Health Data Authority) and
where data generation is prompted by a class of events called prompts: in the example, data in the Danish National Patient Registry are prompted by discharges
from Danish hospitals, and data in the Danish Medical Birth Registry are prompted
by births happening in Denmark. Each data bank has a specified underlying population, i.e. the population whose events prompt records in the data bank. In a data source,
all data banks must have the same underlying population (in the example, the inhabitants
of Denmark), or populations that partially overlap, and must have the potential to
be linked to one another at an individual level.
3 Comparison of Metadata Collected and How They Map onto Each Other, as The Basis
for a Unified Conceptual Framework
3 Comparison of Metadata Collected and How They Map onto Each Other, as The Basis
for a Unified Conceptual Framework
In this section, we analyse the potential to harmonise and/or standardise catalogue
contents between catalogues. We therefore analysed existing catalogues from Maelstrom,
MINERVA, IMI-ConcePTION, LifeCycle, BBMRI/MIABIS, EMIF, ENCePP and IMI-EHDEN (described
above). Subsequently, we analysed how similarities could be harnessed towards a harmonised
framework for both research cohort and RW data catalogues.
3.1 Concepts Commonly Collected
We analysed the user interfaces of existing catalogues and, where available, we also
read their publications and associated literature, and downloaded their data models,
i.e. definitions of tables/column/properties. We created a large spreadsheet where
metadata items collected in the various catalogues were compared. For this review,
we did not aim for a complete mapping of all metadata items collected, but instead
aimed to assess commonalities and differences in terms of main topics, entities and
features collected. This spreadsheet is included as [Supplementary Table 1] (available at: https://docs.google.com/spreadsheets/d/1PvLpZgJ0jTwc9d8jxAt2Ex3nepUH3-Z4ImgH4-UUwnI/edit#gid=0). [Table 1] below summarises the results.
Table 1 Overview of concepts collected. Detailed mapping on which catalogue is associated
to each term can be found in [Supplementary Table 1].
3.2 Analysis Towards Common Conceptual Framework
In this section, we discuss three areas where we find correspondences between cataloguing
cohort and RW data sources: the data collection itself ([Table 2]); the steps to address a research question in one single database ([Table 3]); the steps to enact interoperability across data sources to address a research
question ([Table 4]).
Table 2 Representation of data collections.
Table 3 Concepts involved in the process of addressing a research question in one single
data collection.
Table 4 Concepts involved to enact interoperability across multiple data collections to address
a research question.
From the point of view of a catalogue, the most important difference between research
cohorts and RW data is the unit of the data collection: in the cohort domain, the unit is the cohort itself, which is also the data collection
needed to conduct a study; in the RW domain, the unit is the data bank (e.g., registry,
primary care records, pharmaceutical database), but a second-level unit, the data
source, possibly including multiple data banks, is what is used to conduct a study
[[27]]. For this reason, both the MINERVA and the ConcePTION catalogues have both levels
of data collections separately described. Another important difference is the population. In the case of a cohort, the population is the set of persons whose data is collected,
while the population of a data bank may not be included in the data bank itself: for
example, in the case of a death registry, the population is not the deceased, but
the persons whose death, were it to happen, would be recorded in the data bank. In
general, the population is not the set of persons who have experienced the prompt,
but the set of persons whose prompts, were it to happen, would prompt a record in
the data bank.
In [Table 3], five dimensions are listed. In a catalogue of cohorts, most of the information
relative to a study can be described independently of the research question: with
the exception of large general cohorts, the institution using the data is very close
to the process of data collection, the research question belongs to a set of pre-specified
questions incorporated in the design of the cohort, and both selection of the study
population and study variables are embedded in the cohort data model. On the other
hand, in the RW domain, the selection and study variables observed on the data source
population to conduct a study are not observed directly, but rather are derived based on the collected data, a process referred to in the literature also as measuring [[29]] or phenotyping [[30]]: if a study must select persons with diabetes, an algorithm to identify diabetes
from the data banks of each data source needs to be specified [[31]]. Since every study has different selection criteria and study variables (associated
with the specific research question), to enumerate such information in a catalogue
it is convenient to include study-specific sections. Indeed, the choice of phenotype
may depend on the research question: for a surveillance study, a more sensitive phenotype
is preferrable; to select a population with a disease, a more specific phenotype may
be advisable. The MINERVA catalogue includes a section describing studies, also listing
the study variables derived in each participating data source [[26]].
[Table 4] describes the steps that need to be taken by a network of institutions accessing
data collections to address together a same research question.
In the cohort domain, this typically involves a process where first a research consortium
is formed, then variables for the research are defined, and subsequently all participating
cohorts are asked to retrospectively harmonise their data onto these research variables.
In the RW domain, the first step towards interoperability is the conversion to a common
data model (CDM), such as the Sentinel CDM (https://www.sentinelinitiative.org/methods-data-tools/sentinel-common-data-model), the OMOP CDM [[32]], the ConcePTION CDM [[27]]. On top of this, the derivation of study variables must be enacted based on the
CDM [[33], [34]], possibly using prompt-tailored algorithms [[35], [27]].
In summary, cohort catalogues can be nested in RW catalogues where:
-
Cohorts are a special case of a data source with a single data bank, having a set of research questions embedded in the data bank design, and having
as population the set of persons whose data are included in the cohort;
-
Collection events can be considered a special case of prompts;
-
Harmonisation of items is a special case of conversion to a common data model, since in RW additional information about the data sources must be stored, i.e.,
how to derive study variables to address research questions.
4 Towards a Sustainable Catalogue Ecosystem
4 Towards a Sustainable Catalogue Ecosystem
Creating high quality catalogues takes considerable time, energy, expertise, and motivation
[[36]]. This section analyses requirements for catalogue collaboration in an open catalogue
ecosystem.
Main actors in this endeavour are: the data partners - the individuals who collect and/or provide access to the human/health related data
sources such as owners/hosts of cohorts, registries, biobanks; the data consumers - researchers, data scientists, funders of research, policy makers, etc. who aim
to find, access and reuse these data sources; and catalogue developers - the individuals responsible for the development of the technical infrastructure
of catalogues, specifically its metadata structure (i.e. metadata elements documented),
and for the supply and curation of its contents, such as catalogue data managers/curators,
scientific staff involved and software developers/administrators.
Based on the experience of the co-authors, we observe opposite forces to desire both
central and distributed cataloguing efforts. Some would desire to have one single
catalogue. The motivation from the data partners' perspective is that they ideally
would catalogue their resources only once, instead of now receiving repeated cataloguing
requests, and thus having to provide descriptions of their data sources many times.
The motivation from data consumers is that they only need to search one catalogue,
instead of finding and searching multiple catalogues. However, there are also many
good reasons to have multiple catalogues developed. Depending on the research domain
or the purpose of a catalogue, there can be quite different requirements on what and
how data should be catalogued, i.e. the metadata items which should be described .
and to what level of detail . depend on the use cases of the catalogue and the perspective
of its curators. Moreover, these descriptions can be highly dependent on the context
of use, i.e. often data sources are catalogued to serve particular types of studies
or research questions, e.g. a catalogue that describes cohorts and harmonised variables
for ‘eenvironmental factors influencing child development’.
Also, from a practical perspective, we observe benefits for distributing the cataloguing
efforts. Cataloguing scientific resources currently requires a great effort from the
data partners and typically a data management/curator team that assists data partners
when cataloguing. We observe that such teams are highly dependent on domain knowledge
and regional proximity to enable development of relationships with data partners and
user communities. More importantly, we observe that cataloguing efforts for both data
partners and the user community are more easily focused on a particular research area,
as compared to a general all-purpose cataloguing effort. We speculate that it is much
easier to catalogue information of interest that is required by a specific research
group than for a diverse and more distant user community that is hard to engage. Finally,
we have observed in other scientific databasing efforts that having one central effort
creates potential single points of failure when funders and institutions priorities
shift, while networks of such database efforts are easier to sustain [http://www.insdc.org/]. Meanwhile, small catalogues developed for a particular research project are also
hard to sustain over the long term, potentially wasting cataloguing efforts.
Moving forward, we believe we should thankfully embrace existing data curation in
distributed cataloguing efforts. However, we also believe it timely to promote the
sharing and reuse of catalogue contents to enable data consumers to navigate the catalogue
contents more easily, enabling integrated or cross catalogue searches, reduce duplicated
efforts, improve quality and completeness and ease sustainability of cataloguing contents
in case the original catalogue hosting is discontinued. We identify the following
requirements for successful sharing and reuse of catalogue contents:
-
Ensure autonomy of metadata collection;
-
Have clear licensing conditions in place for catalogue contents;
-
Enable symmetric data sharing, syndication and cross indexing;
-
Add provenance and attribution records (metadata on the metadata).
4.1 Autonomy of Metadata Collectors
In the next section, we will recommend standardisation and harmonisation at many levels.
However, while centralisation and standardisation have many advantages, research is
developing quickly, creating new research applications for existing and newly collected
data. Consequently, the metadata that are desirable are different for different research
use cases. Therefore, we recommend as a principle to maintain the full autonomy of
catalogue developers to expand and refine the catalogue structure (i.e., metadata
elements collected) and contents. Standardisation for interoperability is not a goal
in itself but serves a purpose. We recommend the use of micro standards, i.e., instead
of enforcing catalogue developers to immediately implement 100+ metadata items, allow
for a modular approach such that catalogues can keep their autonomy to adhere to relevant
standard elements where useful but without large standardisation requirements creating
a barrier to connecting a catalogue.
4.2 Clear Data Licence
Without a licence that explicitly allows metadata sharing across catalogues, copyright
laws and database laws will prohibit it and it might be even unclear who would be
the legal entity to approach for permission. General copyright law will reserve all
rights to the creator, which depending on the local legislation, might be the creator
of the catalogue or the original content provider whose data is being catalogued or
a combination thereof. Similarly, database laws will reserve rights to a structured
database to the creator of that database. A licence should clarify how to handle data
sharing and contributions. Aspects such as whether records are allowed to be copied
and modified and to what extent changes should be given back, should be covered, as
well as whether catalogue contents can be included in other catalogues or can be used
in commercial software applications, and, importantly, whether references or citations
must be made to the data partner and/or catalogue provider. Even though it is definitely
possible to draft your own licence, the OpenAIRE Guide “How do I licence my research
data” [https://www.openaire.eu/how-do-i-license-my-research-data] advises to use the CC BY 4.0 licence [https://creativecommons.org/licenses/by/4.0/] for works that fall under the definition of a creative work under copyright law
and to use CC0 licence [https://creativecommons.org/publicdomain/zero/1.0/] for databases or datasets. Using these licences will ensure compatibility with the
definition of Open Access data.
4.3 Symmetric Data Sharing and/or Aggregation
Catalogues aim to improve findability and reuse, and therefore a key performance indicator
is to attract as many (relevant) users as possible, and ideally to convert these users
into successful research projects. For this, indexing into general search engines
such as Google is the primary aim. The aggregation of their catalogue into more general
catalogues, i.e. to have a partial copy of the catalogue record with a link to the
original catalogue so that it can also be found in other catalogue instances can also
aid this goal.
However, catalogues also have the existential challenge that in order to keep securing
their funding they must be recognizable/visible. There can also be a concern that
copies of the metadata are added, changed or removed in ways not approved by the original
catalogue developer, raising concerns about quality and/or completeness of the records.
Moreover, there is the concern that a catalogue record is fully copied and/or no link
back to their catalogue is provided, thus taking away users from the original catalogue.
All these trust issues can make catalogues hesitate to allow cross-catalogue indexing.
To overcome these issues, we recommend that the sharing of catalogue contents should
always be reciprocal. Potential ways could be by synchronising contents between two
similar catalogues, or by aggregating metadata into an ‘umbrella’ catalogue (for example,
a catalogue for ‘all child cohorts’). This kind of ‘cross indexing’ would then enable
the source catalogue to show aggregated catalogue contents from other sources and
thus also provide a limited form of the integrated search experience instead of fearing
loss of their users to the aggregated catalogue. Obviously, links back to the aggregator
must then be shown so that the aggregator is also rewarded for its efforts, with the
potential for more users and greater impact. An example of this is being developed
in EUCAN-connect, where cross-search between the Maelstrom, LifeCycle, Birth cohorts
and RECAP Preterm catalogues is being enabled [https://catalogue.eucanconnect.eu/]. Another example of this is being piloted in the European Joint Programme for Rare
Disease [https://github.com/ejp-rd-vp], where a ‘search widget’ has been developed to enable database users to do ‘live’
searches that also include the other databases in their network, via a single user
interface.
4.4 Clear Procedures for Changes and Updates
Finally, when catalogue records are indeed exchanged and there is clarity on the licence,
there should also be clarity on how changes to records should be processed. Ideally,
it should be clear which metadata items are expected to be universally the same and
which items can be different when cataloguing records in multiple catalogues. We would
expect that administrative information, e.g., name, host institute and certain contact
details, would be independent from the purpose of the catalogue. Meanwhile, study-specific
or application domain-specific details, such as inclusion criteria, data collected
and design of the study, could vary. Ideally, these context-specific metadata items
should also be marked as such.
When additional details and improved descriptions have a universal added value, one
would want to merge these improvements with the original record. Therefore, each catalogue
should also provide a clear procedure for submitting and processing requests for updates
of catalogue records, and for downstream users of the catalogue to then retrieve such
updates. This procedure should minimally include technical submission procedures (webform
and/or programmatic) and an indication of the last changed date. Ideally, it should
also include review/acceptance workflow (e.g., involving the original author of the
record, often associated with the data source being catalogued), versioning, tracking
of changes, and clarity of ways to attribute/name the individuals involved in the
cataloguing effort who should be acknowledged.
5 Interoperability Recommendations
5 Interoperability Recommendations
To implement interoperability across catalogues, technical standardisation is also
desirable. Practically, while catalogues are usually proposed as a method to implement
FAIR principles for datasets, the contents of the catalogue itself must also adhere
to the FAIR principles [[6]]. In the past decade, converging recommendations have emerged to improve metadata
catalogues [FAIRsFAIR on metadata catalogue interoperability, https://zenodo.org/record/5744913#.YeEB3VjMLzc]. Also, recent projects in the context of the European Open Science Cloud (EOSC),
the European Health Data Space (EHDS), and Research Data Alliance (RDA) produce relevant
recommendations, such as via the TEHDAS (towards European health data space) project
[https://tehdas.eu] and the SHAring Rewards and Credit Interest Group [[36]] with practical recommendations, FAIR assessment templates, and lessons learned
such that FAIR criteria should be implemented gradually, and that training, support
and rewards should be considered. Based on these elements and the practical experience
from the co-authors, the following main requirements emerge:
-
Persistent identifiers for cohorts, data banks, data sources, institutions, contributors;
-
Use of common data elements and ontologies to identify content overlap between catalogues;
-
Easy to use syntaxes and architecture to pull/push catalogue contents;
-
Addition of provenance records to track how catalogue contents have been copied and
changed.
5.1 Persistent Identifiers
When data records are copied across catalogues, it is essential that data curators
can check where the original records reside. In addition, the same resource (cohort,
data bank) might be catalogued multiple times with different names/acronyms. Persistent
identifiers such as the DOIs or Handle Identifiers could be used to unambiguously
identify all these different types of records. These systems provide a separation
between the identifier and the location of the (digital) object that they refer to
and have agreements to maintain the service in the long term.
The identifiers should be opaque and unique, where the structure and the content of
the identifier is devoid of any significance about the object that they describe.
This allows the identifier to stay the same, while the object that it describes can
change. For instance, if we use the name of a data bank as an identifier, this causes
problems when the data bank changes its name and website or if another organization
creates a new data bank with the same name. Any catalogue that uses the identifier
can use it to retrieve the current name and link to the website and maintain all references
to the data bank. At the same time, a catalogue system can easily tell the two data
banks apart based on their different identifiers.
There are many persistent identifier services available starting for a few hundred
dollars a year, however the costs are in identity maintenance. We envision that large
organisations, such as the European Medicine Agency via their ENCePP system [https://www.encepp.eu/] and BBMRI-ERIC via their biobank directory [http://directory.bbmri-eric.eu], would be willing to assign and maintain persistent identifiers for a large community
of catalogue maintainers to use.
5.2 Common Data Elements and Ontologies
As described above, we can easily observe overlap between the catalogues. Also, we
believe that a combination of Maelstrom, LifeCycle, BBMRI-ERIC and MINERVA provides
the coverage that can provide a basis to standardise these concepts. However, it would
be very labour-intensive to search, compare and combine the contents from multiple
catalogues manually today. Standardisation in using the same metadata elements and
code systems/ontologies where possible can greatly alleviate this challenge. And ideally,
metadata items should be coded inputs, using values from an ontology, instead of free-text
that allows non-standard metadata values (although in our experience, there is added
value for free-text in descriptive notes, and to chart candidate codes before a code
system has been established).
In recent years, general agnostic catalogue standards have emerged, but they lack
details about the domain (in our case, cohorts, RW data). Recommended standards include
DCAT [Data Catalogue - https://www.w3.org/TR/vocab-dcat-2/], and a specific implementation thereof called FDP [FAIR data point, https://www.fairdatapoint.org/]. DCAT defines: (1) the ‘catalog’ which is a dataset in which each individual item
is a metadata record describing some resource; (2) the ‘resource’ which represents
a dataset, a data service or any other resource; (3) the ‘dataset’, a collection of
data published or curated by a single agent; (4) the ‘distribution’ and ‘data service’
that provide e.g. download file and/or operations via programming interfaces, and
(5) the ‘catalogue record’ primarily concerning the registration information, such
as who added the item and when. For example, Maelstrom could use DCAT to describe
each cohort and network as a ‘resource’ (including details such as the id, title,
landing page, issuer) and use ‘datasets’ to detail datasets within the cohorts (including
details such as the title, keywords, contact point) including links how to access
additional information such as lists of variables (as ‘distribution’). In addition,
we want to mention BioSchemas [https://bioschemas.org/] [[40]], an extension to schema.org used by Google and other search engines that provides
concepts such as DataCatalog with properties such as description, keywords, provider,
citation, licence, variableMeasured.
However, to be of real value more specific details would be needed as described in
section 3. Therefore, ideally the catalogue community would come together to define
common data elements and code systems for concepts. For example, we know of multiple
consortia working on harmonisation of catalogue metadata (e.g., the EUCAN-connect
project tries to harmonise Maelstrom and LifeCycle catalogues, in collaboration with
the CINECA project). However, these efforts are not linked to existing standards.
Instead of creating these common data elements in isolation, it is recommended to
reuse existing metadata item definitions. A good starting point is the standard website
FAIR sharing [[37]] and ontology publishing websites like BioPortal [https://bioportal.bioontology.org/] and Ontology Lookup Service [OLS, http://www.ebi.ac.uk/ols]. Ideally, the catalogue community will go through a rigorous ontology process to
define all metadata types and instances while reusing existing ontologies, for example
following examples from other domains [[38]], towards an ontology with high standards for quality control, for example using
the standards of OBO foundry [[39]].
5.3 Syntaxes and Architecture for Metadata Interoperability
Standardisation of catalogue contents using common data elements is not sufficient
to make data flow. In addition, the standardisation of file formats would be recommended
to ease the download/upload of data between catalogues, and application programming
interfaces (APIs) to automate such interactions using scripts and programs.
The gold standard for FAIR data is linked data or ‘semantic web’. This for example
includes RDF [resource description format, https://www.w3.org/RDF/], a data model and framework that can be represented in a file format similar to
HTML files but with a specific structure to describe all data as ‘triples’ of {subject,
predicate, object} using hyperlinks/URIs to denote cross references. However, in our
experience, most catalogue developers have very limited information technology (IT)
capacity, and therefore prefer using CSV files, Excel spreadsheets or JSON text files.
Notably, JSON has a linked data extension called JSON-LD, which would still enable
integration with the aforementioned linked data communities where desired.
In addition, we must consider the architectures for data exchange. In recent years,
we have implemented both push interfaces . where data submitters would push metadata
into the catalogue . and pull interfaces . where the catalogue developer would retrieve
data from a web location. Push interfaces are typically used for submitting/updating
catalogue records (e.g., from individual cohort into LifeCycle catalogue) while pull
interfaces are more often used for moving data between catalogues (e.g. from the Dutch
national catalogue, https://catalogue.bbmri.nl, to the BBMRI-ERIC EU biobank catalogue, https://directory.bbmri-eric.eu/). In most cases, simple interfaces using the HTTP protocol were preferred, using
operations such as GET, POST, PUT/PATCH, DELETE to retrieve, add, update or remove
contents respectively, ideally enabling ‘batch’ updates, i.e. affecting multiple catalogue
records in one transaction.
With this in mind, we recommend embracing current practises. Therefore, first standardise
on very simple CSV/JSON file formats for metadata exchange, and minimal set of ‘batch’
programmatic interfaces to retrieve/submit metadata in this file format. Only when
desired by multiple catalogue developers, advanced query interfaces with proper (semantic)
API documentation could be defined as standard to enable ‘federated queries’ (i.e.,
so that users could search multiple catalogues at once), for which an interesting
emerging standard ‘beacon’ is being developed in the global alliance for genomics
and health [GA4GH, https://beacon-project.io/].
5.4 Add Provenance and Attribution Records
Finally, it becomes essential that catalogues also start to describe and show the
provenance sof their records, i.e. where the original record originated, whether intermediate
changes were made to the record, and what the procedure should be if one would find
that the record would need to be updated. Examples for which information is necessary
have been extensively developed within the scientific library community, notably in
the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) and OpenAIRE
project [http://www.openarchives.org/OAI/openarchivesprotocol.html]. For each copy made, a provenance record should be added that contains for example
[https://openaire-guidelines-for-literature-repository-managers.readthedocs.io/en/v4.0.0/use_of_oai_pmh.html]: baseURL, identifier, datestamp, metadataNamespace, origin, and for each description,
harvest date and whether data was altered. In addition, there should be information
that explains how changes to the record should be submitted. For example, there might
be a link to a website, programmer interface or simply email address to share updates.
6 Conclusion
We have summarised the landscape of data catalogues cataloguing, on the one hand,
cohorts and on the other, RW data sources composed of multiple data banks, such as
surveillance registries or administrative records.
While these catalogues are heterogeneous, of various scope, and use different terminologies,
the underlying concepts seem potentially harmonizable. Our conclusion is that cohort
catalogues can be nested in RW catalogues where: (1) cohorts are a special case of
a data source with a single data bank, having a set of research questions embedded
in the data bank design, and having as population the set of persons whose data are
included in the cohort, (2) collection events for cohorts are a special case of prompt
for RW data, and (3) harmonisation of items is a special case of the conversion to
a common data model, because in RW data sources additional information must be stored,
that is, how to derive study variables to address research questions.
We recommend embracing the autonomy of current catalogue teams, and invest in their
collaboration via minimal standardisation efforts such as clear data licensing, minimal
metadata ‘common data elements’, symmetric architectures for data sharing (push/pull)
with clear provenance tracks to process updates and acknowledge original contributors.
Obviously, the implementation of such an ecosystem is a massive effort. Fortunately,
many FAIRification efforts exist such as the research data alliance (RDA), ELIXIR
and BBMRI. However, we still observe a gap between the domain-specific cataloguing
groups which typically have vast domain knowledge. These gaps can be bridged by domain-specific
efforts such as EU EUCAN-connect [http://www.eucanconnect.org], the metadata working group of European Human Exposome Network [https://ehp.niehs.nih.gov/doi/abs/10.1289/isee.2021.O-SY-127] and recently funded EU projects BeYond COVID [BY-COVID, https://by-covid.org/], IMI European Partnership for neurodegenerative diseases such as Alzheimer’s and
Parkinson’s EPND [https://www.imi.europa.eu/projects-results/project-factsheets/epnd].
Even while many collaborations exist between these initiatives, these efforts can
lead to silo-ed solutions if we do not invest in cross-domain and cross-consortium
interactions. We therefore recommend creating practical training, documentation at
entry level and intermediate level to ensure new catalogue efforts can converge on
existing standards including: (a) common data elements for catalogue developers to
pick from, (b) exchange formats and APIs that are easiest to implement, and (c) free-to-use
systems for requesting persistent identifiers. Most importantly, environments for
collaboration and resource sharing between catalogue developers should be facilitated.
Key messages:
-
Catalogues are heterogeneous, of various scope, and use different terminologies, however
the underlying concepts seem potentially harmonizable;
-
The autonomy of current catalogue teams is valuable and cross-collaboration should
be sustained via minimal standardisation efforts such as clear data licensing, minimal
metadata ‘common data elements’, symmetric architectures for data sharing (push/pull)
with clear provenance tracks;
-
We recommend creating practical training, documentation at entry level and intermediate
level to ensure new catalogue efforts can converge on existing standards.