Keywords
Circulating tumor DNA - proteomics - metabolomics - decision support techniques -
algorithms - omics integration
The Continued Promise of Cancer Informatics
The Continued Promise of Cancer Informatics
The future of cancer informatics is predicated on the continued development of methodologies
that can identify key altered pathways that are susceptible to molecular targeted
or immunologic therapies[1]. The increasing customization of medical treatment to specific patient characteristics
has been possible through continued advances in a) our understanding of the physiologic
mechanisms of disease through the proliferation of omics data (e.g., proteomics, metabolomics),
and b) computing systems (e.g., patient matching algorithms) that facilitate matching
and development of targeted agents[2]. These advancements allow for improved outcomes and for reduced exposure to the
adverse effects of unnecessary treatment. They can help us better decipher the inter-
(between patients) and intra-(different tumors within the same patient) tumor heterogeneity
that is often a hurdle to treatment success and can contribute to both treatment failure
and drug resistance[3]. Importantly, omics-based cancer medicine is here. In 2017, nearly 50% of the early-stage
pipeline assets and 30% of late-stage molecular entities of pharmaceutical companies
involved the use of biomarker tests[4]. Further, over a third of recent drug approvals have had DNA-based biomarkers included
in their original US Food and Drug Administration (FDA) submissions[5]. We are also thinking about cancer informatics differently. Algorithmically, there
has been a shift from gene signatures to nonlinear approaches such as neural networks
and advanced aggregative techniques to model complex relationships among patients[6]. Importantly, these approaches are the root of cohort matching algorithms that aim to find “patients like my patient.” Results of these algorithms are simpler
to understand and have propelled the growth of clinical trials matching algorithms.
National trials such as the NCI-MATCH[7], which pair patient tumors with specific tumor alterations to targeted medications,
are a simplistic first step in this paradigm shift. The ability to perform complex
matching, and matching rules, has relied on the growth of aggregated patient datasets
and the ability to quickly assess tumor omics data.
This brief review focuses on three cancer omics data growth areas - proteomics, metabolomics,
and circulating tumor and cell-free DNA. These omics approaches all try to enhance
our current complex model of relationships among genes. We will also touch upon the
paradigm shift from singular omics signatures to patient cohort matching - a shift
that may potentially more readily take advantage of the large repositories of omics
data that are being created. [Figure 1] underscores the foci of this review.
Fig. 1 Selected data and algorithm growth areas in cancer medicine.
Within the past several years, tumor om-ics technologies have been integrating into
clinical practice. Concurrently, we have increased our understanding of the underlying
pathophysiology of not only the tumor, but also the patient/tumor interaction through
this omics data. Acquisition of this omics data, which is a focus of this review,
has required improvements in detection techniques and data analysis. For example,
assaying proteins using immunohistochemistry (IHC), the usage of singular antigens
that bind to single proteins of interest in cancer tissue, is now being supplanted
by mass spectrometry - which allows massively parallel identification of hundreds
of proteins simultaneously. However, it has taken improved computer performance (and
super computer clusters) to accurately identify this large number of proteins in a
reasonable amount of time. This advancing field, proteomics, provides a far more accurate readout as compared to IHC - which is often subjective
and difficult to parallelize.
Similarly, metabolite biomarkers have traditionally been singular molecules detected
by immunoassay in the clinic. The chemo-therapeutic drug methotrexate, for example,
has levels that are detected via immunoassay for quantification purposes[8]. However, immunoassays only measure singular known metabolites and it is well known
that combinations of metabolites are more clinically relevant than singular metabolites.
With this in mind, metabolomics has emerged as a new omics field of study that aims to measure abundances of all
small molecules detectable in biospecimens including blood, tissue, urine, and breath,
among others. Typically, mass spectrometry (MS) and nuclear magnetic resonance (NMR)
techniques are applied for measuring hundreds to thousands of metabolites in a given
sample.
Advanced DNA sequencing, which ushered in the genomic revolution, has also improved
greatly. Our ability to perform DNA sequencing with trace amounts of starting material
(low-passage reads) with improved idelity and detection is allowing us to detect circulating
tumor DNA from the blood. Circulating tumor DNA (ctDNA) is tumor-derived fragmented DNA circulating in blood along with cell free DNA (cfDNA) from other sources (including normal cells) measuring about 150bp. ctDNA
provides an overview of the genomic reservoir of different tumor clones and genomic
diversity. ctDNA may finally provide a means of assaying in-tra-patient tumor heterogeneity
- allowing us to get a sense of the relative abundance of genomic alterations across
metastatic deposits within a patient. In the following sections, we will delve into
each of these areas that have been introduced in the above.
Proteomics
Description of Technology: The field of molecular therapeutics is a relatively novel approach that targets abnormalities
in signaling pathways that play critical roles in tumor development and progression.
While the genetic abnormalities of many conditions have been studied intensively,
they do not always correlate with the phenotype of the disease. One possible explanation
of this phenomenon is the lack of predictable changes in protein expression and function
based solely on genetic information. One gene can encode for multiple proteins; protein
concentration is temporally dynamic and protein compartmentalization is paramount
to function; proteins are post translationally modified. All of this complexity leads
to the importance of studying the proteome. Proteomics is fundamentally the study
of proteins and their structure, functional status, localization, and interactions. This has only
been possible as our understanding of proteins and their post-translational landscape
has been investigated. Kinases and phosphatases that control the reversible process
of phosphorylation and are dysregulated in many diseases including cancer have been
studied individually for many years. However, only with the application of larger
scale technologies can we begin to understand the networks that control cellular phenotypes.
Protein and lipid phosphorylation regulates cell survival signaling. Targeting kinases
and phosphatases has proven to be paramount for improving therapeutic intervention
in some diseases. In this regard, it is critical to define qualified cellular targets
for cancer diagnosis and prognosis, as well as accurately predict and monitor responsiveness
to therapies. Mutation proiling of selected genes or the whole exome can provide insights
into possible activated pathways; however, to look at specific end points that can
be targetable, one must examine the functional units of these mutational events, i.e.,
the protein.
Recent Advancements: There are multiple ways to examine events at the level of the protein. These range
from Western blot level technologies which can examine a few proteins at a time, to
mass spectrometry-based (MS-based) shotgun proteomics which can theoretically measure
a very large subset, if not the entire, proteome. Broadly, most proteomic studies
can be broken down into two categories: array-based and direct measurement. Array-based
proteomic measurements are typically dependent on an antibody or substrate for a specific
protein. Antibody-based proteomics platforms have been examined for the last 40 years
and are still yielding exciting results. The most commonly used techniques for multiplexed
assays are reverse phase protein arrays (RPPA), multiplexed immunofluorescence, and
antibody-based chips/beads. These techniques provide a quantitative assay that analyzes
nanoliter amounts of sample for potentially hundreds of proteins simultaneously. These
antibody-based assays determine quantitative levels of protein expression, as well
as protein modifications such as phosphoryla-tion, cleavage, and fatty acid modification[9]
[10]
[11]. Most techniques either array complex proteins samples and then probe with specific
antibodies (e.g., RPPA), or array antibodies or specific ligands and then probe with
a protein mixture. In essence, these assays have major strengths in identifying and
validating cellular targets, characterizing signaling pathways and networks, as well
as determining on and off target activity of novel drugs. One downside to array-based
systems is the inherent reliance on quality antibodies or known substrates, which
may or may not exist for all proteins of interest in a particular study. However,
recent work has demonstrated a tissue-based map of the human proteome utilizing transcriptomic
and multiplexed IHC-based techniques[12]. Similarly, The Cancer Protein Atlas (TCPA, http://tcpaportal.org/tcpa/) has examined samples collected during The Cancer Genome Atlas (TCGA) project and
annotated selected samples with RPPA results. These initiatives provide a rich source
of data at multiple levels from genes to transcripts to proteins[13].
Direct measurement techniques are based on the identification and quantification of
the protein itself without utilizing analyte-based technologies that are solely dependent
on the quality of the antibody. Most direct measurement techniques are based on MS
approaches. MS-based proteomics techniques can be organized as bottom-up shotgun approaches
which are able to accurately identify multiple proteins from complex mixtures. Complementary
methods including stable isotope labeling (SILAC), tandem mass tags (TMT), and isobaric
tags (iTRAQ) can be used in tandem with bottom-up approaches to measure relative or
absolute concentrations of some or all proteins in complex mixtures. One of the inherent
weaknesses of early MS-based approaches was the limited ability for absolute quantification
of protein amounts. Given that many signaling events within cells are based upon changes
in post-translational modifications with very small changes in total protein concentration,
it was of particular interest to develop techniques that allowed for quantification
of these changes. Perhaps one of the most exciting recent advances in MS-based proteomics
techniques is the application of selected/multiple reaction monitoring (SRM/MRM) to
quantify certain peptides of interest[14]. In contrast to array-based techniques described above, SRM-based methods can accurately
measure multiple peptides from a single protein and theoretically measure multiple
post-transla-tional modifications simultaneously, independent of reliance on antibodies.
In a study of multiple MS-based platforms, strong quantitative correlation to an immunoas-say-based
platform was observed for SRM using a synthetic peptide internal standard[15]. This contrasted to poor correlation for spectral counting, extracted ion chromato-grams
(XIC), and non-normalized SRM. The inherent flexibility of the various sectors associated
with MS-based assay systems also allows for multiple questions to be asked that may
not be feasible with antibody-based systems. For example, MS imaging can provide a
molecular resolution of 2D tissue specimens[16]. This will allow for not only identification but also spatial relationships between
biomarkers within samples. This enhanced level of information may be critical for
defining pathway interactions or even more accurate molecular diagnostics.
Clinical Utility: Proteomics has a significant role to play in the translational analysis of patient
tissue samples. The US National Institutes of Health (NIH) has recognized the importance
of clinical proteomics with the establishment of the Office of Clinical Cancer Proteomics
Research (OCCPR). One of the largest working groups established through the OCCPR
is the Clinical Proteomic Tumor Analysis Consortium (CPTAC). Vasaikar et al., with
support from the CPTAC, recently published a new data portal that links TCGA clinical
genomics data with MS-derived proteomics data in a similar fashion to the work performed
by the TCPA utilizing RPPA arrays[13]
[17]. Of note, the CPTAC initiative has produced a number of publications[18]
[19]
[20]
[21]
[22] and freely available datasets for use by the broader omics community. As an example,
Mundt et al. have recently published an MS-based proteomics study on patient-derived
xenografts to identify potential mechanisms of intrinsic and adaptive resistance to
phosphoinositide 3-kinase inhibitors that will likely have clinical impact in the
near future[21]. Much of the clinical utility of proteomics research will be driven by sample availability
and quantity with accurate links to clinical data. While RPPA analysis requires very
small amounts of sample, there is also a limited proteome sample space to test. MS-based
techniques can test for a wide sample space of the proteome, but this requires a sample
size that can hinder research. More recent MS techniques that are focused on quantitative
analysis of a subset of the proteome can be done on smaller sample sizes and likely
represent the future of MS-based clinical tests.
Data Challenges: Proteomics is a relatively new field in the world of large scale omics datasets.
Older technologies that are array-based have relative straightforward datasets. Most
RPPA datasets are reported in normalized linear or Log2 median centered scales based on the detection ranges of the specific equipment being
utilized. There is minimal data manipulation outside of identifying the linear range
of each sample being measured through the use of internal standards and then extrapolating
absolute quantification through interpolation of a standard curve[23]. However, with the explosion of MS-based proteomic techniques, cross platform data
analysis and sharing have been associated with their fair share of growing pains.
The Human Proteome Organization (HUPO) has taken the lead in defining requirements
for proteomic submission and repositories. Such tasks that our colleagues in the genomics
world have taken for granted over the past 15 years are now being reinvented for the
proteomics field. The inherent complexity of proteom-ics data and the multiple platforms
utilized make sharing data a non-trivial affair. There is also an issue of technology
outpacing our reporting ability. While peptide and spectral libraries have been and
continue to be important for most MS proteomic analysis and deposition (PRIDE and
PeptideAtlas being the major resources)[24]
[25]
[26], there is also a need for a common library of molecular transitions with the explosion
of SRM/MRM techniques. PASSEL has been and continues to be the leading resource for
SRM transition datasets[27]. Probably the most important advance in dataset submission and dissemination has
been the continued development of the proteome exchange (PX) consortium. Beginning
in 2011, the PX consortium has continually added on members to allow common data formats
for all proteomic datasets. This will allow remarkable opportunities for data reanalysis
and reinterpretation that our genomics colleagues have been enjoying for more than
10 years.
Future Directions: As more data from multiple tumor types become available, the ability to link genome
to proteome, and ultimately phenotype and treatment choices, becomes less of a holy
grail and more a clinical reality. The integrated information will display the potential
therapeutic targets or biomarkers to accurately predict or rapidly define intra-cellular
signaling networks and functional outcomes affected by therapeutics. Clinically, we
are starting to see this through large basket-type trials incorporating genomic data
matched to targeted drugs, independent of tumor type. The understanding of the proteomic
context of a genomic alteration will be key to expanding the repertoire of successful
biomarker-driven clinical trials. RPPA- and MS-based phosphoproteome investigation
is already being explored in the context of pathway activation and targeted therapies.
Similarly, utilizing targeted genomic mutation panels identifies a subset of ovarian
cancer patients that may be sensitive to poly ADP ribose polymerase (PARP) inhibition,
but incorporating proteomic analysis can also help identify possible responders in
genomically unselected populations treated with cytotoxic chemotherapy and/or PARP
inhibitors[28]
[29]
[30]
[31].
Metabolomics
Description of Technology: The overall aim in metabolomics studies is to measure levels of small molecules,
less than 1,500 Daltons, in a given biospecimen (e.g., blood, tissue, urine, breath).
The combination of various extraction (e.g., enrichment of lipids or protein-bound
metabolites) and analytical techniques generates metabolic profiles that span many
known and unknown metabolic pathways. Such metabolic profiles are a rich resource
for defining phenotypes of distinct diseases such as cancer, and they reflect alterations
in the genome, epigenome, proteome, and environment (exposures and lifestyle). For
this reason, metabolomics is increasingly applied to complement other omics characterization
of cells and clinical samples[32]
[33], and is invaluable for uncovering putative clinical biomarkers, therapeutic targets,
and aberrant biological mechanisms and pathways that are associated with cancer[34]
[35]
[36]
[37]
[38]
[39]
[40].
Metabolites can be categorized as endogenous, naturally produced by the host or cells
under study, or exogenous, including drugs, foods, and cosmetics among others. While
the goal is to measure all metabolites in a given biospecimen, analogous to measuring
all gene levels in transcriptomic studies, current analytical acquisition techniques
can only capture a fraction of metabolites given one assay or platform[36]
[41]. For example, as of April 2018, the Human Metabolome Database[41]
[42]
[43] contains information on 114,100 metabolites, yet only 22,287 (19.5%) have been detected
in human biospecimens. Also, unlike genomics and transcriptomics where one can measure
genome-wide features (e.g., expression, variants) with one assay, metabolomics requires
multiple analytical techniques and instrumentation for a broad coverage of metabolites
(e.g., polar and nonpolar metabolites). In practice, a specific combination of sample
preparation (e.g., enrichment of nonpolar metabolites) and analytical technique is
often optimized for a certain class of compounds (e.g., lipids)[36].
The two main analytical approaches for measuring metabolites are NMR and MS[44]
[45]
[46]
[47]. Abundance detection by MS is typically preceded by a molecule separation technique
such as liquid (LC) or gas (GC) chromatography. While NMR is considered the gold standard
for compound identification (when analyzing singular compounds in pure form) and produces
quantitative measures, MS-based methods are more sensitive (e.g., able to detect low
abundance metabolites) and detect more (e.g., several hundreds to thousands) metabolites[48].
Of note, metabolomics studies can be classified as targeted[49] or untargeted[50]. In targeted studies, a small (∼1-150) panel of metabolites with known chemical
characteristics and annotations are measured and the sample preparation and analytical
platform used are optimized to minimize experimental artifacts. Examples of artifacts
are fragmentation and adduct formation (e.g., addition of sodium or hydrogen ions)
in electrospray ionization[51]. Measurements can be performed using standards and thus produce quantitative or
semi-quantitative measurements. In contrast, untargeted metabolomics aims to detect
all possible metabolites given a biospecimen. Untargeted approaches yield relative
measurements of thousands of signals that represent known metabolites, experimental
artifacts (e.g., adducts), or unidentified metabolites[52]. While many more metabolites can be captured with untargeted approaches, it is very
challenging to annotate signals and identify metabolites[51]. Verification of metabolite identity requires prediction of elemental composition
from accurate masses, and eventually, further experimentation (NMR being the gold
standard) that requires the use of a purified standard for the metabolite of interest[45]
[53]
[54]
[55]
[56]. If a purified standard is not commercially available, one must be synthesized in-house
and thus this validation process could take several years. Ultimately, a targeted
approach is favorable if there is a priori knowledge on the biological system or disease under study because measurements are
quantitative and the data quality is high[52]. However, despite the high level of noise and the increased complexity in data analysis,
untargeted approaches are favorable for discovering novel biomarkers or generating
data-driven hypothesis[52].
Recent Advancements: As metabolomics strategies are increasingly being applied in biomedical research,
advances in automation and improved quantification of NMR- and MS-based methods are
producing high throughput, reproducible data[44]
[57]
[58]. Integration of NMR and LC-MS techniques is increasingly applied to enhance reproducibility,
metabolite identification, and to ensure measurement integrity[59]. Such improvements in data acquisition techniques are critical for expanding the
coverage of metabolites that can be reliably measured. At the same time, these advances
are producing larger data, requiring the construction of databases and the development
of data analysis methods, tools, and pipelines. Currently, the two major sources of
publicly available data are the Metabolomics Workbench[60] and MetaboLights[61]. The Metabolomics Workbench, sponsored by the NIH Common Fund, also provides access
to analytical protocols (e.g., sample preparation and analysis), metabolite standards,
computational tools, and training. While metabolomics data is very informative, for
example, to uncover putative clinical biomarkers, understanding how metabolites are
produced and their function further deepens our understanding of disease phenotypes
and mechanisms. In turn, this mechanistic understanding can guide the search for putative
drug targets. With this in mind, integration of metabolomics data with other omics
datasets, including genome, proteome, and microbiome, is increasingly performed[62]
[63]. Integration of omics datasets includes numerical integration techniques such as
canonical correlations or multivariate modeling, and network/pathway based approaches[64]
[65]
[66]
[67]
[68]
[69]. Furthermore, open-source user-friendly software for metabolomics analysis and interpretation
through pathway analysis has been critical for guiding analysis and interpretation
of the data. Examples include XCMS[70]
[71], MetaboAnalyst[72]
[73], and Metabox[74].
Data Challenges: Unlike genomics, a reference metabolome does not exist and it is currently impossible
to measure all metabolites in a given biospecimen. This lack of reference causes many
data analysis issues, particularly for untargeted metabo-lomics studies where the
identification of metabolites is difficult to pin down[75]. The field also suffers from multiple metabolite naming conventions. In fact, different
naming conventions are more appropriate for certain types of data acquisition techniques.
For example, while untargeted metabolo-mics approaches cannot resolve differing stereochemistry
or double bond position/ geometry, other approaches can identify metabolites with
more or less granularity[60]. Translation services, including Refmet[60] and the Chemical Translation Service[76] help in that regard. Also, the multitude of data acquisition techniques makes it
difficult to organize the data in a standardized fashion[77]; instrumentation vendors have specific data formats that are tied to proprietary
software and conversion of these ile formats to open source formats can require specific
operating systems or software licenses. Differences in how the data was generated
also make it difficult to compare results across studies. With many missing identities
and different resolution of identification, it is difficult to map a metabolite from
one study to another. Standardization is thus critical to handle such challenges but
is in their nascence[60]
[77]. Standard protocols for downstream data analyses, including quality control, transformation/
normalization, and differential analysis, are also difficult to establish, namely
due to differences in experimental study design and data acquisition. Although publicly
available tools and software aim to provide standard approaches[70]
[71]
[72]
[73]
[74]
[78], detailed descriptions of parameters (e.g., mass divided by charge number [m/z]
range allowed for binning features) and cutoffs used are often lacking in published
work and makes reproducibility of results difficult.
Clinical Utility: Metabolomics plays an increasing role in clinical and translational research as large
initiatives such as the Consortium of Metabolomics Studies (https://epi.grants.cancer.gov/comets/) and the NIH Common Funds Metabolomics Program (https://commonfund.nih.gov/ metabolomics) are generating large-cohort metabolomics datasets (>1,000 participants).
Because metabolomics profiles help define disease phenotypes and reflect alterations
in the genome, epigenome, proteome, and environment (exposures and lifestyle), metabolites
are ideal candidates for biomarker discovery in many diseases including cancer[37]
[38]
[39]
[79]
[80]
[81]. With this in mind, metabo-lomics is playing a larger role in precision medicine,
requiring continued efforts in data acquisition and analysis[82]. Metabolomics is also increasingly integrated with other omics information and is
analyzed in the context of biological pathways and networks, with the aim of identifying
mechanisms that underlie diseases and finding novel therapeutic targets[34].
Future Directions: In October 2017, the NIH Common Fund has released funding opportunities to promote
efforts in public accessibility and reuse of metabolomics data, development of computational
tools for analysis (including omics integration) and interpretation of metabolomics
data, and development of approaches to identify unknown metabolites. Thus, we anticipate
further development in open-source, publicly available computational tools and infrastructures
to facilitate metabolomics analysis. Since metabolomics is increasingly applied to
biospecimens from large (>1000) cohorts and consortia, it is now possible to integrate
other omics data, as well as clinical and environmental contexts in the analyses.
The complexity of harmonizing data across cohorts and incorporating clinical and environmental
data necessitates further standardization and computational infrastructure. Of special
interest, the impact of alterations in the microbiome (dysbio-sis) on metabolic pathways
is particularly relevant since these dysbiosis-metabolome relationships can be causative
or indicative of a myriad of human diseases[83]
[84] including obesity and diabetes[85]
[86]
[87], cardiovascular diseases[88]
[89]
[90], inflammatory diseases[91], and cancer[64]
[92]
[93]. We thus suspect an increase in multi-tiered studies that apply a holistic approach
to understanding diseases, including integration of molecular information from host
and environment. Concurrently, as pathway information and identification of metabolites
increases, strategies that take into account the kinetics of metabolites (e.g., metabolic
lux and networks) will become more and more applicable for clinical metabolomics studies.
Lastly, while the classical view of the molecular dogma is that metabolites levels
are modulated by the epigenome, genome, and proteome, there have also been examples
where metabolites regulate epigenetic events (i.e., going against the grain of the
molecular dogma direction)[94]
[95]
[96]. The future of metabolomics and its potential in uncovering biomarkers and deciphering
mechanisms will surely necessitate modeling of complex bi-directional relationships
within omics and environmental context information.
Cell-free DNA
Description of Technology: Both normal and malignant cells shed DNA into the circulation and next-generation
sequencing technologies are capable of detecting small amounts of cfDNA, making the
blood a potential repository for tumor genomic profiling. „Liquid biopsy“, once validated,
could enable the detection of cancer as a screening tool, track evidence for residual
disease after cancer treatment, monitor patients for response to therapy, and discover
meaningful mechanisms of resistance to cancer therapies. With this wealth of previously
unavailable information, liquid biopsy could lead to the development of new assays,
biomarkers, and targeted treatments to help cancer patients live longer, better lives.
It is important to note that cell-free/circulating tumor DNA is only one aspect of
‘liquid biopsies,’ and there are multiple advances with other assays outside of the
scope of this review, including circulating tumor cells[97]
[98]
[99]
[100], other nucleic acids[101], exosomes and other extracellular vesicles[102]
[103]
[104], and integrated biomarkers[105].
The presence of cell-free nucleic acids in the blood was first described in 1947 by
Mandel and Metais[106] and three decades later, Leon et al. demonstrated that cancer patients had greater
amounts of cfDNA relative to healthy controls[107]. Stroun et al. demonstrated both that tumor DNA was detectable specifically in plasma[108]
[109] and that specific genomic alterations could be identified[110]. Of note, cfDNA is distinct and not derived from circulating tumor cells, although
they are correlated and both increased in patients with advanced cancer[111]. Major advancements in cfDNA were first made in the field of perinatology, leading
to the early minimally invasive detection of fetal chromosomal anomalies from maternal
plasma in widespread clinical use today[112]. The remarkable advances in sequencing technology over the past two decades, from
Sanger sequencing to allele-specific PCR to the advent of massively parallel sequencing
('next generation sequencing')[113]
[114], along with advances in bioin-formatics analysis[115] and rapid reduction in cost have facilitated increasing ability to interrogate cfDNA
to profile tumors.
Recent Advancements: To date, most clinical applications of cfDNA sequencing have focused on tracking
specific mutations[111]
[116]
[117]
[118]
[119]
[120]
[121]
[122] or sequencing targeted panels of cancer-related genes[123]
[124]
[125]
[126]
[127], particularly in metastatic cancer. In general, cfDNA is present in a greater proportion
of patients and in larger amounts in metastatic cancers relative to primary tumors.
In the metastatic setting, particularly in cancer types that are in many cases inaccessible
(e.g., lung primary or metastases) or are higher risk lesions to sample in terms of
potential complications, cfDNA genomic approaches may offer potential benefits relative
to tumor biopsy. Tumors are known to be heterogeneous and biopsies inherently only
sample a small localized region of a single metastatic site[128], introducing potential bias that may be overcome by cfDNA as a ‘sink’ of all metastatic
sites in a patient[129]. Taking a patient-centered approach in the metastatic setting is critical - avoiding
painful and inconvenient biopsies has the potential to improve quality of life. In
one study, 34% of breast cancer patients undergoing metastatic biopsy described anxiety
pre-biopsy and 59% described post-biopsy pain[130].
Clinical Utility: The only FDA-approved ‘liquid biopsy’ companion diagnostic to date is the cobas®
EGFR Mutation Test v2 for the detection of exon 19 deletions or exon 21 (p.L858R)
substitution mutations in the epidermal growth factor receptor (EGFR) gene to identify patients with metastatic non-small cell lung cancer eligible for
treatment with erlotinib[131]. However, in cancers harboring mutations that are known to be prognostic or predictive,
plasma-based cfDNA assays have demonstrated utility in disease management and are
increasingly used clinically[132]
[133]
[134]. In addition, cfDNA targeted panel sequencing assays of cancer-related genes are
used in lieu of metastatic tumor biopsy sequencing in clinical practice, including commercial
tests such as Guardant360® and FoundationACT®[126]
[135]. In the clinical setting, genomic profiling via cfDNA has been associated with more
rapid turnaround of genomic results than tissue biopsies, frequently due to delays
in accessing or obtaining tissue[136]. In the non-metastatic setting, there is great interest and excitement around the
potential to develop patient tumor-specific panels of mutations for the highly sensitive
detection of minimal residual disease after initial cancer treatment[137]
[138]. In addition, multiple groups and commercial ventures are pursuing whether cfDNA
could be used as a novel screening approach for cancer diagnosis[139], including the STRIVE Breast Cancer Screening Cohort (NCT03085888) supported by
Grail, Inc. However, the optimal technical approach for cfDNA as a detection methodology
remains unclear and large studies to assess sensitivity and specificity are only recently
underway. Another approach is to incorporate cfDNA into a multi-analyte assay for
cancer screening, such as CancerSEEK[105]. The CancerSEEK assay integrates a cfDNA PCR-based assay for a panel of common cancer
mutations with established circulating protein biomarkers.
The promise of cfDNA is immense yet there remain several key unresolved challenges,
including how well tumor-derived cfDNA mirrors tissue-derived tumor DNA, how to analyze
tumor-normal DNA admixture present in circulation, how to better assess tumor-derived
fraction of cfDNA, and how to account for clonal hematopoiesis of indeterminate potential
(CHIP). While cfDNA appears to demonstrate overall high concordance with tumor biopsies[140]
[141]
[142], it is unclear whether cfDNA can serve as a comprehensive proxy for tumor biopsy
in all contexts. Further, assays vary in their detection and reporting of genomic
alterations from plasma[143].
Data Challenges: Circulating DNA in plasma is an admixture of both normal DNA shed primarily from
leukocytes as well as tumor DNA, which presents challenges for analysis and interpretation
of sequencing data. In the context of a large amount of tumor-derived DNA in the circulation
(high ‘tumor fraction’), for example tumor fraction greater than 10%, standard next-generation
sequencing approaches may be applied. However, in many contexts, tumor fraction is
incredibly low particularly at diagnosis, in the setting of minimal residual disease,
or in some ‘low cfDNA shedding’ cancer types and patient tumors. Highly specific assays
may detect tumor fractions as low as 0.02% for panel sequencing and 0.00025% using
such as digital droplet PCR for specific known alterations[138]. A major remaining challenge is to understand the sensitivity of assays for mutation
detection to ensure that a negative test truly reflects the absence of tumor-derived
DNA and not a limitation of assay or bioinformatic approaches.
As cfDNA assays seek to expand the breadth of sequencing (e.g., whole genome sequencing),
efficient and cost-effective methods to screen blood samples for adequate amounts
of tumor-derived DNA will be critical. Although sequencing costs continue to decline,
identifying samples unlikely to provide usable sequence data should improve efficiencies.
Most assays that determine tumor fraction depend on prior knowledge of tumor-specific
mutations. Recent efforts suggest that low-coverage (approximately 0.1X) whole genome
sequencing of cfDNA may offer the ability to quantify tumor fraction without the need
for prior knowledge of tumor mutations[140].
Another challenge involves deconvo-lution of genomic alterations present in leukocytes
as a consequence of CHIP from tumor-specific alterations[144]
[145]. CHIP is the expansion of a clonal hematopoietic progenitor identified through common
genomic alterations, present in increasing frequency as individual's age. Typically,
‘normal’ DNA to distinguish germline from somatic tumor mutations is derived from
peripheral blood cells and the frequency of CHIP - potentially more than 10% of patients
over the age of 65 - suggests that methods to identify and account for CHIP will be
critical.
Although most efforts to date have focused on tracking specific alterations known
to be present in a tumor biopsy or sequencing targeted panels of cancer-related genes,
there is growing evidence that cfDNA offers the potential to obtain exome- and genome-level
tumor sequencing data. Works from several groups have demonstrated the feasibility
of genome-wide copy number analysis in cancer patients from plasma via shallow or
low-coverage sequencing of cfDNA[140]
[146]
[147]
[148]
[149]. Further efforts in this regard demonstrate feasibility of exome sequencing of cfDNA
in the context of adequate tumor fraction[140]
[141]
[142]
[146]. Comprehensive proiling is useful; particularly as blood can readily be collected
serially, enabling tracking of the evolution of resistance as patients are on therapy.
As we gain a greater understanding of the importance of non-driver mutations and regulatory
elements in carcinogenesis and cancer progression, more comprehensive tumor genomic
proiling from blood offers the potential for discovery in addition to detection, response
tracking, or biomarker identification. In addition, more sensitive methods of detecting
and isolating tumor-derived DNA or alterations from plasma may improve assay sensitivity[150]
[151].
Future Directions: cfDNA is increasingly prevalent in oncology practice, from the first FDA-approved
cfDNA biomarker to commercial cfDNA targeted panel sequencing assays. However, a recent
American Society of Clinical Oncology (ASCO) and College of American Pathologists
joint review reinforced that widespread use in clinical practice is not yet recommended
until there is evidence of clinical validity and utility[152]. Despite this, there is growing evidence that personalized, highly sensitive mutation-based
assays will be feasible for assessment of minimal residual disease and potentially
tracking for early recurrence detection. These advances may translate to cfDNA assays
that could be used for screening and early primary detection as well yet require clinical
validation first. Finally, technological and computational advances are facilitating
comprehensive genomic proiling exclusively from plasma. There remains the hope that
new minimally invasive ‘liquid biopsy’ assays could improve outcomes by identifying
cancer earlier and more specifically while also facilitating a greater understanding
of novel susceptibilities and targets.
Cohort Matching Algorithms
Cohort Matching Algorithms
Description of Technology: Traditional biomarker analysis focuses on trying to figure out what distinguishes
one patient from another patient. Broadly speaking, cohort-matching algorithms are
either centered around similar features, or on similar outcomes. Using feature selection
methods, biomarkers with the strongest association to the feature of interest are
identified and then validated in an independent test set. These biomarker selection
processes universally assume that there is a global, ground truth regarding the biomarker-phenotype
relationship that is stable across multiple settings[153]. Unfortunately, this biomarker selection paradigm results in a tendency to divide
patients into increasingly small subsets that may have no clinical relevance. Moreover,
this fragmentation of previously “common” diseases results in a collection of “rare”
subtypes that are then progressively challenging to study[154]
[155] as there are an endless number of biomarker-subtype-therapy combinations. An alternative
to this biomarker proliferation is the idea of trying to bin patients together based
on potential outcome similarity – pattern-matching at a patient level. In other words, rather than focus on how patients are dissimilar,
focus on how sets of patients respond similarly to a medication. In other words, one
can leverage omics/ phenomics comparisons at a patient level through more holistic
pattern matching. This allows any number of omics technologies to define a patient-patient
similarity strength.
Recent Advances: There currently is not a standard means of patient matching using omics data. There
are an assortment of varied heuristics and cohort matching metrics[156]
[157]
[158]. Feature matching algorithms assume that retained features are critical determinants
of outcomes such as survival and are optimal for situations where the biomarker is
directly linked to the outcome. A straightforward approach to feature matching is
to assign matches based on exact feature overlap -- for two patients to be a match
they must share all features. Foundation Medicine's PatientMatch tool[159] is an example of this exact matching approach. More complex feature matching schemes
have also been developed using Bayesian approaches[160]. Other feature matching algorithms include the PHenotypic Interpretation of Variants
in Exomes (PHIVE) that matches human phenotypic profiles to glean the variants found
in whole exome sequencing in mouse models[157] and DECIPHER[161] that enables international querying of karyotype, genetic, and phenotypic information
for matches. In contrast to feature matching, the outcome-matching approach allows
features to be weighted based on their discriminatory power. Frequently used algorithms
are weighted K-nearest neighbor, random forest techniques, and deep learning (e.g.,
artificial neural networks)[162]
[163]
[164]. Outcomes matching attempts to match patients with other patients who may have a
similar outcome to the same therapy based on phenotypic and omic predictors. A patient's
features could potentially be compared not just from patient to patient (e.g., Patients
Like Me) to infer outcomes but also from patient to cell lines (e.g., the Connectivity
Map project[165]) and from a patient's electronic health record (EHR) to separate patient's EHRs[166].
Clinical Utility: The landscape today is dominated by feature matching strategies. These have been
applied to clinical trial recruitment most notably for such national endeavors as
the NCI-MATCH trial. Much of feature matching algorithms today have been focused on
improving clinical trials accrual by prompting physicians to generate referrals[167]. Although seemingly simple, clinical trials matching algorithms have shown up to
a 90% reduction in time spent identifying trial candidates[168]. GeneYenta matches phenotypically similar patients in regard to rare diseases[169] by weighting predictive features. Algorithms have been written to evaluate single
nucleotide variant (SNV) frequencies between patients and non-small cell lung cancer
cell lines to predict chemotherapeutic response[156]. Startup efforts such as MatchTX (http://match-tx.com) are attempting to reimagine social networking tools to help clinicians find best
patient matches. Although the data sources, data types, and methods are heterogeneous,
matching techniques at their core employ heuristic approaches to discover and vet
the best profiles from large clinical databases.
Data Challenges: Cohort matching algorithms need to be capable of subsuming disparate data types and
methods of comparison. Unfortunately the data types used in the matching process are
varied and can be subjective or objective phenotypic measurements. Definitions of
pathogenicity[170] remain a huge problem as do incomplete datasets and datasets lacking standardized
ontologies. Preprocessing steps will need to be developed to organize the data into
viable features to be used by matching algorithms. A further complication is the possibility
that predictive models may require subsuming disparate unstandardized data-types simultaneously[63]
[171]. EHR and omics interoperability remain a primary impediment to more robust algorithm
generation. This will require a concerted standardization among data sets including
vocabulary mapping and normalization.
Future Directions: As interoperability is a key impediment to the omics revolution, this has spurred
efforts such as the Genomic Data Commons[172] which aims to “provide a uni-fied data repository that enables data sharing across
cancer genomic studies in support of precision medicine.” Other consortia efforts
such as the Global Alliance for Genomics and Health (GA4GH)[173] and Health Level Seven International's Fast Healthcare Interoperability Resources
(FHIR)[174] are enabling the development of application programming interfaces (APIs) and standards
convergence. For example, the GA4GH Beacon Project allows federated queries to detect
the existence of specific genomic variants across a variety of genome resources. Coalescing
large datasets such that meaningful matching can occur has also been a thrust of recent
developments. ASCO has also built a learning system called CancerLinQ[175] to help facilitate integration of data from multiple participating community oncology
practice sites in an attempt to standardize data, facilitate research, and provide
personalized cancer care through patient matching. Academic and selected larger oncology
groups are participating in consortia such as ORIEN[173], GENIE[176], and the International Cancer Genome Consortium (ICGC)[177] and are building their respective frameworks for identifying patient cohorts. The
“Sync for Science”[178] endeavor sponsored by the NIH and the Office of the National Coordinator for Health
Information Technology is going to permit patients to directly donate their data to
be used to support innovative match-based algorithms for predictive purposes and thus
contributing to precision medicine research. Sync for Science is also an integral
part of the patient engagement portion of the NIH ‘All of Us’ initiative (https://allofus.nih.gov). Enhancing and perhaps complicating the field further, individual hospital systems
such as the Swedish Cancer Institute and the Henry Ford Hospital system are developing
their own precision medicine repositories. Commercial pathology laboratories – such
as Caris and Foundation Medicine – have their internal datasets to mine. Other efforts
like Syapse's Open Precision Oncology Network[179] allow aggregated cancer genomics data to be pulled from all participating health
systems. These consortia and businesses all rely on patient matching as part of their
core strengths.
Conclusion
The sequencing of the genome has ushered in a new era of personalized cancer informatics.
But the DNA genome is simply a first layer in a complex biological environment from
which many omics data can be overlaid. We are in a time of growth. Metabolomics and
proteomics are driving us closer to the tumor phenotype, and importantly, its response
to treatment in real-time. ctDNA/cfDNA may help understand the clonal tumor evolution
using non-invasive methods with the patient. These new omics datatypes will more than
certainly help us tailor and adjust therapy for oncology patients. With these new
datatypes and the understanding that data must be centralized, we are witnessing,
too, an explosion of clinical/omics datasets aggregated by consortia and industry
partners. As these datasets grow, so too, will be the need for more sophisticated
cohort matching algorithms to bring clarity and useful actionable insights. These
are exciting times. The cancer omics revolution continues to march forth rapidly and
hopefully continues to improve our ability to practice precision oncology.