Key words breast cancer - SNPs - triple-negative - subtype prediction - prediction model - variable
selection
Schlüsselwörter Brustkrebs - SNPs - triple-negativ - Subtypvorhersage - Prädiktionsmodell - Variablenselektion
Introduction
Knowledge about targeted therapies for breast cancer has improved immensely over the
last two decades. These therapies have mainly been developed for – although they are
not restricted to – intrinsic molecular subtypes: triple-negative breast cancer (TNBC),
hormone receptor-positive breast cancer, and HER2-positive breast cancer.
As TNBC lacks hormone receptors and HER2 receptors, treatment for triple-negative
breast cancer is primarily restricted to conventional chemotherapy. At the molecular
level, however, TNBC is a heterogeneous disease that has different histological and
molecular features. Recently, studies of TNBC have been extending the inclusion criteria
and now include additional molecular markers in the selection criteria, opening up
scope for targeted therapies in this subtype of breast cancer.
One example is a study of the targeted antibody–drug conjugate glembatumumab vedotin
[1 ]. The study includes not only a requirement for the tumor to be triple-negative,
but also for it to show overexpression of glycoprotein nonmetastatic melanoma protein
B (gpNMB). Other examples are poly-(ADP-ribose) polymerase (PARP) inhibitor studies
in patients with BRCA1/2 mutations, based on a requirement for the tumor to be triple-negative for testing
of a BRCA1/2 mutation [2 ], [3 ]. The requirement for triple negativity was later changed to HER2 negativity.
The screening phases for studies of this type are often extended, since the process
of determining the molecular subtype and carrying out additional biomarker assessment
is time-consuming. This can often be a challenge to the patience of both patients
and physicians. Parameters capable of predicting the molecular subtype before it becomes
available via pathology might be helpful for treatment planning and for optimizing
the timing and cost of screening phases for clinical trials. Biomarker assessment
could be carried out at an early stage in the work-up for patients with an increased
likelihood of the specific molecular subtype.
Clinical and epidemiological risk factors, such as reproductive factors and body mass
index (BMI), are associated with the molecular subtype of the tumor. They appear to
have an effect on the risk of developing hormone receptor-positive tumors [4 ], [5 ], [6 ]. In a case–case analysis, our group previously reported that age and BMI are the
most important parameters associated with molecular subtypes [7 ]. High mammographic density was also associated with hormone receptor-negative tumors
[8 ], [9 ].
Since rapid and low-cost genotyping is becoming increasingly widely available [10 ], single nucleotide polymorphisms (SNPs) might be useful as predictors for molecular
subtypes. Genetic factors have been shown to increase the risk for specific breast
cancer subtypes. For example, it is known that patients with BRCA1 mutations are mainly diagnosed with triple-negative breast cancer, and mutation rates
in this population are over 10% [11 ]. In addition, approximately 100 validated SNPs for breast cancer risk are known
[12 ], [13 ], [14 ]. Some of these SNPs have been specifically linked to a risk for hormone receptor-positive,
hormone receptor-negative, or triple-negative breast cancer [15 ], [16 ], [17 ], [18 ], [19 ], [20 ], [21 ].
It was hypothesized that a combination of multiple breast cancer risk SNPs in addition
to clinical predictors of molecular subtypes may improve the prediction of molecular
subtypes. Specifically, predicting TNBC – a breast cancer subtype in which the patients
affected have many unmet medical needs – would be helpful. The aim of this study was
therefore to identify breast cancer risk SNPs capable of predicting TNBC in addition
to clinical predictors in women with invasive breast cancer. The prediction performance
of various methods of selecting SNPs was compared.
Methods
Patients
The patients selected for this retrospectively designed cross-sectional observational
study are included in the Bavarian Breast Cancer Cases and Cohorts (BBCC) study. The
BBCC has been ongoing since 2002 and includes consecutively recruited patients with
invasive breast cancer at the University Breast Center for Franconia. The study was
designed to identify and validate genetic and nongenetic risk factors, and it has
been involved in several validation studies for SNPs [13 ], [14 ], [19 ], [20 ], [21 ], [22 ], [23 ], [24 ], [25 ], [26 ], [27 ], [28 ], [29 ], [30 ], [31 ], [32 ], [33 ]. For the present study, all women who were recruited into the BBCC from 2002 to
2010 were selected. Among them, patients were excluded for the following reasons:
no participation in any genetic BBCC research projects; insufficient remaining DNA
available due to participation in previous research projects; and no data on hormone
receptor status or HER2 status available from the central pathology department at
the breast cancer center. After SNPs had been selected for analysis (see below), patients
with incomplete genetic information were also excluded. All of the patients provided
written informed consent, and the Ethics Committee of the Medical Faculty at Friedrich
Alexander University of Erlangen–Nuremberg approved the study.
Data collection
All treatment-related patient data and tumor characteristics were documented as part
of the certification processes required by the German Cancer Society (Deutsche Krebsgesellschaft) and by the German Society for Breast Diseases (Deutsche Gesellschaft für Senologie)
[34 ]. The data are recorded prospectively in a database and audited annually as part
of the breast cancer center certification process. Epidemiological data and risk factors
for breast cancer were obtained using a structured questionnaire, which was completed
by the patients and reviewed together with trained study personnel and supplemented
if necessary.
SNP selection
A total of 102 SNPs were selected for genotyping. Of these, 98 are validated breast
cancer risk SNPs. Most of these breast cancer risk SNPs have been confirmed in large
international validation studies, mainly by the Breast Cancer Association Consortium
(BCAC). The BCAC initially published a validation of a few SNPs [35 ] and then, after increasing the sample size and analyzing more SNPs, published a
series of papers as a result of the Collaborative Oncological Gene–environment Study
(COGS; www.cogseu.org and www.nature.com/icogs/ ) [13 ], [14 ]. Four SNPs that were shown to have an influence on the prognosis in breast cancer
patients were also selected [36 ], [37 ], [38 ], [39 ]. A complete list of the SNPs, including references, is provided in Supplementary
Table S1
[13 ], [14 ], [18 ], [19 ], [20 ], [25 ], [29 ], [30 ], [32 ], [33 ], [35 ], [37 ], [38 ], [39 ], [67 ], [79 ], [80 ], [81 ], [82 ], [83 ], [84 ], [85 ], [86 ], [87 ], [88 ], [89 ].
DNA extraction, genotyping, and quality control
Whole-blood samples were collected in citrate-phosphatedextrose-adenine (CPDA) tubes
(Sarstedt AG, Nümbrecht, Germany) from patients who had consented to participate in
the biomarker substudy. Germline DNA was extracted using the automated magnetic bead-based
chemagic MSM I technique (PerkinElmer chemagen, Baesweiler, Germany) in accordance
with the manufacturerʼs instructions. Genotyping was done at the Dr. Margarete Fischer-Bosch
Institute of Clinical Pharmacology, using MassARRAY iPLEX Gold (Sequenom, San Diego,
California, USA). SNPs were excluded if MALDI spectra were unreliable, based on raw
genotype data. Exact tests for Hardy–Weinberg equilibrium (HWE) were performed and
SNPs with an unexpectedly small p value, assessed using a quantile–quantile plot,
were also excluded.
Pathology assessment
All of the histopathological information used in the analysis was directly documented
from the original pathology reports, which were reviewed by two investigators. Estrogen
receptor status, progesterone receptor status, and HER2 status were assessed as follows.
Monoclonal mouse antibodies against estrogen receptor-alpha (clone 1D5; 1 : 200 dilution,
DAKO, Denmark) and monoclonal mouse antibody against the progesterone receptor (clone
pgR636, 1 : 200 dilution, DAKO, Denmark) were used to stain the pretreatment core
biopsies. The percentage of positively stained cells was included in the pathology
reports. The tumors were considered to be positive for the estrogen and progesterone
receptors if 10% or more of the cells showed positive staining, in accordance with
recommendations applying at the time when the study was conducted [40 ], [41 ], [42 ], [43 ]. A polyclonal antibody against HER2 (1 : 200 dilution, DAKO, Denmark) was used,
and HER2 status was stated in the pathology reports as negative, 0, 1+, 2+, or 3+
in accordance with the guidelines published by Sauter et al. [44 ]. Tumors with a score of 0 or 1+ were regarded as HER2-negative, and those with a
score of 3+ were regarded as HER2-positive. Tumors with 2+ staining were tested for
gene copy numbers of HER2 by chromogene in-situ hybridization. Using a kit with two
probes of different colors (ZytoDot, 2C SPEC HER2/CEN17, ZytoVision Ltd., Bremerhaven,
Germany), the gene copy numbers of HER2 and centromeres of the corresponding chromosome
17 were retrieved. A HER2/CEN17 ratio of ≥ 2.2 was considered as amplification of
HER2. Scoring was carried out in a standardized way by a group of dedicated pathologists
in routine surgical pathology. A tumor was regarded as being triple-negative if the
estrogen receptor (ER) status was negative, progesterone receptor (PR) status was
negative, and HER2 status was negative. In the present study, “triple-negative” refers
to one subgroup of molecular subtypes of breast cancer, although comprehensive gene
expression profiling was not performed.
Statistical methods
To investigate the predictive value of each single SNP relative to the occurrence
of a TNBC in addition to clinical predictors, a multiple logistic regression model
was fitted for each SNP with TNBC status (yes versus no) as the outcome, and with
the specific SNP (ordinal; 0, 1, or 2 minor alleles) and the clinical predictors age
at diagnosis (continuous) and BMI (continuous) as predictors [7 ]. Patients with missing genetic data or missing outcome data were excluded. Missing
clinical predictors were imputed, as done in [45 ]. Continuous predictors were used as natural cubic spline functions to describe nonlinear
effects [46 ].The degrees of freedom (between 1 and 3) of each predictor were calculated as done
in [45 ]. The odds ratio (OR) per minor allele with confidence interval was calculated using
the logistic regression model. For each SNP, a likelihood ratio test comparing the
clinical-genetic logistic regression model with a clinical logistic regression model
containing only the clinical predictors was performed. The p values (one per SNP)
were corrected for multiple testing using the Bonferroni–Holm method.
The primary study aim was to identify a set of SNPs that together would improve the
prediction of TNBCs in addition to clinical predictors (age, BMI). Identifying relevant
SNPs among the relatively large number of candidate SNPs was a challenging process,
which can be summed up as follows. The complete dataset was randomly divided into
two parts: one training set with about two-thirds of the patients, and one validation
set with about one-third of the patients. Various SNP selection methods and regression
techniques, respectively, were applied to the training data to obtain regression models
with selected SNPs and clinical predictors. The models were compared among themselves
with regard to their prediction errors on validation data.
All but one of the regression techniques considered comprise a bundle of candidate
models characterized by a tuning parameter λ. The optimal λ has to be determined before
a specific prediction model representing the regression technique can be fitted to
predict TNBC. After the degrees of freedom of the continuous clinical predictors had
been determined again by using training data, the following regression techniques
were applied to the training data.
Univariate selection. For each SNP, a logistic regression model with the clinical predictors and the specific
SNP was compared with a logistic regression model with clinical predictors alone,
using a likelihood ratio test. The SNPs were ordered according to increasing p values
for these likelihood ratio tests. The λ top-ranked SNPs were selected and included
in a logistic regression model that also contained the clinical predictors. Here λ,
ranging from 0 to 30, is a tuning parameter representing the number of selected SNPs
[47 ]. When a specific model was applied to the validation data afterwards, generalized
shrinkage after coefficient estimation toward the clinical regression model was used
to improve predictions [48 ].The shrinkage factor was obtained from the maximal genetic model with 30 SNPs.
Stepwise selection as described in [49 ]. All of the SNPs were ordered as above. The top 30 ranked SNPs were preselected,
in order to keep the number of SNPs to be analyzed easy to handle. One hundred bootstrap
samples of the same size as the original dataset were drawn with replacement. On each
bootstrap sample, a logistic regression model with the clinical predictors and the
preselected SNPs was set up. A backward stepwise variable selection procedure that
kept all the clinical predictors was carried out to obtain the best model in accordance
with the Akaike information criterion. The retained variables from each bootstrap
sample were recorded, and a final variable selection was made. The most frequently
selected SNPs (> 70%) and – to address correlation among SNPs – representatives of
highly frequent SNP pairs (> 90%) were chosen. Again, generalized shrinkage was incorporated
when the final model was applied.
The least absolute shrinkage operator (lasso) is a regression technique in which the regression coefficients are shrunk towards
zero during estimation [50 ]. The amount of shrinkage is controlled by a tuning parameter λ. Depending on the
value of λ, a number of coefficients reach exactly zero, which means that lasso also
leads to variable selection. In the present study, a regression model was set up with
the clinical predictors and all SNPs. The coefficients of the SNPs, but not the coefficients
of the clinical predictors, were shrunk by variation of λ. A regression model with
maximal shrinkage that has all coefficients of the SNPs equal to 0 corresponds to
the clinical logistic regression model. In contrast to the usual regression models,
lasso can deal with large numbers of predictors.
Component-wise gradient boosting fits a regression model iteratively [51 ], [52 ]. It starts with an empty model without any predictors. In each iteration, the best-performing
predictor is added to the model with a small step size, or its coefficient is updated
if it was included before. More relevant predictors are included earlier than less
relevant ones. The number of boosting iterations, λ, is a tuning parameter that controls
both the variable selection properties of the algorithm and the implied shrinkage
of the coefficients. The incorporation of clinical predictors is less straightforward
than for lasso. A logistic regression model with clinical predictors is fitted. This
fit is taken as the offset for the boosting procedure described above with SNPs as
predictors [53 ].
The optimal λ for each method was found by 10-fold cross-validation on the training
dataset. For a given value of λ, the prediction model was estimated on nine folds
and then applied on the tenth fold. The mean squared error (MSE) was taken as the
evaluation measure. The MSE is a summary measure of the differences between the observed
TNBC status (either 0 for “no” or 1 for “yes”) of patients in the tenth fold, which
was not used for model building, and the expected probability obtained from the model
(between 0 and 1) for these patients having a TNBC. This procedure was done 10 times,
leaving one fold out at a time, and the average MSE was calculated. The λ value with
the smallest average MSE was regarded as the optimal λ. The whole training set was
finally used to fit a regression model with the optimal λ.
The procedures described above resulted in four clinical-genetic regression models
for predicting TNBC. In addition, two benchmark models – a logistic null model without
any predictors and a clinical logistic regression model with clinical predictors but
without any SNPs – were fitted on the training data. A useful clinical model should
perform better than the null model, whereas a useful prediction model with clinical
and genetic predictors should perform better than the clinical model without further
predictors. These six models were evaluated on the validation dataset to measure their
performance in new patients. Again, the MSE was taken as a performance criterion.
To obtain further insight into the accuracy of the prediction, the performance improvement
of the four clinical-genetic models in comparison with the clinical model was assessed
on validation data using the continuous net reclassification improvement (NRI). Roughly
speaking, the continuous NRI is the proportion of patients with TNBC or without TNBC,
respectively, who are correctly given a higher or lower predicted probability of TNBC
by the clinical-genetic regression model than by the clinical model, corrected by
wrongly assigned lower or higher probabilities [54 ].
In clinical practice, a prediction model for TNBC might support treatment decision-making
based on a threshold for the predicted probability of TNBC that classifies a patient
as a “high-risk” patient or “low-risk” patient. The ability to distinguish between
patients with and without TNBC was measured on validation data using the receiver
operating characteristic (ROC) curve and the area under the ROC curve (AUC), an estimation
of the probability that given two patients, one with TNBC and the other without TNBC,
the prediction model will assign TNBC status to both patients correctly.
To overcome the drawbacks of only splitting the data into training and validation
sets once, the dataset was divided several times into training and validation sets
and the procedure was repeated as described above each time [47 ], [55 ]. More precisely, 3-fold cross-validation with 100 repetitions was done. For each
regression technique for predicting TNBC, the average value of the 300 MSEs of the
corresponding regression models was taken as a final evaluation criterion, and the
average AUC and average NRI were used as further criteria. The regression technique
with the smallest average MSE is regarded as the best method (the “winner” method)
for predicting TNBC.
The best prediction method was applied to the whole dataset to obtain the final prediction
model for TNBC. This was done by repeating all of the model-building steps, this time
not on the training data, but on the complete dataset. That is, cubic spline functions
and the tuning parameter λ were determined as described above and a corresponding
regression model was fitted on the complete dataset. A TNBC prediction score on a
scale from 0 to 100, representing the probability of a TNBC, was derived from the
final prediction model by taking the inverse logit of the linear combination of predictor
values and regression coefficients. The performance of the final model on the complete
dataset in terms of discrimination and calibration was measured using the AUC and
the Hosmer–Lemeshow statistic (scatterplot and χ2 test) comparing predicted and observed TNBC events, as done recently in [56 ]. A large p value indicates satisfactory calibration.
All of the tests were two-sided, and a p value < 0.05 was considered statistically
significant. Calculations were carried out using the R system for statistical computing
(version 3.0.1; R Core Team, Vienna, Austria, 2013).
Results
Patients and SNPs
A total of 2234 patients were recruited into the BBCC during the specified period.
A subset of 1868 patients took part in genetic BBCC research projects. Of these, sufficient
DNA was available from 1743 patients. A further 472 patients with incomplete hormone
receptor and HER2 status information were excluded, resulting in 1271 remaining patients.
Twenty-seven out of 102 SNPs were excluded after genotype quality control: 24 SNPs
because of unreliable MALDI spectra and three SNPs because of departure from HWE (Supplementary
Table S1 ). Due to missing values, the following SNPs were excluded: rs1550623 (17.0% missing
values out of 1271), rs3903072 (9.9%), rs2380205 (9.6%), rs17817449 (7.4%), rs2236007
(7.2%), rs3803662 (5.0%), rs9790517 (5.0%), and rs2046210 (5.0%). All patients had
age information, and missing BMI values (4.2%) were imputed. The final sample size
was 1027 patients, after 244 patients with incomplete genetic information had been
excluded. Patient characteristics are shown in [Table 1 ].
Table 1 Patient characteristics.
Characteristic
Mean or count
SD or %
BMI, body mass index; ER, estrogen receptor; PR, progesterone receptor
Age at diagnosis
Years
57.2
11.7
BMI
kg/m2
26.2
4.8
ER
231
22.5
796
77.5
PR
288
28.0
739
72.0
HER2
877
85.4
150
14.6
Triple-negative
893
87.0
134
13.0
Univariate SNP and TNBC association
The clinical predictors age at diagnosis and BMI, used as adjustment variables, fitted
best as cubic spline functions with 2 and 1 degrees of freedom, respectively – i.e.,
age was used nonlinearly and BMI was used linearly. Twenty SNPs with the smallest
p values in the univariate analyses are shown in [Table 2 ]. rs10069690 (TERT, CLPTM1L) was the only significant SNP (corrected p = 0.02) after correction of p values for
multiple testing. The corrected p values for rs2981579 (FGFR2) , rs7726159 (TERT) , rs2588809 (RAD51B) , and rs78540526 (CCND1) were 0.18, 0.36, 0.81 and 0.93, respectively; the other corrected p values were 1.00.
Table 2 Univariate associations with triple-negative breast cancer (TNBC) for the 20 SNPs
with the lowest p values.
SNP
Chromosome
Nearest genes
MAF
OR (95% CI)1
p value2
SNP, single nucleotide polymorphism; MAF, minor allele frequency
1 Odds ratio (OR) per minor allele, adjusted for age and body mass index, with 95%
confidence interval (CI) and corresponding p value, obtained from the multiple logistic
regression model.
2 Uncorrected p values. The corrected p values for the top five SNPs were 0.02, 0.18,
0.36, 0.81, and 0.93. All other corrected p values were 1.00.
rs10069690
5
TERT, CLPTM1L
0.249
1.66 (1.27, 2.18)
< 0.001
rs2981579
10
FGFR2
0.484
0.66 (0.51, 0.87)
< 0.01
rs7726159
5
TERT
0.358
1.46 (1.12, 1.91)
< 0.01
rs2588809
14
RAD51B
0.174
0.62 (0.42, 0.92)
0.02
rs78540526
11
CCND1
0.104
0.55 (0.32, 0.92)
0.02
rs11820646
11
–
0.389
1.38 (1.06, 1.80)
0.02
rs2981582
10
FGFR2
0.451
0.73 (0.55, 0.95)
0.02
rs3760982
19
KCNN4, ZNF283
0.485
0.77 (0.59, 1.01)
0.06
rs2363956
19
MERIT40
0.488
0.78 (0.60, 1.01)
0.06
rs1436904
18
CHST9
0.383
1.27 (0.98, 1.65)
0.07
rs6001930
22
MKL1
0.127
0.68 (0.43, 1.06)
0.09
rs12422552
12
ATF7IP, GRIN2B
0.295
0.77 (0.57, 1.04)
0.09
rs8170
19
MERIT40
0.191
1.31 (0.97, 1.78)
0.08
rs941764
14
CCDC88C
0.354
0.79 (0.59, 1.04)
0.09
rs11075995
16
FTO
0.264
1.29 (0.96, 1.72)
0.09
rs12710696
2
–
0.357
1.26 (0.96, 1.66)
0.09
rs11365234
7
AKAP9
0.392
1.24 (0.96, 1.61)
0.10
rs2823093
21
NRIP1
0.275
1.26 (0.96, 1.67)
0.10
rs4666275
2
ALK
0.060
1.50 (0.92, 2.46)
0.11
rs75915166
11
CCND1
0.082
0.67 (0.40, 1.14)
0.14
Clinical-genetic TNBC prediction
Boosting turned out to be the most accurate prediction method, and had a slightly
smaller cross-validated prediction error MSE than the lasso ([Table 3 ]). Lasso and boosting performed better than the clinical prediction model without
genetic predictors, whereas univariate selection performed similarly and stepwise
selection performed less well. These results were confirmed by AUC statistics: Boosting
was also superior with regard to distinguishing between TNBC patients and non-TNBC
patients. Lasso and univariate selection performed better than the clinical model,
and stepwise selection less well.
Table 3 Prediction of triple-negative tumor1 .
Model
MSE
Reclassification (%)
AUC
Selected SNPs
NRI
Correctly up
Correctly down
AUC, area under the curve; MSE, mean squared error; NRI, net reclassification improvement;
SNP, single nucleotide polymorphism
1 Summary statistics (mean and standard deviation) for MSE, NRI, and AUC, obtained
from (logistic) regression models as well as the number of selected SNPs are shown.
All measures were obtained by 3-fold cross-validation with 100 repetitions.
2 Logistic regression model without any predictors.
3 Logistic regression model with clinical predictors (age and body mass index), but
without any genetic predictors.
4 Regression model with clinical predictors and selected SNPs.
Null2
0.1137 (0.0109)
–
–
–
0.500 (0.000)
–
Clinical3
0.1098 (0.0104)
–
–
–
0.618 (0.036)
–
Univariate selection4
0.1098 (0.0107)
9.0 (12.2)
29.9 (25.2)
35.3 (29.1)
0.620 (0.038)
2.2 (2.9)
Stepwise selection4
0.1108 (0.0108)
13.8 (13.9)
46.0 (7.5)
60.6 (5.4)
0.614 (0.037)
8.1 (2.5)
Lasso4
0.1096 (0.0103)
12.5 (16.7)
49.1 (19.9)
57.1 (16.9)
0.622 (0.039)
9.1 (7.5)
Boosting4
0.1095 (0.0103)
17.3 (13.8)
55.4 (9.4)
53.3 (8.5)
0.625 (0.037)
8.2 (7.2)
Boosting correctly increased the predicted probabilities of TNBC for the majority
of patients with a TNBC (“correct reclassification upwards” in [Table 3 ]) and correctly decreased the predicted probabilities of TNBC for the majority of
patients without TNBC (“correct reclassification downwards” in [Table 3 ]). Lasso did these correct increases and decreases for about half of the TNBC patients
and the majority of the non-TNBC patients. Univariate selection correctly increased
and decreased prediction probabilities only for a minority of patients. With regard
to correct reclassifications, stepwise selection performed much better than univariate
selection. In total, the reclassification improvement of the boosting model was superior
to all other methods (“NRI” in [Table 3 ]).
The average number of selected SNPs on the 300 training samples was similar at boosting,
lasso, and stepwise selection and smaller at univariate selection. The number of SNPs
varied relatively strongly at lasso and boosting and weakly at stepwise selection
and univariate selection ([Table 3 ]).
During cross-validation, univariate tests were performed on each training set and
SNPs were ordered according to their p values. The most frequent SNP on top was rs10069690,
ranking first 158 times (52.7%). The next most frequent SNPs on top were rs2981579
(17.7%), rs78540526 (5.7%), rs2588809 (4.3%), and rs7726159 (4.3%). In total, 24 SNPs
were ranked first at least once.
A boosting prediction model, the “winner” in the method comparison, was fitted on
the complete dataset. Four SNPs were selected: rs10069690 (TERT, CLPTM1L) , rs2981579 (FGFR2) , rs2588809 (RAD51B) , and rs78540526 (CCND1) . All of these belonged to the top five SNPs at the univariate analysis. Age was the
strongest predictor, stronger than any genetic predictors. The predicted probability
for TNBC as a continuous function of age is shown in [Fig. 1 ]. The likelihood of TNBC decreases with increasing age up to about 60 years and remains
constantly low thereafter. All regression coefficients are shown in [Table 4 ]. The coefficients of the predictor age were approximated using a cubic polynomial,
as cubic spline functions are difficult to use. Apart from age, positive coefficients
are associated with an increased likelihood of TNBC. An ideal “genetically high-risk
patient” can thus be defined as a patient with two minor rs10069690 alleles and always
two common alleles at the other SNPs, while an ideal “genetically low-risk patient”
is a patient with two common rs10069690 alleles and minor alleles at the other SNPs.
The footnote in [Table 4 ] states how the predicted probability of TNBC can be calculated using the predictor
values given.
Table 4 The final clinical-genetic prediction model for triple-negative breast cancer1 .
Predictor
Unit
Coefficient
1 For example, the predicted probability for a 50-year-old patient with a body mass
index of 26 and 1, 2, 1, 0 minor alleles of rs10069690, rs2981579, rs2588809, and
rs78540526, respectively, is exp (z)/ (1+exp[z]) at z = 3.5589 + 50 × (− 0.1624) + 502 × 0.0009372 + 503 × 0.000001951 + 26 × 0.005691 + 1 × 0.19926 + 2 × (− 0.09108) + 1 × (− 0.02625) + 0 × (− 0.03166) = − 1.8354.
That is, 13.8%.
Intercept
3.5589
Age at diagnosis
Year
− 0.1624
Year2
0.0009372
Year3
0.000001951
BMI
Per kg/m2
0.005691
rs10069690
Minor allele
0.19926
rs2981579
Minor allele
− 0.09108
rs2588809
Minor allele
− 0.02625
rs78540526
Minor allele
− 0.03166
Fig. 1 The predicted probability for triple-negative breast cancer (TNBC) as a continuous
function of age at diagnosis. The curves were generated using the boosting model fitted
on the complete dataset. The black curve predicts the TNBC risk of a genetically “average”
woman with a median body mass index. The blue and the orange curves show the predicted
risk for patients with genetically maximally increased and maximally decreased risks.
The boosting model was well calibrated. The difference between actual and predicted
events was quite low ([Fig. 2 ]; p = 0.73, Hosmer–Lemeshow test). The apparent AUC – i.e., the AUC on the complete
dataset – was 0.668, which is 0.043 units larger than the cross-validated AUC value.
This indicates that the prediction model was slightly overfitted. For comparison,
the apparent AUC of the clinical model was 0.632 – i.e., 0.014 units larger than its
cross-validated value.
Fig. 2 The observed and predicted frequencies of triple-negative breast cancer (TNBC). The
patients were sorted according to the predicted probability for TNBC using the boosting
prediction model and grouped into ten categories based on percentiles. The number
of actually observed TNBCs (“observed events”) in each category and the sum of predicted
probabilities for TNBC (“predicted events”) in each category are shown. Points below
the gray line indicate when the model is overestimating the likelihood of TNBC; points
above the gray line indicate when the model is underestimating the likelihood. A perfect
prediction model would show all of the points on the gray line.
To demonstrate a possible future application of the final prediction model, various
cut-off points for the TNBC risk between 0 and 100% were defined – e.g., 12%. Patients
were classified as “low-risk” if the prediction model assigned a TNBC risk below 12%.
Otherwise they were classified as “high-risk.” The sensitivity (i.e., the proportion
of patients classified as “high-risk” among true TNBC patients) and the specificity
(i.e., the proportion of patients classified as “low-risk” among true non-TNBC patients)
are presented in [Table 5 ], and compared with the clinical model. The sensitivities were almost equal for cut-off
points up to 12%. Thereafter, the sensitivities of the boosting model were larger.
For instance, if a physician decides to screen patients with a risk of TNBC of more
than 15% for biomarkers that are important for TNBC patients, without yet knowing
their receptor status, then 43% of all TNBCs will be detected with the boosting model,
in comparison with 38% with the clinical model. The rate of false-positive classifications
would be 24%, two percentage points more than when using the clinical prediction model.
The ROC curves for all possible cut-off points are shown in [Fig. 3 ].
Table 5 Sensitivity and specificity for the clinical prediction model and clinical-genetic
boosting prediction model1 .
Cut-off point for predicting triple-negative tumor (%)2
Frequency above cut-off point (%)3
Sensitivity
Specificity
Clinical model
Clinical-genetic model
Clinical model
Clinical-genetic model
1 All measurements were obtained by 3-fold cross-validation with 100 repetitions.
2 Patients were classified into a „high-risk“ group if the prediction model assigned
a triple-negative tumor probability above the cut-off point. Sensitivity (between
0 and 1) is defined as the proportion of „high-risk“ patients among TNBC patients.
Specificity (between 0 and 1) is defined as the proportion of „low-risk“ patients
among non-TNBC patients.
3 The proportion of patients classified as „high-risk“ in the total study population,
using the clinical-genetic prediction model.
10
56.3
0.68
0.69
0.47
0.44
12
39.0
0.53
0.57
0.65
0.61
15
23.8
0.38
0.43
0.78
0.76
20
13.0
0.26
0.29
0.89
0.87
25
7.3
0.17
0.20
0.95
0.93
Fig. 3 Cross-validated receiver operating characteristic (ROC) curve for the clinical and
clinical-genetic boosting prediction models.
Discussion
The study shows that prediction of TNBC can be improved if breast cancer risk SNPs
are added to a prediction rule based on age at diagnosis and BMI. Age at diagnosis
turned out to be the strongest predictor, stronger than any genetic influencing factors.
The final prediction model included four SNPs from the genes RAD51B, TERT, CCND1, and FGFR2 . Only one of these was statistically significant in the univariate SNP and TNBC association
tests, but all of them belong to the top five SNPs with the lowest p values. Although
the selection procedure did not consider any external biological information, there
might be biological reasons why these SNPs taken together improve prediction.
rs10069690 (TERT) has been described as being associated with estrogen receptor-negative and triple-negative
breast cancer, serous ovarian cancer, breast and ovarian cancer risk in BRCA1 mutation carriers, as well as prostate cancer – implying that there are similar pathways
of pathogenesis in these different types of cancer [13 ], [15 ], [30 ], [33 ], [57 ]. Fine mapping analyses of this region revealed a function for telomere stability
[30 ], [57 ]. rs2981579 (FGFR2) has been clearly described as an SNP that specifically increases the risk for hormone
receptor-positive breast cancer [21 ], [58 ], [59 ]. Its role in hormone receptor signaling has been linked to FOXA1 .
rs2588809 (RAD51B) is associated with triple-negative breast cancer [13 ], [15 ]. RAD51B, RAD51C, and RAD51D are RAD51 paralogues that build complexes among one other [60 ], [61 ] and have a function in homologous recombination. Breast cancer in men [62 ], prostate cancer risk [63 ], and an increased risk of breast and ovarian cancer in BRCA1 mutation carriers [64 ] are associated with SNPs in RAD51B . In vitro experiments have shown that a reduction in RAD51B by silencing RNA increases the chemosensitivity and reduces the efficacy of homologous
recombination in breast cancer cells, with differences depending on subtype [65 ].
rs78540526 (CCND1) is located in a gene region that maps to a putative enhancer of CCDN1 . It is clearly associated only with hormone receptor-positive breast cancer risk
[25 ], [66 ], [67 ] and is therefore a reasonable marker for predicting hormone-receptor negativity
and triple negativity. Functionally different CCDN1 expression levels have been shown to be different with regard to haplotypes in this
enhancer region [25 ]. This is of special interest, as CCND1 expression and/or amplification have been under discussion as a biomarker for the
efficacy of CDK4/6 inhibitors [68 ].
In genetic prediction studies, it can be expected that the ranking of the SNPs will
differ, and the set of SNPs selected for prediction will also differ, if the experiment
is repeated on a different group of patients with the same clinical characteristics.
This also holds if analyses are performed on subsets of patients within one study
[69 ]. In the present study, for instance, the top-ranked SNP in the complete dataset
was not ranked top in about 50% of all subsamples, and the sets of selected SNPs varied strongly.
Correlations among SNPs, and SNPs with weak individual associations with the outcome
but stronger power as a group, may encourage fluctuation in SNP selection. To obtain
stable, reliable results that are independent of a randomly chosen patient subset,
all decisions (e.g., the choice of tuning parameter for model specification and comparison
of model performances) were based on repeated sampling.
Double cross-validation was carried out, with an inner loop to specify the prediction
model and an outer loop to compute model performance measures, in order to ensure
that all model-building steps were performed completely independently of the validation
step [55 ], [70 ]. That is, all reported measures were based on data that were not used for model
building. Otherwise, the measures would have been overoptimistic. Schild et al. [71 ] provide an example of double (cross-)validation being applied in a gynecological
study.
The SNP selection process was carried out following a prespecified plan. Univariate
selection is a simple method that does not take correlations among SNPs into account.
It is known to perform less well in general than more sophisticated methods such as
lasso and boosting [47 ], a result that was confirmed in this study. Lasso and boosting performed similarly,
although the model fitting was rather different. However, the two methods share the
common feature that variable selection is a continuous process that leads to “weakly”
selected SNPs in addition to strong predictors. The result in the present study showing
that boosting had a slightly better prediction accuracy is consistent with a recently
published methodological study comparing boosting and lasso on simulated datasets
[72 ]. Bootstrap-based stepwise selection, a method that our group has previously applied
successfully to nongenetic data (e.g., [45 ], [73 ], [74 ]), performed less well than lasso and boosting. This might be because the parameters
for variable selection were kept firm, in contrast to the varying tuning parameters
of the other methods. Since repetitive stepwise selection is itself relatively elaborate,
it would have been computationally demanding if the number of selection processes
had been further increased.
The added value provided by breast cancer SNPs to a clinical prediction model was
assessed using the overall performance measures MSE and AUC. The advantage of such
overall measures is that prediction models can easily be compared. The disadvantage
is that they may be insensitive to detecting improvements in the model performance
when new predictors are added to a model that has already included important predictors
[75 ], [76 ]. For example, in [77 ], the addition of a significant biomarker score to a set of standard risk factors
increased the AUC only from 0.76 to 0.77, an increase that is similar to that in the
present study. Because of this, different methods of quantifying the improvement such
as the NRI have been developed [78 ].
In the future, germline genetic testing of SNPs from blood could be carried out in
clinical routine work on the same day and at reasonable cost [10 ], particularly if only a few SNPs are involved that can therefore be genotyped using
polymerase chain reaction. This would mean that the data would be available long before
the processing of tissue, which has to be embedded, cut, and examined by a pathologist
along with the relevant molecular tests. Using this genetic method of information
screening for specific TNBC studies with elaborate biomarker assessment could be initiated
at an early time point for patients with an increased likelihood of TNBC, particularly
when biomarker assessment for all patients would be too expensive and waiting for
results to come from pathology would delay biomarker assessment and the patientʼs
entry into a study.
The present study also aimed to demonstrate ways of managing the abundance of data
available in the era of “big data” and easy access to a variety of data, in order
to make it feasible to use large data volumes for clinical purposes. It can be anticipated
that it will also become possible to add the analysis of other markers, such as circulating
tumor DNA, in order to increase the accuracy of molecular subgroup prediction. However,
that will be a task for future research.
This study has some limitations. First of all, it needs to be borne in mind that the
study was conducted in a population consisting only of breast cancer patients. It
did not serve to identify SNPs capable of predicting the risk for triple-negative
breast cancer in healthy women – e.g., using a case–control study design. As the study
was intended to differentiate between triple-negative patients and non–triple-negative
ones, it might have been more useful to examine SNPs differentiating between molecular
subtypes rather than SNPs for breast cancer risk. Another limitation is the small
sample size. With just over 1000 patients, the sample size was rather low and the
findings will require validation in other independent populations.
In conclusion, the ability to predict triple-negative tumors can be improved for breast
cancer patients if breast cancer risk SNPs are added to a prediction rule based on
age at diagnosis and BMI. This finding could be used for prescreening purposes in
complicated molecular therapy studies for triple-negative breast cancer. The advanced
statistical procedures used in this study follow a prespecified, systematic plan and
are described with sufficient generality to be easily adaptable for other research
purposes.
Acknowledgement
The authors are grateful to Michael Robertson for professional medical editing services.