Abstract
Background Temporal dataset shift can cause degradation in model performance as discrepancies
between training and deployment data grow over time. The primary objective was to
determine whether parsimonious models produced by specific feature selection methods
are more robust to temporal dataset shift as measured by out-of-distribution (OOD)
performance, while maintaining in-distribution (ID) performance.
Methods Our dataset consisted of intensive care unit patients from MIMIC-IV categorized by
year groups (2008–2010, 2011–2013, 2014–2016, and 2017–2019). We trained baseline
models using L2-regularized logistic regression on 2008–2010 to predict in-hospital
mortality, long length of stay (LOS), sepsis, and invasive ventilation in all year
groups. We evaluated three feature selection methods: L1-regularized logistic regression
(L1), Remove and Retrain (ROAR), and causal feature selection. We assessed whether
a feature selection method could maintain ID performance (2008–2010) and improve OOD
performance (2017–2019). We also assessed whether parsimonious models retrained on
OOD data performed as well as oracle models trained on all features in the OOD year
group.
Results The baseline model showed significantly worse OOD performance with the long LOS and
sepsis tasks when compared with the ID performance. L1 and ROAR retained 3.7 to 12.6%
of all features, whereas causal feature selection generally retained fewer features.
Models produced by L1 and ROAR exhibited similar ID and OOD performance as the baseline
models. The retraining of these models on 2017–2019 data using features selected from
training on 2008–2010 data generally reached parity with oracle models trained directly
on 2017–2019 data using all available features. Causal feature selection led to heterogeneous
results with the superset maintaining ID performance while improving OOD calibration
only on the long LOS task.
Conclusions While model retraining can mitigate the impact of temporal dataset shift on parsimonious
models produced by L1 and ROAR, new methods are required to proactively improve temporal
robustness.
Keywords
dataset shift - machine learning - clinical outcomes - feature selection