Subscribe to RSS
DOI: 10.1055/s-0045-1803304
Deep Learning Consensus-Based Framework for the Annotation of a Routine Clinical Vestibular Schwannoma MRI Dataset
Introduction: Data annotation is critical for developing machine learning models in medical imaging, where annotation accuracy directly affects model performance. However, obtaining high-quality annotations is costly and requires clinical expertise. Delineating vestibular schwannoma (VS) in magnetic resonance imaging (MRI) is particularly challenging due to tumor size variability, patient anatomy, and the heterogeneity of retrospective data, especially when VS coexists with other pathologies like meningioma. Accurate labeling is essential to avoid confounding factors that could hinder model performance.
Methodology: Previously, we used a labor-intensive and costly iterative pipeline to manually annotate heterogeneous scans from multiple institutions, referred to as the multi-center routine clinical (MC-RC) VS dataset (UCLH-MC-RC). In this study, using the UCLH-MC-RC and two additional single-center gamma knife (SC-GK) datasets (LDN-SC-GK, ETZ-SC-GK), we annotated a new MC-RC dataset (KCH-MC-RC). To achieve this, we introduced an iterative pipeline with deep learning-based segmentation to reduce both the annotators' workload and inter-rater variability ([Fig. 1]).


We utilized the default 3D full-resolution UNet from the nnU-Net model for segmentation. The initial training dataset, comprising expert-annotated images from three datasets (UCLH-MC-RC, LDN-SC-GK, and ETZ-SC-GK) were used to train the model ([Table 1]). With each round, the model was bootstrapped by incorporating additional cases from the KCH-MC-RC dataset.


In Round 1 of model training, 427 scans were processed and quality assessed by 3 independent experts as shown in [Fig. 2]. A consensus meeting involving a consultant neurosurgeon (J.S.) was subsequently convened to review complex scans.


After the consensus meeting, accepted KCH-MC-RC cases were combined with the initial training data to enhance the segmentation model through bootstrapping ([Table 1]). Rejected sessions were then reprocessed using this bootstrapped model. An expert-trained radiologist manually assessed Round 2 annotations; these were accepted or corrected using the ITK-SNAP annotation tool.
In Round 3, accepted and corrected cases from Round 2 were added to the previously accepted cases from Round 1 and combined with the initial training dataset to further refine the model through bootstrapping.
Two independent unseen test datasets were used to evaluate model performance of the bootstrapped models: (1) 50 cases drawn from the UCLH-MC-RC, ETZ-SC-GK, LDN-SC-GK datasets and (2) 30 cases drawn from the KCH-MC-RC dataset.
Results: Using the bootstrapped models did not improve segmentation results but performance on the KCH-MC-RC validation set improved with each round ([Fig. 3]).


Conclusion: This work demonstrated that iterative bootstrapping was effective in refining the model for the specific characteristics of the KCH-MC-RC dataset. This approach could improve a deep learning segmentation model’s accuracy and adaptability when dealing with complex, heterogeneous medical data.
Publication History
Article published online:
07 February 2025
© 2025. Thieme. All rights reserved.
Georg Thieme Verlag KG
Oswald-Hesse-Straße 50, 70469 Stuttgart, Germany