J Neurol Surg B Skull Base 2025; 86(S 01): S1-S576
DOI: 10.1055/s-0045-1803715
Presentation Abstracts
Podium Presentations
Poster Presentations

Evaluation of Named Entity Recognition for Automated Extraction of Present Tumor Size and Personal Names from Radiology Reports Using Spacy

Lorena Garcia-Foncillas Macias
1   School of Biomedical Engineering and Imaging Sciences, King's College London, London, United Kingdom
,
Theodore Barfoot
1   School of Biomedical Engineering and Imaging Sciences, King's College London, London, United Kingdom
,
Tom Vercauteren
1   School of Biomedical Engineering and Imaging Sciences, King's College London, London, United Kingdom
,
Jonathan Shapey
2   Department of Neurosurgery, King's College Hospital NHS Foundation Trust, London, United Kingdom
› Institutsangaben
 

Analyzing tumor growth rates from MRI scans is crucial in diagnosing and treating meningiomas. Tumor sizes are typically documented in radiology reports, comprising unstructured text. Natural language processing (NLP) can be used to extract structured information. De-identifying clinical records by locating and processing personal names is also essential for collaborative research. This study explored training a named entity recognition (NER) model from the SpaCy library[1] to extract present tumor sizes and personal names from radiology reports.

A total of 16,856 radiology reports from patients diagnosed with meningioma at King’s College Hospital between 2011 and 2020 were obtained. For this study, we excluded duplicates, resulting in 9,175 radiology reports. From this dataset, a random subset of 400 reports was manually annotated to identify present tumor size and personal names of radiologists and patients. Cross-validation was conducted on 85% of the annotated reports, while the remaining 15% were reserved for testing. Additionally, for the task of identifying personal names, we compared a specifically trained NER model for extracting such entities with a rule-based algorithm, our baseline. The performance of both methods was evaluated at a token level (i.e., word and punctuation level) using precision (ratio of true positive predictions over all positive predictions), recall (ratio between true positive predictions and all positive ground truths), and F1-score (harmonic mean of precision and recall). Additionally, we assessed the NER models and the rule-based algorithm at a report level, evaluating the accuracy of perfect prediction.

When comparing the performance between the rule-based approach and the specifically trained NER model for name extraction, the rule-based method achieved a macro-averaged token-level precision of 0.96, recall of 0.99, and F1-score of 0.975. In contrast, the NER model attained 0.998 ± 0.004 for each token-level metric. The report-level accuracy of perfectly predicted annotated reports was 0.836 for the rule-based method, compared with 0.971 ± 0.032 for the NER model. Moreover, when a model was trained for both tasks—identifying present tumor size and name extraction—the performance slightly decreased compared with the name extraction-specific model. In this case, the NER model exhibited a macro-averaged token-level precision of 0.948 ± 0.029, recall of 0.97 ± 0.021, and F1-score of 0.958 ± 0.011. At the report level, 0.817 ± 0.029 of the reports were perfectly predicted against our ground truth of manual annotations.

These findings illustrate the potential for training models on highly heterogeneous data to tackle the complex task of extracting tumor sizes from radiology reports and develop automated pipelines for analysis. This automation could identify patients with increased tumor growth or changes in diagnosis, facilitate follow-up assessments, and contribute to creating more structured data. Additionally, this project is a significant step toward the automated generation of radiology reports from MRI scans.

Zoom
A)
Zoom
B)

A), B) Labeled radiology reports with ground truth (GT) and predictions (PRED) from the SpaCy model for present tumor size and personal name extraction. Identifying information is replaced with “[PERSON]” and “[DATE]” placeholders for privacy.



Publikationsverlauf

Artikel online veröffentlicht:
07. Februar 2025

© 2025. Thieme. All rights reserved.

Georg Thieme Verlag KG
Oswald-Hesse-Straße 50, 70469 Stuttgart, Germany