Open Access
CC BY 4.0 · Brazilian Journal of Oncology 2025; 21
DOI: 10.1055/s-0045-1807874
INNOVATION IN HEALTHCARE
1831
POSTER PRESENTATION

Robust machine learning model for classifying lung biopsy samples using general and tissue-specific feature extractors

Authors

  • Viviane Teixeira Loiola de Alencar

  • Felipe Navarro Balbino Alves

  • Guilherme de Sousa Velozo

  • Luiz Edmundo Lopes Mizutani

  • Vladmir Cláudio Cordeiro de Lima

  • Fábio Távora

 

    Introduction: Lung cancer is the leading cause of cancer-related deaths and the most commonly diagnosed malignancy worldwide. Accurate histological subtyping is crucial for diagnosis and treatment planning, yet pathologist variability can affect up to 25% of cases. Diagnoses often rely on small biopsy samples, posing challenges and requiring tissue conservation for subsequent molecular analyses. Therefore, new diagnostic tools that enhance accuracy without requiring additional tissue sampling are highly beneficial.

    Objective: This study aimed to evaluate the potential of a machine learning tool to accurately classify hematoxylin and eosin (H&E) stained lung biopsy samples from a real-world dataset into four categories: adenocarcinoma, squamous cell carcinoma, small cell carcinoma, and benign tissue.

    Methods: The training dataset included 412 adenocarcinomas, 323 squamous cell carcinomas, 41 small cell carcinomas, and 532 benign tissue samples, sourced from The Cancer Genome Atlas and a private dataset. To address class imbalance, oversampling techniques were applied to the minority classes. We developed a proprietary model architecture, training a foundational model (LungDine) for feature extraction using DinoV2 on a 1.2 terabyte dataset of 1,935,106 patches from 3,215 H&E lung images from TCGA, comparing it with ResNet50 features. Additionally, a second feature extractor (OncoDino) was developed using DinoV2, trained on a 6.5 terabyte dataset of 10,212,976 patches from 21,479 histology images. The test dataset consisted of 79 biopsy images from a private real-world dataset, with diagnoses validated by immunohistochemistry tests.

    Results: The LungDine model achieved AUC (Area Under the Curve) values of 97% for adenocarcinoma (LUAD), 96% for squamous cell carcinoma (LUSC), 94% for benign tissue, and 96% for small cell carcinoma (SCC), with an average AUC improvement of 13.5 percentage points compared to ResNet50. The OncoDino model achieved AUC values of 94% for LUAD, 92% for LUSC, 96% for benign tissue, and 99% for SCC.

    Conclusion: These findings demonstrate the efficacy of both models in accurately classifying lung tissue samples. The OncoDino results suggest that effective classification can be achieved without tissue-specific feature extractors, indicating potential for broader and scalable applications in histopathological image analysis. The next step will be to validate these findings in a larger real-world dataset. Funding: FAPESP 2023/11600-0.

    Corresponding author: Viviane Teixeira Loiola de Alencar (e-mail: vivianetlalencar@gmail.com).


    No conflict of interest has been declared by the author(s).

    Publication History

    Article published online:
    06 May 2025

    © 2025. The Author(s). This is an open access article published by Thieme under the terms of the Creative Commons Attribution 4.0 International License, permitting copying and reproduction so long as the original work is given appropriate credit (https://creativecommons.org/licenses/by/4.0/)

    Thieme Revinter Publicações Ltda.
    Rua Rego Freitas, 175, loja 1, República, São Paulo, SP, CEP 01220-010, Brazil

    Bibliographical Record
    Viviane Teixeira Loiola de Alencar, Felipe Navarro Balbino Alves, Guilherme de Sousa Velozo, Luiz Edmundo Lopes Mizutani, Vladmir Cláudio Cordeiro de Lima, Fábio Távora. Robust machine learning model for classifying lung biopsy samples using general and tissue-specific feature extractors. Brazilian Journal of Oncology 2025; 21.
    DOI: 10.1055/s-0045-1807874