Subscribe to RSS
DOI: 10.1055/s-0045-1809785
SurgiMind: Next-Generation Surgical Image Segmentation leveraging Transformers for Lung Cancer Surgery
Background To develop and evaluate a transformer-based deep learning model for real-time anatomical structure segmentation in video-assisted thoracoscopic surgery (VATS) for right upper lobe lobectomy in lung cancer patients.
Methods & Materials A retrospective cohort study was conducted using thoracoscopic video recordings from 81 patients who underwent anatomical VATS right upper lobe resection between 2009 and 2024. A total of 1539 frames were extracted and manually annotated for eight anatomical classes: right upper pulmonary vein, azygos vein, right upper lobe bronchus, phrenic nerve, middle lobe vein, A2 segmental artery, truncus anterior, and pulmonary main artery. Three deep learning architectures (U-Net, Fully Convolutional Transformer [FCT], and the novel Surgi-FCT) were trained and evaluated. Surgi-FCT was optimized by removing the Wide Focus layer and increasing the network depth to improve feature extraction and reduce computational overhead. Evaluation metrics included Dice coefficient, Intersection over Union (IoU), and precision, with separate analyses for class-present (CP) and class-absent (CA) scenarios.
Results The Surgi-FCT model with 7 encoder-decoder layers (Surgi-FCT 7) trained on 640×640 images achieved the best segmentation performance, with an average Dice coefficient of 0.69 (CP) and 0.88 (CA), resulting in an overall Dice of 0.82. This outperformed U-Net (Dice: 0.56 CP, 0.79 CA) and FCT (Dice: 0.68 CP, 0.84 CA). Surgi-FCT 7 was particularly effective in segmenting frequently occurring classes such as the pulmonary main artery and phrenic nerve. Classes with fewer examples, such as the A2 artery and middle lobe vein, had lower Dice scores (0.40 and 0.62 respectively) but still showed improved performance in multi-class training compared to single-class models. The network demonstrated that class co-occurrence, as observed in correlation matrices, improved segmentation accuracy—e.g., co-detection of the azygos vein and main artery. Higher image resolution and deeper model architecture also led to performance gains, though at increased computational cost.
Conclusion The Surgi-FCT 7 model enables accurate segmentation of complex anatomical structures in thoracic surgery videos. Leveraging transformer attention and class co-occurrence, it outperforms conventional CNN-based architectures and provides a scalable foundation for AI-powered visual assistance tools in minimally invasive thoracic surgery.
Publication History
Article published online:
25 August 2025
© 2025. Thieme. All rights reserved.
Georg Thieme Verlag KG
Oswald-Hesse-Straße 50, 70469 Stuttgart, Germany