Endoscopy
DOI: 10.1055/a-2780-0664
Innovations and brief communications

Generative artificial intelligence for patient education material on gastric cancer prevention

Authors

  • Tommy Rizkala

    1   Department of Gastroenterology, IRCCS Humanitas Research Hospital, Rozzano, Italy
  • Natasha Muench

    2   DiCE: Digestive Cancer Europe, Bruessels, Belgium
  • Cesare Hassan*

    1   Department of Gastroenterology, IRCCS Humanitas Research Hospital, Rozzano, Italy
    3   Department of Biomedical Sciences, Humanitas University, Pieve Emanuele, Italy
  • Mario Dinis-Ribeiro*

    4   Precancerous Lesions and Early Cancer Management Group, Research Center of IPO Porto (CI‐IPOP)/CI‐IPOP@RISE (Health Research Group), Portuguese Institute of Oncology of Porto (IPO Porto), Porto, Portugal
    5   Gastroenterology Department, Portuguese Institute of Oncology of Porto, Porto, Portugal
  • Generative AI Working Group


Graphical Abstract

Abstract

Background This study assessed the effectiveness of large language models (LLMs) in generating lay summaries for patient education on the management of precancerous lesions and early neoplasia in the stomach.

Methods In this pilot study, we used a two-period, crossover, blinded design to compare a ChatGPT-4o summary versus a Digestive Cancers Europe (DiCE) summary. Two panels rated the materials: expert physicians and DiCE Patient Advisory Committee members. Experts scored accuracy, completeness, comprehensibility, and satisfaction (across five sections); patients rated overall completeness, comprehensibility, and satisfaction. Paired comparisons used mixed-effects estimates. Readability was assessed with Flesch–Kincaid grade level (FKGL) and SMOG index.

Results Median expert ratings were similar between materials across metrics. For the overall summary, median (range; IQR) scores were: accuracy 5 (4–6; 1) for ChatGPT-4o vs. 5 (3–6; 1) for DiCE (P = 0.10); completeness 4 (3–5; 1) vs. 4 (2–5; 1; P = 0.27); comprehensibility 4 (3–5; 1) vs. 4 (2–5; 1; P = 0.33); and satisfaction 4 (2–5; 1) vs. 3 (1–5; 2; P = 0.53). Patient ratings mirrored experts, with very similar results. Readability failed to meet guideline recommendations for both summaries on both FKGL and SMOG scores.

Conclusion ChatGPT-4o produced patient materials comparable to DiCE, but both require readability optimization; a human-in-the-loop workflow and future tests across prompts and models are warranted.

joint first authors.


* joint senior authors.




Publication History

Received: 14 March 2025

Accepted after revision: 29 December 2025

Article published online:
13 February 2026

© 2026. Thieme. All rights reserved.

Georg Thieme Verlag KG
Oswald-Hesse-Straße 50, 70469 Stuttgart, Germany