Subscribe to RSS

DOI: 10.1055/a-2565-9155
Automating Responses to Patient Portal Messages Using Generative AI
Funding None.
- Abstract
- Background and Significance
- Methods
- Results
- Discussion
- Limitations
- Future Work
- Conclusion
- Clinical Relevance Statement
- Multiple-Choice Questions
- References
Abstract
Background
Patient portals bridge patient and provider communications but exacerbate physician and nursing burnout. Large language models (LLMs) can generate message responses that are viewed favorably by health care professionals/providers (HCPs); however, these studies have not included diverse message types or new prompt-engineering strategies.
Objectives
Our goal is to investigate and compare the quality and precision of GPT-generated message responses versus real doctor responses across the spectrum of message types within a patient portal.
Methods
We used prompt engineering techniques to craft synthetic provider responses tailored to adult primary care patients. We enrolled a sample of primary care providers in a cross-sectional study to compare authentic with synthetic patient portal message responses generated by GPT-3.5-turbo, July 2023 version (GPT). The survey assessed each response's empathy, relevance, medical accuracy, and readability on a scale from 0 to 5. Respondents were asked to identify responses that were GPT-generated versus provider-generated. Mean scores for all metrics were computed for subsequent analysis.
Results
A total of 49 HCPs participated in the survey (59% completion rate), comprising 16 physicians and 32 advanced practice providers (APPs). In comparison to responses generated by real doctors, GPT-generated responses scored statistically significantly higher than doctors in two of the four parameters: empathy (p < 0.05) and readability (p < 0.05). However, no statistically significant difference was observed for relevance and accuracy (p > 0.05). Although readability scores were significantly different, the absolute difference was small, and the clinical significance of this finding remains uncertain.
Conclusion
Our findings affirm the potential of GPT-generated message responses to achieve comparable levels of empathy, relevance, and readability to those found in typical responses crafted by HCPs. Additional studies should be done within provider workflows and with careful evaluation of patient attitudes and concerns related to the ethics as well as the quality of generated responses in all settings.
Keywords
patient web portal - physician–patient interaction - artificial intelligence - communication - health - medical informatics - large language modelsBackground and Significance
Patient portals have become an integral and indispensable component of modern health care, providing patients with secure online access to vital health information and facilitating crucial communication bridges between health care professionals/providers (HCPs) and patients. In doing so, they foster stronger connections between providers and patients and facilitate the delivery of personalized care through effective communication.[1] However, with the rapid growth of patient portals, especially during the COVID-19 pandemic,[2] this convenience has led to an overwhelming surge in the number of in-basket messages that HCPs must manage daily. This growing volume of these in-basket messages, many of which are administrative or routine, contributes to clinical burnout, reduces efficiency, and diverts valuable time from direct patient care.[1] [2] [3] [4] First documented in 1974, physician burnout has been linked to the demands of EHR documentation, which consumes substantial clinical time.[2] [5] Primary care providers face uniquely heightened burnout risks among all HCPs, emphasizing the pressing need for interventions to alleviate EHR-related burdens and support clinician well-being.[2]
Generative AI tools (GenAI) such as OpenAI®'s GPT-4 have emerged as promising tools in the health care sector, particularly for mitigating documentation-related burnout among clinicians. GenAI has gained widespread attention in the medical community due to its capacity to effectively generate patient clinic letters, radiology reports, medical notes, discharge summaries, and clinical decision support.[6] [7] [8] [9] [10] Researchers have begun to explore the patient-facing role of GenAI in areas such as chatbots,[11] [12] wearable technologies[13] [14] [15] and automating patient portal message responses.[9] [16] In particular, a recent study by Tai-Seale et al. with 52 participants showed that GenAI drafts took longer to read, were approximately 18% longer in length, and while responses were less detailed, they were generally more empathetic. Most respondents edited messages to remove recommendations for appointments or inaccuracies. This study did not divide messages into message types, such as those described by Sulieman and colleagues,[17] thereby not addressing if specific types of messages may require less scrutiny or human input. It is critical to note that while GenAI has shown promise in automating routine or administrative messages and easing the load on HCPs, not all patient communications are suitable for automation. Understanding which types of patient messages may require less scrutiny or human input is critical to ensuring both efficiency and safety in health care delivery.
Currently, there is limited research and guidance on how to triage patient messages for potential automation, creating an essential gap in knowledge that can affect the effective integration of artificial intelligence (AI) tools. This study aims to address this gap by evaluating the appropriateness of automating responses to specific message types while also exploring HCP acceptability of GenAI in daily clinical practices. By identifying which messages can be safely handled by AI versus those requiring human attention, this research seeks to optimize health care delivery by streamlining communication workflows. The ultimate goal is to alleviate clinician workload, improve response times for patients, and ensure a high-quality personalized care remains a priority, thus offering a practical solution to a growing issue in modern health care.
Methods
Study Setting
This cross-sectional, single-group study was conducted at a major urban integrated academic medical center in the Northeast. We primarily focused on physicians and advanced practice providers (APPs) in internal medicine and family medicine. HCPs in this setting are located in over 50 locations and function as a patient-centered medical home for millions of patients. The study was deemed non-human participant research by the University of Pennsylvania Institutional Review Board. Answering the survey served as implied consent to participate in the project.
Initial Data Collection and Creation of Synthetic Patient Portal Messages
Considering the sensitive nature of real patient portal messages, we first retrieved a set of 85 patient portal message–response pairs from a repository at Vanderbilt University Medical Center. This set of 85 contained messages from both the provider and the patient, resulting in 170 total messages. All identifying language was removed from each message, with identifying nouns replaced with semantically similar language, allowing us to use them on commercial cloud platforms. No changes were made in tone, urgency, or vernacularity. Two investigators (B.D.S. and K.B.J.) then independently categorized each patient message into the categories as described by Sulieman et al.[17] (management, interventions, problems, referrals, test results, and clinical intervention preparation). Interrater agreement for the categorization of these messages was nearly perfect (kappa = 0.94).
Using the authentic provider responses to these rephrased patient messages, we engineered a prompt within GPT-3.5-turbo, July 2023 version (GPT)[18] to generate additional message responses similar in tone, length, and word choice. We used a strategy based on “chain-of-thought” reasoning, which facilitated the incorporation of explanatory iterative changes in text generated by large language models (LLMs).[19] [20] In our use case, we used this strategy to tailor an initial GPT response to the tone and language of the original message. For example, we can modify the response length of a message without compromising accuracy by iteratively modifying a component of the prompt pipeline, as shown in [Fig. 1]. A convenience sample of eight experienced primary care physicians reviewed these GPT-generated messages and were unable to distinguish them from the original patient messages.


Given these results, we combined our pool of messages into one set to develop our synthetic patient portal message responses for the second phase of the study.
Pipeline Development
We used iterative prompt engineering, similar to chain of thought, to create tailored provider responses to patient messages. These engineering prompts contained explicit instructions to mimic the tone, word choices (slang instead of medical words), and brevity of each rephrased patient message but did not include any content from the patient messages used in the test set for GPT-generated responses. Once we were satisfied with the face validity of responses, we generated synthetic patient portal message responses across the range of categories described above, using this prompt:
“You're a health care provider whom the patient will consult. Your responsibility is to formulate concise (preferably under 150 characters), compassionate, and medically accurate responses to patient messages in the ‘{category}’ category with an ‘{urgency}’ urgency level, and to direct them to ‘my’ office, offer ‘my’ help if necessary. Ask follow-up questions if not enough information is provided. If the situation is very urgent or requires on-site evaluation, you should ask the patient to come in.”
The final pipeline is summarized in [Fig. 1], and all codes to generate the messages are available on Github.[21]
Evaluation of Message–Response Pairs
To evaluate the quality and authenticity of messages generated by our pipeline, we conducted a cross-sectional survey study of HCPs across the University of Pennsylvania. The survey consisted of 20 questions, each including a message–response pair. Participants were presented with a total of 20 unique patient messages, with each message accompanied by a provider response. Of these 20 unique patient message–provider response pairs, 10 responses were generated de novo using GPT, while the remaining 10 were written by real doctors. Participants were not informed which responses were generated by GPT and which were provided by real doctors, ensuring that their evaluations were based solely on the quality of the responses, independent of their source. For each question, participants were asked to rate the message–response pairs according to four key quality dimensions of communication, shown by Ayers and colleagues.[22] These quality dimensions aimed to predict the perceived quality of the response: Empathy, reflecting the degree of consideration for the patient's emotions in the response; Relevance, assessing how closely the response addressed the patient's expressed needs; Medical Accuracy, gauging the alignment of the response with established medical practices and guidelines; and Readability, evaluating the clarity, coherence, and simplicity of the language employed. Each quality dimension was presented as a Likert-scale question with five possible responses, ranging from 0 to 5. Additionally, participants were asked to discern whether each message response was GPT-generated or written by a real provider.
We recruited HCPs who identified as primary care MDs, DOs, or APPs through an email distribution list to complete the survey. This sampling frame covered most primary care providers at our institution. Eighty-four potential participants responded to the initial email request, and a survey link was sent to each interested participant. Upon completion of the survey, participants received a $10 Starbucks gift card as a token of appreciation. The survey was distributed using both Google Forms and REDCap, the latter being utilized due to firewall restrictions preventing Google Forms use on some computers. The survey was administered between November 28, 2023, and January 5, 2024. We used JMP (version 17.2.0)[17] to complete univariate analyses and to determine if participants who could discern which messages were generated by GPT differed in their assessment of message quality.
Results
Of the 84 HCPs who showed interest in participating in the survey, 49 completed the survey, resulting in a 58% completion rate. [Table 1] provides an overview of various demographic and professional variables among the 49 respondents. Most participants identified as female (77.6%), with 69% between the ages of 31 and 40. A total of 67% of respondents identified as APPs, while 33% held a medical degree (MD or DO). Years of experience seeing patients varied, with the largest group having less than 5 years of experience (31%), followed by experience between 10 and 15 years (18%). Most respondents worked in clinics (69%), in urban settings (63%), and reported receiving 25 to 75 in-basket messages from patients during a typical work week (55%). Most respondents (76%) indicated no or unknown experience with AI tools in medical practice.
Abbreviation: AI, artificial intelligence.
[Table 2] and [Fig. 2] summarize the overall assessment of message–response quality. Notably, GPT-generated responses generally outperformed real responses across all key characteristics, demonstrating statistical significance with empathy (p < 0.001) and readability (p < 0.001). Relevance also trended toward significance (p = 0.08). Participants correctly identified GPT messages 73% of the time (good guessers) and correctly identified authentic messages 50% of the time ([Table 3]). There were no statistically significant differences in assessed message–response quality for good guessers versus other participants as determined by one-way analysis of variance (ANOVA; F[1,47] = 2.27, p = 0.13).
Abbreviation: SD, standard deviation.
Note: The table above provides a comprehensive breakdown of the average means and medians derived for the four key characteristics, comparing GPT-generated message–response pairs to real ones. Both empathy and readability were statistically better for GPT-generated responses.


Discussion
In this study, HCPs evaluated the quality of synthetic (GPT-generated) versus authentic (provider-generated) message responses. The results revealed that responses generated by GPT achieved statistically significant ratings in empathy and readability, with a notable trend toward statistical differences in relevance and medical accuracy compared with authentic patient portal message responses. The AI's capacity to generate empathetic and easily readable messages may stem from the technology's ability to invest unlimited time in crafting responses, as opposed to real doctors, who often face time constraints and stress when generating responses to patient messages. These findings not only build upon but also validate previous research by Ayers and colleagues,[22] where a small team of HCPs rated online chatbot responses as more empathetic than verified physician responses. Through a more tailored approach to prompt engineering, these results complement those of a recent study by Tai-Seale and colleagues, where participants found the technology favorable but suggested less “robotic” responses that required less editing.[16] However, it is important to note that our study did not find significant differences in two critical measures—relevance and accuracy—between AI and provider-generated responses. This indicates that while AI may enhance certain aspects of communication, such as tone and readability, it may not yet be superior to human input in areas requiring nuanced judgment, such as the relevance of the response to a patient's medical issue or the accuracy of clinical information. Moreover, while readability differences were statistically significant, the absolute difference was relatively small, raising questions about the clinical significance of these results. This emphasizes the need for careful interpretation of these findings, particularly when considering the practical impact on patient care.
Our study extends these findings by including more response types and utilizing chain-of-thought reasoning to create more nuanced and explainable output. Another study by Garcia and colleagues[23] assessed AI-generated reply utilization and found an overall utilization rate of 20% across 162 clinicians in primary and specialty care, despite draft replies being available for more than significantly more messages. Utilization was affected by technical limitations or internal exclusion criteria, preventing the generation of a message response. Themes affecting adoption included tone, content relevance, and accuracy. Based on our findings, although the complexity of the prompt engineering to create these messages may be high, that cost may result in improved adoption and is likely to be easier to implement over time.
Additionally, it is critical to recognize that this study was not designed as a non-inferiority trial, meaning we cannot definitively conclude that AI-generated responses are comparable to human responses across all quality parameters. Further research is needed to evaluate whether AI tools can consistently maintain high standards of clinical relevance and safety, especially in more complex patient interactions.
As AI-enabled messaging systems continue to mature and advance, with attention to message tailoring and the specific needs of patients from diverse backgrounds, chatbots and similar tools are likely to become more commonplace in medicine. Already, several studies are exploring the feasibility of integrating systems such as GPT to generate high-quality responses to patient inquiries and aid clinical decision-making across various medical specialties.[7] [24] [25] While AI has shown promise in alleviating the communication burden on HCPs, its use should be considered a supportive tool rather than a replacement for human expertise. AI may be best suited to handle certain types of patient messages that require less medical nuance but should not yet be fully relied upon for more complex decision-making without human oversight.
Of note, crafting effective prompts in our study entailed iterative trial and error. The potential for performance variation underscores the importance of understanding the model's reliance on training data patterns and ensuring the relevance and quality of examples provided. Our resulting strategy and prompts are available for reference, providing valuable insights for future research and implementation endeavors in this rapidly evolving field.
Limitations
This study is subject to several limitations that may impact its generalizability. First, the sample size of both generated messages (8) and participating survey respondents (49) is small, potentially limiting the breadth of perspectives represented. All participants were drawn from a single health care system, which may not fully capture the diversity of opinions regarding the value proposition for patient portal message responses or the preferred format and comprehensiveness of these responses across different health care settings. Furthermore, the study relied on a convenience sample of providers who may have had more time and interest to participate in the survey, introducing a potential bias in the results. As such, caution should be exercised when generalizing the findings of this study to broader populations.
The patient portal messages used to generate these synthetic provider responses were de-identified, with identifying nouns replaced. At the time of this study, we were not permitted to use Health Insurance Portability and Accountability Act (HIPAA) safe harbor-compliant messages outside of the health system firewall. We anticipate that health systems will relax this constraint shortly, which will facilitate larger studies within a health system. Finally, our use of prompt engineering to generate responses is currently a trial-and-error process, with features of messages proposed by our research team. Our study was conducted using GPT 3.5-turbo, which is an older and less functional version of GPT than would be used in any clinical trials of this innovation. Our study also tested only the capabilities of GenAI in isolation. Clinical trials of this technology should combine chart content, message threads, and patient preferences, possibly leveraging retrieval-augmented generation or expanding context windows to improve the personalization of responses.
It will be important to better understand the desirable characteristics of patient portal message responses from the perspective of HCPs and patients.
Potential Biases
The survey was administered to HCPs via Google Forms and REDCap. However, no specific measures were put in place to prevent participants from collaborating or discussing the survey or their responses with one another. As a result, there is a potential for response bias or social desirability bias to have been introduced. HCPs may have been influenced by side conversations or peer expectations, leading to responses that were shaped by group discussions rather than independent, unbiased opinions, which could affect the validity of the survey results.
Furthermore, while the survey aimed to capture a broad perspective from HCPs, we did not employ a purposive sampling approach to specifically target participants with varying levels of experience with AI tools in medical practice. This limitation may have introduced sampling bias and influenced the diversity of responses, as HCPs with more experience using AI might have responded differently compared with those with less or no experience. Future studies could benefit from purposive sampling to ensure a more representative sample in terms of AI familiarity, which may provide deeper insights into the technology's impact on clinical practice.
Future Work
Considering the limitations of our pipeline, several areas for future research and improvement emerge. Quantitative assessments are crucial to validate the significance of each step in the pipeline, offering empirical evidence to support the theoretical justifications for the architecture's structure. The grammar editing phase requires refinement to prevent overcorrection or unintended alterations of colloquial or non-standard language, thus preserving contextual appropriateness.
Continued exploration and adaptation of the underlying model will be necessary to align with evolving understandings of response coherence and relevance. Addressing biases and inaccuracies originating from the training data is imperative to improve system performance and mitigate potential data-driven biases in generated responses. Enhancing the system's capacity to retain context throughout extended or complex conversations can be challenging and must be monitored. Finally, the rapidly evolving nature of LLMs is likely to improve the ability to tailor responses. Systems employing these tools must be periodically reevaluated and refined to support better alignment with the needs of patients and HCPs. For example, we can anticipate changes in multistep pipelines to do more at each step (using recursive approaches and context expansion) without introducing hallucinations and include validation layers to check for many more aspects of readability, bias, and context specificity. These advances and others have the potential for significant improvements over current methods.
The study had inadequate power to assess the importance of some covariates that might be useful for implementing this functionality at scale, including patient and primary care provider characteristics. It will be critical to ensure their efficacy considering patient preferences, health care settings, and regulatory requirements. Further research should be done to understand these characteristics, to understand and address any potential ethical and liability considerations related to automating message responses, and to assess the need for additional studies, particularly those designed as non-inferiority trials, to explore the potential risks and benefits of integrating AI into patient communication workflows. Considering these limitations, while the pipeline offers a promising approach to generating human-like responses, ongoing research and iterative refinements are crucial to enhance its efficacy and applicability in diverse real-world scenarios. Moreover, assessing AI performance across more diverse and clinically complex message types will be critical in determining its broader applicability in health care. By tackling these difficulties and utilizing advances in AI, health care communication may develop to meet patients' and clinicians' ever-changing requirements and expectations.
Conclusion
The findings of this study suggest that GPT-generated provider responses using new prompt-engineering approaches are acceptable to primary care providers. The study provides promising insights into the potential of AI-driven messaging systems to alleviate clinician burnout and enhance patient communication. As with all technological endeavors, continual evolution is paramount for addressing challenges and leveraging emerging insights from both the technological and cognitive domains.
Clinical Relevance Statement
This study describes a strategy to create provider responses to patient portal messages using GPT. HCPs found these generated responses to have a comparable level of accuracy, relevance, empathy, and readability to authentic provider responses.
Multiple-Choice Questions
-
Which of the following best describes the author's approach to prompt engineering?
-
Asking HCPs to describe how they think about creating a patient response.
-
Developing a response using a component of the original thinking of the patient.
-
Applying a sequence of defined tasks in successive prompts to create a tailored and interpretable response. (correct answer)
-
Creating a prompt that gives the model engineers the flexibility to interpret the user's intention.
Correct Answer: The correct answer is option c. Prompt engineering is defined as the process of designing inputs for AI tools that will produce optimal outputs. In the case of this manuscript, the authors applied an iterative prompt engineering approach, summarized in [Fig. 1].
-
-
What are the four ways that the authors compared GenAI responses to authentic HCP responses?
-
Relevance, readability, medical accuracy, and empathy (correct answer)
-
Readability, length, timeliness, accuracy
-
Relevance, length, actionability, empathy
-
Relevance, readability, urgency, length
Correct Answer: The correct answer is option a. As is becoming a standard evolution strategy for LLM output in patient care settings, the authors examined the output's relevance, readability, medical accuracy, and empathy with results summarized in [Fig. 2].
-
Message–response pair |
Message category[a] |
Response type |
Percentage identified as GPT (n = 49) |
Percentage identified as real (n = 49) |
---|---|---|---|---|
Message: Ben is having a few problems. His stomach is extended with possible fluid retention. He took his fluid medicine yesterday but it doesn't seem to have helped just yet. He is also becoming very short of breath when doing anything. I wasn't sure if he could be seen in clinic soon or if he could have some blood work done to check levels. Response: I'm really sorry to hear about Ben's symptoms. It's really important to get him checked out as soon as possible. Can you please bring him to the clinic so the doctors can give him a thorough check-up and do the right tests? |
Medical management |
GPT |
36 (73%) |
13 (27%) |
Message: My R elbow is hurting. If I schedule an apt, may I receive another injection? My elbow surgery is scheduled in Oct. Response: I'm sorry to hear that your elbow hurts. It would be a good idea to make an appointment to have a doctor check your symptoms and talk about the possibility of getting another shot. |
Medical interventions |
GPT |
33 (67%) |
16 (33%) |
Message: Do I need to prepare in any way or the upcoming fluid deprivation test? such as quit drinking fluids at a certain time or avoid caffeine and alcohol in the time leading up to it? Response: The test usually finishes faster if you minimize fluid intake after midnight. However, you do not need to strictly avoid drinking anything after midnight and should not do so if you are noticeably thirsty. |
Informational—preparation |
REAL |
21 (43%) |
28 (57%) |
Message: Good morning, I have been dealing with this crud for 8 days now. I still have this terrible cough that I cannot get rid of and when I breathe it doesn't feel like I am getting any air. It is keeping me up at night. I am coughing up/blowing out of my nose this bright yellow clumpy nasty stuff and I am just really tired of the cough. Any suggestions on what else I can do to help this clear up? Response: So sorry to hear you've been dealing with this for such a long time. I'd recommend you come in for an acute visit or go to one of our walk-in clinics if closer to your home. We can be sure there isn't something that needs more than simple over-the-counter treatment. It's difficult to fully assess your symptoms and severity of illness without seeing you in person. |
Medical—problems |
Real |
22 (45%) |
27 (55%) |
Message: I need to talk to someone about possibly being referred to see a therapist. I've began to feel very unlike myself. This happened with my previous pregnancy, and it led me to make some hasty not very well planned decisions. I feel very defeated physically and mentally. If there is someone you can refer me to I'd appreciate it. Response: Thank you so much for reaching out. Yes, lets get you to see a counselor so that we can get you feeling more like yourself. We have a counselor that works at our office. She is funded by a grant so your appointments with her would be free and we could schedule them the same day as your visit so that you don't have to make extra trips. Are you open to meeting with her? |
Medical—referrals |
Real |
12 (24%) |
37 (76%) |
Message: Please let me know when you get my results. Response: I will inform you once I receive your laboratory results. If you have any worries or inquiries, please don't hesitate to contact me. |
Logistical—tests |
GPT |
41 (84%) |
8 (16%) |
Message: I am feeling much better but still have sinus pressure and congestion. I am not sure if I should start the antibiotics or not. Response: I'm happy to hear that you're feeling better. Sometimes, even after the infection is gone, sinus pressure and congestion can still stick around. You could try using decongestants or saline nasal rinses that you can buy without a prescription. If your symptoms get worse or don't go away, please contact our office so we can check it out. |
Medical—management |
GPT |
21[a] (44%) |
27* (56%) |
Message: Good morning! It looks like everything is stable and normal. So why do I feel so crummy? Will switching meds help knowing my results are normal? Response: Your inflammation markers have not been significant elevated, so we cannot necessarily rely on them to assess your disease. Treatment change is recommended based on your report of increased symptoms of joint stiffness, swelling and improvement after taking steroids. |
Medical—management |
REAL |
29 (59%) |
20 (41%) |
Message: Ben is having a few problems. His stomach is extended with possible fluid retention. He took his fluid medicine yesterday but it doesn't seem to have helped just yet. He is also becoming very short of breath when doing anything. I wasn't sure if he could be seen in clinic soon or if he could have some blood work done to check levels. Response: I spoke with the doctor about this. Please continue to monitor for the next 24 hrs. If his condition continues to worsen, he should go to the ER. The doctor will call you this afternoon to check in. |
Medical—management |
Real |
23 (47%) |
26 (53%) |
Message: Have you received results from biopsy from July 19? Response: Your biopsy was negative. Please call our office to schedule a consult to discuss your plan of care. |
Logistical—tests |
Real |
36 (73%) |
13 (27%) |
Message: I've had my first physical seizure this morning. It was just a facial muscle twitch and pulse no passing out. Response: I'm sorry to hear that you had a seizure. It's really important to figure out why it happened. Let's make an appointment so we can talk more about it and decide what to do next. |
Medical—problems |
GPT |
35 (71%) |
14 (29%) |
Message: I just read the X-ray report from my visit. I'm concerned with the Cardiomegaly. Is the enlarged heart due to the stomach being in my chest? Response: Cardiomegaly means that the heart is bigger than it should be, and there can be different reasons for this. It's not likely to be connected to the stomach being in the chest. I suggest talking to your doctor about the results to get more information and to understand what it means. |
Medical—tests |
GPT |
32 (65%) |
17 (35%) |
Message: Jacob's mouth sore is really bothering him to talk and eat, and kinda swollen. He is going to get the mouthwash in the morning. How will we know if he needs to see a doc or do we just treat the symptoms? Response: If the sore in your mouth keeps bothering you or gets worse even after using mouthwash, or if you notice any signs of infection like a fever or redness that is spreading, it's a good idea to see a doctor. They can check it out and give you the right treatment. |
Medical—problems |
GPT |
28 (57%) |
21 (43%) |
Message: This thing in my throat is continuing to progress. I now have a steady dull pain along with a few random stabbing pains in that area. Swallowing has become a bigger issue also. I have an appointment with you this Friday, and with speech pathology afterwards. I am ready to get this thing out! Response: I tried to call you several times today. I am sorry to have not gotten in touch. I hope that your symptoms do not worsen rapidly. If they do, you can come to ER here at any time day or night and we will take care of you |
Medical—problems |
REAL |
11 (22%) |
38 (78%) |
Message: I have an appointment with [DOCTOR] today at 1 and I will have him talk to Jane too, but I'm worried. Since late November, I have been on a mania-like rollercoaster of anxiety that is affecting my sleep. I'm just worried it's going to start physically affecting me. Is there anything that I can do? Response: It would be best to discuss this in the office. My staff will contact you to make an appointment. |
Medical—problems |
REAL |
34 |
15 |
Message: Can you tell me what type of test this is: Natriuretic Peptide, as it appears that I am well over the range. Thanks. Response: The Natriuretic Peptide test is used to measure the levels of a hormone that is released by the heart when it is under stress. If the levels are higher than normal, it could be a sign of heart failure or other heart-related conditions. We kindly request that you schedule an appointment to discuss your test results in more detail. |
Informational—tests |
GPT |
45 (92%) |
4 (8%) |
Message: We have questions re: test results. Please call when possible. Response: I'm sorry to hear that you're confused about your test results. Don't worry, I'm here to assist you. Can you please give me more information about the test and what's bothering you? |
Medical—tests |
GPT |
46 (94%) |
3 (6%) |
Message: Good morning. I have noticed since Sunday that I has ptosis on the right side it is not worse but it is not better. I am asymptomatic otherwise, but I get tired easy. What do you think? Response: I'm sorry to hear about your ptosis, which is a drooping of the eyelid. It's important to look into this more. I suggest making an appointment with me so we can talk about your symptoms and do a physical exam. |
Medical—problems |
GPT |
42 (86%) |
7 (14%) |
Message: Please let me know when you get my results Response: [DOCTOR] believes that this probably due to recent steroids for bronchitis, but you also had some elevated white blood cell count in the past also. It is not uncommon in our obese patient population to have elevated white blood cell (WBC) counts. Typically this elevation is not due to underlying marrow pathology but rather reflects low-grade inflammation. Please have your PCP send us some older CBC results for comparison and establish your normal range? |
Medical—tests |
Real |
24 (49%) |
25 (51%) |
Message: I discovered that I have latent TB from an occupational screening . The exposure was during some construction work years ago but they will not treat me. My daughter is immune compromised due to medications she takes. My concern is that my insurance will not treat latent TB. However if I wait until I am active it is already too late and my family has been exposed. More importantly my daughter who's immune system is suppressed. Response: [DOCTOR] would like to meet with you in clinic to discuss in more detail. Are there days and times that work well for you? |
Medical—problems |
Real |
33 (67%) |
16 (33%) |
a Message category interrater agreement 95% (Cohen's kappa 0.94: near perfect agreement).
Conflict of Interest
None declared.
Protection of Human Subjects
The University of Pennsylvania Human Research Protection Program, under study No. 854147, granted approval for this research project. Participant consent was not deemed necessary as the study involved secondary data analysis of patient portal messages sourced through a meticulously crafted pipeline. Furthermore, the protocol for this research, also approved under study No. 854147, granted approval for retrieving the initial set of patient portal messages from a repository at Vanderbilt University Medical Center (VUMC), which were later used to create synthetic patient portal messages used in the study. The utilization of patient portal messages from VUMC was conducted in compliance with ethical guidelines. This study did not require patient consent for using the patient portal messages retrieved from VUMC, as the data used in this study underwent a rigorous de-identification process, rendering it impossible to trace any information back to individual patients. Thus, our research respects and upholds the principles of confidentiality and anonymity, ensuring the protection of participants' privacy rights in accordance with established ethical standards.
-
References
- 1 Carini E, Villani L, Pezzullo AM. et al. The impact of digital patient portals on health outcomes, system efficiency, and patient attitudes: Updated systematic literature review. J Med Internet Res 2021; 23 (09) e26189
- 2 Tai-Seale M, Baxter S, Millen M. et al. Association of physician burnout with perceived EHR work stress and potentially actionable factors. J Am Med Inform Assoc 2023; 30 (10) 1665-1672
- 3 Johnson KB, Neuss MJ, Detmer DE. Electronic health records and clinician burnout: A story of three eras. J Am Med Inform Assoc 2021; 28 (05) 967-973
- 4 Johnson KB, Ibrahim SA, Rosenbloom ST. Ensuring equitable access to patient portals-closing the “techquity” gap. JAMA Health Forum 2023; 4 (11) e233406
- 5 Kruse CS, Mileski M, Dray G, Johnson Z, Shaw C, Shirodkar H. Physician burnout and the electronic health record leading up to and during the first year of COVID-19: Systematic review. J Med Internet Res 2022; 24 (03) e36200
- 6 Liu S, Wright AP, Patterson BL. et al. Using AI-generated suggestions from ChatGPT to optimize clinical decision support. J Am Med Inform Assoc 2023; 30 (07) 1237-1245
- 7 Rajjoub R, Arroyave JS, Zaidat B. et al. ChatGPT and its role in the decision-making for the diagnosis and treatment of lumbar spinal stenosis: A comparative analysis and narrative review. Global Spine J 2024; 14 (03) 998-1017
- 8 Kao HJ, Chien TW, Wang WC, Chou W, Chow JC. Assessing ChatGPT's capacity for clinical decision support in pediatrics: A comparative study with pediatricians using KIDMAP of Rasch analysis. Medicine (Baltimore) 2023; 102 (25) e34068
- 9 Liu S, McCoy AB, Wright AP. et al. Leveraging large language models for generating responses to patient messages. medRxiv 2023 (e-pub ahead of print)
- 10 Liu J, Wang C, Liu S. Utility of ChatGPT in clinical practice. J Med Internet Res 2023; 25: e48568
- 11 Kurniawan MH, Handiyani H, Nuraini T, Hariyati RTS, Sutrisno S. A systematic review of artificial intelligence-powered (AI-powered) chatbot intervention for managing chronic illness. Ann Med 2024; 56 (01) 2302980
- 12 Mohan RMR, Joy M, Natt D. et al. Digital Therapeutics and Chatbots: Assessing the efficacy of Chat GPT and Google BARD in IBS treatment plans. Am J Gastroenterol 2024; 118: S25
- 13 Chheng C, Wilson D. Abnormal gait detection using wearable Hall-Effect sensors. Sensors (Basel) 2021; 21 (04) 1206
- 14 Ram Mohan RM, Joy M, Natt D. et al. The Future Of Digestive Health: Personalized treatments through wearable AI technologies. Endoscopy 2024; 56: S316
- 15 Soley N, Speed TJ, Xie A, Taylor CO. Predicting postoperative pain and opioid use with machine learning applied to longitudinal electronic health record and wearable data. Appl Clin Inform 2024; 15 (03) 569-582
- 16 Tai-Seale M, Baxter SL, Vaida F. et al. AI-generated draft replies integrated into health records and physicians' electronic communication. JAMA Netw Open 2024; 7 (04) e246565
- 17 Sulieman L, Robinson JR, Jackson GP. Automating the classification of complexity of medical decision-making in patient-provider messaging in a patient portal. J Surg Res 2020; 255: 224-232
- 18 Achiam J, Adler S, Agarwal S. et al.; Open AI. GPT-4 technical report. Published online 2023. arXiv:2303:08774
- 19 Narang S, Raffel C, Lee K, Roberts A, Fiedel N, Malkan K. WT5?! Training Text-to-text models to explain their predictions. Published online April 29, 2020. arXiv:2004:14546. Accessed August 5, 2024 at: http://arxiv.org/abs/2004.14546
- 20 Wiegreffe S, Hessel J, Swayamdipta S, Riedl M, Choi Y. Reframing human-AI collaboration for generating free-text explanations. Published online May 4, 2022. arXiv:2112:08674. Accessed August 5, 2024 at: http://arxiv.org/abs/2112.08674
- 21 Budko A. Observer Project Github site. Published online November 18, 2024. Accessed April 4, 2025 at: https://github.com/alex-budko/OBSERVER-Project
- 22 Ayers JW, Poliak A, Dredze M. et al. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern Med 2023; 183 (06) 589-596
- 23 Garcia P, Ma SP, Shah S. et al. Artificial intelligence-generated draft replies to patient inbox messages. JAMA Netw Open 2024; 7 (03) e243201
- 24 Riedel M, Kaefinger K, Stuehrenberg A. et al. ChatGPT's performance in German OB/GYN exams - paving the way for AI-enhanced medical education and clinical practice. Front Med (Lausanne) 2023; 10: 1296615
- 25 Reynolds K, Tejasvi T. Potential use of ChatGPT in responding to patient questions and creating patient resources. JMIR Dermatol 2024; 7: e48451
Address for correspondence
Publication History
Received: 12 August 2024
Accepted: 11 February 2025
Accepted Manuscript online:
25 March 2025
Article published online:
30 July 2025
© 2025. The Author(s). This is an open access article published by Thieme under the terms of the Creative Commons Attribution License, permitting unrestricted use, distribution, and reproduction so long as the original work is properly cited. (https://creativecommons.org/licenses/by/4.0/)
Georg Thieme Verlag KG
Oswald-Hesse-Straße 50, 70469 Stuttgart, Germany
-
References
- 1 Carini E, Villani L, Pezzullo AM. et al. The impact of digital patient portals on health outcomes, system efficiency, and patient attitudes: Updated systematic literature review. J Med Internet Res 2021; 23 (09) e26189
- 2 Tai-Seale M, Baxter S, Millen M. et al. Association of physician burnout with perceived EHR work stress and potentially actionable factors. J Am Med Inform Assoc 2023; 30 (10) 1665-1672
- 3 Johnson KB, Neuss MJ, Detmer DE. Electronic health records and clinician burnout: A story of three eras. J Am Med Inform Assoc 2021; 28 (05) 967-973
- 4 Johnson KB, Ibrahim SA, Rosenbloom ST. Ensuring equitable access to patient portals-closing the “techquity” gap. JAMA Health Forum 2023; 4 (11) e233406
- 5 Kruse CS, Mileski M, Dray G, Johnson Z, Shaw C, Shirodkar H. Physician burnout and the electronic health record leading up to and during the first year of COVID-19: Systematic review. J Med Internet Res 2022; 24 (03) e36200
- 6 Liu S, Wright AP, Patterson BL. et al. Using AI-generated suggestions from ChatGPT to optimize clinical decision support. J Am Med Inform Assoc 2023; 30 (07) 1237-1245
- 7 Rajjoub R, Arroyave JS, Zaidat B. et al. ChatGPT and its role in the decision-making for the diagnosis and treatment of lumbar spinal stenosis: A comparative analysis and narrative review. Global Spine J 2024; 14 (03) 998-1017
- 8 Kao HJ, Chien TW, Wang WC, Chou W, Chow JC. Assessing ChatGPT's capacity for clinical decision support in pediatrics: A comparative study with pediatricians using KIDMAP of Rasch analysis. Medicine (Baltimore) 2023; 102 (25) e34068
- 9 Liu S, McCoy AB, Wright AP. et al. Leveraging large language models for generating responses to patient messages. medRxiv 2023 (e-pub ahead of print)
- 10 Liu J, Wang C, Liu S. Utility of ChatGPT in clinical practice. J Med Internet Res 2023; 25: e48568
- 11 Kurniawan MH, Handiyani H, Nuraini T, Hariyati RTS, Sutrisno S. A systematic review of artificial intelligence-powered (AI-powered) chatbot intervention for managing chronic illness. Ann Med 2024; 56 (01) 2302980
- 12 Mohan RMR, Joy M, Natt D. et al. Digital Therapeutics and Chatbots: Assessing the efficacy of Chat GPT and Google BARD in IBS treatment plans. Am J Gastroenterol 2024; 118: S25
- 13 Chheng C, Wilson D. Abnormal gait detection using wearable Hall-Effect sensors. Sensors (Basel) 2021; 21 (04) 1206
- 14 Ram Mohan RM, Joy M, Natt D. et al. The Future Of Digestive Health: Personalized treatments through wearable AI technologies. Endoscopy 2024; 56: S316
- 15 Soley N, Speed TJ, Xie A, Taylor CO. Predicting postoperative pain and opioid use with machine learning applied to longitudinal electronic health record and wearable data. Appl Clin Inform 2024; 15 (03) 569-582
- 16 Tai-Seale M, Baxter SL, Vaida F. et al. AI-generated draft replies integrated into health records and physicians' electronic communication. JAMA Netw Open 2024; 7 (04) e246565
- 17 Sulieman L, Robinson JR, Jackson GP. Automating the classification of complexity of medical decision-making in patient-provider messaging in a patient portal. J Surg Res 2020; 255: 224-232
- 18 Achiam J, Adler S, Agarwal S. et al.; Open AI. GPT-4 technical report. Published online 2023. arXiv:2303:08774
- 19 Narang S, Raffel C, Lee K, Roberts A, Fiedel N, Malkan K. WT5?! Training Text-to-text models to explain their predictions. Published online April 29, 2020. arXiv:2004:14546. Accessed August 5, 2024 at: http://arxiv.org/abs/2004.14546
- 20 Wiegreffe S, Hessel J, Swayamdipta S, Riedl M, Choi Y. Reframing human-AI collaboration for generating free-text explanations. Published online May 4, 2022. arXiv:2112:08674. Accessed August 5, 2024 at: http://arxiv.org/abs/2112.08674
- 21 Budko A. Observer Project Github site. Published online November 18, 2024. Accessed April 4, 2025 at: https://github.com/alex-budko/OBSERVER-Project
- 22 Ayers JW, Poliak A, Dredze M. et al. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern Med 2023; 183 (06) 589-596
- 23 Garcia P, Ma SP, Shah S. et al. Artificial intelligence-generated draft replies to patient inbox messages. JAMA Netw Open 2024; 7 (03) e243201
- 24 Riedel M, Kaefinger K, Stuehrenberg A. et al. ChatGPT's performance in German OB/GYN exams - paving the way for AI-enhanced medical education and clinical practice. Front Med (Lausanne) 2023; 10: 1296615
- 25 Reynolds K, Tejasvi T. Potential use of ChatGPT in responding to patient questions and creating patient resources. JMIR Dermatol 2024; 7: e48451



