Generative AI for Patient Portal Messages: Safety Risks and Governance

The Inbox Crisis: Why Patient Portal Messages Are Driving Clinician Burnout

The patient portal was designed to empower patients with convenient access to their health information and care teams. It succeeded — perhaps too well. The volume of incoming messages has surged to the point where managing the inbox has become a primary driver of the documented 50% burnout rate among primary care providers (PCPs). The problem is not just the volume; it is the timing. Studies cited in recent research indicate that PCPs spend an average of 1.4 hours each day on after-hours EHR work, a significant portion of which is dedicated to triaging and responding to portal messages. This after-hours burden erodes personal time, contributes to moral injury, and makes primary care an increasingly unsustainable career path.

For health IT leaders and administrators, this is not merely a wellness issue; it is a workforce sustainability and operational efficiency crisis. When clinicians spend over half their workday on EHR-related tasks — as noted in a recent University of Pennsylvania study — the capacity for direct patient care shrinks. The inbox has become a critical workflow pain point that demands a technological intervention. This context explains why over 100 health systems have already moved to deploy generative AI tools specifically designed to draft responses to patient portal messages, seeking relief from a burden that threatens the viability of primary care itself.

How Generative AI Is Being Deployed for Portal Message Drafting

The deployment model for generative AI in portal messaging is deceptively simple in concept but complex in execution. The typical workflow involves an AI model — often a large language model (LLM) like GPT-3.5-turbo — generating a draft response to a patient's incoming message. This draft is then presented to a clinician who is expected to review, edit, and approve it before sending. The tool does not send messages autonomously; the human remains legally and ethically responsible for the final communication.

The scope of what these systems handle is broad but typically limited to common, low-acuity inquiries. Based on current deployments, the most common use cases include:

Medication-related questions (refill requests, dosage clarifications, side effect inquiries)
Appointment scheduling and rescheduling requests
Test result follow-ups and explanations of normal findings
Referral coordination and specialist visit summaries
General administrative inquiries (billing questions, form requests)

The appeal is obvious: a well-crafted draft can save a clinician 30 to 60 seconds per message, which compounds into hours saved per day. The npj Digital Medicine study notes that over 100 health systems have adopted this technology, citing published reports. This rapid uptake, however, has occurred without a complete understanding of the safety implications — a gap that recent evidence has begun to expose.

The Evidence for Benefit: Empathy, Readability, and Reduced Cognitive Load

The case for deploying generative AI in this context is supported by concrete, measurable benefits. A cross-sectional study from the University of Pennsylvania, published in Applied Clinical Informatics, evaluated GPT-3.5-turbo-generated responses against actual physician responses across 20 patient portal message-response pairs. Forty-nine healthcare professionals (67% advanced practice providers, 33% physicians) blindly rated the responses. The results were striking: AI-generated responses scored significantly higher on empathy (mean 3.57 vs. 3.07, p<0.001) and readability (mean 4.50 vs. 4.13, p<0.001). No statistically significant difference was found for relevance or medical accuracy, meaning the AI was not worse — and in some dimensions, was better — than human clinicians.

Beyond quality metrics, the impact on clinician workload is substantial. The npj Digital Medicine study reported that 80% of participating PCPs agreed that AI-generated drafts reduced their cognitive workload. This aligns with findings from other research cited in the same study, which reported a 40% reduction in inbox burden. For a clinician facing 1.4 hours of after-hours work daily, a 40% reduction translates to over 30 minutes reclaimed each day — time that can be redirected to patient care or personal rest.

Comparison of AI-generated vs. human physician responses to patient portal messages (Kaur et al., Applied Clinical Informatics, 2025).
Metric	AI-Generated (GPT-3.5)	Human Physician	Statistical Significance
Empathy Score (mean)	3.57	3.07	p < 0.001
Readability Score (mean)	4.50	4.13	p < 0.001
Relevance Score (mean)	Not significantly different	Not significantly different	p = 0.08
Medical Accuracy	Not significantly different	Not significantly different	p = 0.12

80% of participants agreed AI drafts reduced cognitive workload; 75% found them safe; yet almost all physicians 'sent' at least one erroneous response.

This quote from the npj Digital Medicine study captures the central paradox of this technology: clinicians perceive the tool as both helpful and safe, but their behavior reveals a dangerous gap between perception and reality.

The Evidence for Risk: 66.6% Error Miss Rate and the Unedited Draft Problem

The benefits are real, but they come with a risk profile that demands serious attention. The npj Digital Medicine study, funded by the Agency for Healthcare Research and Quality (AHRQ), conducted a cross-sectional simulation with 20 practicing PCPs from 13 clinical sites (mean experience: 14.75 years). Each physician reviewed 18 patient portal messages, four of which contained AI-generated drafts with embedded errors — two objective inaccuracies and two potentially harmful omissions.

The results are sobering. Each erroneous draft was missed by 65-75% of participants. On average, physicians missed 2.67 out of 4 errors — a miss rate of 66.6%. Even more alarming, 35-45% of these erroneous AI-generated drafts were submitted to patients entirely unedited. This means that in a simulated environment, nearly half of the harmful AI errors were sent directly to patients without any human correction.

Summary of key risk findings from recent studies on AI-generated patient portal message drafts.
Risk Metric	Finding	Source
Error miss rate (PCPs)	66.6% (2.67 of 4 errors missed on average)	npj Digital Medicine (Biro et al., 2025)
Range of PCPs missing each error	65-75%	npj Digital Medicine (Biro et al., 2025)
Erroneous drafts sent unedited	35-45%	npj Digital Medicine (Biro et al., 2025)
Hallucination rate in AI drafts	~6%	JAMA Network Open (cited in npj study)
Severe harm risk from hallucinations	7.1%	JAMA Network Open (cited in npj study)

The disconnect between clinician perception and performance is stark. While 75% of participants believed the AI drafts were safe, their actual error detection rate tells a different story. This gap is not a failure of individual clinicians — it is a predictable consequence of how human cognition interacts with AI systems in high-stakes, high-volume environments.

Why Clinicians Miss Errors: Automation Complacency, Confirmation Bias, and Functional Fixedness

The npj Digital Medicine study identifies three cognitive biases that explain why experienced, well-intentioned clinicians fail to catch AI-generated errors. These are not unique to healthcare — they are well-documented phenomena in human-AI interaction across aviation, manufacturing, and finance — but their consequences in clinical communication are uniquely severe.

A three-panel editorial infographic illustrating automation complacency, confirmation bias, and functional fixedness in clinical AI review. — Three cognitive biases that undermine human review of AI-generated clinical content.

Automation complacency: When a system performs reliably most of the time, humans naturally reduce their vigilance. The AI draft is correct 94% of the time (based on the ~6% hallucination rate), so the clinician's brain learns to trust the output. This is not laziness; it is a cognitive efficiency mechanism that backfires when the rare error occurs.
Confirmation bias: Once a clinician reads a plausible draft, they tend to look for evidence that confirms its correctness rather than searching for errors. The draft sets an expectation, and the reviewer's attention is directed toward verifying that expectation rather than challenging it.
Functional fixedness: The AI draft presents a complete response path. The clinician's cognitive load is reduced because they do not have to construct a response from scratch — but this also means they are less likely to consider alternative, potentially more accurate responses. The provided path becomes the only path.

These biases are amplified by the very feature that makes AI drafts attractive: they reduce cognitive load. When a clinician is already exhausted from a full day of patient visits and facing an inbox of 50 messages, the temptation to quickly review and approve a well-written draft is immense. The system is designed to make the easy path the default path — and that is precisely where the danger lies.

A Governance Framework for Safe Deployment

The evidence does not argue against deploying generative AI for portal messaging. It argues for deploying it with a governance framework that acknowledges and mitigates the documented risks. A human-in-the-loop is not a checkbox — it is a system that must be designed, trained, and audited to function effectively.

A four-part governance framework illustration showing interconnected safety elements including human review, error detection, clinician training, and organizational policy. — Four essential components of a governance framework for safe AI deployment in clinical messaging.

A structured governance framework for generative AI in patient portal messaging.
Governance Component	Key Actions	Rationale
Mandatory Human Review	Require clinician review and approval for every AI-generated draft before sending. Prohibit auto-send functionality.	Eliminates the 35-45% unedited draft problem at the system level.
Error Detection Design	Implement forcing functions (e.g., require clinicians to explicitly confirm they have reviewed for errors). Use structured review checklists.	Counteracts automation complacency by making the review process deliberate rather than passive.
Clinician Training	Train clinicians on the specific types of errors AI makes (hallucinations, omissions). Educate on cognitive biases.	Prepares clinicians to look for the right kinds of errors rather than reviewing with generic attention.
Organizational Monitoring	Conduct regular audits of AI-generated responses. Track error rates and near-misses. Establish escalation pathways for identified errors.	Creates a feedback loop for continuous improvement and accountability.

Recommendations for Health Systems Deploying or Evaluating These Tools

For health IT leaders, primary care administrators, and clinical informaticists who are evaluating or already deploying generative AI for portal message drafting, the evidence supports a clear set of actionable recommendations. These are not theoretical — they are directly derived from the documented failure modes identified in the npj Digital Medicine and UPenn studies.

Conduct local error-rate audits before full deployment. The 66.6% miss rate and 35-45% unedited draft rate come from a simulation study with 20 PCPs. Your organization's clinicians may perform differently — but you will not know unless you measure it. Run a controlled pilot with embedded errors to establish your baseline error detection rate.
Implement structured review protocols. Do not rely on unstructured "please review carefully" instructions. Provide clinicians with specific checklists: verify medication names and dosages, confirm that the response addresses the patient's specific question, check for omitted critical information, and flag any language that sounds overly definitive or diagnostic.
Train clinicians on cognitive bias risks. Most clinicians have never been taught about automation complacency, confirmation bias, or functional fixedness. A 30-minute training module that explains these biases and provides concrete strategies for countering them can significantly improve error detection.
Establish clear escalation pathways for AI errors. When a clinician identifies an error in an AI-generated draft, there must be a clear process for reporting it. This creates a feedback loop that allows the organization to track error patterns, adjust prompts, and improve the system over time.
Continuously monitor real-world performance. The 6% hallucination rate and 7.1% severe harm risk are averages. Your organization's actual rates may differ based on patient population, message complexity, and the specific AI model used. Ongoing monitoring is essential for maintaining safety.

The core thesis of this evidence review is not that generative AI should be avoided in clinical messaging. The benefits — improved empathy, better readability, reduced cognitive load, and significant time savings — are too substantial to ignore. The thesis is that safe deployment requires governance, not trust. The 66.6% error miss rate is not an argument against the technology; it is an argument for designing systems that account for the predictable ways human cognition interacts with AI. When health systems treat human-in-the-loop as a governance system to be engineered rather than a box to be checked, they can capture the benefits of generative AI while protecting patients from its documented risks.

Generative AI for Patient Portal Messages: Benefits, Risks, and the Human-in-the-Loop Imperative