AI Patient Engagement Chatbots: Clinical Trial Evidence and Outcomes

The Therabot trial: what it showed and what it didn’t

In March 2025, the first randomized controlled trial of a fully generative AI therapy chatbot was published in NEJM AI. The trial randomized 210 adults with clinically significant symptoms of major depressive disorder, generalized anxiety disorder, or clinically high risk for feeding and eating disorders. Half got access to Therabot; the other half went to a waitlist. At 4 and 8 weeks, the chatbot group showed significantly larger symptom reductions across all three conditions. Effect sizes were large: d = 0.845–0.903 for MDD, d = 0.794–0.840 for GAD, and d = 0.627–0.819 for clinically high-risk feeding and eating disorders. Average use exceeded six hours, and participants rated the therapeutic alliance with the chatbot as comparable to what they would expect from a human therapist.

I stop here. Those numbers impress me, but I need to say what they don’t mean. The outcomes were self-reported on PHQ-9 and GAD-7 scales, not clinician-rated. The control was a waitlist, not active therapy. The sample was 210 adults, carefully selected. That is not the same as effectiveness across diverse populations in a busy clinic.

Why 90% engagement is not enough

Outside the controlled trial, vendor data paint a rosier picture. Commure Memora, a conversational AI system deployed in health systems, reports patient engagement rates over 90% and care plan adherence as high as 97% for some clients. I treat these numbers with caution. The data is vendor-reported. I don’t know what “engagement” meant — whether a patient opened a message, replied, completed an action, or actually improved clinically. Engagement is a proxy, not an endpoint.

A focused review of AI-based medication adherence tools found that AI tools improved adherence by 6.7% to 32.7% compared to controls. That sounds promising until you see that only seven eligible studies existed, and they carried moderate to high risk of bias. The evidence is still weak.

A narrative review of AI in chronic disease self-management found that conversational AI personalization improved the mean engagement ratio for diabetes education from 0.26 to 0.31 — a modest gain. The review is not systematic; no formal quality assessment was conducted. These studies agree on direction — positive but modest — but none prove causation.

What we don’t know about safety

The Therabot trial did not test what happens when the chatbot gives harmful advice. The waitlist design did not test escalation failures. No source in the current literature systematically monitors adverse events from conversational AI in patient engagement. That silence is a signal.

The recurrent challenges identified across reviews include:

Data privacy and security risks — patient conversations are health data, and few platforms publish their privacy architecture.
Algorithmic bias — training data that underrepresents certain demographics can produce systematically worse responses for those groups.
Limited generalizability — most studies are single-site or use convenience samples.
Lack of clear escalation pathways — when the chatbot cannot help or makes a dangerous suggestion, who steps in, and how quickly?

My judgment: promise with oversight

The Therabot RCT shows that a fully generative AI chatbot can produce clinically meaningful symptom reductions in a controlled setting. Real-world engagement numbers from Commure Memora suggest patients will use these tools. The safety concerns — hallucination, bias, missing escalation paths — are real but not insurmountable.

I am not ready to accept that these results generalize to routine care. I would not deploy any of these tools without clear oversight: a human in the loop, adverse event monitoring, and an explicit plan for what the chatbot does when it does not know the answer. That is not skepticism for its own sake; it is what the evidence demands. The moment independent replication studies in diverse populations show consistent safety and efficacy, I will change my judgment.

For a deeper look at the financial and operational side of AI patient engagement, see our separate analysis on the business case. The clinical evidence and the ROI story are complementary — but they should not be confused.

AI Chatbots for Patient Engagement: Clinical Trial Evidence and Real-World Outcomes

The Therabot trial: what it showed and what it didn’t

Why 90% engagement is not enough

What we don’t know about safety

My judgment: promise with oversight

Discussion

Comments