5.6 Minutes, 18 Seconds, or Nothing
The pitch is two to five minutes saved per encounter. The data, once you get past the slide decks, tells a different story. A 2026 study in npj Digital Medicine tracked three large health systems using the same product category. Mass General Brigham: a median reduction of 5.6 minutes per appointment. The Permanente Medical Group: 18 seconds. Intermountain Health: no statistically significant gain. That is a 20× spread for a single claimed benefit. That variance is not a measurement error. It is the signal the industry would prefer you miss.

The Variance Isn't Noise
The three sites did not use the same methodology. MGB used a matched-cohort design comparing heavy EHR users to themselves before and after deployment, and found savings concentrated in specialty practices. Permanente compared ambient-scribe users to non-users within the same medical group, with minimal adjustment for selection bias. Intermountain used a rigorous matched cohort and found no productivity gain. The Cleveland Clinic, in a health-system publication, reported 2 minutes saved per appointment and 14 minutes per day among 4,000 providers who voluntarily adopted Ambience Healthcare’s AI Scribe. That is a large, enthusiastic deployment, but not a controlled study.
| Site | Time savings | Study design | Key caveat |
|---|---|---|---|
| Mass General Brigham | 5.6 min/appt | Matched cohort (heavy EHR users) | Savings concentrated in specialty practices; may not generalize to all physicians |
| Permanente Medical Group | 18 s/appt | User vs. non-user comparison | No randomization; minimal confounder adjustment |
| Intermountain Health | No significant gain | Matched cohort | Most rigorous design; null result |
| Cleveland Clinic | 2 min/appt, 14 min/day | Health-system publication (no control) | High adoption (76% of visits) but self-selected users |

What Actually Replicates
While time savings are inconsistent, the evidence for improved clinician well-being is much more coherent. The same npj Digital Medicine study reported that 84% of clinicians at Permanente felt ambient scribes had a positive impact on visit interactions, and 56% of patients reported a positive impact on visit quality. Microsoft’s own survey of 879 clinicians using DAX Copilot found 70% reported improved work-life balance and 80% reduced cognitive burden. Geisinger, when it deployed ambient scribes via Twofold Health, measured a 55% reduction in self-reported burnout. These numbers come from different measurement approaches – vendor surveys, health-system internal assessments, and peer-reviewed studies – but they point in the same direction. The primary value of ambient AI scribes appears to be cognitive load reduction, not time recovery.
The Hidden Cost: Note Bloat
Time supposedly saved at the point of care can be lost downstream. AI-generated notes tend to be longer, include irrelevant detail, and require manual editing. The Cleveland Clinic, despite high adoption (76% of scheduled visits by active users), did not report whether the 2 minutes saved per appointment was net of review time. When clinicians have to correct or trim notes, the claimed savings erode. Vendors rarely mention this. The cost is real, hard to measure, and systematically omitted from ROI calculations.
The same npj Digital Medicine paper introduced the SCRIBE framework (Safety, Clinical accuracy, Reliability, Integration, Bias, Efficiency) and CRAFT-MD scenario-based testing. These are constructive steps toward standardized evaluation. They focus on technical accuracy and usability, not on site-specific workflow impact or clinical outcomes. No international consensus exists yet. A framework that does not require matched cohorts or external validation of time savings across settings is incomplete. It is a start, not a solution.
The Necessary Step: Silent Testing with Matched Controls
Until the evidence base includes multisite replication with consistent methodology, any health system considering an ambient AI scribe should treat vendor benchmarks as marketing, not data. The practical step is a silent testing phase: deploy the tool with parallel manual entry, measure baseline and post-implementation time, burnout, and note length with matched controls, and set stopping rules for adoption. Without this, a system that works at MGB may save no time at your institution. The most actionable recommendation from the current evidence is not a product pick – it is a method.
For a deeper look at the operational factors that determine success or failure, see Barriers and Success Factors for Deploying Conversational AI in Clinical Workflows.
Comments
Join the discussion with an anonymous comment.