
The Paradox: High Interest, Thin Evidence
Primary care is the front door of most health systems, accounting for roughly half of all healthcare visits. It is also the clinical domain where the gap between AI's technical promise and its proven impact is widest. Walk into any primary care conference or read the vendor press releases, and you will encounter a landscape of ambient scribes, AI-assisted ECG interpretation, diabetic retinopathy screening platforms, and clinical decision support tools. The enthusiasm is palpable. The evidence, however, is not.
This article is a critical meta-analysis of that evidence gap. It does not survey individual tools or grade their performance — that work has already been done in our companion piece, AI in Primary Care: An Evidence-Graded Guide to What Works, What Doesn't, and Where the Data Stands. Instead, it asks a more fundamental question: how much rigorous evidence actually exists to support the deployment of AI in primary care, and why is the gap between technical performance and clinical impact so persistent?
What the Evidence Actually Shows: A 2025 Scoping Review
The most comprehensive map of the primary care AI evidence landscape comes from a 2025 scoping review by Katonai et al., published in BJGP Open. The review searched PubMed, Web of Science, and Scopus up to April 13, 2024, and identified 5,224 records. After screening, only 73 empirical studies met the inclusion criteria — a yield of just 1.4%. This number alone signals the field's immaturity.
The 73 studies clustered into four thematic groups, revealing where the field's attention is concentrated and where it is not.
| Theme | Number of Studies | Percentage | Focus Area |
|---|---|---|---|
| Acceptance and implementation experiences | 24 | 33% | Clinician attitudes, workflow barriers, trust, usability |
| Early intervention and decision support | 21 | 29% | Diagnostic AI, risk prediction, clinical decision support |
| Chronic disease management | 16 | 22% | Diabetes, hypertension, cardiovascular risk, COPD |
| Operations and patient management | 12 | 16% | Appointment scheduling, triage, documentation, population health |
The largest single cluster — 33% of all studies — focused on acceptance and implementation experiences, not clinical outcomes. This is a revealing signal: the field is spending more energy studying whether clinicians will use AI tools than measuring whether those tools improve patient health. The remaining studies were split across early intervention, chronic disease, and operations, with most being single-center pilots of limited duration.
Within these studies, AI tools frequently demonstrated strong technical accuracy. The review documented several illustrative examples: an AI-interpreted ECG program that raised low-ejection-fraction heart failure diagnoses from 1.6% to 2.1%, with frequent users twice as likely to detect the condition; a combined AI ECG-stethoscope that identified reduced ejection fraction with 92% sensitivity and 80% specificity; an AI-assisted telemedicine platform that detected urgent retinal disease with 97% sensitivity and 99% specificity while cutting workload by 96%; ambient voice technology that decreased documentation time by 28.8%; and diabetic retinopathy screening systems achieving sensitivities of 87%–100% and specificities of 89%–98%.
The scoping review's most important finding, however, was not about performance — it was about the gap between performance and implementation. Across all 73 studies, the review consistently identified five barriers that prevented AI tools from moving from technical validation into routine primary care workflows: usability challenges, workflow misalignment, trust deficits among clinicians and patients, equity gaps, and financial constraints. These barriers are not peripheral concerns; they are the central obstacle to realizing AI's potential in primary care.
The RCT Evidence Gap: 39 Trials Across All Medicine
If the scoping review reveals the breadth of the evidence base, the systematic review of AI randomized controlled trials reveals its depth — or lack thereof. A 2022 systematic review by Lam et al., published in the Journal of Medical Internet Research, searched for AI RCTs published up to July 2021. From 11,839 retrieved articles, only 39 RCTs — 0.33% — met the inclusion criteria. These 39 trials spanned 13 clinical specialties, but the distribution was heavily skewed: most were in gastroenterology and endoscopy (15 studies on AI-assisted polyp detection), not primary care.
| Metric | Finding |
|---|---|
| Total AI RCTs identified | 39 out of 11,839 articles (0.33%) |
| Specialties represented | 13, with gastroenterology/endoscopy dominant (15 studies) |
| AI outperformed control | 77% (30/39) of RCTs |
| Improved clinically relevant outcomes | 70% (21/30) of those showing AI benefit |
| Improved long-term patient outcomes | 0 studies |
| Single-center studies | 59% (23/39) |
| Sample size < 1,000 | 90% (35/39) |
| High risk of bias | 49% (19/39) |
The headline figure — 77% of AI RCTs showed AI-assisted interventions outperforming usual care — is often cited as evidence of AI's effectiveness. But the details matter. None of the 39 trials demonstrated improved long-term patient outcomes. Most were single-center (59%), had sample sizes under 1,000 (90%), and nearly half (49%) had a high risk of bias. The studies were concentrated in China and the United States, raising questions about generalizability to other healthcare systems. And the temporal cutoff (July 2021) means this review does not capture the most recent wave of AI RCTs from 2022 through early 2026.
For primary care specifically, the RCT evidence is even thinner. The Lam review did not break out primary care as a separate category, but the dominance of gastroenterology studies suggests that primary care AI RCTs represent a small fraction of an already small total. This is not surprising: primary care's broad, undifferentiated clinical scope makes it harder to design the kind of focused, single-disease RCT that dominates the AI trial landscape.
The FDA Approval Mismatch: 3% for 50% of Care
The evidence gap is mirrored in the regulatory landscape. According to the Stanford Healthcare AI Applied Research Team (HEA₃RT), only 3% of FDA-approved AI/ML medical devices are designed for primary care, despite primary care accounting for approximately half of all healthcare delivery. This figure, based on 2023–2024 FDA data, has likely shifted slightly by Q2 2026, but the structural mismatch remains stark.
The contrast with radiology is instructive. Radiology AI devices dominate FDA clearances, accounting for roughly 70–80% of all AI/ML-enabled medical device authorizations. This makes sense: radiology deals with standardized image formats, well-defined diagnostic tasks, and established ground truth — conditions that are ideal for AI development and regulatory approval. Primary care, by contrast, deals with undifferentiated symptoms, multimorbidity, and longitudinal relationships — conditions that are far harder to model, validate, and clear.
This mismatch reflects both market incentives and clinical complexity. From a market perspective, radiology and cardiology offer clearer return on investment: a single AI tool can be deployed across hundreds of imaging centers with a standardized workflow. Primary care AI tools, by contrast, must integrate into diverse EHR systems, accommodate variable workflows, and address a heterogeneous patient population. From a clinical perspective, primary care's scope — prevention, acute care, chronic disease management, mental health, social determinants — resists the narrow, single-task AI models that dominate FDA clearances.
Why Primary Care Is Different: Undifferentiated Symptoms, Multimorbidity, and Relational Continuity
The evidence gap is not simply a matter of insufficient investment or research attention. There are structural reasons why primary care AI evidence lags behind specialty AI, and understanding these reasons is essential for interpreting the evidence that does exist and for designing better studies in the future.
- Undifferentiated symptoms. A patient presenting with chest pain, fatigue, or dizziness could have any of dozens of conditions spanning multiple organ systems. Most AI models are trained for single, well-defined diagnostic tasks (e.g., "is this chest X-ray abnormal?"). Primary care requires a differential diagnosis approach that current AI architectures are not designed to support.
- Multimorbidity. Over 40% of primary care patients have two or more chronic conditions. Single-disease AI tools — a diabetes management algorithm, a hypertension risk predictor — cannot account for the interactions between conditions, polypharmacy effects, and competing treatment priorities that define real primary care decision-making.
- Relational continuity. Primary care is built on longitudinal relationships and shared decision-making. Many clinical decisions — whether to start a statin, how aggressively to treat blood pressure, when to refer to a specialist — depend on patient preferences, values, and life context that are invisible to AI models trained on structured data alone.
These three factors — undifferentiated symptoms, multimorbidity, and relational continuity — are not obstacles that better algorithms will solve. They are fundamental features of primary care that require fundamentally different approaches to AI development, validation, and implementation. For a deeper exploration of how AI clinical decision support tools are being adapted to these realities, see our article AI Clinical Decision Support in Primary Care: Evidence, Applications, and Deployment Realities.
Barriers to Real-World Implementation: Usability, Trust, Equity, and Cost

The Katonai et al. scoping review identified implementation barriers as the most studied theme in the primary care AI literature — 33% of all studies focused on acceptance and implementation experiences. This is not accidental. The barriers are real, persistent, and well-documented across multiple studies and settings.
- Usability and workflow misalignment. Primary care visits are fast-paced, multi-problem encounters. An AI tool that requires additional clicks, extra screens, or workflow interruptions will be abandoned regardless of its technical accuracy. The scoping review consistently found that tools designed without frontline clinician input failed to integrate into real workflows.
- Trust deficits. Clinicians and patients alike express concerns about AI transparency, accountability, and the potential for errors. A survey cited in the scoping review found that 85.7% of primary care professionals understood AI, and 91.4% expressed interest in training — but understanding and trust are not the same thing. Trust requires evidence of reliability in the specific context where the tool will be used.
- Equity gaps. AI tools trained on datasets that underrepresent minority populations, non-English speakers, and patients with low health literacy risk widening existing disparities. This is particularly acute in primary care, which serves as the medical home for the most vulnerable populations. For a detailed case study of equity challenges in a specific AI deployment, see our article AI Diabetic Retinopathy Screening at Community Health Centers: Implementation Realities, Equity Evidence, and Deployment Guidance.
- Financial constraints. Primary care operates on thin margins. AI tools require upfront investment in software, hardware, integration, and training, with uncertain return on investment. The lack of dedicated reimbursement codes for AI-assisted primary care services — unlike the emerging reimbursement pathways for AI in radiology — creates a structural disincentive for adoption.
The Path Forward: Pragmatic Trials, Co-Design, and Anticipatory Planning
The evidence gap is real, but it is not inevitable. The path forward requires a deliberate shift from technology-push to evidence-pull — from building AI tools and then looking for problems to solve, to identifying primary care's most pressing challenges and designing AI solutions that address them within the constraints of real-world workflows.
- Pragmatic randomized trials. The next generation of AI evidence must come from pragmatic trials embedded in real primary care workflows, not from lab-validated pilots. These trials should measure patient-centered outcomes (symptom resolution, quality of life, care coordination) rather than just technical accuracy metrics. They should be multi-center, adequately powered, and designed to detect differential effects across patient subgroups.
- Co-design with frontline clinicians and patients. The Stanford HEA₃RT model — bridging data scientists and frontline clinicians through implementation scientists rather than data scientists — offers a replicable framework. AI tools should be designed from the outset with input from the clinicians who will use them and the patients who will be affected by them.
- Anticipatory planning for equity, reimbursement, and regulation. Equity considerations should be built into AI development, not added as an afterthought. Reimbursement models need to evolve to cover AI-assisted primary care services. Regulatory pathways need to accommodate the iterative, adaptive nature of AI models in primary care settings.
- Low-risk operational entry points. Not every AI application needs to be a diagnostic breakthrough. Operational AI — such as appointment no-show prediction — offers a lower-risk entry point for building institutional AI capability. An NHS pilot of AI-based appointment no-show prediction (Deep Medical) reduced missed appointments by 30% over six months, enabling nearly 2,000 additional appointments. With unattended GP appointments costing the NHS an estimated £216 million annually and approximately 7.2 million appointments missed each year as of 2019, even modest operational improvements can generate significant value.
The field also needs to learn from what has not worked. The scoping review's finding that 33% of studies focused on implementation barriers is itself a form of evidence: it tells us that the primary care AI community has recognized that technical accuracy is not the bottleneck. The bottleneck is implementation. Future research should prioritize understanding the conditions under which AI tools are adopted, sustained, and scaled — not just whether they work in controlled settings.
Conclusion: Closing the Gap Between Promise and Proof

The evidence base for AI in primary care is thinner than the deployment enthusiasm suggests. Three headline figures capture the gap: 73 empirical studies from a 2025 scoping review, 39 AI RCTs across all medicine as of mid-2021, and 3% of FDA-approved AI/ML devices designed for primary care. These numbers are not arguments against AI in primary care. They are arguments for rigor.
The gap is measurable and addressable. It exists not because AI cannot work in primary care — the technical performance data from the scoping review shows that AI can achieve impressive accuracy for specific tasks. It exists because the conditions for successful AI deployment in primary care are fundamentally different from those in radiology, cardiology, or gastroenterology. Undifferentiated symptoms, multimorbidity, relational continuity, and the four implementation barriers — usability, trust, equity, and cost — create a context where technical accuracy is necessary but far from sufficient.
The next phase of primary care AI must prioritize evidence generation over technology push. Pragmatic trials, co-design with clinicians and patients, anticipatory planning for equity and reimbursement, and low-risk operational entry points offer a path forward. The field has the tools and the talent to close the gap. What it needs now is the discipline to measure what matters — not just what works in a lab, but what improves care in the exam room.

Comments
Join the discussion with an anonymous comment.