AI in Primary Care: Navigating the Evidence Gap Between ML and LLM Tools

A primary care physician seated in a warm clinical exam room, attentively consulting with a middle-aged patient. A subtle cyan waveform icon in the background suggests an ambient AI scribe capturing the conversation. — AI in primary care is evolving rapidly, but the strength of the evidence behind different types of tools varies dramatically.

Introduction: Two Eras of AI Clinical Decision Support in Primary Care

Primary care clinicians evaluating AI tools today face a confusing landscape. One day a new study shows a machine learning model improving heart failure diagnoses; the next, a large language model (LLM) claims to answer complex clinical questions at the level of a board-certified physician. The natural instinct is to ask which tool is "better," but that question misses the most clinically relevant distinction: the evidence gap between the two eras of AI development.

This article organizes AI clinical decision support (CDSS) tools in primary care by their evidence maturity, not by their application class. The pre-2024 era produced machine learning models—for ECG interpretation, diabetic retinopathy screening, and infection management—that were tested in pragmatic randomized controlled trials (RCTs) and multi-center studies. The post-2024 era has introduced LLM-based systems—OpenEvidence, ChatGPT for Health, and similar tools—that demonstrate impressive performance on simulated tasks but have undergone minimal real-world clinical validation.

This framing is distinct from the existing comprehensive survey of AI CDSS applications in primary care, which organizes tools by disease type and clinical function. Here, the focus is on what the evidence actually supports—and where it falls short—so that adopters can make informed decisions about which tools to pilot, which to trust, and which to approach with structured evaluation protocols.

A two-era timeline comparison visual. The left side is labeled 'Pre-2024' with icons for AI-ECG, diabetic retinopathy screening, and UTI management under a shield checkmark labeled 'RCT' — rendered in solid medical teals. The right side is labeled 'Post-2024' with translucent LLM chat and diagnostic evidence icons labeled 'Simulation' in lighter cyan, suggesting emerging but unproven tools. — The evidence gap between pre-2024 ML tools (RCT-backed) and post-2024 LLM tools (simulation-backed) is the most clinically relevant distinction for primary care adopters.

Part 1: Evidence-Validated ML Tools — What the RCTs Show

The strongest evidence for AI in primary care comes from machine learning models developed and tested before the generative AI boom. These tools target specific, well-defined clinical problems and have been evaluated in study designs that meet the evidence standards clinicians expect.

AI-Interpreted ECG for Low Ejection Fraction

A pragmatic RCT published in Nature Medicine (Yao et al. 2021) evaluated an AI algorithm applied to routine 12-lead ECGs to identify patients with low ejection fraction (EF) in primary care settings. The intervention arm used the AI-ECG tool during routine care; the control arm received usual care without AI interpretation. The result: the AI-ECG program significantly increased new diagnoses of low-EF heart failure from 1.6% to 2.1%. While the absolute increase appears modest, it represents a 31% relative improvement in detection rate for a condition that is frequently underdiagnosed in primary care.

Diabetic Retinopathy Screening

AI-based diabetic retinopathy screening is one of the most extensively validated applications in primary care. Across multiple studies, AI systems have achieved sensitivities ranging from 87% to 100% and specificities from 89% to 98% for detecting referable diabetic retinopathy. The U.S. FDA has cleared several such devices for autonomous use, meaning they can provide a screening result without a specialist overread. This evidence base includes prospective multi-center trials and real-world deployment studies, making it one of the few AI applications with sufficient data to support broad primary care adoption.

AI for Urinary Tract Infection Management

A less widely known but methodologically strong example comes from an AI tool designed to support UTI management decisions in primary care. In a study spanning 36 practices, the AI-based CDSS improved treatment success rates from 75% to 84%—a 12% relative improvement. The tool provided antibiotic selection guidance based on local resistance patterns and patient-specific factors, addressing a common source of diagnostic uncertainty in primary care.

Summary of pre-2024 ML tools with the strongest primary care evidence.
Application	Study Design	Key Finding	Evidence Quality
AI-ECG for low EF	Pragmatic RCT (Yao et al. 2021)	Diagnoses increased from 1.6% to 2.1%	High — RCT with real-world primary care population
Diabetic retinopathy screening	Multiple prospective multi-center studies	Sensitivity 87–100%, Specificity 89–98%	High — FDA-cleared, multiple independent validations
AI UTI management tool	Cluster study across 36 practices	Treatment success improved from 75% to 84%	Moderate-High — multi-site but not randomized at patient level

Part 2: LLM-Based CDSS — Promise on Simulated Tasks, Gaps in Real-World Validation

The post-2024 wave of AI tools for primary care is dominated by large language models. Products like OpenEvidence, ChatGPT for Health, and various EHR-integrated LLM copilots promise on-demand evidence retrieval, differential diagnosis generation, and clinical reasoning support. The performance claims are striking: some models score at or above the passing threshold for U.S. Medical Licensing Exam (USMLE) questions and can generate differential diagnoses that overlap substantially with those of attending physicians.

However, the Stanford-Harvard State of Clinical AI 2026 report (ARISE network) provides a sobering counterpoint. The report found that nearly half of more than 500 medical AI studies tested models using exam-style questions, and only 5% used real patient data. Very few studies measured whether models recognized their own uncertainty or examined bias and fairness. AI systems performed well on narrow benchmarks but declined when tested in settings resembling real clinical work—where follow-up questions are needed, information is incomplete, and the presentation is undifferentiated.

A notable exception is the collaboration between Penda Health and OpenAI in Kenya, where an AI system was used to review urgent care visits. According to the Stanford-Harvard report, this deployment reduced diagnostic and treatment errors across tens of thousands of patients. This real-world implementation stands out precisely because it is rare: most LLM-based CDSS tools have not been evaluated in actual clinical workflows with real patient outcomes as endpoints.

For a broader examination of the evidence base for generative AI in health, see the comprehensive evidence overview of generative AI in healthcare.

Part 3: Performance Gaps — Where the Evidence Falls Short for Primary Care

The evidence gap between pre-2024 ML tools and post-2024 LLM systems is not merely a matter of study design. It reflects fundamental differences in how these tools handle the clinical realities of primary care: undifferentiated presentations, low disease prevalence, and the need for probabilistic reasoning rather than pattern matching.

AI Triage Tools: Low Concordance with Physician Assessments

The Katonai et al. scoping review reported that AI triage tools matched physician assessments in only 17% of cases overall. This finding underscores a critical limitation: triage decisions in primary care require contextual understanding—knowing which patients can wait, which need same-day attention, and which require emergency referral—that current AI systems struggle to replicate. A tool that disagrees with clinician judgment 83% of the time is not ready for independent deployment, regardless of how well it performs on standardized test questions.

Skin cancer detection AI is a telling case. The Lancet Primary Care review found that only 2 of 272 studies on AI skin cancer detection used training data from low-prevalence settings resembling primary care. The vast majority of models were trained on dermatology clinic datasets where disease prevalence is high and lesion types are preselected. In primary care, where most skin lesions are benign and malignant cases are rare, these models may exhibit dramatically different performance—a phenomenon known as prevalence bias. For a detailed analysis of this specific application, see the evidence appraisal of AI-enabled skin lesion detection.

Generative AI Adoption Outpacing Validation

The Lancet Primary Care review also reported that one in five UK general practitioners (GPs) reported using generative AI in clinical practice by 2024. This rapid adoption has occurred despite the limited validation evidence described above. The Doximity 2026 State of AI in Medicine survey of 3,151 U.S. physicians confirms the trend: 54% of physicians currently use AI in clinical practice, with adoption rising from 47% in April 2025 to 63% by January 2026. Among family medicine physicians, 58% report using AI, and 88% of those users employ it daily.

Key evidence gaps that primary care adopters must navigate.
Evidence Gap	Specific Finding	Source
AI triage concordance	AI matched physician assessments in only 17% of cases	Katonai et al. 2025 scoping review
Skin cancer AI training data	Only 2 of 272 studies used primary-care-relevant low-prevalence data	Lancet Primary Care review (Laranjo et al. 2025)
Generative AI adoption vs. validation	1 in 5 UK GPs used generative AI by 2024 despite limited validation	Lancet Primary Care review
Real patient data in AI studies	Only 5% of 500+ studies used real patient data; ~50% used exam-style questions	Stanford-Harvard State of Clinical AI 2026

Part 4: Key Concerns — Hallucination, Automation Bias, and Data Representativeness

Three concerns are particularly salient for primary care adoption of AI, especially LLM-based systems.

Hallucination Risk in Undifferentiated Presentations

LLMs are known to generate confident-sounding but factually incorrect outputs—a phenomenon called hallucination. In primary care, where patients often present with vague, undifferentiated symptoms (fatigue, dizziness, abdominal pain), the risk is amplified. A model trained on textbook cases may generate a plausible but incorrect differential diagnosis that omits the actual condition. Unlike a specialist who can say "I'm not sure, let's watch and wait," an LLM may produce a definitive-sounding answer that leads clinicians down the wrong diagnostic path.

Automation Bias and Over-Reliance

Automation bias—the tendency to over-rely on automated recommendations—is a well-documented phenomenon in clinical decision support. When an AI tool produces a recommendation that conflicts with a clinician's judgment, the asymmetry of influence can be problematic. The analysis of asymmetric CDSS influence on primary care physicians explores this dynamic in detail. The Doximity survey underscores the concern: 71% of all physicians surveyed cited accuracy and reliability as their top concern about AI, and 47% said their institution's AI policies are "still evolving."

Data Representativeness and Algorithmic Bias

The data representativeness problem is not limited to skin cancer AI. The finding that only 2 of 272 studies used primary-care-relevant training data reflects a broader issue: most medical AI models are trained on datasets from academic medical centers and specialty clinics, which do not reflect the demographic and clinical diversity of primary care populations. Models that perform well on curated datasets may fail when deployed in community health centers, rural practices, or settings serving marginalized populations. The glossary entry on algorithmic bias in healthcare AI provides definitions and mitigation frameworks for this concern.

A conceptual editorial illustration contrasting two data sources. The left side shows neat exam-style multiple choice sheets representing simulated data in cool blue tones. The right side shows messy real-world patient data — prescription bottles, vital signs, a fragmented clinical note, and a stethoscope — in warm teal and earth tones. A wide gap separates the two sides with only a tiny dotted path crossing it. — The gap between simulated and real-world data is the central challenge for LLM-based CDSS validation in primary care.

Future Directions and Practical Recommendations for Primary Care Adopters

The two-era framework suggests a practical path forward for primary care clinicians and health systems evaluating AI tools.

Prioritize tools with RCT-level evidence for specific use cases. AI-ECG for low EF detection, diabetic retinopathy screening, and AI-guided UTI management have demonstrated clinically meaningful improvements in real-world settings. These tools are ready for pilot deployment with appropriate monitoring.
Approach LLM-based CDSS with structured evaluation protocols. Before integrating tools like OpenEvidence or ChatGPT for Health into clinical workflows, health systems should define specific use cases, establish performance benchmarks using local data, and implement monitoring for hallucination and bias. The Stanford-Harvard report's finding that only 5% of studies used real patient data should give any adopter pause.
Address the data representativeness gap. When evaluating any AI tool, ask: What population was this model trained on? Does it include patients from primary care settings? Does it reflect the demographic diversity of my patient panel? If the answer to any of these questions is unclear or negative, the tool requires local validation before deployment.
Invest in ambient AI scribes as a lower-risk entry point. The evidence for ambient documentation AI is stronger than for diagnostic LLMs. The Permanente Medical Group's 63-week evaluation of 7,260 physicians using AI scribes across 2.5 million encounters found an estimated 15,791 hours of documentation time saved, with 84% of physicians reporting improved communication and 82% reporting improved work satisfaction. These tools address a well-defined operational problem with measurable outcomes.
Monitor the evidence base actively. The field is evolving rapidly. The Katonai et al. scoping review found that most AI tools remain at the proof-of-concept stage, but the Stanford-Harvard report and the Lancet review both document accelerating adoption. Health systems should designate a clinical informatics lead to track new evidence and update deployment decisions accordingly.

The two eras of AI in primary care are not in competition. Pre-2024 ML tools offer narrow, validated solutions for specific clinical problems. Post-2024 LLM systems offer broad, unvalidated capabilities that may transform primary care—but only if the evidence base catches up to the promise. For now, the most clinically responsible path is to adopt the former with confidence and approach the latter with rigor.

AI in Primary Care: The Evidence Gap Between Pre-2024 ML Tools and Post-2024 LLM Systems