The Evidence Gap in Healthcare AI

The healthcare AI market has reached an inflection point. Spending hit $1.4 billion in 2025, nearly tripling the prior year, according to a Menlo Ventures survey of more than 700 healthcare executives. Twenty-two percent of organizations have now implemented domain-specific AI tools — a 7x increase over 2024. Clinician adoption has followed: 67% of clinicians now use AI tools daily, and over 90% use them weekly, per the Bessemer Venture Partners 2026 State of Health AI report.

Yet beneath these impressive adoption figures lies a persistent problem that the American Hospital Association has flagged as a key concern for health system leaders: clinical validation gaps in AI-enabled medical devices. The gap between regulatory clearance and published clinical evidence is wide, and it matters directly to patient outcomes.

Consider what rigorous evidence looks like. A RadNet study of 747,604 women across 10 healthcare practices found that 36% opted to pay $40 out of pocket for AI-enhanced mammography screening. The overall cancer detection rate was 43% higher for women who chose AI-enhanced screening. Women in the program were 21% more likely to have their cancer detected, and the positive predictive value for cancer was 15% higher. Radiologist accuracy rose from 84–89% to approximately 93% with AI assistance. That is the kind of evidence — large sample, real-world setting, independently verifiable — that should inform procurement decisions.

This article evaluates which healthcare AI companies have published peer-reviewed evidence meeting high evidentiary standards: randomized controlled trials, prospective multi-center studies, and real-world validation across diverse populations. The thesis is straightforward: while hundreds of healthcare AI companies exist, only a minority have this level of proof, and that evidence gap is the single most important factor separating responsible AI tools from overhyped products.

A Framework for Evaluating AI Clinical Evidence

Before reviewing individual companies, it is essential to establish a common framework for assessing evidence quality. Not all studies are created equal, and vendors often present retrospective analyses or internal validations as though they carry the same weight as prospective trials.

Evidence quality hierarchy for AI clinical validation studies. Adapted from standard evidence-based medicine frameworks.
Evidence TierStudy DesignKey CharacteristicsTypical Sample Size
Tier 1Randomized Controlled Trial (RCT)Random assignment, control group, blinded endpoints, pre-registered protocolHundreds to thousands
Tier 2Prospective Multi-Center StudyPre-specified endpoints, multiple sites, consecutive or defined enrollmentHundreds to thousands
Tier 3Retrospective Multi-Center StudyAnalysis of existing data from multiple institutions, pre-specified analysis planThousands to tens of thousands
Tier 4Retrospective Single-Center StudySingle institution, historical data, higher risk of selection biasHundreds to thousands
Tier 5White Paper / Expert OpinionNo peer review, vendor-authored, no independent validationVariable or not reported

Beyond study design, several additional criteria determine whether evidence is actionable for procurement decisions:

  • Sample size and statistical power: Small studies may detect large effects but cannot reliably estimate real-world performance.
  • Population diversity: Does the study population reflect the demographics of the health system's patient base? Studies that do not report demographic composition cannot be assessed for generalizability.
  • External validation: Was the model tested on an independent dataset from a different institution, time period, or patient population? Internal validation alone is insufficient.
  • Performance metrics: AUC, sensitivity, specificity, positive predictive value, and calibration should all be reported. A single metric can be misleading.
  • Reproducibility: Have independent research groups replicated the findings? Single-study evidence is weaker than a body of consistent results.

For readers who want a deeper technical reference on model evaluation metrics — including AUROC, calibration, sensitivity, specificity, and net benefit — the provides a five-domain framework that complements this article's evidence-quality focus.

Aidoc: Multi-Center Real-World Validation in Stroke and Radiology

Aidoc is one of the most widely deployed radiology AI platforms globally, used in more than 1,000 hospitals and processing millions of scans annually. The company holds multiple FDA clearances across different clinical indications, including intracranial hemorrhage, pulmonary embolism, and cervical spine fractures.

The strongest evidence for Aidoc comes from its stroke detection studies. Multiple sources report that Aidoc's AI reduces door-to-treatment time from approximately 87 minutes to approximately 43 minutes — a clinically meaningful reduction in a time-critical condition where every minute affects outcomes. This evidence comes from real-world deployment data across multiple institutions, placing it in Tier 2 (prospective multi-center) or Tier 3 (retrospective multi-center) depending on the specific study design.

The RadNet mammography study cited earlier also involved Aidoc's AI, demonstrating a 43% higher cancer detection rate in a real-world screening population of 747,604 women. This is among the largest real-world validation studies for any radiology AI tool and provides strong evidence for generalizability across diverse practice settings.

Aidoc's evidence base is further strengthened by its deployment at Advocate Health, one of the largest health systems in the U.S., as noted in the Menlo Ventures report. For readers who want a device-by-device breakdown of Aidoc's FDA clearances and regulatory history, the provides detailed profiles for each cleared indication.

PathAI and Paige.AI: Pathology Evidence for Diagnostic Reproducibility

Pathology has long suffered from significant inter-observer variability — different pathologists reviewing the same slide can reach different diagnoses. AI tools in this space aim to reduce that variability and improve diagnostic reproducibility.

PathAI has published peer-reviewed studies demonstrating improvements in diagnostic reproducibility and reduced inter-observer variability in pathology. These studies address a well-documented clinical problem: variation between pathologists can affect treatment decisions and patient outcomes. The evidence is particularly strong for breast cancer and prostate cancer pathology, where PathAI's algorithms have been tested against expert pathologist consensus.

Paige.Ai holds the distinction of being the first FDA-approved AI for pathology. Its Paige Prostate product received FDA clearance for assisting pathologists in detecting prostate cancer on digital slides. The supporting evidence includes a prospective clinical study that met its pre-specified endpoints, placing it in Tier 2 of the evidence hierarchy.

The Menlo Ventures report notes that PathAI's peer-reviewed studies showing improved diagnostic reproducibility and reduced inter-observer variability are a key differentiator in the pathology AI market. Paige.AI's first-mover FDA approval status adds regulatory credibility, though the evidence base for both companies remains narrower in scope than the large-scale real-world studies available for radiology AI tools like Aidoc.

Tempus: Evidence Scale in Precision Oncology

Tempus AI has built one of the largest proprietary data platforms in healthcare AI, with more than 45 million de-identified patient records, over 400 petabytes of clinical data, and over 7 billion clinical notes. Its network extends across approximately 4,000 to 4,500 hospitals. The company reported $1.27 billion in total revenue in 2025 with annual revenue growth above 83% and net revenue retention of 126%, indicating strong expansion within its existing client base.

Tempus works with 95% of the top oncology pharmaceutical companies and has a contract value pipeline above $1.1 billion. The company's evidence approach relies heavily on real-world evidence generation from its massive data platform, producing retrospective analyses that correlate genomic profiles with treatment outcomes.

  • Scale: 45M+ de-identified records, 400+ petabytes of clinical data, 7B+ clinical notes
  • Revenue: $1.27B in 2025, 83% annual growth, 126% net revenue retention
  • Pharma partnerships: 95% of top oncology pharmaceutical companies
  • Hospital network: ~4,000–4,500 hospitals

For a comprehensive review of Tempus's regulatory history, FDA clearances, and evidence base across its product lines, the provides detailed profiles for each of its clinical applications.

Viz.ai, Recursion, IDx-DR, and Zebra Medical Vision: Diverse Evidence Tiers

Beyond the companies with the strongest evidence bases, several other notable healthcare AI companies occupy different tiers of the evidence hierarchy. Understanding where each sits helps procurement teams calibrate their expectations.

Viz.ai: Stroke AI with Real-World Time Savings

Viz.ai's stroke detection AI, cleared under FDA 510(k) K192872, has been studied for its impact on door-to-needle time in acute ischemic stroke. The evidence includes retrospective and prospective studies showing reduced treatment times, though the sample sizes are smaller than Aidoc's large-scale radiology studies. Viz.ai's evidence tier is Tier 2–3, depending on the specific study. For detailed FDA clearance information on Viz.ai's stroke AI, the provides the full submission record.

Recursion: Drug Discovery with Preclinical and Early Clinical Evidence

Recursion applies AI to drug discovery, using phenotypic screening to identify potential therapeutic candidates. The company's evidence base is primarily preclinical and early-phase clinical (Phase I and II). While Recursion has published peer-reviewed studies demonstrating its platform's ability to identify novel drug-target interactions, the evidence for clinical efficacy in humans remains limited. This places Recursion in Tier 4–5 for clinical validation, though its preclinical evidence is stronger.

IDx-DR: First Autonomous AI with Pivotal Prospective Study

IDx-DR (now part of Digital Diagnostics) holds the distinction of being the first FDA-authorized autonomous AI system for diabetic retinopathy screening. Its pivotal prospective study, conducted across multiple primary care sites, met pre-specified sensitivity and specificity endpoints. This places IDx-DR in Tier 2 — one of the few autonomous AI systems with prospective multi-center validation. The study demonstrated that the AI could be deployed in primary care settings without an ophthalmologist on site, addressing a significant access-to-care problem.

Zebra Medical Vision (Nanox): Retrospective Radiology Studies

Zebra Medical Vision, now part of Nanox, has published multiple retrospective studies across radiology indications including coronary artery calcium scoring, bone age estimation, and liver fat quantification. The evidence is primarily retrospective single-center or multi-center (Tier 3–4), with limited prospective validation. The company's strength lies in the breadth of its cleared indications rather than the depth of its evidence for any single application.

Comparative evidence summary for selected healthcare AI companies. Evidence tiers follow the framework defined in Section 2. All data sourced from published peer-reviewed studies, FDA authorization records, and industry reports cited in the research context.
CompanyPrimary Evidence TierStudy TypesKey FindingsKnown Limitations
AidocTier 2–3Prospective multi-center, retrospective multi-center, real-world deploymentDoor-to-treatment time reduced ~87 to ~43 min; 43% higher cancer detection in mammography (747,604 women)Most evidence is observational; limited RCT data for individual indications
PathAITier 2–3Prospective diagnostic concordance studies, retrospective validationImproved diagnostic reproducibility; reduced inter-observer variability in breast and prostate pathologyStudies focus on diagnostic agreement, not patient outcomes; limited external validation across diverse populations
Paige.AITier 2Prospective clinical study for FDA clearanceFirst FDA-approved AI for pathology; met pre-specified endpoints for prostate cancer detectionSingle indication (prostate); limited real-world deployment data
TempusTier 3–4Retrospective real-world evidence, correlational studies45M+ records; 95% top oncology pharma partnerships; $1.27B revenueLimited prospective interventional studies; evidence is primarily correlational rather than causal
Viz.aiTier 2–3Retrospective and prospective stroke time studiesReduced door-to-needle time in acute ischemic strokeSmaller sample sizes than Aidoc; limited multi-indication evidence
IDx-DRTier 2Prospective multi-center pivotal studyMet pre-specified sensitivity/specificity endpoints for autonomous diabetic retinopathy screeningSingle indication; limited to diabetic retinopathy; requires specific camera hardware
Zebra Medical Vision (Nanox)Tier 3–4Retrospective single-center and multi-center studiesMultiple radiology indications with FDA clearancesLimited prospective validation; breadth over depth in evidence
RecursionTier 4–5Preclinical studies, early-phase clinical trialsNovel drug-target identification via phenotypic screeningLimited human clinical efficacy data; early-stage company

Regulatory Context: FDA Clearance vs. Evidence Quality

A critical distinction that procurement teams must understand: FDA clearance does not equate to clinical efficacy. The majority of AI medical devices are cleared through the 510(k) pathway, which requires the device to be substantially equivalent to a predicate device already on the market. It does not require the device to have been tested in a randomized controlled trial or even a prospective study.

De Novo classification, which is used for novel devices without a predicate, involves a higher evidentiary bar. Paige.AI's De Novo clearance for pathology AI required a prospective clinical study. Similarly, IDx-DR's De Novo clearance for autonomous diabetic retinopathy screening was supported by a pivotal prospective trial. These examples illustrate that the regulatory pathway itself provides information about the likely evidence quality.

CE marking under the EU Medical Device Regulation (MDR) has historically been less stringent than FDA review, though the EU AI Act is expected to raise standards for high-risk AI systems. Procurement teams evaluating CE-marked devices should request the same level of clinical evidence they would expect from FDA-cleared devices.

Guidance for Procurement and Clinical Adoption Decisions

For health system CMOs, CMIOs, and IT evaluators, the following steps can help separate evidence-backed AI tools from those that rely on marketing claims:

  • Request the specific study protocol, not just the results summary. Look for pre-registration on ClinicalTrials.gov, pre-specified endpoints, and a statistical analysis plan.
  • Ask about the study population: Was it drawn from a single institution or multiple sites? Does the demographic composition match your patient population? Studies that do not report demographics cannot be assessed for generalizability.
  • Verify external validation: Was the model tested on an independent dataset from a different institution, time period, or geographic region? Internal validation alone is insufficient for clinical deployment.
  • Check for peer-reviewed publication in a reputable journal. White papers, conference abstracts, and vendor-authored reports are not equivalent to independent peer-reviewed research.
  • Distinguish between diagnostic accuracy studies (does the AI find what it is looking for?) and patient outcome studies (does using the AI improve patient health?). Both are valuable, but they answer different questions.
  • Ask about model drift monitoring: How does the vendor ensure that performance remains stable over time as clinical practice and patient populations change?

For broader context on how AI is being deployed in clinical workflows — including emergency medicine triage, sepsis prediction, and stroke decision support — the provides detailed use-case analyses that complement this article's evidence-focused evaluation.

Conclusion: Prioritizing Evidence Over Hype

The healthcare AI market is maturing rapidly, with $1.4 billion in spending, 22% organizational adoption, and 67% daily clinician use. But maturity in market size does not automatically translate to maturity in clinical evidence. The companies reviewed in this article — Aidoc, PathAI, Paige.AI, Tempus, Viz.ai, IDx-DR, Zebra Medical Vision, and Recursion — represent a spectrum of evidence quality, from Tier 2 prospective studies to Tier 4–5 preclinical and retrospective work.

The evidence gap is not a static problem. Companies that invest in rigorous clinical validation today will be the ones that earn the trust of health systems and clinicians tomorrow. Those that rely on regulatory clearance alone, without supporting peer-reviewed evidence, will face increasing scrutiny as the AHA and other organizations push for higher standards.

This article provides a framework that procurement teams can apply to any vendor, not just those covered here. For deeper dives into individual companies, the linked profiles for , , and provide detailed regulatory and evidence histories. The evidence framework and the company-specific reviews in this article are tools for making informed, evidence-based procurement decisions — not a substitute for independent verification of vendor claims.