The Evidence Gap in Machine Learning for Healthcare

Fewer Than 1%: The Central Fact

Ten thousand four hundred sixty-two. That is the number of machine learning algorithms cataloged across 220 systematic reviews in healthcare, published through early 2023 (Kolasa et al.). The number might suggest a maturing evidence base. It does not. I look first at validation status, not the AUC. And what I find: fewer than 1% of these algorithms ever underwent external validation on independent data. Only 53% reported any form of internal validation. Fewer than 1% — that is the story. Not a minor caveat. It means 99% of the published machine learning literature cannot support a pooled estimate of clinical performance.

Infographic split into two sections: left side shows a tall deep blue bar labeled 'Published ML Prediction Models' towering over a tiny muted orange bar labeled 'Externally Validated (<1%)' with a downward red arrow annotated '−0.20 AUC optimism gap'; right side shows a six-panel icon grid representing ethical, technological, regulatory, workforce, safety, and social barriers. — The volume of published ML models contrasted with the fraction that have ever been validated externally. The 0.20 AUC optimism gap shown here is the typical drop in performance when models are tested on independent data.

A 0.20 AUC Drop – From Clinical Use to Below Threshold

You might believe internal validation is sufficient — that cross-validation on a single dataset gives you a reliable performance estimate. I have watched models report 0.90 AUC on development data and drop to 0.70 when tested on a different population. That is not a statistical footnote. It is the difference between a tool that helps and one that misleads. The optimism gap between internal and external validation can reach 0.20 in AUC (Wynants et al., 2020, as cited in Yankam, 2026). Driven by overfitting, narrow training populations, and data leakage — a separate problem we will get to. The point: the <1% external validation rate is not a marginal oversight. It undermines nearly every headline performance claim in the field.

Missing Metrics: What You Cannot Pool

Even if every model had external validation, you still could not pool most of them. Look at the numbers from the same systematic review: 44% of studies lacked any reported accuracy metric, 72% omitted sensitivity, 75% omitted specificity (Kolasa et al., 2023). I cannot meta-analyze what was never written. These are not minor omissions — they systematically block quantitative synthesis. The evidence gap is multi-layered.

Percentage of reviewed machine learning studies in healthcare that did not report each standard performance metric. Data from Kolasa et al. (2023).
Metric	Studies Omitting It
Accuracy	44%
Sensitivity	72%
Specificity	75%

Data Leakage: Internal Validity Undermined

You might now think at least the internally validated studies are sound. They are not necessarily. Data leakage — where information from the test set contaminates the training process — affects up to 40% of imaging-based machine learning studies (Cacciamani et al., 2023, as cited in Yankam, 2026). The most common form: patients appearing in both training and test sets, inflating performance. This is distinct from the optimism gap. It undermines even the internal performance estimate itself. For a large fraction of imaging studies, the published AUC is inflated by design.

Why ML Studies Don’t Pool – Three Dimensions

Even if all the above problems were fixed — full external validation, complete reporting, no leakage — meta-analysis of ML studies would still face a fundamental challenge: heterogeneity. Drug trials compare a fixed molecule against placebo across similar populations. Machine learning studies vary along at least three dimensions that make simple pooling indefensible.

Diagram with three horizontal lanes: top lane labeled 'Structural Heterogeneity' with icons of neural network, decision tree, and other model architectures; middle lane 'Data Heterogeneity' with icons of diverse patient populations, hospitals, and imaging modalities; bottom lane 'Outcome Heterogeneity' with icons of different clinical endpoints and follow-up periods. Bottom label reads 'Why ML Findings Don't Pool Like Drug Trials'. — Three sources of heterogeneity that prevent direct meta-analysis of machine learning studies: differences in algorithm architectures, datasets, and outcome definitions.

Structural heterogeneity: a neural network, a support vector machine, and a random forest are different tools. The review by Kolasa found neural networks used in 2,454 algorithms, SVMs in 1,578, and random forests in 1,522. You cannot treat a convolutional neural network for chest X-rays and a gradient-boosted model for lab values as comparable studies.

Data heterogeneity: populations, sample sizes, and imaging protocols differ wildly. A model trained on 50,000 patients from a single academic center is not the same as one trained on 500 patients from a community hospital.

Outcome heterogeneity: the definition of the endpoint — what constitutes a positive case, what follow-up window is used — varies across studies. In oncology, one study might use progression-free survival; another uses overall survival. These are not interchangeable.

The standard meta-analytic approach, which assumes studies estimate the same underlying effect, is simply inappropriate for the current ML literature.

The 25% Rule for Systematic Reviewers

So what is a systematic reviewer to do? The Yankam review proposes a clear threshold: when fewer than 25% of included studies have external validation, only narrative synthesis is methodologically defensible (Yankam, 2026). I agree. It is a rule, not a suggestion. If this rule had been applied to the 220 systematic reviews in the Kolasa dataset, over half of them — the 53% that did not even conduct a quality assessment — would have been forced to reconsider their conclusions.

Flowchart for systematic review of ML studies: starts with 'ML Study Identified', then diamond 'Has external validation?' Yes leads to 'Include in meta-analysis' with green checkmark; No leads to 'Reports internal validation only?' Yes leads to 'Include in stratified analysis — flagged exploratory' with yellow warning; No leads to 'Exclude from quantitative synthesis — narrative only' with red X. Bottom summary: 'Stratify by validation status before pooling estimates.' — Practical decision tree for systematic reviewers: stratify by validation status before deciding whether to pool estimates.

PROBAST+AI: Hope Depends on Enforcement

Three recent frameworks aim to impose the discipline that has been missing. PROBAST+AI (Moons et al., 2025) provides a structured bias assessment tool for ML prediction models. TRIPOD+AI (Collins et al., 2024) updates reporting guidelines for regression and machine learning models. PRISMA-AI (Cacciamani et al., 2023) extends PRISMA to systematic reviews of AI in healthcare (Yankam, 2026). I am hopeful these can enforce discipline. But hope is not evidence. The field has had reporting guidelines before — TRIPOD was published in 2015 — and adherence remains suboptimal. Their impact depends entirely on journal and funder mandates.

PROBAST+AI: systematic bias assessment for ML prediction model studies.
TRIPOD+AI: reporting guidelines for clinical prediction models using regression or machine learning.
PRISMA-AI: reporting guidelines for systematic reviews and meta-analyses of AI in healthcare.

This Inference Does Not Hold

Here is where I stand. The machine learning in healthcare literature contains over ten thousand algorithms, fewer than one percent externally validated. The typical optimism gap is 0.20 AUC. Reporting gaps mean most studies cannot be used for meta-analysis even if they were validated. Data leakage inflates internal performance. Heterogeneity prevents simple pooling. And most systematic reviews compound these problems by treating all models as comparable.

The tools to change this exist: stratify by validation status, apply the 25% threshold, mandate PROBAST+AI and TRIPOD+AI, require code and data sharing. But change will not happen automatically. It requires systematic reviewers to enforce methodological standards, journal editors to reject papers that omit basic metrics, and funders to demand independent validation before claiming clinical utility.

Until then, the most honest thing any reviewer can write is: 'This inference does not hold.'

The Evidence Gap in Machine Learning for Healthcare: Why Fewer Than 1% of Models Validate Externally

Fewer Than 1%: The Central Fact

A 0.20 AUC Drop – From Clinical Use to Below Threshold

Missing Metrics: What You Cannot Pool

Data Leakage: Internal Validity Undermined

Why ML Studies Don’t Pool – Three Dimensions

The 25% Rule for Systematic Reviewers

PROBAST+AI: Hope Depends on Enforcement

This Inference Does Not Hold

Discussion

Comments