Foundation Models in Healthcare: Clinical Applications and Limitations

What Are Foundation Models? A Working Definition for Clinical Readers

The term foundation model was formalized by Bommasani and colleagues at the Stanford Center for Research on Foundation Models (CRFM) in 2021. Their definition has since become the reference point across clinical AI literature: a foundation model is any model pre-trained on broad data at scale that can be adapted — with minimal additional training — to a wide range of downstream tasks.

The clinical significance of this definition lies in what it replaces. Traditional narrow AI in healthcare is purpose-built: a retinal screening model processes fundus photographs and outputs a diabetic retinopathy risk score; a fracture detection model identifies specific bone abnormalities on plain radiographs; an ICU readmission model ingests structured billing codes and produces a risk probability. Each model is trained for one task, evaluated on one task, and deployed for one task. It cannot generalize.

A foundation model inverts this architecture. A single large pre-trained base — having learned general-purpose representations from a massive and diverse dataset — can be adapted to multiple clinical tasks with relatively little additional supervision. This adaptability is the defining property, not the model's size or its specific architecture.

In clinical healthcare, foundation models are not a single category. They divide into four distinct families based on the data type they process and the clinical problems they address. Understanding this taxonomy is the prerequisite for evaluating any specific FM's evidence base, limitations, and deployment readiness.

The Four-Family Taxonomy: CLaMs, FEMRs, Vision FMs, and Multimodal FMs

Infographic contrasting narrow single-task AI models on the left with a single broad pre-trained foundation branching into four clinical FM families on the right: CLaMs, FEMRs, Vision FMs, and Multimodal FMs. — The paradigm shift from narrow, task-specific clinical AI to a single pre-trained foundation adaptable across multiple clinical domains. Each branch represents a distinct FM family suited to a different clinical data type.

Four primary families of clinical foundation models have emerged, each defined by the data type it processes and the clinical tasks it can address.

CLaMs — Clinical Language Models

Clinical language models (CLaMs) are pre-trained on clinical text: physician notes, discharge summaries, radiology reports, operative records, and related unstructured documentation. Their primary clinical roles include NLP-based information extraction, ambient documentation support, clinical coding assistance, patient communication drafting, and text-based clinical decision support. Representative models include BioGPT, ClinicalBERT, GatorTron, and the Med-PaLM family.

CLaMs process text as their sole or primary input. Their outputs are also textual — structured extractions, generated summaries, or natural language responses. For applied clinical documentation and coding contexts, see NLP in Clinical Documentation: AI Scribes, Coding, and Clinical Documentation Improvement.

FEMRs — Foundation Models for Electronic Medical Records

FEMRs (Foundation models for Electronic Medical Records) model the structured timeline of a patient's EHR — sequences of diagnosis codes, procedure codes, medication orders, and lab results ordered across time. Rather than processing text, FEMRs learn patient-level embeddings from structured event sequences. These embeddings can then be applied to downstream prediction tasks: hospital readmission, mortality risk, disease phenotyping, and clinical trial eligibility screening.

Most FEMRs are unimodal — they process structured codes only, not accompanying clinical notes or imaging. This design choice simplifies training but also constrains generalizability across health systems that use different EHR platforms and coding conventions.

Vision FMs — Medical Imaging Foundation Models

Vision foundation models apply self-supervised learning to medical images — whole-slide pathology images, chest radiographs, retinal fundus photographs, and other imaging modalities. Rather than being trained on labeled examples of specific diseases, these models learn general visual representations from large collections of unlabeled images, then adapt to specific diagnostic tasks with minimal fine-tuning.

Named examples include UNI and Virchow in computational pathology, RETFound in ophthalmology, and CheXAgent and MAIRA-2 in chest radiology. Each is discussed further in the clinical applications section below.

Multimodal FMs — Integrating Multiple Data Types

Multimodal foundation models integrate two or more input data types within a single model — for example, combining whole-slide imaging with pathology report text, or pairing retinal images with structured clinical variables. TITAN, a whole-slide vision-language model pre-trained on over 335,000 whole-slide images alongside pathology reports and synthetic captions, is a prominent example. Multimodal FMs represent the most architecturally complex family and, as of mid-2026, also the one with the least prospective clinical validation.

The four primary clinical foundation model families, differentiated by input modality, representative named models, and primary clinical role. Procurement and evaluation decisions should begin with identifying which family addresses the relevant clinical data type.
Family	Input Modality	Representative Models	Primary Clinical Role	Training Data Type
CLaMs	Clinical text (unstructured)	BioGPT, ClinicalBERT, GatorTron, Med-PaLM	NLP extraction, documentation, coding, decision support	Physician notes, discharge summaries, radiology reports
FEMRs	Structured EHR timelines	ETHOS, CLMBR, NYUTron	Patient embeddings for readmission, mortality, phenotyping	Diagnosis codes, procedure codes, medication orders, labs
Vision FMs	Medical images	UNI, Virchow (pathology); RETFound (ophthalmology); CheXAgent, MAIRA-2 (radiology)	Disease detection, subtyping, report generation	Whole-slide images, fundus photographs, chest radiographs
Multimodal FMs	Two or more data types combined	TITAN (vision + language, pathology)	Cross-modal reasoning, report generation, rare disease retrieval	Whole-slide images + pathology reports + synthetic captions

Clinical Applications and Evidence Quality by Domain

Mapping the four FM families to clinical application domains reveals sharply uneven evidence quality. Benchmark performance on curated datasets is abundant across all families. Prospective, multi-site clinical validation is rare in every domain.

Pathology

Computational pathology is currently the most mature domain for vision FMs. UNI, a ViT-L/16 model trained using DINOv2 self-supervised learning on approximately 100,000 whole-slide images, demonstrates strong benchmark performance across cancer subtyping, biomarker prediction, and survival analysis tasks. Virchow, a ViT-H/14 model trained on 1.5 million whole-slide images from Memorial Sloan Kettering Cancer Center, shows particularly strong performance on rare cancer retrieval tasks where labeled data is scarce.

TITAN extends the pathology FM paradigm into the multimodal domain. Pre-trained on over 335,000 whole-slide images alongside pathology reports and 423,000 synthetic captions, TITAN can generate pathology reports and retrieve rare disease cases without fine-tuning. Its zero-shot and few-shot classification capabilities are documented across multiple task types.

Chest Radiology

CheXAgent, instruction-tuned on approximately 6 million image-answer triplets, demonstrates strong performance on chest radiograph report generation and visual question answering tasks. MAIRA-2 similarly shows competent automated radiology report generation. Both models represent meaningful advances over earlier narrow AI approaches in terms of report completeness and clinical language quality.

Neither CheXAgent nor MAIRA-2 holds FDA clearance as of this writing. External validation on patient populations outside the training distribution has not been published for either model. Integration with clinical PACS and RIS systems — a prerequisite for practical deployment — remains a documented barrier across U.S.-based radiology FMs generally.

Ophthalmology

RETFound uses masked autoencoder (MAE) pre-training on 1.6 million unlabeled retinal fundus images. With minimal fine-tuning, it has demonstrated performance on diabetic retinopathy detection, age-related macular degeneration classification, and the prediction of systemic conditions — including cardiovascular risk indicators — from retinal images. The breadth of tasks achievable with minimal labeled data is the defining characteristic of its clinical interest.

As with pathology and radiology FMs, prospective multi-site trials for RETFound have not been published. The evidence base is benchmark-level, not deployment-level.

EHR and Clinical NLP

CLaMs and FEMRs both face a distinctive evidence quality problem in the EHR domain. The majority of clinical language models have been pre-trained and evaluated primarily on MIMIC-III — a dataset containing approximately 2 million notes written between 2001 and 2012 in the ICU of a single U.S. academic medical center. Benchmark NLP performance on MIMIC-III tasks does not generalize predictably to other institutions, EHR systems, patient populations, or contemporary clinical documentation styles.

FEMRs face a parallel problem. Most are trained on structured billing code sequences from single institutions. Coding practices vary substantially across health systems and EHR platforms, meaning a FEMR trained at one institution may produce meaningfully different patient embeddings for clinically identical patients at another institution. The evidence supporting CLaMs and FEMRs is almost entirely limited to the first of the FM value propositions — improved predictive accuracy on specific benchmark tasks. Evidence for other claimed benefits — simplified deployment, reduced need for labeled data, improved clinical outcomes — is largely absent from the published literature.

Clinical Decision Support

The most comprehensive assessment of clinical LLM evidence quality to date comes from a 2026 systematic review published in Nature Medicine (Chen et al.), which identified 4,609 peer-reviewed studies of LLMs in clinical medicine published between January 2022 and September 2025. The findings are structurally important for any evaluation of FM clinical decision support claims.

Of those 4,609 studies, only 1,048 — approximately 23% — used real patient data. The remaining studies evaluated LLMs on simulated clinical scenarios (1,857 studies) or board examination and knowledge retrieval tasks (1,704 studies). Among the studies using real patient data, only 19 were prospective randomized trials.

Across 1,046 head-to-head comparisons between LLMs and human clinicians, LLMs outperformed humans in only 33% of comparisons. Performance declined sharply as clinician seniority increased and as task realism increased. LLMs performed comparatively better against medical students and on board-style questions; they performed worse against experienced clinicians on realistic patient care tasks.

Clinical application domains mapped to FM families, representative models, and current evidence level. Evidence tiers follow the framework from the 2026 Nature Medicine systematic review (Chen et al.). Absence of prospective multi-site validation is a consistent finding across all domains.
Domain	FM Family	Key Named Models	Evidence Level	Prospective Multi-Site Validation
Pathology	Vision FM, Multimodal FM	UNI, Virchow, TITAN	Benchmark (Tier II–III)	Absent as of mid-2026
Chest Radiology	Vision FM	CheXAgent, MAIRA-2	Benchmark (Tier II–III)	Absent; no FDA clearance
Ophthalmology	Vision FM	RETFound	Benchmark (Tier II–III)	Absent
EHR / NLP	CLaMs, FEMRs	ClinicalBERT, GatorTron, CLMBR	Benchmark on MIMIC-III (Tier III)	Absent; single-institution training data
Clinical Decision Support	CLaMs	Med-PaLM, GPT-4 variants	Mixed; 23% real patient data (Tier I–III)	19 prospective RCTs identified across all clinical LLM studies

Evidence Quality Framework: Tiers S, I, II, and III

The 2026 Chen et al. systematic review introduced a four-tier evidence quality framework that provides a structured vocabulary for evaluating clinical FM studies. The framework distinguishes studies not only by design type but by the realism of the clinical environment in which the FM was evaluated.

Stacked horizontal tier diagram showing four evidence quality levels: Tier S at the top as the narrowest band, with progressively wider bands for Tier I, Tier II, and Tier III at the base, illustrating that the vast majority of clinical AI studies occupy the lower tiers. — Distribution of clinical foundation model studies across the four evidence tiers. Tier S — prospective randomized trials in live clinical settings — represents a small fraction of the total evidence base. The pyramid shape reflects the actual distribution across 4,609 studies identified in the 2026 Nature Medicine systematic review.

Tier S — Prospective randomized trials conducted in live clinical settings, with real patients and real clinical workflows. The highest evidence tier. Only 19 studies of this type were identified across 4,609 total studies.
Tier I — Retrospective or prospective studies using real patient data, but not randomized or conducted in live clinical environments. Includes retrospective cohort studies, prospective observational studies, and implementation evaluations. Approximately 1,048 studies (23% of the total) reached at least Tier I.
Tier II — Evaluations using simulated or synthetic clinical scenarios — case vignettes, constructed patient cases, or structured clinical simulations. Realistic enough to test clinical reasoning but not using actual patient data. Approximately 1,857 studies.
Tier III — Board examination and knowledge retrieval tasks — USMLE-style questions, specialty board questions, and factual recall challenges. The most common study type, accounting for approximately 1,704 studies. Tests knowledge representation but not clinical reasoning in realistic patient care contexts.

The tier distribution has direct implications for procurement and deployment decisions. A model demonstrating strong Tier III performance — passing a specialty board examination at or above average physician rates — provides no reliable evidence about performance in live patient care. The 33% LLM outperformance rate in head-to-head comparisons is itself drawn primarily from Tier I and Tier II studies; the 19 Tier S trials represent too small a sample to generalize.

Systematic Limitations Across All FM Families

Five structural limitations apply across all four FM families. These are not edge-case failure modes — they are consistent findings documented across the literature as of mid-2026.

1. Evaluation Misalignment

Most CLaMs have been benchmarked on NLP tasks drawn from MIMIC-III — a dataset that is temporally outdated (2001–2012), institutionally narrow (a single U.S. ICU), and stylistically unrepresentative of contemporary clinical documentation across diverse care settings. High performance on MIMIC-III NLP benchmarks does not validate a model's ability to extract clinically meaningful information from notes written in a different EHR system, by clinicians trained in different documentation conventions, or about patient populations with different demographic and comorbidity profiles.

FEMRs face a related but distinct misalignment. Most are evaluated on binary classification tasks — will this patient be readmitted within 30 days? — rather than on the broader value propositions that motivate FM development: simplified deployment across institutions, reduced need for labeled data, or improved generalization. The evaluation tasks do not test the properties that differentiate FEMRs from conventional narrow predictive models.

2. Hallucination

Hallucination in clinical foundation models refers to model-generated outputs that are factually incorrect, logically inconsistent, or unsupported by clinical evidence — in ways that could plausibly alter a clinical decision. The FDA has characterized these as "plausible errors": outputs that are coherent and confident in form but incorrect in substance, making them harder for clinicians to detect than obvious nonsense.

A global survey of clinicians found that 91.8% had encountered medical hallucinations from AI systems, and 84.7% considered them capable of causing patient harm. Physician audits have identified that the majority of residual hallucinations in clinical LLMs stem from causal or temporal reasoning failures rather than simple knowledge gaps — meaning the model fails not because it lacks a fact, but because it cannot correctly apply causal or sequential logic to a clinical scenario.

3. Bias and Equity

Training data for clinical FMs overrepresents certain institutions, populations, imaging equipment vendors, and coding practices. This overrepresentation can manifest as systematic underperformance on patient subgroups that are underrepresented in training data — including patients from community hospitals, rural settings, or demographic groups underrepresented in large academic medical center datasets.

A more subtle bias risk applies to vision FMs specifically: models can encode confounding features — scanner manufacturer, staining protocol, image acquisition settings — as strongly as or more strongly than the biological signals they are intended to detect. A pathology FM trained predominantly on images from a single scanner vendor may underperform on slides prepared with different staining protocols or scanned on different equipment, even if the underlying pathology is identical.

4. Training Data Access Asymmetry

The best-performing clinical FMs — including Virchow (1.5M WSIs from MSKCC) and several leading CLaMs — are pre-trained on private or single-institution datasets that are not publicly accessible. This creates a structural asymmetry: academic groups and smaller health systems cannot reproduce, audit, or build on the training data that underlies the highest-performing models. Independent replication of reported benchmark results is therefore not possible in most cases.

This asymmetry also has equity implications. Health systems that cannot access or contribute to FM pretraining data are likely to receive models that perform less well on their patient populations. Privacy-preserving approaches such as federated learning offer a partial pathway toward broadening FM training data across institutions without centralizing sensitive patient records — see Federated Learning in Healthcare AI: Definition, Privacy Mechanisms, and Clinical Evidence for a detailed treatment.

5. Regulatory Immaturity

Clinical foundation models that function as software as a medical device (SaMD) fall within the FDA's SaMD regulatory framework. However, the application of that framework to generative and multimodal FMs remains fragmented. Generative models that produce clinical outputs — radiology reports, diagnostic suggestions, treatment recommendations — present regulatory classification challenges that the existing 510(k), De Novo, and PMA pathways were not designed to address.

The FDA's predetermined change control plan (PCCP) mechanism allows manufacturers to pre-specify planned model updates, but its application to continuously learning or periodically retrained FMs is not yet settled in guidance. Regulatory frameworks across different jurisdictions — the EU AI Act, MHRA guidance in the UK, and others — are developing in parallel with different classification logic, creating compliance complexity for multinational deployments.

Clinical Adoption Considerations

The evidence quality framework and limitation structure have direct implications for how health systems should approach FM adoption decisions. The following considerations apply across FM families, though their relative weight varies by family and clinical context.

Evidence Tier Requirements Before Clinical Use

Clinical applications where FM errors carry direct patient safety consequences — diagnostic decision support, treatment recommendation, medication review — require at minimum Tier I evidence: retrospective or prospective validation on real patient data from a population and clinical setting comparable to the intended deployment context. Tier III evidence (board examination performance) is not a substitute and should not be treated as such in procurement evaluation.

Applications with lower direct safety stakes — administrative coding assistance, documentation drafting support, patient communication drafting — may be appropriate for deployment with Tier I evidence and strong human oversight, provided the FM's output is reviewed before clinical action is taken.

Adaptation Strategies and Transfer Learning

Foundation models are rarely deployed in their pre-trained state. Fine-tuning on institution-specific data, domain adaptation, and parameter-efficient methods such as LoRA adapters are standard steps in moving from a pre-trained FM base to a deployed clinical application. The governance implications of these adaptation steps — including how fine-tuning affects regulatory status, how it interacts with the original training data license, and how it should be documented for post-market surveillance — are covered in detail in Transfer Learning and Fine-Tuning in Clinical AI: Definitions, Strategies, and Governance Implications.

Human-in-the-Loop Requirements

Given the hallucination risk documented across clinical LLMs, the evaluation misalignment documented for CLaMs and FEMRs, and the absence of prospective multi-site validation for imaging FMs, fully autonomous clinical decision-making by any current FM family is not supported by the evidence base. Human review of FM outputs before clinical action is a minimum requirement across all high-stakes applications.

The specific form of human oversight — whether a clinician reviews every output, a sample of outputs, or only flagged outputs — should be calibrated to the error consequences of the specific application. Radiology report generation reviewed by a radiologist before sign-off represents a different oversight structure than a diagnostic suggestion presented directly in a clinical decision support alert.

Addressing Data Access Asymmetry

Health systems concerned about FM performance on their specific patient populations — and unable to access or audit the pretraining data of commercially available FMs — have two primary options. First, fine-tuning on institution-specific labeled data can partially correct for training distribution mismatch, though this requires sufficient labeled data and introduces its own governance considerations. Second, federated learning approaches allow multiple institutions to jointly contribute to FM training or fine-tuning without centralizing patient data, potentially producing models that are better calibrated across diverse populations.

Before adopting a clinical FM, verify which evidence tier the supporting studies represent — benchmark studies (Tier II–III) do not establish clinical readiness.
Confirm whether the FM's training data includes patient populations, EHR systems, imaging equipment, and care settings comparable to your deployment context.
Clarify the FM's regulatory status directly with the FDA 510(k) database — vendor claims about regulatory readiness should be independently verified.
Define human-in-the-loop oversight requirements before deployment, calibrated to the clinical consequences of FM errors in the specific application.
Document the adaptation steps applied to the base FM (fine-tuning, domain adaptation, prompt engineering) and assess whether those steps affect the model's regulatory classification.
Plan for post-deployment performance monitoring — FM performance can degrade over time as clinical documentation styles, coding practices, and patient population characteristics shift.

Key Takeaways

Foundation models are defined by pre-training on broad data at scale and adaptability across downstream tasks — a structural departure from narrow, single-task clinical AI. The clinical FM landscape divides into four distinct families: CLaMs (clinical text), FEMRs (structured EHR timelines), vision FMs (medical imaging), and multimodal FMs (integrated data types). Evaluation frameworks must be matched to the correct family.
Evidence quality is sharply uneven by domain. Imaging FMs in pathology, radiology, and ophthalmology show strong benchmark performance (Tier II–III) but lack prospective multi-site validation. Clinical decision support LLMs have a larger evidence base, but 77% of studies do not use real patient data, and only 19 prospective randomized trials have been published across all clinical LLM research through September 2025.
Across 1,046 head-to-head comparisons between LLMs and human clinicians, LLMs outperformed humans in 33% of cases. Performance declined with clinician seniority and task realism. This figure does not support the characterization of current clinical LLMs as generally superior to experienced clinicians in realistic patient care contexts.
Five structural limitations constrain safe adoption across all FM families: evaluation misalignment between benchmark tasks and clinical value; hallucination risk (a documented patient safety concern affecting 91.8% of surveyed clinicians); training data bias from overrepresentation of specific institutions and populations; training data access asymmetry that limits reproducibility and equitable performance; and regulatory immaturity under the SaMD framework for generative and multimodal FM types.
Deployment decisions should be anchored to evidence tier, not benchmark performance. Human-in-the-loop oversight is required for all high-stakes clinical applications given the current evidence base. Regulatory status should be verified directly — not inferred from vendor claims.