AI and Health: Core Concepts for Clinical AI Evaluation

What "AI and Health" Actually Covers

The phrase "AI and health" is broad enough to be nearly meaningless without further qualification. In regulatory and clinical contexts, it spans at least four distinct domains that operate under different rules, have different evidence standards, and carry different risks.

Four domains within AI and health, with distinct regulatory and risk profiles.
Domain	Examples	Regulatory Category	Primary Risk Surface
AI-enabled medical devices	Radiology triage tools, ECG analysis software, retinal screening	FDA SaMD (510k / De Novo / PMA)	Diagnostic error, population mismatch, model drift
Clinical decision support software	Sepsis prediction alerts, drug interaction flagging	Varies — some FDA-regulated, some excluded	Alert fatigue, unvalidated thresholds, EHR integration failures
Generative AI in clinical workflows	AI scribes, discharge summary drafting, patient communication	Largely unregulated as of Q2 2026; some CDS rules apply	Hallucination, omission of critical findings, liability gaps
Administrative and operational AI	Prior authorization automation, revenue cycle, scheduling	Not FDA-regulated; payer and CMS policy applies	Bias in coverage decisions, lack of transparency, audit trail gaps

Most public discourse conflates these domains. A performance claim from a radiology AI vendor (validated in a prospective trial, FDA-cleared via De Novo) is not comparable to a claim from an AI scribe company (no FDA pathway, no controlled trial, minimal external validation). Treating them as equivalent is one of the most common evaluation errors.

The Regulatory Anchor: SaMD

Software as a Medical Device (SaMD) is the FDA's foundational classification for AI software that meets the statutory definition of a medical device. The definition matters because it determines whether a tool requires FDA clearance before clinical use — and therefore whether any pre-market evidence review has occurred at all.

SaMD is defined by intended use, not by technical architecture. A machine learning model that analyzes chest X-rays to detect pneumothorax is SaMD. The same model repurposed to analyze images for quality assurance in a manufacturing setting is not. The distinction turns on whether the software's output is intended to inform a clinical decision about a specific patient — not on whether it uses AI.

Clearance Pathways and What They Signal

Three pathways govern most FDA-authorized AI medical devices. Each carries different evidentiary weight and different post-market obligations.

FDA clearance pathways for AI-enabled medical devices. Risk class determines post-market surveillance obligations.
Pathway	Risk Class	Evidence Required	Common AI Applications
510(k)	Class II	Substantial equivalence to a predicate device	Radiology CAD, ECG analysis, retinal screening
De Novo	Class I or II (novel)	Performance testing; establishes new predicate	Novel AI functions without prior predicate
PMA	Class III	Reasonable assurance of safety and effectiveness (clinical trial data typically required)	High-risk diagnostic AI, AI in implantable device context

The majority of FDA-cleared AI devices have used the 510(k) pathway. This is worth noting because 510(k) does not require randomized controlled trials or prospective validation in diverse populations — it requires a showing of substantial equivalence. That gap between regulatory clearance and clinical validation is where most evidence quality concerns concentrate.

How AI Models Are Evaluated: Performance Metrics

When a study reports that an AI model achieved "high accuracy," that statement is incomplete without context. The metrics that matter depend on the clinical task — and the same metric can look very different across populations.

AUROC

The Area Under the Receiver Operating Characteristic curve (AUROC, also written AUC) measures a model's ability to discriminate between positive and negative cases across all possible decision thresholds. An AUROC of 1.0 is perfect discrimination; 0.5 is random.

AUROC is useful for comparing models on the same dataset, but it does not tell you how the model performs at the specific threshold deployed in clinical practice. A model with AUROC 0.92 can still have poor sensitivity at the operating threshold chosen for deployment — particularly if that threshold was optimized for specificity to reduce alert burden.

Sensitivity and Specificity

Sensitivity is the proportion of true positives the model correctly identifies. Specificity is the proportion of true negatives correctly identified. These are threshold-dependent — raising the detection threshold increases specificity and decreases sensitivity, and vice versa.

In high-stakes screening applications (e.g., detecting intracranial hemorrhage or diabetic retinopathy), sensitivity is usually the priority metric. Missing a true positive is more costly than a false alarm that gets resolved downstream. The clinical deployment threshold should reflect that priority — but studies sometimes report performance at the threshold that maximizes overall accuracy, which can obscure poor sensitivity.

Algorithmic Bias in Healthcare AI

Algorithmic bias in healthcare AI refers to systematic performance disparities across patient subgroups — typically defined by race, sex, age, body habitus, or imaging equipment type. It is not a hypothetical concern: documented cases exist in dermatology AI (lower accuracy on darker skin tones), pulse oximetry-based models (known measurement artifacts in patients with higher melanin levels), and sepsis prediction tools (differential performance across hospital types).

Bias enters AI models through training data composition. If a model is trained predominantly on data from academic medical centers in one geographic region, its performance on community hospital patients or patients from underrepresented demographic groups may be substantially lower than reported aggregate metrics suggest.

The practical implication: aggregate AUROC or sensitivity figures from a published study do not guarantee equivalent performance for every patient subgroup in your clinical setting. Subgroup analysis should be a standard part of any evidence review, and its absence should be flagged as a limitation — not overlooked.

Model Drift and the Problem of Deployment Over Time

Model drift describes the degradation in a deployed model's performance as the real-world data it encounters diverges from the data it was trained on. In healthcare, drift sources include changes in patient population demographics, shifts in clinical practice (e.g., new imaging protocols, updated EHR coding practices), equipment upgrades, and seasonal disease pattern variation.

Drift is not always visible. A model can continue to generate outputs that look plausible while its calibration has degraded significantly. Without active monitoring — comparing model outputs against ground-truth labels on an ongoing basis — drift may go undetected for months.

The FDA's Predetermined Change Control Plan (PCCP) framework addresses this partly: it requires manufacturers to pre-specify the types of model modifications they expect to make post-clearance and the validation procedures for each. But PCCP only governs changes the manufacturer makes to the software — it does not govern the drift that occurs in the deployment environment without any software change.

Foundation Models and Multimodal AI

A foundation model is a large AI model trained on broad, often unlabeled data at scale, which is then adapted (fine-tuned) for specific downstream tasks. In healthcare, foundation models trained on medical imaging, clinical notes, genomic sequences, or combinations of these data types are being adapted for diagnostic, prognostic, and administrative tasks.

The regulatory and evidence implications are significant. A foundation model fine-tuned for one clinical task is not automatically validated for adjacent tasks — even if the underlying architecture is identical. Each downstream application requires its own validation, and in most cases, its own regulatory review if it meets the SaMD definition.

Multimodal AI refers to models that process and integrate multiple data types simultaneously — for example, combining radiology images with clinical notes and lab values to generate a diagnostic output. Multimodal architectures are increasingly common in clinical AI research, but they introduce additional validation complexity: performance can degrade when one modality is missing or lower quality, and subgroup bias may manifest differently across modality combinations.

Hallucination in Clinical Generative AI

Hallucination refers to the generation of plausible-sounding but factually incorrect outputs by large language models (LLMs). In general-purpose LLM use, hallucination is an inconvenience. In clinical contexts, it is a patient safety risk.

Documented hallucination patterns in clinical LLM applications include: fabricated medication dosages in discharge summaries, invented laboratory values in clinical notes, incorrect drug interaction warnings, and misattributed patient history. These errors are particularly dangerous because they are embedded in fluent, grammatically correct text that clinicians may not scrutinize at the level they would a structured data field.

Hallucination vs. Model Error

Hallucination is distinct from ordinary model error. A discriminative model (e.g., a classifier) makes errors by assigning the wrong label to an input. A generative model hallucinates by constructing outputs that have no grounding in the input at all — or that selectively misrepresent it. The error mode is different, and so is the detection strategy.

Retrieval-augmented generation (RAG) architectures reduce hallucination by grounding LLM outputs in retrieved source documents, but do not eliminate it. Outputs should still be reviewed against source records, particularly for clinical documentation that will be used in care decisions.

External Validation and Why It Matters

External validation means testing a model on a dataset that was not used in training or internal testing — typically from a different institution, geographic region, or time period. It is the most important single indicator of whether a reported performance metric is likely to hold in a new deployment context.

A substantial share of published healthcare AI studies rely on internal validation only — splitting a single dataset into training and test sets from the same source. Internal validation metrics are almost always optimistic. Performance typically drops when the model is applied to data from a different hospital system, scanner manufacturer, or patient population.

Internal validation: model tested on held-out portion of the same dataset used for training. Optimistic; does not generalize reliably.
External validation (single site): model tested on data from one independent institution. Better, but may not reflect broader population diversity.
Multi-site external validation: model tested across multiple independent sites with different patient populations and equipment. Strongest generalizability evidence.
Prospective validation: model tested on prospectively collected data in a real clinical workflow. Addresses temporal drift and workflow integration effects that retrospective studies miss.

NLP in Healthcare: What It Does and Where It Fails

Natural language processing (NLP) in healthcare refers to AI methods applied to unstructured clinical text — physician notes, radiology reports, discharge summaries, operative records. NLP enables tasks like information extraction (pulling diagnoses or medications from free text), clinical coding, and documentation generation.

The failure modes are well-documented. NLP models trained on notes from one institution often underperform on notes from another because clinical documentation style varies significantly — abbreviations, negation patterns, and template structures differ across EHR systems and specialties. A model that correctly extracts "no history of MI" from one note format may misclassify the same clinical fact written differently in another.

Negation handling is a persistent challenge. Clinical notes are full of negated findings ("denies chest pain," "no acute findings"), and models that fail to correctly process negation will systematically over-report conditions — a bias that compounds in downstream tasks like cohort identification or automated coding.

Reporting Standards: CONSORT-AI and TRIPOD-AI

Two reporting standards are relevant when evaluating published healthcare AI studies.

CONSORT-AI extends the standard CONSORT checklist for randomized controlled trials to cover AI-specific elements: description of the AI intervention, handling of missing data, subgroup analyses by relevant demographic factors, and reporting of model failure modes. Studies that do not follow CONSORT-AI may omit information that is material to evaluating generalizability.

TRIPOD-AI (Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis, AI extension) governs prediction model studies. It requires explicit reporting of model development and validation methodology, calibration statistics (not just discrimination metrics like AUROC), and the intended clinical context. Calibration — how well predicted probabilities match observed event rates — is frequently omitted from published studies but matters substantially for clinical use.

Ambient Intelligence in Clinical Settings

Ambient intelligence refers to AI systems that passively observe clinical encounters — typically through microphones, cameras, or sensor arrays — and generate structured outputs without requiring active clinician input. The primary current application is ambient documentation: AI listens to a patient-clinician conversation and generates a draft clinical note.

Ambient AI systems sit at the intersection of several regulatory and ethical ambiguities. Most ambient documentation tools are not classified as medical devices under current FDA policy, because their primary function is administrative rather than diagnostic. But their outputs directly influence clinical documentation, which in turn influences care decisions — creating an indirect patient safety pathway that existing regulatory frameworks do not fully address.

Consent and privacy are separate concerns. Recording clinical conversations raises HIPAA considerations and, in some states, two-party consent requirements. Institutions deploying ambient AI should verify their legal counsel has reviewed the consent framework, not just the technical integration.

How These Concepts Connect Across the Site

These concepts are not isolated definitions — they recur across every content group on this site and need to be read in relation to each other.

FDA Device Records record the clearance pathway, intended use, and authorization date — the regulatory anchor for any evaluation.
Evidence Appraisals apply the validation concepts above — external validation status, AUROC, subgroup diversity — to specific published studies.
Regulatory Tracker records how FDA guidance on SaMD, PCCP, and CDS exemptions has evolved over time.
Clinical Deployment Reports show where model drift, hallucination, and algorithmic bias manifest in actual hospital deployments — as opposed to controlled study conditions.
Specialty Landscapes aggregate these concepts per clinical specialty, showing which bias concerns are specific to radiology data versus pathology data versus primary care documentation.

AI and Health: Core Concepts Every Evaluator Needs to Understand