What "AI" Actually Means in a Healthcare Context
The term "artificial intelligence" covers a wide range of computational methods, and its meaning shifts significantly depending on whether you're reading a clinical study, an FDA submission, or a vendor press release. For evaluation and regulatory purposes, what matters is not the label but the function: what the system does, what decisions it influences, and whether it meets the definition of a Software as a Medical Device (SaMD) under FDA's classification framework.
FDA's AI/ML-based SaMD framework treats any software that uses machine learning, deep learning, or related techniques to analyze patient data and inform clinical decisions as a regulated device — provided it meets the intended use threshold. A rule-based clinical decision support tool that follows fixed logic is generally not subject to device regulation. A model that learns from data to produce patient-specific outputs typically is.
This distinction matters practically. Many tools marketed as "AI-powered" in healthcare include rule-based components, statistical models, and machine learning in varying proportions. Evaluating a tool requires knowing which components drive the clinical output and whether those components have been through regulatory review.
Major Application Categories
Healthcare AI divides into several distinct application categories that differ in their data inputs, regulatory pathways, evidence requirements, and deployment risks. Conflating them makes evaluation harder.
| Category | Typical Input | Regulatory Pathway | Primary Risk |
|---|---|---|---|
| Medical imaging AI | Radiology, pathology, or ophthalmology images | 510(k) or De Novo (SaMD) | False negatives, demographic performance gaps |
| Clinical decision support (CDS) | EHR data, lab values, vital signs | 510(k) if meeting SaMD criteria; exempt if non-device CDS | Alert fatigue, over-reliance, miscalibration |
| Ambient documentation / AI scribe | Audio of clinical encounter | Generally not device-regulated (as of Q2 2026) | Hallucination in transcription, consent issues |
| Drug discovery AI | Molecular structure, genomic, proteomic data | Not device-regulated at discovery stage | Reproducibility, training data bias |
| Revenue cycle and prior authorization AI | Claims data, billing codes, utilization patterns | Not device-regulated | Denial accuracy, equity in coverage decisions |
| Predictive risk scoring | Longitudinal patient records | 510(k) or De Novo depending on intended use | Model drift, training-deployment distribution shift |
How FDA Classifies and Authorizes Healthcare AI
FDA authorization is not a single category. The three pathways — 510(k), De Novo, and PMA — carry different evidentiary requirements and imply different levels of regulatory scrutiny.
- 510(k) clearance requires demonstrating substantial equivalence to a legally marketed predicate device. Most FDA-authorized radiology AI tools have gone through this pathway. It does not require prospective clinical trials.
- De Novo authorization applies when no predicate exists. It establishes a new device type and sets special controls for future submissions. Several AI tools addressing novel clinical functions have used this pathway.
- PMA (Premarket Approval) is reserved for Class III high-risk devices and requires valid scientific evidence — typically including clinical studies. Very few AI devices have gone through PMA to date.
Key Technical Concepts for Evaluation
AUROC and Performance Metrics
The Area Under the Receiver Operating Characteristic Curve (AUROC, also written AUC) is the most commonly reported performance metric for binary classification tasks in healthcare AI — detecting disease vs. not, flagging high-risk vs. low-risk. An AUROC of 1.0 represents perfect discrimination; 0.5 is no better than chance.
AUROC alone does not tell you how a model performs at the operating threshold actually used in clinical practice. Sensitivity (the proportion of true positives correctly identified) and specificity (the proportion of true negatives correctly identified) describe performance at a specific decision threshold. A model with an AUROC of 0.92 might still have a clinically unacceptable false negative rate at the threshold a health system deploys it at.
When reading evidence appraisals, check whether the reported metrics come from the same dataset used to train the model (internal validation), a held-out test set from the same institution, or an independent external dataset. External validation on a demographically distinct population is the strongest signal that a model generalizes beyond its training environment.
Model Drift
Model drift refers to the degradation of a deployed AI model's performance over time as the real-world data distribution shifts away from the training distribution. In healthcare, this can happen when patient demographics change, when clinical protocols are updated, when imaging equipment is upgraded, or when EHR coding practices evolve.
Drift is a post-deployment concern, not a pre-deployment one. A model can perform well at launch and silently degrade over 12–18 months without triggering any visible failure. Health systems deploying AI tools need monitoring frameworks that compare current model output distributions against baseline, not just periodic manual audits.
FDA's concept of a Predetermined Change Control Plan (PCCP) addresses this directly. A PCCP allows manufacturers to specify in advance what types of model updates are permissible without requiring a new 510(k) submission — provided the modifications stay within defined performance boundaries and the plan was approved with the original authorization.
Algorithmic Bias
Algorithmic bias in healthcare AI refers to systematic differences in model performance across demographic subgroups — race, sex, age, body habitus, socioeconomic status — that produce inequitable clinical outcomes. The mechanism is usually training data: if a model is trained predominantly on data from one population, it will often perform less accurately on underrepresented groups.
Bias is not always visible in aggregate performance metrics. A model with an AUROC of 0.89 across the full test set might have an AUROC of 0.74 for a specific demographic subgroup — a gap that aggregate reporting would obscure. Evaluators should look for subgroup performance stratification in any study they appraise.
Foundation Models and Generative AI
Foundation models are large-scale models trained on broad datasets — text, images, or both — that can be fine-tuned for specific downstream tasks. In healthcare, this includes large language models (LLMs) fine-tuned for clinical note summarization, question answering, or discharge documentation, as well as vision-language models applied to radiology and pathology.
Generative AI introduces a specific failure mode: hallucination. A generative model can produce fluent, confident-sounding text that contains factual errors — fabricated lab values, misattributed diagnoses, invented medication names. In ambient documentation tools, this means a clinician reviewing an AI-generated note may encounter plausible but inaccurate content that was never spoken during the encounter.
Multimodal AI
Multimodal AI refers to models that integrate more than one data type — for example, combining imaging data with clinical text, genomic sequences with structured EHR variables, or audio with visual signals. In healthcare, multimodal approaches are particularly active in oncology (combining pathology images with molecular profiles) and in primary care risk stratification (combining imaging findings with patient history).
Multimodal models introduce evaluation complexity: a model that performs well on combined inputs may degrade substantially when one modality is missing or of lower quality than the training distribution. Deployment environments rarely match training conditions on all modalities simultaneously.
Evidence Standards: What to Look For
The quality of evidence supporting a healthcare AI tool varies enormously. Most FDA-cleared devices were authorized based on retrospective studies or reader studies — not prospective randomized trials. The gap between regulatory authorization and clinical evidence is one of the most important distinctions to track.
| Study Type | Strength | Common Limitation in AI Studies |
|---|---|---|
| Randomized controlled trial (RCT) | Highest — controls for confounders | Few AI RCTs exist; often underpowered or short follow-up |
| Prospective validation | Strong — real-world data collection, pre-specified endpoints | Single-site or limited demographic diversity |
| Retrospective cohort | Moderate — large datasets feasible | Selection bias, data quality issues, no outcome measurement |
| Reader study | Moderate for imaging tasks — direct comparison with clinicians | Controlled conditions rarely match real deployment |
| Internal validation only | Weak for generalizability claims | Overfitting risk; does not demonstrate external performance |
The CONSORT-AI reporting standard provides a checklist extension to the standard CONSORT trial reporting guidelines, specifically for AI interventions. Studies that follow CONSORT-AI are required to report the AI system's version, training data characteristics, and how the model was integrated into the trial workflow. When appraising a study, checking whether it adheres to CONSORT-AI (or TRIPOD-AI for prediction model studies) gives a quick signal of methodological transparency.
Deployment Conditions vs. Study Conditions
A model's performance in a controlled study environment and its performance after deployment in a real clinical workflow are frequently different. Several factors account for this gap.
- Case mix shift: The patient population at the deploying institution may differ from the study population on age, comorbidity burden, imaging equipment, or documentation practices.
- Workflow integration: How a model output is surfaced to clinicians — as an alert, a worklist flag, a background annotation — affects whether it changes clinical behavior. A model that performs well in isolation may have minimal impact if its output is ignored due to alert fatigue.
- Label quality: Retrospective studies often use administrative codes or historical clinical judgments as ground truth. These labels may not reflect the actual clinical finding with the precision the model was trained to detect.
- Temporal drift: Equipment upgrades, protocol changes, and population shifts over time can all degrade model performance without any change to the model itself.
This is why deployment reports — accounts of what actually happened when a tool was integrated into a real health system — carry different information value than controlled studies. They don't replace evidence appraisals, but they surface failure modes and implementation realities that controlled studies are not designed to capture.
How This Site Organizes Healthcare AI Information
The concepts defined in this entry recur across all content groups on this site. Understanding them precisely is necessary for using the site's structured records effectively.
- FDA Device Records document authorization pathway, intended use, and real-world evidence status for individual cleared devices. Clearance pathway (510(k), De Novo, PMA) is a primary organizing field.
- Evidence Appraisals apply the evidence hierarchy above to individual peer-reviewed studies, reporting study design, dataset characteristics, external validation status, and performance metrics in context.
- Clinical Deployment Reports cover real-world implementation — EHR integration method, staff adoption, documented failure modes — distinct from controlled study evidence.
- Regulatory Tracker logs FDA guidances, policy statements, and enforcement actions affecting SaMD — including PCCP guidance, AI transparency requirements, and ONC/CMS policy intersections.
- Specialty Landscapes synthesize the state of AI across a clinical specialty, aggregating cleared devices, key studies, active trials, and known equity concerns in one navigable layer.
Comments
Join the discussion with an anonymous comment.