Conceptual illustration showing a bright blue research lab with benchmark scoreboards on the left and a muted amber hospital scene with a locked FDA gate on the right, with a partially built bridge missing planks between the two sides.
The gap between benchmark performance and clinical deployment readiness remains the defining feature of Google's clinical AI models as of mid-2026.

Introduction: The Benchmark-to-Bedside Gap

Google's clinical AI models have achieved headline-grabbing benchmark scores: Med-PaLM 2 became the first large language model to exceed 85% on USMLE-style questions, and Med-Gemini pushed that figure to 91.1%. These numbers circulate widely in conference presentations, press releases, and vendor evaluations. For clinicians and clinical informaticians evaluating AI tools for institutional adoption, however, benchmark performance is only one dimension of a much more complex assessment.

This article provides a systematic, evidence-grounded evaluation of Google's clinical AI models — from Med-PaLM through Med-Gemini, MedLM, and Vertex AI Search for Healthcare — against five domains: accuracy, safety, bias, deployment readiness, and regulatory status. The core finding is consistent across every model family: state-of-the-art benchmark performance has not yet translated into clinically deployable, FDA-cleared products. As of June 2026, no Google clinical AI model has received FDA 510(k), De Novo, or PMA clearance, and all carry explicit disclaimers against direct clinical use.

The analysis is organized by model family, with each section examining published evidence, methodological limitations, and the gap between research claims and what is required for safe clinical deployment. For readers seeking background on the underlying technology, the site's Foundation Models in Healthcare entry provides architectural context.

Med-PaLM and Med-PaLM 2: The Foundation Models

Med-PaLM, published in Nature in 2023, was the first large language model to pass the USMLE, achieving 67.6% accuracy on the MedQA benchmark. Google's evaluation framework went beyond a single accuracy figure, assessing the model across multiple criteria including scientific consensus, medical reasoning, knowledge recall, bias, and likelihood of possible harm. This multi-dimensional approach set a methodological precedent for subsequent evaluations.

Med-PaLM 2, announced in April 2023 and made available in limited access to select Google Cloud customers, represented a significant leap. It achieved 85%+ accuracy on MedQA and became the first AI system to pass the Indian AIIMS and NEET medical entrance exams with 72.3% accuracy. The model was evaluated on the same multi-criteria framework as its predecessor, with published results showing improvement across all dimensions.

Comparison of Med-PaLM and Med-PaLM 2 benchmark performance and availability status.
MetricMed-PaLMMed-PaLM 2
MedQA (USMLE-style) accuracy67.6%85%+
AIIMS / NEET exam performanceNot evaluated72.3% (first AI to pass)
Publication venueNature (2023)Nature (2023); arXiv
Evaluation criteriaMulti-criteria (consensus, reasoning, bias, harm)Multi-criteria (same framework)
AvailabilityResearch onlyLimited access via Google Cloud (April 2023)
FDA clearanceNoneNone

Med-Gemini: Multimodal State-of-the-Art

Med-Gemini, published in May 2024, extended Google's clinical AI capabilities from text-only to multimodal inputs. The model achieved 91.1% accuracy on MedQA, surpassing Med-PaLM 2 by 4.6 percentage points. In head-to-head comparisons, Med-Gemini outperformed GPT-4V on every directly comparable benchmark — 10 out of 14 medical benchmarks spanning text, multimodal, and long-context applications.

Beyond benchmark scores, Med-Gemini demonstrated novel capabilities that distinguish it from prior models: interpretation of complex 3D CT scans, and genomic risk prediction via Med-Gemini-Polygenic, which outperformed previous polygenic risk scores for eight health outcomes including coronary artery disease, type 2 diabetes, stroke, and all-cause mortality. More than half of Med-Gemini-3D-generated CT reports were determined to result in the same care recommendations as radiologist-generated reports.

Med-Gemini benchmark performance and novel capabilities as reported in Google's May 2024 publication.
CapabilityMed-Gemini PerformanceComparison
MedQA accuracy91.1%Surpassed Med-PaLM 2 by 4.6%
Medical benchmarks (text, multimodal, long-context)State-of-the-art on 10/14Surpassed GPT-4V on all directly comparable benchmarks
3D CT interpretation>50% reports equivalent to radiologist recommendationsNovel capability, not previously benchmarked
Genomic risk prediction (Med-Gemini-Polygenic)Outperformed previous polygenic scores for 8 outcomesPredicted 6 additional untrained outcomes
Publication statusResearch blog (May 2024)Not a commercial product

A related open-weight model, MedGemma, was released in July 2025. The 4B parameter variant scored 64.4% on MedQA, and in an unblinded study, 81% of its chest X-ray reports were judged by a single US board-certified radiologist to result in similar patient management compared to original radiologist reports. The 27B text variant scored 87.7% on MedQA, within 3 points of DeepSeek R1 at approximately one-tenth the inference cost. These models are open-weight and intended for research and development, not clinical deployment.

MedLM and Vertex AI Search for Healthcare: Enterprise Deployment

MedLM, a family of two models built on Med-PaLM 2, was made available on Vertex AI to US customers in December 2023. Unlike the research-stage Med-PaLM and Med-Gemini models, MedLM is positioned as an enterprise tool for healthcare organizations to build and deploy AI applications within Google Cloud's infrastructure. The two-model architecture — one optimized for complex reasoning tasks and another for faster, lighter inference — reflects a practical deployment consideration.

The most advanced clinical deployment of MedLM to date is the HCA Healthcare pilot with Augmedix, an ambient documentation solution. Physicians at four emergency department sites use a hands-free device to capture patient encounters, with MedLM generating medical notes that are reviewed before transfer to the EHR. This represents a real-world clinical workflow integration, but published outcomes data remain minimal.

  • HCA Healthcare / Augmedix pilot: 4 ED sites, ambient documentation, MedLM-powered note generation
  • BenchSci integration: MedLM incorporated into ASCEND platform for pre-clinical research literature analysis
  • Accenture partnership: Healthcare process automation using MedLM on Vertex AI
  • Deloitte partnership: Provider search and clinical workflow optimization

Vertex AI Search for Healthcare, announced alongside MedLM, provides medically-tuned search capabilities over FHIR-formatted clinical records. The tool is designed to enable clinicians and researchers to query structured and unstructured EHR data using natural language, with results grounded in the organization's own data rather than general web knowledge. Like MedLM, it is an enterprise tool, not a medical device, and carries disclaimers against direct clinical decision-making.

Evidence Quality Assessment: What Is Published vs. What Is Not

A systematic assessment of the evidence base for Google's clinical AI models reveals a clear pattern: strong published benchmark performance, but significant gaps in clinical validation. The table below evaluates each model family against five domains drawn from the Clinical AI Model Evaluation Metrics framework.

Evidence quality assessment across five domains for Google's clinical AI model families as of June 2026.
DomainMed-PaLM / Med-PaLM 2Med-GeminiMedLMVertex AI Search
Peer-reviewed publicationNature (2023)Research blog; arXivGoogle Cloud blogGoogle Cloud blog
Prospective clinical trialNoneNoneNoneNone
External independent validationLimitedNone publishedNone publishedNone published
FDA clearanceNoneNoneNoneNone
Real-world deployment dataNoneNone4 ED sites (HCA/Augmedix)None published

The pattern is consistent: peer-reviewed publication of benchmark results, but no prospective clinical trials, no FDA clearance, and limited to no independent third-party replication. The HCA Healthcare pilot is the only real-world deployment with published details, and it covers only four emergency department sites with no published outcomes data on clinical impact, error rates, or workflow efficiency.

Known Limitations from Published Research

Google's own publications acknowledge several limitations that are critical for clinicians and informaticians to understand before considering these models for institutional use. These limitations are not hypothetical — they are documented in the published research and should be weighed against benchmark claims.

  • Dataset bias: Like all large language models, Med-PaLM, Med-Gemini, and MedLM are trained on datasets that may not represent the full diversity of patient populations. The published research does not provide detailed demographic breakdowns of training or evaluation datasets sufficient to assess representation across race, ethnicity, socioeconomic status, or geographic region. For a deeper discussion of this issue, see the site's Algorithmic Bias in Clinical AI entry.
  • Hallucination risk: LLMs, including those fine-tuned for medical applications, can generate plausible-sounding but factually incorrect information. Google's multi-criteria evaluation framework for Med-PaLM 2 included assessment of likelihood of possible harm, but the published results do not provide granular data on hallucination rates in clinical scenarios.
  • Single-radiologist evaluation: The MedGemma chest X-ray report equivalence study (81%) used a single unblinded US board-certified radiologist. This is not a multi-reader, multi-case (MRMC) study design, which is the standard for evaluating diagnostic AI tools. The result should be interpreted as preliminary, not definitive.
  • Self-published benchmarks: Several key performance figures — including Med-Gemini's 91.1% MedQA and MedGemma's benchmarks — come from Google's own published research. While some are peer-reviewed (Nature, arXiv), the absence of independent third-party replication for several benchmarks means the results have not been independently verified.
  • Transfer learning and fine-tuning considerations: MedLM's two-model architecture and the fine-tuning options available on Vertex AI raise questions about how model performance changes when adapted to specific institutional data. The Transfer Learning and Fine-Tuning in Clinical AI entry discusses the governance implications of these adaptation strategies.

Regulatory Assessment: Why No Google LLM Has FDA Clearance

As of June 2026, no Google clinical AI model — Med-PaLM, Med-PaLM 2, Med-Gemini, MedLM, MedGemma, or Vertex AI Search for Healthcare — has received FDA 510(k), De Novo, or PMA clearance. This is not an oversight; it reflects the fundamental regulatory challenge posed by generative AI and large language models in healthcare.

According to a March 2026 analysis of the FDA's AI/ML medical device database, the agency has authorized 1,451 AI-based medical devices, but no device has been authorized that uses generative AI or is powered by large language models. Of those 1,451 authorizations, 1,104 are in radiology, 141 in cardiovascular medicine, and 67 in neurology. The vast majority (1,396) went through the 510(k) pathway, with 37 De Novo and 18 pre-market approval (PMA) authorizations.

The FDA's regulatory framework for AI/ML software as a medical device (SaMD) is well-established, with premarket pathways including 510(k), De Novo, and PMA. The agency has issued multiple guidance documents relevant to AI/ML devices: the 2021 AI/ML SaMD Action Plan, Good Machine Learning Practice (GMLP) principles, a June 2024 draft guidance on transparency, a December 2024 final guidance on predetermined change control plans (PCCP), and a January 2025 draft guidance on lifecycle management. Despite this evolving framework, no generative AI or LLM-based device has yet navigated the pathway to authorization.

For procurement teams and health system administrators, the regulatory status has direct implications: without FDA clearance, Google's clinical AI models cannot be marketed as medical devices, and health systems cannot rely on FDA review as a proxy for safety and effectiveness evaluation. Institutional adoption would require independent validation studies, local performance testing, and establishment of human oversight workflows — all of which carry significant cost and liability considerations.

Conclusion: Readiness Scorecard and Competitive Context

The evidence reviewed in this article supports a clear conclusion: Google's clinical AI models have achieved state-of-the-art benchmark performance, but clinical deployment readiness remains unproven. The readiness scorecard below summarizes the assessment across key dimensions.

Readiness scorecard for Google's clinical AI models as of June 2026.
DimensionAssessmentEvidence Basis
Benchmark performanceState-of-the-artMed-Gemini 91.1% MedQA; Med-PaLM 2 85%+; surpasses GPT-4V on 10/14 benchmarks
Peer-reviewed publicationModerateNature (Med-PaLM), arXiv (Med-Gemini); research blogs for MedLM and Vertex AI Search
Prospective clinical validationNoneNo prospective clinical trials published for any model
Independent third-party replicationLimitedFew independent replication studies; most benchmarks from Google's own research
FDA clearanceNoneNo Google LLM has FDA 510(k), De Novo, or PMA clearance; no GenAI/LLM device authorized by FDA as of March 2026
Real-world deployment dataMinimalHCA Healthcare / Augmedix pilot at 4 ED sites; no published outcomes data
Bias and equity assessmentIncompleteMulti-criteria evaluation includes bias assessment, but demographic composition of training/evaluation datasets not fully disclosed

Google's position within the broader AI-in-healthcare landscape is distinctive: it has invested more heavily in clinical AI research than any other major technology company, with a publication record in high-impact journals and a clear progression from text-only to multimodal models. The State-of-the-Industry Evidence Assessment provides broader context for comparing Google's approach to competitors.

For clinicians and clinical informaticians evaluating these models for institutional adoption, the practical implications are straightforward: benchmark scores are necessary but not sufficient evidence of clinical readiness. The absence of FDA clearance, the lack of prospective clinical trials, and the limited real-world deployment data mean that any decision to pilot or adopt these models should be accompanied by rigorous local validation, clear human oversight protocols, and explicit acknowledgment that these tools are not yet proven safe and effective for direct clinical use.