
Introduction: The Benchmark-to-Bedside Gap
Google's clinical AI models have achieved headline-grabbing benchmark scores: Med-PaLM 2 became the first large language model to exceed 85% on USMLE-style questions, and Med-Gemini pushed that figure to 91.1%. These numbers circulate widely in conference presentations, press releases, and vendor evaluations. For clinicians and clinical informaticians evaluating AI tools for institutional adoption, however, benchmark performance is only one dimension of a much more complex assessment.
This article provides a systematic, evidence-grounded evaluation of Google's clinical AI models — from Med-PaLM through Med-Gemini, MedLM, and Vertex AI Search for Healthcare — against five domains: accuracy, safety, bias, deployment readiness, and regulatory status. The core finding is consistent across every model family: state-of-the-art benchmark performance has not yet translated into clinically deployable, FDA-cleared products. As of June 2026, no Google clinical AI model has received FDA 510(k), De Novo, or PMA clearance, and all carry explicit disclaimers against direct clinical use.
The analysis is organized by model family, with each section examining published evidence, methodological limitations, and the gap between research claims and what is required for safe clinical deployment. For readers seeking background on the underlying technology, the site's Foundation Models in Healthcare entry provides architectural context.
Med-PaLM and Med-PaLM 2: The Foundation Models
Med-PaLM, published in Nature in 2023, was the first large language model to pass the USMLE, achieving 67.6% accuracy on the MedQA benchmark. Google's evaluation framework went beyond a single accuracy figure, assessing the model across multiple criteria including scientific consensus, medical reasoning, knowledge recall, bias, and likelihood of possible harm. This multi-dimensional approach set a methodological precedent for subsequent evaluations.
Med-PaLM 2, announced in April 2023 and made available in limited access to select Google Cloud customers, represented a significant leap. It achieved 85%+ accuracy on MedQA and became the first AI system to pass the Indian AIIMS and NEET medical entrance exams with 72.3% accuracy. The model was evaluated on the same multi-criteria framework as its predecessor, with published results showing improvement across all dimensions.
| Metric | Med-PaLM | Med-PaLM 2 |
|---|---|---|
| MedQA (USMLE-style) accuracy | 67.6% | 85%+ |
| AIIMS / NEET exam performance | Not evaluated | 72.3% (first AI to pass) |
| Publication venue | Nature (2023) | Nature (2023); arXiv |
| Evaluation criteria | Multi-criteria (consensus, reasoning, bias, harm) | Multi-criteria (same framework) |
| Availability | Research only | Limited access via Google Cloud (April 2023) |
| FDA clearance | None | None |
Med-Gemini: Multimodal State-of-the-Art
Med-Gemini, published in May 2024, extended Google's clinical AI capabilities from text-only to multimodal inputs. The model achieved 91.1% accuracy on MedQA, surpassing Med-PaLM 2 by 4.6 percentage points. In head-to-head comparisons, Med-Gemini outperformed GPT-4V on every directly comparable benchmark — 10 out of 14 medical benchmarks spanning text, multimodal, and long-context applications.
Beyond benchmark scores, Med-Gemini demonstrated novel capabilities that distinguish it from prior models: interpretation of complex 3D CT scans, and genomic risk prediction via Med-Gemini-Polygenic, which outperformed previous polygenic risk scores for eight health outcomes including coronary artery disease, type 2 diabetes, stroke, and all-cause mortality. More than half of Med-Gemini-3D-generated CT reports were determined to result in the same care recommendations as radiologist-generated reports.
| Capability | Med-Gemini Performance | Comparison |
|---|---|---|
| MedQA accuracy | 91.1% | Surpassed Med-PaLM 2 by 4.6% |
| Medical benchmarks (text, multimodal, long-context) | State-of-the-art on 10/14 | Surpassed GPT-4V on all directly comparable benchmarks |
| 3D CT interpretation | >50% reports equivalent to radiologist recommendations | Novel capability, not previously benchmarked |
| Genomic risk prediction (Med-Gemini-Polygenic) | Outperformed previous polygenic scores for 8 outcomes | Predicted 6 additional untrained outcomes |
| Publication status | Research blog (May 2024) | Not a commercial product |
A related open-weight model, MedGemma, was released in July 2025. The 4B parameter variant scored 64.4% on MedQA, and in an unblinded study, 81% of its chest X-ray reports were judged by a single US board-certified radiologist to result in similar patient management compared to original radiologist reports. The 27B text variant scored 87.7% on MedQA, within 3 points of DeepSeek R1 at approximately one-tenth the inference cost. These models are open-weight and intended for research and development, not clinical deployment.
MedLM and Vertex AI Search for Healthcare: Enterprise Deployment
MedLM, a family of two models built on Med-PaLM 2, was made available on Vertex AI to US customers in December 2023. Unlike the research-stage Med-PaLM and Med-Gemini models, MedLM is positioned as an enterprise tool for healthcare organizations to build and deploy AI applications within Google Cloud's infrastructure. The two-model architecture — one optimized for complex reasoning tasks and another for faster, lighter inference — reflects a practical deployment consideration.
The most advanced clinical deployment of MedLM to date is the HCA Healthcare pilot with Augmedix, an ambient documentation solution. Physicians at four emergency department sites use a hands-free device to capture patient encounters, with MedLM generating medical notes that are reviewed before transfer to the EHR. This represents a real-world clinical workflow integration, but published outcomes data remain minimal.
- HCA Healthcare / Augmedix pilot: 4 ED sites, ambient documentation, MedLM-powered note generation
- BenchSci integration: MedLM incorporated into ASCEND platform for pre-clinical research literature analysis
- Accenture partnership: Healthcare process automation using MedLM on Vertex AI
- Deloitte partnership: Provider search and clinical workflow optimization
Vertex AI Search for Healthcare, announced alongside MedLM, provides medically-tuned search capabilities over FHIR-formatted clinical records. The tool is designed to enable clinicians and researchers to query structured and unstructured EHR data using natural language, with results grounded in the organization's own data rather than general web knowledge. Like MedLM, it is an enterprise tool, not a medical device, and carries disclaimers against direct clinical decision-making.
Evidence Quality Assessment: What Is Published vs. What Is Not
A systematic assessment of the evidence base for Google's clinical AI models reveals a clear pattern: strong published benchmark performance, but significant gaps in clinical validation. The table below evaluates each model family against five domains drawn from the Clinical AI Model Evaluation Metrics framework.
| Domain | Med-PaLM / Med-PaLM 2 | Med-Gemini | MedLM | Vertex AI Search |
|---|---|---|---|---|
| Peer-reviewed publication | Nature (2023) | Research blog; arXiv | Google Cloud blog | Google Cloud blog |
| Prospective clinical trial | None | None | None | None |
| External independent validation | Limited | None published | None published | None published |
| FDA clearance | None | None | None | None |
| Real-world deployment data | None | None | 4 ED sites (HCA/Augmedix) | None published |
The pattern is consistent: peer-reviewed publication of benchmark results, but no prospective clinical trials, no FDA clearance, and limited to no independent third-party replication. The HCA Healthcare pilot is the only real-world deployment with published details, and it covers only four emergency department sites with no published outcomes data on clinical impact, error rates, or workflow efficiency.
Known Limitations from Published Research
Google's own publications acknowledge several limitations that are critical for clinicians and informaticians to understand before considering these models for institutional use. These limitations are not hypothetical — they are documented in the published research and should be weighed against benchmark claims.
- Dataset bias: Like all large language models, Med-PaLM, Med-Gemini, and MedLM are trained on datasets that may not represent the full diversity of patient populations. The published research does not provide detailed demographic breakdowns of training or evaluation datasets sufficient to assess representation across race, ethnicity, socioeconomic status, or geographic region. For a deeper discussion of this issue, see the site's Algorithmic Bias in Clinical AI entry.
- Hallucination risk: LLMs, including those fine-tuned for medical applications, can generate plausible-sounding but factually incorrect information. Google's multi-criteria evaluation framework for Med-PaLM 2 included assessment of likelihood of possible harm, but the published results do not provide granular data on hallucination rates in clinical scenarios.
- Single-radiologist evaluation: The MedGemma chest X-ray report equivalence study (81%) used a single unblinded US board-certified radiologist. This is not a multi-reader, multi-case (MRMC) study design, which is the standard for evaluating diagnostic AI tools. The result should be interpreted as preliminary, not definitive.
- Self-published benchmarks: Several key performance figures — including Med-Gemini's 91.1% MedQA and MedGemma's benchmarks — come from Google's own published research. While some are peer-reviewed (Nature, arXiv), the absence of independent third-party replication for several benchmarks means the results have not been independently verified.
- Transfer learning and fine-tuning considerations: MedLM's two-model architecture and the fine-tuning options available on Vertex AI raise questions about how model performance changes when adapted to specific institutional data. The Transfer Learning and Fine-Tuning in Clinical AI entry discusses the governance implications of these adaptation strategies.
Regulatory Assessment: Why No Google LLM Has FDA Clearance
As of June 2026, no Google clinical AI model — Med-PaLM, Med-PaLM 2, Med-Gemini, MedLM, MedGemma, or Vertex AI Search for Healthcare — has received FDA 510(k), De Novo, or PMA clearance. This is not an oversight; it reflects the fundamental regulatory challenge posed by generative AI and large language models in healthcare.
According to a March 2026 analysis of the FDA's AI/ML medical device database, the agency has authorized 1,451 AI-based medical devices, but no device has been authorized that uses generative AI or is powered by large language models. Of those 1,451 authorizations, 1,104 are in radiology, 141 in cardiovascular medicine, and 67 in neurology. The vast majority (1,396) went through the 510(k) pathway, with 37 De Novo and 18 pre-market approval (PMA) authorizations.
The FDA's regulatory framework for AI/ML software as a medical device (SaMD) is well-established, with premarket pathways including 510(k), De Novo, and PMA. The agency has issued multiple guidance documents relevant to AI/ML devices: the 2021 AI/ML SaMD Action Plan, Good Machine Learning Practice (GMLP) principles, a June 2024 draft guidance on transparency, a December 2024 final guidance on predetermined change control plans (PCCP), and a January 2025 draft guidance on lifecycle management. Despite this evolving framework, no generative AI or LLM-based device has yet navigated the pathway to authorization.
For procurement teams and health system administrators, the regulatory status has direct implications: without FDA clearance, Google's clinical AI models cannot be marketed as medical devices, and health systems cannot rely on FDA review as a proxy for safety and effectiveness evaluation. Institutional adoption would require independent validation studies, local performance testing, and establishment of human oversight workflows — all of which carry significant cost and liability considerations.
Conclusion: Readiness Scorecard and Competitive Context
The evidence reviewed in this article supports a clear conclusion: Google's clinical AI models have achieved state-of-the-art benchmark performance, but clinical deployment readiness remains unproven. The readiness scorecard below summarizes the assessment across key dimensions.
| Dimension | Assessment | Evidence Basis |
|---|---|---|
| Benchmark performance | State-of-the-art | Med-Gemini 91.1% MedQA; Med-PaLM 2 85%+; surpasses GPT-4V on 10/14 benchmarks |
| Peer-reviewed publication | Moderate | Nature (Med-PaLM), arXiv (Med-Gemini); research blogs for MedLM and Vertex AI Search |
| Prospective clinical validation | None | No prospective clinical trials published for any model |
| Independent third-party replication | Limited | Few independent replication studies; most benchmarks from Google's own research |
| FDA clearance | None | No Google LLM has FDA 510(k), De Novo, or PMA clearance; no GenAI/LLM device authorized by FDA as of March 2026 |
| Real-world deployment data | Minimal | HCA Healthcare / Augmedix pilot at 4 ED sites; no published outcomes data |
| Bias and equity assessment | Incomplete | Multi-criteria evaluation includes bias assessment, but demographic composition of training/evaluation datasets not fully disclosed |
Google's position within the broader AI-in-healthcare landscape is distinctive: it has invested more heavily in clinical AI research than any other major technology company, with a publication record in high-impact journals and a clear progression from text-only to multimodal models. The State-of-the-Industry Evidence Assessment provides broader context for comparing Google's approach to competitors.
For clinicians and clinical informaticians evaluating these models for institutional adoption, the practical implications are straightforward: benchmark scores are necessary but not sufficient evidence of clinical readiness. The absence of FDA clearance, the lack of prospective clinical trials, and the limited real-world deployment data mean that any decision to pilot or adopt these models should be accompanied by rigorous local validation, clear human oversight protocols, and explicit acknowledgment that these tools are not yet proven safe and effective for direct clinical use.
Med-PaLM, Med-PaLM 2, Med-Gemini, MedLM, Vertex AI Search for Healthcare
Comments
Join the discussion with an anonymous comment.