
Why the Emergency Department Is a High-Stakes Environment for AI Bias
The emergency department concentrates several conditions that make algorithmic errors unusually consequential. Decisions happen fast, often within minutes of a patient's arrival. Triage assessments depend heavily on subjective interpretation of patient-reported symptoms — chest pain, dyspnea, undifferentiated pain — where the same complaint can receive meaningfully different clinical responses depending on the clinician. Patient volume is high and continuous, meaning a small systematic error in an AI recommendation compounds across tens of thousands of encounters per year.
The patient population is also among the most demographically diverse in any clinical setting, and frequently includes patients who are uninsured, have limited English proficiency, or present with conditions shaped by social determinants of health. These patients are often the same groups underrepresented in the training datasets on which ED AI tools are built.
This article is not a general overview of AI bias in healthcare. It is anchored to the specific tools and workflows of emergency medicine — triage scoring systems, sepsis prediction models, and LLM-based clinical decision support — and to the specific evidence documenting how bias manifests in each. The core problem is not that AI introduces prejudice from scratch. It is that AI inherits and scales the biases already present in the clinical data it trains on, then delivers those biases at a speed and volume no individual clinician could match.
A Taxonomy of Bias Types Relevant to Emergency Medicine AI
The ACEP AI Task Force (Abbott et al., J Am Coll Emerg Physicians Open, 2025) defines bias in this context as a systematic flaw in a decision-making process — including data collection, algorithm design, and human interpretation — that results in unfair or unintended outcomes. Four distinct bias types operate in emergency medicine AI, and they operate through different mechanisms.

| Bias Type | Definition in EM Context | Example in ED AI |
|---|---|---|
| Data bias | Training data that underrepresents or mislabels certain demographic groups, or reflects historical clinical inequities as if they were ground truth | EHR triage records reflecting consistent undertriage of Black patients for chest pain, used as training labels for an AI triage tool |
| Algorithmic bias | Proxy discrimination (using variables correlated with race or gender) or aggregate optimization that masks subgroup underperformance | A sepsis model optimized for overall AUC that performs significantly worse for minority subgroups without flagging the gap |
| Measurement bias | Systematic inaccuracy in the clinical devices that generate AI input features for specific patient populations | Pulse oximetry hardware overestimating SpO2 in patients with darker skin, contaminating any sepsis or deterioration model that uses SpO2 as a feature |
| Human-interaction bias | Automation bias causing clinicians to defer to AI outputs even when incorrect; implicit clinician bias amplified by AI that surfaces and reinforces biased recommendations | Clinicians accepting lower triage acuity suggested by an AI tool for a Black patient with dyspnea, even when clinical judgment would have escalated |
The automation bias finding from the ACEP Task Force paper is particularly important for the clinical workflow context: the paper documents that diagnostic accuracy remained significantly lower when clinicians were supported by a biased AI than when they received no AI support at all. Introducing a biased tool does not merely fail to help — it actively degrades care quality below the unaided baseline.
Evidence by Use Case: Triage Tools
The clearest quantitative evidence of bias in ED triage comes from a retrospective analysis by Peitzman et al. (West J Emerg Med, 2023) of 297,355 adult ED visits at an urban academic center between 2016 and 2019. After adjusting for patient acuity, chief complaint, and other confounders, Black patients had an adjusted odds ratio of 0.76 (95% CI 0.73–0.79) and Hispanic patients had an aOR of 0.87 (95% CI 0.84–0.90) for receiving high-acuity triage compared with White patients.
The disparity was concentrated entirely in subjective complaints. For chest pain, the aOR was 0.76 for Black patients and 0.88 for Hispanic patients. For dyspnea: 0.79 and 0.84. For any pain: 0.83 and 0.89. For protocolized complaints — STEMI and stroke — no racial differences appeared. The pattern, as the authors concluded, cannot be explained by true differences in resource requirements. It reflects a consistent tendency to underestimate the needs of Black and Hispanic patients when clinical judgment is required.
The downstream consequence was measurable. Among patients who ultimately required high-acuity resources, Black patients were 1.47 times and Hispanic patients 1.27 times more likely to have been initially undertriaged. This is the bias already embedded in EHR training data — and it is exactly the signal that an AI triage tool trained on that data will learn to replicate.
The ACEP AI Task Force paper adds a second concern specific to LLM-based triage tools. When an LLM is evaluated for Emergency Severity Index acuity estimation, it must be trained on clinical narratives written by clinicians. Those narratives already encode implicit bias — in how pain is described for different patients, in which complaints are treated as urgent versus functional, in the language used across demographic groups. The Task Force paper flags this as a labeling bias problem: the model learns to reproduce biased clinical judgments because those judgments are its ground truth.
A June 2025 systematic review of AI-based ED triage systems (Cureus, 2025) covering six studies across South Korea, Germany, Greece, Taiwan, and China found that AI systems reduced mis-triage rates (ranging from 0.3% to 8.9%) and improved AUC performance versus traditional triage. But the review explicitly identifies equity considerations and patient-centered outcomes as unstudied gaps across virtually the entire literature. Reduced aggregate mis-triage does not mean reduced mis-triage for marginalized subgroups.
Evidence by Use Case: Sepsis Prediction and Measurement Bias
Sepsis prediction models present three distinct racial bias channels, each operating through a different mechanism. A health law analysis by Epstein Becker Green identifies each pathway clearly, though readers should note this is a legal advisory publication rather than peer-reviewed research.
- Racially skewed vital signs as model features. Researchers have demonstrated that standard vital signs — heart rate, respiratory rate, blood pressure, temperature — can predict a patient's race with high accuracy when combined. This means that sepsis models trained primarily on majority-race populations and using vital signs as core features may have systematically underfit the vital sign distributions of minority patients, producing higher false-negative rates for those groups.
- Pulse oximetry hardware inaccuracy. Pulse oximeters perform less accurately on patients with darker skin, systematically overestimating blood oxygen saturation (SpO2). Because SpO2 is a feature in many sepsis and deterioration algorithms, the hardware bias propagates directly into any AI model consuming that measurement. This is not a data quality problem addressable by better labeling — it is a measurement device problem that corrupts input data at the point of collection.
- Physician notes carrying implicit bias. The Epstein Becker Green analysis notes that physicians interpret pain reports differently across racial groups and that underinsured patients — disproportionately Black and Hispanic — have sparser EHR documentation overall. Note-based features in sepsis models are therefore racially unequal both in content and in coverage.
On the peer-reviewed side, the ACEP AI Task Force paper references an external validation of a widely deployed proprietary sepsis prediction model that revealed poor discrimination and calibration overall, citing Wong et al. (JAMA Internal Medicine, 2021). The Task Force paper does not provide racial subgroup performance figures for that specific model. The Epstein Becker Green analysis describes one ML sepsis alert system where the overall alert confirmation rate was 36%, dropping to 33% for Black patients and rising to 42% for Asian patients — but this is a legal advisory source, not a peer-reviewed subgroup study.
Evidence That Equitably Designed AI Can Reduce Disparities
The preceding sections document what happens when AI triage and prediction tools are trained on historically biased data without equity-focused design. The Yale New Haven experience provides a contrasting data point: what becomes possible when equity is an explicit design objective rather than an afterthought.
Research from Yale New Haven Hospital and Johns Hopkins, published in NEJM AI and Annals of Emergency Medicine (Taylor et al., 2025), evaluated an AI-driven triage tool that incorporated equity considerations in its design and validation. The findings included a 33% reduction in time to initial care overall, a 17.3% decrease in time-to-care (median 58 minutes) for critically ill patients, safe reclassification of 18.7% of mid-acuity patients, and — critically — greater accuracy in high-severity illness detection specifically for marginalized groups.
This distinction matters for procurement. A vendor claiming that their triage AI improves overall performance metrics is not making an equity claim. Equity-validated performance requires explicit reporting of accuracy, sensitivity, and care-escalation outcomes stratified by race, ethnicity, language, insurance status, and other demographic variables — the same subgroup analysis that the Yale New Haven study provided.
Mitigation Frameworks: Lifecycle Auditing and Fairness-Constrained Design
The ACEP AI Task Force paper provides the only EM-specialty lifecycle framework currently available for systematic bias identification and mitigation. It organizes evaluation across three stages:
| Stage | Key Activities | Tools Referenced |
|---|---|---|
| Predeployment | Audit training data for representation gaps; review outcome labels for embedded historical bias; test for subgroup performance differentials before clinical exposure | Aequitas, AIF360, PROBAST+AI |
| Deployment | Monitor real-time outputs for demographic performance divergence; implement clinician override workflows; document automation bias incidents | CHAI model cards, institutional dashboards |
| Postdeployment | Conduct ongoing demographic subgroup performance reporting; feed local patient population data back into model retraining pipelines; document and disclose failure modes | Post-market surveillance protocols, CHAI standards |
Beyond the clinical framework, a 2026 paper by Bahamazava in the Journal of Economy and Technology adds an economic dimension. Using numerical simulation modeling, the paper finds that fairness-constrained optimization — building equity requirements directly into the algorithm's objective function — simultaneously improves both equity and operational efficiency. In emergency management contexts, the paper identifies that biases in data quality and algorithm design lead to delayed responses, inefficient resource utilization, and measurable welfare losses. The implication for health system decision-makers is that equitable AI is not a cost to be weighed against performance; it is an efficiency improvement.
The audit tools named in the ACEP framework serve distinct functions. Aequitas and IBM's AIF360 are open-source toolkits for measuring and remediating fairness metrics across demographic groups in model outputs. PROBAST+AI extends the PROBAST prediction model bias assessment tool to AI-specific contexts. CHAI model cards provide a structured disclosure format for communicating known model limitations and demographic performance gaps to downstream clinical users. None of these tools is a substitute for prospective validation in the local patient population — they are complements to it.
Regulatory Context and the Institutional Governance Imperative
FDA oversight of AI/ML software as a medical device (SaMD) provides a floor of premarket evaluation — but that floor does not require demographic subgroup performance validation as a condition of clearance. A tool can receive 510(k) clearance on aggregate performance metrics without demonstrating equity across racial or ethnic subgroups. Post-market surveillance requirements exist but remain limited in scope for AI-specific bias monitoring.
The Supreme Court's Loper Bright decision constrains FDA's ability to issue binding guidance under Chevron deference, limiting the agency's practical authority to expand equity requirements through administrative action alone. This structural limitation is particularly relevant for AI regulation, where guidance rather than formal rulemaking has historically been the FDA's primary tool.
A KFF brief published in April 2026 documents that the Trump administration has actively moved to weaken AI equity requirements at the federal level. Executive Order 14179 shifted federal AI policy away from equity and algorithmic fairness mandates. A subsequent EO 14365 (December 2025) directed the Department of Justice to challenge state AI bias-audit laws, and the DOJ created an AI Litigation Task Force in January 2026 specifically to contest state laws with strict anti-bias requirements for AI systems.
For emergency medicine specifically, the regulatory trajectory has a direct implication: as federal oversight of AI equity weakens, the responsibility for vetting demographic performance gaps shifts to health systems and to the emergency physicians overseeing AI deployment. Institutional governance — local validation studies, demographic performance review committees, contractual requirements for subgroup data from vendors — becomes more critical precisely because federal requirements are becoming less so.
The Emergency Physician as Active Bias Auditor
This article has covered three analytically distinct scenarios, and it is worth being precise about what each implies.
- Pre-existing human bias in EHR training data that AI inherits and scales. The Peitzman undertriage findings are the clearest example. The bias did not originate with AI — it was encoded in triage decisions over years of clinical practice. When AI learns from those decisions, it learns to reproduce them at scale and at speed. The procurement implication: before deploying any AI triage tool, health systems should audit the training data source for demographic undertriage patterns.
- Bias introduced by AI design choices. Proxy variable selection, aggregate optimization, and consumption of measurement-biased inputs (pulse oximetry) are design decisions that can introduce new inequity even in models trained on adequately representative data. The procurement implication: vendors should be required to disclose which variables function as proxies for demographic attributes and how the model was validated across demographic subgroups.
- Bias that AI tools can measurably reduce when designed for equity. The Yale New Haven results show that this is achievable — not hypothetical. The procurement implication: equity-conscious design is a verifiable specification, not a marketing claim. Vendors should be asked for subgroup-stratified performance data from independent validation studies, not aggregate metrics from internal evaluations.
The ACEP AI Task Force framing is useful here: emergency physicians are not passive consumers of vendor-supplied performance claims. They are the clinical professionals with the contextual knowledge to identify when an AI recommendation diverges from clinical judgment, to flag anomalous patterns in care escalation rates across patient demographics, and to advocate within their institutions for the subgroup validation data that procurement teams should be requiring.
For each AI tool under evaluation, the relevant questions are specific: What was the demographic composition of the training data? Were outcomes validated separately for Black, Hispanic, and other demographic subgroups? What is the false-negative rate for high-acuity illness detection in those subgroups? Has the tool been locally validated for this institution's patient population? Is there a postdeployment monitoring protocol for detecting demographic performance drift?
In an environment where federal equity requirements are weakening and most deployed EM AI tools have not undergone formal demographic fairness evaluation, these questions are not optional due diligence. They are the institutional safeguard that currently exists in place of regulatory requirements that do not.

Comments
Join the discussion with an anonymous comment.