top of page

Your AI is 99% accurate. So what? Why accuracy alone isn’t enough


From prediction to action: accuracy powers the model — workflow determines the outcome.
From prediction to action: accuracy powers the model — workflow determines the outcome.

“99% accurate” is an impressive statistic. In healthcare AI, it is also an incomplete one.


Accuracy is necessary but rarely sufficient. Not because predictive performance doesn’t matter, but because models don’t create value in isolation. A model produces an output; value emerges only if that output reliably changes decisions and actions in real clinical or operational workflows.


For CMIOs, CAIOs, and clinical operations leaders, this is the gap between an AI that demos well and one that actually has an impact on mortality, length of stay, and workload on the ground.


This gap explains a common pattern: the same model can perform similarly across two organizations, yet deliver very different clinical or operational impact. The difference is often attributed to “local data.” But local data is only one piece of what is better described as the local context – the reality of how a tool performs on your data, in your workflows, and under your constraints.


If you remember nothing else, ask of any “99% accurate” model in healthcare:

  • How will this perform on our data, in our workflows, under our constraints?

  • Is there capacity to act on it?

  • What intervention pathway does it trigger?


Local context is broader than local data


When healthcare leaders hear “local context,” they often interpret it as data compatibility: whether input features resemble the training population, whether performance will hold under drift, and whether recalibration is needed.


That is an important foundation. But in practice, the dominant drivers of impact are often operational:

  • Workflow integration into day-to-day clinical practice

  • Where the output appears (EHR alert, worklist, inbox, messaging, dashboard)

  • Whether it leads to action or becomes noise

  • Clear ownership of the response (individual clinician vs. centralized team)

  • Capacity and resourcing to execute the intervention

  • Adoption behavior (trust, usability, friction, alert fatigue)

  • Downstream coordination across teams (nursing, pharmacy, consult services, transport)


In other words: “local context” is the environment that determines whether an output becomes a meaningful intervention, and ultimately whether it produces local performance and ROI.


A real-world proof point: sepsis outcomes depended largely on implementation


UC San Diego Health published results from deploying an AI model in emergency departments to identify patients at risk for sepsis, reporting an association with a 17% reduction in in-hospital sepsis mortality, alongside improved bundle compliance.


The most instructive part of this story is not simply that the model worked, but how the organization framed its impact.


Christopher Longhurst (then CMO and Chief Digital Officer) described the distribution of value bluntly:

“Twenty or 30 percent of the outcome we saw was because of the algorithm. The secret sauce, maybe 70% to 80%, was actually all of the local context…”


He pointed to concrete operational steps: redesigning processes so the alert reached both frontline clinicians and a centralized response team, and supporting the deployment with clinician education.


That framing is worth emphasizing because it highlights a consistent reality across healthcare AI: outcomes are usually shaped more by implementation than by model performance alone.


The lesson: evaluation cannot stop at technical validation. Governance and oversight must extend into workflow design, ownership, and execution capacity.


Why accuracy alone fails: predictions do not equal impact


A model can be technically excellent and still have limited clinical value if:

  • it arrives too late to influence action

  • it is delivered to the wrong person or location in the workflow

  • no one owns the response pathway

  • operational constraints prevent follow-through

  • clinicians experience it as noise rather than a usable signaldownstream steps are not coordinated across teams


In other words, accuracy describes whether the model is correct. It does not describe whether the system can operationalize what the model recommends.


The following examples show how this gap plays out across common AI use cases.


Example 1: Sepsis prediction is not just detection — it is activation


Many sepsis models are deployed with an implicit goal: “detect sepsis earlier.” Yet frontline clinicians in high-acuity environments often initiate evaluation and treatment based on clinical assessment before an alert fires.


Where sepsis AI tends to have greater impact is when it functions not to diagnose sepsis, but to inform the diagnosis and management, as a workflow activation mechanism:

  • quicker mobilization of response teams

  • more reliable bundle execution

  • earlier alignment between physicians, nursing, and pharmacy


This also raises an implementation design question: who should receive the alert? MetroHealth has described routing its interruptive sepsis alert to clinical pharmacists, reflecting the principle that alerts create value only when they reach the team that can act on them decisively.


The lesson: detection accuracy matters, but activation design determines whether mortality, compliance, and response times actually improve.


Example 2: Radiology AI — sensitivity is not the same as time-to-treatment impact


Consider two stroke detection solutions:

  • Vendor A: 99% sensitivity, but flags findings only when the radiologist opens the study

  • Vendor B: 90% sensitivity, but prioritizes the study in the worklist and automatically escalates to the on-call team


In many time-sensitive conditions, the second system may deliver more meaningful clinical impact because it reduces latency, improves prioritization, and strengthens escalation pathways. The key variable is not the model’s correctness in isolation — it is whether the output changes the system’s speed and coordination.


The lesson: spend less time debating sensitivity in isolation and more time asking how the tool changes escalation speed, prioritization logic, and team coordination.


Example 3: In-basket AI — high quality output can still fail adoption


In-basket drafting tools may appear strong on their performance evaluations, yet struggle to create value in practice because adoption is shaped by friction:

  • clinicians may dislike the tone or phrasing

  • heavy editing can negate time savings

  • perceived risk drives clinicians to rewrite from scratch

  • added steps disrupt already overloaded workflows


Different organizations can extract different value from the same tool depending on workflow design. Some use it purely for drafting; others use it to support triage and routing — automating low-risk categories under structured oversight while escalating higher-risk messages appropriately.

Once again, the output quality may be similar. The impact depends on workflow design, governance, and the division of labor.


The lesson: adoption behavior and governance design determine whether time savings and workload reduction are realized, or whether the tool becomes optional and underutilized.


The central takeaway


Accuracy and technical performance are foundational. But in healthcare, it is best understood as table stakes. Meaningful outcomes depend on local context: how the model is integrated into workflows, who receives the output, what response pathway exists, and whether the organization can operationalize intervention reliably under real-world constraints.

UC San Diego Health’s experience captures the point clearly: if most of the value comes from local context, then evaluation and oversight should focus not only on model performance, but on the design and readiness of the operational systems around it.

 
 
Logo_primary.png
  • LinkedIn
  • X

© Copyright 2024

 Center for Health AI Regulation, Governance & Ethics (CHARGE)

All Rights Reserved

bottom of page