Clinical Reality Check: Dr. Sarah Gebauer on Aligning AI Innovation with the Realities of Clinical Care

CHARGE
May 29
7 min read

Dr. Sarah Gebauer is an anesthesiologist and healthcare leader with over 20 years of experience across medicine, informatics, AI strategy, and policy. She founded Validara Health to bring clinical rigor to AI evaluation and serves as a senior researcher at RAND, focusing on generative AI and tech policy.

In this CHARGE interview, Dr. Gebauer discusses the real-world challenges of adopting AI in healthcare, from fragmented infrastructure to the limits of technical metrics. She shares insights on outcome-focused governance, workflow integration, and practical strategies for resource-limited systems, among other key topics shaping the future of healthcare AI.

Read the full conversation below.

Q: Let’s start with your background. Can you share your clinical and professional journey, and what led you to focus on healthcare AI?

I worked at Bain doing strategy consulting after college, went to Stanford for medical school and then completed my anesthesiology residency at UCSF followed by a palliative care fellowship. I’ve held clinical and leadership roles across multiple health systems, from academic centers to rural hospitals. I’ve always gravitated toward work at the intersection of data, quality, and real-world care delivery, and got a degree in clinical informatics early in my career.

My interest in AI started with natural language processing, since I found the concept of teaching a computer to understand language fascinating. I started a Slack group and a Substack to connect the other physicians interested in the space and to learn. That’s turned into an engaged community of over 500 doctors trying to figure out how to make this technology actually work in practice.

What really drew me into healthcare AI was the disconnect between how healthcare technology tools are developed and implemented and how care actually happens. I used nine different EHRs during residency, and remember the promises of EHRs making our lives easier. The story is clearly more complex, and I want to help ensure that AI lives up to its promise of actually improving care, not just adding more noise.

Q: You’ve worked closely with both hospital leadership and frontline clinicians. What are the challenges health systems face today when evaluating and adopting vendor-provided AI tools and managing third-party risk?

The biggest challenge is infrastructure, both technical and organizational. AI governance involves ML expertise that health systems may have difficulty recruiting, and AI requires teams that might not normally have much interaction such as legal, IT, and clinical to develop shared language and use the same platforms. Challenges range from the most basic, like developing a system to manage documents and sign offs, to the complex and unanswered, like the best ways to demonstrate adherence to guidance about bias and fairness.

Liability and third-party risk are particularly challenging because those fields often rely on legal verdicts and payouts, which are hard to predict. Will juries be more or less forgiving of software errors than human ones? We don’t know yet. And frontline clinicians feel they’re in a lose-lose position: if they follow the AI and it’s wrong, they’re blamed for not using their judgment; if they ignore it and it’s right, they’re blamed for not trusting the system. It’s not clear where the boundaries of shared liability for an AI product are among the vendor, health system, and clinician.

Q: You’ve written about the need for a “Common App” approach to streamline interactions between vendors and health systems. Can you walk us through what that could look like in practice and why it’s needed?

Most industries have standardized ways to test and certify tools before they’re deployed. In entertainment, for instance, networks require shows to meet technical specifications before they’ll even consider airing them. Producers submit their work to a centralized platform that checks for technical quality and compliance before approval. It’s astonishing that TV shows undergo more structured pre-review than most AI tools in healthcare.

The idea behind a “Common App” is to create a shared core set of governance questions and baseline performance metrics; something that’s flexible enough to accommodate different risk levels, but consistent enough to reduce duplication. A shared standard would raise the floor, provide shared expectations, and focus on the meaningful questions without limiting innovation.

Q: Transparency efforts like model card registries and the ONC’s HTI-1 rule aim to support better AI transparency. In your view, do these initiatives actually help health systems make better decisions?

Transparency efforts are well-intentioned but limited. They often focus on processes, like what data was used to train a model, rather than outcomes, like whether the model is actually biased in practice. We’ve seen this repeatedly in clinical quality metrics: process metrics don’t always correlate with better patient care.

Even when transparency is available, it’s not always interpretable or actionable. How much representation is “enough” for a subpopulation in training data? There’s no agreed-upon answer. And if transparency doesn’t lead to different decisions - if we’re not going to change management based on it - then it’s just not helpful. Evidence-based medicine teaches us not to order tests that won’t change our plan. We need to ask questions that help us make better decisions, not just check regulatory boxes.

Q: There’s a growing emphasis on technical metrics (drift detection, performance benchmarks, fairness audits) but those alone often miss the bigger picture. How can health systems move beyond technical metrics and shift toward evaluating AI based on real clinical and business outcomes?

Technical metrics matter, but they’re often proxies for what we actually care about. And they’re hard for non-technical stakeholders to interpret. A fairness audit or drift score doesn’t mean much if you can’t connect it to clinical or operational outcomes.

Health systems should start with outcome questions: Will this tool reduce unnecessary admissions? Improve time to treatment? Increase adherence? Then work backward to see if the AI supports those goals. And all of these outcomes should be interpreted in context. The baseline comparison should be current clinical performance, not perfection. We know there are many parts of the healthcare system that aren’t at the standard we’d like them to be. If the AI performs as well or better than the status quo, and we understand its risks, that’s often good enough.

Q: Evaluation is only one part of the journey. Implementation is often the harder lift. What advice would you offer to health systems trying to integrate AI into real-world clinical or administrative workflows?

Aligning products with workflow is an under-appreciated art. The workflows are often not explicitly written out and require active discovery, but can surface important risks and opportunities. Training end users and ensuring they understand the appropriate use cases and limitations is really important, especially in these initial phases of AI implementations.

Identifying the clinicians who are excited to test this new technology, and compensating their time spent on product testing, is an important part of AI implementation. Asking frontline clinicians to care for a full patient load while trialing tools that may never be adopted is unsustainable and will limit adoption, since there are only so many new AI tools frontline clinicians are willing to learn and abandon. Many clinicians want to be more involved in AI and don’t have a direct path in their health systems. There’s a great opportunity to capitalize on this interest with targeted users and recognition of their contributions. In the rollout stage, feedback loops between end-users and leadership should be short and meaningful. Knowing that a suggestion will be taken seriously encourages real engagement.

And finally, you can’t measure success if you don’t understand the current state. If a health system doesn’t know how well it’s performing today, it won’t be able to tell whether the AI tool is benefitting patients after implementation.

Q: Many health systems, especially those outside large academic centers, face resource constraints and limited AI expertise. What practical steps can they take to establish meaningful AI governance processes without overextending?

I work in a rural hospital, so I see this up close. Interestingly, some smaller or independent health systems are actually more agile than large academic centers. They often have flatter hierarchies and tighter feedback loops between clinicians and administration, which can lead to faster, better-aligned product choice and implementation.

For hospitals with fewer resources, start with outcome-focused questions: What problem are we trying to solve? What will we measure to know if it worked? These are transferable skills that hospitals have developed with other kinds of technology and quality improvement processes over the years. Increasingly, there are open-source tools like Pacific AI’s policies that health systems can use as scaffolding.

Q: You lead a community of over 500 physicians exploring AI in medicine. What are frontline clinicians most excited, or most anxious, about when it comes to AI in their day-to-day work?

Clinicians are excited about anything that reduces documentation burden like ambient scribe tools, preauthorization support, clinical documentation improvement. Tools that let them spend more time with patients and less time at the keyboard see the fastest adoption. There’s also enthusiasm for tools that directly help them take better care of patients like OpenEvidence.

The anxiety I see is about being replaced and about feeling left behind. I don’t think that being replaced is likely for most roles, since we don’t need AI for a huge percentage of what we do. When a patient is wheeled in with a clearly fractured leg sticking out of a ski boot, it’s pretty clear to everyone what needs to be done. The feeling of not knowing enough about AI, and not wanting to look stupid by asking questions, is a real problem - doctors like to be experts. I think this concern will lessen as everyone starts to use AI more in their daily lives, and doctors realize they don’t have to be experts to be able to ask good questions. But I do think clinicians need to understand and engage with these tools to ensure they’re developed and implemented to truly benefit our patients and communities.

Q: Finally, looking ahead: What do you hope to see in the next 2–3 years that would meaningfully improve how we govern, evaluate, and use AI in healthcare?

We need a fast, affordable, and clinically grounded way to evaluate AI tools. It should incorporate the latest machine learning technology to reflect medicine’s foundational values like first do no harm and equitable treatment of patients.

And we need continuous monitoring, not just pre-market testing. As more AI tools become agentic, and those agents start interacting with each other, we’ll also need ways to evaluate those interactions and unintended consequences.