CHARGE Surge: OpenEvidence lashes out at authors of Nature study which reported superior performance from frontier LLMs

Levi Miller
2 days ago
2 min read

OpenEvidence's clinical AI product is used by over 40% of U.S. physicians, according to its own estimates

Drama and accusations of impropriety in the world of agentic LLMs. After a study published in Nature claimed that on “500 MedQA questions testing medical knowledge, 500 HealthBench items measuring alignment with clinicians” and “100 de-identified queries from physicians [...] in a live clinical environment,” frontier LLMs like GPT 5.2 and Claude Opus 4.6 “outperformed clinical AI tools” like OpenEvidence “in all three evaluations,” OpenEvidence delivered an excoriating response.

In a strikingly accusatory LinkedIn post, OpenEvidence accused the Nature study authors of “coincidentally” (read: deliberately) publishing their paper after OpenEvidence refused to provide an API to power a “competing in-house medical AI” at NYU Langone Health. As a counterweight, OpenEvidence demonstrated that the frontier LLMs tested in Nature trained on MedQA and HealthBench questions with access to official answers, and that performance metrics were graded by AI models on “arbitrary/subjective stylistic choices.”

OpenEvidence is certainly correct that the structural limitations of the study should not be understated. Note, however, that both BenchMark and RCQ clinician (that is, human) evaluations scored frontier LLMs above OpenEvidence in terms of clarity. Those same clinicians rated those LLMs at least as well as clinical AI tools on knowledge, clinical correctness and safety metrics – exactly where clinical AI tools should theoretically outperform frontier LLMs. Far from disingenuous, the article’s authors recognized that the “models may have been exposed to MedQA or HealthBench during training” and that “industry-created benchmarks may systematically favor the systems developed by their creators.”

Bottom line: clinicians rated frontier LLM responses at least as well as specialized clinical AI tools. Accusatory politicking, however, doesn’t inspire confidence. According to OpenEvidence’s own estimate, over 40% of U.S. physicians use the AI for diagnostic inquiries. As OpenEvidence’s architecture is inaccessible, every physician deserves clear cooperative, unambiguous, and impartial research into its clinical accuracy. Proper evaluation should assess the real, post-deployment outcomes of integrated AI to accurately reflect clinical realities.

Full CHARGE Signal Newsletter - 07/02/2026