Doctors Using AI Did No Better Than AI Alone in a New Science Study

Note: This post was written by Claude Opus 4.7. The following is an analysis of a peer-reviewed research article in Science.

The April 30, 2026 issue of Science carries a paper from a Harvard, Stanford, BIDMC, MIT, and Microsoft team comparing OpenAI’s o1-preview against hundreds of physicians across six clinical-reasoning experiments. The model outperformed human baselines on almost every measure, including a blinded second-opinion study on real patients in the Beth Israel Deaconess emergency department. The full paper is at doi.org/10.1126/science.adz4433.

The headline numbers are striking on their own. The result that should sit longest in any health system’s strategy meeting is buried a layer deeper.

The numbers

NEJM clinicopathologic conferences (143 cases, 2021–2024): o1-preview included the correct diagnosis in 78.3% of differentials and led with the correct answer in 52% of cases. On 70 cases overlapping a prior GPT-4 study, o1-preview produced the exact or very close diagnosis in 88.6% of cases versus 72.9% for GPT-4 (P = 0.015).
NEJM Healer R-IDEA (20 cases, 312 responses): o1-preview achieved a perfect 10/10 R-IDEA score — a validated rubric for clinical reasoning documentation — on 78 of 80 responses. GPT-4 hit perfect on 47/80, attending physicians on 28/80, residents on 16/72.
Grey Matters management cases (5 cases): o1-preview scored a median 89%. GPT-4 alone scored 42%. Physicians with access to GPT-4 scored 41%. Physicians with conventional resources scored 34%.
ER second opinions (76 real cases at BIDMC): At initial triage — the touchpoint with the least information — o1-preview produced exact or very close diagnoses in 67.1% of cases, against 50.0% and 55.3% for the two attending raters. By admission to the floor or ICU, the gap narrowed but persisted.

The blinding worked. The two attending physicians scoring the ER differentials guessed the source (human or AI) correctly only 3.1% and 15.2% of the time. The rest of their picks were “can’t tell.”

The integration result

The most consequential finding for anyone thinking about clinical decision support sits in the Grey Matters numbers. Physicians given GPT-4 to help them work through cases scored a median 41%. Physicians given conventional resources scored 34%. GPT-4 alone scored 42%.

The clinicians-with-AI bucket landed within a percentage point of the AI-alone bucket. The integration delivered roughly nothing.

This is not a one-off. The pattern echoes a 2024 Stanford study (Goh et al., JAMA Network Open) that this paper builds on, where physicians given GPT-4 access in a similar setup also failed to outperform GPT-4 by itself. Two replications with different cohorts and different cases now point in the same direction.

For a CIO, a clinical informatics lead, or a business owner thinking about a build vs. buy on AI clinical decision support, the practical implication is hard to dodge. Procuring an LLM is the easy half of the deployment. The hard half — workflow design, prompt patterns, output presentation, trust calibration — is what determines whether the human-AI team performs better than either component alone. The Brodeur paper does not solve that problem. It confirms the problem is real.

What the paper does not show

Several caveats deserve to sit in the same paragraph as the headline.

The model is text-only. The authors are explicit: clinical medicine “is multifaceted and awash with nontext inputs,” including patient affect, distress level, and imaging. Existing studies suggest current foundation models are “more limited in reasoning over nontext inputs.” Imaging interpretation, dictation, and ambient signals are not in scope. For radiology, pathology, and dermatology readers, this paper does not move the needle on the modalities your specialty actually runs on. The text-only ceiling is the recurring constraint across the current AI-in-medicine literature.

The specialty mix is narrow. Internal medicine and emergency medicine. The authors note explicitly that surgery and other specialties were not in scope.

The model tested is already retired. o1-preview was supplanted by o3 well before the paper was accepted. Submitted June 2025; accepted February 2026; published April 30, 2026. The authors expect newer models to perform similarly or better, but the specific numbers in the paper are not for the model anyone is deploying today.

Cannot-miss diagnoses did not robustly improve. On the NEJM Healer cases, the proportion of “cannot-miss” diagnoses identified by o1-preview was statistically indistinguishable from GPT-4, attendings, and residents. The same goes for the landmark diagnostic cases — o1 was not statistically better than GPT-4 there. The gains are real on overall reasoning. They are not uniform across the most safety-critical task.

What this means

Prospective trials are the bottleneck now, not the model. The authors say this directly. Benchmark saturation has arrived; what is missing is real-patient outcome data from a deployed system. Pilots from here forward should measure safety, time-to-disposition, cost, and clinician burnout — not just diagnostic accuracy.
Workflow design is the differentiating skill. The AI-alone-beats-AI-with-doctor pattern is the most actionable result in the paper. Whichever organization works out an interaction model that beats both components alone sets the next standard. Procurement decisions today should weight the integration roadmap heavier than raw benchmark scores.
The text-only ceiling matters for image-driven specialties. Radiology, pathology, and dermatology are not represented here. Health systems with strong imaging programs should treat foundation-model evaluation in those areas as a separate workstream, not a free ride on the reasoning numbers.

A 1959 Science paper by Ledley and Lusted opened the 65-year arc by framing AI clinical reasoning as a benchmark problem. Brodeur and colleagues are arguing the benchmarks have been beaten and the work has shifted. The next phase is not whether the model can reason. It is whether a clinical team built around the model can deliver better care than the model alone.

Sources

Science — Performance of a large language model on the reasoning tasks of a physician (Brodeur et al., April 30, 2026)

The numbers

The integration result

What the paper does not show

What this means

Sources

Security Scorecard