AI in Medicine: A Nature Medicine Study Already Obsolete at Publication
Bean AM, et al. Nat Med. 2026
This Oxford-based randomized study tested whether LLMs could help 1,298 UK participants identify medical conditions and choose appropriate care in ten clinical scenarios. The models — GPT-4o, Llama 3, and Command R+ — correctly identified conditions 95% of the time when tested alone. But when real people used these same models, performance dropped dramatically: fewer than 35% identified relevant conditions, no better than the control group using Google.
The problem wasn’t the AI’s medical knowledge. It was the human-AI interaction. People gave incomplete information, asked leading questions, and ignored correct suggestions. Even more striking: standard medical benchmarks and simulated patient interactions completely failed to predict these real-world failures.
The authors conclude that no tested LLM is ready for direct patient care and call for mandatory human user testing before healthcare deployment.
My Take
This study has a fatal flaw that Nature Medicine should have caught: it was obsolete before it was published. The models tested — GPT-4o, Llama 3, and Command R+ — are already one to two generations behind. Data collection ran from August to October 2024. The paper wasn’t published until February 2026 — a 10-month peer review cycle for a field where model capabilities change every few weeks (days, hours?). Publishing this in Nature Medicine with the implicit authority of that journal’s brand creates a misleading impression that “AI doesn’t help patients” — when what the data actually show is that those particular models, tested that particular way, 18 months ago, didn’t help patients. The interaction transcripts are genuinely valuable. The finding that humans provide incomplete information to chatbots mirrors what we know about doctor-patient communication. But drawing sweeping deployment conclusions from models that no longer exist is like testing a flip phone and concluding mobile internet won’t work.
Conclusion
The real scandal here isn’t AI — it’s the broken peer review timeline.
Ten months from submission to publication in a field moving at the speed of AI is journalistic malpractice dressed up as academic rigor.
By the time readers see this study, the models tested are museum pieces. Nature Medicine needs to decide: does it want to publish timely, actionable research, or does it want to run an 18-month lag behind reality and still claim relevance? If major journals cannot review AI studies within 8 weeks, they should decline to publish them at all rather than presenting stale findings as current science. The peer review system was built for fields where knowledge moves in years, not weeks. Medicine and AI now move faster than journals can keep up, and patients deserve better than yesterday’s answers to tomorrow’s questions.


