A small study found that ChatGPT outperformed human doctors in assessing medical histories, even when the doctor was using the chatbot.
Dr. Adam Rodman, an internal medicine physician at Beth Israel Deaconess Medical Center in Boston, fully expected that a chatbot developed using artificial intelligence would help doctors diagnose illnesses.
He was wrong.
Instead, in a study conducted by Dr. Rodman, co-designed by him, doctors who were provided with ChatGPT-4 along with traditional resources performed only slightly better than doctors who didn’t have access to the bot. And to the researchers’ surprise, ChatGPT alone outperformed doctors.
“I was shocked,” said Dr. Rodman.
OpenAI’s chatbot achieved an average of 90% in diagnosing illnesses and explaining why based on case reports. Doctors randomly selected to use the chatbot scored an average of 76%. Those randomly selected not to use it scored an average of 74%.
The study showed that it wasn’t just the chatbot’s superior performance.
This revealed that doctors sometimes have unshakable faith in their own diagnoses, even when chatbots suggest better diagnoses.
And the study revealed that while doctors have exposure to artificial intelligence tools in their work, few of them know how to use chatbots’ features. As a result, they failed to leverage the AI systems’ ability to solve complex diagnostic problems and provide diagnostic explanations.
AI systems should be “physician’s assistants,” says Dr. Rodman, providing a valuable second opinion on diagnoses.
But it seems there is still a long way to go before this potential is realized.
Case Histories, Case Futures
Fifty physicians, both residents and senior physicians, employed through several large hospital systems in the United States, participated in the experiment. The paper was published last month in the journal JAMA Network Open.
Subjects were given six medical histories and graded according to their ability to suggest a diagnosis and explain why they preferred or excluded that diagnosis. The exact final diagnosis was also taken into account in the scoring.
The raters were medical professionals who only saw the participants’ answers, without knowing whether the answers came from a physician with ChatGPT, a physician without ChatGPT, or only from ChatGPT.
The case histories used in the study were based on real patients and are part of a set of 105 cases used by researchers since the 1990s. These cases were intentionally never made public so that medical students and people without prior knowledge could experiment with them. This also meant that ChatGPT could not be trained on them.
But to illustrate the context of the study, the researchers published one of the six cases in which doctors were tested, along with the answers to test questions about that case from doctors who achieved high scores.
This prior case involved a 76-year-old patient who experienced severe pain in his lower back, buttocks, and calves while walking. The pain began several days after he was treated with balloon angioplasty, a procedure to widen his coronary arteries. After the procedure, he was treated with the anticoagulant heparin for 48 hours.
The man complained of fever and fatigue. His cardiologist performed laboratory tests that revealed new-onset anemia and an accumulation of nitrogen and other kidney waste products in his blood. The man had undergone bypass surgery for heart disease 10 years earlier.
The case vignette continued with details of the man’s physical examination, and then provided his laboratory test results.
The correct diagnosis was cholesterol embolism, a condition in which pieces of cholesterol break off from plaque in arteries and clog blood vessels.
Participants were asked about three possible diagnoses and the evidence for each diagnosis. For each possible diagnosis, they were also asked to indicate findings that did not support it, or findings that were expected but not present.
Participants were also asked to provide a definitive diagnosis. They were then asked to identify up to three additional steps they would take in the diagnostic process.
As with the diagnoses of the published cases, determining the diagnoses of the other five cases included in the study was not easy; however, they were not so rare that little was known about them. Nevertheless, on average the doctors performed worse than the chatbot.
Researchers asked themselves, what’s going on?
The answer seems to depend on questions of how doctors arrive at a diagnosis, and how they use tools such as artificial intelligence.
The doctor in the machine
So how does a doctor make a diagnosis?
There’s a problem, says Dr. Leah. “We don’t really know how doctors think,” says Andrew Lee, a medical historian at Brigham and Women’s Hospital who was not involved in the study. “Based on my experience,” said Dr. Leah.
This kind of ambiguity has been a challenge for decades to researchers trying to develop computer programs that can think like doctors.
The search began nearly 70 years ago.
“For as long as computers have existed, people have been trying to use them to make diagnoses,” says Dr. Leah.
One of the most ambitious attempts began in the 1970s at the University of Pittsburgh, where computer scientists recruited Dr. Jack Myers, chairman of the medical school’s department of internal medicine, who was known as a master diagnostician. He had a photographic memory and spent 20 hours a week in the medical library, learning everything there was to know about medicine.
Dr. Myers received medical details about the cases and explained his reasons while reviewing the diagnosis. Computer scientists translated his logic chains into code. The resulting program, called INTERNIST-1, covered more than 500 diseases and about 3,500 disease symptoms.