ChatGPT Candidate Performs Well in Obstetrics and Gynaecology Clinical Examination

In a study to determine how the Chat Generative Pre-Trained Transformer or ChatGPT would fare in medical specialist examinations compared to human candidates without additional training, the Artificial Intelligence chatbot performed better than human candidates in a mock Obstetrics and Gynaecology (O&G) specialist clinical examination, used to assess the eligibility of individuals to become O&G specialists. The results from the mock clinical examination detailed that ChatGPT also achieved high scores in empathetic communication, information-gathering and clinical reasoning.

The tabulated results showed that ChatGPT attained a higher average score of 77.2%, compared to the human candidates who scored an average of 73.7%. It was also recorded that ChatGPT took an average of 2 minutes and 54 seconds to complete each station, markedly ahead of the stipulated 10 minutes given. Even though ChatGPT completed the stations in record time, ChatGPT did not outperform all the individuals in each cohort. To minimise bias, the responses of all three candidates were submitted to the examination panel, while concealing the true identity of ChatGPT.

In the study, the team selected seven stations that were in the objective structured clinical examinations (OSCEs) that had been run in the actual mock examinations in the two previous years, all similar in scope and difficulty, with no inclusion of visual interpretations to cater to the current limitations of ChatGPT present at the time of study. Each station has multiple layers of evolving questions based on initial data presented and subsequent responses from the candidate. The OSCE is a criterion-based assessment, where each candidate is assessed on their clinical competencies by completing a series of circuit stations in a simulated environment.

Given 10 minutes to complete each station, the candidate is introduced to an unfamiliar clinical scenario, coupled with the necessary information which would aid them to make an informed clinical decision. The candidate is expected to articulate a care plan, while demonstrating expertise such as communication, information gathering, application of clinical knowledge and patient safety within the time limit. The stations were introduced in an identical format and in the same order to two human candidates - Candidates A and B, and ChatGPT, known as Candidate C.

The study team from the Department of O&G at the Yong Loo Lin School of Medicine (NUS Medicine), led by Associate Professor Mahesh Choolani, Head of the Department of O&G, also conducted an analysis of the answers and found that ChatGPT scored very well in the empathetic communication domain. It was able to skilfully and rapidly generate factually accurate and contextually relevant answers to evolving clinical questions, based on unfamiliar data in the shortest time possible, a feat that would take an average intelligent person more than 10 years of clinical training to be able to understand the questions in this type of highly-complex examinations and answer them appropriately.

It is laudable that Generative AI, which is at present only in its infancy stage, has the prowess to consolidate and interpret huge chunks of general content quickly, and organise it into coherent and concise conversational-type responses, something that would not come naturally to non-native English speakers or candidates facing examination stress. Despite best efforts to blind the examination panel, examiners were generally able to identify the responses from ChatGPT, but not in all cases.

From the mix of answers from human candidates and ChatGPT that were transcribed verbatim and assessed by 14 trained clinician examiners, it was also observed that even though English was used throughout, there was also an infusion of Singlish or words loaned from Malay, Tamil and Chinese dialects, that was included extensively by human candidates. The intonation and unique vocabulary are very familiar and endearing to Singaporeans or long-term residents in Singapore. This method of communication would very well serve as a bridge to initiate closeness and build trust, while helping to ease nervousness in patients, compared to the more articulately-scripted answers of ChatGPT. The lack of local ethnic knowledge is one of the major limitations of ChatGPT, on top of the lack of up-to-date medical references and data, which in turn causes hallucinations in ChatGPT, compelling it to churn out irrelevant or incorrect answers and conclusions at times.

Crucially, the study results also revealed that ChatGPT is less able to handle subjects that have multiple changes of scenarios, within the question itself, that require open interpretation. The stations with multiple-changing scenarios would require additional training in context-specific medical knowledge in highly-specialised topics. This would be manageable for a highly-trained human candidate who has cultivated higher-level discernment and flexible reasoning needed to tackle ambiguities within these questions. ChatGPT was found to outperform human candidates in several knowledge areas, including labour management, gynecologic oncology and postoperative care, topics or stations that largely focused on standard protocol-driven decision-making, but not in highly contextual situations.

"The arrival and increased use of ChatGPT has proven that it can be a viable resource in guiding medical education, possibly provide adjunct support for clinical care in real time, and even support the monitoring of medical treatment in patients. In an era where accurate knowledge and information is instantly accessible, and these capabilities could be embedded within appropriate context by Generative AI in the foreseeable future, the need for future generations of medical doctors to clearly demonstrate the value and importance of the human touch is now saliently obvious. As doctors and medical educators, we need to strongly emphasise and exemplify the use of soft skills, compassionate communication and knowledge application in medical training and clinical care," said Associate Professor Mahesh Choolani.

In a study to determine how the Chat Generative Pre-Trained Transformer or ChatGPT would fare in medical specialist examinations compared to human candidates without additional training, the Artificial Intelligence chatbot performed better than human candidates in a mock Obstetrics and Gynaecology (O&G) specialist clinical examination, used to assess the eligibility of individuals to become O&G specialists. The results from the mock clinical examination detailed that ChatGPT also achieved high scores in empathetic communication, information-gathering and clinical reasoning.

The tabulated results showed that ChatGPT attained a higher average score of 77.2%, compared to the human candidates who scored an average of 73.7%. It was also recorded that ChatGPT took an average of 2 minutes and 54 seconds to complete each station, markedly ahead of the stipulated 10 minutes given. Even though ChatGPT completed the stations in record time, ChatGPT did not outperform all the individuals in each cohort. To minimise bias, the responses of all three candidates were submitted to the examination panel, while concealing the true identity of ChatGPT.

In the study, the team selected seven stations that were in the objective structured clinical examinations (OSCEs) that had been run in the actual mock examinations in the two previous years, all similar in scope and difficulty, with no inclusion of visual interpretations to cater to the current limitations of ChatGPT present at the time of study. Each station has multiple layers of evolving questions based on initial data presented and subsequent responses from the candidate. The OSCE is a criterion-based assessment, where each candidate is assessed on their clinical competencies by completing a series of circuit stations in a simulated environment.

Given 10 minutes to complete each station, the candidate is introduced to an unfamiliar clinical scenario, coupled with the necessary information which would aid them to make an informed clinical decision. The candidate is expected to articulate a care plan, while demonstrating expertise such as communication, information gathering, application of clinical knowledge and patient safety within the time limit. The stations were introduced in an identical format and in the same order to two human candidates - Candidates A and B, and ChatGPT, known as Candidate C.

The study team from the Department of O&G at the Yong Loo Lin School of Medicine (NUS Medicine), led by Associate Professor Mahesh Choolani, Head of the Department of O&G, also conducted an analysis of the answers and found that ChatGPT scored very well in the empathetic communication domain. It was able to skilfully and rapidly generate factually accurate and contextually relevant answers to evolving clinical questions, based on unfamiliar data in the shortest time possible, a feat that would take an average intelligent person more than 10 years of clinical training to be able to understand the questions in this type of highly-complex examinations and answer them appropriately.

It is laudable that Generative AI, which is at present only in its infancy stage, has the prowess to consolidate and interpret huge chunks of general content quickly, and organise it into coherent and concise conversational-type responses, something that would not come naturally to non-native English speakers or candidates facing examination stress. Despite best efforts to blind the examination panel, examiners were generally able to identify the responses from ChatGPT, but not in all cases.

From the mix of answers from human candidates and ChatGPT that were transcribed verbatim and assessed by 14 trained clinician examiners, it was also observed that even though English was used throughout, there was also an infusion of Singlish or words loaned from Malay, Tamil and Chinese dialects, that was included extensively by human candidates. The intonation and unique vocabulary are very familiar and endearing to Singaporeans or long-term residents in Singapore. This method of communication would very well serve as a bridge to initiate closeness and build trust, while helping to ease nervousness in patients, compared to the more articulately-scripted answers of ChatGPT. The lack of local ethnic knowledge is one of the major limitations of ChatGPT, on top of the lack of up-to-date medical references and data, which in turn causes hallucinations in ChatGPT, compelling it to churn out irrelevant or incorrect answers and conclusions at times.

Crucially, the study results also revealed that ChatGPT is less able to handle subjects that have multiple changes of scenarios, within the question itself, that require open interpretation. The stations with multiple-changing scenarios would require additional training in context-specific medical knowledge in highly-specialised topics. This would be manageable for a highly-trained human candidate who has cultivated higher-level discernment and flexible reasoning needed to tackle ambiguities within these questions. ChatGPT was found to outperform human candidates in several knowledge areas, including labour management, gynecologic oncology and postoperative care, topics or stations that largely focused on standard protocol-driven decision-making, but not in highly contextual situations.

"The arrival and increased use of ChatGPT has proven that it can be a viable resource in guiding medical education, possibly provide adjunct support for clinical care in real time, and even support the monitoring of medical treatment in patients. In an era where accurate knowledge and information is instantly accessible, and these capabilities could be embedded within appropriate context by Generative AI in the foreseeable future, the need for future generations of medical doctors to clearly demonstrate the value and importance of the human touch is now saliently obvious. As doctors and medical educators, we need to strongly emphasise and exemplify the use of soft skills, compassionate communication and knowledge application in medical training and clinical care," said Associate Professor Mahesh Choolani.

Li SW, Kemp MW, Logan SJS, Dimri PS, Singh N, Mattar CNZ, Dashraath P, Ramlal H, Mahyuddin AP, Kanayan S, Carter SWD, Thain SPT, Fee EL, Illanes SE, Choolani MA; National University of Singapore Obstetrics and Gynecology Artificial Intelligence (NUS OBGYN-AI) Collaborative Group.
ChatGPT outscored human candidates in a virtual objective structured clinical examination in obstetrics and gynecology.
Am J Obstet Gynecol. 2023 Apr 22:S0002-9378(23)00251-X. doi: 10.1016/j.ajog.2023.04.020

Most Popular Now

Philips and Medtronic Advocacy Partnersh…

Royal Philips (NYSE: PHG, AEX: PHIA), a global leader in health technology, and Medtronic Neurovascular, a leading innovator in neurovascular therapies, today announced a strategic advocacy partnership. Delivering timely stroke...

Wearable Cameras Allow AI to Detect Medi…

A team of researchers says it has developed the first wearable camera system that, with the help of artificial intelligence (AI), detects potential errors in medication delivery. In a test whose...

New AI Tool Predicts Protein-Protein Int…

Scientists from Cleveland Clinic and Cornell University have designed a publicly-available software and web database to break down barriers to identifying key protein-protein interactions to treat with medication. The computational tool...

AI for Real-Rime, Patient-Focused Insigh…

A picture may be worth a thousand words, but still... they both have a lot of work to do to catch up to BiomedGPT. Covered recently in the prestigious journal Nature...

New Research Shows Promise and Limitatio…

Published in JAMA Network Open, a collaborative team of researchers from the University of Minnesota Medical School, Stanford University, Beth Israel Deaconess Medical Center and the University of Virginia studied...

G-Cloud 14 Makes it Easier for NHS to Bu…

NHS organisations will be able to save valuable time and resource in the procurement of technologies that can make a significant difference to patient experience, in the latest iteration of...

Start-Ups will Once Again Have a Starrin…

11 - 14 November 2024, Düsseldorf, Germany. The finalists in the 16th Healthcare Innovation World Cup and the 13th MEDICA START-UP COMPETITION have advanced from around 550 candidates based in 62...

Hampshire Emergency Departments Digitise…

Emergency departments in three hospitals across Hampshire Hospitals NHS Foundation Trust have deployed Alcidion's Miya Emergency, digitising paper processes, saving clinical teams time, automating tasks, and providing trust-wide visibility of...

MEDICA HEALTH IT FORUM: Success in Maste…

11 - 14 November 2024, Düsseldorf, Germany. How can innovations help to master the great challenges and demands with which healthcare is confronted across international borders? This central question will be...

A "Chemical ChatGPT" for New M…

Researchers from the University of Bonn have trained an AI process to predict potential active ingredients with special properties. Therefore, they derived a chemical language model - a kind of...

Siemens Healthineers co-leads EU Project…

Siemens Healthineers is joining forces with more than 20 industry and public partners, including seven leading stroke hospitals, to improve stroke management for patients all over Europe. With a total...

MEDICA and COMPAMED 2024: Shining a Ligh…

11 - 14 November 2024, Düsseldorf, Germany. Christian Grosser, Director Health & Medical Technologies, is looking forward to events getting under way: "From next Monday to Thursday, we will once again...