ChatGPT versus consultants : blinded evaluation on answering otorhinolaryngology case–based questions

Buhr, Christoph Raphael; Smith, Harry; Huppertz, Tilman; Bahr-Hamm, Katharina; Matthias, Christoph; Blaikie, Andrew; Kelsey, Tom; Kuhn, Sebastian; Eckrich, Jonas

Show simple item record

Files in this item

Name:: Buhr_2023_JMIR-ME_ChatGPT-vs-consultants_CC.pdf
Size:: 365.6Kb
Format:: PDF

View/Open

Item metadata

dc.contributor.author	Buhr, Christoph Raphael
dc.contributor.author	Smith, Harry
dc.contributor.author	Huppertz, Tilman
dc.contributor.author	Bahr-Hamm, Katharina
dc.contributor.author	Matthias, Christoph
dc.contributor.author	Blaikie, Andrew
dc.contributor.author	Kelsey, Tom
dc.contributor.author	Kuhn, Sebastian
dc.contributor.author	Eckrich, Jonas
dc.date.accessioned	2024-02-22T12:30:11Z
dc.date.available	2024-02-22T12:30:11Z
dc.date.issued	2023-12-05
dc.identifier	297350302
dc.identifier	9a3fb956-eaa4-43f1-9bb0-08afc2528e31
dc.identifier	85180330765
dc.identifier.citation	Buhr , C R , Smith , H , Huppertz , T , Bahr-Hamm , K , Matthias , C , Blaikie , A , Kelsey , T , Kuhn , S & Eckrich , J 2023 , ' ChatGPT versus consultants : blinded evaluation on answering otorhinolaryngology case–based questions ' , JMIR Medical Education , vol. 9 , e49183 . https://doi.org/10.2196/49183	en
dc.identifier.issn	2369-3762
dc.identifier.other	ORCID: /0000-0002-8091-1458/work/148888725
dc.identifier.other	ORCID: /0000-0001-7913-6872/work/153977058
dc.identifier.uri	https://hdl.handle.net/10023/29325
dc.description.abstract	Background : Large language models (LLMs), such as ChatGPT (Open AI), are increasingly used in medicine and supplement standard search engines as information sources. This leads to more “consultations” of LLMs about personal medical symptoms. Objective : This study aims to evaluate ChatGPT’s performance in answering clinical case–based questions in otorhinolaryngology (ORL) in comparison to ORL consultants’ answers. Methods : We used 41 case-based questions from established ORL study books and past German state examinations for doctors. The questions were answered by both ORL consultants and ChatGPT 3. ORL consultants rated all responses, except their own, on medical adequacy, conciseness, coherence, and comprehensibility using a 6-point Likert scale. They also identified (in a blinded setting) if the answer was created by an ORL consultant or ChatGPT. Additionally, the character count was compared. Due to the rapidly evolving pace of technology, a comparison between responses generated by ChatGPT 3 and ChatGPT 4 was included to give an insight into the evolving potential of LLMs. Results : Ratings in all categories were significantly higher for ORL consultants (P<.001). Although inferior to the scores of the ORL consultants, ChatGPT’s scores were relatively higher in semantic categories (conciseness, coherence, and comprehensibility) compared to medical adequacy. ORL consultants identified ChatGPT as the source correctly in 98.4% (121/123) of cases. ChatGPT’s answers had a significantly higher character count compared to ORL consultants (P<.001). Comparison between responses generated by ChatGPT 3 and ChatGPT 4 showed a slight improvement in medical accuracy as well as a better coherence of the answers provided. Contrarily, neither the conciseness (P=.06) nor the comprehensibility (P=.08) improved significantly despite the significant increase in the mean amount of characters by 52.5% (n= (1470-964)/964; P<.001). Conclusions : While ChatGPT provided longer answers to medical problems, medical adequacy and conciseness were significantly lower compared to ORL consultants’ answers. LLMs have potential as augmentative tools for medical care, but their “consultation” for medical problems carries a high risk of misinformation as their high semantic quality may mask contextual deficits.
dc.format.extent	9
dc.format.extent	374457
dc.language.iso	eng
dc.relation.ispartof	JMIR Medical Education	en
dc.subject	Large language models	en
dc.subject	LLMs	en
dc.subject	LLM	en
dc.subject	Artificial intelligence	en
dc.subject	AI	en
dc.subject	ChatGPT	en
dc.subject	Otorhinolaryngology	en
dc.subject	ORL	en
dc.subject	Digital health	en
dc.subject	Chatbots	en
dc.subject	Global health	en
dc.subject	Low- and middle- income countries	en
dc.subject	Telemedicine	en
dc.subject	Telehealth	en
dc.subject	Language model	en
dc.subject	Chatbot	en
dc.subject	QA75 Electronic computers. Computer science	en
dc.subject	R Medicine (General)	en
dc.subject	3rd-DAS	en
dc.subject.lcc	QA75	en
dc.subject.lcc	R1	en
dc.title	ChatGPT versus consultants : blinded evaluation on answering otorhinolaryngology case–based questions	en
dc.type	Journal article	en
dc.contributor.institution	University of St Andrews. School of Computer Science	en
dc.contributor.institution	University of St Andrews. Sir James Mackenzie Institute for Early Diagnosis	en
dc.contributor.institution	University of St Andrews. Centre for Interdisciplinary Research in Computational Algebra	en
dc.contributor.institution	University of St Andrews. Infection and Global Health Division	en
dc.contributor.institution	University of St Andrews. School of Medicine	en
dc.identifier.doi	https://doi.org/10.2196/49183
dc.description.status	Peer reviewed	en

This item appears in the following Collection(s)

University of St Andrews Research

Show simple item record