Show simple item record

Files in this item

Thumbnail

Item metadata

dc.contributor.authorBuhr, Christoph R
dc.contributor.authorSmith, Harry
dc.contributor.authorHuppertz, Tilman
dc.contributor.authorBahr-Hamm, Katharina
dc.contributor.authorMatthias, Christoph
dc.contributor.authorCuny, Clemens
dc.contributor.authorSnijders, Jan Phillipp
dc.contributor.authorErnst, Benjamin Philipp
dc.contributor.authorBlaikie, Andrew
dc.contributor.authorKelsey, Tom
dc.contributor.authorKuhn, Sebastian
dc.contributor.authorEckrich, Jonas
dc.date.accessioned2024-06-04T16:30:07Z
dc.date.available2024-06-04T16:30:07Z
dc.date.issued2024-05-23
dc.identifier302369391
dc.identifier0474e647-31de-427e-a662-ee2893b76830
dc.identifier85193978711
dc.identifier.citationBuhr , C R , Smith , H , Huppertz , T , Bahr-Hamm , K , Matthias , C , Cuny , C , Snijders , J P , Ernst , B P , Blaikie , A , Kelsey , T , Kuhn , S & Eckrich , J 2024 , ' Assessing unknown potential—quality and limitations of different large language models in the field of otorhinolaryngology ' , Acta Oto-Laryngologica , vol. Latest Articles , pp. 1-6 . https://doi.org/10.1080/00016489.2024.2352843en
dc.identifier.issn1651-2251
dc.identifier.otherBibtex: doi:10.1080/00016489.2024.2352843
dc.identifier.otherORCID: /0000-0001-7913-6872/work/160316516
dc.identifier.otherORCID: /0000-0002-8091-1458/work/160317043
dc.identifier.urihttps://hdl.handle.net/10023/29986
dc.description.abstractBackground Large Language Models (LLMs) might offer a solution for the lack of trained health personnel, particularly in low- and middle-income countries. However, their strengths and weaknesses remain unclear. Aims/objectives Here we benchmark different LLMs (Bard 2023.07.13, Claude 2, ChatGPT 4) against six consultants in otorhinolaryngology (ORL). Material and methods Case-based questions were extracted from literature and German state examinations. Answers from Bard 2023.07.13, Claude 2, ChatGPT 4, and six ORL consultants were rated blindly on a 6-point Likert-scale for medical adequacy, comprehensibility, coherence, and conciseness. Given answers were compared to validated answers and evaluated for hazards. A modified Turing test was performed and character counts were compared. Results LLMs answers ranked inferior to consultants in all categories. Yet, the difference between consultants and LLMs was marginal, with the clearest disparity in conciseness and the smallest in comprehensibility. Among LLMs Claude 2 was rated best in medical adequacy and conciseness. Consultants’ answers matched the validated solution in 93% (228/246), ChatGPT 4 in 85% (35/41), Claude 2 in 78% (32/41), and Bard 2023.07.13 in 59% (24/41). Answers were rated as potentially hazardous in 10% (24/246) for ChatGPT 4, 14% (34/246) for Claude 2, 19% (46/264) for Bard 2023.07.13, and 6% (71/1230) for consultants. Conclusions and significance Despite consultants superior performance, LLMs show potential for clinical application in ORL. Future studies should assess their performance on larger scale.
dc.format.extent6
dc.format.extent1576870
dc.language.isoeng
dc.relation.ispartofActa Oto-Laryngologicaen
dc.subject3rd-NDASen
dc.titleAssessing unknown potential—quality and limitations of different large language models in the field of otorhinolaryngologyen
dc.typeJournal articleen
dc.contributor.institutionUniversity of St Andrews. School of Medicineen
dc.contributor.institutionUniversity of St Andrews. Sir James Mackenzie Institute for Early Diagnosisen
dc.contributor.institutionUniversity of St Andrews. Infection and Global Health Divisionen
dc.contributor.institutionUniversity of St Andrews. School of Computer Scienceen
dc.contributor.institutionUniversity of St Andrews. Centre for Interdisciplinary Research in Computational Algebraen
dc.identifier.doi10.1080/00016489.2024.2352843
dc.description.statusPeer revieweden


This item appears in the following Collection(s)

Show simple item record