Show simple item record

Files in this item

Thumbnail

Item metadata

dc.contributor.authorFraile Navarro, David
dc.contributor.authorCoiera, Enrico
dc.contributor.authorHambly, Thomas W.
dc.contributor.authorTriplett, Zoe
dc.contributor.authorAsif, Nahyan
dc.contributor.authorSusanto, Anindya
dc.contributor.authorChowdhury, Anamika
dc.contributor.authorAzcoaga Lorenzo, Amaya
dc.contributor.authorDras, Mark
dc.contributor.authorBerkovsky, Shlomo
dc.date.accessioned2025-02-17T15:30:21Z
dc.date.available2025-02-17T15:30:21Z
dc.date.issued2025-01-07
dc.identifier314731018
dc.identifier54989524-be10-49ff-b2d7-fad4e2bb9d78
dc.identifier85214354497
dc.identifier.citationFraile Navarro , D , Coiera , E , Hambly , T W , Triplett , Z , Asif , N , Susanto , A , Chowdhury , A , Azcoaga Lorenzo , A , Dras , M & Berkovsky , S 2025 , ' Expert evaluation of large language models for clinical dialogue summarization ' , Scientific Reports , vol. 15 , 1195 . https://doi.org/10.1038/s41598-024-84850-xen
dc.identifier.issn2045-2322
dc.identifier.otherORCID: /0000-0003-3307-878X/work/178724556
dc.identifier.urihttps://hdl.handle.net/10023/31411
dc.description.abstractWe assessed the performance of large language models’ summarizing clinical dialogues using computational metrics and human evaluations. The comparison was done between automatically generated and human-produced summaries. We conducted an exploratory evaluation of five language models: one general summarisation model, one fine-tuned for general dialogues, two fine-tuned with anonymized clinical dialogues, and one Large Language Model (ChatGPT). These models were assessed using ROUGE, UniEval metrics, and expert human evaluation was done by clinicians comparing the generated summaries against a clinician generated summary (gold standard). The fine-tuned transformer model scored the highest when evaluated with ROUGE, while ChatGPT scored the lowest overall. However, using UniEval, ChatGPT scored the highest across all the evaluated domains (coherence 0.957, consistency 0.7583, fluency 0.947, and relevance 0.947 and overall score 0.9891). Similar results were obtained when the systems were evaluated by clinicians, with ChatGPT scoring the highest in four domains (coherency 0.573, consistency 0.908, fluency 0.96 and overall clinical use 0.862). Statistical analyses showed differences between ChatGPT and human summaries vs. all other models. These exploratory results indicate that ChatGPT’s performance in summarizing clinical dialogues approached the quality of human summaries. The study also found that the ROUGE metrics may not be reliable for evaluating clinical summary generation, whereas UniEval correlated well with human ratings. Large language models may provide a successful path for automating clinical dialogue summarization. Privacy concerns and the restricted nature of health records remain challenges for its integration. Further evaluations using diverse clinical dialogues and multiple initialization seeds are needed to verify the reliability and generalizability of automatically generated summaries.
dc.format.extent11
dc.format.extent1422864
dc.language.isoeng
dc.relation.ispartofScientific Reportsen
dc.rights© The Author(s) 2025. This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (https://creativecommons.org/licenses/by-nc-nd/4.0/), which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.en
dc.subjectNatural language processingen
dc.subjectElectronic health recordsen
dc.subjectPrimary careen
dc.subjectArtificial intelligenceen
dc.subjectE-DASen
dc.subjectMCCen
dc.titleExpert evaluation of large language models for clinical dialogue summarizationen
dc.typeJournal articleen
dc.contributor.institutionUniversity of St Andrews.School of Medicineen
dc.identifier.doi10.1038/s41598-024-84850-x
dc.description.statusPeer revieweden


This item appears in the following Collection(s)

Show simple item record