Files in this item
Quality assurance and validity of AI-generated single best answer questions
Item metadata
dc.contributor.author | Ahmed, Ayla | |
dc.contributor.author | Kerr, Ellen | |
dc.contributor.author | O'Malley, Andrew Stephen | |
dc.date.accessioned | 2025-02-26T11:30:06Z | |
dc.date.available | 2025-02-26T11:30:06Z | |
dc.date.issued | 2025-02-25 | |
dc.identifier | 315038251 | |
dc.identifier | b3930d55-1f62-4a05-8eb0-30278eedf932 | |
dc.identifier.citation | Ahmed , A , Kerr , E & O'Malley , A S 2025 , ' Quality assurance and validity of AI-generated single best answer questions ' , BMC Medical Education , vol. 25 , 300 . https://doi.org/10.1186/s12909-025-06881-w | en |
dc.identifier.issn | 1472-6920 | |
dc.identifier.uri | https://hdl.handle.net/10023/31513 | |
dc.description.abstract | Background Recent advancements in generative artificial intelligence (AI) have opened new avenues in educational methodologies, particularly in medical education. This study seeks to assess whether generative AI might be useful in addressing the depletion of assessment question banks, a challenge intensified during the Covid-era due to the prevalence of open-book examinations, and to augment the pool of formative assessment opportunities available to students. While many recent publications have sought to ascertain whether AI can achieve a passing standard in existing examinations, this study investigates the potential for AI to generate the exam itself. Summary of work This research utilized a commercially available AI large language model (LLM), OpenAI GPT-4, to generate 220 single best answer (SBA) questions, adhering to Medical Schools Council Assessment Alliance guidelines the and a selection of Learning Outcomes (LOs) of the Scottish Graduate-Entry Medicine (ScotGEM) program. All questions were assessed by an expert panel for accuracy and quality. A total of 50 AI-generated and 50 human-authored questions were used to create two 50-item formative SBA examinations for Year 1 and Year 2 ScotGEM students. Each exam, delivered via the Speedwell eSystem, comprised 25 AI-generated and 25 human-authored questions presented in random order. Students completed the online, closed-book exams on personal devices under exam conditions that reflected summative examinations. The performance of both AI-generated and human-authored questions was evaluated, focusing on facility and discrimination index as key metrics. Summary of results The screening process revealed that 69% of AI-generated SBAs were fit for inclusion in the examinations with little or no modifications required. Modifications, when necessary, were predominantly due to reasons such as the inclusion of "all of the above" options, usage of American English spellings, and non-alphabetized answer choices. 31% of questions were rejected for inclusion in the examinations, due to factual inaccuracies and non-alignment with students’ learning. When included in an examination, post hoc statistical analysis indicated no significant difference in performance between the AI- and human- authored questions in terms of facility and discrimination index. Discussion and conclusion The outcomes of this study suggest that AI LLMs can generate SBA questions that are in line with best-practice guidelines and specific LOs. However, a robust quality assurance process is necessary to ensure that erroneous questions are identified and rejected. The insights gained from this research provide a foundation for further investigation into refining AI prompts, aiming for a more reliable generation of curriculum-aligned questions. LLMs show significant potential in supplementing traditional methods of question generation in medical education. This approach offers a viable solution to rapidly replenish and diversify assessment resources in medical curricula, marking a step forward in the intersection of AI and education. | |
dc.format.extent | 10 | |
dc.format.extent | 1076915 | |
dc.language.iso | eng | |
dc.relation.ispartof | BMC Medical Education | en |
dc.rights | Copyright © The Author(s) 2025. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. | en |
dc.subject | Generative AI | en |
dc.subject | Artificial intelligence | en |
dc.subject | ChatGTP | en |
dc.subject | Assessment | en |
dc.subject | LLM | en |
dc.subject | LB2300 Higher Education | en |
dc.subject | RR-NDAS | en |
dc.subject | SDG 3 - Good Health and Well-being | en |
dc.subject.lcc | LB2300 | en |
dc.title | Quality assurance and validity of AI-generated single best answer questions | en |
dc.type | Journal article | en |
dc.contributor.institution | University of St Andrews.Education Division | en |
dc.contributor.institution | University of St Andrews.School of Medicine | en |
dc.identifier.doi | 10.1186/s12909-025-06881-w | |
dc.description.status | Peer reviewed | en |
This item appears in the following Collection(s)
Items in the St Andrews Research Repository are protected by copyright, with all rights reserved, unless otherwise indicated.