Templated text synthesis for expert-guided multi-label extraction from radiology reports

Training medical image analysis models traditionally requires large amounts of expertly annotated imaging data which is time-consuming and expensive to obtain. One solution is to automatically extract scan-level labels from radiology reports. Previously, we showed that, by extending BERT with a per-label attention mechanism, we can train a single model to perform automatic extraction of many labels in parallel. However, if we rely on pure data-driven learning, the model sometimes fails to learn critical features or learns the correct answer via simplistic heuristics (e.g., that “likely” indicates positivity), and thus fails to generalise to rarer cases which have not been learned or where the heuristics break down (e.g., “likely represents prominent VR space or lacunar infarct” which indicates uncertainty over two differential diagnoses). In this work, we propose template creation for data synthesis, which enables us to inject expert knowledge about unseen entities from medical ontologies, and to teach the model rules on how to label difficult cases, by producing relevant training examples. Using this technique alongside domain-specific pre-training for our underlying BERT architecture i.e., PubMedBERT, we improve F1 micro from 0.903 to 0.939 and F1 macro from 0.512 to 0.737 on an independent test set for 33 labels in head CT reports for stroke patients. Our methodology offers a practical way to combine domain knowledge with machine learning for text classification tasks.

Citation

Schrempf , P , Watson , H , Park , E , Pajak , M , MacKinnon , H , Muir , K W , Harris-Birtill , D & O’Neil , A Q 2021 , ' Templated text synthesis for expert-guided multi-label extraction from radiology reports ' , Machine Learning and Knowledge Extraction , vol. 3 , no. 2 , pp. 299-317 . https://doi.org/10.3390/make3020015

Publication

Machine Learning and Knowledge Extraction

Status

Peer reviewed

DOI

https://doi.org/10.3390/make3020015

ISSN

2504-4990

Type

Journal article

Description

Funding:This work is part of the Industrial Centre for AI Research in digital Diagnostics (iCAIRD), which is funded by Innovate UK on behalf of UK Research and Innovation (UKRI) project number 104690. The Data Lab has also provided support and funding.