Evaluating data linkage algorithms with perfect synthetic ground truth
Abstract
Data linkage algorithms join datasets by identifying commonalities between them.
The ability to evaluate the efficacy of different algorithms is a challenging problem
that is often overlooked. If incorrect links are made or links are missed by a linkage
algorithm then conclusions based on its linkage may be unfounded. Evaluating
linkage quality is particularly challenging in domains where datasets are large and
the number of links is low. Example domains include historical population data,
bibliographic data, and administrative data. In these domains the evaluation of
linkage quality is not well understood.
A common approach to evaluating linkage quality is the use of metrics, most commonly
precision, recall, and F-measure. These metrics indicate how often links are
missed or false links are made. To calculate a metric, datasets are used where the
true links and non-links are known. The linkage algorithm attempts to link the
datasets and the constructed set of links is compared with the set of true links. In
these domains we can rarely have confidence that the evaluation datasets contain
all the true links and that no false links have been included. If such errors exist in
the evaluation datasets, the calculated metrics may not truly reflect the performance
of the linkage algorithm. This presents issues when making comparisons between
linkage algorithms.
To rigorously evaluate the efficacy of linkage algorithms, it is necessary to objectively
measure an algorithm’s linkage quality with a range of different configuration
parameters and datasets. These many datasets must be of appropriate scale and
have ground truth which denotes all true links and non-links. Evaluating algorithms
using shared standardised datasets enables objective comparisons between linkage
algorithms. To facilitate objective linkage evaluation, a set of standardised datasets
need to be shared and widely adopted. This thesis establishes an approach for the
construction of synthetic datasets that can be used to evaluate linkage algorithms.
This thesis addresses the following research questions:
• What are appropriate approaches to the evaluation of linkage algorithms?
• Is it feasible to synthesise realistic evaluation data?
• Is synthetic evaluation data with perfect ground truth useful for evaluation?
• How should synthesised data be statistically validated for correctness?
• How should sets of synthesised data be used to evaluate linkage?
• How can the evaluation of linkage algorithms be effectively communicated?
This thesis makes a number of contributions, most notably a framework for the
comprehensive evaluation of data linkage algorithms, thus significantly improving
the comparability of linkage algorithms, especially in domains lacking evaluation
data. The thesis demonstrates these techniques within the population reconstruction
domain. Integral to the evaluation framework, approaches to synthesis and statistical
validation of evaluation datasets have been investigated, resulting in a simulation
model able to create many, characteristically varied, large-scale datasets.
Type
Thesis, PhD Doctor of Philosophy
Collections
Items in the St Andrews Research Repository are protected by copyright, with all rights reserved, unless otherwise indicated.