St Andrews Research Repository

St Andrews University Home
View Item 
  •   St Andrews Research Repository
  • Computer Science (School of)
  • Computer Science
  • Computer Science Theses
  • View Item
  •   St Andrews Research Repository
  • Computer Science (School of)
  • Computer Science
  • Computer Science Theses
  • View Item
  •   St Andrews Research Repository
  • Computer Science (School of)
  • Computer Science
  • Computer Science Theses
  • View Item
  • Login
JavaScript is disabled for your browser. Some features of this site may not work without it.

Evaluating data linkage algorithms with perfect synthetic ground truth

Thumbnail
View/Open
Thesis-Tom-Dalton-complete-version.pdf (26.22Mb)
final-thesis-latex.zip (23.15Mb)
Date
15/06/2022
Author
Dalton, Thomas Stanley
Supervisor
Kirby, Graham N. C.
Dearle, Alan
Funder
Engineering and Physical Sciences Research Council (EPSRC)
Grant ID
EP/M508214/1
Keywords
data linkage
record linkage
evaluation
synthetic data
ground truth
synthetic ground truth
gold standard
linkage evaluation
Metadata
Show full item record
Altmetrics Handle Statistics
Altmetrics DOI Statistics
Abstract
Data linkage algorithms join datasets by identifying commonalities between them. The ability to evaluate the efficacy of different algorithms is a challenging problem that is often overlooked. If incorrect links are made or links are missed by a linkage algorithm then conclusions based on its linkage may be unfounded. Evaluating linkage quality is particularly challenging in domains where datasets are large and the number of links is low. Example domains include historical population data, bibliographic data, and administrative data. In these domains the evaluation of linkage quality is not well understood. A common approach to evaluating linkage quality is the use of metrics, most commonly precision, recall, and F-measure. These metrics indicate how often links are missed or false links are made. To calculate a metric, datasets are used where the true links and non-links are known. The linkage algorithm attempts to link the datasets and the constructed set of links is compared with the set of true links. In these domains we can rarely have confidence that the evaluation datasets contain all the true links and that no false links have been included. If such errors exist in the evaluation datasets, the calculated metrics may not truly reflect the performance of the linkage algorithm. This presents issues when making comparisons between linkage algorithms. To rigorously evaluate the efficacy of linkage algorithms, it is necessary to objectively measure an algorithm’s linkage quality with a range of different configuration parameters and datasets. These many datasets must be of appropriate scale and have ground truth which denotes all true links and non-links. Evaluating algorithms using shared standardised datasets enables objective comparisons between linkage algorithms. To facilitate objective linkage evaluation, a set of standardised datasets need to be shared and widely adopted. This thesis establishes an approach for the construction of synthetic datasets that can be used to evaluate linkage algorithms. This thesis addresses the following research questions: • What are appropriate approaches to the evaluation of linkage algorithms? • Is it feasible to synthesise realistic evaluation data? • Is synthetic evaluation data with perfect ground truth useful for evaluation? • How should synthesised data be statistically validated for correctness? • How should sets of synthesised data be used to evaluate linkage? • How can the evaluation of linkage algorithms be effectively communicated? This thesis makes a number of contributions, most notably a framework for the comprehensive evaluation of data linkage algorithms, thus significantly improving the comparability of linkage algorithms, especially in domains lacking evaluation data. The thesis demonstrates these techniques within the population reconstruction domain. Integral to the evaluation framework, approaches to synthesis and statistical validation of evaluation datasets have been investigated, resulting in a simulation model able to create many, characteristically varied, large-scale datasets.
DOI
https://doi.org/10.17630/sta/247
Type
Thesis, PhD Doctor of Philosophy
Collections
  • Computer Science Theses
URI
http://hdl.handle.net/10023/26784

Items in the St Andrews Research Repository are protected by copyright, with all rights reserved, unless otherwise indicated.

Advanced Search

Browse

All of RepositoryCommunities & CollectionsBy Issue DateNamesTitlesSubjectsClassificationTypeFunderThis CollectionBy Issue DateNamesTitlesSubjectsClassificationTypeFunder

My Account

Login

Open Access

To find out how you can benefit from open access to research, see our library web pages and Open Access blog. For open access help contact: openaccess@st-andrews.ac.uk.

Accessibility

Read our Accessibility statement.

How to submit research papers

The full text of research papers can be submitted to the repository via Pure, the University's research information system. For help see our guide: How to deposit in Pure.

Electronic thesis deposit

Help with deposit.

Repository help

For repository help contact: Digital-Repository@st-andrews.ac.uk.

Give Feedback

Cookie policy

This site may use cookies. Please see Terms and Conditions.

Usage statistics

COUNTER-compliant statistics on downloads from the repository are available from the IRUS-UK Service. Contact us for information.

© University of St Andrews Library

University of St Andrews is a charity registered in Scotland, No SC013532.

  • Facebook
  • Twitter