On defining rules for cancer data fabrication
Abstract
Data is essential for machine learning projects, and data accuracy is crucial for being able to trust the results obtained from the associated machine learning models. Previously, we have developed machine learning models for predicting the treatment outcome for breast cancer patients that have undergone chemotherapy, and developed a monitoring system for their treatment timeline showing interactively the options and associated predictions. Available cancer datasets, such as the one used earlier, are often too small to obtain significant results, and make it difficult to explore ways to improve the predictive capability of the models further. In this paper, we explore an alternative to enhance our datasets through synthetic data generation. From our original dataset, we extract rules to generate fabricated data that capture the different characteristics inherent in the dataset. Additional rules can be used to capture general medical knowledge. We show how to formulate rules for our cancer treatment data, and use the IBM solver to obtain a corresponding synthetic dataset. We discuss challenges for future work.
Citation
Kuster Filipe Bowles , J , Silvina , A , Bin , E & Vinov , M 2020 , On defining rules for cancer data fabrication . in V Gutiérrez Basulto , T Kliegr , A Soylu , M Giese & D Roman (eds) , Rules and Reasoning : 4th International Joint Conference, RuleML+RR 2020, Oslo, Norway, June 29–July 1, 2020, Proceedings . Lecture Notes in Computer Science (Programming and Software Engineering) , vol. 12173 LNCS , Springer , Cham , pp. 168-176 , 4th International Joint Conference on Rules and Reasoning (RCUL+RR 2020) , Oslo , Norway , 29/06/20 . https://doi.org/10.1007/978-3-030-57977-7_13 conference
Publication
Rules and Reasoning
ISSN
0302-9743Type
Conference item
Rights
Copyright © 2020 Springer Nature Switzerland AG. This work has been made available online in accordance with publisher policies or with permission. Permission for further reuse of this content should be sought from the publisher or the rights holder. This is the author created accepted manuscript following peer review and may differ slightly from the final published version. The final published version of this work is available at https://doi.org/10.1007/978-3-030-57977-7_13.
Description
Funding: This research is partially funded by the Data Lab, and the EU H2020 project Serums: Securing Medical Data in Smart Patient-Centric Healthcare Systems (grant 826278).Collections
Items in the St Andrews Research Repository are protected by copyright, with all rights reserved, unless otherwise indicated.