Statistics, machine learning and deep learning for population genetic inference
MetadataShow full item record
Deciphering the evolutionary changes from raw DNA data effectively without the loss of intrinsic information has been the fundamental and core work in population genetics. However, some statistical challenges still restrict the inferential performance in population genetics, for example, the undue emphasis on rare or common alleles measured by different statistics, the ubiquitous multimodal genetic structure within populations, and complex genotype-by-environment associations. In this thesis, I propose to integrate the information-based statistics with machine learning approaches to address these problems and challenges for population genetic inference. First, I evaluated the performance of the information-based summary statistics for spatial demography inference. I showed that the summary statistics based on Shannon differentiation and the transformed diversity of order q=1 had higher power to discriminate spatially-structured scenarios than the traditional allelic richness and heterozygosity-based summary statistics. This provides guidelines for using summary statistics to make inference of spatial demography and for developing new statistical methods to detect signatures of evolutionary changes. Second, I proposed to use Kernel Local Fisher Discriminant Analysis of Principal Components (KLFDAPC) for population genetic structure inference considering the nonlinear and multimodal genetic information between individuals. KLFDAPC outperformed both PCA and DAPC in discriminatory power and in predicting individual geographic origin. KLFDAPC is useful for geographic ancestry inference and correction of population stratification in GWAS. Finally, I proposed a deep learning-based approach (DeepGenomeScan) to detect signals of selection. DeepGenomeScan had higher power than the commonly used machine learning approaches such as pcadapt and RDA in identifying signatures of selection. Furthermore, DeepGenomeScan can be extended to implement various genome-wide association studies (GWAS, TWAS, PWAS, and MWAS) by performing a systematic scanning on genome-wide variants to detect the genetic variations responsible for complex traits or involved in adaptation. In summary, this dissertation addresses several foundational questions in statistics-based and machine learning-based inference, contributing several the-state-of-the-art statistical tools for population genetic inference.
Thesis, PhD Doctor of Philosophy
Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 Internationalhttp://creativecommons.org/licenses/by-nc-nd/4.0/
Embargo Date: 2022-06-02
Embargo Reason: Thesis restricted in accordance with University regulations. Print and electronic copy of all chapters and appendices restricted until 2nd June 2022
Except where otherwise noted within the work, this item's license for re-use is described as Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International
Items in the St Andrews Research Repository are protected by copyright, with all rights reserved, unless otherwise indicated.