Genomics reveal population structure, evolutionary history, and signatures of selection in the northern bottlenose whale, Hyperoodon ampullatus

Abstract Information on wildlife population structure, demographic history, and adaptations are fundamental to understanding species evolution and informing conservation strategies. To study this ecological context for a cetacean of conservation concern, we conducted the first genomic assessment of the northern bottlenose whale, Hyperoodon ampullatus, using whole‐genome resequencing data (n = 37) from five regions across the North Atlantic Ocean. We found a range‐wide pattern of isolation‐by‐distance with a genetic subdivision distinguishing three subgroups: the Scotian Shelf, western North Atlantic, and Jan Mayen regions. Signals of elevated levels of inbreeding in the Endangered Scotian Shelf population indicate this population may be more vulnerable than the other two subgroups. In addition to signatures of inbreeding, evidence of local adaptation in the Scotian Shelf was detected across the genome. We found a long‐term decline in effective population size for the species, which poses risks to their genetic diversity and may be exacerbated by the isolating effects of population subdivision. Protecting important habitat and migratory corridors should be prioritized to rebuild population sizes that were diminished by commercial whaling, strengthen gene flow, and ensure animals can move across regions in response to environmental changes.


Tables and Figures
Page 7

Synteny
We mapped the northern bottlenose whale (Hyperoodon ampullatus) genome to the blue whale (Balaenoptera musculus) to determine chromosomal positions with SatsumaSynteny v2 (Grabherr et al., 2010). Due to the large genome size (>1.5 Gb), we ran SatsumaSynteny for each blue whale chromosome separately. We then filtered out scaffold alignments with an identity value below 0.6 and scaffolds with less than 50% of their length mapped to a chromosome.

Annotation
Genome annotation through MAKER v2.31.10 (Holt & Yandell, 2011) required multiple iterations to train SNAP (computational gene prediction) and create hidden Markov models (HMM) to improve annotations. In the first round, we generated gene models using protein data obtained from Ensembl database (Cunningham et al., 2019) for three model species: blue whale, sperm whale (Physeter macrocephalus), and cow (Bos taurus). We adjusted parameters for masking repeats with RepeatMasker (model_org=all) and aligning splice sites for protein sequences with Exonerate (protein2genome=1). In the second round, we used the generated HMM results from the first round as an input parameter for SNAP to update the annotation. We repeated this process in the third round, including the parameter for additional gene prediction with Augustus (augustus_species=human). In the fourth round, we updated the annotation with the previous round's HMM results for a final time, then filtered out annotations with an annotation edit distance (AED) greater than 0.5. We then aligned the northern bottlenose whale annotation with protein data with the same three model organisms (blue whale, sperm whale, cow) from uniprot database (The UniProt Consortium, 2019) using BLAST+ v2.11.0 (Camacho et al., 2009) and protein functions using InterProScan v5.23-62.0 (Jones et al., 2014), providing gene identifications and gene ontology (GO) information.

Merging sequence data for two individuals
We had duplicate sequencing reads for some of our individuals (due to a laboratory error, 24 samples were re-sequenced). To recover some individuals that were removed from the SNP dataset due to high missingness, we identified sample matches and merged the sequencing data to increase sample completeness. In this process, SNPs were called with Platypus v0.8.1 (Rimmer et al., 2014), using a total of 73 bam files (the original 49 and resequenced 24 samples). Then we ran a kinship analysis with PLINK v1.9 (Purcell et al., 2007) to determine duplicates. Through this method, we were able to confidently identify which sequencing reads to merge for two individuals, allowing us to retain these two samples that would have been filtered out from our main dataset otherwise. We merged aligned bam files with SAMtools v1.9 (Li et al., 2009) and redid deduplication and read group addition for these two samples with Picard v2.20.6 (Broad Institute, 2019). These updated bam files were included in SNP calling for downstream analyses.

Sex determination
Sex chromosomes in mammals have large areas that are non-recombining, and differences in sex chromosomes between males (XY) and females (XX) may drive a proportion of genetic variation among individuals and bias inferences of evolutionary processes that assume recombination occurs. To isolate autosomes from sex chromosomes for downstream analyses, we identified genomic regions with scaffolds exhibiting differences in coverage consistent with being on the X and Y chromosomes. We compared sequencing depth in windows across the genome between male and female samples with DifCover (Smith et al., 2018), using bam files aligned to the reference genome. We used two samples from each sex to run four experimental comparisons with different-sex samples and two control comparisons with same-sex samples (Grayson et al., 2022). BEDTools v2.25.0 (Quinlan & Hall, 2010) was used to compare results from each run and list genomic regions that were consistently different in coverage between males and females and consistently similar in coverage in same-sex comparisons, identifying scaffolds located on the X and Y chromosomes.

Structural variants
Through the program BreakDancer v1.3.6 (Fan et al. 2014), we used 10 samples (bam files aligned to the reference genome) with 8-10x coverage to identify structural variants, containing inter-chromosomal translocations, intra-chromosomal translocations, and inversions. We selected structural variants with confidence scores above 50, then created a list of all SNPs that fell in these regions to use in filtering the SNP dataset.

Linkage disequilibrium decay
Alleles found together at a greater frequency compared to random assortment are described to be in linkage disequilibrium (LD), where greater r 2 indicates higher linkage disequilibrium (Slatkin, 2009). Generally, LD decay exhibits a decrease in r 2 with increasing inter-SNP distance. We used PopLDdecay to complete a LD decay analysis directly on the SNP vcf file (Zhang et al., 2019). After observing a bump in the LD decay plot that did not align with expected LD, we predicted that small scaffolds may be influencing results. Accordingly, we removed SNPs on scaffolds shorter than 50 kb, which resulted in more accurate LD decay estimates. Given impact of these short scaffolds, we excluded them from downstream analyses.    Figure S1. Proportion of autosomes (blue) and sex-chromosome (orange) lengths in final scaffold mapping between the northern bottlenose whale (NBW) and the blue whale (BLW).