Low-cost and clinically applicable copy number profiling using repeat DNA

Large-scale cancer genome studies suggest that tumors are driven by somatic copy number alterations (SCNAs) or single-nucleotide variants (SNVs). Due to the low-cost, the clinical use of genomics assays is biased towards targeted gene panels, which identify SNVs. There is a need for a comparably low-cost and simple assay for high-resolution SCNA profiling. Here we present our method, conliga, which infers SCNA profiles from a low-cost and simple assay.

. Focal deletions such as these 99 may be functionally relevant, potentially rendering tumor suppressor genes inactive. 100 The purity of tumor samples obtained by dissection can vary widely [24], as can samples obtained non- 101 invasively, e.g ctDNA from plasma [25]. As tumor purity reduces, the copy number signal to noise ratio decreases. 102 To determine the performance of conliga and QDNAseq under different purity conditions, we generated samples 103 with varying purity by mixing sequencing reads from normal and OAC samples (Methods). FAST-SeqS samples 104 were generated with two million reads and LC WGS samples were generated with nine million reads. 105 Figure 2: Comparing the performance of SCNA detection in low tumor purity samples and determining the limit of detection. (a) left column: relative copy number calls by conliga at different dilutions of sample OAC3, compared to ASCAT relative copy number profile (top left), discrete copy number states are colored with a gradient (light green to purple), highlighting regions with differing SCNAs. right column: relative copy number calls by QDNAseq at different dilutions of sample OAC3, compared to ASCAT relative copy number profile (top right). (b) The number of copy number states detected by conliga in each of eight OAC samples at differing purity levels. The limit of detection is determined by the lowest purity level in which more than one copy number state is detected. Figure 2a shows the performance of both methods for sample OAC3. At 30% purity, both conliga and 106 QDNAseq recapitulate the copy number profile as determined by ASCAT. At 5%, other than the focal amplifi-107 cation on chromosome 12, QDNAseq fails to detect sub chromosomal SCNAs, whereas conliga shows evidence 108 of chromosome arm and sub-chromosomal arm changes. At 2% purity, conliga is able to distinguish some of the 109 more prominent chromosomal arm SCNAs. The focal amplification on chromosome 12 is identified by conliga 110 at 0.75% and 0.5% purity and not detected by QDNAseq below 1%. At 0.75%, 0.5% and 0% purity, it is hard 111 to distinguish whole chromosome SCNAs from noise generated by segmentation in the QDNAseq profiles. This 112 highlights the advantage of conliga's ability to assign loci to discrete states, meaning we can easily distinguish when SCNAs are and are not different between loci. Despite using 4.5 fold fewer reads, conliga appears to be 114 more sensitive than QDNAseq.

115
In Figure 2b, we show that conliga is able to detect SCNAs at 3% purity in all samples (eight), five at 2% 116 and one at 0.5%. The limit of detection is dependent on the amplitude and lengths of SCNAs present in the 117 sample. Long chromosomal arm amplifications can be detected at 2-3% purity, while some focal amplifications 118 (particularly those occurring at loci with a bias towards obtaining a high number of counts) can be detected at 119 <1% purity (e.g. chr12 in OAC3, Figure 2a). The limit of detection also depends on the technical variability 120 of the protocol and the total number of reads per sample. Increasing the total number of reads beyond two 121 million and reducing technical variability would further improve the limit of detection.

122
These data demonstrate the potential clinical utility of FAST-SeqS coupled with conliga. Ciriello  We model the sample counts, in L selected loci, by assuming that the count at locus l in chromosome arm r in 155 sample j is distributed: Here, n j is the total number of sequencing reads aligned to the L loci in sample j, θ r,l,j represents the probability 157 of observing an aligned read at locus l in chromosome arm r in sample j. We model θ r,l,j as follows: Here, s j is the inverse dispersion variable for sample j where s j > 0, m r,l represents the probability of an aligned 159 sequencing read originating from locus l in chromosome arm r in a control sample, where r Lr l=1 m r,l = 1 160 andĉ r,l,j is the relative copy number at locus l in chromosome arm r in sample j. The number of loci in each 161 chromosome arm is denoted as L r and so the total number of loci, L = r L r . 162 We can interpret m as defining the bias in observing aligned read counts from the FAST-SeqS protocol. This 163 bias can be explained by unequal PCR efficiencies between loci in addition to biases in aligning reads uniquely 164 to FAST-SeqS loci, among other factors. Note that: We can be interpret this equation intuitively; the relative copy number scales the probability of reads to 166 align to a locus. For example, if the relative copy number of a locus is 2 we expect the proportion of reads at 167 the locus to double. This fits with our observations shown in Supplementary Fig. 1.

168
The inverse dispersion variable, s j , is sample specific and reflects our observations that the level of dispersion 169 varies between samples. This variation in dispersion between samples might be due to varying levels of DNA 170 degradation and/or varying quantities of starting material between samples, among other factors. s j relates to 171 the variance and the mean of θ r,l,j in the following way: E [y r,l,j | θ r,l,j ] = µ = n jĉr,l,j m r,l The variance of y r,l,j can be written as a quadratic function of µ with the coefficients being a function of n j 174 and s j :

175
Var (y r,l,j | θ r,l,j ) = 1 + n j − 1 Note that in the limit s j → ∞, a Binomial noise model is recovered.

176
Probabilistic generative model of loci counts for control samples 177 We assume that the loci within a control sample, k, have equal copy numbers (diploid). This means that the 178 RCN for each locus is 1. By settingĉ r,l,k = 1, we model the generative process of counts from a control sample 179 as follows: θ r,l,k | s k , m r,l ∼ Beta(s k m r,l , s k (1 − m r,l )) x r,l,k | θ r,l,k , n k ∼ Binomial(n k , θ r,l,k ) Here, Gamma(ψ shape , ψ scale ) represents the prior distribution over the sample specific inverse dispersion pa-181 rameter, s k , and Beta(φ c,r,l , φ d,r,l ) defines the prior distribution over m r,l .

182
Linking FAST-SeqS loci using a hidden Markov model 183 We assume that chromosome arms are independent. By that we mean, the RCN of the first locus in arm q 184 is independent of the RCN of the last locus in arm p from the same chromosome (and all other chromosome 185 arms). As such, we model each chromosome arm as an independent Markov chain for each sample j. We denote 186 (note that for simplicity we have dropped the sample index j):

187
• z r,l as the hidden state (or copy number state) of the Markov chain at locus l in chromosome arm r

188
• π 0 as the initial distribution of the first locus (l = 1), in chromosome r

189
• π u as the transition distribution for hidden state, u 190 •ĉ u as the relative copy number associated with hidden state, u.

197
The joint density for all L loci in the genome is given by: The number of copy number states present in a sample is unknown a priori. In samples that have equal copies 200 of each locus, only one copy number state is present. Conversely, it is possible (although unlikely) that each 201 locus has its own unique copy number, meaning that there could be up to L copy number states in a sample.

202
Additionally, we expect neighboring loci to share the same copy number given their genomic distance from 203 each other ( Supplementary Fig. 1). To address these two features of the data, we used the sticky hierarchical Note that we useñ,s,θ r,l to distinguish these variables from those in the probabilistic model of control counts matrix, π u , is drawn from a Dirichlet Process and depends on β, α and κ. It can be shown that: where δ u,v represents the discrete Kronecker delta function. If we define ρ = κ α+κ (as in Fox et al. [21]) and by 215 noting that α = (1 − ρ)(α + κ), we obtain: As such, we see that ρ defines how much weight is placed on self-transition within a copy number state.

217
The vector, β, itself drawn from a Dirichlet Process, represents the global transition distribution and holds 218 information about the proportion of loci expected in each copy number state.

219
The variance of the transition probability from copy number state u to v is given by: We see that α + κ is inversely proportional to the variance of the state transition probabilities.  Supplementary Table 10.

238
For each sequencing experiment, a suitable set of controls samples were used (see Supplementary Table 11 239 for the list of samples used in each experiment). As described in equation 7, control samples were assumed 240 to have a relative copy number of one at each locus. In all experiments described in this paper, we used the   template depending on the extracted concentration, and RNAse free water to make the total reaction volume.

294
The cycling conditions for the L1PA7 primers were 98 • C for 120 s followed 2 cycles of 98 • C for 10 s, 57 • C for 295 120 s, and 72 • C for 120 s. The second round was also carried out as a 50 µl sample reaction using 20 µl taken from  In silico generation of FAST-SeqS dilution data 332 We performed an in silico dilution of FAST-SeqS data by mixing sequencing reads from control samples with 333 reads from OAC samples. Since the number of reads in the matched controls were insufficient to create samples 334 with two million reads, we created a pool of control reads (in silico) which were used to dilute the OAC samples.

335
This was done by sub-sampling two million reads from 12 control samples (which were prepared and sequenced in 336 the same batch as the OAC samples). The total number of reads from these 12 control samples was 14,405,596.

337
To obtain a pool of 2 million reads, we used the 'sample' command from seqtk (urlhttps://github.com/lh3/seqtk, file was then processed through a custom pipeline which we describe below.

376
Identifying forward primer position 377 For each read in the FASTQ file, the position of the forward primer sequence was detected by searching for the 378 sequence with the minimum hamming distance to the forward primer sequence using a sliding window. Reads 379 with a minimum hamming distance greater than 5 were discarded.

380
The portion of the reads before and including the forward primer sequence were trimmed. The ends of the 382 reads were also trimmed such that the length of the reads used for downstream analyses were 100 base pairs 383 minus the forward primer length. Any reads shorter than 100 base pairs minus the forward primer length after 384 trimming were discarded.

385
Quality control

386
After trimming, reads were discarded if they contained at least one base with a Phred quality score less than 387 20 and/or contained one or more ambiguous base calls (N).  To make a fair comparison between the tools, it was necessary to convert ASCAT's TCN calls to RCN as follows: Here, normal represents the estimated normal contamination value provided by ASCAT and i represents a contiguous genomic region or a discrete locus or fragment. In the case of a contiguous region, the mean TCN (or ploidy) was calculated as follows: and in the case of discrete loci or fragments: where L represents the total number of loci or fragments considered.

431
In Figure 1e and where l i represents the length of the overlapping portion of the called region with the gene.