Methods to estimate the proportion of extra-nuclear inserts
General method. Figure 1 shows an example in which the proportion of reads from the mitochondrial DNA (orange boxes) differs between three samples. We wish to estimate v /N , the proportion of the nuclear DNA which consists of NUMTs (orange and green areas within blue boxes). We cannot observe this value directly – because some NUMT reads are indistinguishable from organellar reads.
In Figure 1, the difference in the allele frequencies among the reads from different samples is due to the different amounts of organellar DNA. The observed frequency of the NUMT-specific alleles in each sample,Po = n/m, should show a predictable relationship with the proportion of reads mapping to the mitochondrial genome,m/N. Using the notation from Figure 1, this observed frequency would be
Po = n/m = n/N × N/m , [Equation 1]
taking logs,
log(Po ) = log(n/N ) + log(N /m ).
Hence, if the underlying assumptions hold, the plot of log(Po ) vs. log(N /m ) should be a 1:1 line with an intercept of log(n/N ). Possible deviations from these assumptions are that the frequency of the NUMT-specific alleles in the nuclear genome differ between samples, the total number of NUMT sequences in the nuclear genome differ between samples, the relative mapping rate of the different categories of read differ between samples, and that the identity of the organellar allele has been mistaken (either an outright error, or because of heteroplasmy).
For SNP loci showing the expected relationship, the intercept,n /N, can be used as a lower bound on the value we wish to estimate (v/N ) because nv . Different SNP loci will give different estimates because their NUMT allele frequency (n /v ) will vary, depending on the evolutionary history of the integration of mitochondrial sequences into the nuclear genome. A SNP locus would give a precise estimate, only if n=v , which would mean that all NUMT loci had a different allele from the organellar genome.
We used a linear mixed effects model (using the lmer function from the R package lme4; Bates, Mächler, Bolker, & Walker, 2015) to identify the locus with the highest intercept from a selection of loci with high values of n/v. Multiple loci were included in the regression because they provide within-sample replication, which can be used to characterise any consistent deviation of each sample from the regression. Sample-specific deviations could be due to different mapping efficiencies of that mitochondrial genome to the reference, or undetected contamination by DNA of another species. Any such effects were allowed for by fitting a random intercept for each sample. We call the estimate obtained from this regression the “intercept estimate” ofv/N .
Mapping depth estimate . An upper-bound on v/N can be obtained simply from the lowest alignment rate (m/N ) observed in any sample. In the case of Figure 1 the lowest estimate comes from Sample 1: m1 / N1. This is greater than or equal to v/N because mv . It is a precise estimate only when m = v , which would mean that there were no reads from the organellar mitochondrial genome.  We call this estimate the “mapping depth estimate”.
Diverged sites estimate . A second statistical estimate can be obtained if some individuals have a different organellar mitochondrial haplotypes. This is illustrated in Figure 1 by Sample 4, whose mitochondria carry the T allele. In Samples 1, 2 and 3 the nreads carrying allele T, G, and C are unambiguously of NUMT origin, with counts nT , nG, and nC respectively. Similarly in Sample 4, the kreads of NUMT origin can be broken down into kA ,kG, and kC . We cannot directly count the NUMT reads with A alleles in Samples 1, 2 and 3, but we can estimate their proportion from sample 4 as pT =kA4 / N4 . Similarly we cannot count the NUMT reads with the T allele in Sample 4, but we can estimate their proportion as pT = Σi (nTi ) /Σi (Ni ), where summation is over samples.
These estimates of the obscured proportions can be used obtain an estimate of the desired quantity v/N. Consider two sets of samples A and B:
the mitochondrial allele is x (x∈ {T, A, G, C}), sample i has vi unambiguous NUMT reads,mi which map to the mitochondrial genome, andNi reads which do not. An estimate of the frequency of the x allele is obtained from the B sample aspx = Σj(nxj ) / Σj(Ni ),
the mitochondrial allele is y (y∈ {T, A, G, C} \ {x}), sample j has kjunambiguous NUMT reads, mj reads which map to the mitochondrial genome, and Nj reads which do not. An estimate of the frequency of the y allele is obtained from the A sample as py = Σi(nxi ) / Σi(Ni ),
The estimates of v/N for the A set is therefore
px + (Σivi / Σi Ni),
And for the B set, the equivalent value is
py + (Σjkj / Σj Nj).
We call these the “diverged sites estimates”.