Methods to estimate the proportion of extra-nuclear inserts
General method. Figure 1 shows an example in which the proportion
of reads from the mitochondrial DNA (orange boxes) differs between three
samples. We wish to estimate v /N , the proportion of the
nuclear DNA which consists of NUMTs (orange and green areas within blue
boxes). We cannot observe this value directly – because some NUMT reads
are indistinguishable from organellar reads.
In Figure 1, the difference in the allele frequencies among the reads
from different samples is due to the different amounts of organellar
DNA. The observed frequency of the NUMT-specific alleles in each sample,Po = n/m, should show a predictable relationship
with the proportion of reads mapping to the mitochondrial genome,m/N. Using the notation from Figure 1, this observed frequency
would be
Po = n/m = n/N × N/m , [Equation 1]
taking logs,
log(Po ) = log(n/N ) +
log(N /m ).
Hence, if the underlying assumptions hold, the plot of
log(Po ) vs. log(N /m ) should be a
1:1 line with an intercept of log(n/N ). Possible deviations from
these assumptions are that the frequency of the NUMT-specific alleles in
the nuclear genome differ between samples, the total number of NUMT
sequences in the nuclear genome differ between samples, the relative
mapping rate of the different categories of read differ between samples,
and that the identity of the organellar allele has been mistaken (either
an outright error, or because of heteroplasmy).
For SNP loci showing the expected relationship, the intercept,n /N, can be used as a lower bound on the value we wish to
estimate (v/N ) because n ≤ v . Different SNP loci
will give different estimates because their NUMT allele frequency
(n /v ) will vary, depending on the evolutionary history of
the integration of mitochondrial sequences into the nuclear genome. A
SNP locus would give a precise estimate, only if n=v , which would
mean that all NUMT loci had a different allele from the organellar
genome.
We used a linear mixed effects model (using the lmer function from the R
package lme4; Bates, Mächler, Bolker, & Walker, 2015) to identify the
locus with the highest intercept from a selection of loci with high
values of n/v. Multiple loci were included in the regression
because they provide within-sample replication, which can be used to
characterise any consistent deviation of each sample from the
regression. Sample-specific deviations could be due to different mapping
efficiencies of that mitochondrial genome to the reference, or
undetected contamination by DNA of another species. Any such effects
were allowed for by fitting a random intercept for each sample. We call
the estimate obtained from this regression the “intercept estimate” ofv/N .
Mapping depth estimate . An upper-bound on v/N can be
obtained simply from the lowest alignment rate (m/N ) observed in
any sample. In the case of Figure 1 the lowest estimate comes from
Sample 1: m1 / N1. This is
greater than or equal to v/N because m ≥ v . It is a
precise estimate only when m = v , which would mean that there
were no reads from the organellar mitochondrial genome. We call this
estimate the “mapping depth estimate”.
Diverged sites estimate . A second statistical estimate can be
obtained if some individuals have a different organellar mitochondrial
haplotypes. This is illustrated in Figure 1 by Sample 4, whose
mitochondria carry the T allele. In Samples 1, 2 and 3 the nreads carrying allele T, G, and C are unambiguously of NUMT origin, with
counts nT , nG, and
nC respectively. Similarly in Sample 4, the kreads of NUMT origin can be broken down into kA ,kG, and kC . We cannot directly
count the NUMT reads with A alleles in Samples 1, 2 and 3, but we can
estimate their proportion from sample 4 as pT =kA4 / N4 . Similarly we
cannot count the NUMT reads with the T allele in Sample 4, but we can
estimate their proportion as pT =
Σi (nTi ) /Σi (Ni ), where summation is over
samples.
These estimates of the obscured proportions can be used obtain an
estimate of the desired quantity v/N. Consider two sets of
samples A and B:
the mitochondrial allele is x (x∈ {T, A, G, C}),
sample i has vi unambiguous NUMT reads,mi which map to the mitochondrial genome, andNi reads which do not. An estimate of the
frequency of the x allele is obtained from the B sample aspx = Σj(nxj ) / Σj(Ni ),
the mitochondrial allele is y (y∈ {T, A, G, C}
\ {x}), sample j has kjunambiguous NUMT reads, mj reads which map to
the mitochondrial genome, and Nj reads which do
not. An estimate of the frequency of the y allele is obtained from the
A sample as py = Σi(nxi ) / Σi(Ni ),
The estimates of v/N for the A set is therefore
px + (Σivi / Σi Ni),
And for the B set, the equivalent value is
py + (Σjkj / Σj Nj).
We call these the “diverged sites estimates”.