Estimates of NUMT abundance
Numerous studies have made use of public genome assemblies and have
estimated in these assemblies the nuclear proportion of NUMTs. Estimates
of the proportion of NUMTs in the human genome range from 0.0087%
(Hazkani-Covo et al., 2010) and 0.0096% (Richly & Leister, 2004) to
0.012% (Bensasson, Zhang, et al., 2001) and 0.016% (Woischnik &
Moraes, 2002). Our (upper-bound) mapping depth estimate was 0.056%. The
intercept estimate of 0.014% falls inside the range of previous
estimates, confirming the accuracy of our approach.
Our estimate for the nuclear proportion of NUMTs in the grasshopperP. pedestris was 0.056%, which may at first sight seem low given
that a number of previous reports, based on PCR and cloning studies,
inferred a large number of nuclear inserts in this species (Bensasson,
Petrov, Zhang, Hartl, & Hewitt, 2001; Bensasson et al., 2000). However,
considering the species’ enormous genome size of 16.93 pg (Westerman,
Barton, & Hewitt, 1987), which corresponds to 16.6 Gbp (Doležel,
Bartoš, Voglmayr, & Greilhuber, 2003), the P. pedestris NUMTs
content amounts to 8.4 Mbp or the equivalent of over 500 whole
mitochondrial genomes inserted into the nuclear genome. Our second
estimate based on sites with fixed differences between the grasshopper
populations was very close – 0.055%.
For the parrot P. varius , we obtained an intercept estimate of
0.0043%. While the number of NUMTs in birds was thought to be generally
low with a nuclear proportion in chicken estimated to be 0.00078%
(Pereira & Baker, 2004), more recent reports give a more differentiated
picture with high numbers observed in certain songbirds (Liang, Wang,
Li, Kimball, & Braun, 2018) and falcons (Nacer & Raposo do Amaral,
2017). The diverged sites estimate was 0.0033%.
Robustness of the approach
Our general approach produced an estimate comparable to previous
estimates of the human genome’s NUMT proportion. Because our methods
rely on replicated data, they are less susceptible to the idiosyncrasies
of individual samples, such as contamination with non-specific DNA
sequences. Making use of biological replication is also beneficial
because inserts of extra-nuclear DNA have been shown to be polymorphic
within populations and species (Ricchetti, Tekaia, & Dujon, 2004;
Woischnik & Moraes, 2002). Harnessing biological replication is thus
likely to give less biased estimates. Because our methods are designed
with low-coverage sequencing data in mind, the use of multiple samples
reduces the effect of stochasticity in allele sampling (assuming that
the sampling error is independent among samples).
The mapping depth estimate, which is based on the individual with the
lowest alignment rate, is inflated by the smallest proportion of
extranuclear DNA found in the set of samples. Consequently, it should be
considered a crude upper bound most useful as a cross validation check,
unless the samples were deliberately depleted for extranuclear DNA.
The diverged sites estimate is, in theory, more accurate than the
intercept estimate, because it does not require the assumption that the
NUMTs have a high frequency of some SNP alleles that are different from
the mitochondrial reference. It does make different assumptions, in
particular that the number and genotypes of the NUMTs in the A and B
populations are similar. In applying these methods to other species and
other vagrant genomes, it would seem prudent to obtain both the
intercept and diverged sites estimate where possible. The diverged sites
estimate should be the same or slightly higher (within the precision of
the estimates’ confidence intervals). This cross-validation serves to
check for substantial violations of the underlying assumptions. This
check will be especially effective in studies with substantial
divergence among extranuclear genotypes like that between the A and B
mitochondrial haplotypes in our samples, so the diverged sites estimate
has low standard error.
If the sequencing platform used for data generation has a GC bias (Ross
et al., 2013), this could affect the ratio of insert DNA and other
nuclear DNA sequences and may thus bias our method’s estimate. We
noticed higher allele frequencies in six grasshopper samples that we had
sequenced with an older version of the Illumina NextSeq chemistry.
However, exclusion of these samples did not change our estimates
qualitatively (data not shown). With the current trend to PCR-free
sequencing libraries and improving sequencing technologies such biases
may be less of a concern in the future.
Contamination with non-specific sequences can affect our estimates
because our method implies non-aligned sequence to be part of the
nuclear genome. This is not of concern in our datasets, because the
datasets we analysed, mito-like sequences accounted for less than 3% of
the data. As for any bioinformatics project, it is advisable to subject
the original data to standard quality assurance measures like FastQC
(https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) and
removal of sequencing adapters.
When estimating the abundance of nuclear inserts derived from more
complex genomes, it may be useful to restrict the analysis to a subset
of the variant data depending on genome features. For instance, plastid
genomes tend to contain a large inverted repeat region (Kolodner &
Tewari, 1979), where sequence alignment and variant calling may be
unreliable. In addition to regions excluded due to prior knowledge,
regions can be excluded if they show aberrant signals. Because our
method is based on regression, an analysis of the residuals may be used
to uncover individuals or genome regions showing unexpected patterns. To
this end, we have implemented in our R package an
intercept-position-plot function (see tutorial in the Supplemental
Materials).
Our method relies on multiple samples with varying ratios of true
mitochondrial DNA to nuclear DNA. We found that sufficient variation
(several fold) arises in the standard use of commercial extraction kits.
However, this variation could be accentuated by deliberately choosing
tissues with different mitochondrial density, or modifying the
extraction process (Lansman, Shade, Shapira, & Avise, 1981; Macher,
Zizka, Weigand, & Leese, 2018; Zhang & Hewitt, 1996a).
Finally, we would like to point our approach is robust to the existence
of nuclear vagrant DNAs that are identical in sequence to their
extranuclear original DNAs. This is because, different to other methods,
we do not directly identify all copies of vagrant DNA, which would not
be possible with low-coverage (< 1x) sequencing data. Instead,
our methods are based on high allele frequencies as would be observed at
a site that recently experienced a nucleotide substitution in the
extranuclear genome – leading to (near) perfect divergence between
nuclear inserts and extranuclear DNA at that site.
Utility of the approach
Our method is applicable to a wide range of species and sources of
extranuclear DNA. Because it is based on allele frequencies rather than
the length of sequence inserts, knowledge of part of the extranuclear
genome is sufficient for our method to work. This should be particularly
useful where extranuclear genomes are difficult to assemble as in the
case of plant mitochondria (Mower, Sloan, & Alverson, 2012). In
addition, parts of the reference containing repeats such as the inverted
repeat region on land plant plastids (Wicke, Schneeweiss, DePamphilis,
Müller, & Quandt, 2011) can be excluded, which might otherwise confound
variant calling.
Because our method relies on low-coverage sequencing, it is also
suitable for museum samples, which often contain degraded DNA and where
contaminations may complicate analyses that target specific marker
loci. Low-coverage sequencing or ‘genome skimming’ (Straub et al., 2012)
data is straightforward to generate at comparatively low cost. Different
to marker-based approaches such as PCR/Sanger-sequencing or target
capture, no prior knowledge about the samples’ genomes is required. The
analysis of low-coverage data is also less computationally demanding
than the generation of a high-quality genome assembly. In addition,
population-level low-coverage sequencing data naturally lend themselves
to population or landscape genetic analyses because it is usually
possible to reconstruct from such data rDNA, mitochondrial, and (in
plants) plastid sequences (Dodsworth et al., 2015; Twyford & Ness,
2016).