Estimates of NUMT abundance
Numerous studies have made use of public genome assemblies and have estimated in these assemblies the nuclear proportion of NUMTs. Estimates of the proportion of NUMTs in the human genome range from 0.0087% (Hazkani-Covo et al., 2010) and 0.0096% (Richly & Leister, 2004) to 0.012% (Bensasson, Zhang, et al., 2001) and 0.016% (Woischnik & Moraes, 2002). Our (upper-bound) mapping depth estimate was 0.056%. The intercept estimate of 0.014% falls inside the range of previous estimates, confirming the accuracy of our approach.
Our estimate for the nuclear proportion of NUMTs in the grasshopperP. pedestris was 0.056%, which may at first sight seem low given that a number of previous reports, based on PCR and cloning studies, inferred a large number of nuclear inserts in this species (Bensasson, Petrov, Zhang, Hartl, & Hewitt, 2001; Bensasson et al., 2000). However, considering the species’ enormous genome size of 16.93 pg (Westerman, Barton, & Hewitt, 1987), which corresponds to 16.6 Gbp (Doležel, Bartoš, Voglmayr, & Greilhuber, 2003), the P. pedestris NUMTs content amounts to 8.4 Mbp or the equivalent of over 500 whole mitochondrial genomes inserted into the nuclear genome. Our second estimate based on sites with fixed differences between the grasshopper populations was very close – 0.055%.
For the parrot P. varius , we obtained an intercept estimate of 0.0043%. While the number of NUMTs in birds was thought to be generally low with a nuclear proportion in chicken estimated to be 0.00078% (Pereira & Baker, 2004), more recent reports give a more differentiated picture with high numbers observed in certain songbirds (Liang, Wang, Li, Kimball, & Braun, 2018) and falcons (Nacer & Raposo do Amaral, 2017). The diverged sites estimate was 0.0033%.
Robustness of the approach
Our general approach produced an estimate comparable to previous estimates of the human genome’s NUMT proportion. Because our methods rely on replicated data, they are less susceptible to the idiosyncrasies of individual samples, such as contamination with non-specific DNA sequences. Making use of biological replication is also beneficial because inserts of extra-nuclear DNA have been shown to be polymorphic within populations and species (Ricchetti, Tekaia, & Dujon, 2004; Woischnik & Moraes, 2002). Harnessing biological replication is thus likely to give less biased estimates. Because our methods are designed with low-coverage sequencing data in mind, the use of multiple samples reduces the effect of stochasticity in allele sampling (assuming that the sampling error is independent among samples).
The mapping depth estimate, which is based on the individual with the lowest alignment rate, is inflated by the smallest proportion of extranuclear DNA found in the set of samples. Consequently, it should be considered a crude upper bound most useful as a cross validation check, unless the samples were deliberately depleted for extranuclear DNA.
The diverged sites estimate is, in theory, more accurate than the intercept estimate, because it does not require the assumption that the NUMTs have a high frequency of some SNP alleles that are different from the mitochondrial reference. It does make different assumptions, in particular that the number and genotypes of the NUMTs in the A and B populations are similar. In applying these methods to other species and other vagrant genomes, it would seem prudent to obtain both the intercept and diverged sites estimate where possible. The diverged sites estimate should be the same or slightly higher (within the precision of the estimates’ confidence intervals). This cross-validation serves to check for substantial violations of the underlying assumptions. This check will be especially effective in studies with substantial divergence among extranuclear genotypes like that between the A and B mitochondrial haplotypes in our samples, so the diverged sites estimate has low standard error.
If the sequencing platform used for data generation has a GC bias (Ross et al., 2013), this could affect the ratio of insert DNA and other nuclear DNA sequences and may thus bias our method’s estimate. We noticed higher allele frequencies in six grasshopper samples that we had sequenced with an older version of the Illumina NextSeq chemistry. However, exclusion of these samples did not change our estimates qualitatively (data not shown). With the current trend to PCR-free sequencing libraries and improving sequencing technologies such biases may be less of a concern in the future.
Contamination with non-specific sequences can affect our estimates because our method implies non-aligned sequence to be part of the nuclear genome. This is not of concern in our datasets, because the datasets we analysed, mito-like sequences accounted for less than 3% of the data. As for any bioinformatics project, it is advisable to subject the original data to standard quality assurance measures like FastQC (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) and removal of sequencing adapters.
When estimating the abundance of nuclear inserts derived from more complex genomes, it may be useful to restrict the analysis to a subset of the variant data depending on genome features. For instance, plastid genomes tend to contain a large inverted repeat region (Kolodner & Tewari, 1979), where sequence alignment and variant calling may be unreliable. In addition to regions excluded due to prior knowledge, regions can be excluded if they show aberrant signals. Because our method is based on regression, an analysis of the residuals may be used to uncover individuals or genome regions showing unexpected patterns. To this end, we have implemented in our R package an intercept-position-plot function (see tutorial in the Supplemental Materials).
Our method relies on multiple samples with varying ratios of true mitochondrial DNA to nuclear DNA. We found that sufficient variation (several fold) arises in the standard use of commercial extraction kits. However, this variation could be accentuated by deliberately choosing tissues with different mitochondrial density, or modifying the extraction process (Lansman, Shade, Shapira, & Avise, 1981; Macher, Zizka, Weigand, & Leese, 2018; Zhang & Hewitt, 1996a).
Finally, we would like to point our approach is robust to the existence of nuclear vagrant DNAs that are identical in sequence to their extranuclear original DNAs. This is because, different to other methods, we do not directly identify all copies of vagrant DNA, which would not be possible with low-coverage (< 1x) sequencing data. Instead, our methods are based on high allele frequencies as would be observed at a site that recently experienced a nucleotide substitution in the extranuclear genome – leading to (near) perfect divergence between nuclear inserts and extranuclear DNA at that site.
Utility of the approach
Our method is applicable to a wide range of species and sources of extranuclear DNA. Because it is based on allele frequencies rather than the length of sequence inserts, knowledge of part of the extranuclear genome is sufficient for our method to work. This should be particularly useful where extranuclear genomes are difficult to assemble as in the case of plant mitochondria (Mower, Sloan, & Alverson, 2012). In addition, parts of the reference containing repeats such as the inverted repeat region on land plant plastids (Wicke, Schneeweiss, DePamphilis, Müller, & Quandt, 2011) can be excluded, which might otherwise confound variant calling.
Because our method relies on low-coverage sequencing, it is also suitable for museum samples, which often contain degraded DNA and where contaminations may complicate analyses that target specific marker loci. Low-coverage sequencing or ‘genome skimming’ (Straub et al., 2012) data is straightforward to generate at comparatively low cost. Different to marker-based approaches such as PCR/Sanger-sequencing or target capture, no prior knowledge about the samples’ genomes is required. The analysis of low-coverage data is also less computationally demanding than the generation of a high-quality genome assembly. In addition, population-level low-coverage sequencing data naturally lend themselves to population or landscape genetic analyses because it is usually possible to reconstruct from such data rDNA, mitochondrial, and (in plants) plastid sequences (Dodsworth et al., 2015; Twyford & Ness, 2016).