Introduction
Nuclear genomes of most eukaryotes contain insertions of vagrant (extranuclear) DNA. Particularly common are inserts derived from organellar DNA, which are termed nuclear DNAs of mitochondrial origin (NUMTs) or nuclear DNAs of plastid origin (NUPTs) (Hazkani-Covo, Zeller, & Martin, 2010). Vagrant inserts were studied in a phylogenetic context as early as 1994 by (Lopez, Yuhki, Masuda, Modi, & O’Brien, 1994), who discovered a nuclear insert of mitochondrial origin, which they called ‘Numt’, in several species of Felis . The advent of polymerase chain reaction (PCR) and long-range PCR facilitated the study of organellar inserts, and by the mid 1990s, Zhang & Hewitt, (1996b) reviewed the reports of NUMTs and pointed out the promises and problems of NUMTs for evolutionary analysis - NUMTs could be used to study the pace of evolution in the cell’s genomes and the progress of endosymbiosis, but also they had the potential to mislead barcoding studies (Naciri & Manen, 2010; Schultz & Hebert, 2022). While some NUMTs may be functional under selection (Vendrami et al., 2022), the majority are likely pseudogenes. Besides organelle-derived DNAs, there are also reports of extranuclear inserts derived from endosymbionts such as Wolbachia (Hotopp et al., 2007) and Buchnera (Nikoh et al., 2010).
The quantification of vagrant DNA inserts is relatively straightforward in presence of a high-quality genome assembly. For instance, NUMTs have been identified by mapping mitochondrial genome assemblies to nuclear ones, often using BLAST (Bensasson, Zhang, Hartl, & Hewitt, 2001; Hazkani-Covo et al., 2010; Richly & Leister, 2004). It is also possible to screen for regions with k-mer profiles resembling mitochondrial genomes (W. Li, Freudenberg, & Freudenberg, 2019). However, despite rapid advances in sequencing and assembly technology, and the emergence of numerous large-scale genome sequencing projects such as the Earth Biogenome Project (https://www.earthbiogenome.org, (Lewin et al., 2018)), the Darwin Tree of Life Project (https://www.darwintreeoflife.org, (The Darwin Tree of Life Project Consortium, 2022)), the Vertebrate Genome Project (https://genome10k.soe.ucsc.edu, (Rhie et al., 2021)), and the 10000 Plant Genomes Project (https://db.cngb.org/10kp/, (Cheng et al., 2018)), high-quality assemblies are available for the minority of species. In the absence of high-quality genome assemblies, the quantification of extranuclear inserts is more challenging. Fragmented genome assemblies commonly lack repetitive sequences and even assemblies which appear to be complete, or nearly so, can contain regions where repetitive sequences have been collapsed, causing the assembled length to be shorter than in the actual genome size, which would bias estimates of the frequency of inserts. Thus, an assembly-free approach to quantify extranuclear inserts is desirable in the case of fragmented assemblies and to cross verify the results from more complete assemblies.
Instead of using a nuclear genome assembly, we propose to estimate the frequency of vagrant inserts directly from sequencing reads. However, the estimation is not a straightforward case of counting the relative numbers of vagrant and extranuclear sequences. For example, in the case of NUMTs, a high-throughput sequencing dataset could be mapped against a mitochondrial genome assembly. The main obstacle would then arise when it came to classifying these reads into NUMT and organellar mitochondrial categories. They might be classified according to their sequence divergence from the reference, whereby low-divergence matches could be assumed to be true mitochondrial DNA, and higher-divergence matches assumed to be derived from NUMTs. Such an approach has two obvious drawbacks: The estimate will depend on some customized divergence threshold, and, in addition, some reads that are identical to the true mitochondrial genome might actually be derived from a NUMT (they may be recent, not-yet-diverged inserts). An alternative approach would be to screen sequencing data for reads spanning NUMT insertion sites. This approach would be most effective with high-quality long sequencing reads as produced by PacBio’s circular consensus technology. This is because longer reads are more likely to contain junctions between NUMT and ordinary nuclear DNA, which makes it possible to detect NUMT sequences even if they have not yet diverged from the true mitochondrial sequence. Unfortunately, high-quality long-read sequences are still comparatively expensive to generate, and possibly prohibitively so for species with large genomes including (but not restricted to) many grasses and other monocots, grasshoppers, and newts.
As an alternative to these approaches, we propose to exploit a sampling design which uses low-coverage (< 1x) high-quality short reads (also known as low-pass or genome skimming data), from multiple individuals. This type of data is commonly generated for population studies of mitochondrial and plastid DNA or for the analysis of genomic repeats. We will show that such data sets can be used to estimate the proportion of the nuclear genome that are nuclear inserts from a particular vagrant origin. The approach exploits the information that arises when the samples contain different relative proportions of the extra-nuclear DNA. For example, the proportion of mitochondrial reads tends to vary among samples in routine DNA extractions, which contrasts with vagrant inserts that appear at a constant stoichiometry with other nuclear DNA sequences. For this approach to work, there must be some sites that are diverged between vagrant inserts and extranuclear sequences. Despite this, large-scale similarity as observed between NUMTs and mitochondrial sequences do not pose a problem, because the approach is based on regression and not on the identification of every single insert sequence.
Grasshoppers make a good test-bed for this approach, since they are notorious for having genomes with multiple NUMTs, which complicate phylogenetic analyses (Hawlitschek et al., 2017; Song, Buhay, Whiting, & Crandall, 2008). One representative, the grasshopper Podisma pedestris , has been studied for half a century for its hybrid zone (Hewitt & John, 1972; John & Hewitt, 1970). Strong selection against hybrids has been shown in lab experiments (Barton, 1980; Barton & Hewitt, 1981) and in the field (Nichols & Hewitt, 1988), suggesting some level of divergence between the populations. However, to-date we still have no data on mitochondrial differentiation between the two hybridising populations because the presence of NUMTs has made population studies almost impossible (Bensasson, Zhang, & Hewitt, 2000; Vaughan, Heslop-Harrison, & Hewitt, 1999). P. pedestris is also the insect species with the largest genome size recorded (http://www.genomesize.com, accessed 24 November 2022).
As a second example, we chose to analyse data from an organism that might have a relatively smaller number of NUMTs - the parrotPsephotellus varius . There are reports of low NUMT content in the chicken genome, which have been extrapolated to other bird species (Pereira & Baker, 2004), although Nacer & Raposo do Amaral (2017) discovered higher NUMT contents in two species of falcon using BLAST searches to published genome assemblies (Zhan et al., 2013). Even in small numbers, bird NUMTs are known to be a potential source of misleading data in the genetic analysis of their mitochondrial DNA (Sorenson & Quinn, 1998).
We initially confirm our general method’s accuracy testing it on human data, where the number of NUMTs is relatively well characterised because of the exceptionally high quality of the human genome assembly. We then proceed to quantify the NUMT content in a dedicated dataset of the grasshopper Podisma pedestris and in a re-analysis of a museomics dataset of the parrot Psephotellus varius (McElroy, Beattie, Symonds, & Joseph, 2018).