Introduction
Nuclear genomes of most eukaryotes contain insertions of vagrant
(extranuclear) DNA. Particularly common are inserts derived from
organellar DNA, which are termed nuclear DNAs of mitochondrial origin
(NUMTs) or nuclear DNAs of plastid origin (NUPTs) (Hazkani-Covo, Zeller,
& Martin, 2010). Vagrant inserts were studied in a phylogenetic context
as early as 1994 by (Lopez, Yuhki, Masuda, Modi, & O’Brien, 1994), who
discovered a nuclear insert of mitochondrial origin, which they called
‘Numt’, in several species of Felis . The advent of polymerase
chain reaction (PCR) and long-range PCR facilitated the study of
organellar inserts, and by the mid 1990s, Zhang & Hewitt, (1996b)
reviewed the reports of NUMTs and pointed out the promises and problems
of NUMTs for evolutionary analysis - NUMTs could be used to study the
pace of evolution in the cell’s genomes and the progress of
endosymbiosis, but also they had the potential to mislead barcoding
studies (Naciri & Manen, 2010; Schultz & Hebert, 2022). While some
NUMTs may be functional under selection (Vendrami et al., 2022), the
majority are likely pseudogenes. Besides organelle-derived DNAs, there
are also reports of extranuclear inserts derived from endosymbionts such
as Wolbachia (Hotopp et al., 2007) and Buchnera (Nikoh et
al., 2010).
The quantification of vagrant DNA inserts is relatively straightforward
in presence of a high-quality genome assembly. For instance, NUMTs have
been identified by mapping mitochondrial genome assemblies to nuclear
ones, often using BLAST (Bensasson, Zhang, Hartl, & Hewitt, 2001;
Hazkani-Covo et al., 2010; Richly & Leister, 2004). It is also possible
to screen for regions with k-mer profiles resembling mitochondrial
genomes (W. Li, Freudenberg, & Freudenberg, 2019). However, despite
rapid advances in sequencing and assembly technology, and the emergence
of numerous large-scale genome sequencing projects such as the Earth
Biogenome Project
(https://www.earthbiogenome.org,
(Lewin et al., 2018)), the Darwin Tree of Life Project
(https://www.darwintreeoflife.org,
(The Darwin Tree of Life Project Consortium, 2022)), the Vertebrate
Genome Project
(https://genome10k.soe.ucsc.edu,
(Rhie et al., 2021)), and the 10000 Plant Genomes Project
(https://db.cngb.org/10kp/, (Cheng et al., 2018)), high-quality
assemblies are available for the minority of species. In the absence of
high-quality genome assemblies, the quantification of extranuclear
inserts is more challenging. Fragmented genome assemblies commonly lack
repetitive sequences and even assemblies which appear to be complete, or
nearly so, can contain regions where repetitive sequences have been
collapsed, causing the assembled length to be shorter than in the actual
genome size, which would bias estimates of the frequency of inserts.
Thus, an assembly-free approach to quantify extranuclear inserts is
desirable in the case of fragmented assemblies and to cross verify the
results from more complete assemblies.
Instead of using a nuclear genome assembly, we propose to estimate the
frequency of vagrant inserts directly from sequencing reads. However,
the estimation is not a straightforward case of counting the relative
numbers of vagrant and extranuclear sequences. For example, in the case
of NUMTs, a high-throughput sequencing dataset could be mapped against a
mitochondrial genome assembly. The main obstacle would then arise when
it came to classifying these reads into NUMT and organellar
mitochondrial categories. They might be classified according to their
sequence divergence from the reference, whereby low-divergence matches
could be assumed to be true mitochondrial DNA, and higher-divergence
matches assumed to be derived from NUMTs. Such an approach has two
obvious drawbacks: The estimate will depend on some customized
divergence threshold, and, in addition, some reads that are identical to
the true mitochondrial genome might actually be derived from a NUMT
(they may be recent, not-yet-diverged inserts). An alternative approach
would be to screen sequencing data for reads spanning NUMT insertion
sites. This approach would be most effective with high-quality long
sequencing reads as produced by PacBio’s circular consensus technology.
This is because longer reads are more likely to contain junctions
between NUMT and ordinary nuclear DNA, which makes it possible to detect
NUMT sequences even if they have not yet diverged from the true
mitochondrial sequence. Unfortunately, high-quality long-read sequences
are still comparatively expensive to generate, and possibly
prohibitively so for species with large genomes including (but not
restricted to) many grasses and other monocots, grasshoppers, and newts.
As an alternative to these approaches, we propose to exploit a sampling
design which uses low-coverage (< 1x) high-quality short reads
(also known as low-pass or genome skimming data), from multiple
individuals. This type of data is commonly generated for population
studies of mitochondrial and plastid DNA or for the analysis of genomic
repeats. We will show that such data sets can be used to estimate the
proportion of the nuclear genome that are nuclear inserts from a
particular vagrant origin. The approach exploits the information that
arises when the samples contain different relative proportions of the
extra-nuclear DNA. For example, the proportion of mitochondrial reads
tends to vary among samples in routine DNA extractions, which contrasts
with vagrant inserts that appear at a constant stoichiometry with other
nuclear DNA sequences. For this approach to work, there must be some
sites that are diverged between vagrant inserts and extranuclear
sequences. Despite this, large-scale similarity as observed between
NUMTs and mitochondrial sequences do not pose a problem, because the
approach is based on regression and not on the identification of every
single insert sequence.
Grasshoppers make a good test-bed for this approach, since they are
notorious for having genomes with multiple NUMTs, which complicate
phylogenetic analyses (Hawlitschek et al., 2017; Song, Buhay, Whiting,
& Crandall, 2008). One representative, the grasshopper Podisma
pedestris , has been studied for half a century for its hybrid zone
(Hewitt & John, 1972; John & Hewitt, 1970). Strong selection against
hybrids has been shown in lab experiments (Barton, 1980; Barton &
Hewitt, 1981) and in the field (Nichols & Hewitt, 1988), suggesting
some level of divergence between the populations. However, to-date we
still have no data on mitochondrial differentiation between the two
hybridising populations because the presence of NUMTs has made
population studies almost impossible (Bensasson, Zhang, & Hewitt, 2000;
Vaughan, Heslop-Harrison, & Hewitt, 1999). P. pedestris is also
the insect species with the largest genome size recorded
(http://www.genomesize.com,
accessed 24 November 2022).
As a second example, we chose to analyse data from an organism that
might have a relatively smaller number of NUMTs - the parrotPsephotellus varius . There are reports of low NUMT content in the
chicken genome, which have been extrapolated to other bird species
(Pereira & Baker, 2004), although Nacer & Raposo do Amaral (2017)
discovered higher NUMT contents in two species of falcon using BLAST
searches to published genome assemblies (Zhan et al., 2013). Even in
small numbers, bird NUMTs are known to be a potential source of
misleading data in the genetic analysis of their mitochondrial DNA
(Sorenson & Quinn, 1998).
We initially confirm our general method’s accuracy testing it on human
data, where the number of NUMTs is relatively well characterised because
of the exceptionally high quality of the human genome assembly. We then
proceed to quantify the NUMT content in a dedicated dataset of the
grasshopper Podisma pedestris and in a re-analysis of a museomics
dataset of the parrot Psephotellus varius (McElroy, Beattie,
Symonds, & Joseph, 2018).