Simulations
To test the intercept method, we generated a simulation pipeline that
can be run locally on a desktop PC. The pipeline generates a random
genome sequence whose length can be specified. It also generates a
random extranuclear sequence of 16000 nt length. Then, over the course
of 15000000 ‘years’, it is checked each year whether the extranuclear
DNA experiences a nucleotide substitution (at a rate typical for
insects) and whether an insertion event happens (at a rate that can be
specified). Then the nuclear and extranuclear genomes are written out.
After that, sequencing reads are simulated from the nuclear genome using
wgsim (https://github.com/lh3/wgsim), adding varying amounts of
extranuclear DNA (per-sample proportions drawn from an exponential
distribution with a mean of 1%). The reads are then mapped using
bwa-mem2 (Vasimuddin, Misra, Li, & Aluru, 2019) and variants are called
with Freebayes. The resulting VCF files are processed as described
above.
Using this pipeline, we simulated three nuclear genome sizes, 250, 500,
and 1000 Mbp, and seven insertion rates, 0.00001-0.00004 insertions per
year (in steps of 0.000005). We simulated ten replicates for each
parameter combination and recorded how many times the proportion of
nuclear inserts observed in each simulation was within the confidence
interval returned by the intercept method. We then fitted a
binomial-family generalised linear model using the log-transformed
values of ‘genome size in Mbp’ and ‘expected mapping depth of nuclear
reads to insert sequences’ as predictors. To assess the effect of
variable insertion rates along the extranuclear sequence, we also ran
simulations where the start location of the insert sequence
(extrapolated to the length of the extranuclear DNA) was drawn from an
asymmetric unimodal beta distribution (shape parameters set to 3 and 2).