Simulations
To test the intercept method, we generated a simulation pipeline that can be run locally on a desktop PC. The pipeline generates a random genome sequence whose length can be specified. It also generates a random extranuclear sequence of 16000 nt length. Then, over the course of 15000000 ‘years’, it is checked each year whether the extranuclear DNA experiences a nucleotide substitution (at a rate typical for insects) and whether an insertion event happens (at a rate that can be specified). Then the nuclear and extranuclear genomes are written out. After that, sequencing reads are simulated from the nuclear genome using wgsim (https://github.com/lh3/wgsim), adding varying amounts of extranuclear DNA (per-sample proportions drawn from an exponential distribution with a mean of 1%). The reads are then mapped using bwa-mem2 (Vasimuddin, Misra, Li, & Aluru, 2019) and variants are called with Freebayes. The resulting VCF files are processed as described above.
Using this pipeline, we simulated three nuclear genome sizes, 250, 500, and 1000 Mbp, and seven insertion rates, 0.00001-0.00004 insertions per year (in steps of 0.000005). We simulated ten replicates for each parameter combination and recorded how many times the proportion of nuclear inserts observed in each simulation was within the confidence interval returned by the intercept method. We then fitted a binomial-family generalised linear model using the log-transformed values of ‘genome size in Mbp’ and ‘expected mapping depth of nuclear reads to insert sequences’ as predictors. To assess the effect of variable insertion rates along the extranuclear sequence, we also ran simulations where the start location of the insert sequence (extrapolated to the length of the extranuclear DNA) was drawn from an asymmetric unimodal beta distribution (shape parameters set to 3 and 2).