lcWGS data simulated from distinct demographic backgrounds and
coverage
We employed simulated data from
Lou
et al.
(2021)
to demonstrate the ability of these bioinformatic suites to utilize
lcWGS data to accurately estimate allele frequencies and identify
outlier regions at various coverages and sample sizes. Details of
simulated data are included in Lou et al. (2021), but briefly,
nucleotide sequence data including mutation and recombination on a
single 30Mb chromosome were simulated for two populations exchanging
genes under two demographic scenarios: a lower effective population size
and lower rate of gene exchange that produced a
higher
background FST (hereafter, the ‘high background
FST’ scenario) and a higher gene exchange and effective
population size that produced a lower background FST(‘low background FST’ scenario). In both scenarios,
several sites under selection were introduced and allowed to evolve
under divergent selection in each population, ultimately resulting in
seven outlier regions (Supplemental Figure 1). Sequence data from the
simulated chromosomes, generated to reflect Illumina-style paired-end
reads including sequencing errors, reflected 8x coverage for each of
several hundred individuals from each population to enable down-sampling
at various coverage levels. See Lou et al. (2021) for additional details
about simulated data.
While extensive scenarios were tested in Lou et al. (2021)with these
simulated data, we selected four scenarios to benchmark
PoolParty2 against angsd including: a) high background
FST & low sequence coverage; b) high background
FST & higher sequence coverage; c) low background &
low sequence coverage; d) low background FST and higher
sequence coverage. From the simulated sequence data, we randomly
selected a number of individuals from each population to reflect sample
sizes that are common for empirical datasets in the literature and the
dataset included in this study (70 and 63 from each population) and
down-sampled the simulated sequence data (8x) to reflect the median
individual coverage in our empirical data (~0.33x) as
well as a common sequencing target for lcWGS studies (1x). As Lou et al.
(2021) examined the effects of depth and sample size on allele frequency
estimation accuracy across a greater range of values, it was not our
intent to duplicate their efforts except to compare the abilities of the
two bioinformatic suites at these coverage depths. However, to
additionally challenge these suites with the range of variability in
coverage among individuals commonly reflected in empirical data, we fit
several simple mathematical distributions to the sums of individual
coverage in our empirical data for variant positions following an
initial set of global filters (global depth of 10 and minor allele
frequency, MAF, of 0.005). A logistic distribution exhibited the best
fit based on information theoretic criteria (results not shown), and
using parameters for this distribution, we down-sampled the 8x simulated
data. For the 1x coverage dataset, the empirical scale parameter for
this distribution was used, but the location parameter was set to 1, to
produce higher coverage with similar variation. Individuals utilized at
different coverages of the same scenario were not identical. This
resulted in four simulated datasets (low and high background
FST at 0.33x and 1x coverage).
For each of the four datasets, sequence data were processed with the
PPalign module of PoolParty2, including quality
trimming (sliding window
PHRED
≥ 20, retained length ≥ 50bp), read mapping (mapQ
≥
20), variant scoring and filtering by global parameters
(snpQ
≥ 20, global depth of 10 reads, MAF of 0.001), and estimation of raw and
normalized allele frequencies. Unless otherwise indicated, normalized
allele frequencies were utilized in all analyses. SNP variants and their
normalized frequencies identified by
PPalign
were used as input to
PPanalyze,
which applied additional filters for sites observed in fewer than 10
reads per population. PPanalyze also calculated
FSTfor individual sites and sliding windows of 100Kbp windows in 5Kbp sets,
as well as applying Fisher’s Exact tests (FET) for differences in allele
frequency between the two populations. Significance (p ) values
for the latter test were used as input for the Local Score test
(Fariello et al., 2017), which ‘smooths’ the background variation in
significance to identify contiguous outlier regions depending on a
user-specified smoothing parameter (ξ). In addition, this test
determines the significance of putative outlier regions through
accounting for linkage (autocorrelation in p -values) in a
chromosome-specific manner. However, because each run samples different
SNPs to calculate autocorrelation and determine significance, individual
runs may fail to identify outlier regions with borderline significance.
Lower rates of smoothing (smaller ξ) generally retain more power, but
are less precise in determining the boundaries of outlier regions, and
moreover, because the landscape of background significance changes with
various ξ values, and thus the thresholds for significance in this test,
power and smoothing do not have a directly inverse linear relationship.
For these reasons, multiple runs with application of different values of
ξ are useful. We therefore ran the local score test three times for
increasing values of
ξ
representing the 70th,
80th,
90th, 95th, and
99th quantiles of significance values from the Exact
tests of each dataset, and tallied how often each simulated outlier
region was recovered as well as the width estimated for each region.
Using the filtered BAM files produced by
PPalign
as input, we applied angsd to all four simulated datasets in
two manners. First, we ran angsd undirected by any variant
identification from PoolParty2, relying on angsd’s
filters to identify and screen variant sites. We applied filters to
several runs of each dataset (similar to PPalign: mapQ
≥
20, snpQ
≥
20, global depth ≥ 20, MAF
≥
0.001) but with varying significance thresholds for variant discovery:
none, 0.01, and 0.001. We ran this configuration to compare
angsd’s
ability to detect true variants and discover outlier regions to
PoolParty2. Subsequently, specifying for angsd to
consider only variants it discovered with the most stringent
significance threshold (p ≤ 0.001), we ran
angsd
on data from each simulated population separately in order to generate
site frequency spectra and calculate FST using
angsd’s realSFS utility. We only retained sites
observed in more than 10 reads and 3 individuals in each population, and
then directed ANGSD to run association tests on these sites (-doAssoc 1,
2, and 5) to identify outliers, specifying population membership of each
individual. Second, we directed angsd to consider only sites
discovered by PPalign and subsequently retained by filtering
with PPanalyze (“PoolParty2 to angsd”), and
we utilized allele frequencies estimated from these runs to compare the
ability of PoolParty2 and angsd to recover the known
allele frequencies accurately. Subsequently, genotype likelihoods
inferred by angsd from these runs were used as input for
estimation of linkage using ngsLD (Fox, Wright, Fumagalli, &
Vieira, 2019), and we constrained linkage estimation to sites ≤ 100Kbp
from one another. Using these linkage estimates, we calculated mean LD
in 100Kbp windows in 5Kbp steps in R, and identified outlier regions as
contiguous series of ≥10 windows exceeding 2x the interquartile range
(2xIQR) for mean windowed LD.