lcWGS data simulated from distinct demographic backgrounds and coverage
We employed simulated data from Lou et al. (2021) to demonstrate the ability of these bioinformatic suites to utilize lcWGS data to accurately estimate allele frequencies and identify outlier regions at various coverages and sample sizes. Details of simulated data are included in Lou et al. (2021), but briefly, nucleotide sequence data including mutation and recombination on a single 30Mb chromosome were simulated for two populations exchanging genes under two demographic scenarios: a lower effective population size and lower rate of gene exchange that produced a higher background FST (hereafter, the ‘high background FST’ scenario) and a higher gene exchange and effective population size that produced a lower background FST(‘low background FST’ scenario). In both scenarios, several sites under selection were introduced and allowed to evolve under divergent selection in each population, ultimately resulting in seven outlier regions (Supplemental Figure 1). Sequence data from the simulated chromosomes, generated to reflect Illumina-style paired-end reads including sequencing errors, reflected 8x coverage for each of several hundred individuals from each population to enable down-sampling at various coverage levels. See Lou et al. (2021) for additional details about simulated data.
While extensive scenarios were tested in Lou et al. (2021)with these simulated data, we selected four scenarios to benchmark PoolParty2 against angsd including: a) high background FST & low sequence coverage; b) high background FST & higher sequence coverage; c) low background & low sequence coverage; d) low background FST and higher sequence coverage. From the simulated sequence data, we randomly selected a number of individuals from each population to reflect sample sizes that are common for empirical datasets in the literature and the dataset included in this study (70 and 63 from each population) and down-sampled the simulated sequence data (8x) to reflect the median individual coverage in our empirical data (~0.33x) as well as a common sequencing target for lcWGS studies (1x). As Lou et al. (2021) examined the effects of depth and sample size on allele frequency estimation accuracy across a greater range of values, it was not our intent to duplicate their efforts except to compare the abilities of the two bioinformatic suites at these coverage depths. However, to additionally challenge these suites with the range of variability in coverage among individuals commonly reflected in empirical data, we fit several simple mathematical distributions to the sums of individual coverage in our empirical data for variant positions following an initial set of global filters (global depth of 10 and minor allele frequency, MAF, of 0.005). A logistic distribution exhibited the best fit based on information theoretic criteria (results not shown), and using parameters for this distribution, we down-sampled the 8x simulated data. For the 1x coverage dataset, the empirical scale parameter for this distribution was used, but the location parameter was set to 1, to produce higher coverage with similar variation. Individuals utilized at different coverages of the same scenario were not identical. This resulted in four simulated datasets (low and high background FST at 0.33x and 1x coverage).
For each of the four datasets, sequence data were processed with the PPalign module of PoolParty2, including quality trimming (sliding window PHRED ≥ 20, retained length ≥ 50bp), read mapping (mapQ ≥ 20), variant scoring and filtering by global parameters (snpQ ≥ 20, global depth of 10 reads, MAF of 0.001), and estimation of raw and normalized allele frequencies. Unless otherwise indicated, normalized allele frequencies were utilized in all analyses. SNP variants and their normalized frequencies identified by PPalign were used as input to PPanalyze, which applied additional filters for sites observed in fewer than 10 reads per population. PPanalyze also calculated FSTfor individual sites and sliding windows of 100Kbp windows in 5Kbp sets, as well as applying Fisher’s Exact tests (FET) for differences in allele frequency between the two populations. Significance (p ) values for the latter test were used as input for the Local Score test (Fariello et al., 2017), which ‘smooths’ the background variation in significance to identify contiguous outlier regions depending on a user-specified smoothing parameter (ξ). In addition, this test determines the significance of putative outlier regions through accounting for linkage (autocorrelation in p -values) in a chromosome-specific manner. However, because each run samples different SNPs to calculate autocorrelation and determine significance, individual runs may fail to identify outlier regions with borderline significance. Lower rates of smoothing (smaller ξ) generally retain more power, but are less precise in determining the boundaries of outlier regions, and moreover, because the landscape of background significance changes with various ξ values, and thus the thresholds for significance in this test, power and smoothing do not have a directly inverse linear relationship. For these reasons, multiple runs with application of different values of ξ are useful. We therefore ran the local score test three times for increasing values of ξ representing the 70th, 80th, 90th, 95th, and 99th quantiles of significance values from the Exact tests of each dataset, and tallied how often each simulated outlier region was recovered as well as the width estimated for each region.
Using the filtered BAM files produced by PPalign as input, we applied angsd to all four simulated datasets in two manners. First, we ran angsd undirected by any variant identification from PoolParty2, relying on angsd’s filters to identify and screen variant sites. We applied filters to several runs of each dataset (similar to PPalign: mapQ ≥ 20, snpQ ≥ 20, global depth ≥ 20, MAF ≥ 0.001) but with varying significance thresholds for variant discovery: none, 0.01, and 0.001. We ran this configuration to compare angsd’s ability to detect true variants and discover outlier regions to PoolParty2. Subsequently, specifying for angsd to consider only variants it discovered with the most stringent significance threshold (p ≤ 0.001), we ran angsd on data from each simulated population separately in order to generate site frequency spectra and calculate FST using angsd’s realSFS utility. We only retained sites observed in more than 10 reads and 3 individuals in each population, and then directed ANGSD to run association tests on these sites (-doAssoc 1, 2, and 5) to identify outliers, specifying population membership of each individual. Second, we directed angsd to consider only sites discovered by PPalign and subsequently retained by filtering with PPanalyze (“PoolParty2 to angsd”), and we utilized allele frequencies estimated from these runs to compare the ability of PoolParty2 and angsd to recover the known allele frequencies accurately. Subsequently, genotype likelihoods inferred by angsd from these runs were used as input for estimation of linkage using ngsLD (Fox, Wright, Fumagalli, & Vieira, 2019), and we constrained linkage estimation to sites ≤ 100Kbp from one another. Using these linkage estimates, we calculated mean LD in 100Kbp windows in 5Kbp steps in R, and identified outlier regions as contiguous series of ≥10 windows exceeding 2x the interquartile range (2xIQR) for mean windowed LD.