Bioinformatic suites for low coverage whole genome sequencing
data
PoolParty2
is a bioinformatic pipeline that was designed for the utilization of
pooled or individually barcoded low coverage whole genome sequencing
(lcWGS) data with a high-quality reference genome to perform genome-wide
genetic association analysis with binary phenotypic traits or
environmental variables. Since its original presentation, it has seen
numerous updates, though most of these will be invisible to the user
since their effect is to provide greater efficiency or stability. For
example, the scripts that implement the Local Score analysis of Fariello
et al. (2017) have been improved to make them more robust to variations
in SNP and allele frequency count and distribution and provide more
accurate reports with respect to outlier region margins. The major
workhorse of the pipeline, the
PPalign
module, has received the most updates since the pipeline’s first
presentation. Notably, this module is now able to call variants in
parallel (across multiple computational threads) by dividing chromosomes
or scaffolds into groups based on a user-specified number of threads,
which substantially reduces processing time for large datasets. Another
notable update includes the user’s ability to limit the number of
individuals to normalize simultaneously, which reduces RAM memory
requirements in memory-limited or shared computing environments, with
the proviso that this also causes the duration of processing time for
normalization to increase linearly.
These improvements to the PPalign module of PoolParty2
are important since, while there are myriad analyses available for the
discovery of large effect loci, few bioinformatic suites are designed to
convey
lcWGS
data from sequencing output, via quality assurance analyses, through to
analytical results.
angsd,
another useful bioinformatic suite designed for barcoded but not pooled
lcWGS data, requires read alignments as input, and provides additional
functionalities that PoolParty2 does not, making these
bioinformatic suites strong complements. In particular, angsd
can estimate individual genotype likelihoods or posterior probabilities.
For analyses where individual phenotype data are important, such as
random forest analyses (e.g. Hess, Zendt, Matala, & Narum, 2016),
genotype likelihoods would be necessary. Moreover, as we demonstrate
here, these genotype likelihoods allow the estimation of linkage
disequilibrium in specified windows, and which can be used to identify
outlier regions under putative selection, though, as our empirical
results portray, most effectively in combination with analyses that
consider divergence among focal sets of samples.
The
analyses available through the PPanalyze module of
PoolParty2 have most often been applied for the discovery of
one or a few genomic regions with large, independent or
simple-interaction effects on traits (e.g. direct epistasis), though, as
results from our empirical data also portray, these analyses, with
appropriate corroboration, may identify many loci with smaller effects
on polygenic traits. Loci contributing to polygenic traits with
continuous variation and low penetrance, however, will be challenging to
discover with analyses that rely mainly on allele frequency, and their
discovery will likely depend on identifying combinatorial association of
non-contiguous genotypes, for which genotype likelihoods may be better
suited. It should be noted, however, that genotype likelihoods may only
be sufficiently informative for some analyses if coverage is high enough
for each individual, and the decision of how to multiplex individuals or
how deeply to sequence, made early in the data preparation phase, must
be done with respect to the types of analyses that will be needed (Lou
et al., 2021; Paril et al., 2022).
Using simulated data, we demonstrated that
angsd
and PoolParty2 both provide remarkably precise estimates of
allele frequency considering the degree of individual coverage
constraint and variation with which we challenged them, with
PoolParty2
exhibiting a moderate edge in identifying true sites with fewer
artifacts and accurately estimating their allele frequencies. Both
bioinformatic suites also produce very low rates of false positive
outliers, at least when data are examined on a site-aggregate (windowed
or serial) basis. A number of the per-site divergence analyses are
roughly equivalent between bioinformatic suites, including estimates of
FST, sliding window FST, and basic tests
of allele frequency proportions, though we note that to estimate site
frequency spectra in order to calculate FST with
angsd,
each population must be analyzed separately, and if the user desires to
utilize the empirical major/minor alleles rather than
reference/alternate or ancestral/derived, a prior run with all data must
be made. Beyond these, however, the Local Score analyses (Fariello et
al., 2017) and estimate of linkage outliers from ngsLD (Fox et
al., 2019) provide powerful and complementary means to identify true
outlier regions. Moreover, these analyses can be run in parallel by
supplying filtered BAM files and a list of variant sites to
angsd to generate genotype likelihoods for linkage estimation
with ngsLD while simultaneously running PPanalyze and
Local Score analyses. We provide the code to demonstrate how we executed
this for our simulated and empirical data (Supplemental File 1).
However,
results from the analysis of simulated data also indicate that, while
both angsd and PoolParty2 are effective at identifying
segregating variants, estimating allele frequencies, and identifying
outlier loci using lcWGS data, the particular evolutionary context and
genomic environment of putatively adaptive loci may constrain the
efficacy of any bioinformatic suite at a given depth of coverage. For
example, we observed that while both analytical suites provided results
that correctly portrayed the presence and location of outlier loci in
each of the simulated datasets, peaks in the higher background
divergence scenarios were more difficult to discern visually and an
analysis designed to objectively discriminate outlier loci from
background rates of variation, Local Score, identified fewer significant
regions. While outlier identification using linkage was more effective
in high background divergence scenarios, and coordination of this
approach with the other analyses increases flexibility of the combined
toolkit, this analysis was less effective with lower coverage regardless
of evolutionary context. Although we did not investigate it directly
here because of the nature of the simulated data, one strategy to
increase power in challenging scenarios is through the use of paired
sample replicates, i.e., samples of the same phenotype or across the
same environmental axes from multiple populations (e.g. Lotterhos &
Whitlock, 2015). Analyses that emphasize repeated differences across
these replicates, such as the Cochran–Mantel–Haenszel test available
in PoolParty2, increase the power to identify regions
associated with the focal contrasts while minimizing the influence of
background variation (Cochran, 1954; Mantel & Haenszel, 1959). Our
recommendation based on these observations is for researchers to
consider existing information about the demographic and genetic context
of the traits and populations under study, and to carefully arrange
experimental design to maximize power for any given scenario (Lou et
al., 2021).
One underappreciated phenomenon in lcWGS are technical artifacts
commonly known as batch effects (Leek et al., 2010; Lou & Therkildsen,
2022). These occur when idiosyncrasies of the library and sequencing
process introduce artifacts for samples processed together that are
later misinterpreted as true biological differences between groups under
analysis, and may have a number of contributing causes in lcWGS data
(Lou & Therkildsen, 2022). While batch effects are certainly not
exclusive to lcWGS, barcoded and especially pooled lcWGS data that are
prepared group-wise may be subject to them. Importantly, both
PoolParty2 and angsd have a number of utilities
included that help eliminate some batch effects, including read
trimming, mapping and SNP quality filtering, minor allele frequency
thresholds, and read and individual observation minima. Moreover, with
any bioinformatic suite it is wise that users conduct tests to determine
if results are sensitive to various trimming and filtering parameters.
The alignment filtering utilities provided with angsd appear to
be more stringent, even with the same parameters, than what is currently
applied in PoolParty2, as evidenced by diminished correlation
in allele frequencies for sites with higher depth as estimated by
PoolParty2
but not for
angsd.
Additional variant filters are accessible in angsd, including
those that examine strand balance, and though not built-in, these same
filters can be applied to the variants discovered by
PoolParty2
through VCF filtering tools such as bcftools or vcflib
(Danecek et al., 2021; Garrison, Kronenberg, Dawson, Pedersen, & Prins,
2021), as described in our tutorials. Aside from these filters, one
important means of controlling batch effects is to distribute barcoded
samples from different analytical groups (e.g., populations) among
sequencing runs, and ideally library preparation groups, since this
reduces the chance that technical artifacts will be identified as
group-wise differences (O’Leary, Puritz, Willis, Hollenbeck, & Portnoy,
2018).
In
addition, another strategy to minimize the influence of batch effects
and other false positives is through the use of paired sample replicates
as described above: analyses that utilize paired replicates reduce the
chance that artifacts owing to any technical pair will be identified as
consistent, significant differences (Cochran, 1954; Mantel & Haenszel,
1959).
Discovery of large effect loci in non-model organisms
We successfully apply our integrated pipeline to a common data
arrangment of low coverage whole genome sequencing (lcWGS) data from
steelhead trout (Oncorhynchus mykiss ) to identify loci strongly
divergent among populations with distinct histories and putatively of
strong effect on traits under selection in these groups. This species
exhibits an impressive array of life history diversity even among
salmonids, including notable and important variation in migration
phenology (run timing), age at maturity (age at first reproduction),
propensity for residency or anadromy, precocial sexual maturation, and
iteroparity (repeat spawning) (Busby et al., 1999; Carlson & Seamons,
2008; Quinn, Seamons, Vollestad, & Duffy, 2011). Many of these traits
are highly variable both among and within phylogeographic lineages and
local populations (Busby et al., 1999), are part of the portfolio of
life history diversity that assist populations in persisting through
natural and anthropogenic environmental challenges (Quinn, McGinnity, &
Reed, 2016), and are among the outward physical traits used by managers
to estimate the contribution of distinct stocks to mixed-stock fisheries
(Hess et al., 2021).
The diversity of life history ensembles among regional and local stocks
of steelhead reflects a multifaceted history of selection, and upon
these effects anthropogenic forces have applied new challenges. One of
the most direct human impacts on steelhead are use of hatcheries to
mitigate for the loss of fish from dams, overfishing, habitat
degradation, and other human actions. While hatcheries have begun to
shift towards inclusion of natural origin (NOR) returns in broodstock
recruitment to avoid ‘domestication’ selection, these are still the
minority, and indeed some operations have taken the opposite approach,
actively choosing hatchery returns with characteristics desirable for
hatchery managers or fishers, with many consequences to genomic
variation. While the typical application of our pipeline for discovering
loci under selection relies upon groups with known phenotypic
differences, discrete phenotypes with heritable bases may be overall
uncommon in non-model organisms, at least relative to opportunities for
examining populations arrayed across replicate environmental axes
(e.g.
Lotterhos & Whitlock, 2015). To reflect this, we chose to examine
genomic divergence in a hatchery stock with a notable history of
hatchery selection without a priori designation other than group
membership, as would be the case for most landscape-level genome scans.
However, in this case the Skamania Hatchery stock is well known for its
phenotypic ensemble, its history of hatchery origin recruitment and
artificial selection, and the extent to which this stock has been
outplanted or strayed into numerous Columbia Basin tributaries. Not
surprisingly, we observed strong divergence in two regions known to be
associated with early migration timing and age at maturity (ocean
duration) on chromosomes 28 and 25, respectively. However, we also saw
significant divergences in regions that contained several dozen
additional genes, including a prominent region on chromosome 20
containing two genes important in metabolism and cellular signaling.
This region on chromosome 20 is also known to harbor a structural
inversion in this species (Pearse et al., 2019), which may have been
involved in artificial selection of the hatchery stock. Importantly,
these regions were emphasized as outliers both by the analyses we
applied that considered only global linkage patterns as well as the
analyses that considered contrast-specific divergence while controlling
for linkage. However, the linkage-only analysis identified additional
regions that did not appear to exhibit strong divergence across our
contrast of interest (hatchery vs. natural origin), demonstrating the
importance of corroborating outliers identified from any single
analysis. While we decline to hypothesize how these additional genes may
be involved in domestication, these results demonstrate the utility of
applying our pipeline to landscape-level samples and the efficacy and
complementarity of the PoolParty and angsd pipelines
for analyzing lcWGS data.
Acknowledgements
We thank colleagues at the Columbia River Inter-Tribal Fish Commission
(CRITFC) in Portland, OR and Hagerman, ID for facilitating the current
dataset and the associate editor and two anonymous reviewers for
comments on a previous version of this manuscript. We also greatly
appreciate the assistance by Nicolas R. Lou and Nina Overgaard
Therkildsen for providing and explaining the simulated sequence data.
Samples were collected by CRITFC staff or graciously provided by the
Oregon and Washington Departments of Fisheries and Wildlife. Funding for
this project was contributed by Bonneville Power Administration (Grant
no. 2008-907-00), by the NSF Idaho EPSCoR Program, and by the National
Science Foundation (award no. OIA-1757324).
Data availability
Data referenced in this manuscript are available from NBCI Short Read
Archive (PRJNA854899). The pipeline used for bioinformatic processing of
these data, PoolParty, is available from Github
(https://github.com/stuartwillis/poolparty).
References Cited
Aguirre-Ramirez, E., Velasco-Cuervo, S., & Toro-Perea, N. (2021).
Genomic traces of the fruit fly Anastrepha obliqua associated with its
polyphagous nature. Insects , 12 (12). doi:
10.3390/insects12121116
Andrews, K. R., Good, J. M., Miller, M. R., Luikart, G., & Hohenlohe,
P. A. (2016). Harnessing the power of RADseq for ecol & evol
genomics.pdf. Nature Reviews Genetics , 17 .
Ayerst, J. D. (1976). The Role of Hatcheries in Rebuilding Stcelhead
Runs of the Columbia River System. American Fisheries Society
Special Publication: Proceedings of a Symposium Held in Vancouver,
Washington, March 5-6, 1976 , 84–88.
Baird, N. A., Etter, P. D., Atwood, T. S., Currey, M. C., Shiver, A. L.,
Lewis, Z. A., … Johnson, E. A. (2008). Rapid SNP discovery and
genetic mapping using sequenced RAD markers. PLoS ONE ,3 (e3376).
Bonhomme, M., Chevalet, C., Servin, B., Boitard, S., Abdallah, J.,
Blott, S., & SanCristobal, M. (2010). Detecting selection in population
trees: The Lewontin and Krakauer test extended. Genetics ,186 (1). doi: 10.1534/genetics.110.117275
Busby, P., Wainwright, T., & Bryant, G. (1999). Status Review of
Steelhead from Washington, Idaho, Oregon, and California.Sustainable Fisheries Management . doi: 10.1201/9781439822678.ch11
Carlson, S. M., & Seamons, T. R. (2008). SYNTHESIS: A review of
quantitative genetic components of fitness in salmonids: implications
for adaptation to future change. Evolutionary Applications ,1 (2), 222–238. doi: 10.1111/j.1752-4571.2008.00025.x
Cochran, W. G. (1954). Some Methods for Strengthening the Common χ 2
Tests. Biometrics , 10 (4). doi: 10.2307/3001616
Danecek, P., Bonfield, J. K., Liddle, J., Marshall, J., Ohan, V.,
Pollard, M. O., … Li, H. (2021). Twelve years of SAMtools and
BCFtools. GigaScience , 10 (2). doi:
10.1093/gigascience/giab008
Fariello, M. I., Boitard, S., Mercier, S., Robelin, D., Faraut, T.,
Arnould, C., … SanCristobal, M. (2017). Accounting for linkage
disequilibrium in genome scans for selection without individual
genotypes: The local score approach. Molecular Ecology ,26 (14). doi: 10.1111/mec.14141
Fox, E. A., Wright, A. E., Fumagalli, M., & Vieira, F. G. (2019).
NgsLD: Evaluating linkage disequilibrium using genotype likelihoods.Bioinformatics , 35 (19). doi: 10.1093/bioinformatics/btz200
Futschik, A., & Schlötterer, C. (2010). The next generation of
molecular markers from massively parallel sequencing of pooled DNA
samples. Genetics , 186 (1). doi:
10.1534/genetics.110.114397
Garrison, E., Kronenberg, Z. N., Dawson, E. T., Pedersen, B. S., &
Prins, P. (2021). Vcflib and tools for processing the VCF variant call
format. BioRxiv .
Günther, T., & Coop, G. (2013). Robust identification of local
adaptation from allele frequencies. Genetics , 195 (1). doi:
10.1534/genetics.113.152462
Hess, J. E., Collins, E. E., Harmon, S. A., Horn, R. L., Koch, I. J.,
Stephenson, J., … Narum, S. R. (2021). GENETIC ASSESSMENT
OF COLUMBIA RIVER STOCKS: 1/1/2020 - 12/31/2020 Annual Report,
2008-907-00 .
Hess, J. E., Zendt, J. S., Matala, A. R., & Narum, S. R. (2016).
Genetic basis of adult migration timing in anadromous steelhead
discovered through multivariate association testing. Proceedings
of the Royal Society B: Biological Sciences , 283 (1830). doi:
10.1098/rspb.2015.3064
Hoban, S., Kelley, J. L., Lotterhos, K. E., Antolin, M. F., Bradburd,
G., Lowry, D. B., … Whitlock, M. C. (2016). Finding the genomic
basis of local adaptation: Pitfalls, practical solutions, and future
directions. American Naturalist . doi: 10.1086/688018
Horn, R. L., Kamphaus, C., Murdoch, K., & Narum, S. R. (2020).
Detecting genomic variation underlying phenotypic characteristics of
reintroduced Coho salmon (Oncorhynchus kisutch). Conservation
Genetics , 21 (6). doi: 10.1007/s10592-020-01307-0
Korneliussen, T. S., Albrechtsen, A., & Nielsen, R. (2014). ANGSD:
Analysis of Next Generation Sequencing Data. BMC Bioinformatics ,15 (1). doi: 10.1186/s12859-014-0356-4
Lamichhaney, S., Barrio, A. M., Rafati, N., Sundström, G., Rubin, C. J.,
Gilbert, E. R., … Andersson, L. (2012). Population-scale
sequencing reveals genetic differentiation due to local adaptation in
Atlantic herring. Proceedings of the National Academy of Sciences
of the United States of America , 109 (47). doi:
10.1073/pnas.1216128109
Leek, J. T., Scharpf, R. B., Bravo, H. C., Simcha, D., Langmead, B.,
Johnson, W. E., … Irizarry, R. A. (2010). Tackling the widespread
and critical impact of batch effects in high-throughput data.Nature Reviews Genetics , Vol. 11. doi: 10.1038/nrg2825
Lotterhos, K. E., & Whitlock, M. C. (2015). The relative power of
genome scans to detect local adaptation depends on sampling design and
statistical method. Molecular Ecology . doi: 10.1111/mec.13100
Lou, R. N., Jacobs, A., Wilder, A. P., & Therkildsen, N. O. (2021). A
beginner’s guide to low-coverage whole genome sequencing for population
genomics. Molecular Ecology , 30 (23). doi:
10.1111/mec.16077
Lou, R. N., & Therkildsen, N. O. (2022). Batch effects in population
genomic studies with low-coverage whole genome sequencing data: Causes,
detection and mitigation. Molecular Ecology Resources ,22 (5). doi: 10.1111/1755-0998.13559
Lowry, D. B., Hoban, S., Kelley, J. L., Lotterhos, K. E., Reed, L. K.,
Antolin, M. F., & Storfer, A. (2017). Breaking RAD: an evaluation of
the utility of restriction site-associated DNA sequencing for genome
scans of adaptation. Molecular Ecology Resources , 17 (2).
doi: 10.1111/1755-0998.12635
Lyu, G., Feng, C., Zhu, S., Ren, S., Dang, W., Irwin, D. M., …
Zhang, S. (2021). Whole Genome Sequencing Reveals Signatures for
Artificial Selection for Different Sizes in Japanese Primitive Dog
Breeds. Frontiers in Genetics , 12 . doi:
10.3389/fgene.2021.671686
Mantel, N., & Haenszel, W. (1959). Statistical aspects of the analysis
of data from retrospective studies of disease. Journal of the
National Cancer Institute , 22 (4). doi: 10.1093/jnci/22.4.719
Micheletti, S. J., Hess, J. E., Zendt, J. S., & Narum, S. R. (2018).
Selection at a genomic region of major effect is responsible for
evolution of complex life histories in anadromous steelhead 06
Biological Sciences 0604 Genetics. BMC Evolutionary Biology ,18 (1). doi: 10.1186/s12862-018-1255-5
Micheletti, S. J., & Narum, S. R. (2018). Utility of pooled sequencing
for association mapping in nonmodel organisms. Molecular Ecology
Resources . doi: 10.1111/1755-0998.12784
O’Leary, S. J., Puritz, J. B., Willis, S. C., Hollenbeck, C. M., &
Portnoy, D. S. (2018). These aren’t the loci you’re looking for:
Principles of effective SNP filtering for molecular ecologists.Molecular Ecology , 0 (ja). doi: 10.1111/mec.14792
Paril, J. F., Balding, D. J., & Fournier-Level, A. (2022). Optimizing
sampling design and sequencing strategy for the genomic analysis of
quantitative traits in natural populations. Molecular Ecology
Resources , 22 (1). doi: 10.1111/1755-0998.13458
Pearse, D. E., Barson, N. J., Nome, T., Gao, G., Campbell, M. A.,
Abadía-Cardoso, A., … Lien, S. (2019). Sex-dependent dominance
maintains migration supergene in rainbow trout. Nature Ecology and
Evolution , 3 (12). doi: 10.1038/s41559-019-1044-6
Puritz, J. B., Matz, M. V, Toonen, R. J., Weber, J. N., Bolnick, D. I.,
& Bird, C. E. (2014). Demystifying the RAD fad. Molecular
Ecology , 23 (24), 5937–5942.
Quinn, T. P., McGinnity, P., & Reed, T. E. (2016). The paradox of
“premature migration” by adult anadromous salmonid fishes: Patterns
and hypotheses. Canadian Journal of Fisheries and Aquatic
Sciences , Vol. 73, pp. 1015–1030. doi: 10.1139/cjfas-2015-0345
Quinn, T. P., Seamons, T. R., Vollestad, L. A., & Duffy, E. (2011).
Effects of growth and reproductive history on the egg size-fecundity
trade-off in steelhead. Transactions of the American Fisheries
Society . doi: 10.1080/00028487.2010.550244
Ren, S., Lyu, G., Irwin, D. M., Liu, X., Feng, C., Luo, R., …
Wang, Z. (2021). Pooled Sequencing Analysis of Geese (Anser cygnoides)
Reveals Genomic Variations Associated With Feather Color.Frontiers in Genetics , 12 . doi: 10.3389/fgene.2021.650013
Schlötterer, C., Tobler, R., Kofler, R., & Nolte, V. (2014). Sequencing
pools of individuals-mining genome-wide polymorphism data without big
funding. Nature Reviews Genetics , 15 (11). doi:
10.1038/nrg3803
Therkildsen, N. O., & Palumbi, S. R. (2017). Practical low-coverage
genomewide sequencing of hundreds of individually barcoded samples for
population and evolutionary genomics in nonmodel species.Molecular Ecology Resources , 17 (2). doi:
10.1111/1755-0998.12593
Tiffin, P., & Ross-Ibarra, J. (2014). Advances and limits of using
population genetics to understand local adaptation. Trends in
Ecology and Evolution , Vol. 29. doi: 10.1016/j.tree.2014.10.004
Willis, S. C., Hess, J. E., Fryer, J. K., Whiteaker, J. M., Brun, C.,
Gerstenberger, R., & Narum, S. R. (2020). Steelhead (Oncorhynchus
mykiss) lineages and sexes show variable patterns of association of
adult migration timing and age-at-maturity traits with two genomic
regions. Evolutionary Applications , 13 (10), 2836–2856.
doi: 10.1111/eva.13088
Willis, S. C., Hollenbeck, C. M., Puritz, J. B., & Portnoy, D. S.
(2022). Genetic recruitment patterns are patchy and spatiotemporally
unpredictable in a deep-water snapper (Lutjanus vivanus) sampled in
fished and protected areas of western Puerto Rico. Conservation
Genetics , 23 (3). doi: 10.1007/s10592-021-01426-2
Zhu, Y., Bergland, A. O., González, J., & Petrov, D. A. (2012).
Empirical validation of pooled whole genome population re-sequencing in
Drosophila melanogaster. PLoS ONE , 7 (7). doi:
10.1371/journal.pone.0041901
Tables
Table 1. Results of SNP calling and filtering by PoolParty2 and
angsd with different modules and filtering thresholds in two
different demographic scenarios (high and low background
FST) and coverage levels (0.33x and 1x). The true number
of simulated sites for the high
background
FST scenarios with minor allele frequency
>=0.001 was 158,746 SNPs and for low background
FST was 789,423 SNPs.
Table 2.
Outlier
regions identified from simulated data using windowed mean linkage and
regions identified with Local Score analysis of significance from
Fisher’s Exact tests. For each outlier region identified, the width in
Kbp is reported. For windowed linkage analysis, the mean and maximum
windowed mean of linkage is listed, while for Local Score analyses
across quantiles of significance as smoothing parameter values, ξ, the
number of replicates out of three in which a region was significant are
listed.
Table
3. Outlier regions identified in select chromosomes from empirical
steelhead trout data
using
windowed mean linkage and regions identified with Local Score analysis
of significance from Fisher’s Exact tests. For each outlier region
identified, the beginning and end of the regions (in Mbp), width (in
Kbp), and, for regions identified with windowed linkage analysis, the
mean and maximum windowed mean of linkage, are reported. Local score
results are reported across quantiles of significance as smoothing
parameter values, ξ.
Supplemental Table 1. Outlier regions identified in across all
chromosomes from empirical steelhead trout data using Local Score
analysis of significance from Fisher’s Exact tests. For each outlier
region identified, the beginning and end, width, local score of the
peak, and number of replicates out of 3 in which the region was
significant, are reported. Regions are reported
across
quantiles of significance
as
smoothing parameter values, ξ.
Figures
Figure 1. Graphical representation of the normalization process of
allele frequency estimates for variance in read depth possible with the
incorporation on barcoded individual samples.
Figure 2. Comparing estimated allele frequencies with true allele
frequencies across depth cutoffs for data simulated with high background
FST and low coverage (0.33x). Left axis: This red and
blue lines show correlation for
PoolParty2
estimated frequencies for each population for normalized (solid) and
non-normalized (dashed) data. Orange and green lines show correlation
with
angsd
estimated frequencies. Black line shows correlation in depths estimated
by angsd and PoolParty2. Right axis: thick purple line
shows number of sites passing depth thresholds. Left and right panels
shows depths reported by PoolParty2 and angsd,
respectively.
Figure 3. Manhattan plots of divergence estimated for variant sites
identified by PoolParty2 in simulated data in scenarios with
high (A,B) or low (C,D) background FST and 0.33x (A,C)
or 1x coverage (B,D). For each dataset, the divergence
(FST) estimated in 100Kbp sliding windows (left column),
Local Score estimated with
ξ
reflecting the 80th quantile of Fisher’s Exact Test
significance values (middle column), and mean linkage in 100Kbp windows
(right column) are shown. Red dots indicate series of 10+ windows of
mean linkage above 2x the interquartile range. Blue dots in each panel
reflect the positions (but not values) of loci simulated under
selection.
Figure 4. Manhattan plots of divergence and linkage among hatchery and
natural origin population samples of steelhead. A) Individual site
FST corrected for sample size (red line indicates
FST of 0.5) B) Mean
FSTin sliding windows of 100Kb across the genome. C) -Log10 significance
(p ) values from individual site Fisher’s Exact tests D-G) Local
score analysis of significance values from Fisher’s Exact tests for the
80th, 90th, 95th,
and 99th quantile of significance values as ξ
(red
line indicates FDR of 0.05); H-J) Mean linkage in 100Kbp windows for
chromosomes 20, 25, and 28 (red dots indicate series of 20+ windows of
mean linkage above 2x the interquartile range).
Supplemental Figure 1. True FST for loci simulated under
demographic scenarios producing A) high background FST(low NE, low migration) or B) low background
FST (high NE, high migration). Black
dots indicated the FST of positions simulated under
selection.
Supplemental Figure 2. Comparing estimated allele frequencies with true
allele frequencies across depth cutoffs for data simulated with high
background FST and low coverage (0.33x). Left axis: This
red and blue lines show correlation for PoolParty2 estimated
frequencies for each population for normalized (solid) and
non-normalized (dashed) data. Orange and green lines show correlation
with angsd estimated frequencies. Black line shows correlation
in depths estimated by angsd and PoolParty2. Right
axis: thick blue line shows number of sites passing depth thresholds.
Left and right panels shows depths reported by PoolParty2 and
angsd, respectively. A) high background FST,
0.33x coverage; B) high background FST, 1x coverage; C)
low background FST, 0.33x coverage; D) low background
FST, 1x coverage
Supplemental Figure 3. Manhattan plot of results from analysis of 0.33x
coverage data simulated with high background FST. A)
FST estimated by PoolParty; B) 100Kbp sliding
windows of FST estimated by PoolParty2; C)
-Log10 significance (p) values from Fisher’s Exact test of allele
frequencies between populations calculated by PoolParty2; D)
FST estimated by angsd; E) 100Kbp sliding
windows of FST estimated by angsd; F)
Likelihood ratio of frequency differences between population calculated
by angsd; G) Mean linkeage in 100Kbp windows estimated for
sites identified by PoolParty2, genotype likelihood estimated
by angsd, and linkage calculated with ngsLD, with
series of 10+ windows with mean linkage above 2x the interquartile range
in red; H-L) Local score values for ξ values reflecting the
70th, 80th, 90th,
95th, and 99th quantile of
significance values from Fisher’s Exact test. Blue dots in each panel
reflect the positions (but not values) of loci simulated under
selection.
Supplemental Figure 4. Manhattan plot of results from analysis of 1x
coverage data simulated with high background FST. A)
FST estimated by PoolParty2; B) 100Kbp sliding
windows of FST estimated by PoolParty2; C)
-Log10 significance (p) values from Fisher’s Exact test of allele
frequencies between populations calculated by PoolParty2; D)
FST estimated by angsd; E) 100Kbp sliding
windows of FST estimated by angsd; F)
Likelihood ratio of frequency differences between population calculated
by angsd; G) Mean linkeage in 100Kbp windows estimated for
sites identified by PoolParty2, genotype likelihood estimated
by angsd, and linkage calculated with ngsLD, with
series of 10+ windows with mean linkage above 2x the interquartile range
in red; H-L) Local score values for ξ values reflecting the
70th, 80th, 90th,
95th, and 99th quantile of
significance values from Fisher’s Exact test. Blue dots in each panel
reflect the positions (but not values) of loci simulated under
selection.
Supplemental Figure 5. Manhattan plot of results from analysis of 0.33x
coverage data simulated with low background FST. A)
FST estimated by PoolParty2; B) 100Kbp sliding
windows of FST estimated by PoolParty2; C)
-Log10 significance (p) values from Fisher’s Exact test of allele
frequencies between populations calculated by PoolParty2; D)
FST estimated by angsd; E) 100Kbp sliding
windows of FST estimated by angsd; F)
Likelihood ratio of frequency differences between population calculated
by angsd; G) Mean linkeage in 100Kbp windows estimated for
sites identified by PoolParty2, genotype likelihood estimated
by angsd, and linkage calculated with ngsLD, with
series of 10+ windows with mean linkage above 2x the interquartile range
in red; H-L) Local score values for ξ values reflecting the
70th, 80th, 90th,
95th, and 99th quantile of
significance values from Fisher’s Exact test. Blue dots in each panel
reflect the positions (but not values) of loci simulated under
selection.
Supplemental Figure 6. Manhattan plot of results from analysis of 1x
coverage data simulated with low background FST. A)
FST estimated by PoolParty; B) 100Kbp sliding
windows of FST estimated by PoolParty2; C)
-Log10 significance (p) values from Fisher’s Exact test of allele
frequencies between populations calculated by PoolParty2; D)
FST estimated by angsd; E) 100Kbp sliding
windows of FST estimated by angsd; F)
Likelihood ratio of frequency differences between population calculated
by angsd; G) Mean linkeage in 100Kbp windows estimated for
sites identified by PoolParty2, genotype likelihood estimated
by angsd, and linkage calculated with ngsLD, with
series of 10+ windows with mean linkage above 2x the interquartile range
in red; H-L) Local score values for ξ values reflecting the
70th, 80th, 90th,
95th, and 99th quantile of
significance values from Fisher’s Exact test. Blue dots in each panel
reflect the positions (but not values) of loci simulated under
selection.
Supplemental Figure 7. Quality control analyses for steelhead from
Skamania
hatchery or two natural origin populations, Willametter River (Eagle
Creek) or Lewis River. A) Proportion of the genome covered at various
minimum coverage limits. B) Mean coverage across sites for each
population sample (summed across individuals) C) Principal components
analysis of allele frequencies for variants with a maximum allele
frequency difference ≤ 0.9. D) Mean proportion across population samples
of each chromosome. E) Number of SNPs per 10 Kb across the genome (SNP
density).