PoolParty2: An integrated pipeline for analyzing pooled or indexed low coverage whole genome sequencing data to discover the genetic basis of diversity
Stuart Willis*1, Steven Micheletti2, Kimberly R. Andrews3, and Shawn Narum1
1Hagerman Genetics Lab, Columbia River Inter-Tribal Fish Commission, Hagerman, ID, USA
2Dept. Zoology, University of British Columbia, Vancouver, BC, Canada
3Institute for Interdisciplinary Data Sciences, University of Idaho, Moscow, ID, USA
4Dept. Fishery Science, Columbia River Inter-Tribal Fish Commission, Portland, OR, USA
*corresponding author’s email: swillis@critfc.org
Abstract
Whole genome sequencing data allow survey of variation from across the genome, reducing the constraint of balancing genome sub-sampling with recombination rates and linkage between sampled markers and target loci. As sequencing costs decrease, low coverage whole genome sequencing of pooled or indexed-individual samples is commonly utilized to identify loci associated with phenotypes or environmental axes in non-model organisms. There are, however, relatively few publicly available bioinformatic pipelines designed explicitly to analyze these types of data, and fewer still that process the raw sequencing data, provide useful metrics of quality control, and then execute analyses. Here, we present an updated version of a bioinformatics pipeline called PoolParty2 that can effectively handle either pooled or indexed DNA samples and includes new features to improve computational efficiency. Using simulated data, we demonstrate the ability of our pipeline to recover segregating variants, estimate their allele frequencies accurately, and identify genomic regions harboring loci under selection. Based on the simulated data set, we benchmark the efficacy of our pipeline with another bioinformatic suite, angsd, and illustrate the compatibility and complementarity of these suites by using angsd to generate genotype likelihoods as input for identifying linkage outlier regions using alignment files and variants provided by PoolParty2. Finally, we apply our updated pipeline to an empirical dataset of low coverage whole genomic data from uncurated population samples of Columbia River steelhead trout (Oncorhynchus mykiss ), results from which demonstrate the genomic impacts of decades of artificial selection in a prominent hatchery stock.
Introduction
A primary goal of molecular ecology is to understand the genetic basis of diversity such as targets of divergent selection or loci underlying heritable life history variations or ecotypes. Critical to this endeavor is the ability to survey the genome to discover genetic variants associated with phenotypic differences or environmental axes (Günther & Coop, 2013; Hoban et al., 2016; Paril, Balding, & Fournier-Level, 2022). Massively parallel or ‘next-generation’ sequencing has dramatically decreased the cost of surveying genetic variation across statistically meaningful numbers of individuals and has made these kinds of investigations accessible for researchers working with limited budgets on non-model organisms. However, despite the rapid decrease in per-base sequencing costs, sequencing the complete genome of each surveyed individual at high coverage is often not practical, in part because as the cost of sequencing has decreased but demands for statistically robust sample sizes have become more ardent (Schlötterer, Tobler, Kofler, & Nolte, 2014). As a result, geneticists are still faced with the task of determining the appropriate compromise between number of reads devoted to surveying each individual, which in many cases determines the extent of the genome that can be observed, and the number of individuals surveyed (Lou, Jacobs, Wilder, & Therkildsen, 2021).
This compromise has been addressed in a number of ways depending on the goals of the individual project. Many researchers have opted to survey only a fraction of the genome, creating ‘reduced representation’ libraries wherein sequencing coverage is spread across fewer, reproducible subsets of loci (e.g. Baird et al., 2008). With careful tuning of library preparation and sequencing methods, the coverage at each locus may be sufficient to confidently infer genotypes across nearly all individuals at hundreds to tens of thousands of variable loci (Andrews, Good, Miller, Luikart, & Hohenlohe, 2016; Puritz et al., 2014). For analyses where individual genotypes are important but high genomic density of loci is not critical, such as determining relatedness or migration among recently diverged populations, these reduced representation techniques can produce data cost effectively for hundreds of individuals (e.g. Willis, Hollenbeck, Puritz, & Portnoy, 2022). However, while these techniques provide data on many more loci than what was historically accessible, only a fraction of the genome is ultimately surveyed, meaning that for species for which linkage blocks are typically less than 100Kb, many linkage blocks may not be surveyed. As a result, except in cases of regions of high linkage disequilibrium such as inversions or strong selective sweeps, investigations that only survey a few thousand linkage groups may fail to identify loci strongly associated with selection or heritable phenotypic variation (Lowry et al., 2017; Tiffin & Ross-Ibarra, 2014).
On the other hand, for many association and genome scan methods the input is allele frequencies rather than individual genotypes. And notably, because of sampling variance, sampling more individuals at low coverage actually provides more accurate estimates of phenotype or population allele frequencies than sequencing fewer individuals at high coverage (Futschik & Schlötterer, 2010; Günther & Coop, 2013; Schlötterer et al., 2014; Zhu, Bergland, González, & Petrov, 2012). Many analyses, including those that compare allele frequencies between phenotypic variants or populations situated along an environmental gradient and depend on high density sampling across linkage groups to discover the regions of highest divergence, may thus be performed more effectively by low coverage whole genome sequencing (lcWGS) (Lamichhaney et al., 2012; Lou et al., 2021; Schlötterer et al., 2014; Therkildsen & Palumbi, 2017). Moreover, there have been a proliferation of analyses that are able to account for uncertainty in the genotype of each individual (likelihoods), even with data sequenced with <1x coverage per individual (Lou et al., 2021). This low coverage sequencing approach provides compromise among the portion of the genome surveyed, accurate allele frequency estimates, and in many cases analyses that require individual genotype data (Lou et al., 2021; Therkildsen & Palumbi, 2017). However, while lcWGS data may be highly appropriate for these types of investigations and the toolkit for analyzing allele frequency and genotype probability data is expanding, there remain few pipelines specifically designed to take unmapped lcWGS reads and convey the data through quality control and bioinformatic analyses.
To address that need, an integrated, modular bioinformatic pipeline, PoolParty, was developed that facilitates the use of lcWGS data to search for genomic regions showing strong divergence between samples with discrete phenotypic differences or other group-wise characteristics (Micheletti & Narum, 2018). This pipeline has been applied to detect genome-wide genetic association across multiple species (e.g. Aguirre-Ramirez, Velasco-Cuervo, & Toro-Perea, 2021; Horn, Kamphaus, Murdoch, & Narum, 2020; Lyu et al., 2021; Ren et al., 2021). Although most published applications have utilized data from libraries of pooled DNA, the pipeline can also utilize data from individuals sequenced in multiplex using indexed or barcoded libraries, which allows a normalization procedure that corrects for uneven contribution to group allele frequencies across individuals. This normalization is a pseudo-genotying method wherein each individual, regardless of total reads, is allowed to contribute only one or two alleles per locus to the count from which allele frequencies are calculated, depending on that individual’s depth of coverage and the ratio of major and minor alleles (>10:1 is considered a homozygote; Figure 1). PoolParty shares this goal of managing uneven contribution among individuals when estimating allele frequency with another bioinformatic suite designed for use with lcWGS data, angsd, which also generates individual genotype likelihood or posterior probabilities from lcWGS data (Korneliussen, Albrechtsen, & Nielsen, 2014). However, PoolParty takes sequence read files as input, performs sequence cleaning and mapping to a reference genome, produces numerous assurance reports regarding sequence and mapping quality, and facilitates several analyses to identify regions of significant genomic divergence between samples, while angsd requires mapped read alignments produced by other tools as input. Moreover, as demonstrated below, the alignment files created by PoolParty are compatible input to angsd, making these complementary bioinformatic tools for lcWGS data analysis.
To demonstrate various utilities and upgrades of the PoolParty2 pipeline and compatibility with angsd, we apply it to two lcWGS datasets, one simulated and one empirical, that reflect the type of questions to which PoolParty2 may be routinely applied. We utilize data simulated to reflect different demographic contexts and degrees of sequence coverage to show the relative strengths, accuracy, and complementarity of PoolParty2 and angsd to identify segregating loci, estimate their allele frequencies, and identify outlier loci and the boundaries regions affected by selection. Then, using barcoded lcWGS data from natural and hatchery populations of steelhead trout (anadromous Oncorhynchus mykiss ), we demonstrate the potential of integrated application of these bioinformatic suites to identify regions under selection in landscape-level population samples.