2.3 | Sequencing and data processing
We followed the bioinformatic pipeline described in Andrews at al. (2018; Figure 1), with slight modification for STACKS version 2.0 (Catchen et al. 2013). Briefly, a custom PERL script was used to flip the raw reads so that each 140 bp read was aligned starting with the barcode, and the Sbf1 cutsite sequence. STACKS 2.0 (Catchen et al. 2013) program process_radtags was used to demultiplex the raw reads followed by program clone_filter to remove PCR duplicates. BOWTIE2 version 2.3.4.3 (Langmead and Salzberg; 2012) was used to align the sequences to Sebastes nigrocinctus reference genomes downloaded from the ncbi database (https://www.ncbi.nlm.nih.gov/genome/14568). The S. nigrocinctus aligned reads were then processed using the refmap.pl pipeline in STACKS 2.0. Filtering of the final set of SNPs was done using POPULATIONS module in STACKS 2.0 with the minimum percent of individuals genotyped at a locus in a population set at 10% and the minimum global minor allele frequency of SNPs set at 0.1. Subsequent analysis was conducted using R statistical software (R Core Team 2016) using data in genepop format exported from POPULATIONS module.
CLUSTER analysis was conducted using package adegenet (Jombart et Devillard; 2010) and poppr (Kamvar et al., 2014, 2015) using all samples, including the seven remaining replicate pairs (one replicate was not recovered during sequencing) to select the optimal set of filters for removing individuals and loci based on the level of missing data. These filter settings were varied until the CLUSTER plot showed the paired replicates to be most closely related. This resulted in removal of loci which were absent in at least 15% of individuals and genotypes having more than 20% of total identified loci missing. For subsequent analyses, only one from each pair of replicate samples with the most loci was retained. We used the R package sequoia(Huisman, 2017) to identify related individuals, up to half – siblings; this program is specifically designed to use large SNP datasets and does not require a parent to be present in the sample. This was done for each of the two cohorts in order to make sure no related individuals were included in the Genome-Environment Association (GEA) tests.
We estimated the number of ancestral populations represented in the sample using the LEA R package (Frichot and François 2015). The analysis employed population clustering analysis with sparse non-negative matrix factorization optimization (sNMF) (Frichot et al. 2014) to estimate number of ancestral populations represented in the sample. The number of populations was determined from the cross entropy criteria and Cattell’s rule (Cattell, 1966) from the sNMF output. We favored the sNMF routine because it is robust to departures from Hardy-Weinberg equilibrium as compared to Bayesian and Maximum Likelihood approaches (Frichot et al. 2014). We also compared the sNMF results to STRUCTURE 2.3.4 (Pritchard, Stephens & Donnelly 2000) derived population clustering.
We examined whether selection pressure is consistent from year-to-year by testing for a difference in the number of private alleles, or homozygous loci in each year. The number of private alleles that were only found in 2014 but not in 2015 was quantified specifically to each sNMF derived population and across all SNPs. To compute whether the number of private alleles was significantly different between years, we needed to account for the difference in sample sizes between the years. We wrote a custom bootstrap routine in R to create a null distribution of the expected number of lost alleles in the smaller sample size by selecting without replacement from the larger year’s sample, the reduced sample size. The significance (p=0.05) was then based on where the observed number of private alleles lies in the null distribution.
Latent factors mixed model (LFMM) algorithm in R package LEA(Frichot, 2014) was conducted to identify loci influenced by selection. For subsequent analysis we imputed any remaining missing data (3.5% in 2014 and 4.1% in 2015). The missing genotypes were imputed using the random forest algorithm in the R packages randomForestSRC andradiator (Gosselin, 2018). We used the R package hierfstat(Goudet, 2005) to estimate pairwise FST according to Nei (1987). Significance of FST was calculated through 1,000 permutations of population indices. PCA analysis was conducted using the dudi.pca routine in ade4 R package (Dray 2007). Environmental variables included in the GEA included sample date and latitude, sea water temperature, and chlorophyll concentration. Phenotypic metrics were also included in the GEA, including percent lipid content and condition index. This analysis was done for 2014 and 2015 data separately with four latent factors to account for population structure while testing for genome-environment association. This was followed by nucleotide BLAST (https://blast.ncbi.nlm.nih.gov/Blast.cgi) search of nucleotide sequences and their corresponding protein coding gene regions where selection may be occurring. Loci annotation and BLAST searches of the associated 140 bp sequences were accepted when below the nucleotide and protein e-value threshold of \(1x10^{-10}\). BLAST e-value score is the probability that the similarity is due to chance.
The gene ontology (GO) enrichment analysis was used to determine whether the groups of genes associated with each of the environmental variables was enriched for certain biological processes. This analysis was done by querying geneontology.org database using zebra fish (Danio rerio ) as a reference organism, and the alpha level was set at p=0.05 with no multiple test correction applied. Subsequently, the biocyc.org and informatics.jax.org were queried to determine general biological functions of the gene aggregates.