2.3 | Sequencing and data processing
We followed the bioinformatic pipeline described in Andrews at al.
(2018; Figure 1), with slight modification for STACKS version 2.0
(Catchen et al. 2013). Briefly, a custom PERL script was used to flip
the raw reads so that each 140 bp read was aligned starting with the
barcode, and the Sbf1 cutsite sequence. STACKS 2.0 (Catchen et al. 2013)
program process_radtags was used to demultiplex the raw reads
followed by program clone_filter to remove PCR duplicates.
BOWTIE2 version 2.3.4.3 (Langmead and Salzberg; 2012) was used to align
the sequences to Sebastes nigrocinctus reference genomes
downloaded from the ncbi database
(https://www.ncbi.nlm.nih.gov/genome/14568). The S.
nigrocinctus aligned reads were then processed using the refmap.pl
pipeline in STACKS 2.0. Filtering of the final set of SNPs was done
using POPULATIONS module in STACKS 2.0 with the minimum percent of
individuals genotyped at a locus in a population set at 10% and the
minimum global minor allele frequency of SNPs set at 0.1. Subsequent
analysis was conducted using R statistical software (R Core Team 2016)
using data in genepop format exported from POPULATIONS module.
CLUSTER analysis was conducted using package adegenet (Jombart et
Devillard; 2010) and poppr (Kamvar et al., 2014, 2015) using all
samples, including the seven remaining replicate pairs (one replicate
was not recovered during sequencing) to select the optimal set of
filters for removing individuals and loci based on the level of missing
data. These filter settings were varied until the CLUSTER plot showed
the paired replicates to be most closely related. This resulted in
removal of loci which were absent in at least 15% of individuals and
genotypes having more than 20% of total identified loci missing. For
subsequent analyses, only one from each pair of replicate samples with
the most loci was retained. We used the R package sequoia(Huisman, 2017) to identify related individuals, up to half – siblings;
this program is specifically designed to use large SNP datasets and does
not require a parent to be present in the sample. This was done for each
of the two cohorts in order to make sure no related individuals were
included in the Genome-Environment Association (GEA) tests.
We estimated the number of ancestral populations represented in the
sample using the LEA R package (Frichot and François 2015). The
analysis employed population clustering analysis with sparse
non-negative matrix factorization optimization (sNMF) (Frichot et al.
2014) to estimate number of ancestral populations represented in the
sample. The number of populations was determined from the cross entropy
criteria and Cattell’s rule (Cattell, 1966) from the sNMF output. We
favored the sNMF routine because it is robust to departures from
Hardy-Weinberg equilibrium as compared to Bayesian and Maximum
Likelihood approaches (Frichot et al. 2014). We also compared the sNMF
results to STRUCTURE 2.3.4 (Pritchard, Stephens &
Donnelly 2000)
derived population clustering.
We examined whether selection pressure is consistent from year-to-year
by testing for a difference in the number of private alleles, or
homozygous loci in each year. The number of private alleles that were
only found in 2014 but not in 2015 was quantified specifically to each
sNMF derived population and across all SNPs. To compute whether the
number of private alleles was significantly different between years, we
needed to account for the difference in sample sizes between the years.
We wrote a custom bootstrap routine in R to create a null distribution
of the expected number of lost alleles in the smaller sample size by
selecting without replacement from the larger year’s sample, the reduced
sample size. The significance (p=0.05) was then based on where the
observed number of private alleles lies in the null distribution.
Latent factors mixed model (LFMM) algorithm in R package LEA(Frichot, 2014) was conducted to identify loci influenced by selection.
For subsequent analysis we imputed any remaining missing data (3.5% in
2014 and 4.1% in 2015). The missing genotypes were imputed using the
random forest algorithm in the R packages randomForestSRC andradiator (Gosselin, 2018). We used the R package hierfstat(Goudet, 2005) to estimate pairwise FST according
to Nei (1987). Significance of FST was calculated
through 1,000 permutations of population indices. PCA analysis was
conducted using the dudi.pca routine in ade4 R package (Dray
2007). Environmental variables included in the GEA included sample date
and latitude, sea water temperature, and chlorophyll concentration.
Phenotypic metrics were also included in the GEA, including percent
lipid content and condition index. This analysis was done for 2014 and
2015 data separately with four latent factors to account for population
structure while testing for genome-environment association. This was
followed by nucleotide BLAST
(https://blast.ncbi.nlm.nih.gov/Blast.cgi) search of nucleotide
sequences and their corresponding protein coding gene regions where
selection may be occurring. Loci annotation and BLAST searches of the
associated 140 bp sequences were accepted when below the nucleotide and
protein e-value threshold of \(1x10^{-10}\). BLAST e-value score is the
probability that the similarity is due to chance.
The gene ontology (GO) enrichment analysis was used to determine whether
the groups of genes associated with each of the environmental variables
was enriched for certain biological processes. This analysis was done by
querying geneontology.org database using zebra fish (Danio rerio )
as a reference organism, and the alpha level was set at p=0.05 with no
multiple test correction applied. Subsequently, the biocyc.org and
informatics.jax.org were queried to determine general biological
functions of the gene aggregates.