Aligning the gill transcriptome to the reference genome, filtering and normalization of transcripts
The reads were mapped towards the Gasterosteus aculeatus genome with corresponding gene annotation (BROAD S1 Ensembl release 90, 2017) using bowtie2 v2.2.2 and tophat2 v2.0.14, and assembly of transcripts and expression performed using cufflinks v2.1.1. according to the Trapnell et al. protocol (Trapnell et al. 2012). All steps were done on the high-performance computing cluster Abel at the University of Oslo (now replaced by the computing clusters hosted by Sigma2).
All analyses were run in R 4.0.1. (R Development Core Team 2020) using the package edgeR (Robinson et al. 2010). First, the number of expressed reads for the individuals were compared as if an individual by chance was sequenced at greater depth, that might give this individual an un-naturally high gene expression count. The number of expressed reads before filtering varied from 3.832.109 to 25.515.142 (with an average of 8.756.949). The library sizes were adjusted by transforming the raw-scale libraries to logged counts per million (CMP), before filtering out the genes with low expression values, as low read counts are a significant source of measurement error in differential expression analyses (Robinson and Smyth 2007). Filtering out insignificant genes also increases detection power of significant discoveries (Bourgon et al. 2010). Therefore, a total of 597 genes (2.87%) that were unexpressed across all samples were excluded. We further excluded genes that was expressed at low levels across the individuals by the use of the function “FilterByExpr”; keeping genes with a CMP-value of > 0.68, and being expressed in at least three individuals, as that was the lower group size (Supplementary Figure 1 ).
As small differences in expression of highly expressed genes between samples can give the appearance that many of the low-expressed genes are differentially expressed between treatments, we normalized the read counts among libraries with the function “calcNormFactors”. This function uses the method of trimmed means of m values (TMM; Robinson and Oshlack 2010) and normalizes the data by removing the extremely lowly and highly expressed genes, and also removes the genes that are very differentially expressed between samples, by keeping genes that was expressed at least 6-7 times in the smallest sample, and being expressed in at least two of the libraries. The “calcNormFactors” then compares the total count for this subset of genes between the two samples and calculates a set of scaling factors for the library size that minimizes the log-fold changes between samples for most genes, as the method assumes that the majority of genes are equally expressed between any two samples. The scaling factors in this study varied from 0.764 to 1.281 (Supplementary Table 1 ). The calculated effective library size is then used as the original library size in all downstream analyses (Supplementary Figure 2 ). Out of the 20.789 genes retrieved from the stickleback genome (Ensembl version 90), a total of 16.211 genes (77.9%) were kept for further analysis.
As the variance in RNA-seq measurement of gene expression is typically overdispersed, a negative binominal distribution is used to model the variance. We calculated the common dispersion, using the same value for dispersion when modelling the variance for each gene, with the function “estimateCommonDisp”, and found the biological coefficient of variation to be 0.364, meaning that the true abundance for each gene can vary up or down by 36.4% between replicates. Each gene likely differs in dispersion, and the common dispersion model was extended to model the mean variance relationship between genes and the dispersion estimation per gene was calculated, shrinking the dispersion toward the trended dispersion due to low sample sizes, by the function “estimateGLMTagwiseDisp”. The results are visualized with a multidimensional scaling (MDS) plot, using the pairwise biological coefficients of variation as a distance measure to visualize the overall expressional relationships between individuals, and a principal component analysis (PCA) of the gene expression profiles by the use of the prcomp function in R on log2-transformed data.