Aligning the gill transcriptome to the reference genome, filtering
and normalization of transcripts
The reads were mapped towards the Gasterosteus aculeatus genome
with corresponding gene annotation (BROAD S1 Ensembl release 90, 2017)
using bowtie2 v2.2.2 and tophat2 v2.0.14, and assembly of transcripts
and expression performed using cufflinks v2.1.1. according to the
Trapnell et al. protocol (Trapnell et al. 2012). All steps were done on
the high-performance computing cluster Abel at the University of Oslo
(now replaced by the computing clusters hosted by Sigma2).
All analyses were run in R 4.0.1. (R Development Core Team 2020) using
the package edgeR (Robinson et al. 2010). First, the number of expressed
reads for the individuals were compared as if an individual by chance
was sequenced at greater depth, that might give this individual an
un-naturally high gene expression count. The number of expressed reads
before filtering varied from 3.832.109 to 25.515.142 (with an average of
8.756.949). The library sizes were adjusted by transforming the
raw-scale libraries to logged counts per million (CMP), before filtering
out the genes with low expression values, as low read counts are a
significant source of measurement error in differential expression
analyses (Robinson and Smyth 2007). Filtering out insignificant genes
also increases detection power of significant discoveries (Bourgon et
al. 2010). Therefore, a total of 597 genes (2.87%) that were
unexpressed across all samples were excluded. We further excluded genes
that was expressed at low levels across the individuals by the use of
the function “FilterByExpr”; keeping genes with a CMP-value of
> 0.68, and being expressed in at least three individuals,
as that was the lower group size (Supplementary Figure 1 ).
As small differences in expression of highly expressed genes between
samples can give the appearance that many of the low-expressed genes are
differentially expressed between treatments, we normalized the read
counts among libraries with the function “calcNormFactors”. This
function uses the method of trimmed means of m values (TMM; Robinson and
Oshlack 2010) and normalizes the data by removing the extremely lowly
and highly expressed genes, and also removes the genes that are very
differentially expressed between samples, by keeping genes that was
expressed at least 6-7 times in the smallest sample, and being expressed
in at least two of the libraries. The “calcNormFactors” then compares
the total count for this subset of genes between the two samples and
calculates a set of scaling factors for the library size that minimizes
the log-fold changes between samples for most genes, as the method
assumes that the majority of genes are equally expressed between any two
samples. The scaling factors in this study varied from 0.764 to 1.281
(Supplementary Table 1 ). The calculated effective library size
is then used as the original library size in all downstream analyses
(Supplementary Figure 2 ). Out of the 20.789 genes retrieved
from the stickleback genome (Ensembl version 90), a total of 16.211
genes (77.9%) were kept for further analysis.
As the variance in RNA-seq measurement of gene expression is typically
overdispersed, a negative binominal distribution is used to model the
variance. We calculated the common dispersion, using the same value for
dispersion when modelling the variance for each gene, with the function
“estimateCommonDisp”, and found the biological coefficient of
variation to be 0.364, meaning that the true abundance for each gene can
vary up or down by 36.4% between replicates. Each gene likely differs
in dispersion, and the common dispersion model was extended to model the
mean variance relationship between genes and the dispersion estimation
per gene was calculated, shrinking the dispersion toward the trended
dispersion due to low sample sizes, by the function
“estimateGLMTagwiseDisp”. The results are visualized with a
multidimensional scaling (MDS) plot, using the pairwise biological
coefficients of variation as a distance measure to visualize the overall
expressional relationships between individuals, and a principal
component analysis (PCA) of the gene expression profiles by the use of
the prcomp function in R on log2-transformed
data.