Genome-wide association study – exploring correlations between
allelic frequencies and colour morphs
Because population stratification is known to be a confounding factor in
genome-wide association studies, we tested for the existence of
population clusters in our dataset with a PCA performed in SNPrelate and
a discriminant analyses of principal components (DAPC) in R’s package
adegenet (Jombart, 2008). We searched for the likely number of clusters
with the function find.clusters and chose the likely number of K
according to Bayesian Information Criteria (BIC). The two main linear
discriminants, i.e., explaining most the largest amount of variation due
to putative population structure, were used as covariates in downstream
genome-wide association analyses (GWAS). Three approaches were employed
to detect associations between colour (binary phenotype eithergrey or brown ) and underlying genetic variation among
RADtags.
We
used three separate techniques also to assess the robustness of each
technique in the dataset and to identify concordance between techniques.
First, we utilized Plink2’s (Chang et al., 2015) generalized linear
model logistic-Firth hybrid regression while after applying the
following filters to the SNP dataset: Hardy-Weinberg Equilibrium =
p-value < 1E-6; minor allelic frequency = 0.05; SNP presence =
80%; and r2 = 0.1 for pairwise
linkage-disequilibrium. We then utilized GEMMA (Zhou & Stephens, 2012)
with the following filters: SNP presence = 5%, minor allelic frequency
= 0.05, Hardy-Weinberg Equilibrium = 0.001. Family structure and/or
pedigree was considered by calculating and incorporating a relationship
matrix with centered genotypic variance. In GEMMA, associations between
SNPs and covariates were inspected with a univariate linear mixed model
with one intercept and two covariates corresponding to linear
discriminants obtained with DAPC The p-values to consider
associations as significant or suggestive were defined by correcting for
multiple comparisons according to the number of utilized SNPs following
Guo et al. (2017). Specifically, the thresholds (log(p) ) defined
for suggestive and significant associations corrected for the number of
input loci ranged between 3.65 and 4.95 for the 4755 loci utilized in
PLINK’s GLM after filtering and between 4.10 and 5.40 for the 12673 loci
utilized in GEMMA.
Lastly, we also used the R package wtest to further explore
allele-allele interactions (Sun et al., 2019). Wtest is based on
its namesake, the W-test (Wang et al., 2016), developed for post-hoc
exploration of epistatic interactions.