Genome-wide association study – exploring correlations between allelic frequencies and colour morphs
Because population stratification is known to be a confounding factor in genome-wide association studies, we tested for the existence of population clusters in our dataset with a PCA performed in SNPrelate and a discriminant analyses of principal components (DAPC) in R’s package adegenet (Jombart, 2008). We searched for the likely number of clusters with the function find.clusters and chose the likely number of K according to Bayesian Information Criteria (BIC). The two main linear discriminants, i.e., explaining most the largest amount of variation due to putative population structure, were used as covariates in downstream genome-wide association analyses (GWAS). Three approaches were employed to detect associations between colour (binary phenotype eithergrey or brown ) and underlying genetic variation among RADtags. We used three separate techniques also to assess the robustness of each technique in the dataset and to identify concordance between techniques. First, we utilized Plink2’s (Chang et al., 2015) generalized linear model logistic-Firth hybrid regression while after applying the following filters to the SNP dataset: Hardy-Weinberg Equilibrium = p-value < 1E-6; minor allelic frequency = 0.05; SNP presence = 80%; and r2 = 0.1 for pairwise linkage-disequilibrium. We then utilized GEMMA (Zhou & Stephens, 2012) with the following filters: SNP presence = 5%, minor allelic frequency = 0.05, Hardy-Weinberg Equilibrium = 0.001. Family structure and/or pedigree was considered by calculating and incorporating a relationship matrix with centered genotypic variance. In GEMMA, associations between SNPs and covariates were inspected with a univariate linear mixed model with one intercept and two covariates corresponding to linear discriminants obtained with DAPC The p-values to consider associations as significant or suggestive were defined by correcting for multiple comparisons according to the number of utilized SNPs following Guo et al. (2017). Specifically, the thresholds (log(p) ) defined for suggestive and significant associations corrected for the number of input loci ranged between 3.65 and 4.95 for the 4755 loci utilized in PLINK’s GLM after filtering and between 4.10 and 5.40 for the 12673 loci utilized in GEMMA.
Lastly, we also used the R package wtest to further explore allele-allele interactions (Sun et al., 2019). Wtest is based on its namesake, the W-test (Wang et al., 2016), developed for post-hoc exploration of epistatic interactions.