Exploratory Analysis
PCA was unable to clearly separate the four groups of bees according to the combination of dance behaviour (dancer (D)/non-dancer (N)) and distance perceived (long (L) /short (S)) (Figure 2). However, when considering the dance factor alone, we obtained a low-dimensional representation/projection of the data using only few Principal Components (PCs), which produced easily distinguishable clusters of samples i.e., dancers and non-dancers (Figure 3). The representation was dominated by PC1, accounting for 51% of the variance in the data, while PC2 only accounted for 14.4% (Figure 3 and see also Supplemental Information).
In particular, dancers were clustered together towards the centre of the plot, showing lower variance than non-dancers – which is indicative of more consistent global patterns of gene expression in dancers vs. non-dancers. We also found 4 non-dancer (2 NL + 2 NS) which formed a separate cluster further along the first Principal Component (PC1). These samples showed the highest loadings for PC1, with levels around 200 that were much higher than dancers (centred around 0) and the other non-dancers (all below 0). We identified the 3 genes with maximal loadings for PC1: GB52651 (diphthine-ammonia ligase ), GB49108 (PDZ domain-containing protein 8 ) and GB44753 (un uncharacterized gene). Moreover, dancers showed the highest levels of positive correlation between global patterns of gene expression and the first two principal components PC1 and PC2 ( DL = 0.711 and DS = 0.574, Figure 2). Overall, the data show a clear underlying structure in the dataset with respect to the dance component (dancers vs non-dancers) while no evident structure appeared to be associated with the perceived distance (long vs short). Based on these findings, we proceeded in our ML analyses focusing on the “dance” factor alone.