Exploratory Analysis
PCA was unable to clearly separate the four groups of bees according to
the combination of dance behaviour (dancer (D)/non-dancer (N)) and
distance perceived (long (L) /short (S)) (Figure 2). However, when
considering the dance factor alone, we obtained a low-dimensional
representation/projection of the data using only few Principal
Components (PCs), which produced easily distinguishable clusters of
samples i.e., dancers and non-dancers (Figure 3). The representation was
dominated by PC1, accounting for 51% of the variance in the data, while
PC2 only accounted for 14.4% (Figure 3 and see also Supplemental
Information).
In particular, dancers were clustered together towards the centre of the
plot, showing lower variance than non-dancers – which is indicative of
more consistent global patterns of gene expression in dancers vs.
non-dancers. We also found 4 non-dancer (2 NL + 2 NS) which formed a
separate cluster further along the first Principal Component (PC1).
These samples showed the highest loadings for PC1, with levels around
200 that were much higher than dancers (centred around 0) and the other
non-dancers (all below 0). We identified the 3 genes with maximal
loadings for PC1: GB52651 (diphthine-ammonia ligase ), GB49108
(PDZ domain-containing protein 8 ) and GB44753 (un uncharacterized
gene). Moreover, dancers showed the highest levels of positive
correlation between global patterns of gene expression and the first two
principal components PC1 and PC2 ( DL = 0.711 and DS = 0.574, Figure 2).
Overall, the data show a clear underlying structure in the dataset with
respect to the dance component (dancers vs non-dancers) while no evident
structure appeared to be associated with the perceived distance (long vs
short). Based on these findings, we proceeded in our ML analyses
focusing on the “dance” factor alone.