Selected Machine Learning Algorithms
We used Principal Component Analysis (PCA) in order to explore the
underlying structure of our dataset. As a result of this set of
preliminary analyses, we carefully selected the classification
algorithms shown in Table 1. For a brief description of these algorithms
see Supplemental Information. We also made use of “Feature Selection”
techniques (FS) in order to identify the most suitable features (genes)
at predicting the correlation between gene expression data and dance
behaviour.
We explored three approaches with implicit feature ranking procedures
based on previous studies (see Table 1 and also Supplemental
Information): Random Forests (RF), Lasso and Elastic net Regularized
Generalized Linear Model (GLMNET), and Support Vector Machine (SVM). Due
to the complexity of the data, we decided to use a radial kernel for
SVM, as supported by previous research . These methods, also known as
“embedded techniques”, rank the features based on the already trained
classifier, and as a result, the predictive power of the selected
features is dependent on the performance of the model. The selected
approaches proved to converge on the same final set of predictors even
when subjected to repeated random starting conditions.
Whereas embedded methods obtain the importance of certain features from
the trained model, wrapper methods, such as Recursive Feature
Elimination (RFE), embed the model hypothesis search within the feature
subset search . RFE uses backwards selection to assess the importance of
each feature to the model. The ranking of the features is done by the
underlying algorithm, which can be RF, SVM, or others . Considering the
promising properties of RF for genomics studies , we decided to use RF
as the underlying model for recursive feature elimination.