Selected Machine Learning Algorithms
We used Principal Component Analysis (PCA) in order to explore the underlying structure of our dataset. As a result of this set of preliminary analyses, we carefully selected the classification algorithms shown in Table 1. For a brief description of these algorithms see Supplemental Information. We also made use of “Feature Selection” techniques (FS) in order to identify the most suitable features (genes) at predicting the correlation between gene expression data and dance behaviour.
We explored three approaches with implicit feature ranking procedures based on previous studies (see Table 1 and also Supplemental Information): Random Forests (RF), Lasso and Elastic net Regularized Generalized Linear Model (GLMNET), and Support Vector Machine (SVM). Due to the complexity of the data, we decided to use a radial kernel for SVM, as supported by previous research . These methods, also known as “embedded techniques”, rank the features based on the already trained classifier, and as a result, the predictive power of the selected features is dependent on the performance of the model. The selected approaches proved to converge on the same final set of predictors even when subjected to repeated random starting conditions.
Whereas embedded methods obtain the importance of certain features from the trained model, wrapper methods, such as Recursive Feature Elimination (RFE), embed the model hypothesis search within the feature subset search . RFE uses backwards selection to assess the importance of each feature to the model. The ranking of the features is done by the underlying algorithm, which can be RF, SVM, or others . Considering the promising properties of RF for genomics studies , we decided to use RF as the underlying model for recursive feature elimination.