2.4.2 | Complex admixture model-choice with
Random-Forest ABC
For ABC model-choice, we performed 10,000 independent MetHissimulations for each nine competing-scenarios. To mimic our case study
datasets (see 2.4.3), we simulated 100,000 SNPs and sampled 50
individuals in population H, and 90 and 89 individuals respectively in
the African and European source populations.
Using 27 cores and the above
design, we performed the 90,000 simulations with MetHis in four
days, with 2/3 of that time for summary-statistics calculation only
(Supplementary Note S1 ).
We used Random-Forest ABC for model-choice implemented in theabcrf function of the abcrf package to obtain the
cross-validation table and associated prior error rate using an
out-of-bag approach (Figure 2 ). We considered a uniform prior
probability for the nine competing models. We considered 1,000 decision
trees in the forest after visually checking that error-rates converged
appropriately (Supplementary Figure S3 ), using theerr.abcrf function. RF-ABC cross-validation procedures using
groups of scenarios were conducted using the group definition option in
the abcrf function (Estoup, Raynal,
Verdu, & Marin, 2018). Finally, each summary statistics relative
importance to the model-choice cross-validation was computed using theabcrf function (Supplementary Figure S4 ).
We explore model-choice erroneous assignation due to model nestedness in
the parameter space, by considering 1,000 randomly chosen simulation per
model as pseudo-observed data (Supplementary Figure S5 ). We
train the RF algorithm based on the 9000 remaining simulations per model
using the abcrf function similarly as above, which provides
highly similar results as when considering 10,000 simulations per model
(results not shown). We then use the predict.abcrf function to
perform model choice independently for each 1000 simulated
pseudo-observed data with known parameter vectors.
To empirically evaluate the power of the RF-ABC model-choice to
distinguish complex admixture processes, we conducted similar
cross-validations procedures based on additional 10,000 per scenario for
50,000 and, separately, 10,000 SNPs, instead of 100,000 SNPs (180,000
simulations in total, Supplementary Figure S6A-B ).
Furthermore, using 100,000 SNPs, we produced 90,000 simulations and
performed cross-validations (Supplementary Figure S6C ),
considering a five-times smaller sample set, with 10 sampled individuals
in population H (instead of 50 as previously) and 18 individuals in each
source population (instead of 90 and 89).