Can we accurately predict the distribution of soil microorganism presence and relative abundance from environmental DNA?

Abstract

Soil microbes play a key role in shaping terrestrial ecosystems. It is therefore essential to understand what drives their distributions. While multivariate analyses have been used to characterise microbial communities and drivers of their spatial patterns, few studies focused on modelling the distribution of Operational Taxonomic Units (OTUs). Here, we evaluate the potential of species distribution models (SDMs), to predict the presence-absence and relative abundance distribution of bacteria, archaea, fungi and protist OTUs from the Swiss Alps. Advanced automated selection of abiotic covariates was used to circumvent the lack of knowledge on the ecology of each OTU. ‘Presence-absence’ SDMs were successfully applied to most OTUs, yielding better predictions than null models. ‘Relative-abundance’ SDMs were less successful, yet, they were able to correctly rank sites according to their relative abundance values. Archaea and bacteria SDMs displayed better predictive power than fungi and protist ones, indicating a closer link of the latter with the abiotic covariates used. Microorganism distributions were mostly related to edaphic covariates. In particular, pH was the most selected covariate across models. The study shows the potential of using SDM frameworks to predict the distribution of OTUs obtained from environmental DNA (eDNA) data. It underscores the importance of edaphic covariates and the need for further development of precise edaphic mapping and scenario modelling to enhance prediction of microorganism distributions in the future.
Keywords : amplicon sequencing, species distribution model, topsoil, eDNA, bacteria, fungi, archaea, protist, cross-validation, environmental niche.

Introduction

Soil microbes play a key role in shaping terrestrial ecosystems and their responses to climate change and land degradation (Karhu et al., 2014; Cavicchioli et al., 2019) by driving soil functions such as cycling of nutrients and carbon (Philippot et al., 2013; Bardgett & Van Der Putten, 2014; Jiao et al., 2021). For example, rising temperatures could enhance microbial activity, leading to increased carbon release from soil to the atmosphere (Crowther et al., 2016; Ballantyne et al., 2017; Rocci et al., 2021) thereby further amplifying global temperature rise. However, the mechanisms and rates of carbon and nutrient releases depend on the composition and spatial patterns of the soil microbial communities present in the environment (Nottingham et al., 2015, 2019). Variations of these patterns have been observed from micro (Nunan et al., 2003) to regional (Yashiro et al., 2018; Pinto-figueroa et al., 2019; Mazel et al., 2021; Seppey et al., 2023) and global scales (Birkhofer et al., 2012; Bahram et al., 2018). These distribution patterns could, in turn, be retroactively affected by future land-use and climatic changes (Guo et al., 2018; Cavicchioli et al., 2019; Mod et al., 2021).
To better spatially characterise and quantify soil functions, there is a need to improve knowledge on the spatial patterns of soil microbial communities and their components (Bardgett & Van Der Putten, 2014; Mod et al., 2020). Spatial patterns are commonly studied for macro-organisms using species distribution models (SDMs; Franklin, 2010; Peterson et al., 2011; Guisan et al., 2017). SDMs were first developed to relate ‘presence-absence’ of a taxonomic unit (e.g. species for most macro-organisms studies) to environmental conditions, relying on Hutchinson’s realised environmental niche theory (Hutchinson, 1957; Baquero et al., 2021). The realised environmental niche refers to a multidimensional volume in environmental space, representing the different characteristics (e.g. soil, climate, topography, land cover/use) where the taxonomic unit can be found. SDMs characterise this hyper-volume and project it onto geographical space based on environmental conditions, in order to predict the spatial distribution of the modelled unit (Guisan et al., 2017; Araújo et al., 2019). Most of the time, these models predict probabilities of occurrence, but other properties can also be predicted, such as abundance (Waldock et al., 2022), or species and population characteristics such as fitness and genetic diversity (Lee-Yaw et al., 2022).
For microbial diversity, spatial distribution had been previously studied mostly at the community level by linking soil microbial community characteristics such as diversity metrics (Fierer & Jackson, 2006; Griffiths et al., 2016; Ren et al., 2018; Seppey et al., 2020), abundance patterns (Pinto-figueroa et al., 2019), dominance patterns (de Vries et al., 2012), and total biomass (Serna-Chavez et al., 2013; Horrigue et al., 2016) to environmental abiotic predictors such as climate and soil edaphic conditions. Community-level characteristics are commonly obtained by summarising information about Operational Taxonomic Units (OTU) within and across sampling sites, in which OTUs are defined as operationally relevant clusters of environmental DNA (eDNA) amplicon sequenced reads grouped at specific taxon levels. However, individual OTUs may have different responses to environmental factors, which may not be revealed by community-level analyses. This problem, already discussed for macro-organisms (Guisan & Rahbek, 2011), leads to the challenge of modelling individual OTUs’ distributions.
To our knowledge, few microbial studies attempted to model the distribution of microorganisms at the OTU level (Mod et al., 2021). The main reasons could be methodological limitations regarding the lack of appropriate environmental descriptors for soils (Lembrechts et al., 2020), the tremendous number of models to calibrate for a whole library of OTUs, and the difficulty to compare OTUs derived from different databases because of the lack of standardisation of algorithms that cluster sequences into OTUs (Deiner et al., 2015). Some studies used SDM frameworks below the community level, by working with sequences clustered into clades at coarse taxonomic levels (King et al., 2010). Moreover, it has been shown in plants and animals that SDM frameworks can be applied with success to taxonomic units above and below the species level, as long as there was a link between the biological unit and the environmental covariates used (Hadly et al., 2009; Smith et al., 2019).
Relating geographically-referenced eDNA sampling points to information on local environmental condition provides a possibility to model the presence vs. absence of an OTU by characterising the OTU’s environmental niche (i.e. “OTU-environment relationships”). Since read counts per OTU can be somehow related to species abundance (Giner et al., 2016; Galazzo et al., 2020), abundance models can also be fitted for OTUs (Mod et al. 2021). However, the compositionality of eDNA data (Gloor et al., 2017) also means that the direct modelling of absolute abundance (i.e. read counts) might not be meaningful, leaving only the possibility to model ‘presence-absence’ or ‘relative-abundance’.
In this study, we take advantage of recent advances in computing facilities, bioinformatics, and modelling frameworks to fit individual SDMs for all OTUs from a comprehensive mountain soil eDNA database (Yashiro et al., 2016; Pinto-figueroa et al., 2019; Seppey et al., 2020). Our study aims more specifically to test the predictive power of the SDM approach applied to a high number of OTUs using both ‘presence-absence’ and ‘relative-abundance’ data, and comparing their predictive performance. To achieve this, we first generated SDMs for more than 60,000 bacterial, archaeal, fungal and protist OTUs across a wide elevational gradient in the Western Swiss Alps. We then evaluated the predictive power of the models and explored differences between the ‘presence-absence’ and ‘relative-abundance’ based SDMs across the four microbial groups and their constitutive phyla.

Methods

Study area and data collection

Soil samples were collected at a subset of sites from a larger set of grassland plots (Dubuis et al., 2011, 2013) of the Western Swiss Alps (46°10-46°30′N; 6°50′-7°10′E, http://rechalp.unil.ch; Von Däniken et al., 2014; see Figure 1). It is a mountainous region with an elevation range from 425 to 3120 metres above sea level, and very heterogeneous climatic and edaphic conditions. To relate the soil microbiota to environmental values in the area, we used data from 250 sampling sites for bacteria and archaea (Yashiro et al., 2016; Mod et al., 2020), 217 for fungi (Pinto-figueroa et al., 2019; Mod et al., 2020) and 166 for protists (Seppey et al., 2020) . Details on the sampling and DNA sequencing for the three respective groups can be found in the references above, whereas information on assignment of sequenced reads to OTUs have been published in Malard et al. (2022). In brief, soil sampling was conducted from June to September (growing season) during the summers 2012 and 2013. At each selected sampling site, a 2x2 m quadrat was used to sample the top 5 cm of soil with sterilised tools at each corner and at the middle of the quadrat. The five subsamples were then pooled and homogenised into a sample of 500 g representing the site. DNA extraction was done within 36h after collection. Amplification was done targeting the V5 region of the 16S rRNA gene for bacteria and archaea (Lazarevic et al., 2009), the ITS1 rRNA gene operon region for fungi (Schmidt et al., 2013), and the V4 region of the 18S rRNA gene for protists (Stoeck et al., 2010). PCR products were sequenced on the Illumina HiSeq 2500 for 16S and ITS1 amplicons, and on the Illumina MiSeq for 18S amplicons (see Supporting information). Demultiplexing, trimming and merging of sequences, as well as clustering of sequences to obtain zero-radius OTUs (Edgar, 2018), were performed using a custom made pipeline (see details in Mod et al., (2021; Malard et al., 2022). The raw sequence data is available on NCBI bioproject number PRJNA810480 and PRJEB30010.
Proportional abundances (hereafter ‘relative-abundance’) were obtained by dividing each OTU read count by sequencing depth. ‘Presence-absence’ data were obtained for each OTU using counts superior or equal to one as presence, and lack of detection as absence. Taxonomic assignment of OTUs was performed using the IDtaxa classifier (Murali et al., 2018) against the Silva v138 database for bacteria and archaea (Quast et al., 2012), the UNITE+INSD v9.0 database for fungi (Abarenkov et al., 2022), and the PR2 4.5 database for protists (Guillou et al., 2012). OTUs not corresponding to bacteria, archaea, fungi or protists for the corresponding markers were discarded ( Supporting Information).
For each site, values representing a wide range of 78 covariates covering climatic, edaphic, topographic, landuse-landcover and remote-sensing conditions were obtained (Supporting Information). For edaphic covariates (i.e. soil characteristics), data were measured from samples collected in situ as described in Buri et al. (2020) and Yashiro et al. (2016). For other covariates, data were extracted from spatial layers available at 25 m resolution (Broennimann & Guisan, In review; Külling et al., In review).

Modelling framework

‘Presence-absence’ modelling was performed for all OTUs present in more than 5% and less than 95% of sites, leading to 47,520, 163, 17,318 and 2,147 OTUs for bacteria, archaea, fungi and protists, respectively. ‘Relative-abundance’ modelling was performed for all OTUs present in more than 5% of sites, resulting in modelling of 48,316 bacterial, 163 archaeal, 17,345 fungal and 2155 protist OTUs. This selection was done in order to have enough data points to model each OTU (see suppl. Table 1 for data about OTUs removed).
The following framework was applied to each selected OTU in R v4.3.0 (R Core Team, 2023). All ‘presence-absence’ models were fitted using a Binomial probability distribution. ‘Relative-abundance’ models were fitted as read counts models using a Poisson probability distribution and the sequencing depth as an offset (i.e. equivalent to fitting a ‘relative-abundance’ model; Mod et al., 2021).
Covariates were selected by applying a two-step procedure (Adde et al., 2023). This procedure circumvents the lack of a priori knowledge about most OTUs ecology, and optimises model predictive performances, thereby bringing information on the ecology of the modelled organisms (Adde et al., 2023).
The first step consists of a “data snooping” approach (Dormann et al., 2013). In other words, for each of the 78 candidate covariates, univariate Generalised Linear Models (GLM) with quadratic effect were fitted (Guisan et al., 2002). The models’ predictive power, estimated using the difference between the null deviance and the residual deviance, was used to select the 15 best non-correlated covariates recursively, while excluding covariates having a Pearson correlation greater than 0.7 with already selected covariates (Dormann et al., 2013). The number of preselected covariates was capped at 15 in order to limit the computing power needed for the subsequent step of the analysis which further reduced the number of selected covariates using model-embedded regularisation techniques. Four algorithms were used: Generalised linear models (GLM; Guisan et al., 2002), Generalised additive models (GAM; Guisan et al., 2002), Random Forest (RF; Cutler et al., 2007) and Gradient Boosting Machine (GBM; Elith et al., 2008). GLMs had quadratic terms, and a lasso secondary covariate selection and regularisation (“glmnet” package v4.1-7; Tay et al., 2023). GAMs used null-space penalization for covariate selection (“mgcv” package v1.8-42; Wood, 2017). A regularised form of Random Forests (RF) was built using the “RRF” package v1.9.4 (Deng, 2013; Deng & Runger, 2013). For each OTU RF model, the number of trees was determined through hyper-parametrization as in Elith et al. (2008) in order to minimise the error rate of the model (Hastie et al., 2009). We tested four different values (“ntree”=10,100,1000 or 10000). The “mtry” option was left as default value (3 for ‘presence-absence’ models, 5 for ‘relative-abundance’ models). Gradient Boosting Machine models were built with the “gbm” package v2.1.8.1 (Greenwell et al., 2022), in which hyper-parametrization procedure was performed on the number of trees (10, 100, 1000 or 10000 trees) and the shrinkage value (0.001, 0.01, or 0.1). We only tested a limited number of hyper-parameters to reduce computing costs.

Cross-validation of models’ predictive power

For each of the four algorithms’ best models, predictive power was assessed using the “bootstrap .632+” cross-validation procedure (Efron & Tibshirani, 1997) with 100 iterations per model. Unlike classical SDMs cross-validations that sample data without replacement (e.g. split-sampling, k-fold, see Guisan et al., 2017), this approach uses a bootstrap sample (i.e. with replacement) and generates better estimations of model error rates (Efron, 1983; Efron & Tibshirani, 1997). For each bootstrap iteration of the ‘presence-absence’ model, the difference between predictions and validation data was computed using AUC (Swets, 1988) and maximised values of TSS (maxTSS) and of Kappa (Allouche et al., 2006). The values were then averaged across the 100 iterations. Results were compared to those obtained by using null models fitted with randomised data (Collart & Guisan, 2023). For each prevalence value existing in the dataset, 100 randomised models were fitted (i.e. 100 models fitting “randomly distributed OTUs” with fixed prevalence). These randomised data were used to rescale real OTUs’ maxTSS values for each OTU following this formula:
maxTSSadj = (maxTSSobs-maxTSSnull)/(1-maxTSSnull)
with maxTSSadj being the adjusted maxTSS for the considered OTU’s model; maxTSSobs being the raw maxTSS value obtained for that OTU’s model; and maxTSSnullbeing the 95th percentile of the distribution of models fitted on random data with the same prevalence as the considered OTU. Models having a positive maxTSSadj were considered as having higher predictive power than expected by chance given the environmental dataset specificities. By applying the reasoning of Swets (1988) for AUC to our metric, models with maxTSSadj>0.5 were considered as having a high predictive power.
The same procedure was applied to ‘relative-abundance’ models using Spearman correlation (rho) to check whether or not models were accurately ranking sites by their ‘relative-abundance’ values, and Coefficient of Variation (CoV) to assess the difference between predicted ‘relative-abundance’ values and validation values. The null model procedure could not be applied to ‘relative-abundance’ models due to the high computing burden required to process the thousands of microbial OTUs in the dataset, as each OTU model would necessitate its own random distribution, whereas ‘presence-absence’ models with the same prevalence could use the same random distribution. Hence, models were classified as “useful” when rho>0.2 (Mod et al., 2021) and as “good” when rho>0.4 (Landis & Koch, 1977).
After computing the predictive power for all OTU models, differences between organism groups and phyla within each group were explored. Model predictive powers were compared among the four main organism groups using ANOVA followed by Tukey HSD tests with Bonferroni correction, and Cohen’s D effect size metrics. To check for the potential effect of the number and locations of sampling sites on model predictive power, additional bacterial models were constructed using only the exact same 166 sites as used for the protist models. Then, performances of models using 250 sites were compared performances of models using 166 sites for each OTU with a paired student’s t-test.
Covariate selection and importance
To have an insight on which covariates drive OTUs presence and relative abundance, an analysis per organism group was performed. For each group and each environmental covariate, the proportion of models selecting that covariate was computed by clustering the covariates by their correlation values to match clusters formed during covariates selection (data snooping, Dormann et. al. 2013). For GLMs and GAMs, the importance of covariates was assessed using the coefficients of each covariate. For GBM and RF, covariate importance was assessed using Gini coefficients (Deng, 2013; Greenwell et al., 2022). The obtained values were scaled so that the best covariate from each model had an importance of 1, and the other covariates were linearly rescaled from 1 to 0.

Results

Data on model performance and covariate importance for each OTU can be downloaded in FigShare:https://figshare.com/s/825799db5d4fdc9a2f87(private link, which will be published upon acceptance).
1.Evaluation of ‘presence-absence’ models
Out of the 67,148 preselected OTUs in the dataset, all OTUs were successfully fitted by at least one ‘presence-absence’ model algorithm and 65,554 OTUs were successfully fitted by all four algorithms (details in Supporting Information). Overall, 91% of bacteria, 98% of archaea, 81% of fungi, and 60% of protists had higher predictive power than null models when modelling their distribution using GLMs (Figure 2a-d; Supporting Information). However, the proportion of “high predictive power” models (i.e. maxTSSadj>0.4) was rather low in all groups, with 15% of bacteria, 15% of archaea, 6% of fungi, and 0.1% of protists presenting a maxTSSadj>0.4 for GLM (see Supporting information for the other algorithms). ‘Presence-absence’ models for bacteria and archaea had the best predictive power followed by fungi models and protist models across all 4 modelling algorithms (Figure 2a-d). When the number of sites for bacteria and archaea was reduced from 250 to the corresponding 166 sampled of protists, we observed a small, yet significant, decrease in maxTSSadj (Student’s paired tests: bacteria: p<0.001, df=45349, Cohen’s D=0.15±0.01; archaea: p<0.001, df=157, Cohen’s D=0.33±0.22, Supporting Information). However, these models still had higher maxTSSadj values than protists models (bacteria: padj<0.001, Cohen’s D=0.96±0.05; archaea: padj<0.001, Cohen’s D=1.55±0.17).
Differences in performances were observed among phyla. Within bacteria, phyla such as Chloroflexi, Acidobacteriota, and Planctomycetota displayed a higher proportion of high predictive power models (see Figure 3 for GLMs, Supporting Information for other algorithms), with 528/1537, 1171/5163, and 576/2765, respectively. Within archaea, only some OTUs from Crenarchaeota had high predictive power models (20/131), while most phyla had few assigned OTUs and no high predictive power models. Few protists and fungi had good models, with Ascomycota and Mortierellomycota phyla having some high predictive power models (589/8631 and 89/1117, respectively).
2.Evaluation of ‘relative-abundance’ models
‘Relative-abundance’ models of all 67,979 preselected OTUs were successfully fitted by at least one algorithm, while 64,732 OTUs were fitted by four algorithms (details in Supporting Information). Spearman correlation, which evaluates the ability of the models to discriminate sites by their ‘relative-abundance’ values, demonstrated consistent results with the ‘presence-absence’ models results (Figure 2). We obtained numerous “useful” models (i.e. rho>0.2; e.g. for GLMs: 85% of bacteria, 83% or archaea, 63% of fungi, 49% of protists OTUs) and some “good” models (Spearman’s rho>0.4; e.g. for GLMs: 40% of bacteria, 40% of archaea, 20% of fungi, 9% of protists OTUs). Bacteria and archaea models had higher predictive power than fungi and protist ones (Figure 2; Supporting Information). Some phyla had a higher proportion of ‘relative-abundance’ models that correctly ranked sites of OTUs (i.e. rho>0.4), such as for the bacterial phyla Acidobacteriota (1887/5163), Chloroflexi (489/1537) and Planctomycetota (756/2755; Figure 4). Among archaea, Crenarchaeota was the only phylum with good predictive models (28/131). Among fungi, Mortierellomycota had a higher proportion of good performing models (194/1117). For some phyla such as Nanoarchaeota and all protist phyla, the modelling pipeline could not produce ‘relative-abundance’ models with rho>0.4. Prediction of exact ‘relative-abundance’ values was less successful, because their coefficient of variation between predicted and validation values were high, with median prediction error between 10% and 100% of mean observed ‘relative-abundance’ among the 4 studied model algorithms (Supporting Information).
3. Covariate selection and importance
Edaphic covariates were the most selected across all groups for both ‘presence-absence’ and ‘relative-abundance’ models (Figure 5). For example in GLMs, at least one edaphic covariate was selected in models for 97% of archaea OTUs, 95% of bacteria OTUs, 87% of fungi OTUs and 80% of protist OTUs. In particular, pH was the most selected covariate in models for all groups (Figure 5; Supporting Information). Climatic covariates were also highly selected in GLMs, with 67% of archaea, 56% of bacteria, 67% of fungi and 69% of protist best models including at least one climatic covariate (Figure 5). For bacteria and archaea, winter temperature was the most selected climatic covariate, while fungi and protist models selected the set of covariates corresponding to yearly average temperature, precipitation, and elevation covariates (Supporting Information). Surprisingly, for fungi, distance to roads appeared within the list of the most-selected covariates alongside the edaphic and some topographic covariates. For protists, the most selected covariates were found among distance to roads, climate, edaphic, topographic and land-use covariates (Figure 5).