Can we accurately predict the distribution of soil microorganism
presence and relative abundance from environmental DNA?
Abstract
Soil microbes play a key role in shaping terrestrial ecosystems. It is
therefore essential to understand what drives their distributions. While
multivariate analyses have been used to characterise microbial
communities and drivers of their spatial patterns, few studies focused
on modelling the distribution of Operational Taxonomic Units (OTUs).
Here, we evaluate the potential of species distribution models (SDMs),
to predict the presence-absence and relative abundance distribution of
bacteria, archaea, fungi and protist OTUs from the Swiss Alps. Advanced
automated selection of abiotic covariates was used to circumvent the
lack of knowledge on the ecology of each OTU. ‘Presence-absence’ SDMs
were successfully applied to most OTUs, yielding better predictions than
null models. ‘Relative-abundance’ SDMs were less successful, yet, they
were able to correctly rank sites according to their relative abundance
values. Archaea and bacteria SDMs displayed better predictive power than
fungi and protist ones, indicating a closer link of the latter with the
abiotic covariates used. Microorganism distributions were mostly related
to edaphic covariates. In particular, pH was the most selected covariate
across models. The study shows the potential of using SDM frameworks to
predict the distribution of OTUs obtained from environmental DNA (eDNA)
data. It underscores the importance of edaphic covariates and the need
for further development of precise edaphic mapping and scenario
modelling to enhance prediction of microorganism distributions in the
future.
Keywords : amplicon sequencing, species distribution model,
topsoil, eDNA, bacteria, fungi, archaea, protist, cross-validation,
environmental niche.
Introduction
Soil microbes play a key role in shaping terrestrial ecosystems and
their responses to climate change and land degradation
(Karhu et al., 2014;
Cavicchioli et al., 2019) by driving soil functions such as cycling of
nutrients and carbon
(Philippot et al.,
2013; Bardgett & Van Der Putten, 2014; Jiao et al., 2021). For
example, rising temperatures could enhance microbial activity, leading
to increased carbon release from soil to the atmosphere
(Crowther et al.,
2016; Ballantyne et al., 2017; Rocci et al., 2021) thereby further
amplifying global temperature rise. However, the mechanisms and rates of
carbon and nutrient releases depend on the composition and spatial
patterns of the soil microbial communities present in the environment
(Nottingham et al.,
2015, 2019). Variations of these patterns have been observed from micro
(Nunan et al., 2003)
to regional (Yashiro
et al., 2018; Pinto-figueroa et al., 2019; Mazel et al., 2021; Seppey et
al., 2023) and global scales
(Birkhofer et al.,
2012; Bahram et al., 2018). These distribution patterns could, in turn,
be retroactively affected by future land-use and climatic changes
(Guo et al., 2018;
Cavicchioli et al., 2019; Mod et al., 2021).
To better spatially characterise and quantify soil functions, there is a
need to improve knowledge on the spatial patterns of soil microbial
communities and their components
(Bardgett & Van Der
Putten, 2014; Mod et al., 2020). Spatial patterns are commonly studied
for macro-organisms using species distribution models
(SDMs;
Franklin, 2010;
Peterson et al., 2011; Guisan et al., 2017). SDMs were first developed
to relate ‘presence-absence’ of a taxonomic unit (e.g. species for most
macro-organisms studies) to environmental conditions, relying on
Hutchinson’s realised environmental niche theory
(Hutchinson, 1957;
Baquero et al., 2021). The realised environmental niche refers to a
multidimensional volume in environmental space, representing the
different characteristics (e.g. soil, climate, topography, land
cover/use) where the taxonomic unit can be found. SDMs characterise this
hyper-volume and project it onto geographical space based on
environmental conditions, in order to predict the spatial distribution
of the modelled unit
(Guisan et al., 2017;
Araújo et al., 2019). Most of the time, these models predict
probabilities of occurrence, but other properties can also be predicted,
such as abundance
(Waldock et al.,
2022), or species and population characteristics such as fitness and
genetic diversity
(Lee-Yaw et al.,
2022).
For microbial diversity, spatial distribution had been previously
studied mostly at the community level by linking soil microbial
community characteristics such as diversity metrics
(Fierer & Jackson,
2006; Griffiths et al., 2016; Ren et al., 2018; Seppey et al., 2020),
abundance patterns
(Pinto-figueroa et
al., 2019), dominance patterns
(de Vries et al.,
2012), and total biomass
(Serna-Chavez et al.,
2013; Horrigue et al., 2016) to environmental abiotic predictors such
as climate and soil edaphic conditions. Community-level characteristics
are commonly obtained by summarising information about Operational
Taxonomic Units (OTU) within and across sampling sites, in which OTUs
are defined as operationally relevant clusters of environmental DNA
(eDNA) amplicon sequenced reads grouped at specific taxon levels.
However, individual OTUs may have different responses to environmental
factors, which may not be revealed by community-level analyses. This
problem, already discussed for macro-organisms
(Guisan & Rahbek,
2011), leads to the challenge of modelling individual OTUs’
distributions.
To our knowledge, few microbial studies attempted to model the
distribution of microorganisms at the OTU level
(Mod et al., 2021).
The main reasons could be methodological limitations regarding the lack
of appropriate environmental descriptors for soils
(Lembrechts et al.,
2020), the tremendous number of models to calibrate for a whole library
of OTUs, and the difficulty to compare OTUs derived from different
databases because of the lack of standardisation of algorithms that
cluster sequences into OTUs
(Deiner et al.,
2015). Some studies used SDM frameworks below the community level, by
working with sequences clustered into clades at coarse taxonomic levels
(King et al., 2010).
Moreover, it has been shown in plants and animals that SDM frameworks
can be applied with success to taxonomic units above and below the
species level, as long as there was a link between the biological unit
and the environmental covariates used
(Hadly et al., 2009;
Smith et al., 2019).
Relating geographically-referenced eDNA sampling points to information
on local environmental condition provides a possibility to model the
presence vs. absence of an OTU by characterising the OTU’s environmental
niche (i.e. “OTU-environment relationships”). Since read counts per
OTU can be somehow related to species abundance
(Giner et al., 2016;
Galazzo et al., 2020), abundance models can also be fitted for OTUs
(Mod et al. 2021). However, the compositionality of eDNA data
(Gloor et al., 2017)
also means that the direct modelling of absolute abundance (i.e. read
counts) might not be meaningful, leaving only the possibility to model
‘presence-absence’ or ‘relative-abundance’.
In this study, we take advantage of recent advances in computing
facilities, bioinformatics, and modelling frameworks to fit individual
SDMs for all OTUs from a comprehensive mountain soil eDNA database
(Yashiro et al., 2016;
Pinto-figueroa et al., 2019; Seppey et al., 2020). Our study aims more
specifically to test the predictive power of the SDM approach applied to
a high number of OTUs using both ‘presence-absence’ and
‘relative-abundance’ data, and comparing their predictive performance.
To achieve this, we first generated SDMs for more than 60,000 bacterial,
archaeal, fungal and protist OTUs across a wide elevational gradient in
the Western Swiss Alps. We then evaluated the predictive power of the
models and explored differences between the ‘presence-absence’ and
‘relative-abundance’ based SDMs across the four microbial groups and
their constitutive phyla.
Methods
Study area and data
collection
Soil samples were collected at a subset of sites from a larger set of
grassland plots
(Dubuis et al., 2011,
2013) of the Western Swiss Alps (46°10-46°30′N; 6°50′-7°10′E,
http://rechalp.unil.ch;
Von Däniken et al.,
2014; see Figure 1). It is a mountainous region with an elevation range
from 425 to 3120 metres above sea level, and very heterogeneous climatic
and edaphic conditions. To relate the soil microbiota to environmental
values in the area, we used data from 250 sampling sites for bacteria
and archaea (Yashiro
et al., 2016; Mod et al., 2020), 217 for fungi
(Pinto-figueroa et
al., 2019; Mod et al., 2020) and 166 for protists
(Seppey et al., 2020)
. Details on the sampling and DNA sequencing for the three respective
groups can be found in the references above, whereas information on
assignment of sequenced reads to OTUs have been published in Malard et
al. (2022). In brief,
soil sampling was conducted from June to September (growing season)
during the summers 2012 and 2013. At each selected sampling site, a 2x2
m quadrat was used to sample the top 5 cm of soil with sterilised tools
at each corner and at the middle of the quadrat. The five subsamples
were then pooled and homogenised into a sample of 500 g representing the
site. DNA extraction was done within 36h after collection. Amplification
was done targeting the V5 region of the 16S rRNA gene for bacteria and
archaea (Lazarevic et
al., 2009), the ITS1 rRNA gene operon region for fungi
(Schmidt et al.,
2013), and the V4 region of the 18S rRNA gene for protists (Stoeck et
al., 2010). PCR products were sequenced on the Illumina HiSeq 2500 for
16S and ITS1 amplicons, and on the Illumina MiSeq for 18S amplicons (see
Supporting information). Demultiplexing, trimming and merging of
sequences, as well as clustering of sequences to obtain zero-radius OTUs
(Edgar, 2018), were performed using a custom made pipeline (see details
in Mod et al., (2021;
Malard et al., 2022). The raw sequence data is available on NCBI
bioproject number PRJNA810480 and PRJEB30010.
Proportional abundances (hereafter ‘relative-abundance’) were obtained
by dividing each OTU read count by sequencing depth. ‘Presence-absence’
data were obtained for each OTU using counts superior or equal to one as
presence, and lack of detection as absence. Taxonomic assignment of OTUs
was performed using the IDtaxa classifier
(Murali et al., 2018)
against the Silva v138 database for bacteria and archaea
(Quast et al., 2012),
the UNITE+INSD v9.0 database for fungi
(Abarenkov et al.,
2022), and the PR2 4.5 database for protists
(Guillou et al.,
2012). OTUs not corresponding to bacteria, archaea, fungi or protists
for the corresponding markers were discarded ( Supporting Information).
For each site, values representing a wide range of 78 covariates
covering climatic, edaphic, topographic, landuse-landcover and
remote-sensing conditions were obtained (Supporting Information). For
edaphic covariates (i.e. soil characteristics), data were measured from
samples collected in situ as described in Buri et al.
(2020) and Yashiro et
al. (2016). For other
covariates, data were extracted from spatial layers available at 25 m
resolution
(Broennimann &
Guisan, In review; Külling et al., In review).
Modelling
framework
‘Presence-absence’ modelling was performed for all OTUs present in more
than 5% and less than 95% of sites, leading to 47,520, 163, 17,318 and
2,147 OTUs for bacteria, archaea, fungi and protists, respectively.
‘Relative-abundance’ modelling was performed for all OTUs present in
more than 5% of sites, resulting in modelling of 48,316 bacterial, 163
archaeal, 17,345 fungal and 2155 protist OTUs. This selection was done
in order to have enough data points to model each OTU (see suppl. Table
1 for data about OTUs removed).
The following framework was applied to each selected OTU in R v4.3.0
(R Core Team, 2023).
All ‘presence-absence’ models were fitted using a Binomial probability
distribution. ‘Relative-abundance’ models were fitted as read counts
models using a Poisson probability distribution and the sequencing depth
as an offset (i.e. equivalent to fitting a ‘relative-abundance’ model;
Mod et al., 2021).
Covariates were selected by applying a two-step procedure
(Adde et al., 2023).
This procedure circumvents the lack of a priori knowledge about most
OTUs ecology, and optimises model predictive performances, thereby
bringing information on the ecology of the modelled organisms
(Adde et al., 2023).
The first step consists of a “data snooping” approach
(Dormann et al.,
2013). In other words, for each of the 78 candidate covariates,
univariate Generalised Linear Models (GLM) with quadratic effect were
fitted (Guisan et al.,
2002). The models’ predictive power, estimated using the difference
between the null deviance and the residual deviance, was used to select
the 15 best non-correlated covariates recursively, while excluding
covariates having a Pearson correlation greater than 0.7 with already
selected covariates
(Dormann et al.,
2013). The number of preselected covariates was capped at 15 in order
to limit the computing power needed for the subsequent step of the
analysis which further reduced the number of selected covariates using
model-embedded regularisation techniques. Four algorithms were used:
Generalised linear models
(GLM;
Guisan et al., 2002),
Generalised additive models
(GAM;
Guisan et al., 2002),
Random Forest (RF;
Cutler et al., 2007)
and Gradient Boosting Machine (GBM;
Elith et al., 2008).
GLMs had quadratic terms, and a lasso secondary covariate selection and
regularisation (“glmnet” package v4.1-7; Tay et al.,
2023). GAMs used
null-space penalization for covariate selection (“mgcv” package
v1.8-42; Wood, 2017).
A regularised form of Random Forests (RF) was built using the “RRF”
package v1.9.4 (Deng,
2013; Deng & Runger, 2013). For each OTU RF model, the number of trees
was determined through hyper-parametrization as in Elith et al.
(2008) in order to
minimise the error rate of the model
(Hastie et al.,
2009). We tested four different values (“ntree”=10,100,1000 or
10000). The “mtry” option was left as default value (3 for
‘presence-absence’ models, 5 for ‘relative-abundance’ models). Gradient
Boosting Machine models were built with the “gbm” package v2.1.8.1
(Greenwell et al.,
2022), in which hyper-parametrization procedure was performed on the
number of trees (10, 100, 1000 or 10000 trees) and the shrinkage value
(0.001, 0.01, or 0.1). We only tested a limited number of
hyper-parameters to reduce computing costs.
Cross-validation of models’
predictive
power
For each of the four algorithms’ best models, predictive power was
assessed using the “bootstrap .632+” cross-validation procedure
(Efron & Tibshirani,
1997) with 100 iterations per model. Unlike classical SDMs
cross-validations that sample data without replacement (e.g.
split-sampling, k-fold, see
Guisan et al., 2017),
this approach uses a bootstrap sample (i.e. with replacement) and
generates better estimations of model error rates
(Efron, 1983; Efron &
Tibshirani, 1997). For each bootstrap iteration of the
‘presence-absence’ model, the difference between predictions and
validation data was computed using AUC
(Swets, 1988) and
maximised values of TSS (maxTSS) and of Kappa
(Allouche et al.,
2006). The values were then averaged across the 100 iterations. Results
were compared to those obtained by using null models fitted with
randomised data
(Collart & Guisan,
2023). For each prevalence value existing in the dataset, 100
randomised models were fitted (i.e. 100 models fitting “randomly
distributed OTUs” with fixed prevalence). These randomised data were
used to rescale real OTUs’ maxTSS values for each OTU following this
formula:
maxTSSadj =
(maxTSSobs-maxTSSnull)/(1-maxTSSnull)
with maxTSSadj being the adjusted maxTSS for the
considered OTU’s model; maxTSSobs being the raw maxTSS
value obtained for that OTU’s model; and maxTSSnullbeing the 95th percentile of the distribution of models fitted on random
data with the same prevalence as the considered OTU. Models having a
positive maxTSSadj were considered as having higher
predictive power than expected by chance given the environmental dataset
specificities. By applying the reasoning of Swets
(1988) for AUC to our
metric, models with maxTSSadj>0.5 were
considered as having a high predictive power.
The same procedure was applied to ‘relative-abundance’ models using
Spearman correlation (rho) to check whether or not models were
accurately ranking sites by their ‘relative-abundance’ values, and
Coefficient of Variation (CoV) to assess the difference between
predicted ‘relative-abundance’ values and validation values. The null
model procedure could not be applied to ‘relative-abundance’ models due
to the high computing burden required to process the thousands of
microbial OTUs in the dataset, as each OTU model would necessitate its
own random distribution, whereas ‘presence-absence’ models with the same
prevalence could use the same random distribution. Hence, models were
classified as “useful” when rho>0.2
(Mod et al., 2021)
and as “good” when rho>0.4
(Landis & Koch,
1977).
After computing the predictive power for all OTU models, differences
between organism groups and phyla within each group were explored. Model
predictive powers were compared among the four main organism groups
using ANOVA followed by Tukey HSD tests with Bonferroni correction, and
Cohen’s D effect size metrics. To check for the potential effect of the
number and locations of sampling sites on model predictive power,
additional bacterial models were constructed using only the exact same
166 sites as used for the protist models. Then, performances of models
using 250 sites were compared performances of models using 166 sites for
each OTU with a paired student’s t-test.
Covariate selection and
importance
To have an insight on which covariates drive OTUs presence and relative
abundance, an analysis per organism group was performed. For each group
and each environmental covariate, the proportion of models selecting
that covariate was computed by clustering the covariates by their
correlation values to match clusters formed during covariates selection
(data snooping, Dormann et. al.
2013). For GLMs and
GAMs, the importance of covariates was assessed using the coefficients
of each covariate. For GBM and RF, covariate importance was assessed
using Gini coefficients
(Deng, 2013; Greenwell
et al., 2022). The obtained values were scaled so that the best
covariate from each model had an importance of 1, and the other
covariates were linearly rescaled from 1 to 0.
Results
Data on model performance and covariate importance for each OTU can be
downloaded in FigShare:https://figshare.com/s/825799db5d4fdc9a2f87(private link, which will be published upon acceptance).
1.Evaluation of ‘presence-absence’ models
Out of the 67,148 preselected OTUs in the dataset, all OTUs were
successfully fitted by at least one ‘presence-absence’ model algorithm
and 65,554 OTUs were successfully fitted by all four algorithms (details
in Supporting Information). Overall, 91% of bacteria, 98% of archaea,
81% of fungi, and 60% of protists had higher predictive power than
null models when modelling their distribution using GLMs (Figure 2a-d;
Supporting Information). However, the proportion of “high predictive
power” models (i.e. maxTSSadj>0.4) was
rather low in all groups, with 15% of bacteria, 15% of archaea, 6% of
fungi, and 0.1% of protists presenting a
maxTSSadj>0.4 for GLM (see Supporting
information for the other algorithms). ‘Presence-absence’ models for
bacteria and archaea had the best predictive power followed by fungi
models and protist models across all 4 modelling algorithms (Figure
2a-d). When the number of sites for bacteria and archaea was reduced
from 250 to the corresponding 166 sampled of protists, we observed a
small, yet significant, decrease in maxTSSadj (Student’s
paired tests: bacteria: p<0.001, df=45349, Cohen’s
D=0.15±0.01; archaea: p<0.001, df=157, Cohen’s D=0.33±0.22,
Supporting Information). However, these models still had higher
maxTSSadj values than protists models (bacteria:
padj<0.001, Cohen’s D=0.96±0.05; archaea:
padj<0.001, Cohen’s D=1.55±0.17).
Differences in performances were observed among phyla. Within bacteria,
phyla such as Chloroflexi, Acidobacteriota, and Planctomycetota
displayed a higher proportion of high predictive power models (see
Figure 3 for GLMs, Supporting Information for other algorithms), with
528/1537, 1171/5163, and 576/2765, respectively. Within archaea, only
some OTUs from Crenarchaeota had high predictive power models (20/131),
while most phyla had few assigned OTUs and no high predictive power
models. Few protists and fungi had good models, with Ascomycota and
Mortierellomycota phyla having some high predictive power models
(589/8631 and 89/1117, respectively).
2.Evaluation of ‘relative-abundance’ models
‘Relative-abundance’ models of all 67,979 preselected OTUs were
successfully fitted by at least one algorithm, while 64,732 OTUs were
fitted by four algorithms (details in Supporting Information). Spearman
correlation, which evaluates the ability of the models to discriminate
sites by their ‘relative-abundance’ values, demonstrated consistent
results with the ‘presence-absence’ models results (Figure 2). We
obtained numerous “useful” models (i.e. rho>0.2; e.g. for
GLMs: 85% of bacteria, 83% or archaea, 63% of fungi, 49% of protists
OTUs) and some “good” models (Spearman’s rho>0.4; e.g.
for GLMs: 40% of bacteria, 40% of archaea, 20% of fungi, 9% of
protists OTUs). Bacteria and archaea models had higher predictive power
than fungi and protist ones (Figure 2; Supporting Information). Some
phyla had a higher proportion of ‘relative-abundance’ models that
correctly ranked sites of OTUs (i.e. rho>0.4), such as for
the bacterial phyla Acidobacteriota (1887/5163), Chloroflexi (489/1537)
and Planctomycetota (756/2755; Figure 4). Among archaea, Crenarchaeota
was the only phylum with good predictive models (28/131). Among fungi,
Mortierellomycota had a higher proportion of good performing models
(194/1117). For some phyla such as Nanoarchaeota and all protist phyla,
the modelling pipeline could not produce ‘relative-abundance’ models
with rho>0.4. Prediction of exact ‘relative-abundance’
values was less successful, because their coefficient of variation
between predicted and validation values were high, with median
prediction error between 10% and 100% of mean observed
‘relative-abundance’ among the 4 studied model algorithms (Supporting
Information).
3. Covariate selection and importance
Edaphic covariates were the most selected across all groups for both
‘presence-absence’ and ‘relative-abundance’ models (Figure 5). For
example in GLMs, at least one edaphic covariate was selected in models
for 97% of archaea OTUs, 95% of bacteria OTUs, 87% of fungi OTUs and
80% of protist OTUs. In particular, pH was the most selected covariate
in models for all groups (Figure 5; Supporting Information). Climatic
covariates were also highly selected in GLMs, with 67% of archaea, 56%
of bacteria, 67% of fungi and 69% of protist best models including at
least one climatic covariate (Figure 5). For bacteria and archaea,
winter temperature was the most selected climatic covariate, while fungi
and protist models selected the set of covariates corresponding to
yearly average temperature, precipitation, and elevation covariates
(Supporting Information). Surprisingly, for fungi, distance to roads
appeared within the list of the most-selected covariates alongside the
edaphic and some topographic covariates. For protists, the most selected
covariates were found among distance to roads, climate, edaphic,
topographic and land-use covariates (Figure 5).