1. Introduction

Species distribution models (SDMs) by combining data of species occurrence and environmental variables are operative tools to understand the dynamics of biodiversity distribution in space and time. A wealth of literature exists on the utility of SDMs all aiming at explaining, predicting, and projecting species distribution (Araújo et al., 2019). In particular identifying geographic distribution and most effective variables in different geographic scales (Brito et al., 2009; Hemami et al., 2018; Vale et al., 2014), assessing conservation coverage and efficiency of protected areas (Farhadinia et al., 2015; Lentini and Wintle, 2015; Zupan et al., 2014), predicting the biological invasion of alien species (Thuiller et al., 2005; Tingley et al., 2014), climate change-induced range shifts (Thuiller et al., 2011; Waltari and Guralnick, 2009; Yousefi et al., 2017), and combining its results with phylogenetic analyses to explore species evolutionary history (Ahmadi et al., 2018; Ahmadzadeh et al., 2016; Boucher et al., 2015; Saladin et al., 2019) are among the most widely-used aspects of SDMs.
In general, a variety of SDMs with different algorithms has been developed which may lead to different results for a target species (Elith and Graham, 2009; Merow et al., 2014). Consequently, model manipulation and comparing their results have been the subject of a significant amount of debate and research (Elith et al., 2006; Shabani et al., 2016; Wisz et al., 2008). According to Araújo et al. (2019), four aspects of SDMs determine the quality of the resulting model, including response variable (i.e. occurrence records of the species), predictor variables, model building, and model evaluation. An SDM is a process of modeling and prediction, thus, contains levels of uncertainty that rises from each of the above-mentioned aspects.
At each level, solutions have been proposed to increase the quality of the data and reduce the negative effects of uncertainty in the output models. For example, in the first step, improving sampling design can reduce bias and inaccuracy in the geographical distribution of the collected data (Araújo and Guisan, 2006). At this level, ensuring that the collected data correctly represent the actual distribution of the species (Guillera‐Arroita et al., 2015; Tessarolo et al., 2014) and that the scale of modeling and independent variables are consistent with sampling precision (Guisan et al., 2007; Wiens et al., 2009), and reducing unbiased recognition of the taxonomy of the species (Hortal et al., 2008; Rocchini et al., 2011) improve results of an SDM analysis.
An essential hypothesis of statistical methods is that recorded data are independent (i.e. randomly allocated samples with independent distribution), requiring that the entire area of interest is randomly or systematically sampled. In practice, available data on the species distribution is spatially biased toward areas easily assessed and/or better surveyed (Araújo and Guisan, 2006; Boria et al., 2014). A different strategy and intensity of sampling cause uneven distribution of recorded data, inconsistent with the real spatial ecology of the target species. This spatial bias may result in spatial clumpiness, which in turn, leads to the over-representation of areas with a higher density of input data in the model. This can leads to spatial autocorrelation (SAC) of occurrence points (Dormann et al., 2007) that inflates model accuracy (Veloz, 2009), and misleads parameter estimates (Kramer-Schadt et al. 2013).
In general, manipulating the input data (Elith et al., 2010; Phillips et al., 2009), and parametrizing the modeling method (Fithian et al., 2015; Muscarella et al., 2014) are two strategies that have been used to take into account the bias in SDM efforts. In particular, the bias caused by spatial autocorrelation could be reduced by spatial filtering (Boria et al., 2014; Kramer-Schadt et al., 2013) and background weighting schemes, the later is also called ‘target-group background’ (Elith et al., 2010; Phillips et al., 2009). During spatial filtering, the severity of clumpiness is decreased by removing repeated occurrence points within a specific radius around them. The idea behind background weighting comes from the fact that presence‐absence models are much less affected by sampling bias compared to presence‐only models (Phillips et al., 2009). It is because in presence-absence models the spatial sampling bias is reflected in both presence and absence data. Accordingly, background weighting tries to select background data (e.g. pseudo-absences) with the same bias in occurrence points. This method reduces the bias in a way that favors areas densely sampled over sparsely sampled areas (Phillips et al., 2009; Shabani et al., 2016). Elith et al. (2010) recommended this method for invasive species experiencing range shifts in invaded areas, particularly, for those with more recent invasion.
On the contrary, parametrizing SDMs to obtain a fine-tuned model is an aspect that has been poorly considered. In almost all cases, the default setting are being used to perform SDMs, especially for complex machine learning ones (Kass et al., 2021). In addition to increase the possibility of overfitting caused by noisy data (Merow et al., 2014), default setting decreases model transferability during the projection to novel environment (Guevara et al., 2018). Applying different levels of the complexity and evaluating the balance between the bias and variance of models allows to find the optimal model with a justified level of complexinty (Araújo et al., 2019; Radosavljevic and Anderson, 2014). However, among the few attempts to parametrize SDMs are tuning the best combination of the primary models to a final ensemble model (Kindt, 2018; Thuiller et al., 2009) or applying a set of input parameters to fine-tune the MaxEnt model, e.g. the package ENMeval (Muscarella et al., 2014). The development of new tools, for example h2o platform (Candel et al., 2016) or caret package (Kuhn, 2021) can bring SDM parametrization into a new focus. However, a holistic effort in which a wider range of species distribution models are fine-tuned has so far, to our knowledge, not been implemented in this arena of research.
Using SDMs is particularly pragmatic for scarce species as the results of these methods provide valuable information for their conservation implementations (Farhadinia et al., 2015; Franklin, 2010) and for identifying target areas for future sampling (Galante et al., 2018). Notwithstanding, data on scarce species mostly suffer spatial bias due to imbalanced sampling surveys (Rebelo and Jones, 2010). In this research, we evaluated the performance of different SDMs to identify new populations of the rare species of the genus Montivipera in the mountains of Iran, Turkey, and Armenia. From a phylo-geographic point of view, the species of this genus due to their rapid ratio of speciation in the recent evolutionary scales show interesting forms of neo-endemism in the Near and Middle East (Behrooz et al., 2018; Stümpel et al., 2016). This genus consists of two complex group of species, M. xanthina complex and M. raddei complex. In this research, we focused on M. raddei complex (MRC) distributed across mountainous landscapes of northeastern Turkey, Armenia, and Iran. The northern populations of the MRC are well-described and all their potential habitats are geographically well-sampled. On the contrary, the southern ranges in Iran across the Zagros Mountains have not proportionally been sampled and some new populations of these species plus a newly defined species have just recently been identified (Behrooz et al., 2018). Accordingly, the data of the species distribution due to the different intensity and quality of sampling is geographically imbalanced-biased. Here we integrated model parametrization and data manipulation to evaluate the proficiency of four correlative SDMs including generalized linear models (GLM), gradient boosting model (GBM), random forest (RF), and maximum entropy (MaxEnt) for locating recently discoveredMontivipera populations. Models were fine-tuned based on their intrinsic parameters and the input data was bias-corrected by implementing a background weighting procedure. We then compared the results with models of random background procedure given the new populations as out-of-bag data to test the models. In addition to AUC and TSS as two commonly-used measures of model accuracy, we also depicted the accuracy of the models across the entire gradient of suitability thresholds to provie a better understanding of the models demeanor to classify spatially imbalanced-biased data of the species.