Metadata analysis
In order to inquire which ecological species properties influence the intraspecific genetic differentiation, we investigated for correlation between infraspecific genetic distances of COI barcode data and geographical distances among infraspecific samples using three different approaches (Figure 2, see below). Due some controversy in the scientific community about Mantel test (Peres-Neto et al., 2001; Legendre et al., 2015), we included here also ordination techniques (Legendre & Legendre 2012) and combined these to the method of principal coordinates of neighbor matrices in a Procrustes analysis (PCNM; Borcard & Legendre, 2002; Borcard et al., 2004; Dray et al., 2006). In a second step, we linked the outcome of these comparisons to the ecological properties for each species. Significance is expected to be in major part is also dependent from intraspecific sampling of each species which is in many species rather limited. The sheer amount of available data nevertheless promises sufficient significant results.
Meta-analysis was performed using R (R Core Team, 2017) in R-studio (Rstudio Team, 2016), Euclidean distance matrices from the set of geographical coordinates of collection sites of each specimen were generated with geosphere package (Hijmans et al., 2016). Pairwise distances from COI sequences using the Kimura 2-parameters DNA substitution model (Kimura, 1980) were calculated in the Apepackage (Paradis et al., 2014). Even there was some controversy about the use of Kimura 2-parameters as DNA substitution model in barcodes (Magnacca & Brown, 2010; Moniz & Kaczmarska, 2010; Srivathsan & Meier, 2012), we use this substitution model as recommended by Hebert et al. (2003) since this study inquires at “population” level.
On these two distance matrices we subsequently performed the following analyses to infer the number of species of each ecological class whose intraspecific genetic distances were correlated with the geographical distances: 1) a Mantel test, two variants of Procrustes test using either 2) Principal coordinates of neighbor matrices (PCNM) and non-metric multidimensional scaling (NMDS) or, 3) transformation-based principal components analysis (Tb-PCA) and NMDS. PCNM represents the geographical distance matrix while PcoA and NMDS represent the genetic distance matrix. Statistical significance in these comparisons indicates strength of evidence about the probable trend of differentiation of the populations involved (Allendorf & Luikart, 2007).
The Mantel test is a multivariate statistic method which typically compares two distance matrices that were calculated for the same set of objects but that are based on two independent sets of variables (e.g., a species dissimilarity matrix and site distance matrix) (Mantel, 1967). The test calculates the correlation between values in the corresponding positions of two matrices. Significance of the linear relationship between matrices is assessed through permutation of objects (Peres-Neto, 2001). Being first applied in population genetics by Sokal (1979), the Mantel test is currently one of the most commonly used methods to evaluate the relationship between geographic distance and genetic divergence (Mantel, 1967; see Manly, 1985, 1997; Diniz-Fhilo et al., 2013) – despite recent controversy and criticism about its statistical performance (e.g., Harmon & Glor, 2010; Legendre & Fortin, 2010; Guillot & Rousset, 2013, Catellano & Balletto, 2002) and the existence of more sophisticated and complex approaches to analyze spatial multivariate data (Diniz-Fhilo et al., 2013) – the Mantel test is still one of the most employed method in matrix data correlations. While Oden & Sokal (1992) reported a problem for the partial Mantel test for spatially autocorrelated data, Lapointe (1995) found problems with the simple Mantel test when using it for the comparison of dendrograms, and Peres-Neto et al. (2001) tested error type I on Mantel statistic compared to other techniques. Lastly, Nunn et al. (2006) and Harmon & Glor (2010) expressed concerns about the simple and partial Mantel tests when used for phylogenetic comparative analyses. Essentially, all the issues reported by these studies relate to inflated type I error rate or low power (Guillot & Rousset, 2013). In regard of the study design, the quality and the amount of the analyzed data of our case study, the Mantel test seems to be a reasonable technique- to inquire on the degree of relation between the two distance matrices. We ran the Mantel test on submatrices, each one representing a single species composed from more than four specimens. The null hypothesis, i.e., the absence of relationship, was rejected when the p value was lower than 0.05. To run these analyses we used the function “mantel()” in the vegan package (Oksanen et al., 2017) where the Mantel statistics was defined as a matrix correlation between the two, geographical and genetic, dissimilarity matrices.
Ordination techniques provide several alternative ways to search for a correlation among different kind of matrices (Legendre & Legendre, 2012). The most common current alternative to the Mantel tests is to ordinate the genetic distances and compare them with geographic coordinates or other vector representations of geographical distances (Diniz-Fhilo et al., 2013). Principal coordinates of neighbor matrices (PCNM; Borcard & Legendre, 2002; Borcard et al., 2004; Dray et al., 2006), is a powerful approach able to detect the spatial structure of varying scale in response data. Essentially, spatial variables are used to determine the distance between sites with special focus on neighboring sites (Borcard & Legendre, 2002). These distances are decomposed into a new set of independent (and hence orthogonal) spatial variables; it can be considered as a more general approach to transform geographic space in a raw data form. There are several variants of this approach (see Griffith & Peres-Neto, 2006; Bini et al., 2009; Landeiro et al., 2011; Diniz-Filho et al., 2009, 2012). The methodology followed here is the spatial Eigenfunction analysis (SEA) which has been extensively used in ecology, and also gained attention from landscape geneticists (i.e., Manel et al., 2003; Manel & Holderegger, 2013). PCNM can detect a wide range of spatial structures, including autocorrelation and periodic structures (Dray et al., 2006). The distance between objects is represented as a Euclidean distance matrix, calculated from spatial data (latitude and longitude values) associated with the sample locations. As the name suggests, PCNM is primarily concerned with ’neighboring’ sites. We used the PCNM() tool in the vegan package, that automatically set the threshold distance above which distances are simply considered ”large”. Any geographical distances above this value were set to four times the threshold value (see Borcard & Legendre, 2002). Finally, the modified distance matrix was subject to principal coordinates analysis (PcoA). Due to the ’truncation’ of the original distance matrix to create a neighbor matrix, a PcoA on a neighbor matrix (typically) produce more eigenvectors relative to the same analysis on a standard distance matrix. All resulting eigenvectors with positive eigenvalues have been used as a new set of explanatory spatial variables in a multiple regression approach trough Procrustes superimposition method. Therefore, when genetic data are regressed against these eigenvectors, some of them tend to describe the spatial patterns in genetic variation (Legendre & Fortin, 2010).
The transformation-based principal components analysis (Tb-PCA) is a two-step ordination method that it is considered to be the two-step equivalent of the principal coordinates analysis (PcoA). The principal coordinates analysis (PcoA) is a conceptual extension of the principal components analysis (PCA) technique (Pearson, 1901) also integrated in the PCNM procedure (Figure 2). It similarly seeks to order the objects along the axes of principal coordinates while attempting to explain the variance in the original data set. However, while PCA organizes objects by an eigenanalysis of a correlation or covariance matrix, PcoA can be applied to any distance (dissimilarity) matrix (Gower, 1966). As the PCNM technique, PCA generates a set of eigenvectors summarizing the distance matrix information. Legendre & Gallagher (2001) recommended, in order to simplify the analysis and standardize the data entries, to detrend the data (decostand() function in “vegan” package) before performing the PCA. Only the resulting positive eigenvectors have been combined with PCNM eigenvectors in a Procrustes analysis, which aims to find the species where the two distance matrices are better correlated.
Non-metric multidimensional scaling (NMDS) is a unique ordination technique in that a (small) number of ordination axes are explicitly chosen prior to the analysis and the data are then fitted to those dimensions. Because NMDS is a numerical rather than an analytical technique, it does not produce a unique solution. A ‘stress’ parameter is computed to measure the lack of fit between object distances in the NMDS ordination space and the calculated dissimilarities among objects (Paliy & Shankar, 2016). In contrast to tb-PCA, NMDS is a non-eigenvector-based approach, i.e., all axes produced are equally representing the data variance (Legendre et al., 2005). In order to compare and verify the reliability of the results we have chosen to perform this ordination method as well as the eigenvector-based one, and to compare the results then with each other. We performed NMDS analysis aiming at transforming the genetic distance matrices in a set of row data representing the genetic distance variance for every species (Figure 2).
In order to investigate the impact of sampling area size on the results, a fifth categorical, sampling-dependent variable have been introduced: the distance classes which represent the average distance between all individuals of the same species. A high distance class is equivalent with a larger sampling area of a species. The distance class thus represents the distance in which most of the sampled specimens were found. The largest geographical distance value was found to be 719 km (Longitarsus parvulus Paykull, 1799), the smallest was about 5 km (Chrysolina marginata Linnaeus, 1758). All mean distances within 100 km were in the distance class “100”, all those within 101 and 150 km were in the distance class “150”, those between 151 and 200 km in class “200”, to class “300” belong all those species ranging between 201 and 300 km and the last class includes all the species with a mean geographic distance higher than 301 km.
At first, we performed Procrustes analysis on PCNM axes and tb-PCA axes, representing the geographical structure and the genetic distances of each species, respectively (the latter calculated on the detrended data frame as recommended by Borcard & Legendre, 2002) (Figure 2). We have chosen Procrustes superimposition technique because variables could be either ordination axes (usually those that explain most of the variation in a data set) or original variables; it may also be applied to (dis)similarity matrices describing the same objects. Procrustes analysis is a statistical method which compares a collection of (multidimensional) shapes by attempting to transform them into a state of maximal superimposition. It does so by attempting to minimize the sum of squared distances between corresponding points in each shape through translation, reflection, rigid rotation, and dilation (scaling) of their coordinate matrices. In contrast to the Mantel test, Procrustes analysis allows to determine how much variance in one matrix is attributable to the variance in the other. When comparing ordinations from NMDS or PCA a ’shape’ may be defined by treating each ordinated point as a vertex (Peres-Neto & Jackson, 2001). In the present context, Procrustes analysis is applied on matrices that are back-converted into an ”object-by-variable” table by principal coordinates analysis (PcoA), non-metric multidimensional scaling (NMDS), or other suitable ordination methods (Jackson 2001). Configurations resulting from Procrustes analysis can be tested for non-randomness through repeated symmetric Procrustes analyses allowing to test for significant differences (protest()-function in vegan). We repeated the same analysis using the NMDS axes instead of the tb-PCA eigenvectors (both representing the genetic distance).
Finally, we performed linear regressions one by one within the same species on the sets of axes representing genetic distances against the PCNM axes representing the geographic distances to test for significance of the relationship between the two variables. PCNM axes were computed both times using pcnm() function in vegan package (Oksanen et al., 2017), while tb-PCA and NMDS axes were computed respectively using the pcoa() function in vegan and the nmds() function in “ecodist” package (Goslee & Urban, 2007).
Regarding the results of Procrustes analysis, we refer in the following text (statistics and significance values) to “NMDS statistics”, “NMDS significance” or generally to “NMDS results” for results that were computed using regressing NMDS and PCNM axes, while we use “PCA statistics”, “PCA significance” and “PCA results” for results referring to the Procrustes output involving tb-PCA axes and PCNM eigenvectors.