Metadata analysis
In order to inquire which ecological species properties influence the
intraspecific genetic differentiation, we investigated for correlation
between infraspecific genetic distances of COI barcode data and
geographical distances among infraspecific samples using three different
approaches (Figure 2, see below). Due some controversy in the scientific
community about Mantel test (Peres-Neto et al., 2001; Legendre et al.,
2015), we included here also ordination techniques (Legendre & Legendre
2012) and combined these to the method of principal coordinates of
neighbor matrices in a Procrustes analysis (PCNM; Borcard & Legendre,
2002; Borcard et al., 2004; Dray et al., 2006). In a second step, we
linked the outcome of these comparisons to the ecological properties for
each species. Significance is expected to be in major part is also
dependent from intraspecific sampling of each species which is in many
species rather limited. The sheer amount of available data nevertheless
promises sufficient significant results.
Meta-analysis was performed using R (R Core Team, 2017) in R-studio
(Rstudio Team, 2016), Euclidean distance matrices from the set of
geographical coordinates of collection sites of each specimen were
generated with geosphere package (Hijmans et al., 2016). Pairwise
distances from COI sequences using the Kimura 2-parameters DNA
substitution model (Kimura, 1980) were calculated in the Apepackage (Paradis et al., 2014). Even there was some controversy about
the use of Kimura 2-parameters as DNA substitution model in barcodes
(Magnacca & Brown, 2010; Moniz & Kaczmarska, 2010; Srivathsan &
Meier, 2012), we use this substitution model as recommended by Hebert et
al. (2003) since this study inquires at “population” level.
On these two distance matrices we subsequently performed the following
analyses to infer the number of species of each ecological class whose
intraspecific genetic distances were correlated with the geographical
distances: 1) a Mantel test, two variants of Procrustes test using
either 2) Principal coordinates of neighbor matrices (PCNM) and
non-metric multidimensional scaling (NMDS) or, 3) transformation-based
principal components analysis (Tb-PCA) and NMDS. PCNM represents the
geographical distance matrix while PcoA and NMDS represent the genetic
distance matrix. Statistical significance in these comparisons indicates
strength of evidence about the probable trend of differentiation of the
populations involved (Allendorf & Luikart, 2007).
The Mantel test is a multivariate statistic method which typically
compares two distance matrices that were calculated for the same set of
objects but that are based on two independent sets of variables (e.g., a
species dissimilarity matrix and site distance matrix) (Mantel, 1967).
The test calculates the correlation between values in the corresponding
positions of two matrices. Significance of the linear relationship
between matrices is assessed through permutation of objects (Peres-Neto,
2001). Being first applied in population genetics by Sokal (1979), the
Mantel test is currently one of the most commonly used methods to
evaluate the relationship between geographic distance and genetic
divergence (Mantel, 1967; see Manly, 1985, 1997; Diniz-Fhilo et al.,
2013) – despite recent controversy and criticism about its statistical
performance (e.g., Harmon & Glor, 2010; Legendre & Fortin, 2010;
Guillot & Rousset, 2013, Catellano & Balletto, 2002) and the existence
of more sophisticated and complex approaches to analyze spatial
multivariate data (Diniz-Fhilo et al., 2013) – the Mantel test is still
one of the most employed method in matrix data correlations. While Oden
& Sokal (1992) reported a problem for the partial Mantel test for
spatially autocorrelated data, Lapointe (1995) found problems with the
simple Mantel test when using it for the comparison of dendrograms, and
Peres-Neto et al. (2001) tested error type I on Mantel statistic
compared to other techniques. Lastly, Nunn et al. (2006) and Harmon &
Glor (2010) expressed concerns about the simple and partial Mantel tests
when used for phylogenetic comparative analyses. Essentially, all the
issues reported by these studies relate to inflated type I error rate or
low power (Guillot & Rousset, 2013). In regard of the study design, the
quality and the amount of the analyzed data of our case study, the
Mantel test seems to be a reasonable technique- to inquire on the degree
of relation between the two distance matrices. We ran the Mantel test on
submatrices, each one representing a single species composed from more
than four specimens. The null hypothesis, i.e., the absence of
relationship, was rejected when the p value was lower than 0.05. To run
these analyses we used the function “mantel()” in the vegan package
(Oksanen et al., 2017) where the Mantel statistics was defined as a
matrix correlation between the two, geographical and genetic,
dissimilarity matrices.
Ordination techniques provide several alternative ways to search for a
correlation among different kind of matrices (Legendre & Legendre,
2012). The most common current alternative to the Mantel tests is to
ordinate the genetic distances and compare them with geographic
coordinates or other vector representations of geographical distances
(Diniz-Fhilo et al., 2013). Principal coordinates of neighbor matrices
(PCNM; Borcard & Legendre, 2002; Borcard et al., 2004; Dray et al.,
2006), is a powerful approach able to detect the spatial structure of
varying scale in response data. Essentially, spatial variables are used
to determine the distance between sites with special focus on
neighboring sites (Borcard & Legendre, 2002). These distances are
decomposed into a new set of independent (and hence orthogonal) spatial
variables; it can be considered as a more general approach to transform
geographic space in a raw data form. There are several variants of this
approach (see Griffith & Peres-Neto, 2006; Bini et al., 2009; Landeiro
et al., 2011; Diniz-Filho et al., 2009, 2012). The methodology followed
here is the spatial Eigenfunction analysis (SEA) which has been
extensively used in ecology, and also gained attention from landscape
geneticists (i.e., Manel et al., 2003; Manel & Holderegger, 2013). PCNM
can detect a wide range of spatial structures, including autocorrelation
and periodic structures (Dray et al., 2006). The distance between
objects is represented as a Euclidean distance matrix, calculated from
spatial data (latitude and longitude values) associated with the sample
locations. As the name suggests, PCNM is primarily concerned with
’neighboring’ sites. We used the PCNM() tool in the vegan package, that
automatically set the threshold distance above which distances are
simply considered ”large”. Any geographical distances above this value
were set to four times the threshold value (see Borcard & Legendre,
2002). Finally, the modified distance matrix was subject to principal
coordinates analysis (PcoA). Due to the ’truncation’ of the original
distance matrix to create a neighbor matrix, a PcoA on a neighbor matrix
(typically) produce more eigenvectors relative to the same analysis on a
standard distance matrix. All resulting eigenvectors with positive
eigenvalues have been used as a new set of explanatory spatial variables
in a multiple regression approach trough Procrustes superimposition
method. Therefore, when genetic data are regressed against these
eigenvectors, some of them tend to describe the spatial patterns in
genetic variation (Legendre & Fortin, 2010).
The transformation-based principal components analysis (Tb-PCA) is a
two-step ordination method that it is considered to be the two-step
equivalent of the principal coordinates analysis (PcoA). The principal
coordinates analysis (PcoA) is a conceptual extension of the principal
components analysis (PCA) technique (Pearson, 1901) also integrated in
the PCNM procedure (Figure 2). It similarly seeks to order the objects
along the axes of principal coordinates while attempting to explain the
variance in the original data set. However, while PCA organizes objects
by an eigenanalysis of a correlation or covariance matrix, PcoA can be
applied to any distance (dissimilarity) matrix (Gower, 1966). As the
PCNM technique, PCA generates a set of eigenvectors summarizing the
distance matrix information. Legendre & Gallagher (2001) recommended,
in order to simplify the analysis and standardize the data entries, to
detrend the data (decostand() function in “vegan” package) before
performing the PCA. Only the resulting positive eigenvectors have been
combined with PCNM eigenvectors in a Procrustes analysis, which aims to
find the species where the two distance matrices are better correlated.
Non-metric multidimensional scaling (NMDS) is a unique ordination
technique in that a (small) number of ordination axes are explicitly
chosen prior to the analysis and the data are then fitted to those
dimensions. Because NMDS is a numerical rather than an analytical
technique, it does not produce a unique solution. A ‘stress’ parameter
is computed to measure the lack of fit between object distances in the
NMDS ordination space and the calculated dissimilarities among objects
(Paliy & Shankar, 2016). In contrast to tb-PCA, NMDS is a
non-eigenvector-based approach, i.e., all axes produced are equally
representing the data variance (Legendre et al., 2005). In order to
compare and verify the reliability of the results we have chosen to
perform this ordination method as well as the eigenvector-based one, and
to compare the results then with each other. We performed NMDS analysis
aiming at transforming the genetic distance matrices in a set of row
data representing the genetic distance variance for every species
(Figure 2).
In order to investigate the impact of sampling area size on the results,
a fifth categorical, sampling-dependent variable have been introduced:
the distance classes which represent the average distance between all
individuals of the same species. A high distance class is equivalent
with a larger sampling area of a species. The distance class thus
represents the distance in which most of the sampled specimens were
found. The largest geographical distance value was found to be 719 km
(Longitarsus parvulus Paykull, 1799), the smallest was about 5 km
(Chrysolina marginata Linnaeus, 1758). All mean distances within
100 km were in the distance class “100”, all those within 101 and 150
km were in the distance class “150”, those between 151 and 200 km in
class “200”, to class “300” belong all those species ranging between
201 and 300 km and the last class includes all the species with a mean
geographic distance higher than 301 km.
At first, we performed Procrustes analysis on PCNM axes and tb-PCA axes,
representing the geographical structure and the genetic distances of
each species, respectively (the latter calculated on the detrended data
frame as recommended by Borcard & Legendre, 2002) (Figure 2). We have
chosen Procrustes superimposition technique because variables could be
either ordination axes (usually those that explain most of the variation
in a data set) or original variables; it may also be applied to
(dis)similarity matrices describing the same objects. Procrustes
analysis is a statistical method which compares a collection of
(multidimensional) shapes by attempting to transform them into a state
of maximal superimposition. It does so by attempting to minimize the sum
of squared distances between corresponding points in each shape through
translation, reflection, rigid rotation, and dilation (scaling) of their
coordinate matrices. In contrast to the Mantel test, Procrustes analysis
allows to determine how much variance in one matrix is attributable to
the variance in the other. When comparing ordinations from NMDS or PCA a
’shape’ may be defined by treating each ordinated point as a vertex
(Peres-Neto & Jackson, 2001). In the present context, Procrustes
analysis is applied on matrices that are back-converted into an
”object-by-variable” table by principal coordinates analysis (PcoA),
non-metric multidimensional scaling (NMDS), or other suitable ordination
methods (Jackson 2001). Configurations resulting from Procrustes
analysis can be tested for non-randomness through repeated symmetric
Procrustes analyses allowing to test for significant differences
(protest()-function in vegan). We repeated the same analysis using the
NMDS axes instead of the tb-PCA eigenvectors (both representing the
genetic distance).
Finally, we performed linear regressions one by one within the same
species on the sets of axes representing genetic distances against the
PCNM axes representing the geographic distances to test for significance
of the relationship between the two variables. PCNM axes were computed
both times using pcnm() function in vegan package (Oksanen et al.,
2017), while tb-PCA and NMDS axes were computed respectively using the
pcoa() function in vegan and the nmds() function in “ecodist” package
(Goslee & Urban, 2007).
Regarding the results of Procrustes analysis, we refer in the following
text (statistics and significance values) to “NMDS statistics”, “NMDS
significance” or generally to “NMDS results” for results that were
computed using regressing NMDS and PCNM axes, while we use “PCA
statistics”, “PCA significance” and “PCA results” for results
referring to the Procrustes output involving tb-PCA axes and PCNM
eigenvectors.