HTS dataset
The total cleaned HTS dataset (six samples) comprised 208,720 reads. The
number of reads retained in each pre-processing step, their length, GC
content and distribution in the six samples are reported in
Supplementary file S2.
The HTS dataset showed a length range of 254-399 bp, and a
PGC range of 42.6–64.5%, thus matching those of the
reference dataset. Extreme values were exclusive to singleton reads
(abundance = 1). The application of different abundance thresholds (≥2,
up to ≥25) did not affect the length range and the occurrence of two
clear classes of variants (Supplementary file S2). Reads with abundance
≥2 showed PGC = 48.1–57.1%; excluding those with lower
occurrence (abundance <5) had little effect. Filtering for
reads with higher abundance increasingly eliminated lower GC values and
further approached the mean range of the reference dataset; reads with
abundance ≥25 showed a PGC of 52.5–56.4%. Therefore,
high abundance cut-offs (i.e., 10 or 25) represent all the major 5S-IGS
variants of the species included in the investigated samples and remove
rare variants. These latter may include spurious data, pseudogenic
variants, potential outlier of biological significance, and inherent
biases of the HTS methodology leading to scarce amplification of certain
individuals. As such, they should be carefully evaluated.
Reference tree and basic geno-taxonomic capacity of 5S-IGS
sequences
The final multiple sequence alignment of the cloned reference data
produced a matrix with 492 characters. Figure 2A–D shows the ML tree of
the reference dataset, rooted between members of sectionsPonticae and Quercus (subgenus Quercus ) and
sections Cerris and Ilex (subgenus Cerris ;
following Denk et al. 2017; Hipp et al. 2019). The main lineages
(sections) are clearly differentiated, with Q. pontica (sect.Ponticae ) being embedded in the section Quercus subtree
(Fig. 2A). Regarding their geno-taxonomic association, i.e. the ability
of genetic data to recognize a (morphologically defined) target species,
it is important to note that even though species do not form exclusive
clades (Fig. 2B-D), except for Q. alnifolia (sect. Ilex ),Q. afares (sect. Cerris ) and Q. pontica (sect.Ponticae ), conspecific 5S-IGS sequences always cluster within
certain subtrees. In sect. Ilex (Fig. 2B), determination of non-
or near-identical sequences via tree-inference is possible down to the
species level: Q. alnifolia 5S-IGS variants are unique; common
variants of Q. coccifera and Q. aucheri intermix (and may
include occasional Q. ilex clones) and are separated from a large
clade including most Q. ilex sequences (cf. Denk and Grimm 2010).
In section Cerris (Fig. 2C), distinct subtrees collect accessions
of the central-western Mediterranean Q. suber-Q. crenata lineage,
the eastern Mediterranean subsection Macrolepides (Q.
brantii, Q. macrolepis, Q. ithaburensis ), the Anatolian Q.
libani -trojana lineage, and the mixed ‘oriental’ and
‘occidental’ lineages comprising the remaining species (Q.
afares , Q. castaneifolia, Q. cerris, Q. look, Q. euboica, Q,
trojana ; cf. Simeone et al. 2018). In section Quercus (Fig. 2D),
no obvious structure is visible, association of sequences to discrete
species is largely impossible. Three distinct subtrees collected
consistently only sequences of Q. pyrenaica-Q. canariensis(genotype Q1), Q. robur-Q. dalechampii (genotype Q2), andQ. boissieri (Q3 genotypes). Sequences of Q. pontica form
an exclusive subtree, corresponding to sect. Ponticae . A visual
representation of the identified diagnostic groups of variants is
presented in Supplementary file S3. The NEWICK format of the RAxML
reference tree can be downloaded at
https://doi.org/10.6084/m9.figshare.12016317.v1.
Geno-taxonomic composition of samples, cut-off
effects
Genetic assessment via BLAST and per-sample ML phylogenetic inferences
provided taxonomic patterns congruent with the species composition of
each sample (Tab. 1, Supplementary files S2, S4).
Sample D5, pure Q. afares —Of the 31,620 HTS sequences retrieved
in this sample, 31,121 (98.4%) were assigned by BLAST to Q.
afares and a few hundreds to members of the Cerris crown
(Q. castaneifolia , Q. cerris , Q. trojana ) or to the
same section (Q. suber , Q. brantii ). Negligible amounts of
sequences were assigned to a further species of the same section
(Q. macrolepis; three sequences), or to different sections (18
sequences assigned to members of section Quercus andIlex ). Using an abundance cut-off ≥25, all HTS sequences grouped
within the Q. afares reference ML subtree. With cut-off ≥10, the
HTS sequences expanded the Q. afares -comprising larger subtree of
sect. Cerris and brushed against a minor clade with Q.
cerris , Q. trojana , Q. brantii (Type 4b, collecting
likely ancestral, underived sequences; Fig. S3-2 in Supplementary file
S3); with cut-off ≥5, the Q. afares subtree was inflated further
and included a single reference sequence of Q. trojana . The mainCerris lineages could hardly be differentiated with cut-off ≥ 2.
A dozen HTS reads joined sequences of the oriental Q. trojana-Q.
libani , and a few other were highly differentiated and formed a
separate cluster (possible pseudogenes).
Sample G2, pure Q. ilex —Of the 13,091 HTS sequences produced,
nearly all (13,067; 99.8%) were assigned by BLAST to Q. ilex ,
and negligible amounts were assigned to Q. coccifera or members
of different sections. In the ML analysis, all HTS sequences with
abundance ≥5 grouped within a large Q. ilex subtree of the
reference data; with cut-off ≥2, the HTS sequences grouped across
different Q. ilex subtrees, and only few sequences placed
together with Q. aucheri .
Sample H1, pure Q. faginea —Of the 58,003 HTS sequences
produced, 33,546 (57.8%) were assigned by BLAST to Q.
canariensis , an Iberian-North African sister species of Q.
faginea , and 23,770 (41%) were variously assigned to either ambiguous
or specific sequences of section Quercus , including Q.
faginea, Q. pyrenaica, Q. petraea , Q. pubescens , and
(secondarily) Q. frainetto, Q. vulcanica, and Q. robur .
With ML, most HTS sequences with abundance ≥25 and ≥10 grouped within
the Q. pyrenaica-Q. canariensis subtree referring to thecanariensis-pyrenaica -unique type Q1, and all other sequences
were scattered across the undiagnostic Q0 subtrees, often, but not
necessarily, grouping with Q. faginea sequences. With the
decreasing abundance thresholds (cut-offs = 5 and 2), nearly the entire
section Quercus was covered by HTS sequences (except the eastern
Mediterranean Q3 type; Fig. S3-4 in Supplementary file S3).
Sample F2, mixed, one species per section—Of the 41,527 HTS sequences
produced, 1,307, 6,399 and 9,454 (i.e., total of 48.5%) were assigned
by BLAST to the target species Q. suber, Q. ilex and Q.
canariensis , respectively. All other sequences were assigned to theQ. suber-crenata shared types (1,297), Q. coccifera (36),
sister of Q. ilex , and to ambiguous or specific sequences of
section Quercus (mainly Q. faginea, Q. pyrenaica, Q.
petraea, pubescens , and Q. frainetto ). With ML,
dispersion of HTS sequences onto the reference tree increased with
decreasing abundance thresholds; however, a Q. ilex subtree was
always identified, and placements outside the Q. suber-Q. crenatasubtree were only recorded with cut-off = 2 . Conversely, most HTS
sequences centered around Q. faginea, Q. pyrenaica and Q.
canariensis references (ubiquitous, undiagnostic type Q0 andcanariensis-pyrenaica- specific type Q1; Fig. S3-4 in
Supplementary file S3) only when a cut-off = 25 was applied.
Sample E5, mixed, one species per section—Of the 26,352 HTS sequences
produced, only 1,140, 1,838 and 302 (i.e., in total 12.4%) were
assigned by BLAST to the target species Q. coccifera, Q. suber,and Q. canariensis , respectively. Mirroring sample F2, all other
sequences were assigned to the Q. suber-crenata shared types,Q. ilex , and to ambiguous or specific sequences of sectionQuercus (mostly Q. faginea, Q. pyrenaica, Q. petraea, Q.
pubescens , and Q. frainetto ). ML tree inferences also
mirror those of sample F2: the Q. coccifera and Q. suber-Q.
crenata subtrees are clearly identified with all abundance thresholds,
together with the increasing dispersion along the sect. Quercuslineage. Only with cut-off = 2, few HTS sequences were placed in aQ. ilex-Q. aucheri subtree and together with the likely ancestralQ. cerris-Q. trojana-Q. suber sequence clade.
Sample E4, mixed, 8 species, all three sections—Of the 38,127
sequences produced in this sample, 10,699 (28.1%) were assigned by
BLAST to the target species included: Q. coccifera (28), Q.
cerris (335), Q. trojana (479), Q. macrolepis (4,107),Q. frainetto (2,549), Q. petraea (352), Q.
infectoria (332), and Q. pubescens (2,517). All other sequences
were generally assigned to members of subsection Macrolepides(sect. Cerris ), and to sect. Quercus . With ML, the
dispersion of HTS reads increased with the reduced cut-offs, just like
the less-complex mixed samples. Reflecting the species composition of
the sample, a minor part of the HTS sequences always placed within theMacrolepides and the ‘occidental’ Cerris clades, whereas
most sequences were dispersed across the section Quercus subtree.
Only with cut-off = 2, a Q. coccifera- exclusive clade was
identified.
Figure 3A, B reports the ML trees of samples E5 (cut-off = 5), and E4
(cut-off = 2), depicting the different levels of information that can be
obtained in relation to abundance cut-offs and/or complexity of samples.
The NEWICK format of the 24 RAxML trees can be downloaded at
https://doi.org/10.6084/m9.figshare.12016317.v1.
Automated species recognition using BLAST and
EPA
Figure 4 summarizes the results of the taxonomic assignment of HTS
sequences with abundance >25 obtained for each sample using
the BLAST and EPA identification approaches (details provided in
Supplementary file S5). Both approaches determined correct, unequivocal
assignations of the species of sections Ilex and Cerrisincluded in the pure (D5 and G2) and in the mixed samples (E4, E5, F2).
For samples including material covering sect. Quercus, BLAST and
EPA results differ (below section-level) because the former is overly
specific when assigning a sequence to a reference. Our reference data do
not include 5S-IGS variants unique to Q. frainetto, Q.
infectoria, or Q. petraea-pubescens (identified in samples E4,
E5, F2; see Reference dataset ). In sample H1 (pure Q.
faginea , ecomorphotype ‘Q. lusitanica ’), all genotype-Q1 5S-IGS
variants (Fig. S3-4 in Supplementary file S3) uniquely shared byQ. canariensis-Q. pyrenaica in the reference data are identified
as Q. canariensis by BLAST, but either as Q. canariensisor Q. canariensis-pyrenaica by EPA.
Fig. 5 shows two examples of the EPA assignations of samples containing
members of section Quercus on the reference tree. The HTS
sequences of sample F2 (containing Q. ilex , Q. suber , andQ. canariensis ) unambiguously group on Q. ilex andQ. suber branches; a Q. canariensis-pyrenaica minor
cluster is also identified, together with Q. faginea and otherQuercus section subclades, often in basal positions. The HTS
sequences of sample H1 (pure Q. faginea ) aggregate nearly on the
same sect. Quercus subclades of sample F2, with a larger
occurrence of the most derived Q. faginea types, and a second
cluster, including Q. canariensis and Q. petraea . The
general ability of EPA in the identification process in the six tubes
with cut-off ≥25 is shown in Fig. 6.