HTS dataset
The total cleaned HTS dataset (six samples) comprised 208,720 reads. The number of reads retained in each pre-processing step, their length, GC content and distribution in the six samples are reported in Supplementary file S2.
The HTS dataset showed a length range of 254-399 bp, and a PGC range of 42.6–64.5%, thus matching those of the reference dataset. Extreme values were exclusive to singleton reads (abundance = 1). The application of different abundance thresholds (≥2, up to ≥25) did not affect the length range and the occurrence of two clear classes of variants (Supplementary file S2). Reads with abundance ≥2 showed PGC = 48.1–57.1%; excluding those with lower occurrence (abundance <5) had little effect. Filtering for reads with higher abundance increasingly eliminated lower GC values and further approached the mean range of the reference dataset; reads with abundance ≥25 showed a PGC of 52.5–56.4%. Therefore, high abundance cut-offs (i.e., 10 or 25) represent all the major 5S-IGS variants of the species included in the investigated samples and remove rare variants. These latter may include spurious data, pseudogenic variants, potential outlier of biological significance, and inherent biases of the HTS methodology leading to scarce amplification of certain individuals. As such, they should be carefully evaluated.

Reference tree and basic geno-taxonomic capacity of 5S-IGS sequences

The final multiple sequence alignment of the cloned reference data produced a matrix with 492 characters. Figure 2A–D shows the ML tree of the reference dataset, rooted between members of sectionsPonticae and Quercus (subgenus Quercus ) and sections Cerris and Ilex (subgenus Cerris ; following Denk et al. 2017; Hipp et al. 2019). The main lineages (sections) are clearly differentiated, with Q. pontica (sect.Ponticae ) being embedded in the section Quercus subtree (Fig. 2A). Regarding their geno-taxonomic association, i.e. the ability of genetic data to recognize a (morphologically defined) target species, it is important to note that even though species do not form exclusive clades (Fig. 2B-D), except for Q. alnifolia (sect. Ilex ),Q. afares (sect. Cerris ) and Q. pontica (sect.Ponticae ), conspecific 5S-IGS sequences always cluster within certain subtrees. In sect. Ilex (Fig. 2B), determination of non- or near-identical sequences via tree-inference is possible down to the species level: Q. alnifolia 5S-IGS variants are unique; common variants of Q. coccifera and Q. aucheri intermix (and may include occasional Q. ilex clones) and are separated from a large clade including most Q. ilex sequences (cf. Denk and Grimm 2010). In section Cerris (Fig. 2C), distinct subtrees collect accessions of the central-western Mediterranean Q. suber-Q. crenata lineage, the eastern Mediterranean subsection Macrolepides (Q. brantii, Q. macrolepis, Q. ithaburensis ), the Anatolian Q. libani -trojana lineage, and the mixed ‘oriental’ and ‘occidental’ lineages comprising the remaining species (Q. afares , Q. castaneifolia, Q. cerris, Q. look, Q. euboica, Q, trojana ; cf. Simeone et al. 2018). In section Quercus (Fig. 2D), no obvious structure is visible, association of sequences to discrete species is largely impossible. Three distinct subtrees collected consistently only sequences of Q. pyrenaica-Q. canariensis(genotype Q1), Q. robur-Q. dalechampii (genotype Q2), andQ. boissieri (Q3 genotypes). Sequences of Q. pontica form an exclusive subtree, corresponding to sect. Ponticae . A visual representation of the identified diagnostic groups of variants is presented in Supplementary file S3. The NEWICK format of the RAxML reference tree can be downloaded at https://doi.org/10.6084/m9.figshare.12016317.v1.

Geno-taxonomic composition of samples, cut-off effects

Genetic assessment via BLAST and per-sample ML phylogenetic inferences provided taxonomic patterns congruent with the species composition of each sample (Tab. 1, Supplementary files S2, S4).
Sample D5, pure Q. afares —Of the 31,620 HTS sequences retrieved in this sample, 31,121 (98.4%) were assigned by BLAST to Q. afares and a few hundreds to members of the Cerris crown (Q. castaneifolia , Q. cerris , Q. trojana ) or to the same section (Q. suber , Q. brantii ). Negligible amounts of sequences were assigned to a further species of the same section (Q. macrolepis; three sequences), or to different sections (18 sequences assigned to members of section Quercus andIlex ). Using an abundance cut-off ≥25, all HTS sequences grouped within the Q. afares reference ML subtree. With cut-off ≥10, the HTS sequences expanded the Q. afares -comprising larger subtree of sect. Cerris and brushed against a minor clade with Q. cerris , Q. trojana , Q. brantii (Type 4b, collecting likely ancestral, underived sequences; Fig. S3-2 in Supplementary file S3); with cut-off ≥5, the Q. afares subtree was inflated further and included a single reference sequence of Q. trojana . The mainCerris lineages could hardly be differentiated with cut-off ≥ 2. A dozen HTS reads joined sequences of the oriental Q. trojana-Q. libani , and a few other were highly differentiated and formed a separate cluster (possible pseudogenes).
Sample G2, pure Q. ilex —Of the 13,091 HTS sequences produced, nearly all (13,067; 99.8%) were assigned by BLAST to Q. ilex , and negligible amounts were assigned to Q. coccifera or members of different sections. In the ML analysis, all HTS sequences with abundance ≥5 grouped within a large Q. ilex subtree of the reference data; with cut-off ≥2, the HTS sequences grouped across different Q. ilex subtrees, and only few sequences placed together with Q. aucheri .
Sample H1, pure Q. faginea —Of the 58,003 HTS sequences produced, 33,546 (57.8%) were assigned by BLAST to Q. canariensis , an Iberian-North African sister species of Q. faginea , and 23,770 (41%) were variously assigned to either ambiguous or specific sequences of section Quercus , including Q. faginea, Q. pyrenaica, Q. petraea , Q. pubescens , and (secondarily) Q. frainetto, Q. vulcanica, and Q. robur . With ML, most HTS sequences with abundance ≥25 and ≥10 grouped within the Q. pyrenaica-Q. canariensis subtree referring to thecanariensis-pyrenaica -unique type Q1, and all other sequences were scattered across the undiagnostic Q0 subtrees, often, but not necessarily, grouping with Q. faginea sequences. With the decreasing abundance thresholds (cut-offs = 5 and 2), nearly the entire section Quercus was covered by HTS sequences (except the eastern Mediterranean Q3 type; Fig. S3-4 in Supplementary file S3).
Sample F2, mixed, one species per section—Of the 41,527 HTS sequences produced, 1,307, 6,399 and 9,454 (i.e., total of 48.5%) were assigned by BLAST to the target species Q. suber, Q. ilex and Q. canariensis , respectively. All other sequences were assigned to theQ. suber-crenata shared types (1,297), Q. coccifera (36), sister of Q. ilex , and to ambiguous or specific sequences of section Quercus (mainly Q. faginea, Q. pyrenaica, Q. petraea, pubescens , and Q. frainetto ). With ML, dispersion of HTS sequences onto the reference tree increased with decreasing abundance thresholds; however, a Q. ilex subtree was always identified, and placements outside the Q. suber-Q. crenatasubtree were only recorded with cut-off = 2 . Conversely, most HTS sequences centered around Q. faginea, Q. pyrenaica and Q. canariensis references (ubiquitous, undiagnostic type Q0 andcanariensis-pyrenaica- specific type Q1; Fig. S3-4 in Supplementary file S3) only when a cut-off = 25 was applied.
Sample E5, mixed, one species per section—Of the 26,352 HTS sequences produced, only 1,140, 1,838 and 302 (i.e., in total 12.4%) were assigned by BLAST to the target species Q. coccifera, Q. suber,and Q. canariensis , respectively. Mirroring sample F2, all other sequences were assigned to the Q. suber-crenata shared types,Q. ilex , and to ambiguous or specific sequences of sectionQuercus (mostly Q. faginea, Q. pyrenaica, Q. petraea, Q. pubescens , and Q. frainetto ). ML tree inferences also mirror those of sample F2: the Q. coccifera and Q. suber-Q. crenata subtrees are clearly identified with all abundance thresholds, together with the increasing dispersion along the sect. Quercuslineage. Only with cut-off = 2, few HTS sequences were placed in aQ. ilex-Q. aucheri subtree and together with the likely ancestralQ. cerris-Q. trojana-Q. suber sequence clade.
Sample E4, mixed, 8 species, all three sections­—Of the 38,127 sequences produced in this sample, 10,699 (28.1%) were assigned by BLAST to the target species included: Q. coccifera (28), Q. cerris (335), Q. trojana (479), Q. macrolepis (4,107),Q. frainetto (2,549), Q. petraea (352), Q. infectoria (332), and Q. pubescens (2,517). All other sequences were generally assigned to members of subsection Macrolepides(sect. Cerris ), and to sect. Quercus . With ML, the dispersion of HTS reads increased with the reduced cut-offs, just like the less-complex mixed samples. Reflecting the species composition of the sample, a minor part of the HTS sequences always placed within theMacrolepides and the ‘occidental’ Cerris clades, whereas most sequences were dispersed across the section Quercus subtree. Only with cut-off = 2, a Q. coccifera- exclusive clade was identified.
Figure 3A, B reports the ML trees of samples E5 (cut-off = 5), and E4 (cut-off = 2), depicting the different levels of information that can be obtained in relation to abundance cut-offs and/or complexity of samples. The NEWICK format of the 24 RAxML trees can be downloaded at https://doi.org/10.6084/m9.figshare.12016317.v1.

Automated species recognition using BLAST and EPA

Figure 4 summarizes the results of the taxonomic assignment of HTS sequences with abundance >25 obtained for each sample using the BLAST and EPA identification approaches (details provided in Supplementary file S5). Both approaches determined correct, unequivocal assignations of the species of sections Ilex and Cerrisincluded in the pure (D5 and G2) and in the mixed samples (E4, E5, F2). For samples including material covering sect. Quercus, BLAST and EPA results differ (below section-level) because the former is overly specific when assigning a sequence to a reference. Our reference data do not include 5S-IGS variants unique to Q. frainetto, Q. infectoria, or Q. petraea-pubescens (identified in samples E4, E5, F2; see Reference dataset ). In sample H1 (pure Q. faginea , ecomorphotype ‘Q. lusitanica ’), all genotype-Q1 5S-IGS variants (Fig. S3-4 in Supplementary file S3) uniquely shared byQ. canariensis-Q. pyrenaica in the reference data are identified as Q. canariensis by BLAST, but either as Q. canariensisor Q. canariensis-pyrenaica by EPA.
Fig. 5 shows two examples of the EPA assignations of samples containing members of section Quercus on the reference tree. The HTS sequences of sample F2 (containing Q. ilex , Q. suber , andQ. canariensis ) unambiguously group on Q. ilex andQ. suber branches; a Q. canariensis-pyrenaica minor cluster is also identified, together with Q. faginea and otherQuercus section subclades, often in basal positions. The HTS sequences of sample H1 (pure Q. faginea ) aggregate nearly on the same sect. Quercus subclades of sample F2, with a larger occurrence of the most derived Q. faginea types, and a second cluster, including Q. canariensis and Q. petraea . The general ability of EPA in the identification process in the six tubes with cut-off ≥25 is shown in Fig. 6.