Discussion

Diversity of the 5S-IGS in Quercus

Observed intra-, inter-specific and intra-individual polymorphism of the intergenic spacer of the 5S cistron in Quercus (in terms of unit length, GC content, and nucleotide sequence) is consistent with many other plant groups (Negi et al. 2002; Fulnecěk et al. 2002; Forest et al. 2005; Grimm and Denk 2010; Mahelka et al. 2013; Mlinarec et al. 2016; Tynkevich and Volkov 2019). The intra-individual and/or intra-specific occurrence of two (or more) distinct 5S array types (e.g. short and long units in members of sect. Cerris , mainly Q. libani ) is also consistent with the cited studies. The evolutionary significance of such intra-genomic polymorphisms is at odds with the generally acknowledged model of concerted evolution of rDNA, and its presence across many living organisms has led to the definition of the ‘birth-and-death’ model of evolution (Nei and Rooney 2005; Rooney and Ward 2005). Nevertheless, the occurrence of single rDNA loci in an individual, and the location of rDNA arrays far from the telomere may potentially facilitate concerted evolution (Eickbush and Eickbush 2007). In Quercus , one single 5S rDNA locus was physically mapped at a pericentromeric region (Ribeiro et al. 2011), and several species show only one single length and/or main sequence variant; a combined effect of both concerted and birth-and-death evolution models is therefore possible (Galian et al. 2014).
New gene variants can be generated by gene duplication, or gained by auto- or allopolyploidization (rDNA homoeologues; cf. Cronn et al. 2002), hybridization (crossing of evolutionary lineages) and unilateral gene flow (introgression). As they diverge, some may become non-functional pseudogenes, characterized by increased substitution rates and reduced GC content, and may eventually be eliminated (Volkov et al. 2007; see also Dzialuk et al. 2007 and Ribeiro et al. 2011, for the detection of triploids in natural oak populations). Contrary to the 35S (45S) rDNA cistron (encoding for the 18S, 5.8S and 25S rRNA genes), little is known about the GC content in functional 5S repeat units and its non-coding intergenic spacers (Symonová 2019). In this study, outlier sequence variants in CG content, relatively to each species’ mean (generally lower than 50%, with the exception of Q. alnifolia ) were coincident with long terminal branches on the reference tree, and/or incoherent groupings (Supplementary file S3), thus indicating potential pseudogeny. This number was, however, negligible (8 sequences in the reference dataset and 248 in the HTS dataset, corresponding to 0.004 and 0.002%, respectively). Conversely, the short sequence variants exhibited by e.g. Q. libani are inconspicuous (Tynkevich and Volkov 2019); these and other length outliers (Supplementary file S1) from known hybrids or species able to hybridize (e.g., Q. petraea-Q.robur-Q. pubescens ) were discussed in Denk and Grimm (2010) and Simeone et al. (2018).
In general, the 5S-IGS sequences of endemic species appeared more homogeneous (e.g., Q. alnifolia , Q. pontica , Q. afares ). Inter-specifically shared variants included sister species (e.g., Q. suber - Q. crenata , Q. cerris - Q. look ), members of the same phylogenetic series (e.g., Q. cerris - Q. trojana, Q. faginea - Q. canariensis ), or interfertile species (e.g.,Q. cerris - Q. suber, Q. petraea - Q. pubescens, Q. petraea - Q. robur ). One ambiguous variant of section Ilex was generated by aQ. baloot outlier with reduced GC content (potential pseudogene). Other variants were shared among more distantly related species, but always within the same section, and could be explained by reticulation or retained ancient polymorphism. Species of section Quercus were the most difficult to resolve. Their 5S-IGS variants are more homogeneous in structure and sequence than those of the two other sections. This likely reflects the origin of the three sections on topographically different continents (Greenland/America, no west-east crossing mountain ranges vs. Asia, complex tectonic history), the more ancient origin of sections Ilex and Cerris , and the relatively recent expansion of white oaks into the Euro-Mediterranean region, involving bottlenecks and rapid radiation (Hipp et al. 2019b). At the same time, the hybridization rate in sect. Quercus is so extensive that the whole clade can be considered a syngameon (Hipp et al. 2019a). Besides a complex molecular evolution of the 5S cistron, further ecological events such as retention of ancestral traits, hybridization, species/population isolation, and drift may have played key roles in shaping the diversity patterns and the partial homogenization observed in the reference dataset (cf. Denk and Grimm 2010; Simeone et al. 2018). Nevertheless, the 5S-IGS data confirmed its utility to resolve species or closely related species-groups in sectionsCerris, Ilex, Ponticae and (to a lower degree) Quercus .

Uncertainty of the HTS data

In HTS amplicon sequencing (e.g. DNA metabarcoding), abundance of reads does not generally reflect taxa abundance (or biomass) in the sample matrix or bulk communities (but see Keller et al. 2015; Hirai et al. 2015; Piredda et al. 2017). A recent review on meta-analyses suggested that the methodology possesses some quantitative ability, but with a large degree of uncertainty (Lamb et al. 2019). Factors affecting the quantitative performance seem to be not related to the different sequencing platforms, whereas primer affinity, PCR biases, natural variation in copy number (and variation in biomass), are the most common explanations proposed (Lamb et al. 2019; Kelly et al. 2019). In our study, we applied a conservative approach where the performances of the HTS results and the original species composition of samples were considered only as incidence-data (presence/absence of taxa) (Hajibabaei et al. 2011; Porter et al. 2018). In relation to the potential sources of misidentifications (i.e., false-presence and false-absence), BLAST and EPA performed equally well. We did not score species present in the original DNA sample but not collected by the HTS data (false-absence). However, cases of false-presence (species revealed by the HTS data but not present in the original DNA sample) were reported both by BLAST and EPA, when reads with cut-off abundances generally lower than 10 were included in the analyses. These outcomes are well known in metabarcoding studies and confirm that a filtering of low abundance reads can generally avoid an inflation of diversity of samples with false-present taxa. Explaining the possible sources of false-presence is a hard task. In a qualitative and quantitative HTS assessment of artificial pollen mixtures, Bell et al. (2019) detected false presences also in negative controls, mostly corresponding to species included in their sample mixtures. Such false presences can be related to cross-contamination, flow-cell chimeras, or sequencing errors in the index sequences and usually consist of very low abundance reads. Exploring different abundance thresholds to control contamination is therefore an important step. However, we note that the false-presence signals retrieved in our data correlate not only with the used cut-offs but also with the taxonomic complexity of the samples. In the pure and mixed samples including members of sect. Quercus, for instance, only sequences with abundance >25 allowed exclusive identification of the correct species or species groups, whereas sequences with lower abundances were only assignable to the entire section. In contrast, the pure and mixed samples with members of sect. Ilex and Cerris scored sequences assignable to different (sister or close) species only with cut-offs <5. Interestingly, the pure samples referring to members of different sections never revealed sequences assignable to members of absent sections. These low-abundance sequences can all be therefore interpreted as ancestral, underived 5S IGS variants (within each section), or possible pseudogenes, and not exclusively as contaminations. Conversely, in the more complex sample E4 (including 16 individuals of eight total species), the three target species of sect. Cerris became all detectable with cut-off = 10, whereas Q. coccifera could be identified only with cut-off = 2. Therefore, the effect of different abundance cut-offs cannot be standardized, but rather adjusted based on the complexity of samples and the stated aims of a study, trading-off comprehensiveness vs. identification reliability. In sections Ilex and Cerris, utilization of medium-high abundance sequences (>10 or 25) can be more desirable for a net identification of species in samples containing one or few individuals. Data can be used to identify ambiguous specimens (e.g. hybrids, allochtonous species, morphologically deviating individuals), to delineate intra-specific variation and estimate inter-population diversity. Lower abundances (<10) can be more useful to reveal significant phylogenetic signals such as ancestry, introgression, pseudogeny. In more complex samples, such as environmental bulk samples, lower abundances (<10) may allow detection of less frequent species, although resulting in a reduced taxonomic resolution. Concerning the species of sect. Quercus, ‘per-se’ impossible to differentiate, abundances lower than 25 appear of no use, unless for complex samples containing members of the other sections. A valuable development of our approach would be a thorough inspection of all the abundance cut-offs in pure samples of any given species, to evaluate and assess the consistency of the retrievable signal in relation to contamination, possible pseudogenes, and potential (or the designed) applications.

General geno-taxonomic potential

BLAST and EPA approaches provided congruent identifications at high thresholds (abundance cut-off ≥25). Both methods largely agreed in identifying HTS sequences of sections Ilex and Cerris in the pure and mixed samples and, given the absence of sub-sectional resolution of the backbone tree (see Fig. 2) and the reference dataset, neither method could unambiguously assign sequences to the target species of sect. Quercus . Nevertheless, concordant species groups were always detected by the two methods, with respect to the maximum possible phylogenetic and geno-taxonomic resolution within the section. EPA always identified the correct target species or species lineage of each sample; BLAST often retrieved one or several of the possible species, thus over-taxonomizing the query and including species absent in the sample. This is likely due to the common share of 5S-IGS variants among oak species, especially in section Quercus (cf. Denk and Grimm 2010), and to the biased species representation in the reference dataset, where some species were under-represented compared to others such as Q. canariensis . We expect that both EPA and BLAST will be co-informative when more 5S-IGS references will be newly made available and more data accumulate. EPA will likely outperform BLAST in case of new, distinct variants in the data that are absent in the reference. In the future, samples displaying incongruence in the BLAST vs. EPA results will require additional nucleome- and plastome-based analyses, ideally accompanied by morpho-ecological re-examinations. With this workflow, our method has the potential as a fast identification tool to evaluate the possible occurrence of peculiar (e.g. alien) genotypes, rare lineages, or hidden hybridization/introgression events. Keeping in mind the overall species identification capacity of the 5S-IGS DNA region, the combination of BLAST and EPA automated taxomomy on the HTS data could be also useful to survey the overall genetic richness of local oak communities, as well as filtering sequences of certain species or species-groups from mixed samples for large-scale phylogenetic and biogeographic analyses.