Discussion
Diversity of the 5S-IGS in Quercus
Observed intra-, inter-specific and intra-individual polymorphism of the
intergenic spacer of the 5S cistron in Quercus (in terms of unit
length, GC content, and nucleotide sequence) is consistent with many
other plant groups (Negi et al. 2002; Fulnecěk et al. 2002; Forest et
al. 2005; Grimm and Denk 2010; Mahelka et al. 2013; Mlinarec et al.
2016; Tynkevich and Volkov 2019). The intra-individual and/or
intra-specific occurrence of two (or more) distinct 5S array types (e.g.
short and long units in members of sect. Cerris , mainly Q.
libani ) is also consistent with the cited studies. The evolutionary
significance of such intra-genomic polymorphisms is at odds with the
generally acknowledged model of concerted evolution of rDNA, and its
presence across many living organisms has led to the definition of the
‘birth-and-death’ model of evolution (Nei and Rooney 2005; Rooney and
Ward 2005). Nevertheless, the occurrence of single rDNA loci in an
individual, and the location of rDNA arrays far from the telomere may
potentially facilitate concerted evolution (Eickbush and Eickbush 2007).
In Quercus , one single 5S rDNA locus was physically mapped at a
pericentromeric region (Ribeiro et al. 2011), and several species show
only one single length and/or main sequence variant; a combined effect
of both concerted and birth-and-death evolution models is therefore
possible (Galian et al. 2014).
New gene variants can be generated by gene duplication, or gained by
auto- or allopolyploidization (rDNA homoeologues; cf. Cronn et al.
2002), hybridization (crossing of evolutionary lineages) and unilateral
gene flow (introgression). As they diverge, some may become
non-functional pseudogenes, characterized by increased substitution
rates and reduced GC content, and may eventually be eliminated (Volkov
et al. 2007; see also Dzialuk et al. 2007 and Ribeiro et al. 2011, for
the detection of triploids in natural oak populations). Contrary to the
35S (45S) rDNA cistron (encoding for the 18S, 5.8S and 25S rRNA genes),
little is known about the GC content in functional 5S repeat units and
its non-coding intergenic spacers (Symonová 2019). In this study,
outlier sequence variants in CG content, relatively to each species’
mean (generally lower than 50%, with the exception of Q.
alnifolia ) were coincident with long terminal branches on the reference
tree, and/or incoherent groupings (Supplementary file S3), thus
indicating potential pseudogeny. This number was, however, negligible (8
sequences in the reference dataset and 248 in the HTS dataset,
corresponding to 0.004 and 0.002%, respectively). Conversely, the short
sequence variants exhibited by e.g. Q. libani are inconspicuous
(Tynkevich and Volkov 2019); these and other length outliers
(Supplementary file S1) from known hybrids or species able to hybridize
(e.g., Q. petraea-Q.robur-Q. pubescens ) were discussed in Denk
and Grimm (2010) and Simeone et al. (2018).
In general, the 5S-IGS sequences of endemic species appeared more
homogeneous (e.g., Q. alnifolia , Q. pontica , Q.
afares ). Inter-specifically shared variants included sister species
(e.g., Q. suber - Q. crenata , Q. cerris - Q. look ),
members of the same phylogenetic series (e.g., Q. cerris - Q.
trojana, Q. faginea - Q. canariensis ), or interfertile species (e.g.,Q. cerris - Q. suber, Q. petraea - Q. pubescens, Q. petraea - Q.
robur ). One ambiguous variant of section Ilex was generated by aQ. baloot outlier with reduced GC content (potential pseudogene).
Other variants were shared among more distantly related species, but
always within the same section, and could be explained by reticulation
or retained ancient polymorphism. Species of section Quercus were
the most difficult to resolve. Their 5S-IGS variants are more
homogeneous in structure and sequence than those of the two other
sections. This likely reflects the origin of the three sections on
topographically different continents (Greenland/America, no west-east
crossing mountain ranges vs. Asia, complex tectonic history), the more
ancient origin of sections Ilex and Cerris , and the
relatively recent expansion of white oaks into the Euro-Mediterranean
region, involving bottlenecks and rapid radiation (Hipp et al. 2019b).
At the same time, the hybridization rate in sect. Quercus is so
extensive that the whole clade can be considered a syngameon (Hipp et
al. 2019a). Besides a complex molecular evolution of the 5S cistron,
further ecological events such as retention of ancestral traits,
hybridization, species/population isolation, and drift may have played
key roles in shaping the diversity patterns and the partial
homogenization observed in the reference dataset (cf. Denk and Grimm
2010; Simeone et al. 2018). Nevertheless, the 5S-IGS data confirmed its
utility to resolve species or closely related species-groups in sectionsCerris, Ilex, Ponticae and (to a lower degree) Quercus .
Uncertainty of the HTS data
In HTS amplicon sequencing (e.g. DNA metabarcoding),
abundance of reads does not generally reflect taxa abundance (or
biomass) in the sample matrix or bulk communities (but see Keller et al.
2015; Hirai et al. 2015; Piredda et al. 2017). A recent review on
meta-analyses suggested that the methodology possesses some quantitative
ability, but with a large degree of uncertainty (Lamb et al. 2019).
Factors affecting the quantitative performance seem to be not related to
the different sequencing platforms, whereas primer affinity, PCR biases,
natural variation in copy number (and variation in biomass), are the
most common explanations proposed (Lamb et al. 2019; Kelly et al. 2019).
In our study, we applied a conservative approach where the performances
of the HTS results and the original species composition of samples were
considered only as incidence-data (presence/absence of taxa) (Hajibabaei
et al. 2011; Porter et al. 2018). In relation to the potential sources
of misidentifications (i.e., false-presence and false-absence), BLAST
and EPA performed equally well. We did not score species present in the
original DNA sample but not collected by the HTS data (false-absence).
However, cases of false-presence (species revealed by the HTS data but
not present in the original DNA sample) were reported both by BLAST and
EPA, when reads with cut-off abundances generally lower than 10 were
included in the analyses. These outcomes are well known in metabarcoding
studies and confirm that a filtering of low abundance reads can
generally avoid an inflation of diversity of samples with false-present
taxa. Explaining the possible sources of false-presence is a hard task.
In a qualitative and quantitative HTS assessment of artificial pollen
mixtures, Bell et al. (2019) detected false presences also in negative
controls, mostly corresponding to species included in their sample
mixtures. Such false presences can be related to cross-contamination,
flow-cell chimeras, or sequencing errors in the index sequences and
usually consist of very low abundance reads. Exploring different
abundance thresholds to control contamination is therefore an important
step. However, we note that the false-presence signals retrieved in our
data correlate not only with the used cut-offs but also with the
taxonomic complexity of the samples. In the pure and mixed samples
including members of sect. Quercus, for instance, only sequences with
abundance >25 allowed exclusive identification of the
correct species or species groups, whereas sequences with lower
abundances were only assignable to the entire section. In contrast, the
pure and mixed samples with members of sect. Ilex and Cerris scored
sequences assignable to different (sister or close) species only with
cut-offs <5. Interestingly, the pure samples referring to
members of different sections never revealed sequences assignable to
members of absent sections. These low-abundance sequences can all be
therefore interpreted as ancestral, underived 5S IGS variants (within
each section), or possible pseudogenes, and not exclusively as
contaminations. Conversely, in the more complex sample E4 (including 16
individuals of eight total species), the three target species of sect.
Cerris became all detectable with cut-off = 10, whereas Q. coccifera
could be identified only with cut-off = 2. Therefore, the effect of
different abundance cut-offs cannot be standardized, but rather adjusted
based on the complexity of samples and the stated aims of a study,
trading-off comprehensiveness vs. identification reliability. In
sections Ilex and Cerris, utilization of medium-high abundance sequences
(>10 or 25) can be more desirable for a net identification
of species in samples containing one or few individuals. Data can be
used to identify ambiguous specimens (e.g. hybrids, allochtonous
species, morphologically deviating individuals), to delineate
intra-specific variation and estimate inter-population diversity. Lower
abundances (<10) can be more useful to reveal significant
phylogenetic signals such as ancestry, introgression, pseudogeny. In
more complex samples, such as environmental bulk samples, lower
abundances (<10) may allow detection of less frequent species,
although resulting in a reduced taxonomic resolution. Concerning the
species of sect. Quercus, ‘per-se’ impossible to differentiate,
abundances lower than 25 appear of no use, unless for complex samples
containing members of the other sections. A valuable development of our
approach would be a thorough inspection of all the abundance cut-offs in
pure samples of any given species, to evaluate and assess the
consistency of the retrievable signal in relation to contamination,
possible pseudogenes, and potential (or the designed)
applications.
General geno-taxonomic
potential
BLAST and EPA approaches provided congruent identifications at high
thresholds (abundance cut-off ≥25). Both methods largely agreed in
identifying HTS sequences of sections Ilex and Cerris in
the pure and mixed samples and, given the absence of sub-sectional
resolution of the backbone tree (see Fig. 2) and the reference dataset,
neither method could unambiguously assign sequences to the target
species of sect. Quercus . Nevertheless, concordant species groups
were always detected by the two methods, with respect to the maximum
possible phylogenetic and geno-taxonomic resolution within the section.
EPA always identified the correct target species or species lineage of
each sample; BLAST often retrieved one or several of the possible
species, thus over-taxonomizing the query and including species absent
in the sample. This is likely due to the common share of 5S-IGS variants
among oak species, especially in section Quercus (cf. Denk and
Grimm 2010), and to the biased species representation in the reference
dataset, where some species were under-represented compared to others
such as Q. canariensis . We expect that both EPA and BLAST will be
co-informative when more 5S-IGS references will be newly made available
and more data accumulate. EPA will likely outperform BLAST in case of
new, distinct variants in the data that are absent in the reference. In
the future, samples displaying incongruence in the BLAST vs. EPA results
will require additional nucleome- and plastome-based analyses, ideally
accompanied by morpho-ecological re-examinations. With this workflow,
our method has the potential as a fast identification tool to evaluate
the possible occurrence of peculiar (e.g. alien) genotypes, rare
lineages, or hidden hybridization/introgression events. Keeping in mind
the overall species identification capacity of the 5S-IGS DNA region,
the combination of BLAST and EPA automated taxomomy on the HTS data
could be also useful to survey the overall genetic richness of local oak
communities, as well as filtering sequences of certain species or
species-groups from mixed samples for large-scale phylogenetic and
biogeographic analyses.