Reference dataset: cloned 5S IGS data covering all four sections
of Quercus in western Eurasia
The dereplication step (identification and removal of identical
sequences) reduced the total Quercus reference dataset (1770
sequences obtained via PCR, cloning and Sanger-sequencing; Denk and
Grimm 2010; Simeone et al. 2018) to 1160 representative sequences. Most
identical sequences (442) occurred within single individuals or within
species. Twenty-two variants, 163 identical sequences, were shared by
members of different species, generating ambiguities in the taxonomic
assignment. These sequences were flagged as “ambiguous”; they
comprised two variants shared by Q. baloot with Q. ilexand Q. coccifera (section Ilex ), eight variants shared
between two or three different members of section Cerris(Q. crenata, Q. suber, Q. cerris, Q. trojana, Q. look, Q.
brantii ), and thirteen variants shared by several members of sectionQuercus (details shown in Supplementary file S1). Quercus
pontica , the western Eurasian species of disjunct and relict
two-species sect. Ponticae , is characterized by species-unique
5S-IGS variants.
The length range of the reference sequences was 258–407 bp (289–407 bp
in section Ilex , 297–397 bp in section Cerris, 258–296
bp in section Quercus , and 291 bp in Q. pontica ). Some
species exhibited identical length variants (Q. canariensis ,Q. faginea , Q. pyrenaica, Q. boissieri, Q. infectoria, Q.
pontica ). All the remaining species revealed intragenomic and/or
intraspecific variation, with units differing by <10 bp (e.g.,Q. alnifolia, Q. afares, Q. suber , Q. frainetto ), or by
20–30 bp (e.g., Q. aucheri, Q. coccifera, Q. ilex, Q. brantii, Q.
crenata, Q. look, Q. robur, Q. petraea, Q. pubescens ). Two distinct
classes of length variants, differing by 40–90 bp, were found inQ. floribunda, Q. ilex, Q. cerris, Q. trojana, Q. ithaburensis, Q.
macrolepis , and Q. libani . Besides unique, unrepresentative
variants of Q. floribunda and Q. ilex , the short variants
identified both inter-individual (Q. libani : 2 individuals), and
intra-individual variation (Q. cerris : 3 individuals; Q.
ithaburensis, Q. macrolepis and Q. trojana: 1 individual; cf.
Denk and Grimm 2010; Simeone et al. 2018). GC content
(PGC) of the reference sequences ranged between 44.1 to
56.9% (median= 53.61%, mean = 53.66 %; SD = 1.25%) (Fig. 1A). Lowest
GC contents (PCG < 49%) were found in few
sequences of Q. trojana, Q. cerris, Q. ilex and Q.
suber , exhibiting known pseudogenous tendency (Denk and Grimm 2010;
Simeone et al. 2018). Only one species, Q. alnifolia, a Cypriot
endemic of section Ilex , was entirely below the mean range of the
genus. Outlier sequence variants (1st and last three
percentiles) in length and CG content relatively to each species’ median
were labelled in the downstream analyses. Length and CG content of the
total dataset and each species are shown in Fig. 1A, B and Supplementary
file S1.