Reference dataset: cloned 5S IGS data covering all four sections of Quercus in western Eurasia
The dereplication step (identification and removal of identical sequences) reduced the total Quercus reference dataset (1770 sequences obtained via PCR, cloning and Sanger-sequencing; Denk and Grimm 2010; Simeone et al. 2018) to 1160 representative sequences. Most identical sequences (442) occurred within single individuals or within species. Twenty-two variants, 163 identical sequences, were shared by members of different species, generating ambiguities in the taxonomic assignment. These sequences were flagged as “ambiguous”; they comprised two variants shared by Q. baloot with Q. ilexand Q. coccifera (section Ilex ), eight variants shared between two or three different members of section Cerris(Q. crenata, Q. suber, Q. cerris, Q. trojana, Q. look, Q. brantii ), and thirteen variants shared by several members of sectionQuercus (details shown in Supplementary file S1). Quercus pontica , the western Eurasian species of disjunct and relict two-species sect. Ponticae , is characterized by species-unique 5S-IGS variants.
The length range of the reference sequences was 258–407 bp (289–407 bp in section Ilex , 297–397 bp in section Cerris, 258–296 bp in section Quercus , and 291 bp in Q. pontica ). Some species exhibited identical length variants (Q. canariensis ,Q. faginea , Q. pyrenaica, Q. boissieri, Q. infectoria, Q. pontica ). All the remaining species revealed intragenomic and/or intraspecific variation, with units differing by <10 bp (e.g.,Q. alnifolia, Q. afares, Q. suber , Q. frainetto ), or by 20–30 bp (e.g., Q. aucheri, Q. coccifera, Q. ilex, Q. brantii, Q. crenata, Q. look, Q. robur, Q. petraea, Q. pubescens ). Two distinct classes of length variants, differing by 40–90 bp, were found inQ. floribunda, Q. ilex, Q. cerris, Q. trojana, Q. ithaburensis, Q. macrolepis , and Q. libani . Besides unique, unrepresentative variants of Q. floribunda and Q. ilex , the short variants identified both inter-individual (Q. libani : 2 individuals), and intra-individual variation (Q. cerris : 3 individuals; Q. ithaburensis, Q. macrolepis and Q. trojana: 1 individual; cf. Denk and Grimm 2010; Simeone et al. 2018). GC content (PGC) of the reference sequences ranged between 44.1 to 56.9% (median= 53.61%, mean = 53.66 %; SD = 1.25%) (Fig. 1A). Lowest GC contents (PCG < 49%) were found in few sequences of Q. trojana, Q. cerris, Q. ilex and Q. suber , exhibiting known pseudogenous tendency (Denk and Grimm 2010; Simeone et al. 2018). Only one species, Q. alnifolia, a Cypriot endemic of section Ilex , was entirely below the mean range of the genus. Outlier sequence variants (1st and last three percentiles) in length and CG content relatively to each species’ median were labelled in the downstream analyses. Length and CG content of the total dataset and each species are shown in Fig. 1A, B and Supplementary file S1.