Materials and Methods
Sampling and lab
procedures
Forty-one oak trees belonging to 13 species (three Quercussections) were morphologically identified and collected in the wild. DNA
extractions were performed from silica-gel dried leaves with the DNeasy
plant minikit (QIAGEN). DNAs were quantified with a NanoDrop
spectrophotometer (TermoFisher Scientific). We prepared six artificial
samples, consisting of pure and mixed species (Tab. 1), by pooling equal
amounts of DNA of every individual up to 20 total ng. Paired-end
Illumina sequencing (2×300 bp sequencing) was performed by a commercial
provider (LGC Genomics GmbH). The nuclear ribosomal 5S intergenic spacer
(5S-IGS) was amplified with the plant-specific primer pair
CGTGTTTGGGCGAGAGTAGT (forward) and CTGCGGAGTTCTGATGG (reverse),
developed in collaboration with Dr. Berthold Fartmann (LGC Genomics).
Raw sequences were deposited in the Sequence Read Archive under
BioProject PRJNA611057.
Bioinformatics analyses
We assembled a reference dataset consisting of 1770 5S-IGS sequences
covering the taxonomic breadth and distribution range of western
Eurasian Quercus by combining data from previous works (Denk and
Grimm 2010; Simeone et al. 2018). The total dataset was dereplicated to
identify and remove identical sequences using mothur v.1.33.0
(Schloss et al. 2009). Unique (nonredundant) sequences were aligned
using mafft (Katoh and Standley 2013) and the alignment was
manually checked using SeaView v.4.0 (Gouy et al. 2010). Sequence length
and percentage of GC content were calculated within jemboss 1.5
(Carver and Bleasby 2003) and plotted using ggplot2 R package (Wickham
2016). A maximum likelihood tree was inferred using RAxML
v.8.2.11 (Stamatakis 2014), with the GTR+GAMMA model of nucleotide
substitution. The tree was visualized in iTOL
(www.itol.embl.de) (Letunic and
Bork 2019). The database including unique 5S-IGS representative
sequences (unaligned fasta file), the corresponding taxonomy file (in
Mothur compatible format), and an Excel file (providing correspondence
between clone codes, GenBank accessions and geographic origin) is
available on FigShare (https://doi.org/10.6084/m9.figshare.12016272.v1).
Illumina paired-end reads were processed using mothur v.1.33.0
(Schloss et al. 2009). Contigs between read pairs were assembled and
differences in base calls in the overlapping region were solved using
the ΔQ parameter as described in Kozich et al. (2013). Primer sequences
were removed, and no ambiguous bases were allowed. The remaining reads
were dereplicated and screened for chimeras using UCHIME inde-novo mode (Edgar et al. 2011) within mothur.
Taxonomic assignment of the dereplicated reads was performed using
standalone BLAST in BLAST+ suite (Altschul et al. 1990; Camacho et al.
2009) against the reference dataset. Reads assigned with a query
coverage <220 bp and with similarity <90% were
removed.
Length and percentage of GC content of the cleaned dataset were
calculated within jemboss 1.5 (Carver and Bleasby 2003).
Scatter plots between GC content and reads abundance were visualized
with ggplot2 R package (Wickham 2016). Within each sample, we generated
four subsets based on the abundance of reads (subset reads ≥2; subset
reads ≥5; subset reads ≥10; subset reads ≥25). This procedure produced
24 subsets (4 subsets x 6 samples); each of them was aligned with the
reference data and used to infer the corresponding tree with RAxML
v.8.2.11 (Stamatakis 2014) and the GTR+GAMMA model, then visualized in
iTOL to inspect coherence of clades.
Reads with total abundance ≥25 in the total dataset (six samples) were
also assigned using Evolutionary Placement Algorithm (EPA; Berger et al
2011) as implemented in RAxML onto the phylogenetic tree inferred from
the Quercus reference data. For computing the phylogenetic
placements, an alignment comprising both the reference and the query
reads was performed within mothur (align.seqs). EPA outputs
showed multiple possible placement positions with different likelihood
weights (probabilities) in all branches of the tree. Placement results
(tree.jplace standard placement format) for the total Quercusreference dataset and each Quercus clade were visualized using
iTOL. The relative performances of the taxonomic assignments obtained
with BLAST (best hit) and EPA (tree branch placement) were visualized
using semicircle donut plots generated with geom_arc_bar() function in
ggforce R package (https://ggforce.data-imaginist.com/).