Bioinformatics
Three different bioinformatics workflows (pipelines) were used to
process raw paired-end Illumina data: 1) ESVs (exact sequence variants)
pipeline as implemented in DADA2 (Callahan et al. 2016); 2) 95%
OTUs (operational taxonomic units) pipeline, where OTUs are clustered at
95% sequence identity; and 3) a pipeline based on OTUs clustered at
97% identity. The processing of sequencing data to generate ESVs and
95% OTUs followed the workflows as described in Tapolczai, Keck,
Bouchez, Rimet and Vasselon (2019) and Rivera, Vasselon, Bouchez and
Rimet (2020), respectively, except that taxonomy assignment of the
representative sequences was performed using blastn algorithm (instead
of Naïve Bayesian classifier) (Camacho et al. 2009) with e-value
= 0.001, word size = 7, reward = 1, penalty = -1, gap open = 1 and gap
extend = 2 (against R-Syst v.7.2 diatom database (Rimet et al.2016)). Based on our positive and negative controls, the 95% OTUs data
set was further filtered to exclude low occurrence (≤ 3) reads per OTU
per sample to alleviate to ‘tag-switching’ error. The latter was not
performed for the ESVs data set as no sequences were observed in the
negative controls and no ‘read-leakage’ from the positive control
sample. Singleton ESVs/OTUs were discarded from the data sets (i.e.
ESVs/OTUs that had only one read across samples).
To generate 97% OTUs, raw paired-end Illumina sequencing data was
processed in PipeCraft (Anslan, Bahram, Hiiesalu & Tedersoo 2017),
which incorporates all the following tools (except LULU). Reads were
assembled and quality filtered using vsearch (fastq_minoverlen 15,
maxdiffs 45, fastq_minmergelen 200, fastq_maxee 1, fastq_maxns 0,
fastq_truncqual 5, fastq_allowmergestagger) (Rognes, Flouri, Nichols,
Quince & Mahé 2016). Chimera filtering step included vsearch
uchime_denovo algorithm with options id 0.97 and abskew 2. The filtered
sequences were clustered using UPARSE (Edgar 2013) with 97% similarity
threshold and minimum cluster size of 2 (i.e. singletons excluded). The
obtained OTU table was further curated with post-clustering algorithms
as implemented in LULU (Frøslev et al. 2017) to merge
consistently co-occurring ‘daughter’ OTUs (minimum_ratio = 1,
minimum_ratio_type = “avg”, minimum_relative_cooccurence = 0.8,
minimum_match = 96.97). Potential tag-switching errors were also
corrected based on negative and positive controls based on relative
abundances of sequences in the control samples (Taberlet, Bonin, Coissac
& Zinger 2018). To account for unequal sequencing depth, we rarefied
samples to common depth of 7000 sequences using the mothur software
(Schloss et al. 2009). The latter removed five samples from the
data set. Representative sequences per OTU were compared against R-Syst
v.7.2 diatom database using blasn as stated above.
The used primers (rbcl-646F and rbcL-998R) amplify DNA also from other
algae and bacteria (especially Proteobacteria and Cyanobacteria). To
exclude the non-target taxa, only OTUs that demonstrated the match
coverage and identity of ≥ 90% against a reference database, were
considered as diatom OTUs and included in the final tables produced by
each pipeline. According to additional blastn search against NCBI
database (Geer et al. 2009), the above threshold was confirmed to
include only diatom taxa into the final OTU table. OTUs with lower
thresholds to reference diatom sequences (in R-Syst) were often more
closely related (based on e-value, sequence similarity and coverage) to
other micro-algae (e.g. taxa from Mischococcales, Tribonematales,
Eustigmatophyceae), thus excluded from the downstream analyses.
Because of the uncertainty of the most adequate species-level sequence
similarity threshold for diatoms, the taxonomic composition comparisons
between metabarcoding treatments (HTS 10 and HTS 0.5) and microscopy was
performed on genus level. Reliable genus level classification of the
OTUs in the HTS data set was here defined when sequence similarity and
coverage was ≥ 95% against a reference sequence in the R-Syst database.
OTUs that displayed lower values against the R-Syst database sequences
were blastn-searched against the NCBI database to check for additional
genus-level annotations. Synonym names for genera were also explored in
the case of mismatches between microscopy and metabarcoding data sets.