Discussion
Amplicon sequencing remains the most common method for identifying
microbial communities, largely due to its low price and high throughput
relative to more novel techniques (e.g., long-read sequencing, shotgun
metagenomics). As the popularity of amplicon sequencing continues to
grow, so does the wealth of archived 16S rRNA sequences, and
understanding how bioinformatics choices affect the definition of
species, and how this in turn affects the detection of microbial
diversity and changes in this diversity is essential for the
interpretation and reuse of these data
(Jurburg
et al., 2022). This work evaluated how shorter read lengths affect the
detection of microbial taxa, their taxonomic assignments, and
biodiversity estimates derived from these data. Its findings indicate
that short read lengths recover biodiversity patterns, but special
caution should be taken in the selection of biodiversity metrics to
examine these data.
As expected, shorter read lengths resulted in more unclassified ASVs,
but this was dependent on the target taxonomic level and varied across
read lengths. Classification was best in the animal dataset, which was
the least diverse and most-well characterized system. Importantly, only
marginal improvements in taxonomic assignments were obtained by read
lengths greater than 100 bp at the family level and above for all the
datasets used, suggesting that, if only forward reads are available,
little information is lost by 100 bp reads relative to the full forward
read. Our results also highlight that genus-level taxonomic assignments
greatly depend on how well-characterized the microbiota of the target
environment are, and suggest that interpretations of genus-level
assignments are not recommended for shorter reads
(Thompson
et al., 2017).
Further analyses highlighted the robustness of alpha and beta diversity
metrics, especially abundance-weighted metrics (i.e., Inverse-Simpson
index and Bray-Curtis dissimilarities, to shorter read lengths. Reads of
90 bp could recover the majority of the alpha diversity observed with
200 bp, as well as the dissimilarity between communities belonging to
both biological replicates (i.e., variance or dispersion) and different
treatments. Importantly, the similarity between the 200 bp datasets and
their shorter versions increased with read length when assessed with
incidence-based Sorensen dissimilarities, but remained high for
abundance-weighted Bray-Curtis dissimilarities, even for the shortest
reads. As these two dissimilarity metrics differ only in their abundance
weighing, the differences observed when using each suggest that rare
taxa are the ones most affected by shorter read lengths, highlighting
the dependence of rare taxa on bioinformatics parameters.
Similarly, the detection of ASVs increased linearly with read length
until a saturation point that aligned with the expected diversity in
each environment explored (i.e., from least to most diverse, the animal,
aquatic, and soil microbiomes), emphasizing the importance of defining
diversity estimates relative to the trimming parameters. These results
highlight the importance of considering diversity estimates,
particularly incidence-based alpha diversity metrics (i.e., richness) as
a function of read length. In the case of data reuse and comparison
among datasets, this study demonstrates the importance of applying a
uniform read length across datasets in order to have comparable
diversity estimates.
With second generation sequence data (i.e., Illumina MiSeq), sequence
quality decreases with read length
(Ben
J Callahan et al., 2016). Consequently, less reads pass quality
checking, resulting in less reads (or observations) in the final,
processed dataset. Short read lengths may therefore increase the number
of observations per sample, particularly in low-quality sequences.
Furthermore, different studies employ different sequencing platforms,
which produce reads of variable lengths, the shortest of which is
Illumina HiSeq, featuring a maximum read length of 150 bp, including
barcodes and primers
(Di
Bella, Bao, Gloor, Burton, & Reid, 2013). In the case of pair-ended
sequence data, only forward or merged reads are often archived
(Jurburg
et al., 2020). This work demonstrates how one aspect of sequence
processing (i.e., trimming) affects the detection and taxonomic
assignment of microbial diversity. While several studies have examined
how technical choices (i.e., primer choice
(Fouhy,
Clooney, Stanton, Claesson, & Cotter, 2016; MartÃnez-Porchas,
Villalpando-Canchola, & Vargas-Albores, 2016; Tremblay et al., 2015),
pipeline selection
(Marizzoni
et al., 2020), and rarefaction
(McKnight
et al., 2018; Weiss et al., 2017)) affect the detection of diversity,
systematic assessments of how other technical choices (particularly
bioinformatics parameters e.g., chimera checking) affect the microbial
diversity estimates are lacking, but urgently needed. Importantly, short
reads enable the reuse of sequence data in their rawest form, allowing
for complete and unified reprocessing of the sequence data from
different studies, which may in turn improve comparability among them
(Kang
et al., 2021).
Processing metabarcoding data requires making a series of choices that
affect the final dataset and its interpretation
(Abellan-Schneyder
et al., 2021). Sequence trimming is a critical part of processing, but
its effect on the resulting diversity estimates are often overlooked.
The analyses presented focused on the effect of sequence trimming in the
popular dada2 pipeline, which detects amplicon sequence variants
(ASVs) rather than grouping sequences into clusters of 97% sequence
similarity. Dada2 has been extensively validated, and exhibits
high sensitivity to ASVs
(Prodan
et al., 2020). While the findings in this study may guide the general
processing of amplicon sequencing data, it is important to note that the
findings are specific to the dada2 pipeline.
This study lays the groundwork for the analysis and reanalysis of
metabarcoding data using short read lengths, and results in several
recommendations. First, when comparing data with different technical
backgrounds (i.e., from different studies), trimming to the same read
length is important, especially for the analysis of alpha diversity.
Second, when using short read lengths, caution should be taken with the
interpretation of genus-level classifications. Third, abundance-weighted
diversity metrics (i.e., inverse Simpson index, Bray-Curtis
dissimilarity) are more robust to read length than incidence-based
metrics (i.e., richness and Sorensen dissimilarity). Finally, the
detection of microbial diversity from sequence data is far from
absolute, and should instead be considered relative to the read length
employed.