Introduction
The 16S rRNA gene is about 1,550 bp long and encodes the small subunit
ribosomal RNA molecules of ribosomes. Originally used by Woese and Fox
to examine the phylogeny of prokaryotes,
(Woese
& Fox, 1977), the 16S rRNA gene currently serves as a molecular clock
(Woese,
1987) and as a means for differentiating prokaryotic taxa
(Benjamin
J Callahan, McMurdie, & Holmes, 2017; Clarridge, 2004). Sequencing
this gene has revolutionized microbial ecology, allowing for the
identification of microbes that cannot be studied in isolation, or exist
in complex mixtures, and revealing the astounding complexity and
ubiquity of microbes globally
(Thompson
et al., 2017).
The structure of the 16S rRNA transcript and its essential function in
protein synthesis have limited the rate of evolutionary change in the
gene, resulting in highly conserved regions that can be leveraged as
primer targets that contain variable regions for sequencing
(Clarridge,
2004). Initial sequence-based assessments of prokaryotic diversity
relied on Sanger sequencing, which could sequence the entirety of the
16S rRNA gene of few reads at a high cost and effort. The advent of next
generation sequencing technologies (e.g., Illumina HiSeq, IonTorrent),
heretofore amplicon sequencing, allowed for the sequencing of a much
greater number of sequences, but at a length of <600 base
pairs
(Caporaso
et al., 2011).
Despite the development of third-generation sequencing techniques that
allow the sequencing of the full-length of 16S rRNA genes and provide a
higher taxonomic resolution
(Johnson
et al., 2019; Matsuo et al., 2021), amplicon sequencing of shorter
segments remains the most accessible method for the identification of
microbial communities. Amplicon sequencing data continues to grow
exponentially in sequence archives
(Jurburg,
Konzack, Eisenhauer, & Heintz-Buschart, 2020), representing an
important data resource for future research, and has already provided
important insights into the abundance of prokaryotes (e.g.,
(Louca,
Mazel, Doebeli, & Parfrey, 2019)). A wealth of literature describes
the limitations and biases of amplicon sequencing, including the impact
of amplification
(Brooks
et al., 2015; Schloss, Gevers, & Westcott, 2011), bioinformatics
processing
(Kang,
Deng, Crielaard, & Brandt, 2021; Marizzoni et al., 2020; Prodan et al.,
2020), and hypervariable region
(Bukin
et al., 2019; Tremblay et al., 2015; Yang, Wang, & Qian, 2016; Yu,
García-González, Schanbacher, & Morrison, 2008) on the resulting
microbial diversity data.
Critically, while the positive impact of full 16S rRNA gene sequences on
taxonomic assignment has been well documented
(Curry
et al., 2022; Johnson et al., 2019), the extent to which short read
lengths (i.e., <200 base pairs) are able to recover
higher-level taxonomic assignments and ecological patterns has received
little attention. Understanding the opportunities and limitations of
shorter 16S rRNA gene read lengths is essential, especially for the
reuse of rapidly growing sequence data archives
(Jurburg
et al., 2020). Read truncation is a common practice that removes
lower-quality read ends (e.g.,
(Ben
J Callahan, Sankaran, Fukuyama, McMurdie, & Holmes, 2016)). Most
bioinformatics workflows aim to maximize read length, however, allowing
shorter read lengths can improve the comparability of sequences across
datasets, allowing for re-analyses that target the identical 16S rRNA
gene region and avoid biases that emerge from sequencing different, but
overlapping target regions
(Bukin
et al., 2019; Tremblay et al., 2015; Yang et al., 2016; Yu et al.,
2008), or from differential read lengths. Characterizing the impact of
shorter read lengths on 16S rRNA gene-based ecological assessments may
also serve for the integration of data from diverse platforms that
produce a range of sequence lengths (e.g., from full gene sequencing
with Nanopore to 150 bp with single-ended sequencing in HiSeq).
To examine the effect of sequence length on microbial diversity
estimates, three datasets from disturbed soil, water, and animal
microbiomes sequenced using the same primer set and sequencing platform
across a gradient of read lengths were reprocessed. It was hypothesized
that 1) shorter reads would result in a higher percentage of
unclassified ASVs and 2) lower richness estimates, but that 3) the
relationship between disturbed and undisturbed samples in each
environment would still be detectable.