Introduction
The 16S rRNA gene is about 1,550 bp long and encodes the small subunit ribosomal RNA molecules of ribosomes. Originally used by Woese and Fox to examine the phylogeny of prokaryotes, (Woese & Fox, 1977), the 16S rRNA gene currently serves as a molecular clock (Woese, 1987) and as a means for differentiating prokaryotic taxa (Benjamin J Callahan, McMurdie, & Holmes, 2017; Clarridge, 2004). Sequencing this gene has revolutionized microbial ecology, allowing for the identification of microbes that cannot be studied in isolation, or exist in complex mixtures, and revealing the astounding complexity and ubiquity of microbes globally (Thompson et al., 2017).
The structure of the 16S rRNA transcript and its essential function in protein synthesis have limited the rate of evolutionary change in the gene, resulting in highly conserved regions that can be leveraged as primer targets that contain variable regions for sequencing (Clarridge, 2004). Initial sequence-based assessments of prokaryotic diversity relied on Sanger sequencing, which could sequence the entirety of the 16S rRNA gene of few reads at a high cost and effort. The advent of next generation sequencing technologies (e.g., Illumina HiSeq, IonTorrent), heretofore amplicon sequencing, allowed for the sequencing of a much greater number of sequences, but at a length of <600 base pairs (Caporaso et al., 2011).
Despite the development of third-generation sequencing techniques that allow the sequencing of the full-length of 16S rRNA genes and provide a higher taxonomic resolution (Johnson et al., 2019; Matsuo et al., 2021), amplicon sequencing of shorter segments remains the most accessible method for the identification of microbial communities. Amplicon sequencing data continues to grow exponentially in sequence archives (Jurburg, Konzack, Eisenhauer, & Heintz-Buschart, 2020), representing an important data resource for future research, and has already provided important insights into the abundance of prokaryotes (e.g., (Louca, Mazel, Doebeli, & Parfrey, 2019)). A wealth of literature describes the limitations and biases of amplicon sequencing, including the impact of amplification (Brooks et al., 2015; Schloss, Gevers, & Westcott, 2011), bioinformatics processing (Kang, Deng, Crielaard, & Brandt, 2021; Marizzoni et al., 2020; Prodan et al., 2020), and hypervariable region (Bukin et al., 2019; Tremblay et al., 2015; Yang, Wang, & Qian, 2016; Yu, García-González, Schanbacher, & Morrison, 2008) on the resulting microbial diversity data.
Critically, while the positive impact of full 16S rRNA gene sequences on taxonomic assignment has been well documented (Curry et al., 2022; Johnson et al., 2019), the extent to which short read lengths (i.e., <200 base pairs) are able to recover higher-level taxonomic assignments and ecological patterns has received little attention. Understanding the opportunities and limitations of shorter 16S rRNA gene read lengths is essential, especially for the reuse of rapidly growing sequence data archives (Jurburg et al., 2020). Read truncation is a common practice that removes lower-quality read ends (e.g., (Ben J Callahan, Sankaran, Fukuyama, McMurdie, & Holmes, 2016)). Most bioinformatics workflows aim to maximize read length, however, allowing shorter read lengths can improve the comparability of sequences across datasets, allowing for re-analyses that target the identical 16S rRNA gene region and avoid biases that emerge from sequencing different, but overlapping target regions (Bukin et al., 2019; Tremblay et al., 2015; Yang et al., 2016; Yu et al., 2008), or from differential read lengths. Characterizing the impact of shorter read lengths on 16S rRNA gene-based ecological assessments may also serve for the integration of data from diverse platforms that produce a range of sequence lengths (e.g., from full gene sequencing with Nanopore to 150 bp with single-ended sequencing in HiSeq).
To examine the effect of sequence length on microbial diversity estimates, three datasets from disturbed soil, water, and animal microbiomes sequenced using the same primer set and sequencing platform across a gradient of read lengths were reprocessed. It was hypothesized that 1) shorter reads would result in a higher percentage of unclassified ASVs and 2) lower richness estimates, but that 3) the relationship between disturbed and undisturbed samples in each environment would still be detectable.