Sequence processing and data analysis
Sequence data and metadata were downloaded from NCBI and processed using
the popular dada2 pipeline
(B
J Callahan et al., 2016) and standard parameters (maxN=0; maxEE=2,
truncQ=2). As our goal was to explore the impact of shorter read lengths
on the taxonomic assignment of prokaryotes, and its impact on the
ecological conclusions derived from the data, only forward read lengths
from each dataset were selected. Importantly for sequence data reuse,
reverse reads are often not available in archived sequence data
(Jurburg
et al., 2020), either because pair-ended sequencing was not performed
or the reverse reads are not archived. Indeed, one of the datasets used
(Qian
et al., 2017) had merged paired ends prior to archiving. For each
sample, read length was varied from 50-200 bp in intervals of 10 bp.
This range of read lengths was selected as it represents the minimum
output of all next generation sequencing technologies. Taxonomy was
assigned using SILVA v138
(Quast
et al., 2013). For all samples, the number of unassigned reads at each
taxonomic level, and the percentage of original reads included in the
final ASV table was recorded.
ASV tables were analyzed using phyloseq
(McMurdie
& Holmes, 2013) and vegan
(Oksanen
et al., 2007) . To compare diversity estimates, all versions of each
dataset were rarefied to the lowest number of reads (23,354 reads for
the water dataset, 28,105 reads for the soil dataset, and 12,481 reads
for the animal dataset). Unless otherwise noted, all analyses were
performed on chimera-checked data. To explore the impact of read length
on the detection of microbial alpha diversity, the 5 control samples of
each dataset were selected to measure richness and inverse Simpson
diversity
(Chao,
Chiu, & Jost, 2014), which are more heavily weighted by the rare and
dominant taxa, respectively. Similarly, to explore the effects of read
length on beta diversity, Bray-Curtis and Sorensen dissimilarities
between samples were examined. To assess the extent to which read length
affected the ecological conclusions derived from the data, samples from
before and (1 day) after disturbance for each dataset were compared. For
alpha diversity, control and disturbed samples were compared using a
Wilcoxon test, and for beta diversity, control and disturbed samples
were compared using a PERMANOVA (adonis2) for each read length. Finally,
to examine the loss of ecological information with read length, a mantel
test of the dissimilarities (Bray-Curtis and Sorensen) between the
longest read length (200 bp) and all shorter reads was performed for
each dataset.