3. Challenges in amplicon sequence data analysis
3.1. Primer selection dictates phylogenetic coverage
As choice of primers can influence the taxa observed in an amplicon sequencing dataset, it is of utmost importance to take care with regard to primer selection and the interpretation of resulting data as to the community changes/impacts of treatments. Given the high diversity of soil communities (of which the understanding is constantly growing due to massive sequencing efforts), no primer pair will cover the complete phylogenetic breadth on a high rank such as domain (e.g., bacteria, archaea, fungi). As a consequence, part of the communities will always be missing. This is inevitable but it is of particular concern when studies attribute soil functions to taxa found to be "rare" in their amplicon sequencing data due to low coverage of that group
\citep{Chen_2020}. Nevertheless, evaluated and recommended primer pairs are available including the updated versions of 515F-806R primers for surveys of archaea and bacteria (e.g.,
https://earthmicrobiome.org/). Primer selection is even more challenging for studies of eukaryotes, owing to hypervariable sequence lengths and multiple gene copy numbers (due to multiple operons and/or polykaryosis) that contribute to biased amplification of some phylogenetic groups during PCR. This bias may for example lead to the under-estimation of some fungal groups, having a downstream effect on diversity estimates
\citep{Baldrian_2021}. Arbuscular mycorrhizal fungi for instance are largely overlooked by commonly used ITS primers which could lead investigators to infer that arbuscular mycorrhizal fungi are rare
\citep{George_2019}. A promising alternative to ITS-targeted short-read sequencing is the use of
long-read sequencing (e.g., PacBio) which enables the investigation of most fungi (including Glomeromycota) and other soil eukaryotes through covering both the full ITS region and part of the small subunite of the rRNA gene \cite{Tedersoo_2019,Tedersoo_2020}. We refer readers to in-depth reviews that further discuss challenges regarding amplicon sequencing of fungi specifically, including discussion of primer selection and coverage
\citep{Nilsson2019,Baldrian_2021}.
The choice of primers has substantial impacts on estimates of diversity in community studies. As a consequence, we urge researchers to use tools such as TestPrime (
https://www.arb-silva.de/search/testprime/) to evaluate the current status of the coverage of their target microbial groups of interest before sequencing and to discuss this aspect in their publications. We also recommend that reviewers critically assess the coverage of the target group of organisms used in a study to improve future evaluation of sequencing-based research in soil ecology.
3.2. Compositionality necessitates careful data processing
One of the first steps in the analysis of amplicon sequencing data is the
removal of potential sequencing errors. Doing so eliminates sequencing artefacts that may falsely boost diversity levels \citep{Edgar2011,Haas2011}. The use
of amplicon sequencing variants (ASVs), instead of operational taxonomic
units (OTUs) helps to overcome this issue by assigning a greater
probability of a true biological sequence being more abundant than an
error-containing sequence
\citep{Callahan_2017}. To that end,
bioinformatic tools such as DADA2
\citep{Callahan_2016} and Deblur
\citep{Amir_2017} attempt to
use sequencing error profiles to resolve amplicon sequencing data into
ASVs. An ASV is more likely to have an
intrinsic biological meaning (i.e., being a true DNA sequence), as opposed to an OTU which can either be a representation of the most abundant biological sequence or a consensus sequence (of which the latter may not exist in reality). In addition, ASVs facilitate the merging of datasets, particularly when the same sequencing primer pairs are used.
Another relevant step when analyzing sequencing data is to account for
the different sequencing efforts across samples (i.e., sequencing depth) that can result in a substantially different number of recovered
reads even among replicates. Ways to tackle this include
total library size normalization and rarefaction, with both remaining debated to date \cite{McMurdie_2014,Weiss_2017}.
Bioinformatic tools such as DeSeq2 and EdgeR provide ways to normalize count tables \citep{Love_2014,Robinson_2010}. Both
methods are applied on raw or low-abundance filtered count tables, and
have performed well in both real as well as simulated datasets and outperform
rarefaction-based approaches
\citep{McMurdie_2014}. Other
alternatives that account for the compositional aspect of sequencing data
include centered log-ratio (CLR), isometric log-ratio (ILR) or additive log-ratio (ALR)
ratios transformations on a count data matrix with adequate replacements of zeros \cite{Aitchison_1984,Egozcue_2003}.
Following data normalization, traditional workflows
include the generation of distance matrices for ordination, clustering,
and variance partitioning analyses. Commonly used distance metrics
include Bray-Curtis, Jaccard and Unifrac (weighted and unweighted). These metrics are often used although they do not take into
account the compositional nature of sequencing data. The Aitchison
distance - defined as the Euclidian distance on top of a centered log-ratio
transformed count matrix – is a viable compositional alternative \cite{Aitchison_1984} that allows performing ordinations (e.g., PCA biplots). Additionally, the
"Philr" transformation metric has been introduced as a compositional alternative
to the weighted Unifrac that carries phylogenetic information
\citep{Silverman_2017}. Most of the
above mentioned compositional options are implemented in R packages and
include publicly available tutorials. In light of the challenges related to normalization and analysis of compositional data, we recommend a critical evaluation of available data analysis tools to best address the nature of each experimental setup (see also section 6).
Another aspect that prevents data analyses from being fully quantitative is the potential of multiple copies of marker genes present per organism, which may also vary across taxa. For example, the 16S rRNA gene copy number per bacterial cell can vary between 1 and 18 and can additionally show variation within different strains of the same species \citep{Stoddard_2014,Coenye2003,Lavrinienko2021}. Therefore, relying solely on the number and diversity of markers such a 16S rRNA genes can lead to inaccurate estimates of the relative abundance and diversity of microbial communities. Several computational tools can correct amplicon datasets for the number of 16S rRNA gene copies based on existing genome information (e.g., PICRUSt2 \cite{Douglas_2020} and CopyRighter \citep{Angly_2014}). However, correcting for 16S rRNA gene copy numbers in sequencing surveys remains challenging, particularly for soil, as the gene copy numbers are only known for a subset of the soil microbes \citep{Louca_2018,Nunan_2020}. This challenge becomes even more problematic for marker genes of fungi and other eukaryotes, such as protists, as the copy number here can vary drastically between taxa \citep{Gong_2013,Gong_2019}. Other housekeeping genes, which occur only once in a genome, have been proposed as universal phylogenetic marker genes (such as recA \citep{Eisen1995}), but their use remains limited due to lower phylogenetic resolution and limited availability in databases.
3.3. Insufficient data availability contributes to a lack of reproducibility
Reproducibility and reusability of research results are predicated on sharing data and analysis scripts, a topic of growing relevance in light of increasing amounts of sequencing data obtained from soils around the globe and with the increasing complexity of analyses. Proper data sharing practices allow researchers to re-analyze specific aspects of published datasets, and/or investigate patterns in soil communities across datasets in the form of meta-analyses. A prerequisite to ensure data storage and availability in a usable format is that authors are required to do so by respective journals. In order to assess the current state of data deposition in the field, we searched the author guidelines of the 10 specialized soil journals (see Fig. 1 for reference). Out of the 10 journals, many "encourage their authors to make data available" while only 2 journals specifically require sequencing data to be deposited in public repositories such as GenBank before a manuscript is accepted for publication. Even if authors feel encouraged to comply, storage of their data in a repository does not always facilitate reproducibility of the reported research. Deposited datasets often contain only raw results from whole sequencing runs, and provide little meaningful information on the individual amplicons and on the corresponding metadata. As a consequence, it may be difficult to reconstitute the exact datasets used for the reported statistics and illustrations from such data. This requires that the applied quality filters and processing steps (see section 3.1), as well as the versions of applied software packages, be precisely reported.
Consequently, we call on all specialized soil journals that accept and publish sequencing data to (i) provide community standards for reproducible data analysis in their data policy statements and (ii) require the submission of sequencing data, ASV/OTU tables, together with sample metadata, to open repositories (such as GenBank, Dryad, or FigShare) and (iii) require that analysis scripts be made available on open hosting services (such as GitHub) or accompany the publication as a supplement. These steps will greatly facilitate reproducibility, open science, and meta-analyses.
4. Addressing and interpreting compositional sequencing data
4.1. Interpreting relative abundance data
The compositionality of amplicon sequencing data presents challenges to the interpretation of changes in microbial community structure. The amount of sequence data obtained through high-throughput sequencing is a fixed value, resulting in a random sampling of sequences from a sample that cannot be directly linked to absolute abundance based on sequences alone \citep{Gloor_2017}. Numerous studies have revealed shifts in microbial community composition across treatments including gradients of temperature, pH, and salinity, as well as seasonal or temporal parameters. This practice is robust on a community level when broad-scale changes in taxa are of interest (e.g., phylum level), and has resulted in similar ecological conclusions as data generated with more quantitative approaches \citep{Piwosz2020}. However, at higher taxonomic resolution (e.g., genus level), quantitative inferences from relative abundance sequencing data become more challenging. Due to the nature of sequencing, a change in the relative abundance of one species is always reflected in a corresponding change in one or more other species. We depict such challenges in interpretation in the following thought experiment (Fig. 3).
Amplicon sequencing data obtained from the same soil sample at two different time points (t1, t2) consists of two species (A, B). The relative abundance observed for species A and B is 0.55 and 0.45 at time point 1 (t1), and 0.8 and 0.2 at time point 2 (t2), respectively (Fig. 3). From t1 to t2, species B decreases in relative abundance coupled to an increase in the relative abundance of species A. The bars below (t2a-t2e) illustrate five examples of changes in absolute abundance in t2 that could underlie the patterns observed in relative abundance data. The initial time point (t1) is also shown for comparison.
The first case represents a situation where the absolute abundance matches the relative abundance observations. There are no changes in total biomass from t1 to t2 and species A increases, whereas species B decreases (Fig. 3, t2a). The second case depicts an increase in overall biomass between t1 and t2 caused by an absolute increase in species A and no absolute changes in species B (Fig. 3, t2b). The third case represents an opposite scenario where the decreases in total biomass between t1 to t2 is caused by a decrease in species B and no changes in species A (Fig. 3, t2c). The fourth case represents a situation where there is a general increase in biomass from t1 to t2 prompted by increases in absolute abundances of both species A and B (Fig. 3, t2d), while the fifth case represents an opposite scenario (Fig. 3, t2e). For some of these examples, observed changes in relative abundance may accurately reflect true biological changes (t2a, t2d and t2e), whereas interpretation of the community shifts that underlie observed patterns remains more difficult for the other scenarios (t2b and t2c). Without information on absolute abundances, there is still room for ambiguous interpretations solely based on relative abundance plots (see section 4.2). This theoretical exercise shows, that even for a community of only two member species, there are five potential scenarios of changes in the absolute abundance that could cause the observed shift in relative abundance. Given that soil communities usually harbour thousands of species, the degree of complexity increases dramatically.