3. Challenges in amplicon sequence data analysis

3.1. Primer selection dictates coverage

We advise soil researchers to take the utmost care with regard to the choice of primers and the resulting interpretation of community changes/impacts of treatments. Given the high diversity of soil communities (which is constantly growing due to massive sequencing efforts), no primer pair will cover the complete taxonomic breadth on a high rank such as kingdom (e.g. bacteria, archaea, fungi). As a consequence, missing part of a community is inevitable but it is of particular concern when studies attribute soil functions to taxa found to be "rare" in their amplicon sequencing data due to low coverage of that group \citep{Chen_2020}.  Nevertheless, evaluated and recommended primer pairs are available including the updated versions of 515F-806R primers for surveys of archaea and bacteria (e.g. https://earthmicrobiome.org/). Primer selection is even more challenging for studies of eukaryotes, due to hypervariable sequence lengths and multiple gene copy numbers (due to multiple operons and/or polykaryosis) that contribute to biased amplification of some taxonomic groups during PCR. This bias may lead to the under-estimation of some fungal groups, having a downstream effect on diversity estimates \citep{Baldrian_2021}. For example, arbuscular mycorrhizal fungi are largely overlooked by commonly used ITS primers which could lead investigators to infer that arbuscular mycorrhizal fungi are rare \citep{George_2019}. An alternative to ITS for broad eukaryotic community profiling is the use of the 18S ribosomal gene, which enables the investigation of protists and most fungi. We refer readers to in-depth reviews that further discuss challenges regarding amplicon sequencing of fungi specifically, including discussion of primer selection and coverage  \citep{Nilsson2019,Baldrian_2021}.  
The choice of primers has substantial impacts on estimates of diversity in community studies. As a consequence, we urge researchers to use tools such as TestPrime (https://www.arb-silva.de/search/testprime/) to evaluate the current status of the coverage of the target group of interest before sequencing and to discuss this aspect in their publications. We also recommend that reviewers critically assess the coverage of the target group of organisms used in a study to improve future evaluation of sequencing-based research in soil ecology.

3.2. Compositionality necessitates careful data processing 

One of the first steps in the analysis of amplicon sequencing data is the removal of potential sequencing errors. Doing so eliminates sequencing artefacts that may falsely boost diversity levels \citep{Edgar2011,Haas2011}. The use of amplicon sequencing variants (ASVs), instead of operational taxonomic units (OTUs) helps to overcome this issue by assigning a greater probability of a true biological sequence being more abundant than an error-containing sequence \citep{Callahan_2017}⁠. To that end, bioinformatic tools such as DADA2 \citep{Callahan_2016}⁠ and Deblur \citep{Amir_2017}⁠ attempt to use sequencing error profiles to resolve amplicon sequencing data into ASVs. An ASV is more likely to have an intrinsic biological meaning (i.e. true DNA sequence), as opposed to an OTU which can either be a representation of the most abundant biological sequence or a consensus sequence. Additionally, ASVs make the merging of datasets possible, even when the sequencing primer pairs are different  \citep{Callahan_2017}.
Another relevant step when analyzing sequencing data is to account for the different sequencing efforts across samples (i.e. library size, sequencing depth) that can result in a substantially different number of recovered reads even among replicates. Ways to tackle this include total library size normalization and rarefaction, with the latter remaining debated to date \citep{McMurdie_2014}⁠. Bioinformatic tools such as DeSeq2 and EdgeR provide ways to normalize count tables \citep{Love_2014,Robinson_2010}⁠. Both methods are applied on raw or low-abundance filtered count tables, and have performed well in both real as well as simulated datasets and outperform rarefaction-based approaches  \citep{McMurdie_2014}⁠. Other alternatives that account for the compositional aspect of sequencing data include centered log-ratio (CLR), isometric log-ratio (ILR) or additive log-ratio (ALR) ratios transformations on a count data matrix \cite{Aitchison_1984,Egozcue_2003}
Following data normalization, traditional workflows include the generation of distance matrices for ordination, clustering, and variance partitioning analyses. Commonly used distance metrics include Bray-Curtis, Jaccard and Unifrac (weighted and unweighted). These metrics are often used although they do not take into account the compositional nature of sequencing data. The Aitchison distance - defined as the Euclidian distance on top of a centered log-ratio transformed count matrix – is a viable compositional alternative \cite{Aitchison_1984} that allows performing ordinations (e.g. PCA biplots). Additionally, the "Philr" transformation metric has been introduced as a compositional alternative to the weighted Unifrac that carries phylogenetic information \citep{Silverman_2017}⁠. Most of the above mentioned compositional options are implemented in R packages and include publicly available tutorials. In light of the challenges related to normalization and analysis of compositional data, we recommend a critical evaluation of available data analysis tools to best address the nature of each experimental setup (see also section 6).  
Another aspect that prevents data analyses from being fully quantitative is the potential multiple copies of marker genes per organism. For example, the 16S rRNA gene copy number per bacterial cell can vary between 1 and 18 and can additionally show variation within different strains of the same species \citep{Stoddard_2014,Coenye2003,Lavrinienko2021}. Therefore, relying solely on the number and diversity of markers such a 16S rRNA genes can lead to inaccurate estimates of the abundance and diversity of microbial communities. Several computational tools can correct amplicon datasets for the number of 16S rRNA gene copies based on existing genome information (e.g. PICRUSt2 \citep{Douglas_2019} and CopyRighter \citep{Angly_2014}). However, correcting for 16S rRNA gene copy numbers in sequencing surveys remains challenging, particularly for soil, as the gene copy numbers are only known for a subset of the soil microbes \citep{Louca_2018,Nunan_2020}. This challenge becomes even more problematic for marker genes of fungi and other eukaryotes, such as protists, as the copy number here can vary drastically between taxa \citep{Gong_2013,Gong_2019}. Other housekeeping genes, which occur only once in a genome, have been proposed as universal phylogenetic marker genes (such as recA \citep{Eisen1995}), but their use remains limited due to lower taxonomic resolution and limited availability in databases. 

3.3. Insufficient data availability contributes to a lack of reproducibility

 Reproducibility and reusability of research results are predicated on sharing data and analysis scripts, a topic of growing relevance in light of increasing amounts of sequencing data obtained from soils around the globe and with the increasing complexity of analyses. A prerequisite to ensure data storage and availability in a usable format is that authors are required to do so by respective journals. In order to assess the current state of data deposition in the field, we searched the author guidelines of the 10 specialized soil journals that contributed most studies with amplicon sequencing data (see Fig. 1 for reference). Out of the 10 journals, many "encourage their authors to make data available" while only 2 journals require sequencing data to be deposited in public repositories such as GenBank before a manuscript is accepted for publication. Even if authors feel encouraged to comply, storage of  their data in a repository does not always facilitate reproducibility of the reported research. Deposited datasets often contain only raw results from whole sequencing runs, and provide little meaningful information on the individual amplicons and on the corresponding metadata. As a consequence, it may be difficult to reconstitute the exact datasets used for the reported statistics and illustrations from such data. This requires that the applied quality filters and processing steps (see section 3.1), as well as the versions of applied software packages, be precisely reported.
Consequently, we call on all specialized soil journals that accept and publish sequencing data to (i) provide community standards for reproducible data analysis in their data policy statements and (ii) require the submission of sequencing data, ASV/OTU tables, together with sample metadata, to open repositories (such as GenBank, or FigShare)  and (iii) require that analysis scripts be made available on open hosting services (such as GitHub) or accompany the publication as a supplementary. These steps will greatly facilitate reproducibility, open science, and meta-analyses.

4. Addressing and interpreting compositional sequencing data

4.1. Interpreting relative abundance data

The compositionality of amplicon sequencing data presents challenges to the interpretation of changes in microbial community structure. The amount of sequence data obtained through high-throughput sequencing is a fixed value, resulting in a random sampling of sequences from a sample that cannot be directly linked to actual counts based on sequences alone \citep{Gloor_2017}. Numerous studies have revealed shifts in microbial community composition across treatments including gradients of temperature, pH, and salinity, as well as seasonal or temporal parameters. This practice is robust on a community level when broad-scale changes in taxa are of interest (e.g. phylum), and has resulted in similar ecological conclusions as data generated with more quantitative approaches \citep{Piwosz2020}. However, at higher taxonomic resolution (e.g. genus), quantitative inferences from relative abundance sequencing data become more challenging. Due to the nature of sequencing, a change in the relative abundance of one species is often reflected in a corresponding change in another species. We depict such challenges in interpretation in the following theoretical experiment. 
Amplicon sequencing data obtained from the same soil sample at two different time points (t1, t2) consists of two species (A, B). The relative abundance observed for species A and B is 0.55 and 0.45 at time point 1 (t1), and 0.8 and 0.2 at time point 2 (t2), respectively (Fig. 3). From t1 to t2, species B decreases in relative abundance coupled to an increase in the relative abundance of species A. The bars below (t2a-t2e) illustrate five examples of changes in absolute abundance in t2 that could underlie the patterns observed in relative abundance data. The initial time point (t1) is also shown for comparison. 
The first case represents a biologically true situation where the absolute abundance matches the relative abundance observations. There are no changes in biomass from t1 to t2 and species A increases, whereas species B decreases (Fig. 3, t2a). The second case depicts an increase in overall biomass between t1 and t2 caused by an absolute increase in species A and no absolute changes in species B (Fig. 3, t2b). In relative abundance data, species B would appear to reduce in abundance  when in fact that was caused by a change in species A alone. The third case represents an opposite scenario where there is a total decrease in biomass from t1 to t2 caused by a decrease in species B and no changes in species A (Fig. 3, t2c). Again, in a relative abundance plot, species B would appear to have decreased in relative abundance (true), but inferences regarding species A would be false. The fourth case represents a situation where there is a general increase in biomass from t1 to t2 prompted by increases in absolute abundances of both species A and B (Fig. 3, t2d) while the fifth case represents an opposite scenario where there is a general decrease in biomass from t1 to t2 caused by decreases in absolute abundances of both species A and B (Fig. 3, t2e). These last two cases could reflect true biological changes and would be accurately represented by the relative abundance plot t2 as long as the changes in absolute abundances are proportional between species A and B. However, without information on absolute abundances, there is still room for ambiguous interpretations solely based on relative abundance plots (see section 4.2).  
This theoretical exercise shows, that even for a community of only two member species, there are five potential changes in the absolute abundance that could cause the observed shift in relative abundance. Given that soil communities usually harbour thousands of species, the degree of complexity increases dramatically. This is further affected by factors such as PCR bias, which highlights the caution that must be applied when interpreting relative abundances derived from amplicon sequencing.