2.2.2 Amplicon sequence variants and ‘aminotypes’
By default, vAMPirus generates nucleotide-based (ASV) and protein-based
(aminotype) results. ASVs support cross-study comparisons and offer a
statistically supported view of virus sequence diversity, as
biologically inaccurate sequences are removed during denoising (Callahan
et al., 2017; Edgar, 2016b). However, ASV results for virus lineages
with high mutation rates (e.g., RNA viruses with quasispecies
heterogeneity) may still contain high levels of noise that mask
biological patterns. It may be beneficial to group ASVs into distinct
clusters based on genetic or ecological similarities in such use cases.
In vAMPirus, ‘aminotypes’ (unique amino acid sequences, Grupstra et al.
2022) are generated by translating ASVs with VirtualRibosome (v2.0,
Wernersson, 2006) and subsequently dereplicating these translations
using the program CD-HIT (v4.8.1, Fu et al., 2012; Li & Godzik, 2006).
As direct products of specific ASVs, aminotypes maintain sequence
tractability, reproducibility, and comparability, and therefore differ
from de novo OTUs or cASVs (see Section 2.2.3). The ‘aminotyping’
approach not only reduces noise; it also removes sequences with internal
stop codons (a deleterious mutation) and reveals nonsynonymous mutations
that may indicate differences in virus functionality (e.g., infection
efficiency, host range; DeFilippis & Villarreal, 2000).
vAMPirus provides two additional (optional) ASV or aminotype
“grouping” approaches that are alternatives to de novoclustering: Minimum Entropy Decomposition (MED) and phylogeny-based
clustering or ‘phylogrouping’. MED is a method of sequence clustering
that utilizes Shannon entropy (Shannon, 1948) to partition marker gene
datasets into ‘MED nodes’ (Eren et al., 2015). With this approach, users
identify sequence positions in a set of ASVs or aminotypes that are
information-rich (positions of high variability) or information-poor
(positions of high conservation) and use these positions to assign
ASVs/aminotypes to ‘MED groups’ (sequences with identical bases at
specified positions) (Eren et al., 2015). Users can also specify and
assign sequences to MED groups based on sequence positions of interest
(e.g. positions of a protein sequence known to influence a viral
characteristic such as host cell attachment; see Harvey et al., 2021).
Phylogrouping is performed with the TreeCluster program (v1.0.3, Balaban
et al., 2019). With this approach, ASV or aminotype sequences are
assigned to “phylogroups” based on user specified TreeCluster
parameters and the phylogenetic tree produced during analysis (see
Figure 4-V, VI). All grouping methods can be applied at the same time;
coupled with the use of the Nextflow ‘–resume’ feature, adjusting
specific parameters and generating new results to review and compare is
straightforward and does not require re-running the entire DataCheck or
Analyze pipelines.
2.2.3 Optional de novo sequence clustering
vAMPirus provides the option to perform de novo clustering of
ASVs into ‘clustered ASVs’ or ‘cASVs’ based on pairwise nucleotide
(ncASV) and/or protein (pcASV) sequence similarity using the programs
VSEARCH (Rognes et al., 2016) and CD-HIT (Fu et al., 2012; Li & Godzik,
2006), respectively. cASVs differ from traditional de novo OTUs
because for cASVs, denoising of sequences is done prior to clustering.
The de novo clustering of ASVs is most useful for more developed
virus systems where the degree of sequence divergence between
taxonomically or ecologically distinct groups is known. Note that, from
a methodological standpoint, representative sequences generated by a
cASV approach exhibit the same issues as de novo OTUs (e.g.,
dataset dependence; see Callahan et al., 2017).