Materials and Methods
We conducted a non-exhaustive survey via Web of Science (April 2018) to
understand what methodologies were being used for core membership
assignment. We limited our survey to papers containing the terms ‘core’
and ‘microbiome’ within the title or abstract, resulting in 1034 papers.
We selected papers from 2008-2018, then ordered the search results by
the number of citations and considered the 200 most cited papers. Of the
200 papers, 45 publications sufficiently detailed their methods for
assigning core membership. These remaining papers were then subdivided
into five categories according to the varying methodologies and criteria
that were used to identify core taxa (Table 1). While not an exhaustive
review, the survey provided an adequate representation and summary of
the core assignment methodologies used in contemporary analyses. The
categories are as follows: cumulative proportion of sequence reads,
proportion of replicates, proportion of sequence reads and replicates,
hard cutoff, and the Venn diagram method. In practice, DNA sequence
reads (hereafter referred to as “reads”) within a given level of
sequence similarity (e.g. 97%, 99%, or 100%) are counted as a measure
of taxon abundance. The counts of all taxa found in the taxonomic table,
as used here to describe the composition of microbiomes, are analogous
to counts of plants, animals, or any other counts of organisms in
community ecology. In microbial and other organismal ecology, the total
sampling effort is finite and greater certainty of taxon relative
abundance is achieved with increasing sampling effort (increasing read
depth in microbiome research), with diminishing returns (Forcino,
Leighton, Twerdy, & Cahill, 2015; Zaheer et al., 2018).
Here we use four out of the five methods described in our literature
review on both simulated and real published datasets: cumulative
proportion of sequence reads, proportion of replicates, proportion of
sequence reads and replicates, and hard cutoffs (Table 1). We excluded
the Venn Diagram method from our analysis because it can be considered
as a more stringent version of the proportion of replicates method, in
that a taxon must be present in every sample to be included in the core.
The first method, cumulative proportion of sequence reads, ranks taxa
based on relative abundance and includes the most abundant taxa in the
core. Briefly, this is calculated by ranking all taxa by their relative
abundance, from highest to lowest. Starting with the most abundant, the
percentage of total reads is summed cumulatively until 75% of the total
reads are accounted for. Taxa that account for a portion of the first
75% of total reads are assigned core membership. This method was
adopted from vegetation counts in classic ecology (Hanski, 1982) and
accounts for the relative abundance of individual taxa and the total
sampling effort. The second method, proportion of replicates, assigns a
taxon to the core when that taxon is present in at least 50% of samples
within a given treatment (or observational category; our study included
a single treatment). This method assigns taxa to the core that are found
in the majority of samples and accounts for sample size alone. The third
method, proportion of sequence reads and replicates, assigns a taxon to
the core if it is present in a pre-determined proportion of the total
number of samples from a given treatment (or observational category) as
well a predetermined proportion of the total reads (Delgado-Baquerizo et
al., 2018). This third method accounts for both sample size and sampling
effort simultaneously. Though various proportions of samples and reads
were used in publications, for our analysis, we set these thresholds to
include taxa present in at least 50% of samples within a given
treatment (we only included a single treatment) and account for at least
0.02% of the total reads across all samples (adapted from Callahan et
al. (2016)). For the fourth method, hard cutoffs, we implemented a
cutoff similar to Lundberg et al. (2012), such that for a taxon to be
considered part of the core microbiome, it must be present in at
least some number of samples and with at least some number of reads.
This method differs from the proportion of sequence reads and replicates
method in that the hard-cutoff method a priori assigns an
absolute value for the cutoff that is not based on the size of the
experiment or sequencing depth achieved. In our analysis, we used the
following cutoffs: the taxon must be present in five samples with at
least 25 total reads across all samples (Lundberg et al., 2012).
To test each of the four core methods, we used two published and 6,250
simulated datasets. For the two published datasets (Table 2), we used 1)
the rhizosphere and site M21 subset of the final rarified operational
taxon table from the Arabidopsis thaliana root microbiome project
((Lundberg et al., 2012); dataset name “arabidopsis_R_M21.rda”) and
2) the fecal sample subset from The Human Microbiome Project
((Consortium, 2012); dataset name “human_stool.rda”). The two
published datasets were plotted by the log-transformed taxon mean
abundance and the coefficient of variance (CV) to examine the grouping
of taxa assigned to the core. This was done to assess whether a clear
threshold in abundance and coefficient of variance existed between core
and non-core taxa for any of the examined core assignment methods, as an
obvious threshold may provide support for that core assignment
methodology. Additionally, we used 250 simulations for each of 25
possible combinations of 1) five levels of magnitude of difference in
abundance (π) of core versus non-core taxa (represented as the
πcore/πnon-core , ranging from 1 to 25),
and 2) five levels of variance of the abundances πcoreto πnon-core among replicates (quantified by an
intensity parameter θ, ranging from 1 to 50). This resulted in a total
of 6,250 unique simulations to assess each assignment method. Each
simulation of taxon relative abundances involved random draws from a
Dirichlet distribution parameterized by the expected frequencies of all
taxa (Σπi = 1, with 25 taxa parameterized by
πcore and 975 by πnon-core) and a single
intensity parameter (θ) that affects the precision of taxon abundances
(i.e. scales the variance around expected taxon abundance defined by
πcore and πnon-core), using R v3.4.2 (R
Development Core Team, 2020). Across sets of simulations, we varied the
relative abundance of core and non-core taxa
(πcore/πnon-core), with
πcore/πnon-core = 1 corresponding to a
community that lacks a true core, because all taxa have equal expected
abundances. This ratio acts as a control, in that core taxa should not
be recovered from this dataset. On the other end, a
πcore/πnon-core = 25 simulated a dataset
in which core taxa had an abundance 25 times greater than non-core taxa.
Further, we utilized the intensity parameter (θ) to set the precision of
taxa abundances across replicates for a given set of expected
frequencies (π), with θ of 50 corresponding to high precision and low
variance in taxon relative abundances among replicates and a θ of 1
leading to low precision and a large variance in taxon relative
abundances. All simulations with
πcore/πnon-core > 1 were of
25 taxa as core taxa and the remainder 975 as non-core. The 25 core taxa
were simulated to receive the expected abundances and precision of core
taxa, while the non-core received the abundances and precision expected
for non-core members. Consequently, up to 25 taxa could be detected as
true core taxa (true positives), and 975 taxa as false core members
(false positives) or true non-core members (true negatives).
Additionally, simulations of
πcore/πnon-core = 1 were useful in
quantifying the false positive rate, as these communities did not
include any core taxa. A simulation’s random draw from the Dirichlet
distribution yielded a vector of sample proportions for each of 1000
taxa (P(p1, p2, …,
p1000| πcore,πnon-core, θ), to which the four criteria for core
membership were applied directly.
The ability of each method to accurately recover the known core was
assessed using simulated taxon tables by the following metrics: true
positive rate (signal), false positive rate (noise), and net assignment
value (signal-noise). The true positive rate represents the proportion
of known core taxa that were classified as such, regardless of the
number of false negatives. This is expressed as the probability of a
true core taxon being assigned as such. The false positive rate
represents the proportion of non-core taxa classified as core and is
represented as the probability of a non-core taxon being assigned as a
member of the core. The net assignment value represents the differences
between the absolute number of true positives and the number of false
positives. A net assignment value of 25 represents perfect
classification. A net assignment value of -975 would indicate all
non-core taxa were ill-assigned to the core with no true positive
classifications. The more negative the number, the more highly inflated
the core is. This metric can be interpreted as the difference in signal
and noise.
Next, to examine to what extent core assignments produce the same
differences in community diversity (beta-diversity) as the entire
dataset, we created dissimilarity matrices with two different distance
metrics (Bray-Curtis and Jaccard) using each set of core assignments and
the full dataset as independent inputs. We then used PermANOVA testing
(adonis) in the vegan package (Oksanen et al., 2018) to examine the
variance explained by categorical predictors in each of the core and
full datasets. We examined three categorical predictors for the human
microbiome project dataset: patient visit number (1st,
2nd, or 3rd), sex of subject (male
vs. female), and where the sequencing was performed (12 different
sequencing centers). In the Arabidopsis dataset we were able to
examine two categorical predictors: developmental stage (young vs. old)
and genotype (nine different genotypes). Results from the core subsets
and total dataset were compared and differing statistical significance
for categorical variables was noted.
In addition to examining differences in beta-diversity between the full
dataset and core subsets, we also used cooccurrence networks to identify
significant taxa. Cooccurrence networks are used in microbial ecology to
determine microbial taxa that occur together in a statistically
significant manner, with nodes and edges representing significant
microbes and the connections between them respectively. Network analyses
can be used to mine for keystone or hub taxa and can be used to identify
interactions between groups of microbial taxa (Banerjee et al., 2016,
2018; Shi et al., 2020). The Arabidopsis rhizosphere microbiome
graph consisted of 14,890 agents (taxa) and 288 artifacts (samples), and
the human microbiome project graph consisted of 11,752 agents (taxa) and
319 artifacts (samples). From each bipartite graph we obtained the
weighted bipartite projection, then extracted its signed backbone using
the backbone package (Domagalski, Neal, & Sagan, 2019). Edges were
retained in the backbone if their weights were statistically significant
(alpha = 1e-04, Bonferroni corrected) by comparison to a null
Hypergeometric Model (Neal, 2013). The corresponding nodes IDs (taxa
IDs) with significant edges were then extracted and compared to the core
assignments core assignment methods. Assuming core taxa drive abundance
and occurrence patterns, and that rare taxa do not largely contribute to
community structure, one could expect to find core taxa serving as
important nodes with many significant edges (higher degree centrality)
and the opposite to be true for non-core assignments.
To facilitate the use of our analytical methods by researchers curious
about the validity of core microbiome assignment, we wrote an R package,
CoreMicro, that can be installed through github
(github.com/mayagans/coremicro).
The package includes functions that accept a taxon table as an input and
can be used to generate plots and tables of core inclusion by method.
This functionalized approach facilitates the comparison of methods and
provides a means to check for the existence of the hypothesized
core:non-core divide. In addition, the package includes all data,
including the full used within this study as well as the code for all
simulations. Full OTU tables and metadata files of theArabidopsis and Human Microbiome Project datasets can be found at
10.5281/zenodo.4909346.