Improved Genetic Resources for Polyploids
Our technique of developing SNP markers using stringently filtered
reduced representation genomic libraries for pairing with the
ploidy-aware high throughput genotyping pipeline, as demonstrated here
with white sturgeon, should make genetic data more accessible for a
range of polyploid organisms. This process addresses several of the
challenges inherent to polyploid organisms: discrimination of homeolog
amplicons, dosage of polysomic alleles, inference of ploidy in each
individual. Pairing these data with recent ploidy-flexible software will
make investigations of polysomic organisms more efficient.
There are some caveats to this procedure, however. Perhaps the most
idiosyncratic part of this combined technique is the SNP-discovery
process. Our matching and filtering process was designed to retain only
variants which segregated in otherwise highly similar sequence loci,
which helps to discriminate sites, fixed or variants, in homeologs. This
requires that a large number of candidate loci be surveyed initially so
that a sufficient number of candidates are available for PCR testing
after stringent filtering, thus requiring that the number of samples in
the ascertainment panel be balanced with the sequencing effort to
produce sufficient coverage across all the initial candidates and
provide informative read-depth distributions. These parameters will have
to be tailored to individual polyploid species, as will the exact
filtering thresholds utilized to identify loci with the base segregation
pattern (e.g. tetrasomy, in the case of white sturgeon).
One of the great advantages of our method is the ability to
simultaneously infer ploidy and genotype individuals in ploidy variable
species. While we provide a rough guideline of sequence depth for
accurate ploidy inference and genotyping in white sturgeon, this value
will need to be tailored for the number of markers surveyed and their
on-target efficiency in individual species. It is worth re-iterating
that the current genotyping function uses allele ratios predicted from a
normal distribution with standard deviation that is inversely related to
the number of genotype categories, i.e. ploidy. As ploidy increases, the
width of the allowed distributions for each genotype category is
reduced, and greater precision in allele ratios is required to genotype
each marker. Thus, the required sequence coverage, which provides the
sample size for each marker, will be higher with increasing ploidy.
Similarly, the confidence level in ploidy inferences, or minimum
alternate LLR, will also need to be tailored to individual species,
their levels of heterozygosity, and ploidy range. While Delomas et a.
(submitted) make some recommendations (e.g., a minimum LLR of 10 for the
panel described here), individual researchers may find it useful to
employ a higher or lower stringency threshold for genotyping, as the
updated GT-seq genotyping pipeline currently only provides genotypes for
individuals passing the user-specified minimum alternate LLR.
Two additional limitations to this ploidy estimation function worth
noting. First, the function assumes that all loci within a single
individual have the same ploidy. Accommodation of loci with multiple
ploidies within an individual, e.g. tetraploid and octoploid loci within
an ancestral octoploid, can be achieved by fitting models to each group
of loci separately by ploidy. The likelihood across all loci could then
be calculated as the product of the likelihoods for each group. Second,
discrimination of ploidies that are exact multiples of one another may
yield results that are less straightforward to interpret because of the
current lack of a penalty function for overfitting (fitting noise with
the higher ploidy model). For example, in the current dataset, for the
same individuals from which the 4N allele plots were generated, a
comparison of 4N and 8N models demonstrated that the 8N model had higher
likelihood for 54% of samples, although only <1% of
incorrect ploidy estimates had minimum alternate LLR higher than 10. As
pointed out by Delomas et al. (submitted), the most likely 8N model will
always have likelihood higher than or equal to the most likely 4N model,
apart from deviations due to the threshold at which convergence of the
EM algorithm is assumed, because 4N models are a subset of the space of
all possible 8N models. However, ploidy can still be inferred in these
situations: individuals of the lower ploidy will have LLR distributed
close to zero and individuals of the higher ploidy will have LLR
distributed further away from zero. Critical values for assigning ploidy
can be chosen using LLRs from a set of known ploidy individuals.
Alternatively, if individual ploidies are not known, but a group is
presumed to have variable ploidy, the LLRs from individuals in this
group are expected to have a bimodal distribution (one mode for each
ploidy). If these modes are sufficiently distinct, critical values can
be chosen to separate the two clusters. Nevertheless, we look forward to
continued development of these functions to facilitate an even greater
variety of tests.
Recent development of population genetic software that accommodates
polysomic data has advanced the evolutionary analysis of contemporary
polyploids, as demonstrated here for white sturgeon with
Polygene. Polygene includes a number of functions that
were previously inaccessible, including single-parent assignment and
maximum likelihood estimates of relatedness. Sibship estimation remains
unavailable, however, and for situations where candidate parents are not
available, we demonstrated that SNP data could be coded as
pseudo-dominant diploid data for use in Colony2 with
reasonable, though time-consuming, efficacy. We note, however, that one
limitation of Polygene is its treatment of ploidy as invariant
within a “population”, which prevents the estimation of parentage and
other statistics across ploidy states. While other packages such as
Genodive and adegenet do not appear to have this
limitation, they nonetheless lack some of the functionality of Polygene.
Though not demonstrated here, we envision scenarios in which
simulations, relationships inferred from same-ploidy individuals, or
known families, could be used to identify thresholds in ploidy-agnostic
measures of relatedness to estimate relationships between individuals of
different ploidy. As demonstrated here, these measures exhibit variance
and potentially downward bias that suggest they need to be estimated for
each population separately and treated conservatively. In any event, we
expect that the greater accessibility of ploidy-correct genotype data,
as provided by the techniques demonstrated here, will spur further
development of software facilitating the genetic analysis of polyploids.