Improved Genetic Resources for Polyploids
Our technique of developing SNP markers using stringently filtered reduced representation genomic libraries for pairing with the ploidy-aware high throughput genotyping pipeline, as demonstrated here with white sturgeon, should make genetic data more accessible for a range of polyploid organisms. This process addresses several of the challenges inherent to polyploid organisms: discrimination of homeolog amplicons, dosage of polysomic alleles, inference of ploidy in each individual. Pairing these data with recent ploidy-flexible software will make investigations of polysomic organisms more efficient.
There are some caveats to this procedure, however. Perhaps the most idiosyncratic part of this combined technique is the SNP-discovery process. Our matching and filtering process was designed to retain only variants which segregated in otherwise highly similar sequence loci, which helps to discriminate sites, fixed or variants, in homeologs. This requires that a large number of candidate loci be surveyed initially so that a sufficient number of candidates are available for PCR testing after stringent filtering, thus requiring that the number of samples in the ascertainment panel be balanced with the sequencing effort to produce sufficient coverage across all the initial candidates and provide informative read-depth distributions. These parameters will have to be tailored to individual polyploid species, as will the exact filtering thresholds utilized to identify loci with the base segregation pattern (e.g. tetrasomy, in the case of white sturgeon).
One of the great advantages of our method is the ability to simultaneously infer ploidy and genotype individuals in ploidy variable species. While we provide a rough guideline of sequence depth for accurate ploidy inference and genotyping in white sturgeon, this value will need to be tailored for the number of markers surveyed and their on-target efficiency in individual species. It is worth re-iterating that the current genotyping function uses allele ratios predicted from a normal distribution with standard deviation that is inversely related to the number of genotype categories, i.e. ploidy. As ploidy increases, the width of the allowed distributions for each genotype category is reduced, and greater precision in allele ratios is required to genotype each marker. Thus, the required sequence coverage, which provides the sample size for each marker, will be higher with increasing ploidy. Similarly, the confidence level in ploidy inferences, or minimum alternate LLR, will also need to be tailored to individual species, their levels of heterozygosity, and ploidy range. While Delomas et a. (submitted) make some recommendations (e.g., a minimum LLR of 10 for the panel described here), individual researchers may find it useful to employ a higher or lower stringency threshold for genotyping, as the updated GT-seq genotyping pipeline currently only provides genotypes for individuals passing the user-specified minimum alternate LLR.
Two additional limitations to this ploidy estimation function worth noting. First, the function assumes that all loci within a single individual have the same ploidy. Accommodation of loci with multiple ploidies within an individual, e.g. tetraploid and octoploid loci within an ancestral octoploid, can be achieved by fitting models to each group of loci separately by ploidy. The likelihood across all loci could then be calculated as the product of the likelihoods for each group. Second, discrimination of ploidies that are exact multiples of one another may yield results that are less straightforward to interpret because of the current lack of a penalty function for overfitting (fitting noise with the higher ploidy model). For example, in the current dataset, for the same individuals from which the 4N allele plots were generated, a comparison of 4N and 8N models demonstrated that the 8N model had higher likelihood for 54% of samples, although only <1% of incorrect ploidy estimates had minimum alternate LLR higher than 10. As pointed out by Delomas et al. (submitted), the most likely 8N model will always have likelihood higher than or equal to the most likely 4N model, apart from deviations due to the threshold at which convergence of the EM algorithm is assumed, because 4N models are a subset of the space of all possible 8N models. However, ploidy can still be inferred in these situations: individuals of the lower ploidy will have LLR distributed close to zero and individuals of the higher ploidy will have LLR distributed further away from zero. Critical values for assigning ploidy can be chosen using LLRs from a set of known ploidy individuals. Alternatively, if individual ploidies are not known, but a group is presumed to have variable ploidy, the LLRs from individuals in this group are expected to have a bimodal distribution (one mode for each ploidy). If these modes are sufficiently distinct, critical values can be chosen to separate the two clusters. Nevertheless, we look forward to continued development of these functions to facilitate an even greater variety of tests.
Recent development of population genetic software that accommodates polysomic data has advanced the evolutionary analysis of contemporary polyploids, as demonstrated here for white sturgeon with Polygene. Polygene includes a number of functions that were previously inaccessible, including single-parent assignment and maximum likelihood estimates of relatedness. Sibship estimation remains unavailable, however, and for situations where candidate parents are not available, we demonstrated that SNP data could be coded as pseudo-dominant diploid data for use in Colony2 with reasonable, though time-consuming, efficacy. We note, however, that one limitation of Polygene is its treatment of ploidy as invariant within a “population”, which prevents the estimation of parentage and other statistics across ploidy states. While other packages such as Genodive and adegenet do not appear to have this limitation, they nonetheless lack some of the functionality of Polygene. Though not demonstrated here, we envision scenarios in which simulations, relationships inferred from same-ploidy individuals, or known families, could be used to identify thresholds in ploidy-agnostic measures of relatedness to estimate relationships between individuals of different ploidy. As demonstrated here, these measures exhibit variance and potentially downward bias that suggest they need to be estimated for each population separately and treated conservatively. In any event, we expect that the greater accessibility of ploidy-correct genotype data, as provided by the techniques demonstrated here, will spur further development of software facilitating the genetic analysis of polyploids.