Discussion

Using publicly available data from 442 fish species comprising five vertebrate classes, we developed a model to predict species maximum lifespan from genomic CpG density alone. The accuracy of the fish lifespan predictions was consistent across genome assemblies of different samples of the same species, indicating that the analysis of a single individual is required to predict a species’ lifespan using this method. We anticipate this novel approach having immediate utility in any fishery management case where lifespan approximation by other means is impracticable, and here identify areas for future research that may improve the predictive power of the model for broader application.

Robustness, accuracy and potential application of genomic lifespan prediction

The fish lifespan model demonstrates that there is a strong association between genomic CpG density and lifespan. Based on this association, the model is robust to sequence differences between zebrafish promoters and orthologous promoters in distantly related species, as well as differences in genome assembly completeness. The resulting predictions had approximately double the error of the reported values of lifespan, which require far more intensive research efforts to obtain. To predict lifespan using this method, our results indicate that the genome sequence of just a single individual (no repeated sampling) is required. This involves the acquisition of a small piece of tissue (e.g., a fin clip), genome sequencing and assembly followed by downstream bioinformatic analysis. Contig-level assemblies for genomes up to 1 Gbp in size (i.e., most fish) can be produced for less than $5000 USD and in under two weeks (R. Huerlimann, pers comm. ). If a genomic assembly for the species is already available, model predictions can be generated immediately and with no associated consumable expenses. At present, known lifespan estimation involves either observing the age at death of fish held in aquaria, repeated sampling in the field to determine maximum observed age , modelling the maximum based on trends in survivorship with age or estimations based on maximum length . The cost and time involved in housing animals in aquaria or monitoring enough individuals to confidently identify or calculate maximum age using current methods likely far exceeds what is required for genomic lifespan prediction.

Molecular predictors of lifespan

In addition to providing lifespan predictions, the model may provide insight into the molecular biology of fish lifespan. For example, it has been hypothesised that the association between genomic CpG density and lifespan is due to a protective effect of increased CpG density against age-related epigenomic changes . Previous results in mammals showed that CpG density was positively associated with lifespan in 94% of promoters , providing strong support for this theory. However, the vertebrate model showed this positive association was only present for 62% of modelled promoters and here we observed positive associations in just 38%. These results highlight that differences in CpG density are important for predicting lifespan, rather than simply increases, as previously hypothesised. This is evident in mammals and other vertebrates, but is particularly pronounced in fish.
Previous functional analyses of lifespan-related promoters in CpG density models have been unable to identify any significantly enriched gene functions . However, analysis of the lifespan-associated genes here revealed functions related to intracellular components, transport and immune functioning pathways. Specifically, we identified a number of pathway components related to T and B Cell functioning as well as NF-KB signalling pathways, all of which are of central importance in immune functioning. Transcriptional regulation by RUNX3 was also identified; a gene that functions in the suppression of tumours . Collectively, these immune system components are protective against toxins, infection, and cancer and thus are highly likely to influence longevity . These results are consistent with epigenetic age predictors, which commonly select for genomic regions associated with immune function .
We also observed enrichment for specific signal transduction pathway elements, with many involved in Hedgehog repression and RAF/MAP kinase pathways, which regulate programmed cell differentiation and aspects of immune functioning . Interestingly, the analysis revealed enrichment for 44 genes associated with abnormal hair formation in humans. Due to the presence of many shared signalling pathways, Actinopterygian scales are thought to be evolutionary precursors to mammalian hair, which is known to degenerate with increasing age . Fish also have hair cells in their lateral line for sensing prey as well as in their ear canals for sensing barometric pressure . Promoters for genes that are important for species survival may have been altered in different lineages under varying selection pressures, leading to lifespan changes among fish species.
We observed no Pearson correlation between global CpG O/E and lifespan. This provides support for the hypothesis that age-related changes in DNA methylation in promoter regions specifically (as opposed across the genome more generally) are strongly associated with lifespan . We also observed a significant negative Pearson correlation between genome size and CpG O/E. This is consistent with previous reports that high levels of DNA methylation (and therefore low CpG O/E) lead to increases in genome size via the suppression of transposable element (TE) activity . In our results, when genome size and the interaction between genome size and CpG O/E were controlled for, we observed a positive relationship between global CpG O/E and lifespan for small genomes and a negative relationship for large genomes. The differing pattern for larger genomes may be related to increased TE load. However, as this was not the focus of the work, the present results are inconclusive. The relationship between global CpG O/E, genome size, and how it relates to species lifespan warrants further investigation.

Limitations and future directions

Despite the broad applicability and predictive power of the fish lifespan model, variable levels of prediction accuracy may limit its application in its present form. The accuracy of machine learning models, including elastic net regression, is substantially impaired by poor quality training data (e.g., incorrect, inconsistent, or missing values) . In many cases, increasing sample size and using techniques such as cross validation and bagging as applied here will reduce the effects of outliers and increase model accuracy . Our model predictions would be further improved if the quality of the training data (here, the known lifespan values) were increased. Maximum age and therefore lifespan values are difficult to determine for many fish species. The most common aging technique in bony fish, otolith aging, is subject to observation error and is especially difficult to perform for long-lived species. For example, reported orange roughy lifespan estimates range from 10 to 230 years, and despite extensive investigation the true value is still disputed . For cartilaginous fish (sharks and rays), lifespan estimation is particularly difficult because a reliable method for aging is yet to be established . At present, the fish lifespan model relies upon existing lifespan data for training and validation. As such, improvements in the accuracy of training data would greatly improve the accuracy of the model’s predictions. There is little research on how to measure data quality for robust machine learning model development, although software tools for data quality control are emerging in different fields .
The lifespan model training data also suffers inconsistency in taxonomic coverage. For example, the over representation of Sebastesspecies (n=57), or the under-representation of chondrichthyans (n=9). To overcome this, the model could be recalibrated with additional fish genome sequences with broad taxonomic coverage as they are released from individual sequencing projects, or by collaborative efforts such as Beijing Genome Institute’s Fish10K . Finally, a lack of sequence similarity between the target species and zebrafish resulted in reduced length or completely absent BLAST hits (i.e., a large amount of missing data). While we opted to use fish-specific reference sequences and did not observe any bias towards higher prediction error in more divergent species, the model primarily selected promoters with non-zero values. Thus, any model using the same sequence similarity approach is likely to suffer from some degree of bias in divergent species . An alternative to using gene promoters as reference sequences may be to analyse genomic regions that can be identified by location. For example, DNA methylation in first introns is highly correlated with gene expression . However, this approach would require comparable genome annotations and would be computationally expensive to execute.
The most immediate application for the lifespan predictions is likely for the estimation of natural mortality for use in fisheries stock assessments. Lifespan (tmax) based estimators consistently perform better than other methods for calculating natural mortality; one of the most widely used and difficult to estimate stock assessment parameters . A primary advantage of both lifespan-based estimators of mortality and the lifespan predictor presented here is the ability to provide rapid and cost-effective analyses. The provision of this data can assist in overcoming deficiencies in expertise and expenses required to undertake formal stock assessments (approximately $50,000 USD per species) . The accuracy and precision of parameter estimates varies markedly between assessments, but error rates of 10 % are reported as optimal . Although the median error rate for the fish lifespan model was 37 %, the same value for the reported lifespans was 20 %. This re-emphasises the marked absence of appropriate lifespan estimates available, and the need for better training data to build a more refined genomic lifespan predictor. In its present form, the model is likely to be most applicable for data limited or newly targeted fisheries, data deficient species under significant threat, and in any case where lifespan approximation by other means is impracticable.

Conclusion

We derived a model that predicts lifespan for any fish species from the genomic CpG density of a single individual. The model is highly robust to variation in genome quality and is applicable to all classes of fish; a taxonomically diverse and highly specious group of marked ecological and economic importance. The predictions are likely to be of use for both commercially valuable and highly vulnerable species, as lifespan enables approximation of natural mortality and rate of population increase . The work demonstrates the remarkable power of genomic CpG density alone to predict fish lifespan, and the predictive capacity of the model is likely to improve as the quantity and quality of available training data increases. Fish lifespan prediction is a significant problem for many species, and the value of estimating this fundamental life history parameter has driven interest in developing unconventional lifespan measurement technologies . We envisage the utility of our novel approach to estimate this central life history trait is likely to be far reaching, with both commercial and environmental impacts.
Acknowledgements
This project was funded by the CSIRO Environomics Future Science Platform. Fish photographs were kindly provided by Alastair Graham from the Australian National Fish Collections. The authors would like to thank all individuals who were involved in the creation, submission and curation of publicly available data that enabled this work to be carried out. We would also like to thank the reviewers for offering their time and expertise to improve the manuscript.