Discussion
Using publicly available data from 442 fish species comprising five
vertebrate classes, we developed a model to predict species maximum
lifespan from genomic CpG density alone. The accuracy of the fish
lifespan predictions was consistent across genome assemblies of
different samples of the same species, indicating that the analysis of a
single individual is required to predict a species’ lifespan using this
method. We anticipate this novel approach having immediate utility in
any fishery management case where lifespan approximation by other means
is impracticable, and here identify areas for future research that may
improve the predictive power of the model for broader application.
Robustness, accuracy and potential application of genomic
lifespan
prediction
The fish lifespan model demonstrates that there is a strong association
between genomic CpG density and lifespan. Based on this association, the
model is robust to sequence differences between zebrafish promoters and
orthologous promoters in distantly related species, as well as
differences in genome assembly completeness. The resulting predictions
had approximately double the error of the reported values of lifespan,
which require far more intensive research efforts to obtain. To predict
lifespan using this method, our results indicate that the genome
sequence of just a single individual (no repeated sampling) is required.
This involves the acquisition of a small piece of tissue (e.g., a fin
clip), genome sequencing and assembly followed by downstream
bioinformatic analysis. Contig-level assemblies for genomes up to 1 Gbp
in size (i.e., most fish) can be produced for less than $5000 USD and
in under two weeks (R. Huerlimann, pers comm. ). If a genomic
assembly for the species is already available, model predictions can be
generated immediately and with no associated consumable expenses. At
present, known lifespan estimation involves either observing the age at
death of fish held in aquaria, repeated sampling in the field to
determine maximum observed age , modelling the maximum based on trends
in survivorship with age or estimations based on maximum length . The
cost and time involved in housing animals in aquaria or monitoring
enough individuals to confidently identify or calculate maximum age
using current methods likely far exceeds what is required for genomic
lifespan prediction.
Molecular predictors of
lifespan
In addition to providing lifespan predictions, the model may provide
insight into the molecular biology of fish lifespan. For example, it has
been hypothesised that the association between genomic CpG density and
lifespan is due to a protective effect of increased CpG density against
age-related epigenomic changes . Previous results in mammals showed that
CpG density was positively associated with lifespan in 94% of promoters
, providing strong support for this theory. However, the vertebrate
model showed this positive association was only present for 62% of
modelled promoters and here we observed positive associations in just
38%. These results highlight that differences in CpG density are
important for predicting lifespan, rather than simply increases, as
previously hypothesised. This is evident in mammals and other
vertebrates, but is particularly pronounced in fish.
Previous functional analyses of lifespan-related promoters in CpG
density models have been unable to identify any significantly enriched
gene functions . However, analysis of the lifespan-associated genes here
revealed functions related to intracellular components, transport and
immune functioning pathways. Specifically, we identified a number of
pathway components related to T and B Cell functioning as well as NF-KB
signalling pathways, all of which are of central importance in immune
functioning. Transcriptional regulation by RUNX3 was also identified; a
gene that functions in the suppression of tumours . Collectively, these
immune system components are protective against toxins, infection, and
cancer and thus are highly likely to influence longevity . These results
are consistent with epigenetic age predictors, which commonly select for
genomic regions associated with immune function .
We also observed enrichment for specific signal transduction pathway
elements, with many involved in Hedgehog repression and RAF/MAP kinase
pathways, which regulate programmed cell differentiation and aspects of
immune functioning . Interestingly, the analysis revealed enrichment for
44 genes associated with abnormal hair formation in humans. Due to the
presence of many shared signalling pathways, Actinopterygian scales are
thought to be evolutionary precursors to mammalian hair, which is known
to degenerate with increasing age . Fish also have hair cells in their
lateral line for sensing prey as well as in their ear canals for sensing
barometric pressure . Promoters for genes that are important for species
survival may have been altered in different lineages under varying
selection pressures, leading to lifespan changes among fish species.
We observed no Pearson correlation between global CpG O/E and lifespan.
This provides support for the hypothesis that age-related changes in DNA
methylation in promoter regions specifically (as opposed across the
genome more generally) are strongly associated with lifespan . We also
observed a significant negative Pearson correlation between genome size
and CpG O/E. This is consistent with previous reports that high levels
of DNA methylation (and therefore low CpG O/E) lead to increases in
genome size via the suppression of transposable element (TE) activity .
In our results, when genome size and the interaction between genome size
and CpG O/E were controlled for, we observed a positive relationship
between global CpG O/E and lifespan for small genomes and a negative
relationship for large genomes. The differing pattern for larger genomes
may be related to increased TE load. However, as this was not the focus
of the work, the present results are inconclusive. The relationship
between global CpG O/E, genome size, and how it relates to species
lifespan warrants further investigation.
Limitations and future directions
Despite the broad applicability and predictive power of the fish
lifespan model, variable levels of prediction accuracy may limit its
application in its present form. The accuracy of machine learning
models, including elastic net regression, is substantially impaired by
poor quality training data (e.g., incorrect, inconsistent, or missing
values) . In many cases, increasing sample size and using techniques
such as cross validation and bagging as applied here will reduce the
effects of outliers and increase model accuracy . Our model predictions
would be further improved if the quality of the training data (here, the
known lifespan values) were increased. Maximum age and therefore
lifespan values are difficult to determine for many fish species. The
most common aging technique in bony fish, otolith aging, is subject to
observation error and is especially difficult to perform for long-lived
species. For example, reported orange roughy lifespan estimates range
from 10 to 230 years, and despite extensive investigation the true value
is still disputed . For cartilaginous fish (sharks and rays), lifespan
estimation is particularly difficult because a reliable method for aging
is yet to be established . At present, the fish lifespan model relies
upon existing lifespan data for training and validation. As such,
improvements in the accuracy of training data would greatly improve the
accuracy of the model’s predictions. There is little research on how to
measure data quality for robust machine learning model development,
although software tools for data quality control are emerging in
different fields .
The lifespan model training data also suffers inconsistency in taxonomic
coverage. For example, the over representation of Sebastesspecies (n=57), or the under-representation of chondrichthyans (n=9). To
overcome this, the model could be recalibrated with additional fish
genome sequences with broad taxonomic coverage as they are released from
individual sequencing projects, or by collaborative efforts such as
Beijing Genome Institute’s Fish10K . Finally, a lack of sequence
similarity between the target species and zebrafish resulted in reduced
length or completely absent BLAST hits (i.e., a large amount of missing
data). While we opted to use fish-specific reference sequences and did
not observe any bias towards higher prediction error in more divergent
species, the model primarily selected promoters with non-zero values.
Thus, any model using the same sequence similarity approach is likely to
suffer from some degree of bias in divergent species . An alternative to
using gene promoters as reference sequences may be to analyse genomic
regions that can be identified by location. For example, DNA methylation
in first introns is highly correlated with gene expression . However,
this approach would require comparable genome annotations and would be
computationally expensive to execute.
The most immediate application for the lifespan predictions is likely
for the estimation of natural mortality for use in fisheries stock
assessments. Lifespan (tmax) based estimators
consistently perform better than other methods for calculating natural
mortality; one of the most widely used and difficult to estimate stock
assessment parameters . A primary advantage of both lifespan-based
estimators of mortality and the lifespan predictor presented here is the
ability to provide rapid and cost-effective analyses. The provision of
this data can assist in overcoming deficiencies in expertise and
expenses required to undertake formal stock assessments (approximately
$50,000 USD per species) . The accuracy and precision of parameter
estimates varies markedly between assessments, but error rates of 10 %
are reported as optimal . Although the median error rate for the fish
lifespan model was 37 %, the same value for the reported lifespans was
20 %. This re-emphasises the marked absence of appropriate lifespan
estimates available, and the need for better training data to build a
more refined genomic lifespan predictor. In its present form, the model
is likely to be most applicable for data limited or newly targeted
fisheries, data deficient species under significant threat, and in any
case where lifespan approximation by other means is impracticable.
Conclusion
We derived a model that predicts lifespan for any fish species from the
genomic CpG density of a single individual. The model is highly robust
to variation in genome quality and is applicable to all classes of fish;
a taxonomically diverse and highly specious group of marked ecological
and economic importance. The predictions are likely to be of use for
both commercially valuable and highly vulnerable species, as lifespan
enables approximation of natural mortality and rate of population
increase . The work demonstrates the remarkable power of genomic CpG
density alone to predict fish lifespan, and the predictive capacity of
the model is likely to improve as the quantity and quality of available
training data increases. Fish lifespan prediction is a significant
problem for many species, and the value of estimating this fundamental
life history parameter has driven interest in developing unconventional
lifespan measurement technologies . We envisage the utility of our novel
approach to estimate this central life history trait is likely to be far
reaching, with both commercial and environmental impacts.
Acknowledgements
This project was funded by the CSIRO Environomics Future Science
Platform. Fish photographs were kindly provided by Alastair Graham from
the Australian National Fish Collections. The authors would like to
thank all individuals who were involved in the creation, submission and
curation of publicly available data that enabled this work to be carried
out. We would also like to thank the reviewers for offering their time
and expertise to improve the manuscript.