Results
Fish lifespan prediction
Final data set
A total of 1804 reported lifespan values were obtained from six online
databases, ten published data sets and over 100 additional
species-specific publications (Figure S1, Table S1). The reported
lifespans were used to calculate known lifespan estimates (i.e., the
mean of the reported lifespan values for each species) for 442 fish
species with publicly available genome assemblies (Figure 1, Table S1,
Figure S2). Known lifespan values ranged from mean 0.57 (SD 0.46) years
for the Turquoise killifish (Nothobranchius furzeri ) to mean
183.33 (SD 33.57) years in the rougheye rockfish (Sebastes
aleutianus ) (Figure 1, Table S1, Figure S2). Orange roughy
(Hoplostethus atlanticus ) exhibited the greatest variance in
reported lifespan values, with a mean 85.57 (SD 59.24) and a range of 10
to 149 years (Table S1, Figure S2).
The maximum number of BLAST hits to a total of 10,230 zebrafish promoter
regions was 9447 in the orange finned danio (Danio kyathit ), and
the minimum 8 hits in the Arctic lamprey (Lethenteron
camtschaticum ) (Figure 1). The average hit length for the 201 bp region
across all 10,230 promoters ranged from 177.11 bp in D. kyathitto 0.05 bp in L. camtschaticum . According to TimeTree, the
estimated divergence time between zebrafish and orange finned danio, and
zebrafish and Arctic lamprey are 16.4 million years and 599 million
years, respectively. CpG O/E values within the promoter BLAST hits
ranged from 0 to 28, with a minimum non-zero value of 0.06 (Figure 1).
The number of BLAST hits, BLAST hit length and the average CpG O/E all
decreased with divergence time from zebrafish (Figure 1, Figure S3).
Known lifespan increased, although the relationship was not significant
(Figure 1, Figure S3).
Model cross validation
Ten-fold nested cross validation resulted in 10 models with lambda.1se
values ranging from 1.79 – 4.34, where the lower penalty values were
associated with lower mean squared error in the training data but larger
differences in the residuals between testing and training model
predictions (i.e., overfitting; Figure S4). Minimum alpha values ranging
between 0.01 and 0.03, indicating that lifespan predictions with lower
error are produced using a penalty ratio closer to 0 (ridge regression;
L2 penalty) than 1 (lasso regression; L1 penalty) (Figure S4). The lower
alpha value indicates that the lifespan model is more accurate where a
larger number of features (here, promoters) are included. The number of
promoters included in each model ranged from 144 to 541, and 126
promoters were represented in all 10 models (Figure S5). Despite the
variance in the promoters used to predict fish lifespan, the
correlations between known and predicted lifespans were consistent
across models incorporating different combinations of promoters.
Specifically, for all 10 models, the Pearson correlation coefficient was
greater than 0.7 (training: R = 0.8 – 0.87; testing: R = 0.7 – 0.74),
the coefficient of determination was greater than 0.49 (training: R2 =
0.63 – 0.76; testing: R2 = 0.49 – 0.54) and the correlation p-value
was less than 0.05 (Figure S6).
Lifespan model, prediction accuracy and
variability
The final model used a total of 932 promoters to predict fish lifespan
with a correlation coefficient of 0.8 (p =< 0.001), explaining
64 % of the total variance between known and predicted lifespans
(Figure 2A). The median relative and absolute error for all predicted
lifespans were 3.81 years and 36.78 %, respectively, and were
approximately double the median absolute and relative error of 1.5 years
and 20 % for the known lifespan values (Figure 2B). The least accurate
prediction in terms of relative error was for the Neosho madtom
(Notorus placidus ) with a known lifespan of 1 year, a predicted
lifespan of 8.97 years and a relative error of 797.11 %. The least
accurate prediction in terms of absolute error was for the rougheye
rockfish (S. aleutianus ), with a known lifespan of 183.33 years,
a predicted lifespan of 33.07 years and an absolute error of 150.26
years (Table S3). The most accurate prediction was for the olive
flounder (Paralichthys olivaceus ) with a known and predicted
lifespan of 12.5 years, a relative error of 0.02 % and absolute error
of 0 years (Table S3).
Lifespan predictions produced using different genome assemblies (and
associated biosamples) for a given species were highly consistent, with
standard deviations of less than one year for all species (Figure 2C;
Table S4). The sole exception was the Japanese eel (Anguilla
japonica ), for which one of the assemblies had a BUSCO genome
completeness score of 0.1 % (Figure 2C, Table S5). This resulted in a
lifespan prediction that was approximately 8 years less than that
produced by the remaining five eel assemblies (Figure 2C, Table S5).
Genome completeness score did not correlate with error in the predicted
lifespans, demonstrating that the model is highly robust to low quality
genome assemblies (Figure S7G). However, the very poor quality of the
Japanese eel genome assembly and associated prediction suggest that a
low stringency cut-off (e.g., 10 % complete) would be beneficial.
Variables associated with error in lifespan
prediction
There was no correlation between relative error in the predicted
lifespans and: 1) known lifespan; 2) predicted lifespan; 3) relative
known lifespan error or; 4) the number of reported lifespan values used
to calculate known lifespan (Figure S8). However, the number of reported
values resulted in a correlation coefficient with relative error of
-0.08 (p < 0.1), suggesting that known lifespan estimates
derived from a larger number of input values may lead to lower percent
error in the predictions (Figure S8D). To further investigate this
relationship, generalised linear modelling (GLM) was carried out to
model percent prediction error and known lifespan, the number of known
lifespan values and the interaction between the two. The GLM revealed
that this trend (of more input values leading to lower prediction error)
was both influential and significant, but only for shorter lived species
(less than 40-year lifespan; Table S6, Figure S9). This likely reflects
a general tendency of smaller measured values to have higher relative
error (e.g., Figure S8A; p < 0.1).
No significant correlations were identified between the relative error
for predicted lifespans and: 1) the total number of BLAST hits; 2) mean
BLAST hit length; 3) mean sequence identity; 4) genome assembly
completeness (BUSCO completeness score) or; 5) divergence time from
zebrafish (Figure S7). However, the variance in divergence times
produced by TimeTree was limited, where the pairwise distances were
uniform for 75 % of species (Figure S10). Nonetheless, negative trends
for hit number and hit length suggests that decreases in promoter
sequence information used by the lifespan model led to decreases in
prediction accuracy, although the variance was large (Figure S7). The
range of predicted lifespans was smaller than known lifespan range, most
obviously in species of the Sebastes genus (Figure 2A). In
general, CpG O/E values were less variable among Sebastes spp .
compared to fish in other genera, although known lifespans varied
considerably (e.g., Figure 3A, Figure S11). Invariable CpG O/E values
may have led to an inability of the model to accurately predict lifespan
in fish from this group. This is difficult to measure statistically due
to the over representation of Sebastes species in the data set
(57 Sebastes species compared to a mean of 1.56 for all other
genera).
Model composition
Promoter correlations and model
weighting
CpG O/E was negatively associated with lifespan for more than 60 % of
promoters in the model (Figure 4). Specifically, of a total of 932
promoters in the lifespan model, 582 were negatively weighted, and 350
were positively weighted (Figure 4). These results were consistent with
Pearson correlations for negatively weighted promoters, where 570
promoters were negatively correlated with lifespan, and only 12 were
positively correlated (Figure 4B). The results were more varied for
promoters positively weighted in the lifespan model, where 274 had
negative Pearson correlations and 76 had positive Pearson correlations
(Figure 4B).
Promoter CpG observed over expected
ratios
CpG O/E was 0 for 96 % of all promoters in the complete data set and 82
% of promoters in the model. Mean CpG O/E values were significantly
higher in the selected promoters compared to those not selected by the
model (Figure S12A). However, when zero values derived from the absence
of a BLAST hit were removed from the data set, the pattern was reversed
(Figure S12B). These results indicate that the model selects for
promoters with non-zero CpG O/E values, but beyond this does not select
for larger CpG O/E values. The promoter weights were more variable and
of larger magnitude for smaller values of mean promoter CpG O/E, however
the data was skewed toward smaller CpG O/E values (i.e., CpG O/E
< 0.25; Figure 4C).
Functional analysis
Functional analysis revealed enrichment for genes associated with
several GO terms, Reactome pathways and tissue specificity from the
Human Protein Atlas (Figure 5). Promoters positively weighted in the
lifespan model were enriched for genes associated with intracellular
anatomical structures and catalytic activity (Figure 5A). Negatively
weighted promoters were enriched for genes with functions largely
related to intracellular components, including those involved in
cellular transport (Figure 5B). Negatively weighted genes were also
enriched for various biological signalling pathways from the Reactome
data base. These include five with roles in immune system functioning
(Downstream signalling evens of B Cell Receptor (BCR), CLEC7A (Dectin-1)
signalling, TCR signalling, Downstream TCR signalling and Activation of
NF-kappaB in B cells), two in signal transduction (GLI3 is processed to
GLI3R by the proteasome, Regulation of RAS by GAPs), two in metabolism
(Respiratory electron transport, Complex I biogenesis), two in cell
cycling (Autodegeneration of Cdh1 by Cdh1:APC/C, APC/C:Cdc20 mediated
degradation of Securin) and one in gene expression (Transcriptional
regulation by RUNX3; Figure 5B).
Global trends
No significant Pearson correlation between global CpG O/E and species
known lifespan or genome size was observed, however, genome size was
negatively correlated with global CpG O/E (Figure 3). A subsequent GLM
revealed the relationship between CpG O/E and lifespan is apparent
(despite the absence of a Pearson correlation) but is influenced by the
interaction between global CpG O/E and genome size. More specifically,
while known lifespan increases with global CpG O/E, this relationship is
reduced, and even reversed as genome size increases (Figure 3D, Table
S8).