Results

Fish lifespan prediction

Final data set

A total of 1804 reported lifespan values were obtained from six online databases, ten published data sets and over 100 additional species-specific publications (Figure S1, Table S1). The reported lifespans were used to calculate known lifespan estimates (i.e., the mean of the reported lifespan values for each species) for 442 fish species with publicly available genome assemblies (Figure 1, Table S1, Figure S2). Known lifespan values ranged from mean 0.57 (SD 0.46) years for the Turquoise killifish (Nothobranchius furzeri ) to mean 183.33 (SD 33.57) years in the rougheye rockfish (Sebastes aleutianus ) (Figure 1, Table S1, Figure S2). Orange roughy (Hoplostethus atlanticus ) exhibited the greatest variance in reported lifespan values, with a mean 85.57 (SD 59.24) and a range of 10 to 149 years (Table S1, Figure S2).
The maximum number of BLAST hits to a total of 10,230 zebrafish promoter regions was 9447 in the orange finned danio (Danio kyathit ), and the minimum 8 hits in the Arctic lamprey (Lethenteron camtschaticum ) (Figure 1). The average hit length for the 201 bp region across all 10,230 promoters ranged from 177.11 bp in D. kyathitto 0.05 bp in L. camtschaticum . According to TimeTree, the estimated divergence time between zebrafish and orange finned danio, and zebrafish and Arctic lamprey are 16.4 million years and 599 million years, respectively. CpG O/E values within the promoter BLAST hits ranged from 0 to 28, with a minimum non-zero value of 0.06 (Figure 1). The number of BLAST hits, BLAST hit length and the average CpG O/E all decreased with divergence time from zebrafish (Figure 1, Figure S3). Known lifespan increased, although the relationship was not significant (Figure 1, Figure S3).

Model cross validation

Ten-fold nested cross validation resulted in 10 models with lambda.1se values ranging from 1.79 – 4.34, where the lower penalty values were associated with lower mean squared error in the training data but larger differences in the residuals between testing and training model predictions (i.e., overfitting; Figure S4). Minimum alpha values ranging between 0.01 and 0.03, indicating that lifespan predictions with lower error are produced using a penalty ratio closer to 0 (ridge regression; L2 penalty) than 1 (lasso regression; L1 penalty) (Figure S4). The lower alpha value indicates that the lifespan model is more accurate where a larger number of features (here, promoters) are included. The number of promoters included in each model ranged from 144 to 541, and 126 promoters were represented in all 10 models (Figure S5). Despite the variance in the promoters used to predict fish lifespan, the correlations between known and predicted lifespans were consistent across models incorporating different combinations of promoters. Specifically, for all 10 models, the Pearson correlation coefficient was greater than 0.7 (training: R = 0.8 – 0.87; testing: R = 0.7 – 0.74), the coefficient of determination was greater than 0.49 (training: R2 = 0.63 – 0.76; testing: R2 = 0.49 – 0.54) and the correlation p-value was less than 0.05 (Figure S6).

Lifespan model, prediction accuracy and variability

The final model used a total of 932 promoters to predict fish lifespan with a correlation coefficient of 0.8 (p =< 0.001), explaining 64 % of the total variance between known and predicted lifespans (Figure 2A). The median relative and absolute error for all predicted lifespans were 3.81 years and 36.78 %, respectively, and were approximately double the median absolute and relative error of 1.5 years and 20 % for the known lifespan values (Figure 2B). The least accurate prediction in terms of relative error was for the Neosho madtom (Notorus placidus ) with a known lifespan of 1 year, a predicted lifespan of 8.97 years and a relative error of 797.11 %. The least accurate prediction in terms of absolute error was for the rougheye rockfish (S. aleutianus ), with a known lifespan of 183.33 years, a predicted lifespan of 33.07 years and an absolute error of 150.26 years (Table S3). The most accurate prediction was for the olive flounder (Paralichthys olivaceus ) with a known and predicted lifespan of 12.5 years, a relative error of 0.02 % and absolute error of 0 years (Table S3).
Lifespan predictions produced using different genome assemblies (and associated biosamples) for a given species were highly consistent, with standard deviations of less than one year for all species (Figure 2C; Table S4). The sole exception was the Japanese eel (Anguilla japonica ), for which one of the assemblies had a BUSCO genome completeness score of 0.1 % (Figure 2C, Table S5). This resulted in a lifespan prediction that was approximately 8 years less than that produced by the remaining five eel assemblies (Figure 2C, Table S5). Genome completeness score did not correlate with error in the predicted lifespans, demonstrating that the model is highly robust to low quality genome assemblies (Figure S7G). However, the very poor quality of the Japanese eel genome assembly and associated prediction suggest that a low stringency cut-off (e.g., 10 % complete) would be beneficial.

Variables associated with error in lifespan prediction

There was no correlation between relative error in the predicted lifespans and: 1) known lifespan; 2) predicted lifespan; 3) relative known lifespan error or; 4) the number of reported lifespan values used to calculate known lifespan (Figure S8). However, the number of reported values resulted in a correlation coefficient with relative error of -0.08 (p < 0.1), suggesting that known lifespan estimates derived from a larger number of input values may lead to lower percent error in the predictions (Figure S8D). To further investigate this relationship, generalised linear modelling (GLM) was carried out to model percent prediction error and known lifespan, the number of known lifespan values and the interaction between the two. The GLM revealed that this trend (of more input values leading to lower prediction error) was both influential and significant, but only for shorter lived species (less than 40-year lifespan; Table S6, Figure S9). This likely reflects a general tendency of smaller measured values to have higher relative error (e.g., Figure S8A; p < 0.1).
No significant correlations were identified between the relative error for predicted lifespans and: 1) the total number of BLAST hits; 2) mean BLAST hit length; 3) mean sequence identity; 4) genome assembly completeness (BUSCO completeness score) or; 5) divergence time from zebrafish (Figure S7). However, the variance in divergence times produced by TimeTree was limited, where the pairwise distances were uniform for 75 % of species (Figure S10). Nonetheless, negative trends for hit number and hit length suggests that decreases in promoter sequence information used by the lifespan model led to decreases in prediction accuracy, although the variance was large (Figure S7). The range of predicted lifespans was smaller than known lifespan range, most obviously in species of the Sebastes genus (Figure 2A). In general, CpG O/E values were less variable among Sebastes spp . compared to fish in other genera, although known lifespans varied considerably (e.g., Figure 3A, Figure S11). Invariable CpG O/E values may have led to an inability of the model to accurately predict lifespan in fish from this group. This is difficult to measure statistically due to the over representation of Sebastes species in the data set (57 Sebastes species compared to a mean of 1.56 for all other genera).

Model composition

Promoter correlations and model weighting

CpG O/E was negatively associated with lifespan for more than 60 % of promoters in the model (Figure 4). Specifically, of a total of 932 promoters in the lifespan model, 582 were negatively weighted, and 350 were positively weighted (Figure 4). These results were consistent with Pearson correlations for negatively weighted promoters, where 570 promoters were negatively correlated with lifespan, and only 12 were positively correlated (Figure 4B). The results were more varied for promoters positively weighted in the lifespan model, where 274 had negative Pearson correlations and 76 had positive Pearson correlations (Figure 4B).

Promoter CpG observed over expected ratios

CpG O/E was 0 for 96 % of all promoters in the complete data set and 82 % of promoters in the model. Mean CpG O/E values were significantly higher in the selected promoters compared to those not selected by the model (Figure S12A). However, when zero values derived from the absence of a BLAST hit were removed from the data set, the pattern was reversed (Figure S12B). These results indicate that the model selects for promoters with non-zero CpG O/E values, but beyond this does not select for larger CpG O/E values. The promoter weights were more variable and of larger magnitude for smaller values of mean promoter CpG O/E, however the data was skewed toward smaller CpG O/E values (i.e., CpG O/E < 0.25; Figure 4C).

Functional analysis

Functional analysis revealed enrichment for genes associated with several GO terms, Reactome pathways and tissue specificity from the Human Protein Atlas (Figure 5). Promoters positively weighted in the lifespan model were enriched for genes associated with intracellular anatomical structures and catalytic activity (Figure 5A). Negatively weighted promoters were enriched for genes with functions largely related to intracellular components, including those involved in cellular transport (Figure 5B). Negatively weighted genes were also enriched for various biological signalling pathways from the Reactome data base. These include five with roles in immune system functioning (Downstream signalling evens of B Cell Receptor (BCR), CLEC7A (Dectin-1) signalling, TCR signalling, Downstream TCR signalling and Activation of NF-kappaB in B cells), two in signal transduction (GLI3 is processed to GLI3R by the proteasome, Regulation of RAS by GAPs), two in metabolism (Respiratory electron transport, Complex I biogenesis), two in cell cycling (Autodegeneration of Cdh1 by Cdh1:APC/C, APC/C:Cdc20 mediated degradation of Securin) and one in gene expression (Transcriptional regulation by RUNX3; Figure 5B).

Global trends

No significant Pearson correlation between global CpG O/E and species known lifespan or genome size was observed, however, genome size was negatively correlated with global CpG O/E (Figure 3). A subsequent GLM revealed the relationship between CpG O/E and lifespan is apparent (despite the absence of a Pearson correlation) but is influenced by the interaction between global CpG O/E and genome size. More specifically, while known lifespan increases with global CpG O/E, this relationship is reduced, and even reversed as genome size increases (Figure 3D, Table S8).