Materials and methods

Known lifespan database curation

A comprehensive dataset of fish lifespan values [including those reported as longevity or maximum age (tmax)] was built by combining information from existing databases, publicly available fisheries data and by conducting a manual literature search (Table S1). The mean of all recorded values for a given species was used as an estimate of known lifespan (referred to as ‘known lifespan’ hereafter) as there was high variability in reported lifespan values. The mean lifespan value was selected as it is more likely to be representative of the lifespan of all individuals of a given species than the measured value of the single oldest individual reported .

Genomic data and promoter sequence generation

All available fish genomes were downloaded from the National Centre for Biotechnology Information (NCBI), filtering for classes Actinopteri, Cladista, Chondrichthyes, Cephalaspidomorphi, Hyperoartia, Myxini and Sarcopterygii (see Table S2 for accession numbers). If multiple genome assemblies were available for a species, NCBI’s ‘representative’ and ‘reference’ genome classes were used to select the most appropriate assembly for downstream analyses. For species with more than five genome assemblies derived from different individuals available, all assemblies were downloaded and used to assess within-species variability in lifespan predictions. Genome completeness was assessed using Benchmarking Universal Single-Copy Orthologs (BUSCO; version 5.2.2), specifying the Actinopterygii lineage dataset (actinopterygii_odb10) and Augustus gene predictor.
Promoter sequences that have been experimentally validated for transcriptional activity in zebrafish were downloaded from the Eukaryotic Promoter Database (EPD) using the EPDnew selection tool . At present, zebrafish are the only fish species for which EPD promoter sequences are available. For each gene, the region ±100 nucleotides surrounding the transcription start site (TSS) of the most representative gene promoter was extracted. This region was selected as it most likely encompasses the core promoter, a region immediately surrounding the TSS that functions in controlling the activity of RNA polymerase II, and therefore gene transcription . As described previously , the EPD promoter sequences were used to query each genome via Basic Local Alignment Search Tool (BLAST+; version 2.12.0) using a minimum sequence identity of 70 %. The single top hit for each promoter in each species was used to calculate CpG density.

Calculation of CpG observed/expected ratio

The observed/expected ratio of CpGs (CpG O/E) was used as a measure of under- or over-representation of the density of CpG dinucleotides in fish genomes and promoter regions. This measure was developed by to identify CpG islands. CpG O/E is calculated by first obtaining the CpG density [i.e., the total number of CpG dinucleotides (CpG) divided by the sequence length (N)] and dividing it by the expected CpG density, or the C density [i.e., total number of cytosines (C) divided by N] multiplied by the G density [i.e., total number of guanines (G) divided by N] as follows:
\begin{equation} CpG\ Observed/Expected\ =\frac{\text{CpG\ density}}{C\ density*G\ density}\nonumber \\ \end{equation}
Is equal to:
\begin{equation} CpG\ O/E=\frac{\frac{\text{CpG}}{N}}{\frac{C}{N}\times\frac{G}{N}}\nonumber \\ \end{equation}
Which can be simplified to:
\begin{equation} CpG\ O/E=\frac{\text{CpG}}{C\times G}\times N\nonumber \\ \end{equation}
Using this equation, values for CpG O/E were calculated for each promoter sequence and genome in each species. If no matching promoter sequence was obtained during the BLAST search, CpG O/E was given as 0 in the lifespan prediction model.

Lifespan prediction modelling

To predict fish lifespan from CpG O/E, an elastic net regression model was developed using 10-fold nested cross-validation in R version 4.1.2 . First, lifespan values from all fish species with genomic information available were natural log transformed to enable the data to fit a linear model. Based on the percentiles of the transformed values, the data was then split 70/30 for training and testing, respectively. The split was performed 10 times to create 10 outer folds. Within each of the 10 outer folds, the glmnet and glmnetUtils packages were used to perform the elastic net regression, including 10-fold inner cross validation to determine the optimal values for alpha and lambda (hyperparameter optimisation). Using the minimum value of alpha, the model was fitted to the training data for 100 values of lambda. The resulting model was then used to predict lifespan values for the training and testing data, specifying the optimal lambda.1se (lambda “one standard error”; the largest value of lambda within one standard error of the minimum lambda value) from the previous cross validation step. Pearson correlation coefficients between known and predicted lifespan values were calculated for both the testing and training datasets. Comparisons between the testing and training data correlations and residuals were identified using Fisher’s z test (cocor R package) and Students unpaired t-test, respectively. The results of each of the 10 models where then bagged (bootstrap aggregated) to produce more accurate lifespan predictions . To enable correlations between prediction error and distance from the zebrafish last common ancestor, a tree including all chordates was obtained using TimeTree . The chordate tree was then subset for all fish species in our data set, and pairwise distances between zebrafish and all other species were calculated using the ape package .

Gene ontology and analysis

Gene ontology (GO) enrichment was performed using gprofiler2 (an interface to the gprofiler tool g:GOSt) specifying zebrafish as the reference organism. The analyses were performed on all promoters used to predict lifespan, divided into two groups based on the weighting of their average coefficient values (negative or positive).