Materials and methods
Known lifespan database
curation
A comprehensive dataset of fish lifespan values [including those
reported as longevity or maximum age (tmax)] was built
by combining information from existing databases, publicly available
fisheries data and by conducting a manual literature search (Table S1).
The mean of all recorded values for a given species was used as an
estimate of known lifespan (referred to as ‘known lifespan’ hereafter)
as there was high variability in reported lifespan values. The mean
lifespan value was selected as it is more likely to be representative of
the lifespan of all individuals of a given species than the measured
value of the single oldest individual reported .
Genomic data and promoter sequence
generation
All available fish genomes were downloaded from the National Centre for
Biotechnology Information (NCBI), filtering for classes Actinopteri,
Cladista, Chondrichthyes, Cephalaspidomorphi, Hyperoartia, Myxini and
Sarcopterygii (see Table S2 for accession numbers). If multiple genome
assemblies were available for a species, NCBI’s ‘representative’ and
‘reference’ genome classes were used to select the most appropriate
assembly for downstream analyses. For species with more than five genome
assemblies derived from different individuals available, all assemblies
were downloaded and used to assess within-species variability in
lifespan predictions. Genome completeness was assessed using
Benchmarking Universal Single-Copy Orthologs (BUSCO; version 5.2.2),
specifying the Actinopterygii lineage dataset (actinopterygii_odb10)
and Augustus gene predictor.
Promoter sequences that have been experimentally validated for
transcriptional activity in zebrafish were downloaded from the
Eukaryotic Promoter Database (EPD) using the EPDnew selection tool . At
present, zebrafish are the only fish species for which EPD promoter
sequences are available. For each gene, the region ±100 nucleotides
surrounding the transcription start site (TSS) of the most
representative gene promoter was extracted. This region was selected as
it most likely encompasses the core promoter, a region immediately
surrounding the TSS that functions in controlling the activity of RNA
polymerase II, and therefore gene transcription . As described
previously , the EPD promoter sequences were used to query each genome
via Basic Local Alignment Search Tool (BLAST+; version 2.12.0) using a
minimum sequence identity of 70 %. The single top hit for each promoter
in each species was used to calculate CpG density.
Calculation of CpG observed/expected
ratio
The observed/expected ratio of CpGs (CpG O/E) was used as a measure of
under- or over-representation of the density of CpG dinucleotides in
fish genomes and promoter regions. This measure was developed by to
identify CpG islands. CpG O/E is calculated by first obtaining the CpG
density [i.e., the total number of CpG dinucleotides (CpG) divided by
the sequence length (N)] and dividing it by the expected CpG density,
or the C density [i.e., total number of cytosines (C) divided by N]
multiplied by the G density [i.e., total number of guanines (G)
divided by N] as follows:
\begin{equation}
CpG\ Observed/Expected\ =\frac{\text{CpG\ density}}{C\ density*G\ density}\nonumber \\
\end{equation}Is equal to:
\begin{equation}
CpG\ O/E=\frac{\frac{\text{CpG}}{N}}{\frac{C}{N}\times\frac{G}{N}}\nonumber \\
\end{equation}Which can be simplified to:
\begin{equation}
CpG\ O/E=\frac{\text{CpG}}{C\times G}\times N\nonumber \\
\end{equation}Using this equation, values for CpG O/E were calculated for each
promoter sequence and genome in each species. If no matching promoter
sequence was obtained during the BLAST search, CpG O/E was given as 0 in
the lifespan prediction model.
Lifespan prediction
modelling
To predict fish lifespan from CpG O/E, an elastic net regression model
was developed using 10-fold nested cross-validation in R version 4.1.2 .
First, lifespan values from all fish species with genomic information
available were natural log transformed to enable the data to fit a
linear model. Based on the percentiles of the transformed values, the
data was then split 70/30 for training and testing, respectively. The
split was performed 10 times to create 10 outer folds. Within each of
the 10 outer folds, the glmnet and glmnetUtils packages were used to
perform the elastic net regression, including 10-fold inner cross
validation to determine the optimal values for alpha and lambda
(hyperparameter optimisation). Using the minimum value of alpha, the
model was fitted to the training data for 100 values of lambda. The
resulting model was then used to predict lifespan values for the
training and testing data, specifying the optimal lambda.1se (lambda
“one standard error”; the largest value of lambda within one standard
error of the minimum lambda value) from the previous cross validation
step. Pearson correlation coefficients between known and predicted
lifespan values were calculated for both the testing and training
datasets. Comparisons between the testing and training data correlations
and residuals were identified using Fisher’s z test (cocor R package)
and Students unpaired t-test, respectively. The results of each of the
10 models where then bagged (bootstrap aggregated) to produce more
accurate lifespan predictions . To enable correlations between
prediction error and distance from the zebrafish last common ancestor, a
tree including all chordates was obtained using TimeTree . The chordate
tree was then subset for all fish species in our data set, and pairwise
distances between zebrafish and all other species were calculated using
the ape package .
Gene ontology and
analysis
Gene ontology (GO) enrichment was performed using gprofiler2 (an
interface to the gprofiler tool g:GOSt) specifying zebrafish as the
reference organism. The analyses were performed on all promoters used to
predict lifespan, divided into two groups based on the weighting of
their average coefficient values (negative or positive).