Distribution and linkage patterns of mzl-USCOs
All available (as of July 2021) metazoan genomes assembled to chromosome
level were downloaded from NCBI RefSeq (O’Leary et al. 2016; see Table
S1). Contigs not assembled to chromosome level were excluded with a
custom Perl script (Supplementary Material). The genomic nucleotide
sequences were then searched for mzl-USCOs with the program BUSCO v.
4.0.6 (Manni et al., 2021) using the program’s default parameters for
genomic data and the metazoa_odb10 dataset from the BUSCO website.
Mzl-USCOs were first sorted according to a) the chromosome on which they
were located and b) their start position on a given chromosome. Next, we
calculated the distances between start positions of consecutive
mzl-USCOs on the same chromosome as predicted by BUSCO. Distances were
recorded as absolute distances (nucleotides) and as normalized
distances, with the latter being calculated by dividing the absolute
distance by the genome size of the respective species. In a second step,
both absolute and normalized distance values (d) were binned into ten
categories based on log(d) values. For normalized distances, these were
log(d) < -6, -6 ≤ log(d) < -5.5, …, -2.5 ≤
log (d) < -2, log(d) ≥ -2. For absolute distances, 9 was added
to the logarithmic range of each category, as the average size of
analyzed genomes was about 109. For each taxon, the
proportion of distances in each bin was calculated with a custom Perl
script (Supplementary Material). Based on the proportions across all
taxa, we conducted principal component analyses (PCA) for absolute and
normalized distances, respectively, in PAST v. 4.03 (Hammer et al.,
2001). The resulting scores of all taxa for the first and the second PC
axes were mapped on the phylogenetic tree of the taxa (see below) with
MESQUITE v. 3.51 (Maddison & Maddison, 2018). Furthermore, we analyzed
the distribution of distances between start positions of adjacent
mzl-USCOs to assess whether mzl-USCOs tend to cluster spatially more
than a randomly chosen identical number of protein-coding genes would
do. To achieve this, we downloaded the official gene set of coding
sequences (CDS) for each genome and, using a custom Perl script
(Supplementary Material), randomly selected the same number of
protein-coding genes as the number of mzl-USCOs found in the respective
taxon. This random drawing was repeated 10,000 times for each genome,
and for each replicate, the median distance (absolute and normalized,
separately) between neighboring genes was calculated. To infer whether
mzl-USCOs cluster significantly more than randomly chosen genes, we
counted for each taxon the number of replicates in which the median
distance between neighboring protein-coding genes was lower than the
median distance between neighboring mzl-USCOs.
We used a custom Perl script (Supplementary Material) to estimate the
adjusted evenness of the distribution of mzl-USCOs between chromosomes
in each taxon according to the formula e(H/S), where H
is the Shannon-Wiener entropy (Heip et al. 1998) of the distribution and
S the number of chromosomes. While in ecology species which are not
present in a sample are not considered in the calculation of evenness
(Heip et al. 1998), here, S includes all chromosomes, even those with no
mzl-USCOs, representing thus an “adjusted evenness”. For comparison,
we also calculated the adjusted evenness of the number of all
protein-coding genes on the chromosomes, as well as that of the length
of the chromosomes in base pairs. Additionally, we used another Perl
script (Supplementary Material) to (i) conduct a chi-square test in
search of significant deviations of the distribution of mzl-USCOs
between chromosomes from a distribution proportional to chromosome
length, and (ii) for significant deviations from the distribution of
protein-coding genes in general. To assess the degree by which
chromosomal linkage between mzl-USCOs is phylogenetically conserved
across taxa, we calculated with the aid of a custom Perl script
(Supplementary Material) the proportion of taxa in which a given pair of
mzl-USCOs was found to be co-located on a chromosome.