Distribution and linkage patterns of mzl-USCOs
All available (as of July 2021) metazoan genomes assembled to chromosome level were downloaded from NCBI RefSeq (O’Leary et al. 2016; see Table S1). Contigs not assembled to chromosome level were excluded with a custom Perl script (Supplementary Material). The genomic nucleotide sequences were then searched for mzl-USCOs with the program BUSCO v. 4.0.6 (Manni et al., 2021) using the program’s default parameters for genomic data and the metazoa_odb10 dataset from the BUSCO website. Mzl-USCOs were first sorted according to a) the chromosome on which they were located and b) their start position on a given chromosome. Next, we calculated the distances between start positions of consecutive mzl-USCOs on the same chromosome as predicted by BUSCO. Distances were recorded as absolute distances (nucleotides) and as normalized distances, with the latter being calculated by dividing the absolute distance by the genome size of the respective species. In a second step, both absolute and normalized distance values (d) were binned into ten categories based on log(d) values. For normalized distances, these were log(d) < -6, -6 ≤ log(d) < -5.5, …, -2.5 ≤ log (d) < -2, log(d) ≥ -2. For absolute distances, 9 was added to the logarithmic range of each category, as the average size of analyzed genomes was about 109. For each taxon, the proportion of distances in each bin was calculated with a custom Perl script (Supplementary Material). Based on the proportions across all taxa, we conducted principal component analyses (PCA) for absolute and normalized distances, respectively, in PAST v. 4.03 (Hammer et al., 2001). The resulting scores of all taxa for the first and the second PC axes were mapped on the phylogenetic tree of the taxa (see below) with MESQUITE v. 3.51 (Maddison & Maddison, 2018). Furthermore, we analyzed the distribution of distances between start positions of adjacent mzl-USCOs to assess whether mzl-USCOs tend to cluster spatially more than a randomly chosen identical number of protein-coding genes would do. To achieve this, we downloaded the official gene set of coding sequences (CDS) for each genome and, using a custom Perl script (Supplementary Material), randomly selected the same number of protein-coding genes as the number of mzl-USCOs found in the respective taxon. This random drawing was repeated 10,000 times for each genome, and for each replicate, the median distance (absolute and normalized, separately) between neighboring genes was calculated. To infer whether mzl-USCOs cluster significantly more than randomly chosen genes, we counted for each taxon the number of replicates in which the median distance between neighboring protein-coding genes was lower than the median distance between neighboring mzl-USCOs.
We used a custom Perl script (Supplementary Material) to estimate the adjusted evenness of the distribution of mzl-USCOs between chromosomes in each taxon according to the formula e(H/S), where H is the Shannon-Wiener entropy (Heip et al. 1998) of the distribution and S the number of chromosomes. While in ecology species which are not present in a sample are not considered in the calculation of evenness (Heip et al. 1998), here, S includes all chromosomes, even those with no mzl-USCOs, representing thus an “adjusted evenness”. For comparison, we also calculated the adjusted evenness of the number of all protein-coding genes on the chromosomes, as well as that of the length of the chromosomes in base pairs. Additionally, we used another Perl script (Supplementary Material) to (i) conduct a chi-square test in search of significant deviations of the distribution of mzl-USCOs between chromosomes from a distribution proportional to chromosome length, and (ii) for significant deviations from the distribution of protein-coding genes in general. To assess the degree by which chromosomal linkage between mzl-USCOs is phylogenetically conserved across taxa, we calculated with the aid of a custom Perl script (Supplementary Material) the proportion of taxa in which a given pair of mzl-USCOs was found to be co-located on a chromosome.