Spatial distribution and potential linkage patterns of mzl-USCOs in genomes
We extracted mzl-USCOs from chromosome-level assembled genomes of 239 species of Metazoa, covering almost all major lineages of Protostomia and Deuterostomia. As expected, we found that the large majority of the mzl-USCOs were consistently present in most investigated species, and pairwise aligned nucleotide or amino acid sequences of mzl-USCOs from different species were found to overlap in the multiple sequence alignment of each gene to a high degree (Figure S1). The median distance between neighboring mzl-USCOs on a chromosome was on average 742,876 bp (+/- 607,054 bp SD). Considering all possible pairs of mzl-USCOs, we found that in the vast majority of genomes the two mzl-USCOs in a pair were located on different chromosomes (Fig. 1). Specifically, we found only 1.3% of the analyzed pairs of mzl-USCOs to be located on the same chromosome in more than 50% of the analyzed species. Only 0.2% of all analyzed pairs of mzl-USCOs were found on the same chromosome in more than 75% of the analyzed species. Looking at these latter pairs in more detail, we found the two mzl-USCOs in each pair to be spatially separated on average by a mean distance over all taxa of 11.6 Mbp (+/- 6.1 Mbp SD) on a given chromosome, with the spatial separation differing widely between taxa (average standard deviation 18.2 Mbp, +/- 8.8 Mbp SD).
While these data imply that mzl-USCOs can be regarded as genetically largely unlinked in practical applications, mzl-USCOs show a slight tendency to cluster compared to randomly chosen protein-coding genes. Specifically, we found physical distances between neighboring mzl-USCOs normalized by genome size to be consistently slightly lower than expected by chance when compared with distances from the same number of randomly chosen protein-coding genes. In all but three taxa, the median distance, both absolute and normalized by genome size, was lower in the USCO data than the median in the randomly chosen protein-coding genes (inferred from 10,000 simulations in each taxon; Fig. 2). In 195 taxa (82% of all investigated taxa), the difference was statistically significant (p < 0.05). On average, the median absolute distance was lower by 106,062 +/– 91,451 bp in the real data, the normalized distance by 9.91*10-5 +/– 6.57*10-5 of genome size (15.77 +/– 9.4 %). The extent to which mzl-USCOs cluster more than randomly chosen genes tends to be larger in arthropods than in vertebrates (Table S1).
We found the distribution of absolute distances (in nucleotides) between neighboring mzl-USCOs on chromosomes to be highly correlated with the taxon’s genome size (correlation of median distance with genome size: r = 0.9714, p < 0.001). When binning absolute distances in eleven categories and using a PCA to visualize the degree of similarity between taxa in their distance values (plot not shown), separation of taxa along the first axis (which explained 71% of the total variance) strongly correlated with the logarithm of the taxon’s genome size (r = -0.9818, p < 0.001). We focused in the present investigation on the conspicuous patterns found in normalized distances (nucleotides divided by genome size), as this metric was less confounded by the organism’s genome size: correlation of median normalized distance with genome size was -0.17201 (p = 0.008). When binning normalized distances between neighboring mzl-USCOs on chromosomes in eleven categories and using a PCA to visualize the degree of similarity (Fig 3b), we found the clustering of taxa in some instances to correspond noticeably with high systematic units, such as Insecta (red triangles), teleost fishes (gray dots), birds (black squares), and mammals (black triangles; Fig 3).
The adjusted evenness of the distribution of mzl-USCOs between chromosomes ranged between 0.58 and 0.99 (mean 0.87 +/– 0.09). It tends to be especially low in birds and especially high in teleost fish (Table S1). It is highly correlated with both the evenness of chromosome length (r = 0.83, p = 6.26 * 10-61) and especially that of the distribution of all protein-coding genes (r = 0.94, p = 1.98 * 10-110).
In many taxa, our chi-square test showed significant deviations of USCO distribution from the distribution of chromosome lengths (Table S1). In 215 taxa (90% of all investigated taxa), the chi-square test showed a statistically significant (p < 0.05) deviation without correction for multiple test, and in 153 of the taxa (64%), the test result remained significant after Bonferroni correction for multiple tests. The deviation tended to be particularly high in birds and particularly low in teleost fish. The chi-square test showed that the deviation from the distribution of all protein-coding genes was significant in 170 taxa (71%), but in only 43 of these taxa (18%) it remained so after Bonferroni correction. A correlation with phylogenetic placement of the taxa was less obvious than in the comparison with chromosome length.
To assess whether the phylogenetic signal contained in mzl-USCOs is sufficient to infer the phylogenetic relationships of the investigated taxa, we used the extracted mzl-USCOs of the 239 species of Metazoa for phylogenetic analyses. The inferred phylogenetic trees based on a supermatrix of amino acid sequences (Fig. S2) were largely consistent with the respective current state of the art phylogenetic hypotheses (e.g., Laumer et al., 2019; Irisarri et al., 2017; Esselstyn et al., 2017). Discrepancies occurred in a few rapid radiations. For example, in the USCO-derived phylogenies of Neoaves we found hummingbirds to be more closely related to passerines than to falcons and parrots, contradicting results from phylogenomic studies of Jarvis et al. (2014) and Prum et al. (2015). Such discrepancies were also found in multi-species coalescent-based trees obtained from analyzing amino acid data (Fig. S4), which had overall low support values, however. Both supermatrix- and coalescent-based phylogenetic inferences based on nucleotide sequence data using codon positions 1 and 2 (Fig. S3, S5) resulted in some highly questionable phylogenetic estimates, such as a non-monophyly of Arthropoda.