Distribution and potential linkage patterns of mzl-USCOs
This study is the first comparative analysis of the physical
distribution of mzl-USCOs in the genomes of a wide range of animal taxa.
We did not find mzl-USCOs to exhibit a noteworthy tendency of physical
linkage when compared to randomly chosen protein-coding genes. Physical
distances between USCO genes were found to be in general much larger
than the average distances across which loci can be assumed to be linked
in evolutionary timescales (<1000 bp; Springer & Gatesy,
2016). The resulting average extent of linkage of loci located on the
same chromosome is thus likely negligible and cannot be a prioriassumed to violate assumptions of multispecies coalescent analyses,
irrespective of whether the method is used for phylogenetic
reconstruction or species delimitation. Although there was considerable
variation across taxa, we found neighboring pairs of mzl-USCOs to be on
average spatially located somewhat more closely together than pairs
obtained by randomly choosing the same number of annotated
protein-coding genes. A possible explanation for this result could be
that mzl-USCOs have a small tendency to cluster in genomic regions that
are under selection to remain in single-copy.
Mzl-USCOs were found to be rather evenly distributed over the
chromosomes and do not cluster on particular chromosomes, indicated by
high values of adjusted evenness of the USCO distribution. However, taxa
with chromosomes of unequal length tended to have an unequal
distribution of mzl-USCOs. This was demonstrated by the positive and
significant correlation of the evenness of chromosome length and
protein-coding gene distribution with that of the USCO distribution. As
expected, longer chromosomes, and especially chromosomes with relatively
more protein-coding genes than others, also contain more mzl-USCOs.
However, chi-square tests showed that this correlation is not
necessarily linear. In nematodes, for example, the correlation of the
number of mzl-USCOs with that of protein-coding genes was negative,
although this was based on few chromosomes of rather similar length. In
particular the deviation of USCO number from chromosome length tended to
be higher in birds which also have highly unequal chromosome sizes
within their genomes. This deviation is probably due to the fact that
gene density is high in short chromosomes (microchromosomes; e.g.,
International Chicken Genome Sequencing Consortium, 2004), which are
particularly common in birds but are also found in some other
vertebrates (Waters et al., 2021). Significant deviations from the
distribution of protein-coding genes in general are probably caused by
taxon-specific groupings of mzl-USCOs on certain chromosomes. However,
such deviations do not seem to be conserved across major lineages, a
pattern that is consistent with our observation that groupings of
mzl-USCOs on the same chromosome are in most cases not phylogenetically
conserved according to the current sampling of taxa. However, as some
lineages were poorly covered by these analyses, it is difficult to make
accurate statements about this for metazoans in general.
Intra-locus recombination is known to bias coalescent-based phylogenomic
analyses (Gatesy & Springer, 2014; Edwards et al., 2016; Springer &
Gatesy, 2018). Among eukaryotes, the genome-wide recombination rate is
known to vary over at least one order of magnitude (Stapley et al.,
2017). Intraspecific recombination rates are also known to vary between
the sexes and across the genome, with recombination hot spots in which
most crossovers occur (Jeffreys et al., 2001; Kauppi et al., 2004;
Niehuis et al., 2010). Recombination hot spots have been studied in a
variety of species, including fruit flies (Chan et al., 2012), crickets
(Blankers et al., 2018), birds (Kawakami et al., 2017), and mammals
(Jeffreys et al., 2001; Kauppi et al., 2004; Arnheim et al., 2007;
Penalba & Wolf, 2020). In humans, recombination hot spots are regions
of 1 to 2 kbp that are spatially separated from each other by larger
regions (50–100 kb) with lower recombination activity (Myers et al.,
2005; Baudat et al., 2010). Simulation studies have shown that species
tree estimation is robust to recombination even if the amount of
recombination exceeds that found in extant organisms (Lanier & Knowles,
2012; Zhu et al., 2022). However, these studies used a model of constant
recombination rates across the genome (instead of a model of
recombination hot spots), which might not reflect the situation in a
given genome properly. We therefore expect that data partitioning and
its implementation within models of species inference using the
multispecies coalescent will remain a hot topic in the future, as will
be some other parameters in species delimitation approaches, e.g.,
effective population size, whose fluctuation is known to impact species
delimitation analyses (Ahrens et al., 2016).
The distribution of distances between USCO genes reported by us
exhibited lineage-specific patterns (Fig. 3; Figure S6, S7). Some of
these lineages showed an extraordinary variation. This lineage-specific
variation likely reflects peculiarities in the genomic architecture of
different higher taxa, but a closer investigation of these phenomena is
beyond the scope of this study.