Case studies: phylogenetic trees and species clustering using
extracted mzl-USCOs
Mzl-USCOs extracted from WGS were confirmed to separate closely related
species in a wide systematic context from each other, as previously
shown with USCO data generated with DNA target enrichment (Dietz et al.,
2023). Our results demonstrate that the majority of phylogenetic
topologies obtained with mzl-USCOs is consistent with the relationships
and species entities inferred previously with WGS datasets exemplified
in the four case studies of vertebrate and arthropod taxa. The few
exceptions of deviating topologies include cases of closely related
species that are still frequently hybridizing and for which phylogenies
based on single or few genes may give unreliable results. For example,
in Darwin’s finches and Drosophila , the monophyly of some species
was not confirmed. For some of these species this was also the case in
the original analyses with WGS data (Lamichhaney et al., 2015; Mai et
al., 2019). These results are likely caused by extensive hybridization
between species (e.g. in Anopheles and Heliconius ) but
possibly also due to large amounts of missing data (particularly in the
Darwin’s finches). Introgression was reported to occur inHeliconius butterflies by Martin et al. (2013) and Edelman et al.
(2019) and is also known in other groups studied by us includingDrosophila (Suvorov et al. 2022). Introgression has been
identified in all major organism groups, such as fungi, vertebrates,
insects, and angiosperms (Suvorov et al. 2022), indicating that
hybridization across species barriers is not uncommon.
In several cases, we found discrepancies between the concatenation
data-based and the coalescent analysis-based trees (Table 3). In most of
these cases, the ASTRAL trees agreed better with previously published
WGS phylogenies (e.g., monophyly of Heliconius melpomeneand interspecific phylogeny of Darwin’s finches) than the concatenation
data-based trees. This confirms that coalescent-based approaches are
more reliable for inferring the phylogeny of closely related species
still under introgression than concatenation data-based phylogenies
which are based on the often-incorrect assumption that all loci share
the same phylogenetic history (e.g., Solís-Lemus et al., 2016; Bryant &
Hahn, 2020; Stolle et al., 2022). However, concatenation data-based
approaches seem to give better results if data completeness is highly
heterogeneous across individuals. Samples for which information is
missing to a high degree are often placed closer to the root in the
ASTRAL trees, as seen for example in Drosophila . The underlying
cause of this may be mapping reference bias and low coverage of some
samples (Stolle et al. 2022).
Analyses of nucleotide sequence variation based on SNPs extracted from
mzl-USCOs confirmed the results of the phylogenetic analyses regarding
the circumscription of species entities. NMDS and STRUCTURE plots
allowed us to visually distinguish generally recognized species in most
case studies, as was the case in studies that analyzed more extensive
WGS data. However, in Darwin’s finches, several closely related species
were indistinguishable from each other. This result is probably a
consequence of a high degree of admixture between the species. It could
have alternatively or additionally been caused by the fact that the
analyzed dataset suffered from a high degree of missing data. Finally,
it is possible that the separation of some species requires the analysis
to include more than two dimensions due to the complex distribution of
variation. NMDS may also be unreliable if more data are missing in some
specimens than in others, as was the case with some Drosophilaindividuals which were placed far apart from others of the same species.
Clustering of SNPs with STRUCTURE did not exhibit this problem due to
the simpler nature of this analysis as a group reassignment test.