Case studies: phylogenetic trees and species clustering using extracted mzl-USCOs
Mzl-USCOs extracted from WGS were confirmed to separate closely related species in a wide systematic context from each other, as previously shown with USCO data generated with DNA target enrichment (Dietz et al., 2023). Our results demonstrate that the majority of phylogenetic topologies obtained with mzl-USCOs is consistent with the relationships and species entities inferred previously with WGS datasets exemplified in the four case studies of vertebrate and arthropod taxa. The few exceptions of deviating topologies include cases of closely related species that are still frequently hybridizing and for which phylogenies based on single or few genes may give unreliable results. For example, in Darwin’s finches and Drosophila , the monophyly of some species was not confirmed. For some of these species this was also the case in the original analyses with WGS data (Lamichhaney et al., 2015; Mai et al., 2019). These results are likely caused by extensive hybridization between species (e.g. in Anopheles and Heliconius ) but possibly also due to large amounts of missing data (particularly in the Darwin’s finches). Introgression was reported to occur inHeliconius butterflies by Martin et al. (2013) and Edelman et al. (2019) and is also known in other groups studied by us includingDrosophila (Suvorov et al. 2022). Introgression has been identified in all major organism groups, such as fungi, vertebrates, insects, and angiosperms (Suvorov et al. 2022), indicating that hybridization across species barriers is not uncommon.
In several cases, we found discrepancies between the concatenation data-based and the coalescent analysis-based trees (Table 3). In most of these cases, the ASTRAL trees agreed better with previously published WGS phylogenies (e.g., monophyly of Heliconius melpomeneand interspecific phylogeny of Darwin’s finches) than the concatenation data-based trees. This confirms that coalescent-based approaches are more reliable for inferring the phylogeny of closely related species still under introgression than concatenation data-based phylogenies which are based on the often-incorrect assumption that all loci share the same phylogenetic history (e.g., Solís-Lemus et al., 2016; Bryant & Hahn, 2020; Stolle et al., 2022). However, concatenation data-based approaches seem to give better results if data completeness is highly heterogeneous across individuals. Samples for which information is missing to a high degree are often placed closer to the root in the ASTRAL trees, as seen for example in Drosophila . The underlying cause of this may be mapping reference bias and low coverage of some samples (Stolle et al. 2022).
Analyses of nucleotide sequence variation based on SNPs extracted from mzl-USCOs confirmed the results of the phylogenetic analyses regarding the circumscription of species entities. NMDS and STRUCTURE plots allowed us to visually distinguish generally recognized species in most case studies, as was the case in studies that analyzed more extensive WGS data. However, in Darwin’s finches, several closely related species were indistinguishable from each other. This result is probably a consequence of a high degree of admixture between the species. It could have alternatively or additionally been caused by the fact that the analyzed dataset suffered from a high degree of missing data. Finally, it is possible that the separation of some species requires the analysis to include more than two dimensions due to the complex distribution of variation. NMDS may also be unreliable if more data are missing in some specimens than in others, as was the case with some Drosophilaindividuals which were placed far apart from others of the same species. Clustering of SNPs with STRUCTURE did not exhibit this problem due to the simpler nature of this analysis as a group reassignment test.