Systematics with mzl-USCOs from whole genomes: data recovery of different extraction methods
Reference genomes of each of the four study groups contained at least 90% of the mzl-USCOs with exactly one copy. We found no consistent differences in the number of detected mzl-USCOs across the analyzed individuals (Figure S8) irrespective of what software we used to identify mzl-USCOs and their copy numbers. In all four study groups, all mzl-USCOs present in the reference genomes were recovered in at least some target individuals, and in all specimens, except some Darwin’s finches, the majority of mzl-USCOs was recovered (Figure S8).
The concatenated multiple nucleotide sequence alignments of mzl-USCOs extracted with the BUSCO software were more than a million sites long; the corresponding supermatrices of USCO nucleotide sequences extracted with Orthograph were on average about 30% shorter (Table 2). The Orthograph/bwa-based approach was found to consistently miss some mzl-USCOs in some specimens: the number of mzl-USCOs recovered across all specimens proved to be consistently lower when using Orthograph for target gene identification than when using BUSCO (Figure S8). Total alignment completeness at the nucleotide level exceeded 90% in all study groups, except in Darwin’s finches with a completeness of 45–52%. Alignment completeness of Orthograph-based datasets was slightly lower than of BUSCO-based datasets (Figure S9). The number of SNP sites was higher than 5,000 in all studied taxonomic groups, except in Darwin’s finches. The number was generally much smaller in the Orthograph-derived datasets than in the BUSCO-derived ones (Table 2).