Case studies: combination of different USCO extraction methods
Species delimitation methods are highly sensitive to intraspecific nucleotide sequence variation, which affects branch lengths, the topology of single gene trees, and in the end the outcome of the delimitation of populations and species. We used a multiple nucleotide sequence alignment combining data from all three USCO extraction approaches to assess the comparability and combinability of data from different approaches (Figures S16–23). In the ideal case, nucleotide sequences obtained by the three different USCO extraction methods that belong to a given specimen would form a monophyletic group, with no or at least only little nucleotide sequence divergence. In practice, we observed that the presence of nucleotide sequence alignment positions with missing data had an enormous impact on the tree topology, species monophyly, and on the clustering of sequences belonging to the same specimen (Table 4). A particular impact in this regard was caused by discrepancies, both in data yield and in the actual extracted nucleotide sequences, between the BUSCO-based data extraction and the Orthograph-based data extraction.
Visual inspection of the alignments in all four case studies revealed discrepancies between the results of the three USCO extraction methods in 29 to 79 (5–14% of all) of the multiple nucleotide sequence alignments of individual USCO genes. The discrepancies manifested in a clustering of nucleotide sequences that reflected the extraction method rather than individual specimens (Figures S16–23). This pattern was observed in all four case studies, sometimes affecting only a few, sometimes all taxa, and it was more prevalent when phylogenetically analyzing the extracted data as supermatrix rather than using a summary multispecies coalescent approach that depends on gene trees as input. In a minority of gene loci, the discrepancies could be explained by incorrect alignment of nucleotides across gaps and positions with missing data or at one of the ends of the nucleotide sequence. In the majority of instances, the extraction methods had extracted partially different sequences from the WGS libraries. Such differences were almost always found at the ends of the nucleotide sequences obtained with BUSCO and Orthograph, indicating that different coding nucleotide sequence fragments were evaluated as being part of the gene and were joined together. Editing the datasets by excluding positions with gaps and/or missing data reduced the erroneous inference of non-monophyly of individual samples which is the ultimate test scenario for an error-free species delimitation procedure (Table 4; Figures S16–23).