Case studies: combination of different USCO extraction methods
Species delimitation methods are highly sensitive to intraspecific
nucleotide sequence variation, which affects branch lengths, the
topology of single gene trees, and in the end the outcome of the
delimitation of populations and species. We used a multiple nucleotide
sequence alignment combining data from all three USCO extraction
approaches to assess the comparability and combinability of data from
different approaches (Figures S16–23). In the ideal case, nucleotide
sequences obtained by the three different USCO extraction methods that
belong to a given specimen would form a monophyletic group, with no or
at least only little nucleotide sequence divergence. In practice, we
observed that the presence of nucleotide sequence alignment positions
with missing data had an enormous impact on the tree topology, species
monophyly, and on the clustering of sequences belonging to the same
specimen (Table 4). A particular impact in this regard was caused by
discrepancies, both in data yield and in the actual extracted nucleotide
sequences, between the BUSCO-based data extraction and the
Orthograph-based data extraction.
Visual inspection of the alignments in all four case studies revealed
discrepancies between the results of the three USCO extraction methods
in 29 to 79 (5–14% of all) of the multiple nucleotide sequence
alignments of individual USCO genes. The discrepancies manifested in a
clustering of nucleotide sequences that reflected the extraction method
rather than individual specimens (Figures S16–23). This pattern was
observed in all four case studies, sometimes affecting only a few,
sometimes all taxa, and it was more prevalent when phylogenetically
analyzing the extracted data as supermatrix rather than using a summary
multispecies coalescent approach that depends on gene trees as input. In
a minority of gene loci, the discrepancies could be explained by
incorrect alignment of nucleotides across gaps and positions with
missing data or at one of the ends of the nucleotide sequence. In the
majority of instances, the extraction methods had extracted partially
different sequences from the WGS libraries. Such differences were almost
always found at the ends of the nucleotide sequences obtained with BUSCO
and Orthograph, indicating that different coding nucleotide sequence
fragments were evaluated as being part of the gene and were joined
together. Editing the datasets by excluding positions with gaps and/or
missing data reduced the erroneous inference of non-monophyly of
individual samples which is the ultimate test scenario for an error-free
species delimitation procedure (Table 4; Figures S16–23).