Case studies using mzl-USCOs from whole genome sequences: data
extraction
To investigate the usefulness of mzl-USCOs to resolve species boundaries
in recent radiations and to assess the practicability of the data
extraction and assembly pipelines that we developed and applied, we
analyzed mzl-USCOs obtained from raw reads of WGS data sets of species
of four well-studied radiations: Heliconius butterflies, Darwin’s
finches, Anopheles mosquitoes, and Drosophila fruit flies
(Table 1; Table S2). Each of these four case studies included multiple
specimens of each involved species. The WGS raw reads were downloaded
from NCBI. To assemble genomic raw reads to individual USCOs, we
extracted mzl-USCOs (Eberle et al., 2020; Dietz et al., 2023) from one
selected fully assembled and annotated genome per study group (Table 1)
and then used each gene to map the raw reads of each individual onto it
(see below).
One rationale for prioritizing USCOs over other genomic nuclear markers
(Eberle et al., 2020) is that they allow us to build a comprehensive
database in which USCO data referring to different taxonomic groups are
stored. This data can be obtained at different times (i.e., with
different ortholog sets) and with different data extraction approaches
(e.g., DNA target enrichment, WGS; Eberle et al., 2020; Dietz et al.,
2023). To evaluate the data yield and ability to resolve species-level
relationships with different extraction approaches and genome reference
systems (Zdobnov et al., 2017; Kriventseva et al., 2019), mzl-USCO
nucleotide sequences were extracted from the reference genomes of the
four case studies with three different methods. In the first approach,
exonic nucleotide sequences of USCOs were extracted from the assembled
genomes with the BUSCO program v. 4.0.6 (Simão et al., 2015; Manni et
al., 2021) using the genome mode and the Metazoa dataset from OrthoDB v.
10 (Kriventseva et al., 2019), in the following text referred to as
BUSCO data set. In the second approach, Orthograph v. 0.7.1 (Petersen et
al., 2017) was used with HMMs from OrthoDB v. 9 (Zdobnov et al., 2017),
in the following text referred to as OrthoDB v. 9 data set. For this, we
downloaded the official gene sets (OGS) of all species included in the
Metazoa OrthoDB v. 9 dataset from the OrthoDB site and the HMMs and
information files for that dataset from the BUSCO website
(https://busco-archive.ezlab.org/v3/). We
used these to create an SQLite database with Orthograph, which was used
together with the HMMs from BUSCO to extract the respective USCO
nucleotide sequences from the coding sequences (CDS) of each taxon’s OGS
using Orthograph with its default setting. Our methodology was thus
identical to the one used in approach A2 by Dietz et al. (2023) to
assemble USCO raw reads retrieved via DNA target enrichment. The third
approach was identical to the second with the one exception that we used
OrthoDB v. 10
(https://busco.ezlab.org/busco_v4_data.html)
instead of OrthoDB v. 9, in the following text referred to as OrthoDB v.
10 data set.
In all three approaches, nucleotide sequences of single-copy USCOs
extracted from the respective genome were used as a reference against
which raw reads were mapped with bwa v. 2.1 (Li & Durbin, 2009) using
the software’s default setting, except that the minimum seed length was
set to 30. Diploid consensus sequences, in which heterozygous sites were
represented by an IUPAC ambiguity code, were generated with samtools v.
1.10 (Li et al., 2009) and bcftools v. 1.10.2
(https://github.com/samtools/bcftools). As
the nucleotide sequences were aligned to the reference sequence by bwa,
no further alignment was necessary. Phylogenetic analyses were done with
IQ-TREE v. 2.1.2 (Minh et al., 2020) using a supermatrix of the
concatenated nucleotide sequences (positions with missing data or gaps
were not removed at this point). The substitution model and partitioning
schemes were chosen as described above, and 50 replicate analyses were
performed for each dataset. With the same method, we performed
phylogenetic analyses based on the nucleotide sequence alignment of each
individual USCO and used the resulting trees as input for a multispecies
coalescent analysis with ASTRAL v. 5.6.1 (Zhang et al., 2018). All trees
were rooted with the outgroup taxa used in the respective original
studies from which the data were taken (Table 1).