Phylogenetic analysis of metazoan genomes
We performed phylogenetic analyses with the Metazoa-level USCO nucleotide sequences to assess their reliability in recovering phylogenies and classifications. To this end, we analyzed all orthologous nucleotide sequences of each mzl-USCO gene from all genome assemblies in which more than half of the loci were recovered as being complete and single-copy. Nucleotide and amino acid sequences of USCOs were taken from the output of the BUSCO software. Amino acid sequences were aligned with MAFFT v. 7.305b (Katoh & Standley, 2013) using the L-INS-I algorithm. Poorly aligned regions were identified and removed from the amino acid alignments with ALISCORE v. 2.0 (Misof & Misof, 2009; Kück et al., 2010) and ALICUT v. 2.31 (available from:https://github.com/PatrickKueck/AliCUT), and outlier sequences were identified and removed with OliInSeq v. 0.9.3 (https://github.com/cmayer/OliInSeq). Multiple nucleotide sequence alignments based on the amino-acid alignments were inferred with pal2nal v. 14.1 (Suyama et al., 2006), and all third codon positions were excluded with a custom Perl script (Supplementary Material). Maximum-likelihood analyses were performed with IQ-TREE v. 2.1.2 (Minh et al., 2020) using multiple sequence alignments of individual genes and concatenated multiple sequence alignments of all genes, respectively, and analyzing amino-acid sequence data or nucleotide sequence data with third codon positions removed. For both the concatenated nucleotide dataset and the concatenated amino-acid dataset, the best fitting substitution model and partitioning scheme were inferred with ModelFinder (Chernomor et al., 2016; Kalyaanamoorthy et al., 2017) and PartitionFinder (Lanfear et al. 2014) as implemented in IQ-TREE using the full list of models and the IQ-TREE option -m MFP+MERGE. Data blocks in the partition merging steps were the USCO genes. For analyzing the nucleotide dataset, we applied the inferred substitution model and partitioning scheme and performed 50 replicate maximum likelihood tree searches from random starting trees. We performed a single maximum likelihood tree search when analyzing the amino-acid dataset, as performing replicates would have been computationally unreasonably expensive with respect to the expected benefit. Branch support was estimated from 1,000 ultrafast bootstrap replicates (UFBoot, Hoang et al., 2018) as well as approximate likelihood ratio tests (aLRT) using nearest neighbor interchange (NNI) as tree rearrangement method. The tree with the highest likelihood was then chosen among all replicates. The individual gene trees were further used for a coalescent-based tree analysis with ASTRAL v. 5.6.1 (Zhang et al., 2018) applying the program’s default settings.
Sequence overlap in multiple sequence alignments was examined using the concatenated alignment containing all taxa. We calculated with a custom script (Supplementary Material) the overlap for each pair of individuals, defined as the number of alignment positions with data in both individuals, divided by the number of alignment positions with data in at least one of the two individuals.