Introduction
The declining cost of DNA sequencing has increased the adoption of
phylogenetic studies using thousands of loci, but this trend has also
brought substantial analytical challenges. The most demanding parts of
typical phylogenomic workflows, such as raw sequence read cleaning and
adapter trimming (e.g. Fasp (Chen et al., 2018)), contig assembly (e.g.,
SPAdes (Bankevich et al., 2012)), sequence alignment (e.g., MAFFT (Katoh
et al., 2002)), phylogenetic tree estimation (e.g., RAxML-NG (Kozlov et
al., 2019) and IQ-TREE (Minh et al., 2020; Nguyen et al., 2015)), and
species delimitation (e.g., BPP (Yang, 2015)), use high-performance
programming languages (e.g., C and C++). However, other essential
workflow processes, such as alignment manipulation (e.g., filtering,
splitting, extracting, and concatenating) and summary statistic
calculation (e.g., number of parsimony informative sites, percent
missing data, etc),
are
typically carried out with applications written in interpreted
languages, such as Python (e.g., Borowiec, 2016; Faircloth, 2016), R
(e.g., Hutter et al., 2022), or Perl (e.g., Kück & Longo, 2014). The
computational efficiency of this approach is limited by the requirement
of an interpreter running alongside the application, type inference at
runtime, and garbage collection memory management, which together result
in a high memory footprint. One program, goalign (Lemoine & Gascuel,
2021), uses a compiled programming language and eliminates dependencies
required at runtime. Many of the program functions, however, only
operate on a single file, forcing users to write custom scripts to work
on thousands of genomic files.
With some exceptions (e.g., BPP (Yang, 2015) and FastQC (Andrews,
2010)), most commonly used phylogenomic programs are available only as
command-line interface (CLI) applications. CLI programs use computer
resources more efficiently than graphical user interface (GUI)
applications, and they are easier to develop than comparable GUI
software, but they present barriers for scientists with limited
computing knowledge or support. An approach using a high-performance
programming language with a GUI would produce fast execution and
efficient memory usage while minimizing the computing skills needed to
study phylogenomics.
A fast, memory efficient, reduced dependency application for
phylogenomic studies would enhance research efficiency and
repeatability, while also improving accessibility for evolutionary
biologists with limited computing resources. Furthermore, efficient
computing reduces the carbon footprint of bioinformatics (Grealey et
al., 2022). Such applications often require programmers to use a fast,
compiled programming language that allows fine control over how data are
managed in computer memory. In this context, the two commonly used
programming languages are C and C++. They require programmers to ensure
valid memory access, correct variable type to store data, and ensure no
data races (i.e., multiple cores/threads modify data concurrently),
which make them challenging to use (Perkel, 2020). These
code-correctness issues are difficult to avoid and represent common
problems in phylogenetic software (Darriba et al., 2018). The recently
emergent programming language, Rust, offers a memory-safe alternative to
C/C++ (Köster, 2016; Perkel, 2020). It comes with efficient development
tools (e.g., a package manager and a simple build system), guarantees
valid memory access, does not require garbage collection, and prevents
data races for multithreading applications. As a compiled programming
language, Rust has zero dependency at runtime and can be distributed as
a single executable CLI. Developing phylogenomic tools in Rust promises
efficient performance, while eliminating dependency issues at runtime.
Reducing dependencies minimizes conflict with other applications when
used as part of analysis pipelines and leads to improved research
reproducibility.
GUI development is more complicated than CLI development, especially
when targeting multiple platforms. A common cross-platform approach uses
Java (e.g., BEAST (Suchard et al., 2018), BEAST2 (Bouckaert et al.,
2019), FastQC (Andrews, 2010)), but this strategy is often limited by
the language’s memory management. Furthermore, it is challenging to
maintain a consistent user interface (UI) across operating systems (see
TaxonDNA documentation, https://github.com/gaurav/taxondna (commit
hash: 50584ac)). An alternative uses the Shiny package in R (e.g.,
phruta (Román-Palacios, 2023), treehouse (Steenwyk & Rokas, 2019)), but
is less efficient because the application runs in the R environment and
a browser. An emergent cross-platform framework, Flutter, promises
mobile and desktop support with consistent UI across platforms. The
programming language Dart, required to write Flutter applications, uses
garbage-collected memory management, and features an excellent
foreign-function interface to interact with higher-performance
programming languages. Writing our application using the Flutter
framework and Rust allows us to develop a cross-platform,
high-performance GUI application for phylogenomics.
We developed the SEGUL applications for phylogenomic data manipulation
and summarization. They are available as a CLI, GUI, and programming
language library, with support for MacOS, Linux, Windows, iOS, and
Android. We designed SEGUL with beginners in mind, while still providing
advanced features for experienced users. As such, they are suitable for
both research and teaching.