(1) Seed selection
As the initial step of sequence extension, seed selection is crucial in determining the accuracy and completeness of an assembly. To simplify this process, GeneMiner quickly scans through sequencing data to automatically identify and select appropriate seeds without any manual work. This saves and effort while ensuring optimal assembly. We assume that k-mers, which occur at high frequencies in both the reference and sequencing data, are conserved regions with a higher probability of occurrence in the target genes. To achieve an unbiased selection of candidate seeds, we apply a weighted seed model that accounts for the k-mer counts in both the reference and sequencing data. This model assigns a weighted score to each candidate seed, with a stronger preference toward conserved regions in the reference, to avoid the risk of high-frequency false positives or repeat regions in the sequencing data. If the resulting assembly from a particular seed candidate is unsatisfactory according to the assembly length and completeness, GeneMiner will select a new candidate seed to optimize assembly performance.
(2) Weighted node model
De Bruijn graph is the foundational methodology for almost all short-read genome and transcriptome assembly tools (Bao et al., 2014; Cameron et al., 2017; Chang et al., 2015; Li et al., 2017; Pandey et al., 2017). In the field of genomics, each node in a de Bruijn graph stands for a k-mer. These nodes are connected through directed edges when their (k-1) long suffixes match another node’s (k-1) long prefix. The k-mers are most often derived from unassembled DNA sequencing reads. The key concepts of GeneMiner include utilizing de Bruijn graphs to establish connections between k-mers. We employ a weighted node model that combines information from both reference sequences and the input reads to guide seed selection and node connection and use depth-first search and stacks to enable efficient seed greedy extension and backtracking. The weighted node model encompasses both seed selection and node-to-node connection choices, with a distinct emphasis on assigning weighted scores. By taking into account both the reads and reference sequences information, the model importantly balances the impact of sequencing errors and reference bias on assembly. As a result, the weighted node model helps make optimal decisions for large numbers of non-unique node-to-node connections or seed selections.