Assembly
Similar to Easy353, GeneMiner uses a reference-guided de Brujin graph
assembly with seed self-discovey and a greedy extension method (Figure
2-B). Using this design, a k-mer is initially chosen as a seed, acting
as the starting point for the assembly and it is self-discovered from
the datasets. After identifying the seed, a greedy extension technique
is applied, progressively appending the most probable next k-mer at each
stage. This is completed through utilization of information from the de
Bruijn graph and k-mer count. The reference sequence is utilized
throughout the program to ensure the assembled sequence resembles the
reference sequence.
The GeneMiner method is updated from our Easy353 program in the
following two ways: seed selection and the method for extending the seed
into contigs. For choosing the seed, GeneMiner splits both of the
reference sequences and filtered reads into k-mers usingka (different from the kfused in the filtering) and builds k-mer sets as R and T ,
respectively. The intersection of the two sets becomes the potential
seed set. Seeds are weighted, and those with higher weights are
prioritized. In Easy353, only the number of reads were considered
(Wseed = C read ). However,
in GeneMiner, Wseed is jointly determined by both
reference sequences and reads. The weight is calculated as\(W_{\text{seed}}=C_{\text{ref}}\ast\operatorname{}{{(C}_{\text{read}}+}1)\),
where Cread represents the counts of the k-mer in
the set T , and Cref represents the counts
of the k-mer in the set R. For extending the seed into contigs,
each k-mer is taken as a node and assigned a weighted score according to
its frequency and position in the reference sequences. In Easy353, the
node-weighted score only considered the position of a k-mer in the
reference sequences, and the weight formula used was\(W_{\text{node\ }}={C_{\text{read}}}^{(1-P_{R})}\). However, in
GeneMiner, the node-weighted score is calculated
as\(\text{\ \ }W_{\text{node\ }}={C_{\text{read}}\ast e}^{(1-|P_{R}-\ P_{T}|)}\),
where Cread represents the occurrences of the
k-mer in the set T , PR represents the
average percentage distance of the k-mer in the set R , andPT represents the percentage distance of the
k-mer in target gene. If a k-mer appears at multiple positions in a
reference sequence, the Wnode is downgraded toCread to avoid the introducing of insertion
errors or duplication issues. In the case of a given seed, any secondary
branches branching off from the main path, such as ”bubbles” in the
graph, will be disregarded during sequence reconstruction as long as
they have a lower score.
Verifying results
One valuable step, which is often overlooked, when designing tools for
NGS data, is verification of the results. GeneMiner employs a novel
verification method based on mutation models and repeated resampling to
assess the influence of the reference and also confirm the stability of
assembly results statistically (Figure 2-C). The mutation model is
generated by calculating both the variation rate and the indel rate in
the reference sequences. Based on the initially generated contig
(Seq1 ), mutations and resamples are applied using
the mutation model to obtain ref1 ,ref2 …. refn . All
simulated ref1 , ref2 …refn are implemented as the new reference for the
whole assembly process and get each new recovered target gene is noted
as Seq2 . A bootstrap score is then used to assess
the stability of the assembly results, calculated as\(\frac{N}{L}\ast 100\), where N represents the matched base
number in the pairwise alignment Seq1 andSeq2 , and L represents the alignment
length. A recovered target gene with a high bootstrap scoreindicates a stable assembly and a probably well-chosen reference data.
Conversely, a recovered target gene with a low bootstrap scorerequires further consideration for reference selection and is probably
not able to be used for phylogenetic studies.