Reads filtering
GeneMiner stores reads with high similarity to the reference of closely
related taxa. The more closely related the input sequences are to the
reference sequence, the better the results of the program. The process
includes two stages: building and retrieving the reference hash table
(Figure 2-A). To create the reference hash table, we split the reference
sequences of multiple genes are split into a k-mers dictionary, which is
simply a dictionary of length-k strings of nucleotide. A record of the
position information, the occurrence number, and the gene name of all
k-mers are added in the hash table. The k-mer size,kf , is set by the user. To use the reference hash
table, raw reads from NGS data (from Illumina, Roche-454, ABI, or other
sequencing platforms) in FASTQ format are used as input. Similar to the
reference sequences, each raw read is also split into k-mers where the
same kf is k-mer sized as previosuly designated
by the user. If the k-mer of a read matches a k-mer in the reference
hash table, the whole read will be kept and assigned to the target
gene’s filtered dataset. In this process, GeneMiner can concurrently
process multiple genes and samples, leading to an accelerated analysis
speed.