Parameter and performance testing
To evaluate the overall performance of GeneMiner, we conducted two rounds of evaluation, Test I and Test II. Initially, we input short-read NGS data (sequence data, 50-300bp) and the sequences of target genes from closely related taxa (reference sequences) into GeneMiner. We then compared the results from GeneMiner with the known results, referred to as the gold standard.
For Test I, we used two public RNA-seq data sets from Arabidopsis thaliana and Oryza sativa as the sequencing files (can be found in the Data Accessibility section). The reference sequences were derived from Angiosperms353 of Brassicaceae (53,379 sequences) andPoaceae (224,049 sequences) , excluding A. thaliana andO. sativa . Angiosperms353 is a set of targeted sequencing probes designed to reconstruct the Tree of Angiosperms. It was developed based on 353 putatively single-copy protein-coding genes from the One Thousand Plant Transcriptomes Initiative project (Johnson et al., 2019). The gold standard for this test were the Angiosperms353 genes that are specific to these two species. Thus our gold standard for this test includes with 349 genes from A. thaliana and 347 genes from O. sativa.Within each collection of reference sequences, the variability between the gold standard gene and its closest reference sequence ranges from 0-15.4%, and 0-18.3%, respectively. Both the reference and the gold standard were downloaded from the Kew Tree of Life Explorer at (https://treeoflife.kew.org). Performance of aTRAM, Easy353, GeneMiner and HybPiper were compared, evaluating speed, memory efficiency, and disk space usage using similar parameters in Test I. Based on the difference between the recovered target genes and the gold standard, we categorized the results into four different levels as follows: level1 (100% identity and and 90-100% coverage, high-quality), level2 (100% identity and 0-90% coverage, medium-quality), level3 (only with indels, medium-quality), andlevel4 (with substitutions, low-quality).
While Test I used real data, Test II utilized simulation datasets. We simulated different input data in order to examine the impact of data sources and program parameters on the results, including sequencing depth, reference and parameter-bootstrap. We constructed simulation datasets ranging from 1 to 100x based on the transcriptome of A. thaliana from Phytozome version 12 (release TAIR10) (Goodstein et al., 2012) using NGSNGS version 0.9 (Henriksen et al., 2023). We used two two different sets of reference sequences, one set being real reference sequences, the same as in Test I with Brassicaceae, and the other set being simulated reference sequences. The simulated reference sequences were generated using custom scripts based on the gold standard to construct 31 groups of reference sequences with variances ranging from 0 to 30%. Each group contained 347 genes, with each gene having 50 sequences. The gold standard was also constructed from the transcriptome of A. thaliana , including a total of 349 angiosperm genes. The scripts used to construct the datasets and the completed simulated reference sequences can be found in the Supplementary Materials.
In Test II, we first used data of varying depths and real reference sequences, with and without a Weight-node model to test the impact of using data of different depths on the GeneMiner results. In the test that did not use the Weight-node model, the weights of the nodes were set equal to the counts of k-mers generated by the filtered reads. We then conducted a grid test using data with depths ranging from 1 to 100x and simulated reference sequences with variances ranging from 0% to 30%. This round of evaluation looked at the impact of the distance between reference sequences and target sequences on the results. Based on the evaluation results, we assessed the capability of parameter-bootstrap to validate results using data with a depth of 50x and 0-30% variance. In Test II, we used the same four levels as outlined for Test I to categorize the assembly results. All codes, reference sequences, and test data involved in Test I and Test II are available on the following GitHub page: https://github.com/yyscu/GeneMiner-Test.