A decade ago, de novo transcriptome assembly evolved as a versatile and powerful approach to make evolutionary assumptions, analyze gene expression, and annotate novel transcripts, in particular, for non-model organisms lacking an appropriate reference genome. Various tools have been developed to generate a transcriptome assembly, and even more computational methods depend on the results of these tools for further downstream analyses. In this issue of Molecular Ecology Resources, Freedman et al. (2020) present a comprehensive analysis of errors in de novo transcriptome assemblies across public data sets and different assembly methods. They focus on two implicit assumptions that are often violated: First, the assembly presents an unbiased view of the transcriptome. Second, the expression estimates derived from the assembly are reasonable, albeit noisy, approximations of the relative frequency of expressed transcripts. They show that appropriate filtering can reduce this bias but can also lead to the loss of a reasonable number of highly expressed transcripts. Thus, to partly alleviate the noise in expression estimates, they propose a new normalization method called length-rescaled CPM. Remarkably, the authors found considerable distortions at the nucleotide level, which leads to an underestimation of diversity in transcriptome assemblies. The study by Freedman et al. clearly shows that we have not yet reached “high-quality” in the field of transcriptome assembly. Above all, it helps researchers be aware of these problems and filter and interpret their transcriptome assembly data appropriately and with caution.
In software development, it usually doesn’t take long for an approach to work fundamentally, or at least to look like it will work. However, there is always a lot of work still to be done ”behind the scenes” to catch edge cases, deal with errors, and to find and fix all the little bugs. Accordingly, a rule of thumb, derived from the Pareto principle named after the economist and philosopher Vilfredo Federico Damaso Pareto, states that the last 20% of a software project usually takes 80% of the time.
This rule of thumb can also be applied to many areas of bioinformatics. The massive parallel sequencing of DNA has led to the creation of new genomes, which are being assembled, annotated, and analyzed more rapidly than ever before. However, we are still struggling to get it right (the last 20%, so to speak), as Steven Salzberg noted in a recent report on pervasive assembly and annotation errors (Salzberg, 2019). The same, if not worse, applies to the analysis of high-throughput transcriptome sequencing data (RNA-Seq), where (de novo ) assembly is a prominent first analysis step. While the assembly of transcriptomes has become an everyday bioinformatics task, dealing with all the potential errors and small caveats is still a challenge and error-prone, even a decade after the emergence of the first tools (Birol et al., 2009; Grabherr et al., 2011; Schulz et al., 2012).
In their recent study, Freedman et al . extensively analyzed errors, bias, and noise in de novo transcriptome assemblies. In its most common application, RNA-Seq short reads are aligned to a reference genome (map-to-reference, as Freedman et al . refer to it) to functionally annotate genomic features (such as genes) and estimate their expression levels. In another application, RNA-Seq-derived reads can be (de novo ) assembled first to reconstruct the transcriptome and then use it as a proxy for annotation and expression evaluation (map-to-transcriptome).
According to Freedman et al ., de novo transcriptome assembly is based on two implicit assumptions. First, the assembled sequences represent an unbiased view of the underlying expressed transcriptome, and second, the expression estimates of the assembly are good, if noisy, approximations of the relative frequency of expressed transcripts (Freedman, Clamp and Sackton, 2020). It is evident that these two assumptions have important implications for further downstream analysis steps and directly affect gene expression estimates, variant invocation, and evolutionary analyses based on a de novo transcriptome assembly. In their work, Freedman et al . show that these assumptions are frequently violated across different public mice RNA-Seq data sets and assembly algorithms, thus directly impacting downstream analyses performed on de novo transcriptome assemblies. In particular, they focused on expression estimation bias and differences in nucleotide variant calls while also comparing de novo results against a map-to-reference approach.
Firstly, Freedman et al . dispel the illusion that de novotranscriptome assemblies are mainly composed of full-length transcripts, which is typically not the case for short reads. The authors continue to carry out that the functional composition of a transcriptome assembly is biased towards intronic, UTR, and intergenic sequences, although most studies focus on protein-coding genes. As an important finding, they describe frequent genotyping error rates ranging from 30% to 83% that, in particular, negatively bias heterozygosity estimates (Fig. 1). Their results also show that single contigs are poor expression estimators. Although commonly done in the current gene expression literature, the use of single contigs as proxies for gene-level expression appears to be problematic according to their study. Based on their results, it might be interesting to investigate whether cluster- or graph-based expression estimates can overcome such limitations.
Alongside these interesting, but also alarming findings, Freedmanet al . suggest ways to deal with individual errors and minimize them. Among other ideas, they propose a new formula for normalizing gene expression, the length-rescaled CPM (counts per million). It is best practice in transcriptomics to consider measures like sequencing depth and feature lengths when estimating and comparing expression values derived from RNA-Seq counts. However, correctly determining a feature’s length from a de novo transcriptome assembly alone can be difficult because gene lengths are not adequately represented on the fragmented gene models that are typically derived from de novotranscriptome assemblies. To account for such biases, the authors investigated whether rescaling of CPM using length metrics based on information from both reference transcripts (observed length) andde novo assembled contigs (effective length) improves expression estimates. By combining effective and observed length, they adjust the CPM values to better represent the actual transcriptome expression. They show that, to some extent, the expression bias at gene level can be corrected by this formula. However, the observed length estimation is difficult for non-model organisms lacking a good reference genome or transcriptome and annotation.
So, are we there yet? With the transcriptome assembly methods for short RNA-Seq reads developed over the last decade, we are quite close to the first 80%. However, as Freedman et al . impressively demonstrate, the last 20% still pose a challenge. Multiple tools and parameter settings are often used and merged to generate a comprehensive de novo transcriptome assembly, but further bias and redundancy are introduced that researchers need to deal with (Hölzer and Marz, 2019). Nevertheless, modern multi-tool ensemble approaches for de novotranscriptome assembly achieve promising results (Voshall et al., 2020). However, the implicit assumptions and their violation, as discussed extensively by Freedman et al ., urgently require control mechanisms and corresponding normalization and filter steps, especially with such combined approaches.
Finally, Freedman et al . give a brief outlook on the application of long reads derived from single-molecule real-time sequencing (SMRT), as provided e.g. by PacBio or Oxford Nanopore Technologies (ONT), to generate a provisional genome assembly in the absence of a suitable reference genome. Such a draft can then be used for map-to-reference transcriptome analyses. However, other problems may arise, and, as Freedman et al . describe, genome assembly is not necessarily a panacea for all issues related to expression analysis.
With a view of today’s technology, one could even argue that the transcriptome assembly of short reads will become obsolete in the coming years. SMRT is already capable of generating long reads that can potentially span full-length transcripts - no assembly required!? In addition, ONT allows for the direct sequencing of native RNA molecules (dRNA-Seq) without any fragmentation steps and cDNA conversion. Recently, the application of ONT dRNA-Seq for the detection of differential expression of human cell populations impressively showed the potential of the technology to overcome many limitations of short and long cDNA sequencing methods (Gleeson et al., 2020). However, even with the complete avoidance of biases introduced by de novo transcriptome assembly of short reads, not all problems are immediately solved by switching to another technology. Instead, other noise classes occur, such as a higher sequencing error rate for dRNA-Seq, which researchers need to know and which must be taken into account by novel tools. Thus, hybrid approaches combining the strengths of both short and long reads will become more important, in particular in the context of de novo assembly and transcriptome analyses. In any case, one thing will certainly not let us go: the careful handling of transcriptome data and their interpretation with regard to error, noise, and bias.