GeneScissors: a comprehensive approach to detecting and correcting spurious transcriptome inference due to RNAseq reads misalignment.

Zhaojun Zhang1, Shunping Huang1, Jack Wang1, Xiang Zhang2, Fernando Pardo Manuel de Villena4, Leonard McMillan1, and Wei Wang3

1Department of Computer Science, University of North Carolina at Chapel Hill, Chapel Hill, NC.
2Department of Electrical Engineering and Computer Scince, Case Western Reserve University, Cleveland, OH.
3Department of Computer Science, University of California, Los Angles, CA.
4Department of Genetics , University of North Carolina at Chapel Hill, Chapel Hill, NC.

RNA-seq techniques provide an unparalleled means for exploring a transcriptome with deep coverage and base pair level resolution. Various analysis tools have been developed to align and assemble RNA-seq data, such as the widely used TopHat/Cufflinks pipeline. A common observation is that a sizable fraction of the fragments/reads align to multiple locations of the genome. These multiple alignments pose substantial challenges to existing RNA-seq analysis tools. Inappropriate treatment may result in reporting spurious expressed genes (false positives), and missing the real expressed genes (false negatives). Such errors impact the subsequent analysis, such as differential expression analysis. In our study, we observe that about 3.5% of transcripts reported by TopHat/Cufflinks pipeline correspond to annotated nonfunctional pseudogenes. Moreover, about 10.0% of reported trascripts are not annotated in the Ensembl database. These genes could be either novel expressed genes or false discoveries. We examine the underlying genomic features that lead to multiple alignments and investigate how they generate systematic errors in RNA-seq analysis. We develop a general tool, GeneScissors, which exploits machine learning techniques guided by biological knowledge to detect and correct spurious transcriptome inference by existing RNA-seq analysis methods. In our simulated study, GeneScissors can predict spurious transcriptome calls due to misalignment with an accuracy close to 90%. It provides substantial improvement over the widely used TopHat/Cufflinks or MapSplice/Cufflinks pipelines in both precision and F-measurement. On real data, GeneScissors reports 53.6% less pseudogenes and 0.97% more expressed and annotated transcripts, when compared with the TopHat/Cufflinks pipeline. In addition, among the 10.0% unannotated transcripts reported by TopHat/Cufflinks, GeneScissors finds that more than 16.3% of them are false positives.