EasyCluster2: an improved tool for clustering and assembling long transcriptome reads

Bevilacqua, Vitoantonio; Pietroleonardo, Nicola; Giannino, Ely Ignazio; Stroppa, Fabio; Simone, Domenico; Pesole, Graziano; Picardi, Ernesto

doi:10.1186/1471-2105-15-S15-S7

Background: Expressed sequences (e.g. ESTs) are a strong source of evidence to improve gene structures and predict reliable alternative splicing events. When a genome assembly is available, ESTs are suitable to generate gene-oriented clusters through the well-established EasyCluster software. Nowadays, EST-like sequences can be massively produced using Next Generation Sequencing (NGS) technologies. In order to handle genome-scale transcriptome data, we present here EasyCluster2, a reimplementation of EasyCluster able to speed up the creation of gene-oriented clusters and facilitate downstream analyses as the assembly of full-length transcripts and the detection of splicing isoforms. Results: EasyCluster2 has been developed to facilitate the genome-based clustering of EST-like sequences generated through the NGS 454 technology. Reads mapped onto the reference genome can be uploaded using the standard GFF3 file format. Alignment parsing is initially performed to produce a first collection of pseudo-clusters by grouping reads according to the overlap of their genomic coordinates on the same strand. EasyCluster2 then refines read grouping by including in each cluster only reads sharing at least one splice site and optionally performs a Smith-Waterman alignment in the region surrounding splice sites in order to correct for potential alignment errors. In addition, EasyCluster2 can include unspliced reads, which generally account for >50% of 454 datasets, and collapses overlapping clusters. Finally, EasyCluster2 can assemble full-length transcripts using a Directed-Acyclic-Graph-based strategy, simplifying the identification of alternative splicing isoforms, thanks also to the implementation of the widespread AStalavista methodology. Accuracy and performances have been tested on real as well as simulated datasets. Conclusions: EasyCluster2 represents a unique tool to cluster and assemble transcriptome reads produced with 454 technology, as well as ESTs and full-length transcripts. The clustering procedure is enhanced with the employment of genome annotations and unspliced reads. Overall, EasyCluster2 is able to perform an effective detection of splicing isoforms, since it can refine exon-exon junctions and explore alternative splicing without known reference transcripts. Results in GFF3 format can be browsed in the UCSC Genome Browser. Therefore, EasyCluster2 is a powerful tool to generate reliable clusters for gene expression studies, facilitating the analysis also to researchers not skilled in bioinformatics.

EasyCluster2: an improved tool for clustering and assembling long transcriptome reads / Bevilacqua, V., Pietroleonardo, N., Giannino, E.I., Stroppa, F., Simone, D., Pesole, G., Picardi, E.. - In: BMC BIOINFORMATICS. - ISSN 1471-2105. - 15:15 suppl.(2014). [10.1186/1471-2105-15-S15-S7]