December 2014 RNA-Seq de novo assembly training session Day 2 hands-on Useful links: Assemblathon An offshoot of the Genome 10K project, and primarily organized by the UC Davis Genome Center, Assemblathons are contests to assess state-of-the-art methods in the field of genome assembly CD-HIT CD-HIT is a very widely used program for clustering and comparing protein or nucleotide sequences. CD-HIT was originally developed by Dr. Weizhong Li. TGICL This package automates clustering and assembly of a large EST/mRNA dataset. The clustering is performed by a slightly modified version of NCBI's megablast, and the resulting clusters are then assembled using CAP3 assembly program. Oases Oases is a de novo transcriptome assembler designed to produce transcripts from short read sequencing technologies, such as Illumina, SOLiD, or 454 in the absence of any genomic assembly. Trinity Trinity, developed at the Broad Institute and the Hebrew University of Jerusalem, represents a novel method for the efficient and robust de novo reconstruction of transcriptomes from RNA-seq data. BWA BWA is a software package for mapping low-divergent sequences against a large reference genome, such as the human genome. It consists of three algorithms: BWA-backtrack, BWA-SW and BWA-MEM. Samtools SAMTools provide various utilities for manipulating alignments in the SAM format, including sorting, merging, indexing and generating alignments in a per-position format. IGV The Integrative Genomics Viewer (IGV) is a high-performance visualization tool for interactive exploration of large, integrated genomic datasets. Training session aims: This training session provides you the manipulation of some de novo assemblers. 1 December 2014 Data used in the exercises can be found at: http://genoweb.toulouse.inra.fr/~formation/RNASeq_de_novo/Assembly Exercise n°1: Assembly quality assessment For each of the fasta file from the directory exercise_1:  compute generic metrics using the assemblathon statistics script.  draw the contig length histogram using the python length_histogram.py script.  compute the realignment mapping rates (mapped and paired).  does one of the assemblies seem obviously better than the others? Exercise n°2: Assembly using Velvet/Oases Assemble reads from runs ERR145651_t and ERR145651_t_norm inside exercise_2 directory using Velvet/Oases with the following parameters (assemble runs separately):  k-mers list: 29, 37, 45, 53, 61, 69  -min_contig_lgth = 200 for velvetg Use the *_70_LONG command versions (velveth_70_LONG…). Job resources reservation: -l mem=8G,h_vmem=32G Locate the output contigs fasta file (named transcripts.fa inside the oases -merge output directories) and:  compute the realignment mapping rates (mapped and paired).  Blat contigs to Danio rerio chr3 and extract the best blat hit (in psl format).  Exonerate Danio rerio proteins to contigs. Start IGV and compare assemblies of the two runs. Load following files:  Genome -> Load genome from file -> Danio rerio chr3 fasta file  File -> Load from file : o Danio rerio chr3 GTF file o ERR145651_t_vs_genome BAM file o ERR145651_t_norm_vs_genome BAM file igv_exercise_2.xml o ERR145651_t_vs_genome TDF file o ERR145651_t_norm_vs_genome TDF file o ERR145651_t best blat hits versus genome o ERR145651_t_norm best blat hits versus genome Locate particular regions:  Transcripts correctly assembled using one run and not the other  All isoforms of transcripts correctly or not correctly assembled 2 December 2014    Contigs found inside UTRs Contigs found inside introns Transcripts not correctly assembled whereas reads coverage seems sufficient Run ERR145651_t_norm is a normalized version of the ERR145651_t run. What are the normalization main effects on the assembly? IGV Tips:   Once all files have been load and tracks correctly formatted, don’t forget to save your session (File -> Save sessions) Use the Region -> Region navigator tool to store particular regions Exercise n°3: Assembly using Trinity Assemble reads from runs ERR145651_t and ERR145651_s inside exercise_3 directory using Trinity with the following parameters:  number of CPUs = 4  memory = 64G Job resources reservation: -l mem=8G,h_vmem=32G -pe parallel_smp 4 Locate the output contigs fasta file (named Trinity.fasta inside the output directories) and:  compute the realignment mapping rates (mapped and paired).  Blat contigs to Danio rerio chr3 and extract the best blat hit (in psl format).  Exonerate Danio rerio proteins to contigs. Start IGV and compare assemblies of the two runs. Load following files:  Genome -> Load genome from file -> Danio rerio chr3 fasta file  File -> Load from file : o Danio rerio chr3 GTF file o ERR145651_t_vs_genome BAM file o ERR145651_s_vs_genome BAM file igv_exercise_3.xml o ERR145651_t_vs_genome TDF file o ERR145651_s_vs_genome TDF file o ERR145651_t best blat hits versus genome o ERR145651_s best blat hits versus genome Locate particular regions. Compare mapping rates between ERR145651_t assembled with Trinity, ERR145651_t assembled with oases in exercise n°2 and ERR145651_t_norm assembled with Oases 3 December 2014 also in exercise n°2. What about the mapping rates differences? Where would they come from? For further questions :  e-mail : [email protected] .  You can check the FAQ of the genotoul website: http://bioinfo.genotoul.fr/index.php?id=11 .  Using the following link, you can have more information about the other training sessions provided by BIOINFO GENOTOUL: http://bioinfo.genotoul.fr/index.php?id=10. 4