Programs and Web Tools Status Update GEP Alumni Workshop Wilson Leung 08/05/2011.

Programs and Web Tools Status Update

GEP Alumni Workshop

Wilson Leung

08/05/2011

Outline

• GEP web framework updates– GEP web site– Gene Record Finder– Gene Model Checker– Small Exon Finder

• Tools under development– modENCODE mRNA-Seq data– Designing and managing your own projects

• Discussions on needed improvements

Graded Web Browser Support• GEP web framework aims to provide support for

the following web browsers:– Based on graded browser support policy from Yahoo!

Web Browsers Win XP Win 7 / Vista Mac OS 10.6

Safari 5 A-grade

Chrome (latest stable) A-grade A-grade

Firefox (latest stable) A-grade A-grade A-grade

Firefox 3.6 A-grade A-grade A-grade

IE 9.0 A-grade

IE 8.0 A-grade A-grade

IE 7.0 A-grade

IE 6.0 A-grade

* Other configurations may work but may not be tested (X-grade)

Goals for GEP Web Site Update • More easily find and discover materials

• Search engine optimizations and site search• Added Quick Start and FAQ sections• Search for related documents using tags• Standardize layout and file download links

• New section for contributions from GEP members• Maintain backward compatibility• Improve support for modern web browsers

GEP Web Site Demo

http://gep.wustl.edu/

GEP Web Site Questions

• GEP glossary– Currently listed under Introducing Students to DNA

Sequencing and Genomic Analysis

• GEP photos– Community section– Facebook groups– Flickr groups

GEP Wiki and Forum

• Bulletin board software upgraded to phpBB3– Allow upload of images and other attachments– Automatic image thumbnails– More powerful full text search– Better cross-browser support

• Plan to migrate GEP Wikis (both private and public) to newer version of the Mediawiki software in Fall 2011

Gene Record Finder Update

• Two FlyBase updates – Releases 5.29, 5.32– Release 5.39 for Fall 2011

• Start and end columns now refers to the 5’ start and 3’ end coordinates– For features on the minus strand, start coordinates are

larger than the end coordinates

• Added new section for D. melanogaster genes with non-canonical splice sites

Gene Record Finder Demo

http://gander.wustl.edu/~wilson/dmelgenerecord/retrievegenerecord.php?searchname=eIF4G&db=dm3

Potential Issues with Gene Record Finder

• Phase for coding exons were based on GFF files provided by FlyBase

• Since Release 5.33, the phase entries for CDS features may be incorrect– In older releases, the phase column in the FlyBase GFF

file represents the reading frame– Issue has not yet been resolved as of Release 5.39

• Instead of relying on the FlyBase entries, phase and CDS translations are calculated separately

Keeping up with FlyBase Releases

• Release 6 assembly may be released in September

• New modENCODE RNA-Seq data has led to many updates to the D. melanogaster gene annotations

Graveley BR, et al. The development transcriptome of Drosophila melanogaster. Nature (471) 473-479

• More up-to-date Gene Record Finder available at:http://gander.wustl.edu/~wilson/dmelgenerecord_current/index.html

http://gander.wustl.edu/~wilson/dmelgenerecord_current/index.html

Gene Model Checker User Interface Improvements

• Form values (except the sequence file) will persist when you refresh the web page

• Added support for sequence file in rich text format

• Improve detection of overlapping coordinates– Overlap among exon coordinates– Overlap between exon coordinates and stop codons

Gene Model Checker Updates

• New “Warn” level in the checklist– Non-canonical splice donor site (GC)– Number of coding exons in submitted model differ from

the D. melanogaster ortholog– Cannot find the putative D. melanogaster ortholog

• Global alignment between submitted model and the D. melanogaster ortholog

• Color dot plot for complete gene models

Gene Model Checker Demo

http://gander.wustl.edu/~wilson/genechecker/index.html

Annotation Files Merger

• Combine files generated by the Gene Model Checker for different gene models into a single file– Use this tool to reduce submission errors

• Added link to view combined GFF file as a custom track on the UCSC Genome Browser mirror

• Updated documentation for annotation submission shows how to use this tool to prepare files for project submission

Annotation Files Merger Demo

http://gander.wustl.edu/~wilson/submissionhelper/index.php

Small Exon Finder

• Search for small open reading frames that cannot be identified through sequence alignments

• Search for small exons that satisfy a set of biological constraints:– CDS type (initial, internal, terminal)– CDS size– Donor and acceptor phase

• Documentation available in the Small Exon Finder User Guide (under Help -> Documentations)

Small Exon Finder Demo

http://gander.wustl.edu/~wilson/smallexonfinder/index.html

UCSC Genome Browser Mirror Update

• Updated genome browser software to release 238– New navigation features

• Search for blastx hits, drag and zoom, re-order tracks

– Improved support for second-generation sequence data

• Added initial set of mRNA-Seq and TopHat junction predictions to the D. mojavensis dot chromosome assembly

Finishing Updates

• GEP LiveCD– Re-master image with new kernel and software updates– Consed updated to version 20– Created new VirtualBox appliance with support for both

both SATA and IDE drives– Improve support for VirtualBox 4– Updated documentation for storing user data on USB and

local VirtualBox disk image

• Finishing Packages– New naming conventions for fosmid end traces– Updated configuration files (e.g. for autofinish, digests)

Tools Under Development

http://www.flickr.com/photos/gullevek/155604654/sizes/z/

http://www.flickr.com/photos/gullevek/155604654/sizes/z/

modENCODE Drosophila mRNA-Seq Data

• modENCODE project (Brian Oliver) has generated mRNA-Seq data for multiple Drosophila species– mRNA-Seq of head tissues from D. mojavensis

• Data tracks added to GEP Genome Browsers in Fall 2010• Unpaired, Illumina Genome Analyzer and Genome Analyzer II

– mRNA-Seq of whole flies from Drosophila• Paired-end, Illumina Genome Analyzer II and HiSeq 2000

• mRNA’s from male and female of multiple Drosophila species– Verify D. melanogaster gene models– Examine differences in gene expression

modENCODE Drosophila RNA-Seq Data

• Species with mRNA-Seq data that are also in the GEP annotation pipeline– D. ananassae– D. mojavensis

Reference Published In Progress

RNA-Seq

mRNA-Seq Overview

Wang et al. RNA-Seq: a revolutionary tool for transcriptomics. Nature Reviews Genetics (10) 57-63

Types of mRNA-Seq Data Tracks

• Mapping mRNA-Seq reads onto contig sequences– Read coverage and alignment summary

• Splice junction predictions– TopHat predictions, spliced reads alignments

• Transcriptome assembly – Velvet and Oases– Cufflinks

• Reads unmapped by TopHat

mRNA-Seq Alignment Summary Track

• Because of high read coverage, unable to display all the reads because it may overload the browser

• Composite multi-wiggle track captures the number of high quality reads aligned at each position

http://gander.wustl.edu/cgi-bin/hgTracks?&clade=insect&org=D.+mojavensis&db=Dmoj5&position=DMAC14:20367-28371

Identify Splice Junctions with mRNA-Seq

• Two additional tracks can be used to identify splice junctions– TopHat junctions

• For reads >= 75bp, search for GT-AG, GC-AG, and AT-AC intron junctions

• Search for joins between neighboring coverage islands• Use mate pair information to estimate intron sizes

– Average mate pair distance for this library is 150 bp

– Spliced mRNA-Seq• Subset of read mate pairs mapped by TopHat with at least

three alignment blocks

mRNA-Seq Transcriptome Assembly

• Two basic approaches to transcriptome assembly– Assemble reads first then map the assembled transcripts

back to the genome• Trinity, ABySS, Oases

– Map reads onto the reference genome first and then merge overlapping reads to create transcripts• Cufflinks, Scripture

• Because of limited computational resources:– Map reads against each contig with TopHat– Extract mapped reads and assemble with Oases– Align assembled transcripts back to the contig with BLAT

Number of Unmapped mRNA-Seq Reads Against the D. mojavensis Assembly

• ~300 million reads unmapped by TopHat

SRR Accession # Missing Reads

SRR166832_1 19,005,424

SRR166832_2 19,005,425

SRR166833_1 54,724,955

SRR166833_2 54,724,955

SRR166834_1 22,019,750

SRR166834_2 22,019,751

SRR166835_1 62,179,296

SRR166835_2 62,179,296

Total 315,858,852

Number of Reads Removed because of Low Quality or Unknown Bases

• Only about 1% of the reads contain unknown bases once the reads are trimmed from 101 to 75 bases

SRR Accession # Removed Total # Missing % Removed

SRR166832_1 834883 19005424 4.39 %

SRR166832_2 541113 19005425 2.85 %

SRR166833_1 505054 54724955 0.92 %

SRR166833_2 646392 54724955 1.18 %

SRR166834_1 227753 22019750 1.03 %

SRR166834_2 165507 22019751 0.75 %

SRR166835_1 411722 62179296 0.66 %

SRR166835_2 566942 62179296 0.91 %

Total 3,899,366 315,858,852 1.23 %

Potential Problems with TopHat Alignments• Bowtie is optimized for ungapped alignment

• TopHat subdivide each read into 25bp segments – Each segment is mapped independently– Alignment blocks are then merged back together

• TopHat could fail to map reads that are derived from multiple exons

Coding Exons

Alignment

Mapping unaligned reads with BLAT

http://gander.wustl.edu/cgi-bin/hgTracks?&clade=insect&org=D.+mojavensis&db=Dmoj5&position=DMAC4:22,402-22,980

Intron Sizes Distribution in D. melanogaster

Comeron JM and Kreitman M. The Correlation Between Intron Length and Recombination in Drosophila: Dynamic Equilibrium Between Mutational and Selective Forces. Genetics (156): 1175-1190

Filter the Unaligned Reads using Minimum Intron Size

Coding Exons

Alignment

>= 30 >= 30

• Minimum intron size in Drosophila is ~40 bases

• Only keep alignments that consist of 3 blocks where the distance between each adjacent block is at least 30 bases

Example of Unaligned Reads that Span Multiple Exons

http://gander.wustl.edu/cgi-bin/hgTracks?clade=insect&org=D.+mojavensis&db=Dmoj5&position=DMAC33:9,272-9,787

Using mRNA-Seq Tracks in Annotation

Plans to Incorporate mRNA-Seq Data into the GEP Annotation Pipeline

• Develop and continue to improve programs used to manipulate and process mRNA-Seq datasets

• New Homework #2 from Dr. Buhler on how to use mRNA-Seq data for annotation

• Test and validate the new mRNA-Seq tracks – Han and William have used the mRNA-Seq data when

they checked the annotation submissions this summer– Also received feedback from Bio 4342 this year

Cross-species mRNA-Seq

• How can we incorporate the mRNA-Seq data into the D. erecta and D. grimshawi annotation projects?

Reference Published In Progress

RNA-Seq

Map D. yakuba mRNA-Seq Reads onto D. erecta Contigs

• Use D. yakuba reads to generate TopHat junctions and coverage data tracks for D. erecta

• Cannot generate Cufflinks and Oases transcripts directly from D. erecta alignments– Reads from less conserved regions may not be mapped

• Build transcriptome library from whole genome alignments to D. yakuba– Map assembled transcripts against the D. erecta contigs

with BLAT

Incorporating mRNA-Seq data into D. grimshawi projects

• D. virilis and D. mojavensis are the two species that are most closely related to D. grimshawi– Cannot reliably detect conserved nucleotide sequences

with BLASTN, BLAT, Clustalw

• Build transcriptome library based on whole genome alignments to D. virilis and D. mojavensis– Run Cufflinks and Oases in sliding window

– Align the assembled transcripts against the D. grimshawi contigs with translated BLAT and TBLASTX

Cross-species mRNA-Seq

Designing and Managing Your Own Annotation Projects

• Three major components:– Workflow system for creating custom genome browsers

• Command-line based workflow is now operational

– Galaxy modules for performing statistical analysis and data mining (with modENCODE data)

– Create and manage your own annotation projects using the Project Management System

Project Management System Demo

Conclusions

• During the past year, we have made substantial improvements to the GEP web framework

• New mRNA-Seq data should help resolve many ambiguous cases and speed up annotation– mRNA-Seq evidence tracks now available on gander

• Continue to work on system that will allow you to create and manage your own projects

Questions and Group Discussion

rRNA in D. mojavensis mRNA-Seq

Programs and Web Tools Status Update GEP Alumni Workshop Wilson Leung 08/05/2011.

Documents

Transcript of Programs and Web Tools Status Update GEP Alumni Workshop Wilson Leung 08/05/2011.