Programs and Web Tools Status Update GEP Alumni Workshop Wilson Leung 08/05/2011.
-
Upload
regina-hicks -
Category
Documents
-
view
215 -
download
1
Transcript of Programs and Web Tools Status Update GEP Alumni Workshop Wilson Leung 08/05/2011.
Programs and Web Tools Status Update
GEP Alumni Workshop
Wilson Leung
08/05/2011
Outline
• GEP web framework updates– GEP web site– Gene Record Finder– Gene Model Checker– Small Exon Finder
• Tools under development– modENCODE mRNA-Seq data– Designing and managing your own projects
• Discussions on needed improvements
Graded Web Browser Support• GEP web framework aims to provide support for
the following web browsers:– Based on graded browser support policy from Yahoo!
Web Browsers Win XP Win 7 / Vista Mac OS 10.6
Safari 5 A-grade
Chrome (latest stable) A-grade A-grade
Firefox (latest stable) A-grade A-grade A-grade
Firefox 3.6 A-grade A-grade A-grade
IE 9.0 A-grade
IE 8.0 A-grade A-grade
IE 7.0 A-grade
IE 6.0 A-grade
* Other configurations may work but may not be tested (X-grade)
Goals for GEP Web Site Update • More easily find and discover materials
• Search engine optimizations and site search• Added Quick Start and FAQ sections• Search for related documents using tags• Standardize layout and file download links
• New section for contributions from GEP members• Maintain backward compatibility• Improve support for modern web browsers
GEP Web Site Demo
GEP Web Site Questions
• GEP glossary– Currently listed under Introducing Students to DNA
Sequencing and Genomic Analysis
• GEP photos– Community section– Facebook groups– Flickr groups
GEP Wiki and Forum
• Bulletin board software upgraded to phpBB3– Allow upload of images and other attachments– Automatic image thumbnails– More powerful full text search– Better cross-browser support
• Plan to migrate GEP Wikis (both private and public) to newer version of the Mediawiki software in Fall 2011
Gene Record Finder Update
• Two FlyBase updates – Releases 5.29, 5.32– Release 5.39 for Fall 2011
• Start and end columns now refers to the 5’ start and 3’ end coordinates– For features on the minus strand, start coordinates are
larger than the end coordinates
• Added new section for D. melanogaster genes with non-canonical splice sites
Gene Record Finder Demo
Potential Issues with Gene Record Finder
• Phase for coding exons were based on GFF files provided by FlyBase
• Since Release 5.33, the phase entries for CDS features may be incorrect– In older releases, the phase column in the FlyBase GFF
file represents the reading frame– Issue has not yet been resolved as of Release 5.39
• Instead of relying on the FlyBase entries, phase and CDS translations are calculated separately
Keeping up with FlyBase Releases
• Release 6 assembly may be released in September
• New modENCODE RNA-Seq data has led to many updates to the D. melanogaster gene annotations
Graveley BR, et al. The development transcriptome of Drosophila melanogaster. Nature (471) 473-479
• More up-to-date Gene Record Finder available at:http://gander.wustl.edu/~wilson/dmelgenerecord_current/index.html
Gene Model Checker User Interface Improvements
• Form values (except the sequence file) will persist when you refresh the web page
• Added support for sequence file in rich text format
• Improve detection of overlapping coordinates– Overlap among exon coordinates– Overlap between exon coordinates and stop codons
Gene Model Checker Updates
• New “Warn” level in the checklist– Non-canonical splice donor site (GC)– Number of coding exons in submitted model differ from
the D. melanogaster ortholog– Cannot find the putative D. melanogaster ortholog
• Global alignment between submitted model and the D. melanogaster ortholog
• Color dot plot for complete gene models
Gene Model Checker Demo
Annotation Files Merger
• Combine files generated by the Gene Model Checker for different gene models into a single file– Use this tool to reduce submission errors
• Added link to view combined GFF file as a custom track on the UCSC Genome Browser mirror
• Updated documentation for annotation submission shows how to use this tool to prepare files for project submission
Annotation Files Merger Demo
Small Exon Finder
• Search for small open reading frames that cannot be identified through sequence alignments
• Search for small exons that satisfy a set of biological constraints:– CDS type (initial, internal, terminal)– CDS size– Donor and acceptor phase
• Documentation available in the Small Exon Finder User Guide (under Help -> Documentations)
Small Exon Finder Demo
UCSC Genome Browser Mirror Update
• Updated genome browser software to release 238– New navigation features
• Search for blastx hits, drag and zoom, re-order tracks
– Improved support for second-generation sequence data
• Added initial set of mRNA-Seq and TopHat junction predictions to the D. mojavensis dot chromosome assembly
Finishing Updates
• GEP LiveCD– Re-master image with new kernel and software updates– Consed updated to version 20– Created new VirtualBox appliance with support for both
both SATA and IDE drives– Improve support for VirtualBox 4– Updated documentation for storing user data on USB and
local VirtualBox disk image
• Finishing Packages– New naming conventions for fosmid end traces– Updated configuration files (e.g. for autofinish, digests)
Tools Under Development
http://www.flickr.com/photos/gullevek/155604654/sizes/z/
modENCODE Drosophila mRNA-Seq Data
• modENCODE project (Brian Oliver) has generated mRNA-Seq data for multiple Drosophila species– mRNA-Seq of head tissues from D. mojavensis
• Data tracks added to GEP Genome Browsers in Fall 2010• Unpaired, Illumina Genome Analyzer and Genome Analyzer II
– mRNA-Seq of whole flies from Drosophila• Paired-end, Illumina Genome Analyzer II and HiSeq 2000
• mRNA’s from male and female of multiple Drosophila species– Verify D. melanogaster gene models– Examine differences in gene expression
modENCODE Drosophila RNA-Seq Data
• Species with mRNA-Seq data that are also in the GEP annotation pipeline– D. ananassae– D. mojavensis
Reference Published In Progress
RNA-Seq
mRNA-Seq Overview
Wang et al. RNA-Seq: a revolutionary tool for transcriptomics. Nature Reviews Genetics (10) 57-63
Types of mRNA-Seq Data Tracks
• Mapping mRNA-Seq reads onto contig sequences– Read coverage and alignment summary
• Splice junction predictions– TopHat predictions, spliced reads alignments
• Transcriptome assembly – Velvet and Oases– Cufflinks
• Reads unmapped by TopHat
mRNA-Seq Alignment Summary Track
• Because of high read coverage, unable to display all the reads because it may overload the browser
• Composite multi-wiggle track captures the number of high quality reads aligned at each position
Identify Splice Junctions with mRNA-Seq
• Two additional tracks can be used to identify splice junctions– TopHat junctions
• For reads >= 75bp, search for GT-AG, GC-AG, and AT-AC intron junctions
• Search for joins between neighboring coverage islands• Use mate pair information to estimate intron sizes
– Average mate pair distance for this library is 150 bp
– Spliced mRNA-Seq• Subset of read mate pairs mapped by TopHat with at least
three alignment blocks
mRNA-Seq Transcriptome Assembly
• Two basic approaches to transcriptome assembly– Assemble reads first then map the assembled transcripts
back to the genome• Trinity, ABySS, Oases
– Map reads onto the reference genome first and then merge overlapping reads to create transcripts• Cufflinks, Scripture
• Because of limited computational resources:– Map reads against each contig with TopHat– Extract mapped reads and assemble with Oases– Align assembled transcripts back to the contig with BLAT
Number of Unmapped mRNA-Seq Reads Against the D. mojavensis Assembly
• ~300 million reads unmapped by TopHat
SRR Accession # Missing Reads
SRR166832_1 19,005,424
SRR166832_2 19,005,425
SRR166833_1 54,724,955
SRR166833_2 54,724,955
SRR166834_1 22,019,750
SRR166834_2 22,019,751
SRR166835_1 62,179,296
SRR166835_2 62,179,296
Total 315,858,852
Number of Reads Removed because of Low Quality or Unknown Bases
• Only about 1% of the reads contain unknown bases once the reads are trimmed from 101 to 75 bases
SRR Accession # Removed Total # Missing % Removed
SRR166832_1 834883 19005424 4.39 %
SRR166832_2 541113 19005425 2.85 %
SRR166833_1 505054 54724955 0.92 %
SRR166833_2 646392 54724955 1.18 %
SRR166834_1 227753 22019750 1.03 %
SRR166834_2 165507 22019751 0.75 %
SRR166835_1 411722 62179296 0.66 %
SRR166835_2 566942 62179296 0.91 %
Total 3,899,366 315,858,852 1.23 %
Potential Problems with TopHat Alignments• Bowtie is optimized for ungapped alignment
• TopHat subdivide each read into 25bp segments – Each segment is mapped independently– Alignment blocks are then merged back together
• TopHat could fail to map reads that are derived from multiple exons
Coding Exons
Alignment
Mapping unaligned reads with BLAT
Intron Sizes Distribution in D. melanogaster
Comeron JM and Kreitman M. The Correlation Between Intron Length and Recombination in Drosophila: Dynamic Equilibrium Between Mutational and Selective Forces. Genetics (156): 1175-1190
Filter the Unaligned Reads using Minimum Intron Size
Coding Exons
Alignment
>= 30 >= 30
• Minimum intron size in Drosophila is ~40 bases
• Only keep alignments that consist of 3 blocks where the distance between each adjacent block is at least 30 bases
Example of Unaligned Reads that Span Multiple Exons
Using mRNA-Seq Tracks in Annotation
Plans to Incorporate mRNA-Seq Data into the GEP Annotation Pipeline
• Develop and continue to improve programs used to manipulate and process mRNA-Seq datasets
• New Homework #2 from Dr. Buhler on how to use mRNA-Seq data for annotation
• Test and validate the new mRNA-Seq tracks – Han and William have used the mRNA-Seq data when
they checked the annotation submissions this summer– Also received feedback from Bio 4342 this year
Cross-species mRNA-Seq
• How can we incorporate the mRNA-Seq data into the D. erecta and D. grimshawi annotation projects?
Reference Published In Progress
RNA-Seq
Map D. yakuba mRNA-Seq Reads onto D. erecta Contigs
• Use D. yakuba reads to generate TopHat junctions and coverage data tracks for D. erecta
• Cannot generate Cufflinks and Oases transcripts directly from D. erecta alignments– Reads from less conserved regions may not be mapped
• Build transcriptome library from whole genome alignments to D. yakuba– Map assembled transcripts against the D. erecta contigs
with BLAT
Incorporating mRNA-Seq data into D. grimshawi projects
• D. virilis and D. mojavensis are the two species that are most closely related to D. grimshawi– Cannot reliably detect conserved nucleotide sequences
with BLASTN, BLAT, Clustalw
• Build transcriptome library based on whole genome alignments to D. virilis and D. mojavensis– Run Cufflinks and Oases in sliding window
– Align the assembled transcripts against the D. grimshawi contigs with translated BLAT and TBLASTX
Cross-species mRNA-Seq
Designing and Managing Your Own Annotation Projects
• Three major components:– Workflow system for creating custom genome browsers
• Command-line based workflow is now operational
– Galaxy modules for performing statistical analysis and data mining (with modENCODE data)
– Create and manage your own annotation projects using the Project Management System
Project Management System Demo
Conclusions
• During the past year, we have made substantial improvements to the GEP web framework
• New mRNA-Seq data should help resolve many ambiguous cases and speed up annotation– mRNA-Seq evidence tracks now available on gander
• Continue to work on system that will allow you to create and manage your own projects
Questions and Group Discussion
rRNA in D. mojavensis mRNA-Seq