Advanced ComputationAL Biology Project Presentation

Post on 23-Feb-2016

42 views 0 download

description

Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640. Advanced ComputationAL Biology Project Presentation. OVERVIEW. Project Description Introduction Motivation Bioinformatics Application Explicit vs Implicit Problem Analysis Implement Files Experimental Results - PowerPoint PPT Presentation

Transcript of Advanced ComputationAL Biology Project Presentation

ADVANCED COMPUTATIONAL BIOLOGY

PROJECT PRESENTATION

Team Members:Joshua Wu 11174269

Shuyu (Christine) Xu 11161640

OVERVIEW Project Description Introduction Motivation Bioinformatics Application Explicit vs Implicit Problem Analysis Implement Files Experimental Results Conclusion Possible Future Work

Now we are here

Project DescriptionExplicit Suffix Trees

Suppose that we want to store explicitly all strings that are edge labels of a suffix tree.

The main question of this project is how much space explicit suffix trees require comparing to implicit suffix trees.

Implement suffix tree algorithm and run it on substrings of real data.

OVERVIEW Project Description Introduction Motivation Bioinformatics Application Explicit vs Implicit Problem Analysis Implement Files Experimental Results Conclusion Possible Future Work

Now we are here

Introduction Any string of length m can be

degenerated into m suffixes, and these suffixes can be stored in a suffix tree.

Setup time O(m) (m is length of string)

searching time O(n) (n is length of pattern)

OVERVIEW Project Description Introduction Motivation Bioinformatics Application Explicit vs Implicit Problem Analysis Implement Files Experimental Results Conclusion Possible Future Work

Now we are here

Motivation "Suffix trees are widely used in the

computer field... Recent improvements in the method have cut the memory requirement to 17 bytes per letter, which brings the method to the verge of practicality [for bioinformatics applications]" -- Nat Goodman (Genome Technology).

OVERVIEW Project Description Introduction Motivation Bioinformatics Application Explicit vs Implicit Problem Analysis Implement Files Experimental Results Conclusion Possible Future Work

Now we are here

Bioinformatics Application

1. multiple genome alignment (Michael Hohl et al., 2002)

2. selection of signature oligonucleotides for DNA arrays (Kaderali and Schliep, 2002)

3. identification of sequence repeats (Kurtz and Schleiermacher, 1999)

OVERVIEW Project Description Introduction Motivation Bioinformatics Application Explicit vs Implicit Problem Analysis Implement Files Experimental Results Conclusion Possible Future Work

Now we are here

Explicit vs Implicit ABC $ Explicit 1 2 3 4 ABC$ $ BC$ C$ Implicit 1,4 4,4 2,4 3,4

OVERVIEW Project Description Introduction Motivation Bioinformatics Application Explicit vs Implicit Problem Analysis Implement Files Experimental Results Conclusion Possible Future Work

Now we are here

Problem Analysis Best Case for explicit and implicit suffix

trees: All different characters

Best case not likely with DNA inputs: total of 4 characters

Worst case: same characters throughout

Assumptions In implicit trees, each number will only

take up one bit. (the number 10 takes up 1 bit)

Only alphabets will be in the sequence

Example: all different char ABCD $ 1,5 5,5 1 2 3 4 5 2,5 3,5 4,5

N: string length N = 5 Memory = 10 best case

Example ABCABC $ 7,7 1 2 3 4 5 6 7 1,3 2,3 6,6 N: string length N = 7 4,7 7,7 7,7 7,7 Memory = 20 4,7 4,7

Example: all same character AAAA $ 1 2 3 4 5 1,1 5,5 N=string length N = 5, 6, 7 2,2 5,5 Memory = 16, 20, 24 Memory = 4n-4 3,3 5,5

Worse case 4,5 5,5

Program Input Data

DNA for all kinds of creatures:

Homo Sapiens, Monkeys, Chickens, …

OVERVIEW Project Description Introduction Motivation Bioinformatics Application Explicit vs Implicit Problem Analysis Implement Files Experimental Results Conclusion Possible Future Work

Now we are here

Sample input: Homo Sapien

cagctcctgagactgctggcatgaaggggagccgtgccctcctgctggtggccctcaccctgttctgcatctgccggatggccacaggggaggacaacgatgagtttttcatggacttcctgcaaacactactggtggggaccccagaggagctctatgaggggaccttgggcaagtacaatgtcaacgaagatgccaaggcagcaatgactgaactcaagtcctgcagagatggcctgcagccaatgcacaaggcggagctggtcaagctgctggtgcaagtgctgggcagtcaggacggtgcctaagtggacctcagacatggctcagccataggacctgccacacaagcagccgtggacacaacgcccactaccacctcccacatggaaatgtatcctcaaaccgtttaatcaataa

Sample result

Sample input 2: plants

EARPIVVGPPPPLSGGLPGTENSDQARDGTLPYTKDRFYLQPLPPTEAAQRAKVSASEILNVKQFIDRKAWPSLQNDLRLRASYLRYDLKTVISAKPKDEKKSLQELTSKLFSSIDNLDHAAKIKSPTEAEKYYGQTVSNINEVLAKLG

Sample output:

OVERVIEW Project Description Introduction Motivation Bioinformatics Application Explicit vs Implicit Problem Analysis Implement Files Experimental Results Conclusion Possible Future Work

Now we are here

Homo Sapien

Sample Input: Homo Sapiens

atgaaggggagccgtgccctcctgctggtggccctcaccctgttctgcatctgccggatggccacaggggaggacaacgatgagtttttcatggacttcctgcaaacactactggtggggaccccagaggagctctatgaggggaccttgggcaagtacaatgtcaacgaagatgccaaggcagcaatgactgaactcaagtcctgcagagatggcctgcagccaatgcacaaggcggagctggtcaagctgctggtgcaagtgctgggcagtcaggacggtgcctaa

Comparisons: Homo Sapiens

Comparisons: Homo Sapiens

Monkey Virus

Sample Input: Monkey Virus

GGSCFKCGKKGHFAKNCHEHAHNNAEPKVPGLCPRCKRGKHWANECKSKTDNQGNPIPPH

Monkey Virus

Plants

Sample Input: Plants EARPIVVGPPPPLSGGLPGTENSDQA

RDGTLPYTKDRFYLQPLPPTEAAQRAKVSASEILNVKQFIDRKAWPSLQNDLRLRASYLRYDLKTVISAKPKDEKKSLQELTSKLFSSIDNLDHAAKIKSPTEAEKYYGQTVSNINEVLAKLG

Plants

Tobacco

Sample input: tobacco

SYSITTPSQFVFLSSAWADPIELINLCTNALGNQFQTQQARTVVQRQFSEVWKPSPQVTVRFPDSDFKVYRYNAVLDPLVTALLGAFDTRNRIIEVENQANPTTAETLDATRRVDDATVAIRSAINNLIVELIRGTGSYNRSSFESSSGLVWTSGPAT

Tobacco

Insects

Sample Input: Insects DCLSGRYKGPCAVWDNETCRRVCKE

EGRSSGHCSPSLKCWCEGC

Insects

Birds

Sample Input: Birds IDTCRLPSDRGRCKASFERWYFNGRT

CAKFIYGGCGGNGNKFPTQEACMKRCAKA

Birds

SARS

Sample Input: SARS ALNTLVKQLSSNFGAISSVLNDILSRLD

KVEAEV

SARS

Fish

Sample Input: Fish GHHHHHHLEDPSGGTPYIGSKISLISK

AEIRYEGILYTIDTENSTVALAKVRSFGTEDRPTDRPIAPRDETFEYIIFRGSDIKDLTVCEPPKPIM

Fish

Chicken

Sample Input: Chicken

RVKRVWPLVIRTVIAGYNLYRAIKKK

Chicken

files Code

Results

OVERVIEW Project Description Introduction Motivation Bioinformatics Application Explicit vs Implicit Problem Analysis Implement Files Experimental Results Conclusion Possible Future Work

Now we are here

Conclusion Explicit suffix trees require more space

than implicit suffix trees in real datas.

Data comparison: worst case is DNA input (least variety of characters)

results Implicit trees should be used for smaller

use of storage

1 3 5 7 9 11 13 15 17 19 21 23 250

500

1000

1500

2000

2500

3000

variety of string vs tree size

variety of string vs tree size

# of alphabets

Conclusion Application:

it is easier to compare structures for implicit than explicit suffix trees (number comparisons)

Save spaceEasy to implement

Further improvement?

OVERVIEW Project Description Introduction Motivation Bioinformatics Application Explicit vs Implicit Problem Analysis Implement Files Experimental Results Conclusion Possible Future Work Now we are here

Possible Future Work Program speed is too slow

The interface of our program should be improved. (Matlab)

More variety of input

References Online info http://en.wikipedia.org/wiki/Suffix_tree http://marknelson.us/1996/08/01/suffix-tr

ees/ http://homepage.usask.ca/~ctl271/857/s

uffix_tree.shtml http://www.cs.uku.fi/~kilpelai/BSA05/lect

ures/print07.pdf

THANK YOU!