Advanced ComputationAL Biology Project Presentation

62
ADVANCED COMPUTATIONAL BIOLOGY PROJECT PRESENTATION Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640

description

Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640. Advanced ComputationAL Biology Project Presentation. OVERVIEW. Project Description Introduction Motivation Bioinformatics Application Explicit vs Implicit Problem Analysis Implement Files Experimental Results - PowerPoint PPT Presentation

Transcript of Advanced ComputationAL Biology Project Presentation

Page 1: Advanced ComputationAL Biology Project Presentation

ADVANCED COMPUTATIONAL BIOLOGY

PROJECT PRESENTATION

Team Members:Joshua Wu 11174269

Shuyu (Christine) Xu 11161640

Page 2: Advanced ComputationAL Biology Project Presentation

OVERVIEW Project Description Introduction Motivation Bioinformatics Application Explicit vs Implicit Problem Analysis Implement Files Experimental Results Conclusion Possible Future Work

Now we are here

Page 3: Advanced ComputationAL Biology Project Presentation

Project DescriptionExplicit Suffix Trees

Suppose that we want to store explicitly all strings that are edge labels of a suffix tree.

The main question of this project is how much space explicit suffix trees require comparing to implicit suffix trees.

Implement suffix tree algorithm and run it on substrings of real data.

Page 4: Advanced ComputationAL Biology Project Presentation

OVERVIEW Project Description Introduction Motivation Bioinformatics Application Explicit vs Implicit Problem Analysis Implement Files Experimental Results Conclusion Possible Future Work

Now we are here

Page 5: Advanced ComputationAL Biology Project Presentation

Introduction Any string of length m can be

degenerated into m suffixes, and these suffixes can be stored in a suffix tree.

Setup time O(m) (m is length of string)

searching time O(n) (n is length of pattern)

Page 6: Advanced ComputationAL Biology Project Presentation

OVERVIEW Project Description Introduction Motivation Bioinformatics Application Explicit vs Implicit Problem Analysis Implement Files Experimental Results Conclusion Possible Future Work

Now we are here

Page 7: Advanced ComputationAL Biology Project Presentation

Motivation "Suffix trees are widely used in the

computer field... Recent improvements in the method have cut the memory requirement to 17 bytes per letter, which brings the method to the verge of practicality [for bioinformatics applications]" -- Nat Goodman (Genome Technology).

Page 8: Advanced ComputationAL Biology Project Presentation

OVERVIEW Project Description Introduction Motivation Bioinformatics Application Explicit vs Implicit Problem Analysis Implement Files Experimental Results Conclusion Possible Future Work

Now we are here

Page 9: Advanced ComputationAL Biology Project Presentation

Bioinformatics Application

1. multiple genome alignment (Michael Hohl et al., 2002)

2. selection of signature oligonucleotides for DNA arrays (Kaderali and Schliep, 2002)

3. identification of sequence repeats (Kurtz and Schleiermacher, 1999)

Page 10: Advanced ComputationAL Biology Project Presentation

OVERVIEW Project Description Introduction Motivation Bioinformatics Application Explicit vs Implicit Problem Analysis Implement Files Experimental Results Conclusion Possible Future Work

Now we are here

Page 11: Advanced ComputationAL Biology Project Presentation

Explicit vs Implicit ABC $ Explicit 1 2 3 4 ABC$ $ BC$ C$ Implicit 1,4 4,4 2,4 3,4

Page 12: Advanced ComputationAL Biology Project Presentation

OVERVIEW Project Description Introduction Motivation Bioinformatics Application Explicit vs Implicit Problem Analysis Implement Files Experimental Results Conclusion Possible Future Work

Now we are here

Page 13: Advanced ComputationAL Biology Project Presentation

Problem Analysis Best Case for explicit and implicit suffix

trees: All different characters

Best case not likely with DNA inputs: total of 4 characters

Worst case: same characters throughout

Page 14: Advanced ComputationAL Biology Project Presentation

Assumptions In implicit trees, each number will only

take up one bit. (the number 10 takes up 1 bit)

Only alphabets will be in the sequence

Page 15: Advanced ComputationAL Biology Project Presentation

Example: all different char ABCD $ 1,5 5,5 1 2 3 4 5 2,5 3,5 4,5

N: string length N = 5 Memory = 10 best case

Page 16: Advanced ComputationAL Biology Project Presentation

Example ABCABC $ 7,7 1 2 3 4 5 6 7 1,3 2,3 6,6 N: string length N = 7 4,7 7,7 7,7 7,7 Memory = 20 4,7 4,7

Page 17: Advanced ComputationAL Biology Project Presentation

Example: all same character AAAA $ 1 2 3 4 5 1,1 5,5 N=string length N = 5, 6, 7 2,2 5,5 Memory = 16, 20, 24 Memory = 4n-4 3,3 5,5

Worse case 4,5 5,5

Page 18: Advanced ComputationAL Biology Project Presentation

Program Input Data

DNA for all kinds of creatures:

Homo Sapiens, Monkeys, Chickens, …

Page 19: Advanced ComputationAL Biology Project Presentation

OVERVIEW Project Description Introduction Motivation Bioinformatics Application Explicit vs Implicit Problem Analysis Implement Files Experimental Results Conclusion Possible Future Work

Now we are here

Page 20: Advanced ComputationAL Biology Project Presentation

Sample input: Homo Sapien

cagctcctgagactgctggcatgaaggggagccgtgccctcctgctggtggccctcaccctgttctgcatctgccggatggccacaggggaggacaacgatgagtttttcatggacttcctgcaaacactactggtggggaccccagaggagctctatgaggggaccttgggcaagtacaatgtcaacgaagatgccaaggcagcaatgactgaactcaagtcctgcagagatggcctgcagccaatgcacaaggcggagctggtcaagctgctggtgcaagtgctgggcagtcaggacggtgcctaagtggacctcagacatggctcagccataggacctgccacacaagcagccgtggacacaacgcccactaccacctcccacatggaaatgtatcctcaaaccgtttaatcaataa

Page 21: Advanced ComputationAL Biology Project Presentation

Sample result

Page 22: Advanced ComputationAL Biology Project Presentation

Sample input 2: plants

EARPIVVGPPPPLSGGLPGTENSDQARDGTLPYTKDRFYLQPLPPTEAAQRAKVSASEILNVKQFIDRKAWPSLQNDLRLRASYLRYDLKTVISAKPKDEKKSLQELTSKLFSSIDNLDHAAKIKSPTEAEKYYGQTVSNINEVLAKLG

Page 23: Advanced ComputationAL Biology Project Presentation

Sample output:

Page 24: Advanced ComputationAL Biology Project Presentation

OVERVIEW Project Description Introduction Motivation Bioinformatics Application Explicit vs Implicit Problem Analysis Implement Files Experimental Results Conclusion Possible Future Work

Now we are here

Page 25: Advanced ComputationAL Biology Project Presentation

Homo Sapien

Page 26: Advanced ComputationAL Biology Project Presentation

Sample Input: Homo Sapiens

atgaaggggagccgtgccctcctgctggtggccctcaccctgttctgcatctgccggatggccacaggggaggacaacgatgagtttttcatggacttcctgcaaacactactggtggggaccccagaggagctctatgaggggaccttgggcaagtacaatgtcaacgaagatgccaaggcagcaatgactgaactcaagtcctgcagagatggcctgcagccaatgcacaaggcggagctggtcaagctgctggtgcaagtgctgggcagtcaggacggtgcctaa

Page 27: Advanced ComputationAL Biology Project Presentation

Comparisons: Homo Sapiens

Page 28: Advanced ComputationAL Biology Project Presentation

Comparisons: Homo Sapiens

Page 29: Advanced ComputationAL Biology Project Presentation

Monkey Virus

Page 30: Advanced ComputationAL Biology Project Presentation

Sample Input: Monkey Virus

GGSCFKCGKKGHFAKNCHEHAHNNAEPKVPGLCPRCKRGKHWANECKSKTDNQGNPIPPH

Page 31: Advanced ComputationAL Biology Project Presentation

Monkey Virus

Page 32: Advanced ComputationAL Biology Project Presentation

Plants

Page 33: Advanced ComputationAL Biology Project Presentation

Sample Input: Plants EARPIVVGPPPPLSGGLPGTENSDQA

RDGTLPYTKDRFYLQPLPPTEAAQRAKVSASEILNVKQFIDRKAWPSLQNDLRLRASYLRYDLKTVISAKPKDEKKSLQELTSKLFSSIDNLDHAAKIKSPTEAEKYYGQTVSNINEVLAKLG

Page 34: Advanced ComputationAL Biology Project Presentation

Plants

Page 35: Advanced ComputationAL Biology Project Presentation

Tobacco

Page 36: Advanced ComputationAL Biology Project Presentation

Sample input: tobacco

SYSITTPSQFVFLSSAWADPIELINLCTNALGNQFQTQQARTVVQRQFSEVWKPSPQVTVRFPDSDFKVYRYNAVLDPLVTALLGAFDTRNRIIEVENQANPTTAETLDATRRVDDATVAIRSAINNLIVELIRGTGSYNRSSFESSSGLVWTSGPAT

Page 37: Advanced ComputationAL Biology Project Presentation

Tobacco

Page 38: Advanced ComputationAL Biology Project Presentation

Insects

Page 39: Advanced ComputationAL Biology Project Presentation

Sample Input: Insects DCLSGRYKGPCAVWDNETCRRVCKE

EGRSSGHCSPSLKCWCEGC

Page 40: Advanced ComputationAL Biology Project Presentation

Insects

Page 41: Advanced ComputationAL Biology Project Presentation

Birds

Page 42: Advanced ComputationAL Biology Project Presentation

Sample Input: Birds IDTCRLPSDRGRCKASFERWYFNGRT

CAKFIYGGCGGNGNKFPTQEACMKRCAKA

Page 43: Advanced ComputationAL Biology Project Presentation

Birds

Page 44: Advanced ComputationAL Biology Project Presentation

SARS

Page 45: Advanced ComputationAL Biology Project Presentation

Sample Input: SARS ALNTLVKQLSSNFGAISSVLNDILSRLD

KVEAEV

Page 46: Advanced ComputationAL Biology Project Presentation

SARS

Page 47: Advanced ComputationAL Biology Project Presentation

Fish

Page 48: Advanced ComputationAL Biology Project Presentation

Sample Input: Fish GHHHHHHLEDPSGGTPYIGSKISLISK

AEIRYEGILYTIDTENSTVALAKVRSFGTEDRPTDRPIAPRDETFEYIIFRGSDIKDLTVCEPPKPIM

Page 49: Advanced ComputationAL Biology Project Presentation

Fish

Page 50: Advanced ComputationAL Biology Project Presentation

Chicken

Page 51: Advanced ComputationAL Biology Project Presentation

Sample Input: Chicken

RVKRVWPLVIRTVIAGYNLYRAIKKK

Page 52: Advanced ComputationAL Biology Project Presentation

Chicken

Page 53: Advanced ComputationAL Biology Project Presentation

files Code

Results

Page 54: Advanced ComputationAL Biology Project Presentation

OVERVIEW Project Description Introduction Motivation Bioinformatics Application Explicit vs Implicit Problem Analysis Implement Files Experimental Results Conclusion Possible Future Work

Now we are here

Page 55: Advanced ComputationAL Biology Project Presentation

Conclusion Explicit suffix trees require more space

than implicit suffix trees in real datas.

Data comparison: worst case is DNA input (least variety of characters)

results Implicit trees should be used for smaller

use of storage

Page 56: Advanced ComputationAL Biology Project Presentation

1 3 5 7 9 11 13 15 17 19 21 23 250

500

1000

1500

2000

2500

3000

variety of string vs tree size

variety of string vs tree size

# of alphabets

Page 57: Advanced ComputationAL Biology Project Presentation

Conclusion Application:

it is easier to compare structures for implicit than explicit suffix trees (number comparisons)

Save spaceEasy to implement

Further improvement?

Page 58: Advanced ComputationAL Biology Project Presentation

OVERVIEW Project Description Introduction Motivation Bioinformatics Application Explicit vs Implicit Problem Analysis Implement Files Experimental Results Conclusion Possible Future Work Now we are here

Page 59: Advanced ComputationAL Biology Project Presentation

Possible Future Work Program speed is too slow

The interface of our program should be improved. (Matlab)

More variety of input

Page 61: Advanced ComputationAL Biology Project Presentation

References Online info http://en.wikipedia.org/wiki/Suffix_tree http://marknelson.us/1996/08/01/suffix-tr

ees/ http://homepage.usask.ca/~ctl271/857/s

uffix_tree.shtml http://www.cs.uku.fi/~kilpelai/BSA05/lect

ures/print07.pdf

Page 62: Advanced ComputationAL Biology Project Presentation

THANK YOU!