M. Vlachos, B. Taneri, E. Keogh, P.S. Yu

13
M. Vlachos, B. Taneri, E. Keogh, P.S. Yu IBM Research, NY Scripps Genome Center, San Diego University of California, Riverside

description

M. Vlachos, B. Taneri, E. Keogh, P.S. Yu. IBM Research, NY Scripps Genome Center, San Diego University of California, Riverside. how can we visualize DNA data?. - PowerPoint PPT Presentation

Transcript of M. Vlachos, B. Taneri, E. Keogh, P.S. Yu

Page 1: M. Vlachos, B. Taneri, E. Keogh, P.S. Yu

M. Vlachos, B. Taneri, E. Keogh, P.S. Yu

IBM Research, NYScripps Genome Center, San DiegoUniversity of California, Riverside

Page 2: M. Vlachos, B. Taneri, E. Keogh, P.S. Yu

how can we visualize DNA data?GTTAATGTAGCTTAAATATTTATAAAGCAAAACACTGAAAATGTTTAGATGGGTTTAATTAACCCCATTGACATTAAAGGTTTGGTCCCAGCCTTTCTATTAGTTCTAAACAGACTTACACATGCGAGCATCTACATCCCAGTGAGAACGCCCTCTAAATCATCAAGGATCAAAAGGAGCGGGTATCAAGCACACTAACACTAGTAGCTCACAACGCCTCGCTTAGCCACACCCCCACGGGACACAGCAGTGATAAAAATTAAGCCATGAACGAAAGTTTGACTAAGTCATGTTTACAAGGGTTGGTAAACTTCGTGCCAGCCACCGCGGTCATACGATTAACCCAAATTAATAGAAACACGGCGTAAAGAGTGTTAAGGAGTCACGTAAAATAAAGTCAAGCCTTAATTAAGCTGTAAAAAGCCCTAATTAAAACTAAGCCAAACTACGAAAGTGACTTTAATATAATCTGATTACACGACAGCTAAGACCCAAACTGGGATTAGATACCCCACTATGCTTAGCCATAAACTCTAATAGTCACAAAACAAGACTACTCGCCAGAGTACTACTAGCAATAGCCTAAAACTCAAAGGACTTGGCGGTGCTTCATACCCCCCTAGAGGAGCCTGTTCTATAAACGATAAACCCCGATCAACCTCACCAACCCTTGCTACTCCAGTCTATATACCGCCATCTT………….

Thousands or millions of basepairs long

• Humans cannot easily compare or visualize text

• We understand and visualize better shapes

• Can we find a way to visually represent bulks of DNA sequences?

• How can we represent the relationships between DNA sequences in an accessible manner?

I think it’s time I bought that new pair of reading glasses…

I think it’s time I bought that new pair of reading glasses…

CAGCAAACCCTAAAAGGGAACGAAAGTAAGCATAACCATCCTACATAAAAACGTTAGGTCAAGGTGTAACCTATGGGTTGGGAAGAAATGGGCTACATTTTCTATATTAAGAACATTCCTTATACTCACACGAAAGTTTTTATGAAACTTAAAAACCAAAGGAGGATTTAGTAGTAAATCAAGAGCAGAGTGCTTGATTGAACAAGGCCATGGAGCACGCACACACCGCCCGTCACCCTCCTCAAGTACCCTAGCAAAGCCCCAGTTCGTTAACTCACGCCAAGCAATCATACGAGAGGAGACAAGTCGTAACAAGGTAAGCATACCGGAAGGTGTGCTTGGATGAATCAAGATATAGCTTAAACAAAGCATCTAGTTTACACCTAGAAGATTCCACACCCTGTGTATATCTTGAACCAATTCTAGCCCACACCCTCCCCACTTCTACTACTACAAACCAATCAAATAAAACATTCACCATACATTTTAAAGTATAGGAGATAGAAATTTAATTACCAGTGGCGCTATAGAGATAGTACCGTAAGGGAAAGATGAAAGAAAACCTAAAAGTAGTAAAAAGCAAAGCTTACCCCTTGTACCTTTTGCATAATGACTTAACTAGTAATAACTTAGCAAAGAGACCTTAAGTTAAATTACCCGAAACCAGACGAGCTACTTATGAGCAGTATTTAGAACGAAC…………...

Page 3: M. Vlachos, B. Taneri, E. Keogh, P.S. Yu

dendrogram visualization

• Dendrograms present a hierarchy of affinity/similarity

• They still do not provide any solutions for the DNA representation

• Dendrograms cannot capture pairwise relationships- They are lost during the grouping

GATAAAAATTAAGCCATGAACGAAAGTTTGACTAAGTCATGTTTACAAGGGTTGGTAAACTTCGTGCCAGCCACCGCGGTCATACGATTAACCCAAATTAATAGAAACACGGCGTAAAGAGTGTTAAGGAGTCACGTAAAATAAAGTCAAGCCTTAATTAAGCTGTAAAAAGCCCTAATTAAAACTAAGCCAAACTACGAAAGTGACTTTAATATAATCTGATTACA

GATAAAAATTAAGCCATGAACGAAAGTTTGACTAAGTCATGTTTACAAGGGTTGGTAAACTTCGTGCCAGCCACCGCGGTCATACGATTAACCCAAATTAATAGAAACACGGCGTAAAGATAAGGAGTCACGTAAAATAAAGTCAAGCCTTAATTAAGCTGTAAAAAGCCCTAATTAAAACTAAGCCAAACTACGAAAGTGACTTTAATATAATCTGATTACA

AATTGATAAAAATTAAGCCATGAACGAAAGTTTGACTAAGTCATGTTTACAAGGGTTCGTGCCAGCCACCGCGGTCATACGATTAACCCAAATAGAAACACGGCGTAAAGATAAGGAGTCACGTAAAATAAAGTCAAGCCTTAATTAAGCTGTAAAAAGCCCTAATTAAAACTAAGCCAAACTACGAAAGTGACTTTAATATAATCTGATTACATTGGTAAAC

Other techniques:

• HyperTree • PattVision

Page 4: M. Vlachos, B. Taneri, E. Keogh, P.S. Yu

what we propose …

…GTACTTAGCGATTTAAATTC…

DNA string

Trajectory (and possible simplification)

Easier to visualize and compare trajectories rather than strings

Easier to visualize and compare trajectories rather than strings

• Transform sequences into 2-dimensional trajectories

• Compute elastic matching between DNA trajectories

• Plot their relationships on the 2D plane using a spanning-tree mapping A

Spanning Tree Visualization

+ Relative Distance towards pivot point

L

R2R

D

F

W

H

B

I

Page 5: M. Vlachos, B. Taneri, E. Keogh, P.S. Yu

converting DNA to trajectories

• This process will convert a long string of nuclotides into a 2D trajectory that can be easily visualized

• Similar DNA sequences will result into similar trajectories

• Resulting trajectories can be downsampled or compressed for easier plotting

Trajectory(i)= Trajectory(i-1) + V

TG

A

C

• Given an initial point on the 2D space (e.g. [0,0])• Start moving up/down/left/right based on the DNA letter you encounter

…GAATTC…

Cow DNADNA

Trajectory

Example:

Page 6: M. Vlachos, B. Taneri, E. Keogh, P.S. Yu

example

Human vs Bear

• Species with similar DNA content will also have very similar DNA trajectories

• The elastic matching offered by the warping function can find flexible similaritiesDynamic Time Warping

Primer

• Use dynamic programming to solve the matching problem

Human vs Chimpanzee

Page 7: M. Vlachos, B. Taneri, E. Keogh, P.S. Yu

is this representation meaningful?

• The dendrogram on the pairwise distances between the trajectories is correct

• It accurately captures the predominant views about affinity of species

ChimpanzeePygmy Chimpanzee

Human BlueWhaleFinbackWhale

HippopotamusDog

American Bear

Polar Bear

Indian Elephant

African Elephant

Proboscidea

Carnivora

Ursus

Cetartiodactyla

Balaenoptera

Eutheria

Panines

Hominidae

Page 8: M. Vlachos, B. Taneri, E. Keogh, P.S. Yu

spanning tree visualization

• The distance between any tree points can be perfectly retained on the 2D space

• For additional points we can retain the distance to the NN point + to one pivot point

• Out of N2 distances we can preserve a

total of:3 + 2(n-3) distances

Advantages of the Mapping:

• Important distances are exactly preserved• Local and global structure is preserved• Preservation of distances against the pivot point allows for the very powerful visualization

Page 9: M. Vlachos, B. Taneri, E. Keogh, P.S. Yu

visualization: humans & ‘relatives’

Using warping distance

Using euclidean distance

• Pivot point is the human

• Species that diverged closer in time, are also placed closer on the 2D space.

• Gibbon is erroneously placed closer to human compared to the orangutan.

Page 10: M. Vlachos, B. Taneri, E. Keogh, P.S. Yu

visualization: relationship between mammals

Human is closer to pygmy chimpanzee than to regular chimpamzee. C. Lockwood, W. Kimbel, and J. Lynch. Morphometrics and hominoid phylogeny:Support for a chimpanzee-human clade and differentiation among great ape sub-species. In Proc. Natl. Acad. Sci. USA, 101(13), 4356-4360, 2004.

Human is closer to pygmy chimpanzee than to regular chimpamzee. C. Lockwood, W. Kimbel, and J. Lynch. Morphometrics and hominoid phylogeny:Support for a chimpanzee-human clade and differentiation among great ape sub-species. In Proc. Natl. Acad. Sci. USA, 101(13), 4356-4360, 2004.

Hippopotamus is indeed closer to whale than to any other species - Cetartiodactyla

B. M. Ursing and U. Arnason. Analyses of mitochondrial genomes strongly support a hippopotamus-whale clade. In Proc. of the Royal Society of London, Series B, vol 265:

2251-2255, 1998.

Hippopotamus is indeed closer to whale than to any other species - Cetartiodactyla

B. M. Ursing and U. Arnason. Analyses of mitochondrial genomes strongly support a hippopotamus-whale clade. In Proc. of the Royal Society of London, Series B, vol 265:

2251-2255, 1998.

Page 11: M. Vlachos, B. Taneri, E. Keogh, P.S. Yu

spanning tree for non-metric distances

• When dealing with non-metric distance (like the DTW) the circles in the spanning-tree visualization method may not intersect.

• So now we need to find the point in 2D space that is closer to the two center circles.

• The reference circles can either enclose each other or be one outside the other.

Disjoint Reference Circles

Enclosed Reference Circles

C is the point closer to the centers A1 and A2 of the two reference circles.

C is the point closer to the centers A1 and A2 of the two reference circles.

Page 12: M. Vlachos, B. Taneri, E. Keogh, P.S. Yu

extensions (search and indexing)

• Can we utilize advanced compression and indexing schemes for the DNA trajectories in order to do fast prefiltering between millions of DNA sequences?

query

Project all sequences into a new space, and search this space instead (eg project trajectory from 100-D space to 2-D space)

Fea

ture

1

Feature 2Organize the low-dimensional points into a hierarchical ‘index’ structure.

A

B

C

Page 13: M. Vlachos, B. Taneri, E. Keogh, P.S. Yu

• DNA trajectories can be a interesting approach to clinical screening and diagnostics

• E.g. does this tissue/cell look more like a cancerous one or not?

• By evaluating the distance in-between the DNA trajectories, can we evaluate the cancer stage of a tissue?

extensions (medical screening)

Cancer Tissues

Normal Tissues

• Perform clustering

• Discover Classification Rules, e.g. through Nearest NeighborsX

Is this tissue cancerous or not and at what stage?

Is this tissue cancerous or not and at what stage?