PhD Completion Seminar Algorithms for the Study of RNA and...

70
PhD Completion Seminar Algorithms for the Study of RNA and Protein Structure Alex Stivala University of Melbourne, CSSE Supervisors: Prof. Peter Stuckey, Dr Tony Wirth August 27, 2010 1 / 70

Transcript of PhD Completion Seminar Algorithms for the Study of RNA and...

Page 1: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

PhD Completion Seminar

Algorithms for the Study of RNA and Protein

Structure

Alex StivalaUniversity of Melbourne, CSSE

Supervisors: Prof. Peter Stuckey, Dr Tony Wirth

August 27, 2010

1 / 70

Page 2: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

Overview

◮ RNA structural alignment

◮ Parallelism 1: Parallel dynamic programming

◮ Protein substructure searching

◮ Parallelism 2: Faster protein substructure searching

◮ Protein structure cartoons

2 / 70

Page 3: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

RNA structural alignment

3 / 70

Page 4: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

Introduction: Dynamic Programming

◮ Optimization problems where optimum can be computedefficiently from optimal solutions to subproblems (Bellman1957)

◮ Bottom-up

◮ Top-down and memoization

4 / 70

Page 5: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

Introduction: RNA secondary structure

◮ RNA has a secondary structure determined by basepairinginteractions within the RNA molecule.

◮ This can be predicted from sequence by a D.P. algorithm suchas that implemented in RNAfold in the Vienna RNA package(Zuker & Stiegler (1981); Hofacker et al. (1994)).

◮ Rather than one single “best” structure, a base pairingprobability matrix (McCaskill (1990)) can be computed,showing probabilities of bases pairing with each other.

5 / 70

Page 6: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

Example RNA secondary structure (1)

GCGGGGGUGCC

CGAGCCUGGCCAA

AG G

G G U C G G G C U CA G

GA

CC

CGA

U GG

CG U

AGG

CCUG

CGUG

GG

UU

CAAAU

CCCACCCCCCGCA

AP000063

G C G G G G G U G C C C G A G C C U G G C C A A A G G G G U C G G G C U C A G G A C C C G A U G G C G U A G G C C U G C G U G G G U U C A A A U C C C A C C C C C C G C A

G C G G G G G U G C C C G A G C C U G G C C A A A G G G G U C G G G C U C A G G A C C C G A U G G C G U A G G C C U G C G U G G G U U C A A A U C C C A C C C C C C G C AGC

GG

GG

GU

GC

CC

GA

GC

CU

GG

CC

AA

AG

GG

GU

CG

GG

CU

CA

GG

AC

CC

GA

UG

GC

GU

AG

GC

CU

GC

GU

GG

GU

UC

AA

AU

CC

CA

CC

CC

CC

GC

A

GC

GG

GG

GU

GC

CC

GA

GC

CU

GG

CC

AA

AG

GG

GU

CG

GG

CU

CA

GG

AC

CC

GA

UG

GC

GU

AG

GC

CU

GC

GU

GG

GU

UC

AA

AU

CC

CA

CC

CC

CC

GC

A

Figure: Secondary structure (left) and base pairing probability matrix diagram (right) of a tRNA fromAeropyrum pernix, an aeoribic hyper-thermophillic Archaea species.

6 / 70

Page 7: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

RNA structural alignment

D.P. from Hofacker et al. (2004):

S(i , j , k, l) = max

S(i + 1, j , k, l) + γ,

S(i , j , k + 1, l) + γ,

S(i + 1, j , k + 1, l) + σ(Ai ,Bk),maxh≤j ,q≤l S

M(i , h, k, q) + S(h + 1, j , q + 1, l)

SM(i , j , k, l) = S(i+1, j−1, k+1, l−1)+ΨAij +ΨB

kl+τ(Ai ,Aj ,Bk ,Bl)

7 / 70

Page 8: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

Parallel dynamic programming

8 / 70

Page 9: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

Previous work on parallelizing D.P.

◮ Used the bottom-up approach

◮ Analyze data dependencies and compute independent cells inparallel

◮ So only works on problems where it is feasible to applybottom-up, i.e. compute every single subproblem

◮ And requires careful analysis of data dependencies in eachparticular D.P. for parallelization

9 / 70

Page 10: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

Overview of our approach

Parallelize any D.P. on a multicore processor with shared memory.

◮ Each thread computes the entire problem.

◮ The results are shared via a lock-free hash table in sharedmemory.

◮ Randomization of subproblem ordering causes divergence ofthread computations.

10 / 70

Page 11: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

General form of a D.P.

f (x̄) = if b(x̄) then g(x̄)else F (f (x̄1), . . . , f (x̄n))

where

◮ b(x̄) holds for the base cases

◮ g(x̄) is the result for base cases,

◮ F is a function combining the optimal answers to a number ofsub-problems x̄1, . . . , x̄n.

11 / 70

Page 12: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

Computing the D.P.: serial version

f(x̄)v ← lookup(x̄)if v 6= KEY NOT FOUND

return v

if b(x̄) then v ← g(x̄)else

for i ∈ 1..nv [i ] ← f(x̄i )

v ← F (v [1], . . . , v [n])insert(x̄ ,v)return v

12 / 70

Page 13: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

Computing the D.P.: parallel version

f(x̄)v ← par lookup(x̄)if v 6= KEY NOT FOUND

return v

if b(x̄) then v ← g(x̄)else

for i ∈ 1..n in random order

v [i ] ← f(x̄i )v ← F (v [1], . . . , v [n])

par insert(x̄ ,v)return v

13 / 70

Page 14: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

The D.P.s we use to demonstrate our method

◮ knapsack

◮ shortest paths

◮ RNA structural alignment (we are familiar with this already)

◮ open stacks (skipping this here)

For the D.P. experiments, we used the open addressing hash table.

14 / 70

Page 15: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

Application to knapsack

k(i ,w) =

0 if i = 0k(i − 1,w) if w < wi

max{k(i − 1,w), otherwisek(i − 1,w − wi ) + pi}

There are two possible orderings of the two subproblems in lastcase. Each thread chooses one of the two, with equal probability.

15 / 70

Page 16: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

Shortest paths

Floyd-Warshall algorithm: s(i , j , k) is the length of the shortestpath from node i to node j using only 1, . . . , k as intermediatenodes.

s(i , j , k) =

0, if i = j

dij , if k = 0min{s(i , j , k − 1), otherwise

s(i , k, k − 1) + s(k, j , k − 1)}

where dij is the weight (distance) of the directed edge from i .The randomization chooses, with equal probability, between the3! = 6 orderings of the subproblems.

16 / 70

Page 17: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

RNA structural alignment (revisited)

D.P. from Hofacker et al. (2004):

S(i , j , k, l) = max

S(i + 1, j , k, l) + γ,

S(i , j , k + 1, l) + γ,

S(i + 1, j , k + 1, l) + σ(Ai ,Bk),maxh≤j ,q≤l S

M(i , h, k, q) + S(h + 1, j , q + 1, l)

SM(i , j , k, l) = S(i+1, j−1, k+1, l−1)+ΨAij +ΨB

kl+τ(Ai ,Aj ,Bk ,Bl)

To randomize, we take a random permutation of the list of allsubproblems in the S(·) and SM(·) computation.

17 / 70

Page 18: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

UltraSPARC T1 results (1)

0 5 10 15 20 25 30

02

46

810

12

threads

spee

dup

knapsackknapsack (no rand)shortest pathsshortest paths (no rand)RNA struct. alignmentRNA struct. alignment (no rand)open stacks (tiebreak)open stacks (full)open stacks (no rand)

18 / 70

Page 19: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

UltraSPARC T1 results (2)

problem speedup threads

knapsack 8.97 31shortest paths 9.88 32RNA struct. alignment 9.17 31open stacks (tiebreak) 6.96 27open stacks (full) 10.83 32

19 / 70

Page 20: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

Conclusions (parallel dynamic programming)

◮ We demonstrated a 10x speedup (for 32 threads) on the openstacks D.P.

◮ Applicable to any dynamic program

◮ Including very large ones where bottom-up impractical

◮ Can also be applied to smaller problems using array not hashtable

◮ No need to analyze data dependencies etc. for vectorization

◮ Only need to consider how to randomize subproblem ordering:generally trivial but “more random” is better

20 / 70

Page 21: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

Protein substructure searching

(the short version — no equations)

21 / 70

Page 22: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

Introduction

◮ We want to search for occurrence of substructures in adatabase of structures.

◮ This is important to understand protein structure andfunction.

◮ Structural matching is a computationally challenging problem.

22 / 70

Page 23: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

Protein structure in 3 easy steps (1): Primary

>1UBI:A|PDBID|CHAIN|SEQUENCE

MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRL

IFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGG

23 / 70

Page 24: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

Protein structure in 3 easy steps (2): Secondary

24 / 70

Page 25: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

Protein structure in 3 easy steps (3): Tertiary

25 / 70

Page 26: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

Existing methods

◮ There are lots; I’m not going to go into details here.

◮ Basically two classes:

1. Residue or atom level structural alignment;2. SSE matching

◮ Class 1 methods are slow as they work at a detailed level sohave large amounts of data;

◮ Class 2 methods are slow as they tend to require solvingcomputationally hard problems (but on much smaller amountsof data than class 1).

◮ Lots of heuristic methods to get faster approximations.

26 / 70

Page 27: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

TableauSearch

◮ Konagurthu, Stuckey, Lesk (2008) Bioinformatics24(5):645–651.

◮ This work is an extension of the work cited above.

27 / 70

Page 28: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

Tableau encoding

P

O

L R

E D

S T

45°−45°

−135° 135°

90°

180°

−90°

28 / 70

Page 29: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

β-grasp query structure and tableau

e

OT e

LE RT xa

PD OS RD xg

RT LE RT LS e

LE RD LE LS OT e

RT LS LS RD PE OS xg

PE RT LE RD OT PE RT e

0.000

2.654 0.000

−1.173 2.147 1.000

0.389 −2.757 1.352 3.000

2.040 −1.438 2.079 −1.651 0.000

−1.258 1.556 −1.108 −1.647 2.985 0.000

1.693 −1.808 −1.851 1.309 −0.370 −2.913 3.000

−0.590 2.102 −1.231 0.959 2.566 −0.721 2.264 0.000

(a) (b)

(c) (d)

N

12

3

4

5

6

7

8

C

29 / 70

Page 30: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

Maximally-similar subtableaux

◮ Finding a substructure within a structure is to findmaximally-similar subtableaux in the two tableaux.

◮ This problem can be formulated as a quadratic integerprogram (QIP).

◮ This problem is equivalent to the quadratic assignmentproblem and thus NP-hard.

30 / 70

Page 31: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

TableauSearch

◮ TableauSearch is a method of approximating the optimummatching score.

◮ It uses dynamic programming (DP) in an alignment-likeapproach.

◮ It is extremely fast; e.g. searching 75632 ASTRAL domains in66s (Konagurthu et al. (2008)).

◮ But it is only an approximation to the score, and can missmatches.

◮ More importantly, it is a global alignment, giving high scoresto similar structures - we cannot use it to search for asubstructure in a larger structure.

◮ It also depends on SSE ordering (sequence) being maintained

◮ IR Tableau (Zhang et al. 2010) is an even faster method.

31 / 70

Page 32: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

MNAligner

◮ Li et al. 2007 “Alignment of molecular networks by integerquadratic programming” Bioinformatics 23(13):1631-1639

◮ The formulation of this problem as a QIP in this paper isstrikingly similar to the tableau matching QIP.

◮ But the authors note that the constraints are totallyunimodular.

◮ This means that the QIP can be relaxed to a QP which willhave an integer solution(under certain conditions)

32 / 70

Page 33: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

QP tableau search

◮ So by relaxing the QIP to a QP as in MNAligner, we canmuch more efficiently solve it. This is essentially what our QPtableau search method does.

◮ We also need to relax some of the constraints and penalizetheir violation in the objective function instead.

◮ I am leaving out the details of the formulation and solution byinterior point method for this short talk.

33 / 70

Page 34: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

Evaluation

◮ We ran queries on the ASTRAL SCOP 1.73 95% sequenceidentity non-redundant subset (15273 domains).

◮ For assessing whole structure matches, we used a randomlychosen set of 200 query structures.

◮ For substructure matches, we must make some more manualcomparisons with other methods. We will just show anexample here.

◮ Similarly for non-linear matches.

34 / 70

Page 35: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

ROC curves in 200 query set in ASTRAL SCOP 95%

subset

False positive rate

Tru

e po

sitiv

e ra

te

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

QP tableau search (norm2)VASTSHEBATableauSearch (norm2)TOPS

35 / 70

Page 36: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

Serpin B/C sheet substructure (left) found in 1MTP (right)

36 / 70

Page 37: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

Conclusions (QP Tableau Search)

◮ Relaxing the QIP tableau matching algorithm to real QP andsolving by interior point method is a sensitive and efficientmethod to locate occurrences of a query substructure in astructure database.

◮ Although slower than the DP TableauSearch or IR Tableau,this technique is much faster than solving the QIP or ILPdirectly with CPLEX and (in common with the lattermethods):

◮ can handle substructure queries;◮ can remove the ordering constraint to find non-sequential

alignments;◮ can return the SSE matches, not just a score.

◮ If only we could do it even faster...

37 / 70

Page 38: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

Faster protein substructure searching

38 / 70

Page 39: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

To try to maximize the same objective function as before, butmuch faster, we will:

◮ use simulated annealing

◮ create a parallel implementation on a graphics processing unit(GPU)

39 / 70

Page 40: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

State of the system

Pairwise comparison of structure A with NA SSEs and structure Bwith NB SSEs, represented by tableaux TA and TB

The state of the matching is vector v , where

vi = j , 1 ≤ i ≤ NA, 0 ≤ j ≤ NB

means that the ith SSE in structure A is matched with the jth SSEin structure B, or with no SSE if vi = 0.

40 / 70

Page 41: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

Initial state

Each SSE in structure A is matched with the first SSE of the sametype (helix, strand) in structure B, with probability pm.

41 / 70

Page 42: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

Move to neighbor state

◮ An SSE i is chosen uniformly at random in structure A and itsmapping changed to a random SSE j of the same type instructure B, (optionally) without violating the non-crossingconstraint.

◮ If there is no such j , then set vi = 0 i.e. that SSE is no longermatched.

42 / 70

Page 43: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

Simulated annealing schedule

◮ a move is accepted if objective function is better, or if

exp(g(v ′)−g(v)T

) > p where p is a random number in [0, 1]

◮ T ← αT for the next iteration

◮ repeat until iteration limit (100 iterations) reached

Slightly different from “vanilla” simulated annealing:

◮ The best state ever found is remembered, not necessarily thefinal state.

◮ The whole schedule is run M times, we call each a restart.

◮ The answer is the best objective value (and its state) everfound in any of the restarts.

43 / 70

Page 44: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

Graphics Programming Units

◮ Originally for graphics rendering, now “GPGPU” (generalpurpose GPU) also used for other parallel algorithms.

◮ NVIDIA proprietary CUDA: extensions to C for GPGPUprogramming

◮ OpenCL is a standard for doing the same thing

◮ Data-parallelism: basically SIMD - very many threads runningthe same code simultaneously on different data locations.

◮ “co-processing” — host CPU has to send data to GPU andthen get results back from GPU.

◮ Need to use a large number (thousands) of threads to getgood results

44 / 70

Page 45: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

GPU Memory hierarchy

◮ Large, high latency global memory

◮ Small (e.g. 16 KB) fast shared memory

◮ Small fast constant (readonly) memory

◮ Thread local memory: this “local” memory is slow like global!

◮ No cache: the programmer manages memory hierarchy.

45 / 70

Page 46: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

CUDA programming model

Thread

Key:

Block 5Block 3 Block 4

Grid

Block 2Block 0

Shared memory

Block 1

46 / 70

Page 47: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

Parallelization using CUDA (1)

◮ The query data is stored in the constant memory, all threadscan read it quickly

◮ The whole database of structures is stored in the large (slow)global memory

◮ Two levels of parallelism:◮ Each block of threads does a single pairwise comparison of the

query to a structure in the database.◮ Each thread in the block runs the simulated annealing

schedule, so we do the restarts in parallel.

◮ Each block of threads first parallel copies the databasestructure it is going to use into its shared memory for fastaccess.

47 / 70

Page 48: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

Parallelization using CUDA (2)

◮ The only synchronization necessary is:◮ a barrier after loading structures into shared memory, so all

threads in a block can safely use it◮ a barrier at the end of the restart to do the MAX reduction

operation to find the best value of objective function over allthreads in the block.

◮ If the number of restarts is larger than the number of threadsin a block, threads in a block will have to run more restartsuntil done.

◮ If the number of database structures is larger than thenumber of blocks in the grid, the blocks will have to load andprocess more db structures until they are done.

48 / 70

Page 49: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

Parallelization using CUDA (3)

◮ Structures too large for the shared or constant memory canstill be handled, by leaving them in the global memory, butthis is slower.

◮ The tableau representation is compact — means we don’thave a problem with overhead communicating data to GPU(and individual structures are usually small enough for fastshared/constant memory).

49 / 70

Page 50: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

NVIDIA hardware used in experiments

Tesla C1060 4 GB global memory, 30 multiprocessors, 240 cores,1.30 GHz.

GTX 285 1 GB global memory, 30 multiprocessors, 240 cores,1.48 GHz.

We used 128 threads/block and 128 blocks/grid (16384 threadstotal) for all experiments.

50 / 70

Page 51: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

Results for ASTRAL 95% 200 query set

standard 95% C. I.Method Platform Restarts Elapsed time AUC error lower upperSHEBA CPU - 25 h 22 m 0.931 0.004 0.924 0.938SA Tableau Search CPU 4096 142 h 42 m 0.930 0.004 0.923 0.937SA Tableau Search GTX 285 4096 4 h 11 m 0.930 0.004 0.923 0.937SA Tableau Search Tesla C1060 4096 5 h 40 m 0.930 0.004 0.923 0.937DaliLite CPU - 620 h 03 m 0.929 0.004 0.922 0.936SA Tableau Search CPU 128 4 h 18 m 0.921 0.004 0.913 0.929SA Tableau Search GTX 285 128 0 h 08 m 0.920 0.004 0.912 0.927SA Tableau Search Tesla C1060 128 0 h 11 m 0.920 0.004 0.912 0.927QP Tableau Search CPU - 157 h 51 m 0.914 0.004 0.906 0.922LOCK2 CPU - 208 h 09 m 0.912 0.004 0.904 0.920TableauSearch CPU - 1 h 11 m 0.858 0.005 0.848 0.867TOPS CPU - 1 h 11 m 0.855 0.005 0.845 0.864VAST CPU - 14 h 26 m 0.855 0.005 0.846 0.865IR Tableau CPU - 0 h 01 m 0.854 0.005 0.844 0.864YAKUSA CPU - 4 h 14 m 0.827 0.005 0.816 0.837

51 / 70

Page 52: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

AUC versus elapsed time on CPU and GPUs

0 200 400 600 800

0.92

00.

922

0.92

40.

926

0.92

80.

930

elapsed time (minutes)

AU

C

128

256

512

1024 2048

4096 8192

128

256

512

1024 2048

4096 8192

128

256

GTX 285 Tesla C1060 CPU (AMD Opteron)

52 / 70

Page 53: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

β-grasp motif query

53 / 70

Page 54: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

Motif query results

standard 95% C. I.Query Method Platform Restarts Elapsed time AUC error lower upperd1ubia SA GTX 285 128 00 m 03 s 0.918 0.011 0.896 0.940d1ubia SA CPU 128 01 m 17 s 0.912 0.011 0.890 0.935d1ubia QP CPU - 05 m 11 s 0.902 0.012 0.879 0.926d1ubia TOPS CPU - 00 m 10 s 0.894 0.012 0.870 0.918β-grasp SA CPU 128 01 m 11 s 0.939 0.010 0.920 0.958β-grasp QP CPU - 02 m 01 s 0.938 0.010 0.918 0.957β-grasp SA GTX 285 128 00 m 03 s 0.934 0.010 0.914 0.954β-grasp TOPS CPU - 00 m 09 s 0.847 0.014 0.819 0.875serpin B/C sheet SA GTX 285 128 00 m 03 s 0.993 0.013 0.968 1.019serpin B/C sheet SA CPU 128 01 m 19 s 0.991 0.015 0.962 1.021serpin B/C sheet QP CPU - 08 m 16 s 0.986 0.019 0.949 1.023serpin B/C sheet TOPS CPU - 00 m 24 s 0.491 0.054 0.385 0.597

54 / 70

Page 55: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

Conclusions (SA Tableau Search)

◮ Comparable in speed and accuracy with widely used methods

◮ Up to 34× speedup on GPU over CPU

◮ Making it one of the fastest available methods

◮ The first use of a GPU for protein structure search

◮ If only there were a graphical interface for building motifqueries...

55 / 70

Page 56: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

Protein structure cartoons

56 / 70

Page 57: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

Cartoons

◮ hand-drawn

◮ HERA / ProMotif (Hutchinson and Thornton 1990, 1996)

◮ TOPS (Flores et al. 1994, Westhead et al. 1999)

◮ hand-drawn with TopDraw (Bond, 2003)

◮ PDBSum topology cartoons (Laskowski et al. 2006), based onHERA.

57 / 70

Page 58: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

Example - 1HH1

Holliday junction resolvase, PDB id 1HH1, Bond et al. 2001, image generated with PyMol

58 / 70

Page 59: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

HERA / ProMotif (Hutchinson and Thornton 1996)

1HH1

96 V

97 L

98 K

99 F

100 I

101 P

102 F

103 E

104 K

105 L

93 K

92 K

91 V

90 G

89 L

88 F

87 L

86 S

57 K

56 M

55 E

54 I

53 L

52 I

51 I

50 V

42 D

43 I

44 I

45 A

46 L

47 K

29 A

28 R

27 V

26 V

25 A

22 K

21 D

20 R

19 L

18 R

17 S

16 V

15 I

14 N

13 R

12 E

11 V

10 A

9 S

59 R

60 K

61 D

84 G

83 S

82 K

81 R

80 A

79 F

78 E

77 I

76 I

75 G

74 E

73 A

72 Q

71 E

70 R

69 R

68 V

67 Y

66 I

65 K

112 N

113 Y

114 V

115 A

123 D

124 L

125 E

126 D

127 L

128 V

129 R

130 L

131 V

132 E

133 A

134 K

135 I

136 S

137 R

138 T

139 L

140 D

108 T

107 R

106 R

N

C

24

23

24

64

85

85

95

122 122

111

116

59 / 70

Page 60: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

PDBSum

96

102

103105

86

93

50

57

42

47 25

29

9

22

N

5961

69

84

65

68

112

115

123

140

C

106108

60 / 70

Page 61: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

N1

C2

Figure: TOPS vs manually drawn with TopDraw

61 / 70

Page 62: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

N

A

12

3

4

5 B

6

7

8

9CC

Figure: Pro-origami vs manually drawn with TopDraw

62 / 70

Page 63: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

The essence of our approach

◮ Generate specifications of cartoon elements (SSEs)

◮ Generate connections and relationships between the cartoonelements as constraints

◮ Give this information to the constraint-based diagram editorDunnart (Wybrow et al. 2006; Dwyer et al. 2009)

◮ Dunnart will lay out the diagram consistent with theconstraints and can be used interactively to edit it,maintaining the constraints.

63 / 70

Page 64: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

Plu-MACPF, PDB id 2QP2 (Rosado et al. 2007)

64 / 70

Page 65: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

Comparison of TOPS, PDBSum and Pro-origami for Plu-MACPF

65 / 70

Page 66: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

Using Pro-origami to make structural motif query

66 / 70

Page 67: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

Conclusions

◮ We found that bounding for the RNA structural alignmentd.p. was not successful

◮ But we demonstrated a 9× speedup (32 threads) byparallelizing that d.p.,

◮ as a specific example of a completely general method forparallelizing any d.p., without regard to its specific properties.

◮ We developed two new algorithms for tableau-based proteinstructure/motif search,

◮ and a parallel implementation of one on a GPU — the firstuse of a GPU for the protein structure search problem, andone of the fastest available methods.

◮ We developed a system for automatically generating proteinstructure cartoons by means of constraint-based diagrams,

◮ and its use as a substructure query building interface to ourprotein substructure search algorithms.

67 / 70

Page 68: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

Acknowledgments

◮ Prof. Peter Stuckey

◮ Dr Tony Wirth

◮ Dr Linda Stern

◮ A/Prof. Maria Garcia de la Banda

◮ Prof. Manuel Hermenegildo

◮ Dr Michael Wybrow

◮ Prof. James Whisstock

◮ Dr Arun Konagurthu

◮ VPAC

◮ Tech. services.

68 / 70

Page 69: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

Publications

◮ A. Stivala, A. Wirth, P.J. Stuckey, “Tableau-based proteinsubstructure search using quadratic programming”, BMC

Bioinformatics (2009), 10:153

◮ A. Stivala, P.J. Stuckey, M. Garcia de la Banda, M.Hermenegildo, A. Wirth, “Lock-free parallel dynamicprogramming”, J. Parallel Distrib. Comput. (2010),70:839-848

◮ A. Stivala, P.J. Stuckey, A. Wirth, “Fast and accurate proteinsubstructure searching with simulated annealing and GPUs”,submitted to BMC Bioinformatics (under review)

◮ A. Stivala, M.J. Wybrow, A. Wirth, J.C. Whisstock, P.J.Stuckey, “Automatic generation of protein structure cartoonswith Pro-origami”, submitted to Bioinformatics (under review)

69 / 70

Page 70: PhD Completion Seminar Algorithms for the Study of RNA and ...munk.cis.unimelb.edu.au/~stivalaa/completion_slides.pdf · PhD Completion Seminar Algorithms for the Study of RNA and

Source code, data sets, web server, etc.

◮ QP Tableau Search:http://www.csse.unimelb.edu.au/~astivala/qpprotein

◮ SA Tableau Search:http://www.csse.unimelb.edu.au/~astivala/satabsearch

◮ Pro-origami server:http://munk.csse.unimelb.edu.au/pro-origami

70 / 70