Versatile fixed-parameter tractable sampling for multi ...

Sa

mp

lin

gfo

rm

ult

i-ta

rget

RN

Ad

esig

n·

S.W

ill

Versatile fixed-parameter tractable sampling formulti-target RNA design

Sebastian Will

Bled 2018

TBI · University of Vienna

Fixed-Parameter Tractable Sampling for RNADesign with Multiple Target Structures

Stefan Hammer1,2,3, Yann Ponty4,5,?, Wei Wang4,5, and Sebastian Will2

1 University Leipzig, Department of Computer Science and Interdisciplinary Centerfor Bioinformatics, 04107 Leipzig, Germany

2 University of Vienna, Faculty of Chemistry, Department of Theoretical Chemistry,1090 Vienna, Austria

3 University of Vienna, Faculty of Computer Science, Research Group Bioinformaticsand Computational Biology, 1090 Vienna, Austria

4 CNRS UMR 7161 LIX, Ecole Polytechnique, Bat. Turing, 91120 Palaiseau, France5 AMIBio team, Inria Saclay, Bat Alan Turing, 91120 Palaiseau, France

? To whom correspondence should be addressed: [email protected]

Motivation. Engineering artificial biological systems promises broad applicationsin synthetic biology, biotechnology and medicine. Here, the rational design ofmulti-stable RNA molecules is especially powerful, since RNA can be generatedwith highly specific properties and programmable functions. In particular, de-signing artificial riboswitches became popular due to their potential as versatilebiosensors [1]. Effective in-silico methods proved to greatly facilitate the designapproach and have tremendous impact on their cost and feasibility.

Statement of problem. Most methods for computational design share a similaroverall strategy: one or several initial seed sequences are generated and optimizedsubsequently. In this contribution we revisit the first main ingredient of (multi-target) design methods, namely the sampling of sequences, which energeticallyfavor several given target structures at the same time. While previous multi-target methods [4, 6] relied on ad-hoc sampling strategies, sampling seeds fromthe uniform distribution was solved only recently [3, 2].

Algorithmic contributions. We generalize Boltzmann sampling for RNA design,which was recently shown powerful for single targets in IncaRNAtion [5], to de-sign for multiple structural targets. After showing that even uniform sampling is#P-hard, we introduce the tree decomposition-based fixed parameter tractable(FPT) sampling algorithm RNARedPrint. Finally, we combine our FPT stochas-tic sampling algorithm with multi-dimensional Boltzmann sampling over distri-butions controlled by expressive RNA energy models. We show that sampling tsequences of length n for k target structures takes O(2d nk + t n k) time, whered := min(w+ c+ 1, 2(w+ 1)), depending on the tree width w of the dependencygraph (covering all dependencies between sequence positions introduced by theenergy function) as well as the number c of connected components in the com-patibility graph (covering the constraints enforcing canonical base pairings). Dueto a constraint framework, RNARedPrint supports generic Boltzmann-weightedsampling for arbitrary additive RNA energy models; this moreover enables tar-geting specific free energies or GC-content, compare Figure 1.

Sa

mp

lin

gfo

rm

ult

i-ta

rget

RN

Ad

esig

n·

S.W

ill

Multi-target design of RNA sequences

For example: design riboswitches for translational control

+

Multiple structures (=multiple design targets)

b

c

de

f gh i

j k l m

nopq

rs

a t

uv

a b c d e

f

gh

ijk l

mn o

pqr

s t u v

a

b

c

d

e f

g

h ij

k

l m n o

p

q

r st

u

v

abcdefghijklmnopqrstuv

(((((.)).(((..))).))).

((.))((...))..(((..)))

....(((((..)))...))...

Task: generate sequences with specific properties

• low/specific energy for multiple structures

• specific CG content

• specific energy differences

• specific sequence/structure motifs (enforce/forbid)

Approach:(defined) sampling

Sa

mp

lin

gfo

rm

ult

i-ta

rget

RN

Ad

esig

n·

S.W

ill

Uniform sampling: multiple structures

1 2 3 4 5

( . . ) .

. ( ( ) )

( ( . ) )

A A A U U

A A G U UA G A U UA G G U UG A A U CG A A U UG A G U CG A G U UG G A U CG G A U UG G G C CG G G C UG G G U CG G G U U...

• For uniform: choose first positionA : C : G : U = 4 : 4 : 10 : 10Then, e.g. after G, choose secondA : G = 4 : 6, . . .

• → counting

• (Why) is this hard?

Theorem: Counting of sequences formultiple targets is #P-hard.

Proof: equiv. to counting independent sets

1. dependency graph is bipartite

{A,G} vs. {C ,U}

2. A and C cannot pair: independent3. Selecting all As and C s, i.e. independent

sets, determines a design (and vice versa)

Sa

mp

lin

gfo

rm

ult

i-ta

rget

RN

Ad

esig

n·

S.W

ill

Uniform sampling: multiple structures

1 2 3 4 5

( . . ) .

. ( ( ) )

( ( . ) )

A A A U UA A G U UA G A U UA G G U UG A A U CG A A U UG A G U CG A G U UG G A U CG G A U UG G G C CG G G C UG G G U CG G G U U...

• For uniform: choose first positionA : C : G : U = 4 : 4 : 10 : 10Then, e.g. after G, choose secondA : G = 4 : 6, . . .

• → counting

• (Why) is this hard?

Theorem: Counting of sequences formultiple targets is #P-hard.

Proof: equiv. to counting independent sets

1. dependency graph is bipartite

{A,G} vs. {C ,U}

2. A and C cannot pair: independent3. Selecting all As and C s, i.e. independent

sets, determines a design (and vice versa)

Sa

mp

lin

gfo

rm

ult

i-ta

rget

RN

Ad

esig

n·

S.W

ill

Systematic efficient counting (and sampling)

Recipe:

1. Decompose dependency graph2. Apply dynamic programming3. Sample

1 2 3 4 5

( . . ) .

. ( ( ) )

( ( . ) )

target structures

1

45

2 3

dependency graph

5 2

5 42

4

5 14

1

4 3

3

2

5

2

tree decomposition

Theorem: Counting and sampling efficient for fixed tree width.

(So far) Blueprint / RNAdesign similar, but ear decomposition[Blueprint] Hammer, Tschiatschek, Flamm, Hofacker, Findeiß. Bioinformatics, 2017.

[RNAdesign] Hoener, Hammer, Abfalter, Hofacker, Flamm, Stadler, Biopolymers, 2013.

Sa

mp

lin

gfo

rm

ult

i-ta

rget

RN

Ad

esig

n·

S.W

ill

Systematic efficient counting (and sampling)

Recipe:

1. Decompose dependency graph2. Apply dynamic programming3. Sample

1 2 3 4 5

( . . ) .

. ( ( ) )

( ( . ) )

target structures

1

45

2 3

dependency graph

5 2

5 42

4

5 14

1

4 3

3

2

5

2

Cts_4_5 A C G UA:1 0 0 0C:0 1 0 0G:1 0 2 0U:0 1 0 2

Cts_4A:1C:1G:2U:2

Cts_2_5 A C G UA:? ? ? ?C:? ? ? ?G:? ? ? ?U:? ? ? ?

tree decomposition

Theorem: Counting and sampling efficient for fixed tree width.

(So far) Blueprint / RNAdesign similar, but ear decomposition[Blueprint] Hammer, Tschiatschek, Flamm, Hofacker, Findeiß. Bioinformatics, 2017.

[RNAdesign] Hoener, Hammer, Abfalter, Hofacker, Flamm, Stadler, Biopolymers, 2013.

Sa

mp

lin

gfo

rm

ult

i-ta

rget

RN

Ad

esig

n·

S.W

ill

From uniform sampling to Boltzmann sampling

• counts for all subtrees −→ uniform sampling

• Analogously: partition functions −→ Boltzmann sampling

Boltzmann sampling: P(S) ∼ exp(−βE (S)).

• Energy = (weighted) sum of energies for single structures

• energy models• Base pair model

“like counting”• Nearest neighbor model (Turner model)

requires multi-ary dependencies: constraint framework∗

• Stacking model“in-between”, score stacks (4-cliques of dependencies)

∗Constraint networks / cluster tree elimination [Rina Dechter]

Sa

mp

lin

gfo

rm

ult

i-ta

rget

RN

Ad

esig

n·

S.W

ill

Example dependency graphs((((.((....)).)))).((.(((.((((.....(((..((((((.((..(((.(.....).)))..)).)).))))..)))..)))).))).))....

..(((((.....(((.(((((((.....))))..))).))).....)))))..((((((((((...))).)....))))))...((((((....))))))

......(((((.....(((...(((.((.((.(((....((......))...))).)).)))))..))).............))))).((((...)))).

base pair1 18

2 17

3

1651

4

1550

5

49

6

13

48

7

1247

87

8 86

9 85

10 84

11

83

41

14 40

39

3769

3668

19 3567

20 3296

21 3195

22 3023 29

6493

24 6392

25

6291

26

27

6189

28

6088

58

57

33 55

34 54

5382

38 81

78

42

77

43

76

44 75

45 73

46 72

70

52 66

65

80

56

79

59

71

74

100

99

98

97

90

94

12

17

18

316

4

15

50

51

5

49

6

48

7

12

13

47

8

86

87

9

85

10

8411

83

14

40

41

39

36

37

68

6919

35

67

20

21

31

32

95

96

22

30

23

2924

6364

92

93

25

62

91

26

27

28

60

61

88

89

57

5833

34

54

55

53

82

38

81

42

77

78

43

76

4475

4546

72

73

70

52

65

66

80

56

79

59

71

74

99

100

98

97

90

94

stacking

Sa

mp

lin

gfo

rm

ult

i-ta

rget

RN

Ad

esig

n·

S.W

ill

Example tree decompositions((((.((....)).)))).((.(((.((((.....(((..((((((.((..(((.(.....).)))..)).)).))))..)))..)))).))).))....

..(((((.....(((.(((((((.....))))..))).))).....)))))..((((((((((...))).)....))))))...((((((....))))))

......(((((.....(((...(((.((.((.(((....((......))...))).)).)))))..))).............))))).((((...)))).

base pair 48 41

48 4113

13

41 78

78

70 48

70

48 136

6

57 78

57

57 31

31

21 31

21

95 21

95

95 90

90

98 90

98

98 87

87

29 87

29

7 87

7

23 29

23

64 23

64

23 93

93

54 64

54

54 34

34

81 54

81

33

55 33

55

11

11 83

83

84

10 84

10

9

85 9

85

8

8 86

86

83 36

36

36 18

18

68 18

68

1 18

1

62 68

62

62 25

25

56 62

56

91 25

91

97 91

97

97 88

88

99 86

99

30 86

30

99 89

89

85 100

100

96 89

96

89 27

27

92 96

92

20 96

20

92 24

24

63 24

63

63 67

67

19 67

19

61 27

61

69 61

69

28 88

28

28 60

60

71 60

71

59

76 59

76

30 58

58

30 22

22

77 58

77

56 79

79

80 55

80

20 32

32

19 35

35

53 35

53

39

15 39

15

49 69

49

17 69

17

40 49

40

49 5

5

40 14

14

47 7

47

7 12

12

4 15

4

4 50

50

51

51 3

3

53 65

65

66

66 52

52

72

46 72

46

73

45 73

45

75

75 44

44

76 43

43

77 42

42

81 38

38

17 37

37

17 2

2

37 82

82

16 3

16

97 90

97 9098

98

97 9098 63

63

97 62 9098 63

62

97 61 6290 98 63

61

97 31 6162 90 98

63

31

97 64 3161 62 90

98 63

64

97 83 6431 61 6290 98 63

83

30 97 83 6431 61 62 90

98 63

30

30 97 83 647 31 61 6290 98 63

7

30 83 6431 7 6162 63 35

35

30 97 83 6431 7 61 6223 90 98 63

23

30 41 83 647 31 61 62

63 35

41

30 81 41 8364 31 7 6162 63 35

81

81 41 8368 7 61

62 63 35

68

30 81 7941 64 31

35

79

97 96 83 6461 23 90 6330 31 7 62

98

96

30 97 96 8331 7 61 2923 90 98

29

97 96 9164 62 2390 98 63

91

17 81 41 8368 7 61 62

63 35

17

17 81 41 8368 7 61 6218 63 35

18

17 81 8368 36 6218 63 35

36

17 41 687 61 6269 18

69

30 81 7941 56 64

31 35

56

30 41 7956 31 58

58

81 80 7956 64 35

80

67 68 3662 18 63

35

67

17 37 8183 36 18

37

17 15 417 18 69

15

30 97 96 8331 7 61 2923 90 89 98

89

30 97 83 967 61 29 9098 89 87

87

30 96 3129 23 90

89 22

22

30 99 837 29 9098 89 87

99

97 96 6129 88 89

98 87

88

30 99 837 29 8698 87

86

8 83 997 86 87

8

97 92 9691 64 62

23 63

92

28 61 2988 89 87

28

92 91 6424 62 23

63

24

30 96 9531 90 89

22

95

67 19 6836 18 35

19

30 79 4156 57 31

58

57

41 79 5657 58 78

78

81 80 7956 64 55

35

55

81 80 5464 55 35

54

30 96 9521 31 22

21

17 4 1541 7 18

69

4

17 4 1541 7 18

69 3

3

4 41 1549 7 69

3

49

17 4 1516 18 3

16

48 4 4115 49 7

69 3

48

48 4 4115 7 49

13 3

13

70 4849 69

70

92 91 2462 63 25

25

92 2423 93

93

48 4 749 13 3

6

6

48 41 1540 49 13

40

48 4 495 3 6

5

48 76 47

47

7 1312 6

12

41 15 4013 14

14

50 4 495 3

50

54 34 6455 35

34

77 41 5758 78

77

37 81 8336 82

82

28 61 8988 27

27

20 96 9521 31

20

17 16 182 3

2

8 85 9983 86

85

53 54 6434 35

53

33 5434 55

33

8 85 8386 9

9

85 99100 86

100

85 8384 9

84

10 85 8384 9

10

10 8311 84

11

15 3940 14

39

51 450 3

51

53 5465 64

65

81 3738 82

38

28 6160 27

60

32 2021 31

32

17 118 2

1

77 41 4258 78

42

77 7642 58

76

77 7658 59

59

77 7642 43

43

53 6566

66

53 6566 52

52

72

46 72

46

46 7273

73

45 4672 73

45

75 7643

75

75 7644 43

44

stacking

Sa

mp

lin

gfo

rm

ult

i-ta

rget

RN

Ad

esig

n·

S.W

ill

Targeting specific properties:multi-dimensional Boltzmann sampling

A B

Weight and combine single structure energies and features (“GC content”)

A Learn weights by adaptive scheme→ target specific energies and GC content

B Sampling: targets Turner energies by linear fitting of energies

Sa

mp

lin

gfo

rm

ult

i-ta

rget

RN

Ad

esig

n·

S.W

ill

Boltzmann vs. uniform samplingfor multi-target RNA design

Dataset Redprint Uniform ImprovementSeeds 2str 21.67 (±4.38) 37.74 (±6.45) 73%

3str 18.09 (±3.98) 30.49 (±5.41) 71%4str 19.94 (±3.84) 32.29 (±5.24) 63%

Optimized 2str 5.84 (±1.31) 7.95 (±1.76) 28%3str 5.08 (±1.10) 7.04 (±1.52) 31%4str 8.77(±1.48) 13.13 (±2.13) 37%

Multi-target design objective [Blueprint] on the Modena benchmark

Modena benchmark: Taneda. BMC Bioinformatics, 2015.

2str = ”rnatabupath”

Sa

mp

lin

gfo

rm

ult

i-ta

rget

RN

Ad

esig

n·

S.W

ill

Thank you!

Questions? ⇒ me, , and/or

b

c

de

f gh i

j k l m

nopq

rs

a t

uv

a b c d e

f

gh

ijk l

mn o

pqr

s t u v

a

b

c

d

e f

g

h ij

k

l m n o

p

q

r st

u

v

R1

R2

R3

i) Input Structures

ab c d e f g h i j k l mn o p q r s t u v

ii) Merged Base-Pairs

aeg

p uk

csn

j

qtb

d h m

f l o

vi

r

iii) Compatibility Graph

∅

v o

o ll ff r

l i

j qq t

t bb dd hh m

k pk p g

g n g p uu g e

u e a e ss c

iv) Tree Decomposition

RNARedPrint

Partition FunctionStochastic Backtrack

GACUUUAGCU

UGUUUGAAGU

UA

GACUUGGGAC

UUUUAGGUGU

CU

UUAGAUUCCA

GGGGCUUGU

AGG

GGUUUAGGGC

CUUUAGGUAU

CU

AUUGUGACGG

UCGUGACUAG

UU

. . .

0

0

0

−5

−5

−5

kcal.mol−1

kcal.mol−1

kcal.mol−1

E1

E2

E3GC%

0 50 100

v) Weight Optimization (Adaptive Sampling)

Weightsupd

ate

GCCGCGGUAGCUACAGCCGGCUUUGGGGUUGGGUAGACUCCGGUGCUGCAGCGGCUGUGGCUGGCCGGUUCUGGUUUGCUUAGGGCUACGACGGCGGUGCCGGCAUUUGC. . .

0kcal.mol−1

E1E2E3

GC%

0 50 100

vi) Final Designs

(workflow shown for base pair energy model; for more sophisticated models and

applications, we introduced n-ary dependencies in a flexible constraint framework)

Sa

mp

lin

gfo

rm

ult

i-ta

rget

RN

Ad

esig

n·

S.W

ill

APPENDIX

Sa

mp

lin

gfo

rm

ult

i-ta

rget

RN

Ad

esig

n·

S.W

ill

Uniform sampling: one single structure

1 2 3 4

. ( . )

A A A UA C A GA G A CA G A UA G A U

...

U U U AU U U G

→ 4 · 6 · 4 = 96

• draw unpaired bases uniformly

• choose 1st end of each base pair, s.t. A : C : G : U = 1 : 1 : 2 : 2

• select 2nd end accordingly (if first is G or U: choose)

Sa

mp

lin

gfo

rm

ult

i-ta

rget

RN

Ad

esig

n·

S.W

ill

Tree widths over benchmark sets

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●

●●●●●●●●

●

●

Modena RF3 RF4 RF5 RF6

510

15

Tree

wid

th

●●

●

●●●●

●●●

●●

Modena RF3 RF4 RF5 RF65

1015

2025

30

Base pair model vs. stacking model

Versatile fixed-parameter tractable sampling for multi ...

Documents

Transcript of Versatile fixed-parameter tractable sampling for multi ...