Versatile fixed-parameter tractable sampling for multi ...
Transcript of Versatile fixed-parameter tractable sampling for multi ...
Sa
mp
lin
gfo
rm
ult
i-ta
rget
RN
Ad
esig
n·
S.W
ill
Versatile fixed-parameter tractable sampling formulti-target RNA design
Sebastian Will
Bled 2018
TBI · University of Vienna
Fixed-Parameter Tractable Sampling for RNADesign with Multiple Target Structures
Stefan Hammer1,2,3, Yann Ponty4,5,?, Wei Wang4,5, and Sebastian Will2
1 University Leipzig, Department of Computer Science and Interdisciplinary Centerfor Bioinformatics, 04107 Leipzig, Germany
2 University of Vienna, Faculty of Chemistry, Department of Theoretical Chemistry,1090 Vienna, Austria
3 University of Vienna, Faculty of Computer Science, Research Group Bioinformaticsand Computational Biology, 1090 Vienna, Austria
4 CNRS UMR 7161 LIX, Ecole Polytechnique, Bat. Turing, 91120 Palaiseau, France5 AMIBio team, Inria Saclay, Bat Alan Turing, 91120 Palaiseau, France
? To whom correspondence should be addressed: [email protected]
Motivation. Engineering artificial biological systems promises broad applicationsin synthetic biology, biotechnology and medicine. Here, the rational design ofmulti-stable RNA molecules is especially powerful, since RNA can be generatedwith highly specific properties and programmable functions. In particular, de-signing artificial riboswitches became popular due to their potential as versatilebiosensors [1]. Effective in-silico methods proved to greatly facilitate the designapproach and have tremendous impact on their cost and feasibility.
Statement of problem. Most methods for computational design share a similaroverall strategy: one or several initial seed sequences are generated and optimizedsubsequently. In this contribution we revisit the first main ingredient of (multi-target) design methods, namely the sampling of sequences, which energeticallyfavor several given target structures at the same time. While previous multi-target methods [4, 6] relied on ad-hoc sampling strategies, sampling seeds fromthe uniform distribution was solved only recently [3, 2].
Algorithmic contributions. We generalize Boltzmann sampling for RNA design,which was recently shown powerful for single targets in IncaRNAtion [5], to de-sign for multiple structural targets. After showing that even uniform sampling is#P-hard, we introduce the tree decomposition-based fixed parameter tractable(FPT) sampling algorithm RNARedPrint. Finally, we combine our FPT stochas-tic sampling algorithm with multi-dimensional Boltzmann sampling over distri-butions controlled by expressive RNA energy models. We show that sampling tsequences of length n for k target structures takes O(2d nk + t n k) time, whered := min(w+ c+ 1, 2(w+ 1)), depending on the tree width w of the dependencygraph (covering all dependencies between sequence positions introduced by theenergy function) as well as the number c of connected components in the com-patibility graph (covering the constraints enforcing canonical base pairings). Dueto a constraint framework, RNARedPrint supports generic Boltzmann-weightedsampling for arbitrary additive RNA energy models; this moreover enables tar-geting specific free energies or GC-content, compare Figure 1.
Sa
mp
lin
gfo
rm
ult
i-ta
rget
RN
Ad
esig
n·
S.W
ill
Multi-target design of RNA sequences
For example: design riboswitches for translational control
+
Multiple structures (=multiple design targets)
b
c
de
f gh i
j k l m
nopq
rs
a t
uv
a b c d e
f
gh
ijk l
mn o
pqr
s t u v
a
b
c
d
e f
g
h ij
k
l m n o
p
q
r st
u
v
abcdefghijklmnopqrstuv
(((((.)).(((..))).))).
((.))((...))..(((..)))
....(((((..)))...))...
Task: generate sequences with specific properties
• low/specific energy for multiple structures
• specific CG content
• specific energy differences
• specific sequence/structure motifs (enforce/forbid)
Approach:(defined) sampling
Sa
mp
lin
gfo
rm
ult
i-ta
rget
RN
Ad
esig
n·
S.W
ill
Multi-target design of RNA sequences
For example: design riboswitches for translational control
+
Multiple structures (=multiple design targets)
b
c
de
f gh i
j k l m
nopq
rs
a t
uv
a b c d e
f
gh
ijk l
mn o
pqr
s t u v
a
b
c
d
e f
g
h ij
k
l m n o
p
q
r st
u
v
abcdefghijklmnopqrstuv
(((((.)).(((..))).))).
((.))((...))..(((..)))
....(((((..)))...))...
Task: generate sequences with specific properties
• low/specific energy for multiple structures
• specific CG content
• specific energy differences
• specific sequence/structure motifs (enforce/forbid)
Approach:(defined) sampling
Sa
mp
lin
gfo
rm
ult
i-ta
rget
RN
Ad
esig
n·
S.W
ill
Multi-target design of RNA sequences
For example: design riboswitches for translational control
+
Multiple structures (=multiple design targets)
b
c
de
f gh i
j k l m
nopq
rs
a t
uv
a b c d e
f
gh
ijk l
mn o
pqr
s t u v
a
b
c
d
e f
g
h ij
k
l m n o
p
q
r st
u
v
abcdefghijklmnopqrstuv
(((((.)).(((..))).))).
((.))((...))..(((..)))
....(((((..)))...))...
Task: generate sequences with specific properties
• low/specific energy for multiple structures
• specific CG content
• specific energy differences
• specific sequence/structure motifs (enforce/forbid)
Approach:(defined) sampling
Sa
mp
lin
gfo
rm
ult
i-ta
rget
RN
Ad
esig
n·
S.W
ill
Uniform sampling: multiple structures
1 2 3 4 5
( . . ) .
. ( ( ) )
( ( . ) )
A A A U U
A A G U UA G A U UA G G U UG A A U CG A A U UG A G U CG A G U UG G A U CG G A U UG G G C CG G G C UG G G U CG G G U U...
• For uniform: choose first positionA : C : G : U = 4 : 4 : 10 : 10Then, e.g. after G, choose secondA : G = 4 : 6, . . .
• → counting
• (Why) is this hard?
Theorem: Counting of sequences formultiple targets is #P-hard.
Proof: equiv. to counting independent sets
1. dependency graph is bipartite
{A,G} vs. {C ,U}
2. A and C cannot pair: independent3. Selecting all As and C s, i.e. independent
sets, determines a design (and vice versa)
Sa
mp
lin
gfo
rm
ult
i-ta
rget
RN
Ad
esig
n·
S.W
ill
Uniform sampling: multiple structures
1 2 3 4 5
( . . ) .
. ( ( ) )
( ( . ) )
A A A U UA A G U UA G A U UA G G U UG A A U CG A A U UG A G U CG A G U UG G A U CG G A U UG G G C CG G G C UG G G U CG G G U U...
• For uniform: choose first positionA : C : G : U = 4 : 4 : 10 : 10Then, e.g. after G, choose secondA : G = 4 : 6, . . .
• → counting
• (Why) is this hard?
Theorem: Counting of sequences formultiple targets is #P-hard.
Proof: equiv. to counting independent sets
1. dependency graph is bipartite
{A,G} vs. {C ,U}
2. A and C cannot pair: independent3. Selecting all As and C s, i.e. independent
sets, determines a design (and vice versa)
Sa
mp
lin
gfo
rm
ult
i-ta
rget
RN
Ad
esig
n·
S.W
ill
Uniform sampling: multiple structures
1 2 3 4 5
( . . ) .
. ( ( ) )
( ( . ) )
A A A U UA A G U UA G A U UA G G U UG A A U CG A A U UG A G U CG A G U UG G A U CG G A U UG G G C CG G G C UG G G U CG G G U U...
• For uniform: choose first positionA : C : G : U = 4 : 4 : 10 : 10Then, e.g. after G, choose secondA : G = 4 : 6, . . .
• → counting
• (Why) is this hard?
Theorem: Counting of sequences formultiple targets is #P-hard.
Proof: equiv. to counting independent sets
1. dependency graph is bipartite
{A,G} vs. {C ,U}
2. A and C cannot pair: independent3. Selecting all As and C s, i.e. independent
sets, determines a design (and vice versa)
Sa
mp
lin
gfo
rm
ult
i-ta
rget
RN
Ad
esig
n·
S.W
ill
Systematic efficient counting (and sampling)
Recipe:
1. Decompose dependency graph2. Apply dynamic programming3. Sample
1 2 3 4 5
( . . ) .
. ( ( ) )
( ( . ) )
target structures
1
45
2 3
dependency graph
5 2
5 42
4
5 14
1
4 3
3
2
5
2
tree decomposition
Theorem: Counting and sampling efficient for fixed tree width.
(So far) Blueprint / RNAdesign similar, but ear decomposition[Blueprint] Hammer, Tschiatschek, Flamm, Hofacker, Findeiß. Bioinformatics, 2017.
[RNAdesign] Hoener, Hammer, Abfalter, Hofacker, Flamm, Stadler, Biopolymers, 2013.
Sa
mp
lin
gfo
rm
ult
i-ta
rget
RN
Ad
esig
n·
S.W
ill
Systematic efficient counting (and sampling)
Recipe:
1. Decompose dependency graph2. Apply dynamic programming3. Sample
1 2 3 4 5
( . . ) .
. ( ( ) )
( ( . ) )
target structures
1
45
2 3
dependency graph
5 2
5 42
4
5 14
1
4 3
3
2
5
2
Cts_4_5 A C G UA:1 0 0 0C:0 1 0 0G:1 0 2 0U:0 1 0 2
Cts_4A:1C:1G:2U:2
Cts_2_5 A C G UA:? ? ? ?C:? ? ? ?G:? ? ? ?U:? ? ? ?
tree decomposition
Theorem: Counting and sampling efficient for fixed tree width.
(So far) Blueprint / RNAdesign similar, but ear decomposition[Blueprint] Hammer, Tschiatschek, Flamm, Hofacker, Findeiß. Bioinformatics, 2017.
[RNAdesign] Hoener, Hammer, Abfalter, Hofacker, Flamm, Stadler, Biopolymers, 2013.
Sa
mp
lin
gfo
rm
ult
i-ta
rget
RN
Ad
esig
n·
S.W
ill
From uniform sampling to Boltzmann sampling
• counts for all subtrees −→ uniform sampling
• Analogously: partition functions −→ Boltzmann sampling
Boltzmann sampling: P(S) ∼ exp(−βE (S)).
• Energy = (weighted) sum of energies for single structures
• energy models• Base pair model
“like counting”• Nearest neighbor model (Turner model)
requires multi-ary dependencies: constraint framework∗
• Stacking model“in-between”, score stacks (4-cliques of dependencies)
∗Constraint networks / cluster tree elimination [Rina Dechter]
Sa
mp
lin
gfo
rm
ult
i-ta
rget
RN
Ad
esig
n·
S.W
ill
From uniform sampling to Boltzmann sampling
• counts for all subtrees −→ uniform sampling
• Analogously: partition functions −→ Boltzmann sampling
Boltzmann sampling: P(S) ∼ exp(−βE (S)).
• Energy = (weighted) sum of energies for single structures
• energy models• Base pair model
“like counting”• Nearest neighbor model (Turner model)
requires multi-ary dependencies: constraint framework∗
• Stacking model“in-between”, score stacks (4-cliques of dependencies)
∗Constraint networks / cluster tree elimination [Rina Dechter]
Sa
mp
lin
gfo
rm
ult
i-ta
rget
RN
Ad
esig
n·
S.W
ill
Example dependency graphs((((.((....)).)))).((.(((.((((.....(((..((((((.((..(((.(.....).)))..)).)).))))..)))..)))).))).))....
..(((((.....(((.(((((((.....))))..))).))).....)))))..((((((((((...))).)....))))))...((((((....))))))
......(((((.....(((...(((.((.((.(((....((......))...))).)).)))))..))).............))))).((((...)))).
base pair1 18
2 17
3
1651
4
1550
5
49
6
13
48
7
1247
87
8 86
9 85
10 84
11
83
41
14 40
39
3769
3668
19 3567
20 3296
21 3195
22 3023 29
6493
24 6392
25
6291
26
27
6189
28
6088
58
57
33 55
34 54
5382
38 81
78
42
77
43
76
44 75
45 73
46 72
70
52 66
65
80
56
79
59
71
74
100
99
98
97
90
94
12
17
18
316
4
15
50
51
5
49
6
48
7
12
13
47
8
86
87
9
85
10
8411
83
14
40
41
39
36
37
68
6919
35
67
20
21
31
32
95
96
22
30
23
2924
6364
92
93
25
62
91
26
27
28
60
61
88
89
57
5833
34
54
55
53
82
38
81
42
77
78
43
76
4475
4546
72
73
70
52
65
66
80
56
79
59
71
74
99
100
98
97
90
94
stacking
Sa
mp
lin
gfo
rm
ult
i-ta
rget
RN
Ad
esig
n·
S.W
ill
Example tree decompositions((((.((....)).)))).((.(((.((((.....(((..((((((.((..(((.(.....).)))..)).)).))))..)))..)))).))).))....
..(((((.....(((.(((((((.....))))..))).))).....)))))..((((((((((...))).)....))))))...((((((....))))))
......(((((.....(((...(((.((.((.(((....((......))...))).)).)))))..))).............))))).((((...)))).
base pair 48 41
48 4113
13
41 78
78
70 48
70
48 136
6
57 78
57
57 31
31
21 31
21
95 21
95
95 90
90
98 90
98
98 87
87
29 87
29
7 87
7
23 29
23
64 23
64
23 93
93
54 64
54
54 34
34
81 54
81
33
55 33
55
11
11 83
83
84
10 84
10
9
85 9
85
8
8 86
86
83 36
36
36 18
18
68 18
68
1 18
1
62 68
62
62 25
25
56 62
56
91 25
91
97 91
97
97 88
88
99 86
99
30 86
30
99 89
89
85 100
100
96 89
96
89 27
27
92 96
92
20 96
20
92 24
24
63 24
63
63 67
67
19 67
19
61 27
61
69 61
69
28 88
28
28 60
60
71 60
71
59
76 59
76
30 58
58
30 22
22
77 58
77
56 79
79
80 55
80
20 32
32
19 35
35
53 35
53
39
15 39
15
49 69
49
17 69
17
40 49
40
49 5
5
40 14
14
47 7
47
7 12
12
4 15
4
4 50
50
51
51 3
3
53 65
65
66
66 52
52
72
46 72
46
73
45 73
45
75
75 44
44
76 43
43
77 42
42
81 38
38
17 37
37
17 2
2
37 82
82
16 3
16
97 90
97 9098
98
97 9098 63
63
97 62 9098 63
62
97 61 6290 98 63
61
97 31 6162 90 98
63
31
97 64 3161 62 90
98 63
64
97 83 6431 61 6290 98 63
83
30 97 83 6431 61 62 90
98 63
30
30 97 83 647 31 61 6290 98 63
7
30 83 6431 7 6162 63 35
35
30 97 83 6431 7 61 6223 90 98 63
23
30 41 83 647 31 61 62
63 35
41
30 81 41 8364 31 7 6162 63 35
81
81 41 8368 7 61
62 63 35
68
30 81 7941 64 31
35
79
97 96 83 6461 23 90 6330 31 7 62
98
96
30 97 96 8331 7 61 2923 90 98
29
97 96 9164 62 2390 98 63
91
17 81 41 8368 7 61 62
63 35
17
17 81 41 8368 7 61 6218 63 35
18
17 81 8368 36 6218 63 35
36
17 41 687 61 6269 18
69
30 81 7941 56 64
31 35
56
30 41 7956 31 58
58
81 80 7956 64 35
80
67 68 3662 18 63
35
67
17 37 8183 36 18
37
17 15 417 18 69
15
30 97 96 8331 7 61 2923 90 89 98
89
30 97 83 967 61 29 9098 89 87
87
30 96 3129 23 90
89 22
22
30 99 837 29 9098 89 87
99
97 96 6129 88 89
98 87
88
30 99 837 29 8698 87
86
8 83 997 86 87
8
97 92 9691 64 62
23 63
92
28 61 2988 89 87
28
92 91 6424 62 23
63
24
30 96 9531 90 89
22
95
67 19 6836 18 35
19
30 79 4156 57 31
58
57
41 79 5657 58 78
78
81 80 7956 64 55
35
55
81 80 5464 55 35
54
30 96 9521 31 22
21
17 4 1541 7 18
69
4
17 4 1541 7 18
69 3
3
4 41 1549 7 69
3
49
17 4 1516 18 3
16
48 4 4115 49 7
69 3
48
48 4 4115 7 49
13 3
13
70 4849 69
70
92 91 2462 63 25
25
92 2423 93
93
48 4 749 13 3
6
6
48 41 1540 49 13
40
48 4 495 3 6
5
48 76 47
47
7 1312 6
12
41 15 4013 14
14
50 4 495 3
50
54 34 6455 35
34
77 41 5758 78
77
37 81 8336 82
82
28 61 8988 27
27
20 96 9521 31
20
17 16 182 3
2
8 85 9983 86
85
53 54 6434 35
53
33 5434 55
33
8 85 8386 9
9
85 99100 86
100
85 8384 9
84
10 85 8384 9
10
10 8311 84
11
15 3940 14
39
51 450 3
51
53 5465 64
65
81 3738 82
38
28 6160 27
60
32 2021 31
32
17 118 2
1
77 41 4258 78
42
77 7642 58
76
77 7658 59
59
77 7642 43
43
53 6566
66
53 6566 52
52
72
46 72
46
46 7273
73
45 4672 73
45
75 7643
75
75 7644 43
44
stacking
Sa
mp
lin
gfo
rm
ult
i-ta
rget
RN
Ad
esig
n·
S.W
ill
Targeting specific properties:multi-dimensional Boltzmann sampling
A B
Weight and combine single structure energies and features (“GC content”)
A Learn weights by adaptive scheme→ target specific energies and GC content
B Sampling: targets Turner energies by linear fitting of energies
Sa
mp
lin
gfo
rm
ult
i-ta
rget
RN
Ad
esig
n·
S.W
ill
Boltzmann vs. uniform samplingfor multi-target RNA design
Dataset Redprint Uniform ImprovementSeeds 2str 21.67 (±4.38) 37.74 (±6.45) 73%
3str 18.09 (±3.98) 30.49 (±5.41) 71%4str 19.94 (±3.84) 32.29 (±5.24) 63%
Optimized 2str 5.84 (±1.31) 7.95 (±1.76) 28%3str 5.08 (±1.10) 7.04 (±1.52) 31%4str 8.77(±1.48) 13.13 (±2.13) 37%
Multi-target design objective [Blueprint] on the Modena benchmark
Modena benchmark: Taneda. BMC Bioinformatics, 2015.
2str = ”rnatabupath”
Sa
mp
lin
gfo
rm
ult
i-ta
rget
RN
Ad
esig
n·
S.W
ill
Thank you!
Questions? ⇒ me, , and/or
b
c
de
f gh i
j k l m
nopq
rs
a t
uv
a b c d e
f
gh
ijk l
mn o
pqr
s t u v
a
b
c
d
e f
g
h ij
k
l m n o
p
q
r st
u
v
R1
R2
R3
i) Input Structures
ab c d e f g h i j k l mn o p q r s t u v
ii) Merged Base-Pairs
aeg
p uk
csn
j
qtb
d h m
f l o
vi
r
iii) Compatibility Graph
∅
v o
o ll ff r
l i
j qq t
t bb dd hh m
k pk p g
g n g p uu g e
u e a e ss c
iv) Tree Decomposition
RNARedPrint
Partition FunctionStochastic Backtrack
GACUUUAGCU
UGUUUGAAGU
UA
GACUUGGGAC
UUUUAGGUGU
CU
UUAGAUUCCA
GGGGCUUGU
AGG
GGUUUAGGGC
CUUUAGGUAU
CU
AUUGUGACGG
UCGUGACUAG
UU
. . .
0
0
0
−5
−5
−5
kcal.mol−1
kcal.mol−1
kcal.mol−1
E1
E2
E3GC%
0 50 100
v) Weight Optimization (Adaptive Sampling)
Weightsupd
ate
GCCGCGGUAGCUACAGCCGGCUUUGGGGUUGGGUAGACUCCGGUGCUGCAGCGGCUGUGGCUGGCCGGUUCUGGUUUGCUUAGGGCUACGACGGCGGUGCCGGCAUUUGC. . .
0kcal.mol−1
E1E2E3
GC%
0 50 100
vi) Final Designs
(workflow shown for base pair energy model; for more sophisticated models and
applications, we introduced n-ary dependencies in a flexible constraint framework)
Sa
mp
lin
gfo
rm
ult
i-ta
rget
RN
Ad
esig
n·
S.W
ill
APPENDIX
Sa
mp
lin
gfo
rm
ult
i-ta
rget
RN
Ad
esig
n·
S.W
ill
Uniform sampling: one single structure
1 2 3 4
. ( . )
A A A UA C A GA G A CA G A UA G A U
...
U U U AU U U G
→ 4 · 6 · 4 = 96
• draw unpaired bases uniformly
• choose 1st end of each base pair, s.t. A : C : G : U = 1 : 1 : 2 : 2
• select 2nd end accordingly (if first is G or U: choose)
Sa
mp
lin
gfo
rm
ult
i-ta
rget
RN
Ad
esig
n·
S.W
ill
Tree widths over benchmark sets
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●
●●●●●●●●
●
●
Modena RF3 RF4 RF5 RF6
510
15
Tree
wid
th
●●
●
●●●●
●●●
●●
Modena RF3 RF4 RF5 RF65
1015
2025
30
Base pair model vs. stacking model