Post on 06-Apr-2018
8/2/2019 SCFG Smaller
1/38
Stochastic Context Free GrammarsStochastic Context Free Grammars
CBB 231 / COMPSCI 261
for noncoding RNA gene predictionfor noncoding RNA gene prediction
8/2/2019 SCFG Smaller
2/38
Formal Languages
A formal language is simply a set of strings (i.e., sequences). That
set may be infinite.
Let Mbe a model denoting a language. IfM is a generative model
such as an HMM or a grammar, then L(M) denotes the languagegenerated byM.
IfM is an acceptor model such as a finite automaton, then L(M)
denotes the language accepted byM.
When all the parameters of a stochastic generative model are known,
we can ask:
What is the probability that model M will generate string S?
which we denote:
P(S|M)
8/2/2019 SCFG Smaller
3/38
Recall: The Chomsky Hierarchy
Examples:
* gene-finding assumes DNA is regular
* secondary structure prediction assumes RNA is context-free
* RNA pseudoknots are context-sensitive
regular languages
context free languages
context sensitive languages
recursive languages
HMMs / reg. exp.s
SCFGs / PDAs
Linear-bounded TMs
Halting TMs
recursively enumerable languages Turing machines
* each class is a subset of the next higher class in the hierarchy
all languages
8/2/2019 SCFG Smaller
4/38
Context-free Grammars (CFGs)
A context-free grammar is a generative model denoted by a 4-tuple:
G= (V, , S,R)
where:
is a terminal alphabet, (e.g., {a, c, g, t} )
Vis a nonterminal alphabet, (e.g., {A,B, C,D,E, ...} )
SVis a special start symbol, and
R is a set of rewriting rules calledproductions.
Productions inR are rules of the form:
X
for XV, (V)*; such a production denotes that the nonterminal
symbolXmay be rewritten by the expression , which may consist ofzero or more terminals and nonterminals.
8/2/2019 SCFG Smaller
5/38
A Simple Example
As an example, consider G=(VG, , S,RG), for VG={S,L,N}, ={a,c,g,t}, andRG theset consisting of:
SaSt|tSa|cSg|gSc|L
LN N N N
Na|c|g|t
One possible derivation using this grammar is:
S
aSt
acSgt
acgScgt
acgtSacgt
acgtLacgt
acgtNNNNacgtacgtaNNNacgt
acgtacNNacgt
acgtacgNacgt
acgtacgtacgt
SaSt
S
cSgSgSc
StSa
SL
LN N N N
Na
Nc
Ng
Nt
8/2/2019 SCFG Smaller
6/38
Derivations
Suppose a CFG G has generated a terminal stringx*
. A derivationdenotes a single way by which G may have generatedx. For a grammar
G and a stringx, there may exist multiple distinct derivations.
A derivation (or parse) consists of a series of applications ofproductions fromR, beginning with the start symbolSand ending with
the terminal stringx:
S
s1
s2
s3
xWe can denote this more compactly as: S*x. Each string si in a
derivation is called a sentential form, and may consist of both terminal
and nonterminal symbols: si(V)*. Each step in a derivation must be
of the form:
wXzwz
for w,z
(V
)*
, where X
is a production in R; note that w and zmay be empty (denotes the empty string).
8/2/2019 SCFG Smaller
7/38
Leftmost Derivations
A leftmost derivation is one in which at each step the leftmost
nonterminal in the current sentential form is the one which is
rewritten:
SabXdYZabxxxdYZabxxxdyyyZabxxxyzdyyyzzzFor many applications, it is not necessary to restrict ones attention to
only the leftmost parses. In that case, there may exist multiplederivations which can produce the same exact string.
However, when we get to stochastic CFGs, it will be convenient to
assume that only leftmost derivations are valid. This will simplifyprobability computations, since we dont have to model the process of
stochastically choosing a nonterminal to rewrite. Note that doing this
does not reduce the representational power of the CFG in any way; it
just makes it easier to work with.
8/2/2019 SCFG Smaller
8/38
Context-freeness
The context-freeness of context-free grammars is imposed by therequirement that the l.h.s of each production rule may contain only a
single symbol, and that symbol must be a nonterminal:
X
for XV, (V)*. That is, Xis a nonterminal and is any (possibly
empty) string of terminals and/or nonterminals. Thus, a CFG cannot
specify context-sensitive rules such as:
wXzwz
which states that nonterminal X can be rewritten by only when X
occurs in the local context wXz in a sentential form. Such productions
are possible in context-sensitive grammars (CSGs).
8/2/2019 SCFG Smaller
9/38
The advantage of CFGs over HMMs lies in their ability to model arbitrary runs of
matching pairs of palindromic elements, such as nested pairs of parentheses:
(((((((())))))))
where each opening parenthesis must have exactly one matching closing parenthesis on
the right. When the number of nested pairs is unbounded (i.e., a matching close
parenthesis can be arbitrarily far away from its open parenthesis), a finite-state model
such as a DFA or an HMM is inadequate to enforce the constraint that all left elementsmust have a matching right element.
In contrast, the modeling of nested pairs of elements can be readily achieved in a CFG
using rules such asX(X). A sample derivation using such a rule is:
X(X)((X))(((X)))((((X))))(((((X)))))An additional rule such asXis necessary to terminate the recursion.
Context-free Versus Regular
8/2/2019 SCFG Smaller
10/38
CFGs and Pushdown Automata
Unlike finite-state automata (and HMMs), which have only a finite amount of memory,a machine for recognizing context-free languages needs an unbounded amount of
memory (in general).
One such machine is the pushdown automaton (PDA), which consists of a
(nondeterministic)fi
nite automaton plus a stack:
.
.
.
.
.
.
automaton stack
The unbounded memory provided by the stack allows the PDA to recognize nested
structures of arbitrary depth, whereas an HMM has no such capability since its onlymemory is its finite set of states. That is, CFGs relax theMarkov assumption.
In practice, a recognizer for a CFL neednt have an infinite stack, but does require a
dynamic programming matrix of size O(L
2
), for L the length of the sequence to beparsed. We will see such an algorithm later.
8/2/2019 SCFG Smaller
11/38
Pushdown Automaton Example
S1S1S0S0S
110011
S
110011
S
1
1
10011
S
1
10011
S
1stack:
input:
time: t=0 t=1 t=2 t=3
1
1
S
1
1
0011
t=4
S
1
1
0011
t=5
0
0
S
1
1
011
t=6
0
action: push S S1S1 match1 S1S1 match1S0S0 match0 S
the grammar: the input: 110011
Note that in principle this
procedure is nondeterministic,
so that back-tracking may be
necessary in practice!
1
1
11
1
1
t=9 t=10t=8
match1 match 1 accept
1
1
0
011
t=7
Why didnt the PDA
use S0S0 here?
8/2/2019 SCFG Smaller
12/38
Limitations of CFGs
One thing that CFGs cant model is the matching of arbitrary runs of matchingelements in the same direction (i.e., notpalindromic):
......abcdefg.......abcdefg.....
In other words, languages of the form:
wxw
for strings w andx of arbitrary length, cannot be modeled using a CFG.
More relevant to ncRNA prediction is the case ofpseudoknots, which also cannot be
recognized using standard CFGs:
....abcde....rstuv.....edcba.....vutsr....
The problem is that the matching palindromes (and the regions separating them) are
ofarbitrary length.
Q: why isnt this very
relevant to RNA
structure prediction?
Hint: think of the
directionality of
paired strands.
8/2/2019 SCFG Smaller
13/38
Stochastic CFGs (SCFGs)
A stochastic context-free grammar (SCFG) is a CFG plus a probabilitydistribution on productions:
G= (V, , S,R, Pp)
wherePp :R
, and probabilities are normalized at the level of eachl.h.s. symbolX:
[Pp(X)=1 ]
XVX
Thus, we can compute the probability of a single derivation S*x by
multiplying the probabilities for all productions used in the derivation:
iP(Xii)
We can sum over all possible (leftmost) derivations of a given stringx
to get the probability that G will generatex at random:
P(x | G) = P(Sj*x | G).
j
8/2/2019 SCFG Smaller
14/38
A Simple Example
As an example, consider G=(VG
, , S,RG
, PG
), for VG
={S,L,N}, ={a,c,g,t}, andRG
the set consisting of:
SaSt|tSa|cSg|gSc|L
LN N N N
Na|c|g|t
where PG(S)=0.2, PG(LNNNN)=1, and PG(N)=0.25. Then the
probability of the sequenceacgtacgtacgt
is given by:P(acgtacgtacgt) =
P( SaStacSgtacgScgtacgtSacgt
acgtLacgtacgtNNNNacgtacgtaNNNacgt
acgtacNNacgtacgtacgNacgtacgtacgtacgt) =
0.2 0.2 0.2 0.2 0.2 1 0.25 0.25 0.25 0.25 = 1.2510-6
because this sequence has only one possible leftmostderivation under grammar G.
(P=0.2)
(P=1.0)
(P=0.25)
8/2/2019 SCFG Smaller
15/38
Chomsky Normal Form (CNF)
Any CFG which does not derive the empty string (i.e., L(G)) can be converted intoan equivalent grammar in Chomsky Normal Form (CNF). A CNF grammar is one in
which all productions are of the form:
XYZ
or:Xa
for nonterminalsX, Y,Z, and terminal a.
Transforming a CFG into CNF can be accomplished by appropriately-ordered
application of the following operations:
eliminating useless symbols (nonterminals that only derive )
eliminating null productions (X)eliminating unit productions (XY)
factoring long rhs expressions (Aabc factored intoAaB,BbC, Cc)
factoring terminals (AcB is factored intoACB, Cc)
(see, e.g., Hopcroft & Ullman, 1979).
8/2/2019 SCFG Smaller
16/38
Non-CNF:
SaSt|tSa|cSg|gSc|L
LN N N N
Na|c|g|t
CNF - Example
CNF:
SA ST|T S
A|C S
G|G S
C|N L
1
SAS A
STS T
SCS C
SGS G
L1NL
2
L2
NNNa|c|g|t
Aa
Cc
Gg
Tt
Disadvantages of CNF: (1) more nonterminals & productions, (2) more convoluted relation to problem domain (can be
important when implementingposterior decoding)
Advantages: (1) easy implementation of parsers
8/2/2019 SCFG Smaller
17/38
The Parsing Problem
Two questions for a CFG:
1) Can a grammarG derive stringx?
2) If so, what series of productions would be used during
the derivation? (there may be multiple answers!)
Additional questions for an SCFG:
1) What is theprobability that G derives stringx?
2) What is the most probable derivation ofx via G?
(likelihood)
8/2/2019 SCFG Smaller
18/38
The CYK Parsing Algorithm
j
k
i
(i,j)
atcg
atcg
atcgta
gcccta
tcccta
gctat
ctag
ctta
cggtaa
tcgcat
cgcg
cgtctta
gcgca
i
j
(i, k)
(k+1,j)
Cell (i,j) contains all the nonterminalsXwhich can derive the entire subsequence:
actagctatctagcttacggtaatcgcatcgcgc.
(k+1,j) contains only those nonterminals
which can derive the red substring.
(i, k) contains only those nonterminals
which can derive the green
substring.
(0, n-1)
initialization:
Xx(diagonal)
inductive:
ABC(for all A,BC, and k)
termination:
is SD0, n-1?
A
C
B
S?
8/2/2019 SCFG Smaller
19/38
The CYK Parsing Algorithm (CFGs)
Given a grammar G = (V,
, S,R) in CNF, we initialize a DP matrixD such that:
0i
8/2/2019 SCFG Smaller
20/38
Relation to PDAs
If this is equivalent to a pushdown automaton, then
where is thepushdown stack?
And where are thestates of the automaton?
.....
.
automaton stack
?
PDA CYK
j
k
i(i,j)
(i, k)
(k+1,j)
(0, n-1 )
X
Z
Y
8/2/2019 SCFG Smaller
21/38
Relation to PDAs
The stack exists implicitly as a jagged slice across the DP matrix.Each such slice represents a possible configuration for the stack
at some point in time during the operation of a PDA. CYK
effectively checks whether there exists a series of stack
configurations (plus discarded input prefixes)corresponding to a valid derivation of the input
string.
Under CYK there is implicitly a single stateq1 in addition to the silent start/stop state q0.
Each collapse of a production during DP
corresponds to the self transition q1q1. At
the end of the algorithm, if the answer isACCEPT then the transition q1q0:1
(terminate and emit a 1) is taken; if the
answer is REJECT then the transition
q1q0:0 (terminate and emit a 0) is taken.
q0 q1
$,:1
$,x:0
x,:0
$,S:
x,y:
(stack, input : emission)
8/2/2019 SCFG Smaller
22/38
Modified CYK for SCFGs (Inside Algorithm)
CYK can be easily modified to compute the probability of a string.
We associate a probability with each cell inDi,j, as follows:
1) For each nonterminal A we multiply the probabilities associated
with B and C when applying the production ABC (and alsomultiply by the probability attached to the production itself)
2) We sum the probabilities associated with differentproductions forA
and different values ofk
The probability of the input string is then given by the probability
associated with the start symbol Sin cell (0, n-1).
If we instead want the single highest-scoring parse, we can simply
perform an argmax operation rather than the sums in step #2.
8/2/2019 SCFG Smaller
23/38
The Inside Algorithm
Recall that for the forward algorithm we defined a forward variablef(i, j). Similarly,
for the inside algorithm we define an inside variable(i,j,X):
(i,j,X) =P(X*xi...xj|X)
which denotes the probability that nonterminalXwill derive subsequencexi...xj.
Computing this variable for all integers i and j and all nonterminals Xconstitutes the
inside algorithm:
for i=0 up to L-1 do
foreach nonterminal X do
(i,i,X)=P(Xxi);for i=L-2 down to 0 do
for j=i+1 up to L-1 do
foreach nonterminal X do
(i,j,X)=YZk=i..j-1P(XYZ)(i,k,Y)(k+1,j,Z);
Note thatP(XYZ)=0 ifXYZis not a valid production in the grammar.
The probabilityP(x|G) of the full input sequencex of lengthL can then be found in the
final cell of the matrix: (0, L-1, S) (the corner cell). Reconstructing the most
probable derivation (parse) can be done by modifying this algorithm to (1) computemaxs instead of sums, and (2) to keep traceback pointers as in Viterbi.
j
k
i(i,j)
(i, k)
(k+1,j)
(0,L-1)
X
Z
Y
Q: why did we assume
the grammar was in
CNF?
time=
O(L3
M
3
)
8/2/2019 SCFG Smaller
24/38
Training an SCFG
Two common methods for training an SCFG:
1) If parses are known for the training sequences, we can simply count
the number of times each production occurs in the training parses
and normalize these counts into probabilities. This is analogous tolabeled sequence training of an HMM (i.e., when each symbol in
a training sequence is labeled with an HMM state).
2) If parses are NOT known for the training sequences, we can use an
EM algorithm similar to the Forward-Backward (Baum-Welch)
algorithm for HMMs. The EM algorithm for SCFGs is called
Inside-Outside.
8/2/2019 SCFG Smaller
25/38
Recall: Forward-Backward
INSIDE
OUTSIDE
CATCG
TATCGCGCGATATCTCGATCATCG
CTCGAC
TATT
ATAT
CAGT
CTACTTCAGA
TCTAT
CATC
GTAT
CGCG
CGAT
ATC
TCGATC
ATCGCTCGACTATTATATCAGTCTAC
TTCAGA
TCTAT
FORW
ARD
BACKWAR
D
CATCGTATCGCGCGATATCTCGATCATCGCTCGACTATTATATCA CATCGTATCGCGCGATATCTCGATCATCGCTCGACTATTATATCA
Inside-Outside uses a similar trick to estimate the
expected number of times each production is used:
time = O(L3M
3) time = O(L3M
3)
time=O(LN2) time =O(LN
2)
8/2/2019 SCFG Smaller
26/38
The Outside Algorithm
For the outside algorithm we define an outside variable(i,j, Y):
(i,j,Y) =P( S*x0..xi-1 Yxj+1..xL-1 )
which denotes the probability that the start symbol S will derive the sentential form
x0..xi-1 Yxj+1..xL-1(i.e., that Swill derive some string having prefixx0...xi-1 and suffixxj+1...xL-1 and that the region between will be derived through nonterminal Y).
(0,L-1,S)=1;foreach XS do (0,L-1,X)=0;for i=0 up to L-1 do
for j=L-1 down to i doforeach nonterminal X do
if (i,j,X) undefined then (i,j,X)=YZk=j+1..L-1 P(YXZ)(j+1,k,Z)(i,k,Y)+ YZk=0..i-1 P(YZX)(k,i-1,Z)(k,j,Y);
S
YX
Z(j+1,k
,Z)
(i,
k,Y)
time = O(L3M
3)
i
j
k
8/2/2019 SCFG Smaller
27/38
Inside-Outside Parameter Estimation
Pnew(XYZ)=E(XYZ x)
E(X x)
=
(i, j,X)P(XYZ)(i,k,Y)(k+1, j,Z)k=i
j1
j=i+1
L1
i=0
L2
(i, j,X)(i, j,X)j=i
L1
i=0
L1
Pnew(X a) =E(X a x)
E(X x)
=
kron(xi ,a)(i, i,X)P(X a)
i=
0
L1
(i, j,X)(i, j,X)j=i
L1
i=0
L1
EM-update equations:
(see Durbin et al., 1998, section 9.6)
8/2/2019 SCFG Smaller
28/38
Posterior Decoding for SCFGs
P(Xi,jYi,kZk+1,j x)=(i, j,X)P(XYZ)(i,k,Y)(k+1, j,Z)
(0,L 1)
P(X xix)=
(i,i,X)P(X xi)
(0,L 1)
What is the probability that nonterminal X generates the subsequence xi ...xj via
productionXYZ, with Ygeneratingxi ...xkandZgeneratingxk+1...xj?
What is the probability that nonterminal X generates xi (in some particularsequence)?
P(F
i,j x) =some product ofs and s (and other things)
(0,L
1)
What is the probability that a structural feature of type F will occupy sequencepositions i throughj?
8/2/2019 SCFG Smaller
29/38
An SCFG for simple Stem-loop Structures
S
aSt
(generate paired bases in the stem)StSaScSgSgScSgStStSgSL
LN L (generate a loop of length 4)L
N N N N
Na (generate each nucleotide in the loop)NcNgN
t
15%15%
25%
25%
5%
5%
10%
75%
25%20%
20%
30%
30%
8/2/2019 SCFG Smaller
30/38
Sliding Windows to Find ncRNA Genes
Given a grammarG describing ncRNA structures and an input sequence Z, we can
slide a window of length L across the sequence, computing the probability P(Zi,i+L-1|
G) that the subsequence Zi,i+L-1 falling within the current window could have been
generated by grammarG.
This is the Bayesian inverse ofP(G|Zi,i+L-1), the probability that this subsequence
would (if transcribed into ssRNA) fold into a shape characteristic of the modeled class
of ncRNAs.
Using a likelihood ratio:
R =P(Zi,i+L-1|G) /P(Zi,i+L-1|background),
we can impose the rule that any subsequence having a scoreR>>1 is likely to contain
a ncRNA gene (where the background modelis typically a 0th-orderMarkov chain).
atcgatcgtatcgtacgatcttctctatcgcgcgattcatctgctatcattatatctattatttcaaggcattcag
sliding window
R = 0.99537 (summing overall possible
secondary
structures under
the grammar)
?
8/2/2019 SCFG Smaller
31/38
What about Pseudoknots?
Among the most prevalent RNA structures is a motif known as the pseudoknot. (Staple & Butcher, 2005)
A G f Si l P d k t
8/2/2019 SCFG Smaller
32/38
A Grammar for Simple Pseudoknots
L(G) = {x y xr
yr
|x*
,y*
}
SL X R
X aXt |tXa|cXg|gXc|Y
Ya A Y|c C Y|g G Y|t T Y|
A aaA , A ccA , A ggA ,A ttA
Ca
a C, Cc
c C, Cg
g C, Ct
t CG aa G , G cc G , G gg G , G tt G
Taa T, Tcc T, Tgg T, Ttt T
A RtR , C RgR , G RcR , T RaR
L aaL ,L ccL ,L ggL ,L ttL ,L R
generate x and xr
erase extra markers
reverse-complement second y atend of sequence
propagate encoded copy of yto end of sequence
generate y and encoded 2ndcopy
Q: what type of grammar is this?
I l ti Z k i SCFG
8/2/2019 SCFG Smaller
33/38
Implementing Zuker in an SCFG
(Rivas & Eddy, 2000)
i is unpaired j is unpaired i pairs with j
i+1 pairs with j i pairs with j-1 i+1 pairs with j-1
serial structures
these allow
incorporation of
stacking energies
. . .
loops bulges multiloops
I l ti Z k i SCFG
8/2/2019 SCFG Smaller
34/38
Implementing Zuker in an SCFG
(Rivas & Eddy, 2000)
Attrib te grammars & Generali ed SCFGs
8/2/2019 SCFG Smaller
35/38
Attribute grammars & Generalized SCFGs
What is an attribute-SCFG?
An attribute-SCFG (ASCFG) is an SCFG in which each nonterminalXhas associated
with it a set ofattributesAX={x0,x1, ... ,xn} and each production has associated with it
a set ofattribute update equations, by which the attributes of the nonterminals on the
left-hand-side of the production can be updated from the values of the attributes of the
nonterminals on the right-hand-side. For example:
XYZ{x0=y3+3z1;x1=6z0}
Attributes can be computed using these equations during the Inside algorithm, in
addition to (or in place of) the usual probabilistic DP equations.
We can also think of this as being something of a Generalized SCFG (as in
Generalized HMMs), since they can encapsulate arbitrary additional computations
within each state (i.e., nonterminal).
Batzoglous GSCFG (CONTRAfold)
8/2/2019 SCFG Smaller
36/38
Batzoglou s GSCFG (CONTRAfold)
(Do et al., 2006)
Attribute updates allow them
to perform additional
computation during Inside,
so as to provide more fine-
grained modeling of the
structure.
Much like in a GHMM, they
can incorporate explicit
length scores, etc..
Discriminative Training of GSCFGs
8/2/2019 SCFG Smaller
37/38
Discriminative Training of GSCFG s
S
8/2/2019 SCFG Smaller
38/38
An SCFG is a generative model utilizing production rules togenerate strings
The probability of a string S being generated by an SCFG G
can be computed using the Inside Algorithm
Given a set of productions for a SCFG, the parameters can
be estimated using the Inside-Outside (EM) algorithm
SCFGs are more powerful than HMMs because they canmodel arbitrary runs ofpaired nested elements, such as base-
pairings in a stem-loop structure, though they cant model
pseudoknots
Thermodynamicfolding algorithms can be simulated in an
SCFG
SCFGs can be generalized via explicit attribute equations,
and they can be discriminatively trained
Summary