Transcript of DTC Gerton Lunter, WTCHG February 10, 2010 Hidden Markov models in Computational Biology.
- Slide 1
- DTC Gerton Lunter, WTCHG February 10, 2010 Hidden Markov models
in Computational Biology
- Slide 2
- Overview First part: Mathematical context: Bayesian Networks
Markov models Hidden Markov models Second part: Worked example: the
occasionally crooked casino Two applications in computational
biology Third part: Practical 0: a bit more theory on HMMs
Practical I-V: theory, implementation, biology. Pick &
choose.
- Slide 3
- Part I HMMs in (mathematical) context
- Slide 4
- Probabilistic models Mathematical model describing how
variables occur together. Three type of variables are
distinguished: Observed variables Latent (hidden) variables
Parameters Latent variables often are the quantities of interest,
and can be inferred from observations using the model. Sometimes
they are nuisance variables, and used to correctly describe the
relationships in the data. Example: P(clouds, sprinkler_used, rain,
wet_grass)
- Slide 5
- Some notation P(X,Y,Z): probability of (X,Y,Z) occurring
(simultaneously) P(X,Y):probability of (X,Y) occurring. P(X,Y|Z):
probability of (X,Y) occurring, provided that it is known that Z
occurs (conditional on Z, or given Z) P(X,Y) = Z P(X,Y,Z) P(Z) =
X,Y P(X,Y,Z) P(X,Y| Z ) = P(X,Y,Z) / P(Z) X,Y,Z P(X,Y,Z) = 1 P(Y |
X ) = P(X | Y) P(Y) / P(X) (Bayes rule)
- Slide 6
- Independence Two variables X, Y are independent if P(X,Y) =
P(X)P(Y) Knowing that two variables are independent reduces the
model complexity. Suppose X, Y each take N possible values:
specification of P(X,Y) requires N 2 -1 numbers specification of
P(X), P(Y) requires 2N-2 numbers. Two variables X,Y are
conditionally independent (given Z) if P(X,Y|Z) =
P(X|Z)P(Y|Z).
- Slide 7
- Probabilistic model: example P(Clouds, Sprinkler, Rain,
WetGrass) = P(Clouds) P(Sprinker|Clouds) P(Rain|Clouds) P(WetGrass
| Sprinkler, Rain) This specification of the model determines which
variables are deemed to be (conditionally) independent. These
independence assumptions simplify the model. Using formulas as
above to describe the independence relationship is not very
intuitive, particularly for large models. Graphical models (in
particular, Bayesian Networks) are a more intuitive way to do the
same
- Slide 8
- Bayesian network: example Cloudy Sprinkl er Rain Wet grass
P(Clouds) P(Sprinker|Clouds) P(Rain|Clouds) P(WetGrass | Sprinkler,
Rain) Rule: Two nodes of the graph are conditionally independent
given the state of their parents E.g. Sprinker and Rain are
independent given Cloudy
- Slide 9
- Bayesian network: example Cloudy Sprinkl er Rain Wet grass
Convention: Latent variables are open Observed variables are shaded
P(Clouds) P(Sprinker|Clouds) P(Rain|Clouds) P(WetGrass | Sprinkler,
Rain)
- Slide 10
- Bayesian network: example Combat Air Identification algorithm;
www.wagner.com
- Slide 11
- Bayesian networks Intuitive formalism to develop models
Algorithms to learn parameters from training data (maximum
likelihood; EM) General and efficient algorithms to infer latent
variables from observations (message passing algorithm) Allows
dealing with missing data in a robust and coherent way (make
relevant node a latent variable) Simulate data
- Slide 12
- Markov model A particular kind of Bayesian network All
variables are observed Good for modeling dependencies within
sequences P(S n | S 1,S 2,,S n-1 ) = P(S n | S n-1 ) (Markov
property) P(S 1, S 2, S 3, , S n ) = P(S 1 ) P(S 2 |S 1 ) P (S n |
S n-1 ) S1S1 S1S1 S2S2 S2S2 S3S3 S3S3 S4S4 S4S4 S5S5 S5S5 S6S6 S6S6
S7S7 S7S7 S8S8 S8S8
- Slide 13
- Markov model States: letters in English words Transitions:
which letter follows which S1S1 S1S1 S2S2 S2S2 S3S3 S3S3 S4S4 S4S4
S5S5 S5S5 S6S6 S6S6 S7S7 S7S7 S8S8 S8S8 MR SHERLOCK HOLMES WHO WAS
USUALLY VERY LATE IN THE MORNINGS SAVE UPON THOSE NOT INFREQUENT
OCCASIONS WHEN HE WAS UP ALL . S 1 =M S 2 =R S 3 = S 4 =S S 5 =H .
P(S n = y| S n-1 = x ) = (parameters) P(S n-1 S n = xy ) / P (S n-1
= x ) (frequency of xy) / (frequency of x) (max likelihood)
UNOWANGED HE RULID THAND TROPONE AS ORTIUTORVE OD T HASOUT TIVE IS
MSHO CE BURKES HEST MASO TELEM TS OME SSTALE MISSTISE S
TEWHERO
- Slide 14
- Markov model States: triplets of letters Transitions: which
(overlapping) triplet follows which S1S1 S1S1 S2S2 S2S2 S3S3 S3S3
S4S4 S4S4 S5S5 S5S5 S6S6 S6S6 S7S7 S7S7 S8S8 S8S8 MR SHERLOCK
HOLMES WHO WAS USUALLY VERY LATE IN THE MORNINGS SAVE UPON THOSE
NOT INFREQUENT OCCASIONS WHEN HE WAS UP ALL . S 1 =MR S 2 =R S S 3
= SH S 4 =SHE S 5 =HER . P(S n = xyz| S n-1 = wxy ) = P( wxyz ) /
P( wxy ) (frequency of wxyz) / (frequency of wxy) THERE THE YOU
SOME OF FEELING WILL PREOCCUPATIENCE CREASON LITTLED MASTIFF HENRY
MALIGNATIVE LL HAVE MAY UPON IMPRESENT WARNESTLY
- Slide 15
- Markov model States: word pairs Text from:
http://www.gutenberg.org/etext/1105http://www.gutenberg.org/etext/1105
Then churls their thoughts (although their eyes were kind) To thy
fair appearance lies To side this title is impanelled A quest of
thoughts all tenants to the sober west As those gold candles fixed
in heaven's air Let them say more that like of hearsay well I will
drink Potions of eisel 'gainst my strong infection No bitterness
that I was false of heart Though absence seemed my flame to qualify
As easy might I not free When thou thy sins enclose! That tongue
that tells the story of thy love Ay fill it full with feasting on
your sight Book both my wilfulness and errors down And on just
proof surmise accumulate Bring me within the level of your eyes And
in mine own when I of you beauteous and lovely youth When that
churl death my bones with dust shall cover And shalt by fortune
once more re-survey These poor rude lines of life thou art forced
to break a twofold truth Hers by thy deeds
- Slide 16
- Hidden Markov model HMM = probabilistic observation of Markov
chain Another special kind of Bayesian network S i form a Markov
chain as before, but states are unobserved Instead, y i (dependent
on S i ) are observed Generative viewpoint: state S i emits symbol
y i y i do not form a Markov chain (= do not satisfy Markov
property) They exhibit more complex (and long-range) dependencies
S1S1 S1S1 S2S2 S2S2 S3S3 S3S3 S4S4 S4S4 S5S5 S5S5 S6S6 S6S6 S7S7
S7S7 S8S8 S8S8 y1y1 y1y1 y2y2 y2y2 y3y3 y3y3 y4y4 y4y4 y5y5 y5y5
y6y6 y6y6 y7y7 y7y7 y8y8 y8y8
- Slide 17
- Hidden Markov model Notation above emphasizes relation to
Bayesian networks Different graph notation, emphasizing transition
probabilities P(S i |S i-1 ). E.g. in the case S i {A,B,C,D}:
Notes: Emission probabilities P( y i | S i ) not explicitly
represented Advance from i to i+1 also implicit Not all arrows need
to be present (prob = 0) S1S1 S1S1 S2S2 S2S2 S3S3 S3S3 S4S4 S4S4
S5S5 S5S5 S6S6 S6S6 S7S7 S7S7 S8S8 S8S8 y1y1 y1y1 y2y2 y2y2 y3y3
y3y3 y4y4 y4y4 y5y5 y5y5 y6y6 y6y6 y7y7 y7y7 y8y8 y8y8 A A B B D D
C C
- Slide 18
- Pair Hidden Markov model S 11 S 21 S 31 S 41 S 51 z1z1 z1z1 S
12 S 22 S 23 S 24 S 25 z2z2 z2z2 S 31 S 32 S 33 S 34 S 35 z3z3 z3z3
y1y1 y1y1 y2y2 y2y2 y3y3 y3y3 y4y4 y4y4 y5y5 y5y5
- Slide 19
- Pair Hidden Markov model S 11 S 21 S 31 S 41 S 51 z1z1 z1z1 S
12 S 22 S 23 S 24 S 25 z2z2 z2z2 S 31 S 32 S 33 S 34 S 35 z3z3 z3z3
y1y1 y1y1 y2y2 y2y2 y3y3 y3y3 y4y4 y4y4 y5y5 y5y5 Normalization:
paths p s p(1) s p(N) y 1 y A z 1 z B P(s p(1),,s p(N),y 1 y A,z 1
z B ) = 1 N = N(p) = length of path States may emit a symbol in
sequence y, or in z, or both, or neither (silent state). If a
symbol is emitted, the associated coordinate subscript increases by
one. E.g. diagonal transitions are associated to simultaneous
emissions in both sequences. A realization of the pair HMM consists
of a state sequence, with each symbol emitted by exactly one state,
and the associated path through the 2D table. (A slightly more
general viewpoint decouples the states and the path; then the
hidden variables are the sequence of states S, and a path through
the table. In this viewpoint the transitions, not states, emit
symbols. The technical term in finite state machine theory is Mealy
machine; the standard viewpoint is also known as Moore
machine)
- Slide 20
- Inference in HMMs So HMMs can describe complex (temporal,
spatial) relationships in data. But how can we use the model? A
number of (efficient) inference algorithms exist for HMMs: Viterbi
algorithm: most likely state sequence, given observables Forward
algorithm: likelihood of model given observables Backward
algorithm: together with Forward, allows computation of posterior
probabilities Baum-Welch algorithm: parameter estimation given
observables
- Slide 21
- Summary of part I Probabilistic models Observed variables
Latent variables: of interest for inference, or nuisance variables
Parameters: obtained from training data, or prior knowledge
Bayesian networks independence structure of model represented as a
graph Markov models linear Bayesian network; all nodes observed
Hidden Markov models observed layer, and hidden (latent) layer of
nodes efficient inference algorithm (Viterbi algorithm) Pair Hidden
Markov model two observed sequences with interdependencies,
determined by an unobserved Markov sequence
- Slide 22
- Part II Examples of HMMs
- Slide 23
- Detailed example: The Occasionally Crooked Casino Dirk
Husmeiers slides
http://www.bioss.sari.ac.uk/staff/dirk/talks/tutorial_hmm.pdf
Slides 1-15 Recommended reading: Slides 16-23: the Forward and
Backward algorithm, and posteriors
- Slide 24
- Applications in computational biology Dirk Husmeiers slides:
http://www.bioss.sari.ac.uk/staff/dirk/talks/tutorial_hmm_bioinf.pdf
Slides 1-8: pairwise alignment Slides 12-16: Profile HMMs
- Slide 25
- Part III Practicals
- Slide 26
- Practical 0: HMMs What is the interpretation of the probability
computed by the Forward (FW) algorithm? The Viterbi algorithm also
computes a probability. How does that relate to the one computed by
the FW algorithm? How do the probabilities computed by FW and
Backward algorithms compare? Explain what a posterior is, either in
the context of alignment using an HMM, or of profile HMMs. Why is
the logarithm trick useful for the Viterbi algorithm? Does the same
trick work for the FW algorithm?
- Slide 27
- Practical I: Profile HMMs in context
- Slide 28
- Lookup protein sequence of PRDM9 in the UCSC genome browser
Search Intropro for the protein sequence. Look at the ProSite
profile and sequence logo. Work out the syntax of the profile
(HMMer syntax), and relate the logo and profile. Which residues are
highly conserved? What structural role do these play? Which are not
very much conserved? Can you infer that these are less important
biologically? Read PMID: 19997497 (PubMed). What is the meaning of
the changed number of zinc finger motifs across species? Relate the
conserved and changeable positions in the zinc fingers to the
INTERPRO motif. Do these match the predicted pattern? Read PMID:
19008249 and PMID:20044541. Explain the relationship between the
recombination motif and the zinc fingers. What do you think is the
cellular function of PRDM9? Relate the fact that recombination
hotspots in Chimpanzee do not coincide with those in human with
PRDM9. What do you predict about recombination hotspots in other
mammalian species? Why do you think PRDM9 evolves so fast?
Background information on motif finding:
www.bx.psu.edu/courses/bx-fall04/phmm.ppt
http://compbio.soe.ucsc.edu/SAM_T08/T08-query.html
- Slide 29
- Practical II: HMMs and population genetics
- Slide 30
- Read PMID: 17319744, and PMID: 19581452 What is the difference
between phylogeny and genealogy? What is incomplete lineage
sorting? The model operates on multiple sequences. Is it a linear
HMM, a pair HMM, or something else? What do the states represent?
How could the model be improved? Which patterns in the data is the
model looking for? Would it be possible to analyze these patterns
without a probabilistic model? (Estimate how frequently (per
nucleotide) mutations occur between the species considered. What is
the average distance between recombinations?) How does the method
scale to more species?
- Slide 31
- Practical III: HMMs and alignment
- Slide 32
- PMID: 18073381 What are the causes of inaccuracies in
alignments? Would a more accurate model of sequence evolution
improve alignments? Would this be a large improvement? What is the
practical limit (in terms of evolutionary distance, in
mutations/site) on pairwise alignment? Would multiple alignment
allow more divergent species to be aligned? How does the complexity
scale for multiple alignment using HMMs, in a nave implementation?
What could you do to improve this? What is posterior decoding and
how does it work? In what way does it improve alignments, compared
to Viterbi? Why is this?
- Slide 33
- Practical IV: HMMs and conservation: phastCons
- Slide 34
- Read PMID: 16024819 What is the difference between a phyloHMM
and a standard HMM? How does the model identify conserved regions?
How is the model helped by the use of multiple species? How is the
model parameterized? The paper uses the model to estimate the
fraction of the human genome that is conserved. How can this
estimate be criticized? Look at a few protein-coding genes, and
their conservation across mammalian species, using the UCSC genome
browser. Is it always true that (protein-coding) exons are well
conserved? Can you see regions of conservation outside of
protein-coding exons? Do these observations suggest that the model
is inaccurate? Read PMID: 19858363. Summarize the differences of
approaches of the new methods and the old phyloHMM.
- Slide 35
- Practical V: Automatic code generation for HMMs
- Slide 36
- http://www.well.ox.ac.uk/~gerton/Gulbenkian/HMMs and
alignments.doc. Skip sections 1-3.
http://www.well.ox.ac.uk/~gerton/Gulbenkian/HMMs and alignments.doc
Implementing the various algorithms for HMMs can be hard work,
particularly when a reasonable efficiency is required. Library
implementations are however neither fast nor flexible enough. This
practical demonstrates a code generator that takes the pain out of
working with HMMs. This practical takes you through an existing
alignment HMM, and modifies it to identify conserved regions ( la
phastCons) Requirements: a Linux system, with Java and GCC
installed. Experience with C and/or C++ is helpful for this
tutorial.