CS5263 Bioinformatics

43
CS5263 Bioinformatics Lecture 9: Motif finding Biological & Statistical background

description

CS5263 Bioinformatics. Lecture 9: Motif finding Biological & Statistical background. Roadmap. Review of last lecture Intro to probability and statistics Intro to motif finding problems Biological background. Multiple Sequence Alignment. Scoring functions. Ideally: - PowerPoint PPT Presentation

Transcript of CS5263 Bioinformatics

Page 1: CS5263 Bioinformatics

CS5263 Bioinformatics

Lecture 9: Motif finding

Biological & Statistical background

Page 2: CS5263 Bioinformatics

Roadmap

• Review of last lecture

• Intro to probability and statistics

• Intro to motif finding problems– Biological background

Page 3: CS5263 Bioinformatics

Multiple Sequence Alignment

Page 4: CS5263 Bioinformatics

Scoring functions

• Ideally:– Maximizes probability that

sequences evolved from common ancestor

• In practice:– Sum of Pairs

x

yz

w

v

?

x: AC-GCGG-Cy: AC-GC-GAGz: GCCGC-GAG

x: ACGCGG-C x: AC-GCGG-C; y: AC-GCGAGy: ACGC-GAC z: GCCGC-GAG; z: GCCGCGAG

Page 5: CS5263 Bioinformatics

Algorithms

• MDP

• Progressive alignment

• Iterative refinement

• Restricted DP

Page 6: CS5263 Bioinformatics

MDP

• Similar to pair-wise alignment– O(2NLN) running time– O(LN) memory

F(i-1,j-1,k-1) + S(xi, xj, xk),

F(i-1,j-1,k ) + S(xi, xj, -),

F(i-1,j ,k-1) + S(xi, -, xk),

F(i,j,k) = max F(i ,j-1,k-1) + S(-, xj, xk),

F(i-1,j ,k ) + S(xi, -, -),

F(i ,j-1,k ) + S(-, xj, -),

F(i ,j ,k-1) + S(-, -, xk)

(i,j,k)

(i,j,k-1)

(i-1,j,k-1)(i-1,j-1,k-1)

(i-1,j-1,k)

(i,j-1,k)

(i-1,j,k)

(i,j-1,k-1)

Page 7: CS5263 Bioinformatics

Progressive alignment

• Most popular multiple alignment algorithm– CLUSTALW

• Main idea:– Construct a guide tree based on pair-wise

alignment scores– Align the most similar sequences first– Progressively add other sequences

• Pros: fast (O(NL2)• Cons: initial bad alignment is frozen

Page 8: CS5263 Bioinformatics

Iterative Refinement

• Basic idea:– Do progressive alignment first– Iteratively:

• Remove a sequence, and realign it back while keeping the rest fixed

• A note of its convergence guarantee– Every time we realign a sequence, we

improve its score– Therefore, the algorithm must converge to

either a global or local maximum

Page 9: CS5263 Bioinformatics

Restricted MDP

• Similar to bounded DP in pair-wise alignment1. Construct progressive multiple alignment m

2. Run MDP, restricted to radius R from m

Running Time: O(2N RN-1 L)

x

y

z

Page 10: CS5263 Bioinformatics

Today

• Probability and statistics

• Biology background for motif finding

Page 11: CS5263 Bioinformatics

Probability Basics

• Definition (informal)– Probabilities are numbers assigned to events

that indicate “how likely” it is that the event will occur when a random experiment is performed

– A probability law for a random experiment is a rule that assigns probabilities to the events in the experiment

– The sample space S of a random experiment is the set of all possible outcomes

Page 12: CS5263 Bioinformatics

Example

0 P(Ai) 1

P(S) = 1

Page 13: CS5263 Bioinformatics

Random variable

• A random variable is a function from a sample to the space of possible values of the variable– When we toss a coin, the number of times

that we see heads is a random variable– Can be discrete or continuous

• The resulting number after rolling a die• The weight of an individual

Page 14: CS5263 Bioinformatics

Cumulative distribution function (cdf)

• The cumulative distribution function FX(x) of a random variable X is defined as the probability of the event {X≤x}

F (x) = P(X ≤ x) for −∞ < x < +∞

Page 15: CS5263 Bioinformatics

Probability density function (pdf)

• The probability density function of a continuous random variable X, if it exists, is defined as the derivative of FX(x)

• For discrete random variables, the equivalent to the pdf is the probability mass function (pmf):

Page 16: CS5263 Bioinformatics

Probability density function vs probability

• What is the probability for somebody weighting 200lb?

• The figure shows about 0.62– What is the probability of

200.00001lb?

• The right question would be:– What’s the probability for somebody

weighting 199-201lb.

• The probability mass function is true probability– The chance to get any face is 1/6

Page 17: CS5263 Bioinformatics

Some common distributions

• Discrete:– Binomial– Multinomial– Geometric– Hypergeometric– Possion

• Continuous– Normal (Gaussian)– Uniform– EVD– Gamma– Beta– …

Page 18: CS5263 Bioinformatics

Probabilistic Calculus

• If A, B are mutually exclusive:– P(A U B) = P(A) + P(B)

• Thus: P(not(A)) = P(Ac) = 1 – P(A)

A B

Page 19: CS5263 Bioinformatics

Probabilistic Calculus

• P(A U B) = P(A) + P(B) – P(A ∩ B)

Page 20: CS5263 Bioinformatics

Conditional probability

• The joint probability of two events A and B P(A∩B), or simply P(A, B) is the probability that event A and B occur at the same time.

• The conditional probability of P(B|A) is the probability that B occurs given A occurred.

P(A | B) = P(A ∩ B) / P(B)

Page 21: CS5263 Bioinformatics

Example

• Roll a die– If I tell you the number is less than 4– What is the probability of an even number?

• P(d = even | d < 4) = P(d = even ∩ d < 4) / P(d < 4)• P(d = 2) / P(d = 1, 2, or 3) = (1/6) / (3/6) = 1/3

Page 22: CS5263 Bioinformatics

Independence

• P(A | B) = P(A ∩ B) / P(B)

=> P(A ∩ B) = P(B) * P(A | B)• A, B are independent iff

– P(A ∩ B) = P(A) * P(B) – That is, P(A) = P(A | B)

• Also implies that P(B) = P(B | A)– P(A ∩ B) = P(B) * P(A | B) = P(A) * P(B | A)

Page 23: CS5263 Bioinformatics

Examples

• Are P(d = even) and P(d < 4) independent?– P(d = even and d < 4) = 1/6– P(d = even) = ½– P(d < 4) = ½– ½ * ½ > 1/6

• If your die actually has 8 faces, will P(d = even) and P(d < 5) be independent?

• Are P(even in first roll) and P(even in second roll) independent?

• Playing card, are the suit and rank independent?

Page 24: CS5263 Bioinformatics

Theorem of total probability

• Let B1, B2, …, BN be mutually exclusive events whose union equals the sample space S. We refer to these sets as a partition of S.

• An event A can be represented as:

•Since B1, B2, …, BN are mutually exclusive, then

P(A) = P(A∩B1) + P(A∩B2) + … + P(A∩BN)

•And therefore

P(A) = P(A|B1)*P(B1) + P(A|B2)*P(B2) + … + P(A|BN)*P(BN)

= i P(A | Bi) * P(Bi)

Page 25: CS5263 Bioinformatics

Example

• Row a loaded die, 50% time = 6, and 10% time for each 1 to 5

• What’s the probability to have an even number? Prob(even) = Prob(even | d < 6) * Prob(d<6)

+ Prob(even | d=6) * Prob(d=6)= 2/5 * 0.5 + 1 * 0.5= 0.7

Page 26: CS5263 Bioinformatics

Another example

• We have a box of dies, 99% of them are fair, with 1/6 possibility for each face, 1% are loaded so that six comes up 50% of time. We pick up a die randomly and roll, what’s the probability we’ll have a six?

• P(six) = P(six | fair) * P(fair) + P(six | loaded) * P(loaded)– 1/6 * 0.99 + 0.5 * 0.01 = 0.17 > 1/6

Page 27: CS5263 Bioinformatics

Bayes theorem

• P(A ∩ B) = P(B) * P(A | B) = P(A) * P(B | A)

AP

BPABP

)(

)()|( ==>

Posterior probability of A Normalizing constant

BAP )|( Prior of B

Likelihood

This is known as Bayes Theorem or Bayes Rule, and is (one of) the most useful relations in probability and statistics

Bayes Theorem is definitely the fundamental relation in Statistical Pattern Recognition

Page 28: CS5263 Bioinformatics

Bayes theorem (cont’d)

• Given B1, B2, …, BN, a partition of the sample space S. Suppose that event A occurs; what is the probability of event Bj?

• P(Bj | A) = P(A | Bj) * P(Bj) / P(A)

= P(A | Bj) * P(Bj) / jP(A | Bj)*P(Bj)

Bj: different models

In the observation of A, should you choose a model that maximizes P(Bj | A) or P(A | Bj)? Depending on how much you know about Bj !

Page 29: CS5263 Bioinformatics

Example

• Prosecutor’s fallacy– Some crime happened– The suspect did not leave any evidence, except some

hair– The police got his DNA from his hair

• Some expert matched the DNA with that of a suspect– Expert said that both the false-positive and false

negative rates are 10-6

• Can this be used as an evidence of guilty against the suspect?

Page 30: CS5263 Bioinformatics

Prosecutor’s fallacy

• Prob (match | innocent) = 10-6

• Prob (no match | guilty) = 10-6

• Prob (match | guilty) = 1 - 10-6 ~ 1

• Prob (no match | innocent) = 1 - 10-6 ~ 1

• Prob (guilty | match) = ?

Page 31: CS5263 Bioinformatics

Prosecutor’s fallacy

P (g | m) = P (m | g) * P(g) / P (m)~ P(g) / P(m)• P(g): the probability for someone to be

guilty with no other evidence • P(m): the probability for a DNA match• How to get these two numbers?

– We don’t really care P(m)– We want to compare two models:

• P(g | m) and P(i | m)

Page 32: CS5263 Bioinformatics

Prosecutor’s fallacy

• P(i | m) = P(m | i) * P(i) / P(m) = 10-6 * P(i) / P(m)

• ThereforeP(i | m) / P(g | m) = 10-6 * P(i) / P(g)

• P(i) + P(g) = 1

• It is clear, therefore, that whether we can conclude the suspect is guilty depends on the prior probability P(i)

• How do you get P(i)?

Page 33: CS5263 Bioinformatics

Prosecutor’s fallacy

• How do you get P(i)?• Depending on what other information you have on the

suspect• Say if the suspect has no other connection with the

crime, and the overall crime rate is 10-7

• That’s a reasonable prior for P(g)• P(g) = 10-7, P(i) ~ 1• P(i | m) / P(g | m) = 10-6 * P(i) / P(g) = 10-6/10-7 = 10

Page 34: CS5263 Bioinformatics

• P(observation | model1) / P(observation | model2): likelihood-ratio test

• LR test• Often take logarithm: log (P(m|i) / P(m|i))• Log likelihood ratio (score)• Or log odds ratio (score)

• Bayesian model selection: log (P(model1 | observation) / P(model2 | observation))

= LLR + log P(model1) - log P(model2)

Page 35: CS5263 Bioinformatics

Prosecutor’s fallacy

• P(i | m) / P(g | m) = 10-6/10-7 = 10• Therefore, we would say the suspect is

more likely to be innocent than guilty, given only the DNA samples

• We can also explicitly calculate P(i | m):P(m) = P(m|i)*P(i) + P(m|g)*P(g)

= 10-6 * 1 + 1 * 10-7

= 1.1 x 10-6

P(i | m) = P(m | i) * P(i) / P(m) = 1 / 1.1 = 0.91

Page 36: CS5263 Bioinformatics

Prosecutor’s fallacy

• If you have other evidences, P(g) could be much larger than the average crime rate

• In that case, DNA test may give you higher confidence• How to decide prior?

– Subjective?– Important?– There are debates about Bayes statistics historically– Some strongly support, some strongly against– Growing interests in many fields

• However, no question about conditional probability• If all priors are equally possible, decisions based on

bayes inference and likelihood test are equivalent• We use whichever is appropriate

Page 37: CS5263 Bioinformatics

Another example

• A test for a rare disease claims that it will report a positive result for 99.5% of people with the disease, and 99.9% of time of those without.

• The disease is present in the population at 1 in 100,000

• What is P(disease | positive test)?

• What is P(disease | negative test)?

Page 38: CS5263 Bioinformatics

Yet another example

• We’ve talked about the boxes of casinos

• 99% fair, 1% loaded (50% at six)

• We said if we randomly pick a die and roll, we have 17% of chance to get a six

• If we get 3 six in a row, what’s the chance that the die is loaded?

• How about 5 six in a row?

Page 39: CS5263 Bioinformatics

• P(loaded | 3 six in a row) = P(3 six in a row | loaded) * P(loaded) / P(3 six in a row) = 0.5^3 * 0.01 / (0.5^3 * 0.01 + (1/6)^3 * 0.99) = 0.21

• P(loaded | 5 six in a row) = P(5 six in a row | loaded) * P(loaded) / P(5 six in a row) = 0.5^5 * 0.01 / (0.5^5 * 0.01 + (1/6)^5 * 0.99) = 0.71

Page 40: CS5263 Bioinformatics

Relation to multiple testing problem

• When searching a DNA sequence against a database, you get a high score, with a significant p-value

• P(unrelated | high score) / P(related | high score) =

P(high score | unrelated) * P(unrelated)

P(high score | related) * P(related)

• P(high score | unrelated) is much smaller than P(high score | related)

• But your database is huge, and most sequences should be unrelated, so P(unrelated) is much larger than P(related)

Likelihood ratio

Page 41: CS5263 Bioinformatics

Question

• We’ve seen that given a sequence of observations, and two models, we can test which model is more likely to generate the data– Is the die loaded or fair?– Either likelihood test or Bayes inference

• Given a set of observations, and a model, can you estimate the parameters?– Given the results of rolling a die, how to infer the

probability of each face?

Page 42: CS5263 Bioinformatics

Question

• You are told that there are two dice, one is loaded with 50% to be six, one is fair.

• Give you a series of numbers resulted from rolling the two dice

• Assume die switching is rare

• Can you tell which number is generated by which die?

Page 43: CS5263 Bioinformatics

Question

• You are told that there are two dice, one is loaded, one is fair. But you don’t know how it is loaded

• Give you a series of numbers resulted from rolling the two dice

• Assume die switching is rare

• Can you tell how is the die loaded and which number is generated by which die?