Position-specific scoring matrices Decrease complexity through info analysis

22
Position-specific scoring Position-specific scoring matrices matrices Decrease complexity through info Decrease complexity through info analysis analysis Training set including sequences from two Nostocs 71-devB CATTACTCCTTCAATCCCTCGCCCCTCATTTGTACAGTCTGTTACCTTTACCTGAAACAGATGAATGTAGAATTTA Np-devB CCTTGACATTCATTCCCCCATCTCCCCATCTGTAGGCTCTGTTACGTTTTCGCGTCACAGATAAATGTAGAATTCA 71-glnA AGGTTAATATTACCTGTAATCCAGACGTTCTGTAACAAAGACTACAAAACTGTCTAATGTTTAGAATCTACGATAT Np-glnA AGGTTAATATAACCTGATAATCCAGATATCTGTAACATAAGCTACAAAATCCGCTAATGTCTACTATTTAAGATAT 71-hetC GTTATTGTTAGGTTGCTATCGGAAAAAATCTGTAACATGAGATACACAATAGCATTTATATTTGCTTTAGTATCTC 71-nirA TATTAAACTTACGCATTAATACGAGAATTTTGTAGCTACTTATACTATTTTACCTGAGATCCCGACATAACCTTAG Np-nirA CATCCATTTTCAGCAATTTTACTAAAAAATCGTAACAATTTATACGATTTTAACAGAAATCTCGTCTTAAGTTATG 71-ntcB ATTAATGAAATTTGTGTTAATTGCCAAAGCTGTAACAAAATCTACCAAATTGGGGAGCAAAATCAGCTAACTTAAT Np-ntcB TTATACAAATGTAAATCACAGGAAAATTACTGTAACTAACTATACTAAATTGCGGAGAATAAACCGTTAACTTAGT 71-urt ATTAATTTTTATTTAAAGGAATTAGAATTTAGTATCAAAAATAACAATTCAATGGTTAAATATCAAACTAATATCA Np-urt TTATTCTTCTGTAACAAAAATCAGGCGTTTGGTATCCAAGATAACTTTTTACTAGTAAACTATCGCACTATCATCA Might increase performance of our PSSM if we can filter out columns that don’t have “enough information” Not every column is as well conserved – some seem to be more informative about what a binding site looks like!

description

Position-specific scoring matrices Decrease complexity through info analysis. Training set including sequences from two Nostoc s 71-devB CATTACTCCTTCAATCCCTCGCCCCTCATTTGTACAGTCTGTTACCTTTACCTGAAACAGATGAATGTAGAATTTA - PowerPoint PPT Presentation

Transcript of Position-specific scoring matrices Decrease complexity through info analysis

Page 1: Position-specific scoring matrices Decrease complexity through info analysis

Position-specific scoring matricesPosition-specific scoring matricesDecrease complexity through info analysisDecrease complexity through info analysis

Training set including sequences from two Nostocs

71-devB CATTACTCCTTCAATCCCTCGCCCCTCATTTGTACAGTCTGTTACCTTTACCTGAAACAGATGAATGTAGAATTTA

Np-devB CCTTGACATTCATTCCCCCATCTCCCCATCTGTAGGCTCTGTTACGTTTTCGCGTCACAGATAAATGTAGAATTCA

71-glnA AGGTTAATATTACCTGTAATCCAGACGTTCTGTAACAAAGACTACAAAACTGTCTAATGTTTAGAATCTACGATAT

Np-glnA AGGTTAATATAACCTGATAATCCAGATATCTGTAACATAAGCTACAAAATCCGCTAATGTCTACTATTTAAGATAT

71-hetC GTTATTGTTAGGTTGCTATCGGAAAAAATCTGTAACATGAGATACACAATAGCATTTATATTTGCTTTAGTATCTC

71-nirA TATTAAACTTACGCATTAATACGAGAATTTTGTAGCTACTTATACTATTTTACCTGAGATCCCGACATAACCTTAG

Np-nirA CATCCATTTTCAGCAATTTTACTAAAAAATCGTAACAATTTATACGATTTTAACAGAAATCTCGTCTTAAGTTATG

71-ntcB ATTAATGAAATTTGTGTTAATTGCCAAAGCTGTAACAAAATCTACCAAATTGGGGAGCAAAATCAGCTAACTTAAT

Np-ntcB TTATACAAATGTAAATCACAGGAAAATTACTGTAACTAACTATACTAAATTGCGGAGAATAAACCGTTAACTTAGT

71-urt ATTAATTTTTATTTAAAGGAATTAGAATTTAGTATCAAAAATAACAATTCAATGGTTAAATATCAAACTAATATCA

Np-urt TTATTCTTCTGTAACAAAAATCAGGCGTTTGGTATCCAAGATAACTTTTTACTAGTAAACTATCGCACTATCATCA

Might increase performance of our PSSM if we can filter out columns that don’t have “enough information”

Not every column is as well conserved – some seem to be more informative about what a binding site looks like!

Page 2: Position-specific scoring matrices Decrease complexity through info analysis

Position-specific scoring matricesPosition-specific scoring matricesDecrease complexity through info analysisDecrease complexity through info analysis

Uncertainty (Hc) = - [pic log2(pic)]

Uncertainty Distribution

0

0.1

0.2

0.3

0.4

0.5

0.6

0 0.2 0.4 0.6 0.8 1

fraction

Unc

erta

inty

(H

)

Confusing!!!

Page 3: Position-specific scoring matrices Decrease complexity through info analysis

Digression on information theoryDigression on information theoryUncertainty when all outcomes are equally probable

Pretend we have a machine that spits out an infinitely long string of nucleotidesBut that each one is EQUALLY LIKELY to occur:

A

Pretend we have a machine that spits out an infinitely long string of nucleotides:

G A T G A C T C …

How uncertain are we about the outcome BEFORE we see each new character produced by the machine?

Intuitively, this uncertainty will depend on howmany possibilities exist

Page 4: Position-specific scoring matrices Decrease complexity through info analysis

Digression on information theoryDigression on information theory

If the possibilities are:A or G or C or T

Quantifying uncertainty when outcomes are equally probable

One way to quantify uncertainty is to ask: “What is the minimum number of questions required to

remove all ambiguity about the outcome?”

How many yes/no questions do we need to ask?

Page 5: Position-specific scoring matrices Decrease complexity through info analysis

Digression on information theoryDigression on information theory

AGCT

AG CT

A G C T

Quantifying uncertainty when outcomes are equally probable

M = 4 ((Alphabet size)

H = log2(M)

Number of decisions dependson the height of the decision tree

With M = 4 we are uncertain by log2(4) = 2 bits before each new symbol is made by our machine

Page 6: Position-specific scoring matrices Decrease complexity through info analysis

Digression on information theoryDigression on information theoryUncertainty when all outcomes are equally probable

After we have received a new symbol from our machinewe are less uncertain

Intuitively, when we become less uncertain, it means we have gained information

Information = uncertaintybefore - uncertaintyafterInformation = Hbefore - Hafter

Note that only in the special case whereno uncertainty remains after (Hafter = 0) does

information = Hbefore

In the real world this never happens In the real world this never happens because of because of noisenoise in the system!! in the system!!

Page 7: Position-specific scoring matrices Decrease complexity through info analysis

Digression on information theoryDigression on information theory

Necessary when outcomes are not equally probable!

Fine, but where did we get

H = Pi log2Pi ?i =1

M

Page 8: Position-specific scoring matrices Decrease complexity through info analysis

Digression on information theoryDigression on information theoryUncertainty with unequal probabilities

Now our machine produces a string of symbols, but some are more likely to occur than others:

PA = 0.6PG = 0.1PC = 0.1 PT = 0.2

Page 9: Position-specific scoring matrices Decrease complexity through info analysis

Digression on information theoryDigression on information theoryUncertainty with unequal probabilities

Now our machine produces a string of symbols, but we know that some are more likely to occur than others:

A A A T A A G T C …

Now how uncertain are we about the outcome BEFORE we see each new character?

Are we more or less surprised when we see an“A” or a “C”?

Page 10: Position-specific scoring matrices Decrease complexity through info analysis

Digression on information theoryDigression on information theoryUncertainty with unequal probabilities

Now our machine produces a string of symbols, but we know that some are more likely to occur than others:

Do you agree that we are less surprised to see an “A” than we are to see a “G”?

A GA AA A T T C

Do you think that the output of our new machine is more or less uncertain?

Page 11: Position-specific scoring matrices Decrease complexity through info analysis

Digression on information theoryDigression on information theoryWhat about when outcomes are not equally probable?

log2M = -log2M-1

= - log2(1/M)= - log2(P)

P = 1/M = probability of a symbol appearing

Page 12: Position-specific scoring matrices Decrease complexity through info analysis

Digression on information theoryDigression on information theoryWhat about when outcomes are not equally probable?

PA = 0.6PG = 0.1PC = 0.1 PT = 0.2

Pi = 1i =1

M

Remember that the probabilities of all possiblesymbols must sum to 1!

M = 4

Page 13: Position-specific scoring matrices Decrease complexity through info analysis

Digression on information theoryDigression on information theoryHow surprised are we to see a given symbol?

ui = - log2(Pi)

UA = -log2(0.6) = 0.7UG = -log2(0.1) = 3.3UC = -log2(0.1) = 3.3UT = -log2(0.2) = 2.3

(where Pi = probability of ith symbol)

}

Ui is therefore called the surprisal for symbol i

Page 14: Position-specific scoring matrices Decrease complexity through info analysis

Digression on information theoryDigression on information theory

What does the surprisal for a symbol haveto do with uncertainty?

ui = - log2(Pi)

Uncertainty is the average surprisal for the infinite string of symbols produced by our machine

the “surprisal”

Page 15: Position-specific scoring matrices Decrease complexity through info analysis

Digression on information theoryDigression on information theoryLet’s first imagine that our machine only

produces a finite string of N symbols

N Nii =1

M

Ni is equal to the number of times each symbol occurred in a string of length N

NA = 5NG = 2NC = 1 NT = 1

For example, for the string “AAGTAACGA”

N = 9

Page 16: Position-specific scoring matrices Decrease complexity through info analysis

Digression on information theoryDigression on information theory

For every Ni, there is a corresponding surprisal ui

therefore the average surprisal for N symbols will be:

Niuii =1

M

Nii =1

M

Niuii =1

M

N

Ni

i =1

M

Nui

Page 17: Position-specific scoring matrices Decrease complexity through info analysis

Digression on information theoryDigression on information theory

For every Ni, there is a corresponding surprisal ui

therefore the average surprisal for N symbols will be:

i =1

M

Ni

Nui Pi

i =1

M

uiRemember that Pi is simply the probability

of generating the ith symbol!

But wait! We also already defined Ui !!

Page 18: Position-specific scoring matrices Decrease complexity through info analysis

Digression on information theoryDigression on information theory

Pii =1

M

ui

Congratulations! This is Claude Shannon’s famousformula defining uncertainty when the probability of

each symbol is unequal!

ui = - log2(Pi)

Pii =1

M

log2(Pi)-

Therefore:

H

Page 19: Position-specific scoring matrices Decrease complexity through info analysis

Digression on information theoryDigression on information theory

Uncertainty is largest when all symbols are equally probable!

(1/M)i =1

M

log2(1/M)-Heq

How does it reduce assuming equiprobable symbols?

1i =1

M

- (1/M log21/M)

M- (1/M log21/M)

Mlog2

Page 20: Position-specific scoring matrices Decrease complexity through info analysis

Digression on information theoryDigression on information theory

Uncertainty when M = 2

Pii =1

M

log2(Pi)-H

Uncertainty is largest when all symbols are equally probable!

Page 21: Position-specific scoring matrices Decrease complexity through info analysis

Digression on information theoryDigression on information theory

OK, but how much information is present in each column?

Information (R) = Hbefore - Hafter

Mlog2 Pii =1

M

log2(Pi)-Now before and after refers to before and after we

examined the contents of a column

Page 22: Position-specific scoring matrices Decrease complexity through info analysis

Digression on information theoryDigression on information theory

http://weblogo.berkeley.edu/

Sequence logos graphically display howMuch information is present in each column