Position-specific scoring matrices Decrease complexity through info analysis

Position-specific scoring matricesPosition-specific scoring matricesDecrease complexity through info analysisDecrease complexity through info analysis

Training set including sequences from two Nostocs

71-devB CATTACTCCTTCAATCCCTCGCCCCTCATTTGTACAGTCTGTTACCTTTACCTGAAACAGATGAATGTAGAATTTA

Np-devB CCTTGACATTCATTCCCCCATCTCCCCATCTGTAGGCTCTGTTACGTTTTCGCGTCACAGATAAATGTAGAATTCA

71-glnA AGGTTAATATTACCTGTAATCCAGACGTTCTGTAACAAAGACTACAAAACTGTCTAATGTTTAGAATCTACGATAT

Np-glnA AGGTTAATATAACCTGATAATCCAGATATCTGTAACATAAGCTACAAAATCCGCTAATGTCTACTATTTAAGATAT

71-hetC GTTATTGTTAGGTTGCTATCGGAAAAAATCTGTAACATGAGATACACAATAGCATTTATATTTGCTTTAGTATCTC

71-nirA TATTAAACTTACGCATTAATACGAGAATTTTGTAGCTACTTATACTATTTTACCTGAGATCCCGACATAACCTTAG

Np-nirA CATCCATTTTCAGCAATTTTACTAAAAAATCGTAACAATTTATACGATTTTAACAGAAATCTCGTCTTAAGTTATG

71-ntcB ATTAATGAAATTTGTGTTAATTGCCAAAGCTGTAACAAAATCTACCAAATTGGGGAGCAAAATCAGCTAACTTAAT

Np-ntcB TTATACAAATGTAAATCACAGGAAAATTACTGTAACTAACTATACTAAATTGCGGAGAATAAACCGTTAACTTAGT

71-urt ATTAATTTTTATTTAAAGGAATTAGAATTTAGTATCAAAAATAACAATTCAATGGTTAAATATCAAACTAATATCA

Np-urt TTATTCTTCTGTAACAAAAATCAGGCGTTTGGTATCCAAGATAACTTTTTACTAGTAAACTATCGCACTATCATCA

Might increase performance of our PSSM if we can filter out columns that don’t have “enough information”

Not every column is as well conserved – some seem to be more informative about what a binding site looks like!

Position-specific scoring matricesPosition-specific scoring matricesDecrease complexity through info analysisDecrease complexity through info analysis

Uncertainty (Hc) = - [pic log2(pic)]

Uncertainty Distribution

0

0.1

0.2

0.3

0.4

0.5

0.6

0 0.2 0.4 0.6 0.8 1

fraction

Unc

erta

inty

(H

)

Confusing!!!

Digression on information theoryDigression on information theoryUncertainty when all outcomes are equally probable

Pretend we have a machine that spits out an infinitely long string of nucleotidesBut that each one is EQUALLY LIKELY to occur:

A

Pretend we have a machine that spits out an infinitely long string of nucleotides:

G A T G A C T C …

How uncertain are we about the outcome BEFORE we see each new character produced by the machine?

Intuitively, this uncertainty will depend on howmany possibilities exist

Digression on information theoryDigression on information theory

If the possibilities are:A or G or C or T

Quantifying uncertainty when outcomes are equally probable

One way to quantify uncertainty is to ask: “What is the minimum number of questions required to

remove all ambiguity about the outcome?”

How many yes/no questions do we need to ask?


AGCT

AG CT

A G C T

Quantifying uncertainty when outcomes are equally probable

M = 4 ((Alphabet size)

H = log2(M)

Number of decisions dependson the height of the decision tree

With M = 4 we are uncertain by log2(4) = 2 bits before each new symbol is made by our machine

Digression on information theoryDigression on information theoryUncertainty when all outcomes are equally probable

After we have received a new symbol from our machinewe are less uncertain

Intuitively, when we become less uncertain, it means we have gained information

Information = uncertaintybefore - uncertaintyafterInformation = Hbefore - Hafter

Note that only in the special case whereno uncertainty remains after (Hafter = 0) does

information = Hbefore

In the real world this never happens In the real world this never happens because of because of noisenoise in the system!! in the system!!


Necessary when outcomes are not equally probable!

Fine, but where did we get

H = Pi log2Pi ?i =1

M

Digression on information theoryDigression on information theoryUncertainty with unequal probabilities

Now our machine produces a string of symbols, but some are more likely to occur than others:

PA = 0.6PG = 0.1PC = 0.1 PT = 0.2


Now our machine produces a string of symbols, but we know that some are more likely to occur than others:

A A A T A A G T C …

Now how uncertain are we about the outcome BEFORE we see each new character?

Are we more or less surprised when we see an“A” or a “C”?


Now our machine produces a string of symbols, but we know that some are more likely to occur than others:

…

Do you agree that we are less surprised to see an “A” than we are to see a “G”?

A GA AA A T T C

Do you think that the output of our new machine is more or less uncertain?

Digression on information theoryDigression on information theoryWhat about when outcomes are not equally probable?

log2M = -log2M-1

= - log2(1/M)= - log2(P)

P = 1/M = probability of a symbol appearing

Digression on information theoryDigression on information theoryWhat about when outcomes are not equally probable?

PA = 0.6PG = 0.1PC = 0.1 PT = 0.2

Pi = 1i =1

M

Remember that the probabilities of all possiblesymbols must sum to 1!

M = 4

Digression on information theoryDigression on information theoryHow surprised are we to see a given symbol?

ui = - log2(Pi)

UA = -log2(0.6) = 0.7UG = -log2(0.1) = 3.3UC = -log2(0.1) = 3.3UT = -log2(0.2) = 2.3

(where Pi = probability of ith symbol)

}

Ui is therefore called the surprisal for symbol i


What does the surprisal for a symbol haveto do with uncertainty?

ui = - log2(Pi)

Uncertainty is the average surprisal for the infinite string of symbols produced by our machine

the “surprisal”

Digression on information theoryDigression on information theoryLet’s first imagine that our machine only

produces a finite string of N symbols

N Nii =1

M

Ni is equal to the number of times each symbol occurred in a string of length N

NA = 5NG = 2NC = 1 NT = 1

For example, for the string “AAGTAACGA”

N = 9


For every Ni, there is a corresponding surprisal ui

therefore the average surprisal for N symbols will be:

Niuii =1

M

Nii =1

M

Niuii =1

M

N

Ni

i =1

M

Nui


For every Ni, there is a corresponding surprisal ui

therefore the average surprisal for N symbols will be:

i =1

M

Ni

Nui Pi

i =1

M

uiRemember that Pi is simply the probability

of generating the ith symbol!

But wait! We also already defined Ui !!


Pii =1

M

ui

Congratulations! This is Claude Shannon’s famousformula defining uncertainty when the probability of

each symbol is unequal!

ui = - log2(Pi)

Pii =1

M

log2(Pi)-

Therefore:

H


Uncertainty is largest when all symbols are equally probable!

(1/M)i =1

M

log2(1/M)-Heq

How does it reduce assuming equiprobable symbols?

1i =1

M

- (1/M log21/M)

M- (1/M log21/M)

Mlog2


Uncertainty when M = 2

Pii =1

M

log2(Pi)-H

Uncertainty is largest when all symbols are equally probable!

http://www.lecb.ncifcrf.gov/~toms/paper/primer/latex/img24.gif


OK, but how much information is present in each column?

Information (R) = Hbefore - Hafter

Mlog2 Pii =1

M

log2(Pi)-Now before and after refers to before and after we

examined the contents of a column


http://weblogo.berkeley.edu/

Sequence logos graphically display howMuch information is present in each column




Position-specific scoring matrices Decrease complexity through info analysis

Documents

Transcript of Position-specific scoring matrices Decrease complexity through info analysis