Position-specific scoring matricesPosition-specific scoring matricesDecrease complexity through info analysisDecrease complexity through info analysis
Training set including sequences from two Nostocs
71-devB CATTACTCCTTCAATCCCTCGCCCCTCATTTGTACAGTCTGTTACCTTTACCTGAAACAGATGAATGTAGAATTTA
Np-devB CCTTGACATTCATTCCCCCATCTCCCCATCTGTAGGCTCTGTTACGTTTTCGCGTCACAGATAAATGTAGAATTCA
71-glnA AGGTTAATATTACCTGTAATCCAGACGTTCTGTAACAAAGACTACAAAACTGTCTAATGTTTAGAATCTACGATAT
Np-glnA AGGTTAATATAACCTGATAATCCAGATATCTGTAACATAAGCTACAAAATCCGCTAATGTCTACTATTTAAGATAT
71-hetC GTTATTGTTAGGTTGCTATCGGAAAAAATCTGTAACATGAGATACACAATAGCATTTATATTTGCTTTAGTATCTC
71-nirA TATTAAACTTACGCATTAATACGAGAATTTTGTAGCTACTTATACTATTTTACCTGAGATCCCGACATAACCTTAG
Np-nirA CATCCATTTTCAGCAATTTTACTAAAAAATCGTAACAATTTATACGATTTTAACAGAAATCTCGTCTTAAGTTATG
71-ntcB ATTAATGAAATTTGTGTTAATTGCCAAAGCTGTAACAAAATCTACCAAATTGGGGAGCAAAATCAGCTAACTTAAT
Np-ntcB TTATACAAATGTAAATCACAGGAAAATTACTGTAACTAACTATACTAAATTGCGGAGAATAAACCGTTAACTTAGT
71-urt ATTAATTTTTATTTAAAGGAATTAGAATTTAGTATCAAAAATAACAATTCAATGGTTAAATATCAAACTAATATCA
Np-urt TTATTCTTCTGTAACAAAAATCAGGCGTTTGGTATCCAAGATAACTTTTTACTAGTAAACTATCGCACTATCATCA
Might increase performance of our PSSM if we can filter out columns that don’t have “enough information”
Not every column is as well conserved – some seem to be more informative about what a binding site looks like!
Position-specific scoring matricesPosition-specific scoring matricesDecrease complexity through info analysisDecrease complexity through info analysis
Uncertainty (Hc) = - [pic log2(pic)]
Uncertainty Distribution
0
0.1
0.2
0.3
0.4
0.5
0.6
0 0.2 0.4 0.6 0.8 1
fraction
Unc
erta
inty
(H
)
Confusing!!!
Digression on information theoryDigression on information theoryUncertainty when all outcomes are equally probable
Pretend we have a machine that spits out an infinitely long string of nucleotidesBut that each one is EQUALLY LIKELY to occur:
A
Pretend we have a machine that spits out an infinitely long string of nucleotides:
G A T G A C T C …
How uncertain are we about the outcome BEFORE we see each new character produced by the machine?
Intuitively, this uncertainty will depend on howmany possibilities exist
Digression on information theoryDigression on information theory
If the possibilities are:A or G or C or T
Quantifying uncertainty when outcomes are equally probable
One way to quantify uncertainty is to ask: “What is the minimum number of questions required to
remove all ambiguity about the outcome?”
How many yes/no questions do we need to ask?
Digression on information theoryDigression on information theory
AGCT
AG CT
A G C T
Quantifying uncertainty when outcomes are equally probable
M = 4 ((Alphabet size)
H = log2(M)
Number of decisions dependson the height of the decision tree
With M = 4 we are uncertain by log2(4) = 2 bits before each new symbol is made by our machine
Digression on information theoryDigression on information theoryUncertainty when all outcomes are equally probable
After we have received a new symbol from our machinewe are less uncertain
Intuitively, when we become less uncertain, it means we have gained information
Information = uncertaintybefore - uncertaintyafterInformation = Hbefore - Hafter
Note that only in the special case whereno uncertainty remains after (Hafter = 0) does
information = Hbefore
In the real world this never happens In the real world this never happens because of because of noisenoise in the system!! in the system!!
Digression on information theoryDigression on information theory
Necessary when outcomes are not equally probable!
Fine, but where did we get
H = Pi log2Pi ?i =1
M
Digression on information theoryDigression on information theoryUncertainty with unequal probabilities
Now our machine produces a string of symbols, but some are more likely to occur than others:
PA = 0.6PG = 0.1PC = 0.1 PT = 0.2
Digression on information theoryDigression on information theoryUncertainty with unequal probabilities
Now our machine produces a string of symbols, but we know that some are more likely to occur than others:
A A A T A A G T C …
Now how uncertain are we about the outcome BEFORE we see each new character?
Are we more or less surprised when we see an“A” or a “C”?
Digression on information theoryDigression on information theoryUncertainty with unequal probabilities
Now our machine produces a string of symbols, but we know that some are more likely to occur than others:
…
Do you agree that we are less surprised to see an “A” than we are to see a “G”?
A GA AA A T T C
Do you think that the output of our new machine is more or less uncertain?
Digression on information theoryDigression on information theoryWhat about when outcomes are not equally probable?
log2M = -log2M-1
= - log2(1/M)= - log2(P)
P = 1/M = probability of a symbol appearing
Digression on information theoryDigression on information theoryWhat about when outcomes are not equally probable?
PA = 0.6PG = 0.1PC = 0.1 PT = 0.2
Pi = 1i =1
M
Remember that the probabilities of all possiblesymbols must sum to 1!
M = 4
Digression on information theoryDigression on information theoryHow surprised are we to see a given symbol?
ui = - log2(Pi)
UA = -log2(0.6) = 0.7UG = -log2(0.1) = 3.3UC = -log2(0.1) = 3.3UT = -log2(0.2) = 2.3
(where Pi = probability of ith symbol)
}
Ui is therefore called the surprisal for symbol i
Digression on information theoryDigression on information theory
What does the surprisal for a symbol haveto do with uncertainty?
ui = - log2(Pi)
Uncertainty is the average surprisal for the infinite string of symbols produced by our machine
the “surprisal”
Digression on information theoryDigression on information theoryLet’s first imagine that our machine only
produces a finite string of N symbols
N Nii =1
M
Ni is equal to the number of times each symbol occurred in a string of length N
NA = 5NG = 2NC = 1 NT = 1
For example, for the string “AAGTAACGA”
N = 9
Digression on information theoryDigression on information theory
For every Ni, there is a corresponding surprisal ui
therefore the average surprisal for N symbols will be:
Niuii =1
M
Nii =1
M
Niuii =1
M
N
Ni
i =1
M
Nui
Digression on information theoryDigression on information theory
For every Ni, there is a corresponding surprisal ui
therefore the average surprisal for N symbols will be:
i =1
M
Ni
Nui Pi
i =1
M
uiRemember that Pi is simply the probability
of generating the ith symbol!
But wait! We also already defined Ui !!
Digression on information theoryDigression on information theory
Pii =1
M
ui
Congratulations! This is Claude Shannon’s famousformula defining uncertainty when the probability of
each symbol is unequal!
ui = - log2(Pi)
Pii =1
M
log2(Pi)-
Therefore:
H
Digression on information theoryDigression on information theory
Uncertainty is largest when all symbols are equally probable!
(1/M)i =1
M
log2(1/M)-Heq
How does it reduce assuming equiprobable symbols?
1i =1
M
- (1/M log21/M)
M- (1/M log21/M)
Mlog2
Digression on information theoryDigression on information theory
Uncertainty when M = 2
Pii =1
M
log2(Pi)-H
Uncertainty is largest when all symbols are equally probable!
Digression on information theoryDigression on information theory
OK, but how much information is present in each column?
Information (R) = Hbefore - Hafter
Mlog2 Pii =1
M
log2(Pi)-Now before and after refers to before and after we
examined the contents of a column
Digression on information theoryDigression on information theory
http://weblogo.berkeley.edu/
Sequence logos graphically display howMuch information is present in each column
Top Related