Essential CS & Statistics (Lecture for CS498-CXZ Algorithms in Bioinformatics) Aug. 30, 2005...
-
Upload
earl-little -
Category
Documents
-
view
223 -
download
0
description
Transcript of Essential CS & Statistics (Lecture for CS498-CXZ Algorithms in Bioinformatics) Aug. 30, 2005...
Essential CS & Statistics
(Lecture for CS498-CXZ Algorithms in Bioinformatics)
Aug. 30, 2005
ChengXiang Zhai
Department of Computer ScienceUniversity of Illinois, Urbana-Champaign
Essential CS Concepts• Programming languages: Languages that we use to
communicate with a computer– Machine language (010101110111…)– Assembly language (move a, b; add c, b; …)– High-level language (x= a+2*b… ), e.g., C++, Perl, Java– Different languages are designed for different applications
• System software: Software “assistants” to help a computer– Understand high-level programming languages (compilers)– Manage all kinds of devices (operating systems)– Communicate with users (GUI or command line)
• Application software: Software for various kinds of applications– Standing alone (running on a local computer, e.g., Excel, Word)– Client-server applications (running on a network, e.g., web browser)
Intelligence/Capacity of a Computer
• The intelligence of a computer is determined by the intelligence of the software it can run
• Capacities of a computer for running software are mainly determined by its– Speed– Memory– Disk space
• Given a particular computer, we would like to write software that is highly intelligent, that can run fast, and that doesn’t need much memory (contradictory goals)
Algorithms vs. Software
• Algorithm is a procedure for solving a problem– Input: description of a problem– Output: solution(s)– Step 1: We first do this– Step 2: ….– ….– Step n: here’s the solution!
• Software implements an algorithm (with a particular programming language)
Example: Change Problem
• Input: – M (total amount of money)
– C1 > c2 > … >Cd (denominations)
• Output – i1 , i2 , … ,id (number of coins of each kind), such
that i1*C1 + i2 *C2 + … + id *Cd =M and i1+ i2 + … + id is as small as possible
Algorithm Example: BetterChange
BetterChange(M,c,d)1 r=M;2 for k=1 to d {
1 ik=r/ck
3 r=r-r-ik*ck
4 }5 Return (i1, i2, …, id)
Input variables
Output variables
1( ,..., )dc c c
Take only the integer part (floor)
Properties of an algorithms:- Correct vs. Incorrect algorithms (Is BetterChange correct?)- Fast vs. Slow algorithms (How do we quantify it?)
Big-O Notation
• How can we compare the running time of two algorithms in a computer-independent way?
• Observations: – In general, as the problem size grows, the running
time increases (sorting 500 numbers would take more time than sorting 5 elements)
– Running time is more critical for large problem size (think about sorting 5 numbers vs. sorting 50000 numbers)
• How about measuring the growth rate of running time?
Big-O Notation (cont.)
• Define problem size (e.g., the lengths of a sequence, n)• Define “basic steps” (e.g., addition, division,…)• Express the running time as a function of the problem
size ( e.g., 3*n*log(n) +n)• As the problem size approaches the positive infinity,
only the highest-order term “counts”• Big-O indicates the highest-order term• E.g., the algorithm has O(n*log(n)) time complexity• Polynomial (O(n2)) vs. exponential (O(2n))• NP-complete
Basic Probability & Statistics
Purpose of Prob. & Statistics
• Deductive vs. Plausible reasoning
• Incomplete knowledge -> uncertainty
• How do we quantify inference under uncertainty?– Probability: models of random
process/experiments (how data are generated)– Statistics: draw conclusions on the whole
population based on samples (inference on data)
Basic Concepts in Probability
• Sample space: all possible outcomes, e.g., – Tossing 2 coins, S ={HH, HT, TH, TT}
• Event: ES, E happens iff outcome in E, e.g., – E={HH} (all heads) – E={HH,TT} (same face)
• Probability of Event : 1P(E) 0, s.t.– P(S)=1 (outcome always in S)– P(A B)=P(A)+P(B) if (AB)=
Basic Concepts of Prob. (cont.)
• Conditional Probability :P(B|A)=P(AB)/P(A)– P(AB) = P(A)P(B|A) =P(B)P(B|A)– So, P(A|B)=P(B|A)P(A)/P(B)– For independent events, P(AB) = P(A)P(B), so P(A|
B)=P(A)
• Total probability: If A1, …, An form a partition of S, then– P(B)= P(BS)=P(BA1)+…+P(B An)
– So, P(Ai|B)=P(B|Ai)P(Ai)/P(B) (Bayes’ Rule)
Interpretation of Bayes’ Rule
)()()|()|(
EPHPHEPEHP ii
i
Hypothesis space: H={H1 , …, Hn} Evidence: E
If we want to pick the most likely hypothesis H*, we can drop p(E)
Posterior probability of Hi Prior probability of Hi
Likelihood of data/evidenceif Hi is true
)()|()|( iii HPHEPEHP
Random Variable
• X: S (“measure” of outcome)
• Events can be defined according to X– E(X=a) = {si|X(si)=a}
– E(Xa) = {si|X(si) a}
• So, probabilities can be defined on X– P(X=a) = P(E(X=a))– P(aX) = P(E(aX)) (f(a)=P(a>x): cumulative dist. func)
• Discrete vs. continuous random variable (think of “partitioning the sample space”)
An Example• Think of a DNA sequence as results of tossing a 4-
face die many times independently
• P(AATGC)=p(A)p(A)p(T)p(G)p(C)
• A model specifies {p(A),p(C), p(G),p(T)}, e.g., all 0.25 (random model M0)
• P(AATGC|M0) = 0.25*0.25*0.25*0.25*0.25
• Comparing 2 models– M1: coding regions– M2: non-coding regions– Decide if AATGC is more likely a coding region
Probability Distributions
• Binomial: Times of successes out of N trials
• Gaussian: Sum of N independent R.V.’s
• Multinomial: Getting ni occurrences of outcome i
( | ) (1 )k N kNp k N p p
k
2
2( )
21( ) e2
x
f x
111
( ,..., | )...
i
kn
k iik
Np n n N p
n n
Parameter Estimation
• General setting:– Given a (hypothesized & probabilistic) model that governs
the random experiment– The model gives a probability of any data p(D|) that
depends on the parameter
– Now, given actual sample data X={x1,…,xn}, what can we say about the value of ?
• Intuitively, take your best guess of -- “best” means “best explaining/fitting the data”
• Generally an optimization problem
Maximum Likelihood EstimatorData: a sequence d with counts c(w1), …, c(wN), and length |d|Model: multinomial M with parameters {p(wi)} Likelihood: p(d|M)Maximum likelihood estimator: M=argmax M p(d|M)
( ) ( )
1 11
1
'
1 1
'
1 1
| |( | ) , ( )
( )... ( )
( | ) log ( | ) ( ) log
( | ) ( ) log ( 1)
( ) ( )0
1, ( ) | |
i i
N Nc w c w
i i i ii iN
N
i ii
N N
i i ii i
i ii
i i
N N
i ii i
dp d M where p w
c w c w
l d M p d M c w
l d M c w
c w c wl
Since c w d So
( ), ( )| |
ii i
c wp wd
We’ll tune p(wi) to maximize l(d|M)
Use Lagrange multiplier approach
Set partial derivatives to zero
ML estimate
Maximum Likelihood vs. Bayesian• Maximum likelihood estimation
– “Best” means “data likelihood reaches maximum”
– Problem: small sample
• Bayesian estimation– “Best” means being consistent with our “prior”
knowledge and explaining data well
– Problem: how to define prior?
)|(maxargˆ
XP
)()|(maxarg)|(maxargˆ
PXPXP
Bayesian Estimator• ML estimator: M=argmax M p(d|M)• Bayesian estimator:
– First consider posterior: p(M|d) =p(d|M)p(M)/p(d)– Then, consider the mean or mode of the posterior dist.
• p(d|M) : Sampling distribution (of data) • P(M)=p(1 ,…, N) : our prior on the model parameters• conjugate = prior can be interpreted as “extra”/“pseudo” data• Dirichlet distribution is a conjugate prior for multinomial
sampling distribution 1
11
11 )()(
)(),,|(
ii
N
iN
NNDir
“extra”/“pseudo” counts e.g., i= p(wi|REF)
Dirichlet Prior Smoothing (cont.)
))( , ,)( | () | ( 11 NNwcwcDirdp
Posterior distribution of parameters:
}{)E(then ),|(~ If :Property i
iDir
The predictive distribution is the same as the mean:
|d|)|()w(
|d|
)w(
)|()|p(w)ˆ|p(w
i
1i
i
ii
REFwpcc
dDir
iN
i
i
Bayesian estimate (|d| ?)
Illustration of Bayesian Estimation
Prior: p()
Likelihood: p(D|)
D=(c1,…,cN)
Posterior: p(|D) p(D|)p()
: prior mode ml: ML estimate: posterior mode
Basic Concepts in Information Theory
• Entropy: Measuring uncertainty of a random variable
• Kullback-Leibler divergence: comparing two distributions
• Mutual Information: measuring the correlation of two random variables
Entropy
2
( ) ( ) ( ) log ( )
0 log 0 0, log logx
H X H p p x p x all possible values
Define
Entropy H(X) measures the average uncertainty of random variable X
1 ( ) 0.5( ) 0 1 ( ) 0.8
0 ( ) 1
fair coin p HH X between and biased coin p H
completely biased p H
Example:
Properties: H(X)>=0; Min=0; Max=log M; M is the total number of values
Interpretations of H(X)
• Measures the “amount of information” in X– Think of each value of X as a “message”
– Think of X as a random experiment (20 questions)
• Minimum average number of bits to compress values of X– The more random X is, the harder to compress
A fair coin has the maximum information, and is hardest to compressA biased coin has some information, and can be compressed to <1 bit on averageA completely biased coin has no information, and needs only 0 bit
" " "# " log ( ) ( ) [ log ( )]pInformation of x bits to code x p x H X E p x
Cross Entropy H(p,q)
What if we encode X with a code optimized for a wrong distribution q?
Expected # of bits=? ( , ) [ log ( )] ( ) log ( )px
H p q E q x p x q x
Intuitively, H(p,q) H(p), and mathematically,
( )( , ) ( ) ( )[ log ]( )( )log [ ( ) ] 0( )
x
x
q xH p q H p p xp xq xp xp x
' : ( ) ( )
, , 1
i i i ii i
ii
By Jensen s inequality p f x f p x
where f is a convex function and p
Kullback-Leibler Divergence D(p||q)
What if we encode X with a code optimized for a wrong distribution q?
How many bits would we waste? ( )( || ) ( , ) ( ) ( ) log( )x
p xD p q H p q H p p xq x
Properties:
- D(p||q)0 - D(p||q)D(q||p) - D(p||q)=0 iff p=q
KL-divergence is often used to measure the distance between two distributions
Interpretation:
-Fix p, D(p||q) and H(p,q) vary in the same way
-If p is an empirical distribution, minimize D(p||q) or H(p,q) is equivalent to maximizing likelihood
Relative entropy
Cross Entropy, KL-Div, and Likelihood
1
1
/ : ( ,..., )
11: ( ) ( , ) ( , )0 . .
N
N
ii
Data Sample for X Y y y
if x yEmpirical distribution p x y x y x
o wN
1
1
( ) ( )
log ( ) log ( ) ( ) log ( ) ( ) log ( )
N
ii
N
ii x x
L Y p X y
L Y p X y c x p X x N p x p x
Likelihood:
log Likelihood:
1 log ( ) ( , ) ( || ) ( )
1, arg max log ( ) arg min ( , ) arg min ( , )p p p
L Y H p p D p p H pN
Fix the data L Y H p p D p pN
Criterion for estimating a good model
Mutual Information I(X;Y)Comparing two distributions: p(x,y) vs p(x)p(y)
,
( , )( ; ) ( , ) log ( ) ( | ) ( ) ( | )( ) ( )x y
p x yI X Y p x y H X H X Y H Y H Y Xp x p y
( , ),
( | ) [ log ( | )] ( , ) log ( | ) ( ) ( | ) log ( | )p x yx y x y
H Y X E p y x p x y p y x p x p y x p y x
Conditional Entropy: H(Y|X)
Properties: I(X;Y)0; I(X;Y)=I(Y;X); I(X;Y)=0 iff X & Y are independent
Interpretations: - Measures how much reduction in uncertainty of X given info. about Y - Measures correlation between X and Y
What You Should Know
• Computational complexity, big-O notation
• Probability concepts: – sample space, event, random variable, conditional prob.
multinomial distribution, etc
• Bayes formula and its interpretation
• Statistics: Know how to compute maximum likelihood estimate
• Information theory concepts: – entropy, cross entropy, relative entropy, conditional entropy,
KL-div., mutual information, and their relationship