Statistical Learning Methods
description
Transcript of Statistical Learning Methods
![Page 1: Statistical Learning Methods](https://reader033.fdocuments.in/reader033/viewer/2022061406/56815b7e550346895dc97b58/html5/thumbnails/1.jpg)
Statistical Learning Statistical Learning MethodsMethods
Russell and Norvig: Chapter 20 (20.1,20.2,20.4,20.5)CMSC 421 – Fall 2006
![Page 2: Statistical Learning Methods](https://reader033.fdocuments.in/reader033/viewer/2022061406/56815b7e550346895dc97b58/html5/thumbnails/2.jpg)
Statistical ApproachesStatistical Learning (20.1)Naïve Bayes (20.2)Instance-based Learning (20.4)Neural Networks (20.5)
![Page 3: Statistical Learning Methods](https://reader033.fdocuments.in/reader033/viewer/2022061406/56815b7e550346895dc97b58/html5/thumbnails/3.jpg)
Statistical Learning (20.1)
![Page 4: Statistical Learning Methods](https://reader033.fdocuments.in/reader033/viewer/2022061406/56815b7e550346895dc97b58/html5/thumbnails/4.jpg)
Example: Candy BagsCandy comes in two flavors: cherry () and lime ()Candy is wrapped, can’t tell which flavor until openedThere are 5 kinds of bags of candy:
H1= all cherry H2= 75% cherry, 25% lime H3= 50% cherry, 50% lime H4= 25% cherry, 75% lime H5= 100% lime
Given a new bag of candy, predict HObservations: D1, D2 , D3, …
![Page 5: Statistical Learning Methods](https://reader033.fdocuments.in/reader033/viewer/2022061406/56815b7e550346895dc97b58/html5/thumbnails/5.jpg)
Bayesian LearningCalculate the probability of each hypothesis, given the data, and make prediction weighted by this probability (i.e. use all the hypothesis, not just the single best)
Now, if we want to predict some unknown quantity X
)h(P)h|d(P)d|h(P ii)d(P)h(P)h|d(P
iii
)d|h(P)h|X(P)d|X(P iii
![Page 6: Statistical Learning Methods](https://reader033.fdocuments.in/reader033/viewer/2022061406/56815b7e550346895dc97b58/html5/thumbnails/6.jpg)
Bayesian Learning cont.
)h(P)h|d(P)d|h(P iii
Calculating P(h|d)
Assume the observations are i.i.d.—independent and identically distributed
likelihood prior
j
iji )h|d(P)h|d(P
![Page 7: Statistical Learning Methods](https://reader033.fdocuments.in/reader033/viewer/2022061406/56815b7e550346895dc97b58/html5/thumbnails/7.jpg)
Example:Hypothesis Prior over h1, …, h5 is {0.1,0.2,0.4,0.2,0.1}Data:
Q1: After seeing d1, what is P(hi|d1)?Q2: After seeing d1, what is P(d2= |d1)?
![Page 8: Statistical Learning Methods](https://reader033.fdocuments.in/reader033/viewer/2022061406/56815b7e550346895dc97b58/html5/thumbnails/8.jpg)
Making Statistical Inferences
Bayesian – predictions made using all hypothesis, weighted by
their probabilities
MAP – maximum a posteriori uses the single most probable hypothesis to make
prediction often much easier than Bayesian; as we get more and
more data, closer to Bayesian optimal
ML – maximum likelihood assume uniform prior over H when
![Page 9: Statistical Learning Methods](https://reader033.fdocuments.in/reader033/viewer/2022061406/56815b7e550346895dc97b58/html5/thumbnails/9.jpg)
Naïve Bayes (20.2)
![Page 10: Statistical Learning Methods](https://reader033.fdocuments.in/reader033/viewer/2022061406/56815b7e550346895dc97b58/html5/thumbnails/10.jpg)
Naïve Bayesaka Idiot Bayesparticularly simple BNmakes overly strong independence assumptionsbut works surprisingly well in practice…
![Page 11: Statistical Learning Methods](https://reader033.fdocuments.in/reader033/viewer/2022061406/56815b7e550346895dc97b58/html5/thumbnails/11.jpg)
Bayesian Diagnosissuppose we want to make a diagnosis D and there are n possible mutually exclusive diagnosis d1, …, dn
suppose there are m boolean symptoms, E1, …, Em )e,...,e|d(P m1i
how do we make diagnosis?
)d(P i )d|e,...,e(P im1we need:
)e,...,e(P)d|e,...,e(P)d(P
m1
im1i
![Page 12: Statistical Learning Methods](https://reader033.fdocuments.in/reader033/viewer/2022061406/56815b7e550346895dc97b58/html5/thumbnails/12.jpg)
Naïve Bayes AssumptionAssume each piece of evidence (symptom) is independent give the diagnosis then
)d|e,...,e(P im1
what is the structure of the corresponding BN?
m
1kik )d|e(P
![Page 13: Statistical Learning Methods](https://reader033.fdocuments.in/reader033/viewer/2022061406/56815b7e550346895dc97b58/html5/thumbnails/13.jpg)
Naïve Bayes Examplepossible diagnosis: Allergy, Cold and OKpossible symptoms: Sneeze, Cough and Fever
Well Cold AllergyP(d) 0.9 0.05 0.05
P(sneeze|d) 0.1 0.9 0.9
P(cough|d) 0.1 0.8 0.7
P(fever|d) 0.01 0.7 0.4
my symptoms are: sneeze & cough, what isthe diagnosis?
![Page 14: Statistical Learning Methods](https://reader033.fdocuments.in/reader033/viewer/2022061406/56815b7e550346895dc97b58/html5/thumbnails/14.jpg)
Learning the Probabilitiesaka parameter estimationwe need P(di) – prior P(ek|di) – conditional probabilityuse training data to estimate
![Page 15: Statistical Learning Methods](https://reader033.fdocuments.in/reader033/viewer/2022061406/56815b7e550346895dc97b58/html5/thumbnails/15.jpg)
Maximum Likelihood Estimate (MLE)
use frequencies in training set to estimate:
Nn)d(p i
i
i
ikik n
n)d|e(p
where nx is shorthand for the counts of events intraining set
![Page 16: Statistical Learning Methods](https://reader033.fdocuments.in/reader033/viewer/2022061406/56815b7e550346895dc97b58/html5/thumbnails/16.jpg)
Example:D Sneeze Cough Fever
Allergy yes no noWell yes no no
Allergy yes no yesAllergy yes no no
Cold yes yes yesAllergy yes no no
Well no no noWell no no no
Allergy no no noAllergy yes no no
what is:P(Allergy)?P(Sneeze| Allergy)?P(Cough| Allergy)?
![Page 17: Statistical Learning Methods](https://reader033.fdocuments.in/reader033/viewer/2022061406/56815b7e550346895dc97b58/html5/thumbnails/17.jpg)
Laplace Estimate (smoothing)
use smoothing to eliminate zeros:
nN1n)d(p i
i
2n1n)d|e(p
i
ikik
where n is number of possible values for dand e is assumed to have 2 possible values
many other smoothing schemes…
![Page 18: Statistical Learning Methods](https://reader033.fdocuments.in/reader033/viewer/2022061406/56815b7e550346895dc97b58/html5/thumbnails/18.jpg)
CommentsGenerally works well despite blanket assumption of independenceExperiments show competitive with decision trees on some well known test sets (UCI)handles noisy data
![Page 19: Statistical Learning Methods](https://reader033.fdocuments.in/reader033/viewer/2022061406/56815b7e550346895dc97b58/html5/thumbnails/19.jpg)
Learning more complex Bayesian networks
Two subproblems:learning structure: combinatorial search over space of networkslearning parameters values: easy if all of the variables are observed in the training set; harder if there are ‘hidden variables’
![Page 20: Statistical Learning Methods](https://reader033.fdocuments.in/reader033/viewer/2022061406/56815b7e550346895dc97b58/html5/thumbnails/20.jpg)
Instance-based Learning
![Page 21: Statistical Learning Methods](https://reader033.fdocuments.in/reader033/viewer/2022061406/56815b7e550346895dc97b58/html5/thumbnails/21.jpg)
Instance/Memory-based Learning
Non-parameteric hypothesis complexity grows with the
data
Memory-based learning Construct hypotheses directly from
the training data itself
![Page 22: Statistical Learning Methods](https://reader033.fdocuments.in/reader033/viewer/2022061406/56815b7e550346895dc97b58/html5/thumbnails/22.jpg)
Nearest Neighbor MethodsTo classify a new input vector x, examine the k-closest training data points to x and assign the object to the most frequently occurring class
x
k=1
k=6
![Page 23: Statistical Learning Methods](https://reader033.fdocuments.in/reader033/viewer/2022061406/56815b7e550346895dc97b58/html5/thumbnails/23.jpg)
IssuesDistance measure
Most common: euclidean Better distance measures: normalize each variable by standard deviation For discrete data, can use hamming distance
Choosing k Increasing k reduces variance, increases bias
For high-dimensional space, problem that the nearest neighbor may not be very close at all!
Memory-based technique. Must make a pass through the data for each classification. This can be prohibitive for large data sets.
Indexing the data can help; for example KD trees
![Page 24: Statistical Learning Methods](https://reader033.fdocuments.in/reader033/viewer/2022061406/56815b7e550346895dc97b58/html5/thumbnails/24.jpg)
Neural Networks (20.5)
![Page 25: Statistical Learning Methods](https://reader033.fdocuments.in/reader033/viewer/2022061406/56815b7e550346895dc97b58/html5/thumbnails/25.jpg)
Neural functionBrain function (thought) occurs as the result of the firing of neuronsNeurons connect to each other through synapses, which propagate action potential (electrical impulses) by releasing neurotransmittersSynapses can be excitatory (potential-increasing) or inhibitory (potential-decreasing), and have varying activation thresholdsLearning occurs as a result of the synapses’ plasticicity: They exhibit long-term changes in connection strengthThere are about 1011 neurons and about 1014 synapses in the human brain
![Page 26: Statistical Learning Methods](https://reader033.fdocuments.in/reader033/viewer/2022061406/56815b7e550346895dc97b58/html5/thumbnails/26.jpg)
Biology of a neuron
![Page 27: Statistical Learning Methods](https://reader033.fdocuments.in/reader033/viewer/2022061406/56815b7e550346895dc97b58/html5/thumbnails/27.jpg)
Brain structureDifferent areas of the brain have different functions Some areas seem to have the same function in all humans (e.g.,
Broca’s region); the overall layout is generally consistent Some areas are more plastic, and vary in their function; also, the
lower-level structure and function vary greatlyWe don’t know how different functions are “assigned” or acquired
Partly the result of the physical layout / connection to inputs (sensors) and outputs (effectors)
Partly the result of experience (learning)We really don’t understand how this neural structure leads to what we perceive as “consciousness” or “thought”Our neural networks are not nearly as complex or intricate as the actual brain structure
![Page 28: Statistical Learning Methods](https://reader033.fdocuments.in/reader033/viewer/2022061406/56815b7e550346895dc97b58/html5/thumbnails/28.jpg)
Comparison of computing power
Computers are way faster than neurons…But there are a lot more neurons than we can reasonably model in modern digital computers, and they all fire in parallelNeural networks are designed to be massively parallelThe brain is effectively a billion times faster
![Page 29: Statistical Learning Methods](https://reader033.fdocuments.in/reader033/viewer/2022061406/56815b7e550346895dc97b58/html5/thumbnails/29.jpg)
Neural networksNeural networks are made up of nodes or units, connected by linksEach link has an associated weight and activation levelEach node has an input function (typically summing over weighted inputs), an activation function, and an output
![Page 30: Statistical Learning Methods](https://reader033.fdocuments.in/reader033/viewer/2022061406/56815b7e550346895dc97b58/html5/thumbnails/30.jpg)
Neural unit
![Page 31: Statistical Learning Methods](https://reader033.fdocuments.in/reader033/viewer/2022061406/56815b7e550346895dc97b58/html5/thumbnails/31.jpg)
Linear Threshold Unit (LTU)
n
W2
0
1
X 1
X 2
X n
X 0 =1
i
n
ii xw
0
otherwise 1-
if 1n
1iii xw
otherwise 1-0xw...xww if 1)x,...,x(o nn110
n1
![Page 32: Statistical Learning Methods](https://reader033.fdocuments.in/reader033/viewer/2022061406/56815b7e550346895dc97b58/html5/thumbnails/32.jpg)
Sigmoid Unit
W n
W2
W0W
1
X 1
X 2
X n
X 0 =1
i
n
ii xwnet
0
neteneto
1
1)(σ
-xe11
function sigmoid the is )x(
σ
![Page 33: Statistical Learning Methods](https://reader033.fdocuments.in/reader033/viewer/2022061406/56815b7e550346895dc97b58/html5/thumbnails/33.jpg)
Neural ComputationMcCollough and Pitt (1943)showed how LTU can be use to compute logical functions AND? OR? NOT?
Two layers of LTUs can represent any boolean function
![Page 34: Statistical Learning Methods](https://reader033.fdocuments.in/reader033/viewer/2022061406/56815b7e550346895dc97b58/html5/thumbnails/34.jpg)
Learning RulesRosenblatt (1959) suggested that if a target output value is provided for a single neuron with fixed inputs, can incrementally change weights to learn to produce these outputs using the perceptron learning rule assumes binary valued input/outputs assumes a single linear threshold unit
![Page 35: Statistical Learning Methods](https://reader033.fdocuments.in/reader033/viewer/2022061406/56815b7e550346895dc97b58/html5/thumbnails/35.jpg)
Perceptron Learning ruleIf the target output for unit j is tj
ijjjiji o)ot(ww
Equivalent to the intuitive rules:If output is correct, don’t change the weightsIf output is low (oj=0, tj=1), increment weights for all the inputs which are 1If output is high (oj=1, tj=0), decrement weights for all inputs which are 1Must also adjust threshold. Or equivalently assume there is a weight wj0 for an extra input unit that has o0=1
![Page 36: Statistical Learning Methods](https://reader033.fdocuments.in/reader033/viewer/2022061406/56815b7e550346895dc97b58/html5/thumbnails/36.jpg)
Perceptron Learning Algorithm
Repeatedly iterate through examples adjusting weights according to the perceptron learning rule until all outputs are correct
Initialize the weights to all zero (or random) Until outputs for all training examples are correct
for each training example e do compute the current output oj compare it to the target tj and update weights
each execution of outer loop is an epochfor multiple category problems, learn a separate perceptron for each category and assign to the class whose perceptron most exceeds its thresholdQ: when will the algorithm terminate?
![Page 37: Statistical Learning Methods](https://reader033.fdocuments.in/reader033/viewer/2022061406/56815b7e550346895dc97b58/html5/thumbnails/37.jpg)
![Page 38: Statistical Learning Methods](https://reader033.fdocuments.in/reader033/viewer/2022061406/56815b7e550346895dc97b58/html5/thumbnails/38.jpg)
Representation Limitations of a Perceptron
Perceptrons can only represent linear threshold functions and can therefore only learn functions which linearly separate the data, I.e. the positive and negative examples are separable by a hyperplane in n-dimensional space
![Page 39: Statistical Learning Methods](https://reader033.fdocuments.in/reader033/viewer/2022061406/56815b7e550346895dc97b58/html5/thumbnails/39.jpg)
Perceptron LearnabilityPerceptron Convergence Theorem: If there are a set of weights that are consistent with the training data (I.e. the data is linearly separable), the perceptron learning algorithm will converge (Minksy & Papert, 1969)Unfortunately, many functions (like parity) cannot be represented by LTU
![Page 40: Statistical Learning Methods](https://reader033.fdocuments.in/reader033/viewer/2022061406/56815b7e550346895dc97b58/html5/thumbnails/40.jpg)
Layered feed-forward network
Output units
Hidden units
Input units
![Page 41: Statistical Learning Methods](https://reader033.fdocuments.in/reader033/viewer/2022061406/56815b7e550346895dc97b58/html5/thumbnails/41.jpg)
Backpropagation Algorithm
i,jji,j
i,ji,ji,j
j,i
koutputsk
h,khhh
kkkkk
x w where
www wweight network each Update 4.
w)o1(o hunit hidden eachFor 3.
)ot)(o1(o kunit output eachFor 2.
outputs the compute and example training theInput 1. do example, training eachFor
do satisfied, Until numbers. random small to weights all Initialize
![Page 42: Statistical Learning Methods](https://reader033.fdocuments.in/reader033/viewer/2022061406/56815b7e550346895dc97b58/html5/thumbnails/42.jpg)
“Executing” neural networks
Input units are set by some exterior function (think of these as sensors), which causes their output links to be activated at the specified levelWorking forward through the network, the input function of each unit is applied to compute the input value
Usually this is just the weighted sum of the activation on the links feeding into this node
The activation function transforms this input function into a final value
Typically this is a nonlinear function, often a sigmoid function corresponding to the “threshold” of that node
![Page 43: Statistical Learning Methods](https://reader033.fdocuments.in/reader033/viewer/2022061406/56815b7e550346895dc97b58/html5/thumbnails/43.jpg)
Neural Nets for Face Recognition
30x32inputs
left strt rgt up
Typical Input Images
90% accurate learninghead pose, and recognizing1-of-20 faces
![Page 44: Statistical Learning Methods](https://reader033.fdocuments.in/reader033/viewer/2022061406/56815b7e550346895dc97b58/html5/thumbnails/44.jpg)
Summary: Statistical Learning Methods
Statistical Inference use likehood of data and prob of hypothesis
to predict value for next instance Bayesian MAP ML
Naïve BayesNearest NeighborNeural Networks