Lesson 2. Draw and label a picture graph to represent data with up to four categories.
Classification Slides from Heng Ji Classification Define classes/categories Define...
-
Upload
elijah-hodges -
Category
Documents
-
view
225 -
download
0
Transcript of Classification Slides from Heng Ji Classification Define classes/categories Define...
![Page 1: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/1.jpg)
ClassificationClassification
Slides from Heng Ji
![Page 2: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/2.jpg)
ClassificationClassification
Define classes/categoriesDefine classes/categoriesLabel textLabel textExtract featuresExtract featuresChoose a classifierChoose a classifier
Naive Bayes Classifier Decision Trees Maximum Entropy …
Train it Train it Use it to classify new examplesUse it to classify new examples
![Page 3: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/3.jpg)
NaNaïïve Bayesve Bayes
More powerful that Decision Trees
Decision Trees Naïve Bayes
Every feature gets a say in determining which label should be assigned to a given input value.
![Page 4: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/4.jpg)
NaNaïïve Bayes: Strengthsve Bayes: Strengths
Very simple modelVery simple model Easy to understand Very easy to implement
Can scale easily to millions of training Can scale easily to millions of training examples (just need counts!) examples (just need counts!)
Very efficient, fast training and classificationVery efficient, fast training and classification Modest space storageModest space storage Widely used because it works really well for Widely used because it works really well for
text categorizationtext categorization Linear, but non parallel decision boundariesLinear, but non parallel decision boundaries
![Page 5: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/5.jpg)
NaNaïïve Bayes: weaknessesve Bayes: weaknesses
Naïve Bayes independence assumption has two consequences:Naïve Bayes independence assumption has two consequences: The linear ordering of words is ignored (bag of words
model) The words are independent of each other given the
class:• President is more likely to occur in a context that
contains election than in a context that contains poet
Naïve Bayes assumption is inappropriate if there are strong Naïve Bayes assumption is inappropriate if there are strong conditional dependencies between the variablesconditional dependencies between the variables
Nonetheless, Naïve Bayes models do well in a surprisingly large Nonetheless, Naïve Bayes models do well in a surprisingly large number of cases because often we are interested in number of cases because often we are interested in classification classification accuracyaccuracy and not in accurate and not in accurate probability estimationsprobability estimations) )
Does not optimize prediction accuracy Does not optimize prediction accuracy
![Page 6: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/6.jpg)
The naivete of The naivete of independenceindependence Naïve Bayes assumption is inappropriate if there are Naïve Bayes assumption is inappropriate if there are
strong conditional dependencies between the variablesstrong conditional dependencies between the variables Classifier may end up "double-counting" the effect of Classifier may end up "double-counting" the effect of
highly correlated features, pushing the classifier closer highly correlated features, pushing the classifier closer to a given label than is justifiedto a given label than is justified
Consider a name gender classifierConsider a name gender classifier features ends-with(a) and ends-with(vowel) are dependent on
one another, because if an input value has the first feature, then it must also have the second feature
For features like these, the duplicated information may be given more weight than is justified by the training set
![Page 7: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/7.jpg)
Decision Trees: StrengthsDecision Trees: Strengths
capable to generate understandable rulescapable to generate understandable rules perform classification without requiring much perform classification without requiring much
computationcomputation capable to handle both continuous and capable to handle both continuous and
categorical variablescategorical variables provide a clear indication of which features provide a clear indication of which features
are most important for prediction or are most important for prediction or classification. classification.
![Page 8: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/8.jpg)
Decision Trees: weaknessesDecision Trees: weaknesses
prone to errors in classification prone to errors in classification problems with many classes and problems with many classes and relatively small number of training relatively small number of training examples. examples. Since each branch in the decision tree
splits the training data, the amount of training data available to train nodes lower in the tree can become quite small.
can be computationally expensive to can be computationally expensive to train. train. Need to compare all possible splits Pruning is also expensive
![Page 9: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/9.jpg)
Decision Trees: weaknessesDecision Trees: weaknesses
Typically examine one field at a timeTypically examine one field at a time Leads to rectangular classification boxes that Leads to rectangular classification boxes that
may not correspond well with the actual may not correspond well with the actual distribution of records in the decision space. distribution of records in the decision space. Such ordering limits their ability to exploit
features that are relatively independent of one another
Naive Bayes overcomes this limitation by allowing all features to act "in parallel"
![Page 10: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/10.jpg)
Linearly separable dataLinearly separable data
Class1Class2
Linear Decision boundary
![Page 11: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/11.jpg)
Non linearly separable Non linearly separable datadata
Class1Class2
![Page 12: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/12.jpg)
Non linearly separable Non linearly separable datadata
Non Linear Classifier
Class1Class2
![Page 13: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/13.jpg)
Linear versus Non Linear Linear versus Non Linear algorithmsalgorithms Linear or Non linear separable data?Linear or Non linear separable data?
We can find out only empirically Linear algorithmsLinear algorithms (algorithms that find a linear decision (algorithms that find a linear decision
boundary)boundary) When we think the data is linearly separable Advantages
• Simpler, less parameters Disadvantages
• High dimensional data (like for NLP) is usually not linearly separable
Examples: Perceptron, Winnow, large margin Note: we can use linear algorithms also for non
linear problems (see Kernel methods)
![Page 14: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/14.jpg)
Linear versus Non Linear Linear versus Non Linear algorithmsalgorithms
Non Linear algorithmsNon Linear algorithms When the data is non linearly separable Advantages
• More accurate
Disadvantages• More complicated, more parameters
Example: Kernel methods Note: the distinction between linear and non linear
applies also for multi-class classification (we’ll see this later)
![Page 15: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/15.jpg)
Simple linear Simple linear algorithmsalgorithms Perceptron algorithmPerceptron algorithm
Linear Binary classification Online (process data sequentially, one data
point at the time) Mistake driven Simple single layer Neural Networks
![Page 16: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/16.jpg)
Linear AlgebraLinear Algebra
w
b
wx + b > 0
wx + b < 0
![Page 17: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/17.jpg)
Linear binary Linear binary classificationclassification Data: {(xi, yi)}i=1…n
x in Rd (x is a vector in d-dimensional space) feature vector y in {-1,+1}
label (class, category)
Question: Design a linear decision boundary: wx + b (equation of
hyperplane) such that the classification rule associated with it has minimal probability of error
classification rule: y = sign(w x + b) which means: if wx + b > 0 then y = +1 if wx + b < 0 then y = -1
Gert Lanckriet, Statistical Learning Theory Tutorial
![Page 18: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/18.jpg)
Linear binary classificationLinear binary classification
Find a goodFind a good hyperplane hyperplane
((ww, , bb) ) in in RRd+1d+1
that correctly classifies that correctly classifies data points as much as data points as much as possiblepossible
In In online fashiononline fashion: one : one data point at the time, data point at the time, update weights as update weights as necessarynecessary
wx + b = 0
Classification Rule: y = sign(wx + b)
From Gert Lanckriet, Statistical Learning Theory Tutorial
![Page 19: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/19.jpg)
PerceptronPerceptron
![Page 20: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/20.jpg)
Perceptron Learning RulePerceptron Learning Rule
Assuming the problem is linearly separable, Assuming the problem is linearly separable, there is a learning rule that converges in a there is a learning rule that converges in a finite timefinite time
MotivationMotivation
A new (unseen) input pattern that is similar to A new (unseen) input pattern that is similar to an old (seen) input pattern is likely to be an old (seen) input pattern is likely to be classified correctlyclassified correctly
![Page 21: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/21.jpg)
Learning Rule, CtdLearning Rule, Ctd
Basic Idea – go over all existing data Basic Idea – go over all existing data patterns, patterns, whose labeling is knownwhose labeling is known, and check , and check their classification with a current weight their classification with a current weight vectorvector
If correct, continueIf correct, continue If not, add to the weights a quantity that is If not, add to the weights a quantity that is
proportional to the product of the input pattern proportional to the product of the input pattern with the desired output Z (1 or –1)with the desired output Z (1 or –1)
![Page 22: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/22.jpg)
Weight Update RuleWeight Update Rule
WWjj+1+1 = = WWj j + + ZZjjXXjjj j = 0, …, n= 0, …, n
= learning rate= learning rate
![Page 23: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/23.jpg)
Hebb RuleHebb Rule
In 1949, In 1949, Hebb Hebb postulated that the changes in postulated that the changes in a synapse are proportional to the a synapse are proportional to the correlationcorrelation between firing of the neurons that are between firing of the neurons that are connected through the synapse (the pre- and connected through the synapse (the pre- and post- synaptic neurons)post- synaptic neurons)
Neurons that fire together, wire together Neurons that fire together, wire together
![Page 24: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/24.jpg)
Example: a simple Example: a simple problemproblem
4 points linearly separable4 points linearly separable
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
Z = 1 Z = 1 Z = - 1 Z = - 1
(1/2, 1)
(1,1/2)
(-1,1/2)
(-1,1)
![Page 25: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/25.jpg)
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
(1/2, 1)
(1,1/2)
(-1,1/2)
(-1,1)
W0 = (0, 1)
Initial WeightsInitial Weights
![Page 26: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/26.jpg)
Updating WeightsUpdating Weights
Upper left point is wrongly classifiedUpper left point is wrongly classified
= 1/3 , = 1/3 , WW00 = (0, 1) = (0, 1)
WW11 WW00 + + ZZ XX11
WW11 = (0, 1) + 1/3 = (0, 1) + 1/3 -1 -1 (-1, 1/2) = (1/3, 5/6) (-1, 1/2) = (1/3, 5/6)
![Page 27: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/27.jpg)
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
W1 = (1/3,5/6)
First CorrectionFirst Correction
![Page 28: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/28.jpg)
Updating Weights, CtdUpdating Weights, Ctd
Upper left point is still wrongly classifiedUpper left point is still wrongly classified
WW22 WW11 + + ZZ XX11
WW22 = (1/3, 5/6) + 1/3 = (1/3, 5/6) + 1/3 -1 -1 (-1, 1/2) = (2/3, 2/3) (-1, 1/2) = (2/3, 2/3)
![Page 29: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/29.jpg)
Second CorrectionSecond Correction
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
W2 = (2/3,2/3)
![Page 30: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/30.jpg)
Example, CtdExample, Ctd
All 4 points are classified correctlyAll 4 points are classified correctly Toy problem – only 2 updates requiredToy problem – only 2 updates required Correction of weights was simply a Correction of weights was simply a
rotation of the separating hyper planerotation of the separating hyper plane Rotation can be applied to the right direction, Rotation can be applied to the right direction,
but may require many updatesbut may require many updates
![Page 31: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/31.jpg)
Support Vector MachinesSupport Vector Machines
![Page 32: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/32.jpg)
Large margin Large margin classifierclassifier
Another family of linear Another family of linear algorithmsalgorithms
IntuitionIntuition (Vapnik, 1965) (Vapnik, 1965) If the classes are linearly If the classes are linearly
separable:separable: Separate the data Place hyper-plane “far”
from the data: large margin
Statistical results guarantee good generalization
BAD
Gert Lanckriet, Statistical Learning Theory Tutorial
![Page 33: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/33.jpg)
GOOD
Maximal Margin Classifier
Large margin Large margin classifierclassifier
IntuitionIntuition (Vapnik, 1965) if (Vapnik, 1965) if linearly separable:linearly separable: Separate the data Place hyperplane
“far” from the data: large margin
Statistical results guarantee good generalization
Gert Lanckriet, Statistical Learning Theory Tutorial
![Page 34: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/34.jpg)
Large margin Large margin classifierclassifier
If If not linearly separablenot linearly separable Allow some errors Still, try to place
hyperplane “far” from each class
Gert Lanckriet, Statistical Learning Theory Tutorial
![Page 35: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/35.jpg)
Large Margin Large Margin ClassifiersClassifiersAdvantagesAdvantages
Theoretically better (better error bounds)LimitationsLimitations
Computationally more expensive, large quadratic programming
![Page 36: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/36.jpg)
Non Linear problemNon Linear problem
![Page 37: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/37.jpg)
Non Linear problemNon Linear problem
![Page 38: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/38.jpg)
Non Linear problemNon Linear problem
Kernel methodsKernel methods A family of A family of non-linear algorithmsnon-linear algorithms Transform the non linear problem in a linear Transform the non linear problem in a linear
one (in a different feature space)one (in a different feature space) Use linear algorithms to solve the linear Use linear algorithms to solve the linear
problem in the new spaceproblem in the new space
Gert Lanckriet, Statistical Learning Theory Tutorial
![Page 39: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/39.jpg)
X=[x z]
Basic principle kernel Basic principle kernel methodsmethods
: : RRdd R RDD (D >> d) (D >> d)
(X)=[x2 z2 xz]
f(x) = sign(w1x2+w2z2+w3xz +b)
wT(x)+b=0
Gert Lanckriet, Statistical Learning Theory Tutorial
![Page 40: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/40.jpg)
Basic principle kernel Basic principle kernel methodsmethods Linear separabilityLinear separability: more likely in high : more likely in high
dimensionsdimensions MappingMapping: : maps input into high-dimensional maps input into high-dimensional
feature spacefeature space ClassifierClassifier: construct linear classifier in high-: construct linear classifier in high-
dimensional feature spacedimensional feature space MotivationMotivation: appropriate choice of : appropriate choice of leads to leads to
linear separabilitylinear separability We can do this efficiently!We can do this efficiently!
Gert Lanckriet, Statistical Learning Theory Tutorial
![Page 41: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/41.jpg)
Basic principle kernel Basic principle kernel methodsmethods We can use the linear algorithms seen before We can use the linear algorithms seen before
(for example, perceptron) for classification in (for example, perceptron) for classification in the higher dimensional spacethe higher dimensional space
![Page 42: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/42.jpg)
Multi-class classificationMulti-class classification
Given:Given: some data items that belong to one of some data items that belong to one of M possible classes M possible classes
Task:Task: Train the classifier and predict the class Train the classifier and predict the class for a new data itemfor a new data item
Geometrically:Geometrically: harder problem, no more harder problem, no more simple geometrysimple geometry
![Page 43: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/43.jpg)
Multi-class classificationMulti-class classification
![Page 44: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/44.jpg)
Linear Classifiers
f x
yest
denotes +1
denotes -1
f(x,w,b) = sign(w x + b)
How would you classify this data?
w x +
b=0
w x + b<0
w x + b>0
![Page 45: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/45.jpg)
Linear Classifiers
f x
yest
denotes +1
denotes -1
f(x,w,b) = sign(w x + b)
How would you classify this data?
![Page 46: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/46.jpg)
Linear Classifiers
f x
yest
denotes +1
denotes -1
f(x,w,b) = sign(w x + b)
How would you classify this data?
![Page 47: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/47.jpg)
Linear Classifiers
f x
yest
denotes +1
denotes -1
f(x,w,b) = sign(w x + b)
Any of these would be fine..
..but which is best?
![Page 48: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/48.jpg)
Linear Classifiers
f x
yest
denotes +1
denotes -1
f(x,w,b) = sign(w x + b)
How would you classify this data?
Misclassified to +1 class
![Page 49: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/49.jpg)
f x
yest
denotes +1
denotes -1
f(x,w,b) = sign(w x + b)
Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.
Classifier Margin
f x
yest
denotes +1
denotes -1
f(x,w,b) = sign(w x + b)
Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.
![Page 50: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/50.jpg)
Maximum Margin
f x
yest
denotes +1
denotes -1
f(x,w,b) = sign(w x + b)
The maximum margin linear classifier is the linear classifier with the, um, maximum margin.
This is the simplest kind of SVM (Called an LSVM)Linear SVM
Support Vectors are those datapoints that the margin pushes up against
1. Maximizing the margin is good according to intuition and PAC theory
2. Implies that only support vectors are important; other training examples are ignorable.
3. Empirically it works very very well.
![Page 51: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/51.jpg)
Let me digress to…what is PAC Let me digress to…what is PAC Theory?Theory?
Two important aspects of complexity in machine learning.Two important aspects of complexity in machine learning.
First, First, sample complexitysample complexity: in many learning problems, : in many learning problems, training data is expensive and we should hope not to need training data is expensive and we should hope not to need too much of it. too much of it.
Secondly, Secondly, computational complexitycomputational complexity: A neural network, for : A neural network, for example, which takes an hour to train may be of no example, which takes an hour to train may be of no practical use in complex financial prediction problems.practical use in complex financial prediction problems.
Important that both the amount of training data required Important that both the amount of training data required for a prescribed level of performance and the running time for a prescribed level of performance and the running time of the learning algorithm in learning from this data do not of the learning algorithm in learning from this data do not increase too dramatically as the “difficulty” of the learning increase too dramatically as the “difficulty” of the learning problem increases. problem increases.
![Page 52: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/52.jpg)
Such issues have been formalised and Such issues have been formalised and investigated over the past decade within the field investigated over the past decade within the field of “computational learning theory”. of “computational learning theory”.
One popular framework for discussing such One popular framework for discussing such problems is the probabilistic framework which problems is the probabilistic framework which has become known as the “probably has become known as the “probably approximately correct”, or approximately correct”, or PAC, model of PAC, model of learning. learning.
Let me digress to…what is PAC Let me digress to…what is PAC Theory?Theory?
![Page 53: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/53.jpg)
Linear SVM Linear SVM MathematicallyMathematically
What we know:What we know: ww . . xx++ + b = +1 + b = +1 ww . . xx-- + b = -1 + b = -1 ww . ( . (xx++-x-x--) = 2 ) = 2
“Predict Class
= +1”
zone
“Predict Class
= -1”
zonewx+b=1
wx+b=0
wx+b=-1
X-
x+
ww
wxxM
2)(
M=Margin Width
![Page 54: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/54.jpg)
Linear SVM Mathematically Goal: 1) Correctly classify all training data
if yi = +1
if yi = -1
for all i
2) Maximize the Margin
same as minimize
We can formulate a Quadratic Optimization Problem and solve for w and b
Minimize
subject to
wM
2
www t
2
1)(
1bwxi
1bwxi
1)( bwxyi ii
1)( bwxy ii
wwt
2
1
![Page 55: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/55.jpg)
Solving the Optimization Problem
Need to optimize a quadratic function subject to linear constraints.
Quadratic optimization problems are a well-known class of mathematical programming problems, and many (rather
intricate) algorithms exist for solving them. The solution involves constructing a dual problem where a
Lagrange multiplier αi is associated with every constraint in the primary problem:
Find w and b such thatΦ(w) =½ wTw is minimized;
and for all {(xi ,yi)}: yi (wTxi + b) ≥ 1
Find α1…αN such that
Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and
(1) Σαiyi = 0(2) αi ≥ 0 for all αi
![Page 56: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/56.jpg)
A digression… Lagrange A digression… Lagrange MultipliersMultipliers
In mathematical optimization, the method of In mathematical optimization, the method of Lagrange Lagrange multipliersmultipliers provides a strategy for finding the maxima and provides a strategy for finding the maxima and minima of a function subject to constraints.minima of a function subject to constraints.
For instance, consider the optimization problem For instance, consider the optimization problem maximize maximize subject to subject to
We introduce a new variable (λ) called a Lagrange We introduce a new variable (λ) called a Lagrange multiplier, and study the Lagrange function defined bymultiplier, and study the Lagrange function defined by
(the λ term may be either added or subtracted.) (the λ term may be either added or subtracted.) If (If (xx,,yy) is a maximum for the original constrained problem, ) is a maximum for the original constrained problem,
then there exists a then there exists a λλ such that ( such that (xx,,yy,λ) is a stationary ,λ) is a stationary point for the Lagrange function point for the Lagrange function
(stationary points are those points where the partial (stationary points are those points where the partial derivatives of Λ are zero). derivatives of Λ are zero).
![Page 57: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/57.jpg)
The Optimization Problem Solution
The solution has the form:
Each non-zero αi indicates that corresponding xi is a support vector.
Then the classifying function will have the form:
Notice that it relies on an inner product between the test point x and the support vectors xi
Also keep in mind that solving the optimization problem involved computing the inner products xi
Txj between all pairs of training points.
w =Σαiyixi b= yk- wTxk for any xk such that αk 0
f(x) = ΣαiyixiTx + b
![Page 58: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/58.jpg)
Dataset with noise
Hard Margin: So far we require all data points be classified correctly
- No training error
What if the training set is noisy?
- Solution 1: use very powerful kernels
denotes +1
denotes -1
OVERFITTING!
![Page 59: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/59.jpg)
Slack variables ξi can be added to allow misclassification of difficult or noisy examples.
wx+b=1
wx+b=0
wx+b=-
1
7
11 2
Soft Margin ClassificationSoft Margin Classification
What should our quadratic optimization criterion be?
Minimize
R
kkεC
1
.2
1ww
![Page 60: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/60.jpg)
Hard Margin v.s. Soft Margin
The old formulation:
The new formulation incorporating slack variables:
Parameter C can be viewed as a way to control overfitting.
Find w and b such that
Φ(w) =½ wTw is minimized and for all {(xi ,yi)}yi (wTxi + b) ≥ 1
Find w and b such that
Φ(w) =½ wTw + CΣξi is minimized and for all {(xi ,yi)}yi (wTxi + b) ≥ 1- ξi and ξi ≥ 0 for all i
![Page 61: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/61.jpg)
Linear SVMs: Overview
The classifier is a separating hyperplane. Most “important” training points are support vectors; they
define the hyperplane. Quadratic optimization algorithms can identify which training
points xi are support vectors with non-zero Lagrangian multipliers αi.
Both in the dual formulation of the problem and in the solution
training points appear only inside dot products: Find α1…αN such thatQ(α) =Σαi - ½ΣΣαiαjyiyjxi
Txj is maximized and (1) Σαiyi = 0(2) 0 ≤ αi ≤ C for all αi
f(x) = ΣαiyixiTx + b
![Page 62: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/62.jpg)
Non-linear SVMs
Datasets that are linearly separable with some noise work out great:
But what are we going to do if the dataset is just too hard?
How about… mapping data to a higher-dimensional space:
0 x
0 x
0 x
x2
![Page 63: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/63.jpg)
Non-linear SVMs: Feature spaces
General idea: the original input space can always be mapped to some higher-dimensional feature space where the training set is separable:
Φ: x → φ(x)
![Page 64: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/64.jpg)
The “Kernel Trick”
The linear classifier relies on dot product between vectors K(xi,xj)=xiTxj
If every data point is mapped into high-dimensional space via some transformation Φ: x → φ(x), the dot product becomes:
K(xi,xj)= φ(xi) Tφ(xj)
A kernel function is some function that corresponds to an inner product in some expanded feature space.
Example:
2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xiTxj)2
,
Need to show that K(xi,xj)= φ(xi) Tφ(xj):
K(xi,xj)=(1 + xiTxj)2
,
= 1+ xi12xj1
2 + 2 xi1xj1 xi2xj2+ xi2
2xj22 + 2xi1xj1 + 2xi2xj2
= [1 xi12 √2 xi1xi2 xi2
2 √2xi1 √2xi2]T [1 xj12 √2 xj1xj2 xj2
2 √2xj1 √2xj2]
= φ(xi) Tφ(xj), where φ(x) = [1 x1
2 √2 x1x2 x22 √2x1 √2x2]
![Page 65: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/65.jpg)
What Functions are Kernels?
For some functions K(xi,xj) checking that
K(xi,xj)= φ(xi) Tφ(xj) can be cumbersome.
Mercer’s theorem:
Every semi-positive definite symmetric function is a kernel
![Page 66: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/66.jpg)
Examples of Kernel Functions
Linear: K(xi,xj)= xi Txj
Polynomial of power p: K(xi,xj)= (1+ xi Txj)p
Gaussian (radial-basis function network):
Sigmoid: K(xi,xj)= tanh(β0xi Txj + β1)
)2
exp(),(2
2
ji
ji
xxxx
K
![Page 67: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/67.jpg)
Non-linear SVMs Mathematically
Dual problem formulation:
The solution is:
Optimization techniques for finding αi’s remain the same!
Find α1…αN such thatQ(α) =Σαi - ½ΣΣαiαjyiyjK(xi, xj) is maximized and (1) Σαiyi = 0(2) αi ≥ 0 for all αi
f(x) = ΣαiyiK(xi, xj)+ b
![Page 68: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/68.jpg)
SVM locates a separating hyperplane in the feature space and classify points in that space
It does not need to represent the space explicitly, simply by defining a kernel function
The kernel function plays the role of the dot product in the feature space.
Nonlinear SVM - OverviewNonlinear SVM - Overview
![Page 69: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/69.jpg)
Properties of SVMProperties of SVM
Flexibility in choosing a similarity functionFlexibility in choosing a similarity function Sparseness of solution when dealing with large data Sparseness of solution when dealing with large data
setssets only support vectors are used to specify the separating
hyperplane Ability to handle large feature spacesAbility to handle large feature spaces
complexity does not depend on the dimensionality of the feature space
Overfitting can be controlled by soft margin Overfitting can be controlled by soft margin approachapproach
Nice math property:Nice math property: a simple convex optimization problem which is guaranteed to
converge to a single global solution Feature SelectionFeature Selection
![Page 70: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/70.jpg)
SVM ApplicationsSVM Applications
SVM has been used successfully in many SVM has been used successfully in many real-world problemsreal-world problems
- text (and hypertext) categorization- text (and hypertext) categorization
- image classification – different types of sub-- image classification – different types of sub-problems problems
- bioinformatics (Protein classification, - bioinformatics (Protein classification,
Cancer classification)Cancer classification)
- hand-written character recognition- hand-written character recognition
![Page 71: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/71.jpg)
Weakness of SVMWeakness of SVM
It is sensitive to noiseIt is sensitive to noise A relatively small number of mislabeled examples can
dramatically decrease the performance
It only considers two classesIt only considers two classes how to do multi-class classification with SVM? Answer:
1. with output arity m, learn m SVM’sSVM 1 learns “Output==1” vs “Output !=
1”SVM 2 learns “Output==2” vs “Output !=
2”:SVM m learns “Output==m” vs “Output !
= m”2. to predict the output for a new input, just predict with
each SVM and find out which one puts the prediction the furthest into the positive region.
![Page 72: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/72.jpg)
Application: Text Application: Text CategorizationCategorization Task: The classification of natural text (or Task: The classification of natural text (or
hypertext) documents into a fixed number of hypertext) documents into a fixed number of predefined categories based on their content.predefined categories based on their content.
- email filtering, web searching, sorting documents by - email filtering, web searching, sorting documents by topic, etc..topic, etc..
A document can be assigned to more than A document can be assigned to more than one category, so this can be viewed as a one category, so this can be viewed as a series of binary classification problems, one series of binary classification problems, one for each categoryfor each category
![Page 73: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/73.jpg)
Application : Face Application : Face Expression RecognitionExpression Recognition Construct feature space, by use of Construct feature space, by use of
eigenvectors or other meanseigenvectors or other means Multiple class problem, several expressionsMultiple class problem, several expressions Use multi-class SVMUse multi-class SVM
![Page 74: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/74.jpg)
Some IssuesSome Issues
Choice of kernelChoice of kernel - Gaussian or polynomial kernel is default- Gaussian or polynomial kernel is default - if ineffective, more elaborate kernels are needed- if ineffective, more elaborate kernels are needed
Choice of kernel parametersChoice of kernel parameters - e.g. - e.g. σ in Gaussian kernelσ in Gaussian kernel - - σ is the distance between closest points with different σ is the distance between closest points with different
classifications classifications -- In the absence of reliable criteria, applications rely on the use In the absence of reliable criteria, applications rely on the use
of a validation set or cross-validation to set such parameters. of a validation set or cross-validation to set such parameters.
Optimization criterionOptimization criterion – Hard margin v.s. Soft margin – Hard margin v.s. Soft margin - a lengthy series of experiments in which various parameters - a lengthy series of experiments in which various parameters
are tested are tested
![Page 75: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/75.jpg)
Additional ResourcesAdditional Resources
LibSVMLibSVM
An excellent tutorial on VC-dimension and Support Vector An excellent tutorial on VC-dimension and Support Vector Machines:Machines:
C.J.C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):955-974, 1998.
The VC/SRM/SVM Bible:The VC/SRM/SVM Bible: Statistical Learning Theory by Vladimir Vapnik, Wiley-Interscience; 1998Statistical Learning Theory by Vladimir Vapnik, Wiley-Interscience; 1998
http://www.kernel-machines.org/
![Page 76: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.](https://reader035.fdocuments.in/reader035/viewer/2022062321/56649ef65503460f94c09fa5/html5/thumbnails/76.jpg)
ReferenceReference
Support Vector Machine Classification of Support Vector Machine Classification of Microarray Gene Expression DataMicroarray Gene Expression Data, Michael P. , Michael P. S. Brown William Noble Grundy, David Lin, Nello S. Brown William Noble Grundy, David Lin, Nello Cristianini, Charles Sugnet, Manuel Ares, Jr., Cristianini, Charles Sugnet, Manuel Ares, Jr., David Haussler David Haussler
www.cs.utexas.edu/users/mooney/cs391L/www.cs.utexas.edu/users/mooney/cs391L/svm.svm.pptppt
Text categorization with Support Vector Text categorization with Support Vector Machines:Machines:learning with many relevant featureslearning with many relevant features
T. Joachims, ECML - 98 T. Joachims, ECML - 98