Class1 Class2 The methods discussed so far are Linear Discriminants.
-
Upload
hubert-shepherd -
Category
Documents
-
view
226 -
download
6
Transcript of Class1 Class2 The methods discussed so far are Linear Discriminants.
![Page 1: Class1 Class2 The methods discussed so far are Linear Discriminants.](https://reader035.fdocuments.in/reader035/viewer/2022062322/5697bf781a28abf838c81f5f/html5/thumbnails/1.jpg)
Class1 Class2
The methods discussed so far are Linear Discriminants
![Page 2: Class1 Class2 The methods discussed so far are Linear Discriminants.](https://reader035.fdocuments.in/reader035/viewer/2022062322/5697bf781a28abf838c81f5f/html5/thumbnails/2.jpg)
10
1 0
XOR Problem: Not Linearly Separable!
![Page 3: Class1 Class2 The methods discussed so far are Linear Discriminants.](https://reader035.fdocuments.in/reader035/viewer/2022062322/5697bf781a28abf838c81f5f/html5/thumbnails/3.jpg)
Decision Rules for XOR Problem:
If x=0 thenIf y=0 then class=0Else class = 1
Else if x=1 thenIf y=0 then class=1Else class = 0
10
1 0
![Page 4: Class1 Class2 The methods discussed so far are Linear Discriminants.](https://reader035.fdocuments.in/reader035/viewer/2022062322/5697bf781a28abf838c81f5f/html5/thumbnails/4.jpg)
f
f
t
t
C1
C2 C1
Y
X
A Sample Decision Tree
By default, a false value is to the left, true to the right.It is easy to generate a tree to perfectly classify training data;it is much harder to generate a tree that works well on thetest data!
![Page 5: Class1 Class2 The methods discussed so far are Linear Discriminants.](https://reader035.fdocuments.in/reader035/viewer/2022062322/5697bf781a28abf838c81f5f/html5/thumbnails/5.jpg)
Decision Tree Induction
• Pick feature to test, say X
• Split training cases into a set where X=True, and X=False
• If a set is entirely of cases in one class, label it as a leaf. Alternately, label a set as a leaf if there are fewer than some threshold number of cases, e.g. 5
• Repeat process on sets that are not leaves.
![Page 6: Class1 Class2 The methods discussed so far are Linear Discriminants.](https://reader035.fdocuments.in/reader035/viewer/2022062322/5697bf781a28abf838c81f5f/html5/thumbnails/6.jpg)
Decision Tree Induction• How to pick which feature to test? Use heuristic
search!
• Entropy heuristic attempts to reduce the degree of randomness, or “impurity” of the selected feature (# bits needed to encode)
• E.g. High randomness: Feature that splits into two sets, each set 50% class 1 and 2.
• Low Randomness: Feature that splits into two sets, everything in set1=class1, everything in set2=class2.
![Page 7: Class1 Class2 The methods discussed so far are Linear Discriminants.](https://reader035.fdocuments.in/reader035/viewer/2022062322/5697bf781a28abf838c81f5f/html5/thumbnails/7.jpg)
The entropy of a particular state is the negative sum over all the classes of the probability of each class multiplied by the log of the probability:
c
cc ppnentropy 2log)(
2 classes, C1 and C2100 casesFor this state, 50 cases are in C1 and 50 cases are in C2Thus the probability of each class, P1 and P2 are 0.5.The entropy of this node = -[ (0.5)(lg 0.5) + (0.5)(lg 0.5) ] = 1
75 cases in C1 and 25 cases in C2P(C1) = 0.75, P(C2) = 0.25Entropy = -[ (0.75)(lg 0.75) + (0.25)(lg 0.25) ] = 0.81
![Page 8: Class1 Class2 The methods discussed so far are Linear Discriminants.](https://reader035.fdocuments.in/reader035/viewer/2022062322/5697bf781a28abf838c81f5f/html5/thumbnails/8.jpg)
Our algorithm will pick the feature or test that reduces the entropy the most. This can be achieved by selecting the feature testthat maximizes the following equation, in the event that there areonly two classes:
)()()( leftleft nentropypnentropynentropy
)( rightright nentropyp
Find the feature test that maximizes:
If there were more classes, we would have to includepclass*entropy(nodeclass).
![Page 9: Class1 Class2 The methods discussed so far are Linear Discriminants.](https://reader035.fdocuments.in/reader035/viewer/2022062322/5697bf781a28abf838c81f5f/html5/thumbnails/9.jpg)
For example, let’s say that we are at a node with an entropy of 1as calculated previously. If we split the current set of cases testing if Feature A is trueor false:
Feature A100 cases
F T
10 cases C120 cases C2
50 cases C120 cases C2
E=-[(1/3)*lg(1/3) + (2/3)*lg(2/3) = 0.92
E=-[(5/7)*lg(5/7) + (2/7)*lg(2/7) = 0.86
122.086.0)7.0(92.0)3.0(1 entropy
![Page 10: Class1 Class2 The methods discussed so far are Linear Discriminants.](https://reader035.fdocuments.in/reader035/viewer/2022062322/5697bf781a28abf838c81f5f/html5/thumbnails/10.jpg)
If we split the current set of cases testing if Feature B is trueor false:
Feature B100 cases
F T
50 cases C1 50 cases C2
E=-[(1)*lg(1) + (0)*lg(0) = 0 (actually undefined at lg(0))
E=-[(0)*lg(0) + (1)*lg(1) = 0
10)5.0(0)5.0(1 entropy
Larger change in entropy, will pick Feature B over Feature A!
![Page 11: Class1 Class2 The methods discussed so far are Linear Discriminants.](https://reader035.fdocuments.in/reader035/viewer/2022062322/5697bf781a28abf838c81f5f/html5/thumbnails/11.jpg)
0
0.2
0.4
0.6
0.8
1
1 2 3 4 7 9
Apparent Error
True Error
# of Terminals vs. Error Rates (for Iris Data problem)
![Page 12: Class1 Class2 The methods discussed so far are Linear Discriminants.](https://reader035.fdocuments.in/reader035/viewer/2022062322/5697bf781a28abf838c81f5f/html5/thumbnails/12.jpg)
Reduction in Tree Size• Prune Branches
– Induct Tree– From the bottom, move up to the subtree starting at non-
terminal node– Prune this node– Test the new tree on the *test* cases– If it performs better than the original tree, keep the
changes and continue
• Subtle Flaw: Trains on the Test Data. Need a large sample size to get valid results.
![Page 13: Class1 Class2 The methods discussed so far are Linear Discriminants.](https://reader035.fdocuments.in/reader035/viewer/2022062322/5697bf781a28abf838c81f5f/html5/thumbnails/13.jpg)
Web Demo
• http://www.cs.ualberta.ca/~aixplore/learning/DecisionTrees/Applet/DecisionTreeApplet.html
![Page 14: Class1 Class2 The methods discussed so far are Linear Discriminants.](https://reader035.fdocuments.in/reader035/viewer/2022062322/5697bf781a28abf838c81f5f/html5/thumbnails/14.jpg)
Rule Induction Overview
• Generic separate-and-conquer strategy
• CN2 rule induction algorithm
• Improvements to rule induction
![Page 15: Class1 Class2 The methods discussed so far are Linear Discriminants.](https://reader035.fdocuments.in/reader035/viewer/2022062322/5697bf781a28abf838c81f5f/html5/thumbnails/15.jpg)
Problem
• Given:– A target concept– Positive and negative examples– Examples composed of features
• Find:– A simple set of rules that discriminates between
(unseen) positive and negative examples of the target concept
![Page 16: Class1 Class2 The methods discussed so far are Linear Discriminants.](https://reader035.fdocuments.in/reader035/viewer/2022062322/5697bf781a28abf838c81f5f/html5/thumbnails/16.jpg)
Sample Unordered Rules
• If X then C1
• If X and Y then C2
• If NOT X and Z and Y then C3
• If B then C2
• What if two rules fire at once? Just OR together?
![Page 17: Class1 Class2 The methods discussed so far are Linear Discriminants.](https://reader035.fdocuments.in/reader035/viewer/2022062322/5697bf781a28abf838c81f5f/html5/thumbnails/17.jpg)
Target Concept• Target concept in the form of rules. If we only have 3
features, X, Y, and Z, then we could generate the following possible rules:– If X then…– If X and Y then…– If X and Y and Z then…– If X and Z then …– If Y then …– If Y and Z then …– If Z then…
• Exponentially large space, larger if allow NOT’s
![Page 18: Class1 Class2 The methods discussed so far are Linear Discriminants.](https://reader035.fdocuments.in/reader035/viewer/2022062322/5697bf781a28abf838c81f5f/html5/thumbnails/18.jpg)
Generic Separate-and-Conquer StrategyTargetConcept = NULLWhile NumPositive(Examples) > 0
BestRule = TRUERule = BestRuleCover = ApplyRule(Rule)While NumNegative(Cover) > 0
For each feature FeaturesRefinement=Rule featureIf Heuristic(Refinement, Examples) > Heuristic(BestRule, Examples)
BestRule = RefinementRule = BestRuleCover = ApplyRule(Rule)
TargetConcept = TargetConcept RuleExamples = Examples - Cover
![Page 19: Class1 Class2 The methods discussed so far are Linear Discriminants.](https://reader035.fdocuments.in/reader035/viewer/2022062322/5697bf781a28abf838c81f5f/html5/thumbnails/19.jpg)
Trivial Example
1: a,b2: b,c
3: c,d4: d,e
+
-
PositiveNegative
PositiveexamplesruleHeuristic
##
#),(
H(T)=2/4
H(a)=1/1
H(b)=2/2
H(c)=1/2
H(d)=0/1
H(e)=0/1
Say we pick a. Remove covered examples:
2: b,c
3: c,d4: d,e
+
-
H(a b)=1/1
H(a c)=1/2
H(a d)=0/2
H(a e)=0/1
Pick as our rule: a b.
![Page 20: Class1 Class2 The methods discussed so far are Linear Discriminants.](https://reader035.fdocuments.in/reader035/viewer/2022062322/5697bf781a28abf838c81f5f/html5/thumbnails/20.jpg)
CN2 Rule Induction (Clark & Boswell, 1991)
• More specialized version of separate-and-conquer:
CN2Unordered(allexamples, allclasses)Ruleset {}For each class in allclasses
Generate rules by CN2ForOneClass(allexamples, class)Add rules to ruleset
Return ruleset
![Page 21: Class1 Class2 The methods discussed so far are Linear Discriminants.](https://reader035.fdocuments.in/reader035/viewer/2022062322/5697bf781a28abf838c81f5f/html5/thumbnails/21.jpg)
CN2
CN2ForOneClass(examples, class)Rules {}Repeat Bestcond FindBestCondition(examples, class) If bestcond <> null then Add the rule “IF bestcond THEN PREDICT class” Remove from examples all + cases in class covered by bestcondUntil bestcond = nullReturn rules
Keeps negative examples around so future rules won’t impact existing negatives (allows unordered rules)
![Page 22: Class1 Class2 The methods discussed so far are Linear Discriminants.](https://reader035.fdocuments.in/reader035/viewer/2022062322/5697bf781a28abf838c81f5f/html5/thumbnails/22.jpg)
CN2FindBestCondition(examples, class)
MGC true ‘ most general conditionStar MGC, Newstar {}, Bestcond nullWhile Star is not empty (or loopcount < MAXCONJUNCTS) For each rule R in Star For each possible feature F R’ specialization of Rule formed by adding F as an
Extra conjunct to Rule (i.e. Rule’ = Rule AND F) Removing null conditions (i.e. A AND NOT A) Removing redundancies (i.e. A AND A) and previously generated rules.
If LaPlaceHeuristic(R’,class) > LaPlaceHeuristic (Bestcond, class) Bestcond R’ Add R’ to Newstar If size(NewStar) > MAXRULESIZE then
Remove worst in Newstar until Size=MAXRULESIZE
Star NewstarReturn Bestcond
![Page 23: Class1 Class2 The methods discussed so far are Linear Discriminants.](https://reader035.fdocuments.in/reader035/viewer/2022062322/5697bf781a28abf838c81f5f/html5/thumbnails/23.jpg)
LaPlace Heuristic
NumClassesruleveredNumTotalCo
ruleCoveredNumCorrectruleLaPlace
)(
1)()(
In our case, NumClasses=2.
A common problem is a specific rule that covers only 1 example.
In this case, LaPlace = 1+1/1+2 = 0.6667. However, a rule that
covers say 2 examples gets a higher value of 2+1/2+2 = 0.75.
![Page 24: Class1 Class2 The methods discussed so far are Linear Discriminants.](https://reader035.fdocuments.in/reader035/viewer/2022062322/5697bf781a28abf838c81f5f/html5/thumbnails/24.jpg)
Trivial Example Revisited1: a,b2: b,c
3: c,d4: d,e
+
-
L(T)=3/6
L(a)=2/3
L(b)=3/4
L(c)=1/4
L(d)=0/4
L(e)=0/3
Say we pick beam=3. Keep T, a, b.
L(a b)=2/3
L(a c)=0
L(a d)=0
L(a e)=0
Our best rule out of all these is just “b”.
Specialize T : (all already done)
Specialize a:
Specialize b:
L(b a)=2/3
L(b c)=2/3
L(b d)=0
L(b e)=0Continue until out of features, or max num of conjuncts reached.
![Page 25: Class1 Class2 The methods discussed so far are Linear Discriminants.](https://reader035.fdocuments.in/reader035/viewer/2022062322/5697bf781a28abf838c81f5f/html5/thumbnails/25.jpg)
Improvements to Rule Induction
• Better feature selection algorithm• Add rule pruning phase
– Problem of overfitting the data– Split training examples into a GrowSet (2/3)
and PruneSet (1/3)• Train on GrowSet• Test on PruneSet with pruned rules, keep rule with
best results
– Needs more training examples!
![Page 26: Class1 Class2 The methods discussed so far are Linear Discriminants.](https://reader035.fdocuments.in/reader035/viewer/2022062322/5697bf781a28abf838c81f5f/html5/thumbnails/26.jpg)
Improvements to Rule Induction
• Ripper / Slipper– Rule induction with pruning, new heuristics on
when to stop adding rules, prune rules– Slipper builds on Ripper, but uses boosting to
reduce weight of negative examples instead of removing them entirely
• Other search approaches– Instead of beam search, genetic, pure hill
climbing (would be faster), etc.
![Page 27: Class1 Class2 The methods discussed so far are Linear Discriminants.](https://reader035.fdocuments.in/reader035/viewer/2022062322/5697bf781a28abf838c81f5f/html5/thumbnails/27.jpg)
In-Class VB Demo
• Rule Induction for Multiplexer