Download - Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur1 Machine Learning 1 Introduction Sudeshna Sarkar IIT Kharagpur.

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur1 Machine Learning 1 Introduction Sudeshna Sarkar IIT Kharagpur

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur2 What is Machine Learning? Adapt to / learn from data To optimize a performance function Can be used to: Extract knowledge from data Learn tasks that are difficult to formalise Create software that improves over time

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur3 When to learn Human expertise does not exist (navigating on Mars) Humans are unable to explain their expertise (speech recognition) Solution changes in time (routing on a computer network) Solution needs to be adapted to particular cases (user biometrics) Learning involves Learning general models from data Data is cheap and abundant. Knowledge is expensive and scarce. Build a model that is a good and useful approximation to the data

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur4 Applications Speech and hand-writing recognition Autonomous robot control Data mining and bioinformatics: motifs, alignment, Playing games Fault detection Clinical diagnosis Spam email detection Credit scoring, fraud detection Applications are diverse but methods are generic

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur5 Learning applied to NLP problems Decisional problems involving ambiguity resolution Word selection Semantic ambiguity (polysemy) PP attachment Reference ambiguity (anaphora) Text categorization Document filtering Word sense disambiguation

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur6 Learning applied to NLP problems Problems involving sequence tagging and detection of sequential structures POS tagging Named entity recognition Syntactic chunking Problems with output as hierarchical structure Clause detection Full parsing IE of complex concepts

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur7 Example-based learning: Concept learning The computer attempts to learn a concept, i.e., a general description (e.g., arch-learning) Input = examples Output = representation of concept which can classify new examples Representation can also be approximate e.g., 50% of stone objects are arches So, if an unclassified example is made of stone, its 50% likely to be an arch With multiple such features, more accurate classification can take place

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur8 Learning methodologies Learning from labelled data (supervised learning) eg. Classification, regression, prediction, function approx Learning from unlabelled data (unsupervised learning) eg. Clustering, visualization, dimensionality reduction Learning from sequential data eg. Speech recognition, DNA data analysis Associations Reinforcement Learning

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur9 Inductive learning Data produced by target. Hypothesis learned from data in order to explain, predict,model or control target. Generalization ability is essential. Inductive learning hypothesis: If the hypothesis works for enough data then it will work on new examples.

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur10 Supervised Learning: Uses Prediction of future cases Knowledge extraction Compression Outlier detection

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur11 Unsupervised Learning Clustering: grouping similar instances Example applications Clustering items based on similarity Clustering users based on interests Clustering words based on similarity of usage

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur12 Reinforcement Learning Learning a policy: A sequence of outputs No supervised output but delayed reward Credit assignment problem Game playing Robot in a maze Multiple agents, partial observability

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur13 Statistical Learning Machine learning methods can be unified within the framework of statistical learning: Data is considered to be a sample from a probability distribution. Typically, we dont expect perfect learning but only probably correct learning. Statistical concepts are the key to measuring our expected performance on novel problem instances.

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur14 Probabilistic models Methods have an explicit probabilistic interpretation: Good for dealing with uncertainty eg. is a handwritten digit a three or an eight ? Provides interpretable results Unifies methods from different fields

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur15 Machine Learing Concept learning Sudeshna Sarkar IIT Kharagpur

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur16 Introduction to concept learning What is a concept? A concept describes a subset of objects or events defined over a larger set (e,g, concept of names of people, names of places, non-names) Concept learning Acquire/Infer the definition of a general concept given a sample of positive and negative training examples of the concept Each concept can be thought of as a Boolean valued function Approximate the function from samples

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur17 Concept Learning Example: Bird VS Lion Sports VS Entertainment ?

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur18 Example-based learning: Concept learning Computer attempts to learn a concept, i.e., a general description (e.g., arch-learning) Input = examples An example is described by Value for the set of features/ attributes and the concept represented by the example Example: Output = representation of the concept made-of-stone & shape=arc => arch With multiple such features, more accurate classification can take place

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur19 Prototypical concept learning task Instance Space: X (animals; described by attributes, such as Barks (Y/N), has_4_legs (Y/N),) Concept Space: C set of possible target concepts (dog=(barks=Y) (has_4_legs=Y)) Hypothesis Space: H set of possible hypotheses Training instances S: positive and negative examples of the target concept f C Determine: A hypothesis h H such that h(x) = f(x) for all x S ? A hypothesis h H such that h(x) = f(x) for all x X ?

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur20 Concept Learning notations Notation and basic terms Instances X: the set of items over which the concept is defined Target concept c: the concept or function to be learned Training example, the set of avl training examples D Positive(negative) examples: Instances for which c(x)=1(0) Hypotheses H: all possible hypotheses considered by learner regarding the identity of target concept. In general, each Hypothesis h in H represents a boolean- valued function defined over X: h:X {0,1} Learning goal To find a hypothesis h satisfying h(x)=c(x) for all x in X

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur21 An example Concept Learning Task Given: Instances X : Possible days decribed by the attributes Sky, Temp, Humidity, Wind, Water, Forecast Target function c: EnjoySport X {0,1} Hypotheses H: conjunction of literals e.g. Training examples D : positive and negative examples of the target function:,, Determine: A hypothesis h in H such that h(x)=c(x) for all x in D.

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur22 Learning Methods A classifier is a function: f(x) = p(class) from attribute vectors, x=(x 1,x 2, x d ) to target values, p(class) Example classifiers (interest AND rate) OR (quarterly) -> interest score = 0.3*interest + 0.4*rate + 0.1*quarterly; if score >.8, then interest category

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur23 Designing a learning system Select features Obtain training examples Select hypothesis space Select/ design a learning algorithm

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur24 Inductive Learning Methods Supervised learning to build classifiers Labeled training data (i.e., examples of items in each category) Learn classifier Test effectiveness on new instances Statistical guarantees of effectiveness

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur25 Concept Learning Concept learning as Search: Hypotheses space Hypothesis representation Desired hypothesis define Search Training examples Best fit?

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur26 Example 1: Hand-written digits Data representation: Greyscale images Task: Classification (0,1,2,3..9) Problem features: Highly variable inputs from same class imperfect human classification, high cost associated with errors so dont know may be useful.

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur27 Example 2: Speech recognition Data representation: features from spectral analysis of speech signals Task: Classification of vowel sounds in words of the form h-?-d Problem features: Highly variable data with same classification. Good feature selection is very important.

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur28 Example 3: Text classification Task: classifying the given text to some category Performance: percent of texts correctly classified Examples: a database of some texts with given correct classifications

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur29 Text Classification Process text files word counts per file data set Feature selection Decision tree Nave Bayes Bayes nets Support vector machine Learning Methods test classifier

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur30 Text Representation Vector space representation of documents word1 word2 word3 word4... Doc 1 = Doc 2 = Doc 3 = Mostly use: simple words, binary weights Text can have 10 7 or more dimensions e.g., 100k web pages had 2.5 million distinct words

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur31 Feature Selection Word distribution - remove frequent and infrequent words based on Zipfs law: frequency * rank ~ constant # Words (f) Words by rank order (r) 1 2 3 m

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur32 Feature Selection Fit to categories - use mutual information to select features which best discriminate category vs. not Designer features - domain specific, including non-text features * Use 100-500 best features from this process as input to learning methods

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur33 Training Examples for Concept EnjoySport SkyTempHumidWindWaterFore- cast Enjoy Sport Sunny Rainy Sunny Warm Cold Warm Normal High Strong Warm Cool Same Change Yes No Yes Concept: days on which my friend Aldo enjoys his favourite water sports Task: predict the value of Enjoy Sport for an arbitrary day based on the values of the other attributes attributes instance

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur34 Representing Hypothesis Hypothesis h is a conjunction of constraints on attributes Each constraint can be: A specific value : e.g. Water=Warm A dont care value : e.g. Water=? No value allowed (null hypothesis): e.g. Water= Example: hypothesis h Sky Temp Humid Wind Water Forecast

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur35 Enjoy Concept Learning Task Consider the target concept days on which Aldo enjoys his favorite sport Exampl e SkyAirTem p Humidit y WindWate r Forecas t EnjoySpo rt 1Sunn y WarmNormalStron g War m SameYes 2Sunn y WarmHighStron g War m SameYes 3RainyColdHighStron g War m ChangeNo 4Sunn y WarmHighStron g CoolChangeYes Positive and negative examples for the target concept EnjoySport

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur36 Enjoy Concept Learning Task Give: Instances X: Possible days (described by attributes) Sky, AirTemp, Humidity, Wind, Water and Forecast Hypotheses H: Each hypothesis is described by a conjunction of constraints on attributes. The constraints may be ?, , or a specific value Target concept c: EnjoySport: X {0,1} (1:Yes, 0:No) Training examples D: positive and negative, see Table2.1 Determine: A hypothesis h in H satisfying h(x)=c(x) for all x in X

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur37 General-to-Specific Ordering More_general_then_or_equal_to: h j and h k are boolean-valued functions defined over X. h j is more_general_then_or_equal_to h k (Written as h j g h k ) iff (Vx X)[(h k (x)=1 (h j (x)=1)] Partial order over H h j > g h k

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur38 Find-S Algorithm Find a maximally specific hypothesis Begin with the most specific possible hypothesis in H, then generalize when cant cover a positive training example For example: 1. h 2. h 3. h 4. Ignore the negative example 5. h

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur39 Find-S Algorithm Two assumptions: The correct target concept is contained in H The training examples are correct Some questions: Converge to the correct concept? Why prefer the most specific? Noise problem Several maximally specific consistent hypothesis?

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur40 Inductive Bias

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur41 Inductive Bias Fundamental assumption of inductive learning: The inductive learning hypothesis: Any hypothesis found to approximate the target function well over a sufficiently large set of training examples will also approximate the target function well over other unobserved examples.

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur42 Inductive Bias Fundamental questions: What if the target concept is not contained in hypothesis space? The relationship between the size of hypothesis space, the ability of algorithm to generalize to unobserved instances, the number of training examples that must be observed

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur43 Inductive Bias See the training examples: It cant be represented in H we defined ExampleSkyAirTempHumidityWindWaterForecastEnjoySport 1SunnyWarmNormalStrongWarmSameYes 2RainyWarmNormalStrongWarmSameNo 3CloudyWarmNormalStrongWarmSameYes

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur44 Inductive Bias Fundamental property of inductive inference A learner that makes no a priori assumptions regarding the identity of the target concept has no rational basis for classifying any unseen instances Inductive bias The inductive bias of L is any minimal set of assertion B such that for any target concept c and corresponding training examples D c (V x i X)[B D c x i L(x i, D c )]

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur45 Inductive Bias Candidate Elimination Algorithm Using Hypothesis Space H Theorem Prover Training examples New instance Classification of new instance, or dont know Training examples Assertion H contains the target concept Inductive bias New instance Classification of new instance, or dont know Deductive Inductive

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur46 Inductive Learning Hypothesis Any hypothesis found to approximate the target function well over the training examples, will also approximate the target function well over the unobserved examples.

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur47 Number of Instances, Concepts, Hypotheses Sky: Sunny, Cloudy, Rainy AirTemp: Warm, Cold Humidity: Normal, High Wind: Strong, Weak Water: Warm, Cold Forecast: Same, Change #distinct instances : 3*2*2*2*2*2 = 96 #syntactically distinct hypotheses : 5*4*4*4*4*4=5120 #semantically distinct hypotheses : 1+4*3*3*3*3*3=973

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur48 Inductive Learning Methods Find Similar Decision Trees Nave Bayes Bayes Nets Support Vector Machines (SVMs) All support: Probabilities - graded membership; comparability across categories Adaptive - over time; across individuals

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur49 Find Similar Aka, relevance feedback Rocchio Classifier parameters are a weighted combination of weights in positive and negative examples -- centroid New items classified using: Use all features, idf weights,

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur50 Decision Trees Learn a sequence of tests on features, typically using top-down, greedy search Binary (yes/no) or continuous decisions f1f1 !f 1 f7f7 !f 7 P(class) =.6 P(class) =.9 P(class) =.2

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur51 Aka, binary independence model Maximize: Pr (Class | Features) Assume features are conditionally independent - math easy; surprisingly effective Nave Bayes x1x1 x3x3 x2x2 xnxn C

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur52 Bayes Nets Maximize: Pr (Class | Features) Does not assume independence of features - dependency modeling x1x1 x3x3 x2x2 xnxn C

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur53 Support Vector Machines Vapnik (1979) Binary classifiers that maximize margin Find hyperplane separating positive and negative examples Optimization for maximum margin: Classify new items using: support vectors

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur54 Support Vector Machines Extendable to: Non-separable problems (Cortes & Vapnik, 1995) Non-linear classifiers (Boser et al., 1992) Good generalization performance OCR (Boser et al.) Vision (Poggio et al.) Text classification (Joachims)

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur55 Machine Learning 3 Decision tree induction Sudeshna Sarkar IIT Kharagpur

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur56 Outline Decision tree representation ID3 learning algorithm Entropy, information gain Overfitting

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur57 Decision Tree for EnjoySport Outlook SunnyOvercastRain Humidity HighNormal Wind StrongWeak NoYes No

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur58 Decision Tree for EnjoySport Outlook SunnyOvercastRain Humidity HighNormal NoYes Each internal node tests an attribute Each branch corresponds to an attribute value node Each leaf node assigns a classification

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur59 Decision Tree for EnjoySport No Outlook SunnyOvercastRain Humidity HighNormal Wind StrongWeak NoYes No Outlook Temperature Humidity Wind PlayTennis Sunny Hot High Weak ?

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur60 Decision Tree for Conjunction Outlook SunnyOvercastRain Wind StrongWeak NoYes No Outlook=Sunny Wind=Weak No

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur61 Decision Tree for Disjunction Outlook SunnyOvercastRain Yes Outlook=Sunny Wind=Weak Wind StrongWeak NoYes Wind StrongWeak NoYes

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur62 Decision Tree for XOR Outlook SunnyOvercastRain Wind StrongWeak YesNo Outlook=Sunny XOR Wind=Weak Wind StrongWeak NoYes Wind StrongWeak NoYes

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur63 Decision Tree Outlook SunnyOvercastRain Humidity HighNormal Wind StrongWeak NoYes No decision trees represent disjunctions of conjunctions (Outlook=Sunny Humidity=Normal) (Outlook=Overcast) (Outlook=Rain Wind=Weak)

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur64 When to consider Decision Trees Instances describable by attribute-value pairs Target function is discrete valued Disjunctive hypothesis may be required Possibly noisy training data Missing attribute values Examples: Medical diagnosis Credit risk analysis Object classification for robot manipulator (Tan 1993)

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur65 Top-Down Induction of Decision Trees ID3 1.A the best decision attribute for next node 2.Assign A as decision attribute for node 3. For each value of A create new descendant 4.Sort training examples to leaf node according to the attribute value of the branch 5.If all training examples are perfectly classified (same value of target attribute) stop, else iterate over new leaf nodes.

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur66 Which Attribute is best? A 1 =? TrueFalse [21+, 5-][8+, 30-] [29+,35-] A 2 =? TrueFalse [18+, 33-] [11+, 2-] [29+,35-]

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur67 Entropy S is a sample of training examples p + is the proportion of positive examples p - is the proportion of negative examples Entropy measures the impurity of S Entropy(S) = -p + log 2 p + - p - log 2 p -

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur68 Entropy Entropy(S)= expected number of bits needed to encode class (+ or -) of randomly drawn members of S (under the optimal, shortest length-code) Information theory optimal length code assign log 2 p bits to messages having probability p. So the expected number of bits to encode (+ or -) of random member of S: -p + log 2 p + - p - log 2 p - (log 0 = 0)

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur69 Information Gain Gain(S,A): expected reduction in entropy due to sorting S on attribute A A 1 =? TrueFalse [21+, 5-][8+, 30-] [29+,35-] A 2 =? TrueFalse [18+, 33-] [11+, 2-] [29+,35-] Gain(S,A)=Entropy(S) - v values(A) |S v |/|S| Entropy(S v ) Entropy([29+,35-]) = -29/64 log 2 29/64 35/64 log 2 35/64 = 0.99

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur70 Information Gain A 1 =? TrueFalse [21+, 5-][8+, 30-] [29+,35-] Entropy([21+,5-]) = 0.71 Entropy([8+,30-]) = 0.74 Gain(S,A 1 )=Entropy(S) -26/64*Entropy([21+,5-]) -38/64*Entropy([8+,30-]) =0.27 Entropy([18+,33-]) = 0.94 Entropy([8+,30-]) = 0.62 Gain(S,A 2 )=Entropy(S) -51/64*Entropy([18+,33-]) -13/64*Entropy([11+,2-]) =0.12 A 2 =? TrueFalse [18+, 33-] [11+, 2-] [29+,35-]

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur71 Training Examples DayOutlookTemp.HumidityWindEnjoySport D1SunnyHotHighWeakNo D2SunnyHotHighStrongNo D3OvercastHotHighWeakYes D4RainMildHighWeakYes D5RainCoolNormalWeakYes D6RainCoolNormalStrongNo D7OvercastCoolNormalWeakYes D8SunnyMildHighWeakNo D9SunnyColdNormalWeakYes D10RainMildNormalStrongYes D11SunnyMildNormalStrongYes D12OvercastMildHighStrongYes D13OvercastHotNormalWeakYes D14RainMildHighStrongNo

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur72 Selecting the Next Attribute Humidity HighNormal [3+, 4-][6+, 1-] S=[9+,5-] E=0.940 Gain(S,Humidity) =0.940-(7/14)*0.985 (7/14)*0.592 =0.151 E=0.985 E=0.592 Wind WeakStrong [6+, 2-][3+, 3-] S=[9+,5-] E=0.940 E=0.811E=1.0 Gain(S,Wind) =0.940-(8/14)*0.811 (6/14)*1.0 =0.048

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur73 Selecting the Next Attribute Outlook Sunny Rain [2+, 3-] [3+, 2-] S=[9+,5-] E=0.940 Gain(S,Outlook) =0.940-(5/14)*0.971 -(4/14)*0.0 (5/14)*0.0971 =0.247 E=0.971 Over cast [4+, 0] E=0.0 Temp ?

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur74 ID3 Algorithm Outlook SunnyOvercastRain Yes [D1,D2,,D14] [9+,5-] S sunny =[D1,D2,D8,D9,D11] [2+,3-] ? ? [D3,D7,D12,D13] [4+,0-] [D4,D5,D6,D10,D14] [3+,2-] Gain(S sunny, Humidity)=0.970-(3/5)0.0 2/5(0.0) = 0.970 Gain(S sunny, Temp.)=0.970-(2/5)0.0 2/5(1.0)-(1/5)0.0 = 0.570 Gain(S sunny, Wind)=0.970= -(2/5)1.0 3/5(0.918) = 0.019

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur75 ID3 Algorithm Outlook SunnyOvercastRain Humidity HighNormal Wind StrongWeak NoYes No [D3,D7,D12,D13] [D8,D9,D11] [D6,D14] [D1,D2] [D4,D5,D10]

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur76 Hypothesis Space Search ID3 + - + A1 - - + + - + A2 + - - + - + A2 - A4 + - A2 - A3 - +

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur77 Hypothesis Space Search ID3 Hypothesis space is complete! Target function surely in there Outputs a single hypothesis No backtracking on selected attributes (greedy search) Local minimal (suboptimal splits) Statistically-based search choices Robust to noisy data Inductive bias (search bias) Prefer shorter trees over longer ones Place high information gain attributes close to the root

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur78 Converting a Tree to Rules Outlook SunnyOvercastRain Humidity HighNormal Wind StrongWeak NoYes No R 1 : If (Outlook=Sunny) (Humidity=High) Then PlayTennis=No R 2 : If (Outlook=Sunny) (Humidity=Normal) Then PlayTennis=Yes R 3 : If (Outlook=Overcast) Then PlayTennis=Yes R 4 : If (Outlook=Rain) (Wind=Strong) Then PlayTennis=No R 5 : If (Outlook=Rain) (Wind=Weak) Then PlayTennis=Yes

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur79 Continuous Valued Attributes Create a discrete attribute to test continuous Temperature = 24.5 0 C (Temperature > 20.0 0 C) = {true, false} Where to set the threshold? Temperature15 0 C 18 0 C 19 0 C 22 0 C 24 0 C 27 0 C PlayTennisNo Yes No

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur80 Attributes with many Values Problem: if an attribute has many values, maximizing InformationGain will select it. E.g.: Imagine using Date=12.7.1996 as attribute perfectly splits the data into subsets of size 1 Use GainRatio instead of information gain as criteria: GainRatio(S,A) = Gain(S,A) / SplitInformation(S,A) SplitInformation(S,A) = - i=1..c |S i |/|S| log 2 |S i |/|S| Where S i is the subset for which attribute A has the value v i

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur81 Attributes with Cost Consider: Medical diagnosis : blood test costs 1000 SEK Robotics: width_from_one_feet has cost 23 secs. How to learn a consistent tree with low expected cost? Replace Gain by : Gain 2 (S,A)/Cost(A) [Tan, Schimmer 1990] 2 Gain(S,A) -1/(Cost(A)+1) w w [0,1] [Nunez 1988]

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur82 Unknown Attribute Values What if examples are missing values of A? Use training example anyway sort through tree If node n tests A, assign most common value of A among other examples sorted to node n. Assign most common value of A among other examples with same target value Assign probability pi to each possible value vi of A Assign fraction pi of example to each descendant in tree Classify new examples in the same fashion

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur83 Occams Razor: prefer shorter hypotheses Why prefer short hypotheses? Argument in favor: Fewer short hypotheses than long hypotheses A short hypothesis that fits the data is unlikely to be a coincidence A long hypothesis that fits the data might be a coincidence Argument opposed: There are many ways to define small sets of hypotheses E.g. All trees with a prime number of nodes that use attributes beginning with Z What is so special about small sets based on size of hypothesis

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur84 Overfitting Consider error of hypothesis h over Training data: error train (h) Entire distribution D of data: error D (h) Hypothesis hH overfits training data if there is an alternative hypothesis hH such that error train (h) < error train (h) and error D (h) > error D (h)

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur85 Overfitting in Decision Tree Learning

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur86 Avoid Overfitting How can we avoid overfitting? Stop growing when data split not statistically significant Grow full tree then post-prune

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur87 Reduced-Error Pruning Split data into training and validation set Do until further pruning is harmful: 1.Evaluate impact on validation set of pruning each possible node (plus those below it) 2.Greedily remove the one that less improves the validation set accuracy Produces smallest version of most accurate subtree

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur88 Effect of Reduced Error Pruning

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur89 Rule-Post Pruning 1.Convert tree to equivalent set of rules 2.Prune each rule independently of each other 3.Sort final rules into a desired sequence to use Method used in C4.5

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur90 Cross-Validation Estimate the accuracy of a hypothesis induced by a supervised learning algorithm Predict the accuracy of a hypothesis over future unseen instances Select the optimal hypothesis from a given set of alternative hypotheses Pruning decision trees Model selection Feature selection Combining multiple classifiers (boosting)

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur91 Holdout Method Partition data set D = {(v 1,y 1 ),,(v n,y n )} into training D t and validation set D h =D\D t Training D t Validation D\D t acc h = 1/h (vi,yi) Dh (I(D t,v i ),y i ) I(D t,v i ) : output of hypothesis induced by learner I trained on data D t for instance v i (i,j) = 1 if i=j and 0 otherwise Problems: makes insufficient use of data training and validation set are correlated

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur92 Cross-Validation k-fold cross-validation splits the data set D into k mutually exclusive subsets D 1,D 2,,D k Train and test the learning algorithm k times, each time it is trained on D\D i and tested on D i D1D1 D2D2 D3D3 D4D4 D1D1 D2D2 D3D3 D4D4 D1D1 D2D2 D3D3 D4D4 D1D1 D2D2 D3D3 D4D4 D1D1 D2D2 D3D3 D4D4 acc cv = 1/n (vi,yi) D (I(D\D i,v i ),y i )

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur93 Cross-Validation Uses all the data for training and testing Complete k-fold cross-validation splits the dataset of size m in all (m over m/k) possible ways (choosing m/k instances out of m) Leave n-out cross-validation sets n instances aside for testing and uses the remaining ones for training (leave one-out is equivalent to n- fold cross-validation) Leave one out is widely used In stratified cross-validation, the folds are stratified so that they contain approximately the same proportion of labels as the original data set

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur94 Bootstrap Samples n instances uniformly from the data set with replacement Probability that any given instance is not chosen after n samples is (1-1/n) n e -1 0.632 The bootstrap sample is used for training the remaining instances are used for testing acc boot = 1/b i=1 b (0.632 0 i + 0.368 acc s ) where 0 i is the accuracy on the test data of the i-th bootstrap sample, acc s is the accuracy estimate on the training set and b the number of bootstrap samples

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur95 Wrapper Model Input features Feature subset search Feature subset evaluation Feature subset evaluation Induction algorithm

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur96 Wrapper Model Evaluate the accuracy of the inducer for a given subset of features by means of n-fold cross-validation The training data is split into n folds, and the induction algorithm is run n times. The accuracy results are averaged to produce the estimated accuracy. Forward elimination: Starts with the empty set of features and greedily adds the feature that improves the estimated accuracy at most Backward elimination: Starts with the set of all features and greedily removes features and greedily removes the worst feature

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur97 Bagging For each trial t=1,2,,T create a bootstrap sample of size N. Generate a classifier C t from the bootstrap sample The final classifier C* takes class that receives the majority votes among the C t Training set 1 Training set 2 Training set T C1C1 C2C2 CTCT train instance C*C* yesno yes

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur98 Bagging Bagging requires instable classifiers like for example decision trees or neural networks The vital element is the instability of the prediction method. If perturbing the learning set can cause significant changes in the predictor constructed, then bagging can improve accuracy. (Breiman 1996)