Machine Learning

Machine Learning:

Concept Learning&

Decision-Tree Learning

Yuval Shahar M.D., Ph.D.

Medical Decision Support Systems

Machine Learning

• Learning: Improving (a program’s) performance in some task with experience

• Multiple application domains, such as– Game playing (e.g., TD-gammon)– Speech recognition (e.g., Sphinx)– Data mining (e.g., marketing)– Driving autonomous vehicles (e.g., ALVINN)– Classification of ER and ICU patients– Prediction of financial and other fraud– Prediction of pneumonia-patients recovery rate

Concept Learning

• Inference of a boolean-valued function (concept) from its I/O training examples

• The concept c is defined over a set of instances X– c: X {0,1}

• The learner is presented with a set of positive/negative training examples <x, c(x)> taken from X

• There is a set H of possible hypotheses that the learner might consider regarding the concept

• Goal: Find a hypothesis h, s.t. (x X), h(x) = c(x)

A Concept-Learning Example

#

SkyAir

temp

HumidWindWaterFore-cast

Enjoy?

1SunWarmNormalStrongWarmSameYes

2SunWarmHighStrongWarmSameYes

3RainColdHighStrongWarmChangeNo

4SunWarmHighStrongCoolChangeYes

The Inductive Learning Hypothesis

Any hypothesis approximating the target function well over a large set of training examples will also approximate that target function well over other, unobserved, examples

Concept Learning as Search

• Learning is searching through a large space of hypotheses– Space is implicitly defined by the hypothesis

representation

• General-to-specific ordering of hypotheses– H1 is more-general-or-equal to H2 if any

instance that satisfies H2 also satisfies H1• <Sun, ?, ?, ?, ?, ?> g <Sun, ?, ?, Strong, ?, ?>

The Find-S Algorithm

• Start with the most specific hypothesis h in H– h <, , , , , >

• Generalize h by the next more general constraint (for each appropriate attribute) whenever it fails to classify correctly a positive training example

• Leads here finally to h = <Sun, Warm, ?, Strong, ?, ?>• Finds only one (the most specific) hypothesis• Cannot detect inconsistencies

– Ignores negative examples!

• Assumes no noise and no errors in the input

The Candidate-Elimination (CE) Algorithm(Mitchel, 1977, 1979)

• A Version Space: The subset of hypotheses of H consistent with the training examples set D

• A version space can be represented by:– Its general (maximally general) boundary set G of hypotheses

consistent with D (G0 {<?, ?, ...,?>}) – Its specific (minimally general) boundary set S of hypotheses

consistent with D (S0 {< , , ..., >})

• The CE algorithm updates the general and specific boundaries given each positive and negative example

• The resultant version space contains all and only all hypotheses consistent with the training set

Properties of The CE Algorithm

• Converges to the “correct” hypothesis if– There are no errors in the training set

• Else, the correct target concept is always eliminated!

– There is in fact such a hypothesis in H

• The next best query (new training example to ask for) separates maximally the hypotheses in the version space (best: into two halves)

• Partially learned concepts might suffice to classify a new instance with certainty, or at least with some confidence

Inductive Biases• Every learning method implicitly is biased towards a certain

hypotheses space H– The conjunctive hypothesis space (only one value per attribute) can

only represent 973 out of 296 possible subsets, or target concepts, in our example domain (assuming 3x2x2x2x2x2=96 possible instances, for 3,2,2,2,2,2 respective attribute values )

• Without an inductive bias (no a priori assumptions regarding the target concept) there is no way to classify new, unseen instances!– The S boundary will always be the disjunction of the positive

example instances; the G boundary will be the negated disjunction of the negative example instances

– Convergence possible only when all of X is seen!

• Strongly biased methods make more inductive leaps• Inductive bias of CE: The target concept c is in H!

Decision Tree learning

• Decision trees: A method for representing classification functions– Can be represented as a set of If-Then rules– Each node represents a test of some attribute– An instance is classified by starting at the root,

testing attributes in each node and moving along the branch corresponding to that attribute’s value

Example Decision Tree

Outlook?

Humidity ? Wind ?Yes

YesYesNo No

Sun Overcast Rain

High Normal Strong Weak

When Should Decision Trees Be Used?

• When instances are <attribute, value> pairs– Values are typically discrete, but can be continuous

• The target function has discrete output values• Disjunctive descriptions might be needed

– Natural representation of disjunction of rules

• Training data might contain errors– Robust to errors of classification and attribute values

• The training data might contain missing values– Several methods for completion of unknown values

The Basic Decision-Tree Learning Algorithm: ID3

(Quinlan, 1986)

• A top-down greedy search through the hypothesis space of possible decision trees

• Originally intended for boolean-valued functions• Extensions incorporated in C4.5 (Quinlan, 1993)• In each step, the “best” attribute for testing is

selected using some measure, and branching occurs along its values, continuing the process

• Ends when all attributes have been used, or all examples in this node are either positive or negative

Which Attribute is Best to Test?

• The central choice in the ID3 algorithm and similar approaches

• Here, an information gain measure is used, which measures how well each attribute separates training examples according to their target classification

Entropy• Entropy: An information-theory measure that characterizes

the (im)purity of an example set S using the proportion of positive () and negative instances ()

• Informally: Number of bits needed to encode the classification of an arbitrary member of S;

• Entropy(S) = –p log2p – p log2p

• Entropy(S) is in [0..1]• Entropy(S) is 0 if all members are positive or negative• Entropy is maximal (1) when p = p = 0.5 (uniform

distribution of positive and negative cases)• If there are c different values to the target concept,

Entropy(S) = i=1..c – pi log2pi (pi is proportion of class i)

Entropy Function for a Boolean Classification

p

1.0 0.0 0.5

Ent

ropy

(S)

1.0

Entropy and Surprise

• Entropy can also be considered as the mean surprise on seeing the outcome (actual class)

• -log2p is also called the surprisal [Tribus, 1961]

• It is the only nonnegative function consistent with the principle that the amount we are surprised by the occurrence of two independent events with probabilities p1 and p2 is the same as we are surprised by the occurrence of a single event with probability p1 x p2

Information Gain of an Attribute• Sometimes termed the Mutual Information (MI)

gained regarding a class (e.g., a disease) given an attribute (e.g., a test), since it is symmetric

• The expected reduction in entropy E(S) caused by partitioning the examples in S using the attribute A and all its corresponding values

• Gain(S, A) E (S) – v Values(A) (|Sv|/|S|) E (Sv)

• The attribute with maximal information gain is chosen by ID3 for splitting the node

• Follows from intuitive axioms [Benish, in press], e.g. not caring how the test result is revealed

Information Gain Example

Humidity ? Wind ?

{3+, 4-}

E = 0.985

High Normal Strong Weak

S: {9+,5-}E = 0.940

S: {9+,5-}E = 0.940

{6+, 1-}

E = 0.592

{6+, 2-}

E = 0.811

{3+, 3-}

E = 1.0

Gain(S, Humidity) = 0.940-(7/14)0.985-(7/14)0.592 = 0.151

Gain(S, Wind) = 0.940-(8/14)0.811-(6/14)1.0 = 0.048

Properties of ID3• Searches the hypothesis space of decision trees• A complete space of all finite discrete-valued functions

(unlike using conjunctive hypotheses)• Maintains only a single hypothesis (unlike CE)• Performs no backtracking; thus, might get stuck in a

local optimum• Uses all training examples at every step to refine the

current hypothesis (unlike Find-S or CE)• (Approximate) Inductive bias: Prefers shorter trees over

larger trees (Occam’s razor), and trees that place high information gain close to the root over those that do not

The Data Over-Fitting Problem

• Occurs due to noise in data or too-few examples

• Handling the data over-fitting problem:– Stop growing the tree earlier, or– Prune the final tree retrospectively

• In either case, correct final tree size is determined by

• A separate validation set of examples, or

• Using all examples, deciding if expansion is likely to help

• Using an explicit measure to encode the training examples and the tree and stop when the measure is minimized

Other Improvements to ID3

• Handling continuous values of attributes– Pick a threshold that maximizes information gain

• Avoid selection of many-valued attributes such as date by using more sophisticated measures, such as gain ratio (dividing the gain of S relative to A and the target concept by the entropy of S with respect to the values of A)

• Handling missing values (average value or distribution)• Handling costs of measuring attributes (e.g., laboratory

tests) by including cost in the attribute selection process

Summary: Concept and Decision-Tree Learning• Concept learning is a search through a hypothesis space

• The Candidate Elimination algorithm uses general-to-specific ordering of hypotheses to compute the version space

• Inductive learning algorithms can classify unseen examples only because of their implicit inductive bias

• ID3 searches through the space of decision trees

• ID3 searches a complete hypothesis space and can handle noise and missing values in the training set

• Over-fitting the training is a common problem and requires handling by methods such as post-pruning

Machine Learning

Documents

Transcript of Machine Learning