Download - Concept Learning Mitchell Ch2

8/3/2019 Concept Learning Mitchell Ch2

http://slidepdf.com/reader/full/concept-learning-mitchell-ch2 1/9

1

CS 464: Introduction to

Machine Learning

Concept Learning

Slides adapted from Chapter 2,

Machine Learning by Tom M. Mitchell

http://www-2.cs.cmu.edu/afs/cs.cmu.edu/user/mitchell/ftp/mlbook.html2

Concept Learning

Acquiring the definition of a general

category from given sample positive and

negative training examples of the category. An example: learning “bird” concept from

the given examples of birds (positive

examples) and non-birds (negative

examples).

Inferring a boolean-valued function from

training examples of its input and output.

3

Classification (Categorization)

• Given

– A fixed set of categories: C={c1, c

2,…ck }

– A description of an instance x = (x1 , x2 ,…xn)

with n features, x∈ X , where X is the instance space.

• Determine:

– The category of x: c( x)∈C, where c( x) is a

categorization function whose domain is X and whose

range is C .

– If c( x) is a binary function C ={1,0} ({true,false},

{positive, negative}) then it is called a concept .

4

Learning for Categorization

• A training example is an instance x∈ X,

paired with its correct category c( x):

< x, c( x)> for an unknown categorization

function, c.

• Given a set of training examples, D.

• Find a hypothesized categorization function,

h( x), such that:

)()(:)(, xc xh D xc x =∈><∀

Consistency

5

A Concept Learning Task – Enjoy Sport

Training Examples

Example Sky AirTemp Humidity Wind Water Forecast EnjoySport

1 Sunny Warm Normal Strong Warm Same YES

2 Sunny Warm High Strong Warm Same YES

3 Rainy Cold High Strong Warm Change O

4 Sunny Warm High Strong Warm Change YES

• A set of example days, and each is described by six attributes.• The task is to learn to predict the value of EnjoySport for arbitrary day,

based on the values of i ts attribute values.

ATTRIBUTES COCEPT

6

Hypothesis Space

• Restrict learned functions to a given hypothesis space, H ,of functions h( x) that can be considered as definitions of c( x).

• For learning concepts on instances described by n discrete-valued features, consider the space of conjunctivehypotheses represented by a vector of n constraints

<c1, c2, … cn> where each ci is either:

– ?, a wild card indicating no constraint on that feature – A specific value from the domain of that feature – Ø indicating no value is acceptable

• Sample conjunctive hypotheses are – <Sunny, ?, ?, Strong, ?, ?>

– <?, ?, ?, ?, ?, ?> (most general hypothesis)

– < Ø, Ø, Ø, Ø, Ø, Ø> (most specific hypothesis)



7

EnjoySport Concept Learning Task Given

Instances X : set of all possible days, each described by the attributes

Sky – (values: Sunny, Cloudy, Rainy)

AirTemp – (values: Warm, Cold)

Humidity – (values: Normal, High)

Wind – (values: Strong, Weak)

Water – (values: Warm, Cold)

Forecast – (values: Same, Change)

Target Concept (Function) c : EnjoySport : X {0,1}

Hypotheses H : Each hypothesis is described by a conjunction of constraints

on the attributes. Example hypothesis: <Sunny, ?, ?, ?, ?, ?>

Training Examples D : positive and negative examples of the target function

Determine

A hypothesis h in H such that h(x) = c(x) for all x in D.

8

Inductive Learning Hypothesis

Although the learning task is to determine a

hypothesis h identical to the target concept c over the

entire set of instances X , the only information

available about c is its value over the training

examples.

The Inductive Learning Hypothesis: Any function

that is found to approximate the target concept well on

a sufficiently large set of training examples will also

approximate the target function well on unobserved

examples.

• Assumes that the training and test examples are drawnindependently from the same underlying distribution.

9

Concept Learning as Search

Concept learning can be viewed as the task of searching

through a large space of hypotheses implicitly defined

by the hypothesis representation.

• The goal of this search is to find the hypothesis that best

fits the training examples.

• By selecting a hypothesis representation, the designer of

the learning algorithm implicitly defines the space of all

hypotheses that the program can ever represent and

therefore can ever learn.

10

Concept Learning as Search(Cont.)• Consider an instance space consisting of n binary

features which therefore has 2n instances.• For conjunctive hypotheses, there are 4 choices for

each feature: Ø, T, F, ?, so there are 4n syntacticallydistinct hypotheses.

• However, all hypotheses with one or more Øs areequivalent, so there are 3n+1 semantically distincthypotheses.

• The target concept learning function in principle could be any of the possible 22^n functions on n input bits.

• Therefore, conjunctive hypotheses are a small subsetof the space of possible functions, but both areintractably large.

• All reasonable hypothesis spaces are intractably large

or could be even infinite.

11

Enjoy Sport - Hypothesis Space

• Sky has 3 possible values, and other 5 attributes have 2 possible

values.

• There are 3·2·2·2·2·2 = 96 distinct instances in X.

• There are 5·4·4·4·4·4 = 5120 syntactically distinct hypotheses in

H. (Two more values for attribute values: ? and Ø)

• Every hypothesis containing one or more Ø symbols represents

the empty set of instance (it classifies every instance as negative).• There are 1+ 4·3·3·3·3·3 = 973 semantically distinct hypotheses in

H. Only one more value for attributes: ?, and one hypothesis

representing empty set of instances.

• Although EnjoySport has small, finite hypothesis space, most

learning tasks have much larger (even infinite) hypothesis spaces.12

Learning by Enumeration

• For any finite or countably infinite hypothesisspace, one can simply enumerate and testhypotheses one at a time until a consistent one isfound.

For each h in H do:

If h is consistent with the training data D,

then terminate and return h.

• This algorithm is guaranteed to terminate with aconsistent hypothesis if one exists; however, it isobviously computationally intractable for almostany practical problem.



13

Using the Generality Structure

• By exploiting the structure imposed by the generality of hypotheses, an hypothesis space can be searched for consistent hypotheses without enumerating or explicitlyexploring all hypotheses.

• An instance, x∈ X , is said to satisfy an hypothesis, h, iff h( x)=1 (positive)

• Given two hypotheses h1 and h

2, h1 is more general than

or equal to h2 (h1≥h2) iff every instance that satisfies h2 also

satisfies h1.

• Given two hypotheses h1 and h2, h1 is (strictly) more

general than h2 (h1>h2) iff h1≥h2 and it is not the case that

h2≥ h1.

• Generality defines a partial order on hypotheses.14

Examples of Generality

• Conjunctive feature vectors

– <Sunny, ?, ?, ?, ?, ?> is more general than

<Sunny, ?, ?, Strong, ?, ?>

– Neither of <Sunny, ?, ?, ?, ?, ?> and

<?, ?, ?, Strong, ?, ?> is more general than the other.

• Axis-parallel rectangles in 2-d space

– A is more general than B

– Neither of A and C are more general than the other.

A

B

C

15

“More General Than” Relation

• h2 > h1 and h2 > h3

But there is no more general than relation between h1 and h3

16

FID-S Algorithm

• Find the most-specific hypothesis (least-general generalization) that is consistent withthe training data.

• Starting with the most specific hypothesis,incrementally update hypothesis after every positive example, generalizing it just enough tosatisfy the new example.

• Time complexity: O(|D| · n) if n is the number of features

17

FID-S Algorithm

Initialize h = <Ø, Ø,… Ø> (the most specific hypothesis)

For each positive training instance x in D

For each feature f i

If the constraint on f i in h is not satisfied by x

If f i in h is Ø

then set f i in h to the value of f i in x

else set f i in h to “?”

If h is consistent with the negative training instances in D

then return h

else no consistent hypothesis exists18

FID-S Algorithm - Example



19

Issues with FID-S

• Given sufficient training examples, does FIND-Sconverge to a correct definition of the target concept(assuming it is in the hypothesis space)?

• How do we know when the hypothesis has converged

to a correct definition?• Why prefer the most-specific hypothesis? Are moregeneral hypotheses consistent? What about the most-general hypothesis? What about the simplesthypothesis?

• What if there are several maximally specific consistenthypotheses? Which one should be chosen?

• What if there is noise in the training data and sometraining examples are incorrectly labeled?

20

Candidate-Elimination Algorithm

• FIND-S outputs a hypothesis from H, that is consistent

with the training examples, which is just one of many

hypotheses from H that might fit the training data

equally well.• The key idea in the Candidate-Elimination algorithm is

to output a description of the set of all hypotheses

consistent with the training examples without explicitly

enumerating all of its members.

• This is accomplished by using the more-general-than

partial ordering and maintaining a compact

representation of the set of consistent hypotheses.

21

Consistent Hypothesis

• The key difference between this definition of consistent and

satisfies:

• An example x is said to satisfy hypothesis h when h(x) = 1,

regardless of whether x is a positive or negative example of

the target concept.• An example is consistent with hypothesis h when h(x) = c(x)

( depending on the target concept).

22

Version Space

• Given an hypothesis space, H , and training

data, D, the version space is the complete

subset of H that is consistent with D.

• The version space can be naively generated

for any finite H by enumerating all

hypotheses and eliminating the inconsistent

ones.

23

List-Then-Eliminate Algorithm

24

List-Then-Eliminate Algorithm

• The version space of candidate hypotheses shrinks as moreexamples are observed, until ideally just one hypothesis remainsthat is consistent with all the observed examples. What if there is insufficient training data? Will output the entire

set of hypotheses consistent with the observed data.

• List-Then-Eliminate algorithm can be applied whenever thehypothesis space H is finite.

It has many advantages, including the fact that it is guaranteed tooutput all hypotheses consistent with the training data.

Unfortunately, it requires exhaustively enumerating allhypotheses in H - an unrealistic requirement for all but the mosttrivial hypothesis spaces.

Can one compute the version space more efficiently than using

enumeration?



25

Compact Representation of

Version Spaces with S and G• The version space can be represented more compactly by

maintaining two boundary sets of hypotheses, S , the set of

most specific consistent hypotheses, and G, the set of most

general consistent hypotheses:

• S and G represent the entire version space via its boundaries in

the generalization lattice:

)]},([),(|{ D sConsistent s s H s D sConsistent H sS ′∧′>∈′¬∃∧∈=

)]},([),(|{ D sConsistent g g H g D g Consistent H g G ′∧>′∈′¬∃∧∈=

version

space

G

S

26

Compact Representation of

Version Spaces• The Candidate-Elimination algorithm represents

the version space by storing only its most general

members G and its most specific members S.

• Given only these two sets S and G, it is possible

to enumerate all members of a version space by

generating hypotheses that lie between these two

sets in general-to-specific partial ordering over

hypotheses.

• Every member of the version space lies between

these boundaries.

27

Version Space Lattice

< Ø, Ø, Ø>

<?,?,circ> <big,?,?> <?,red,?> <?,blue,?> <sm,?,?> <?,?,squr>

< big,red,circ><sm,red,circ><big,blue,circ><sm,blue,circ>< big,red,squr><sm,red,squr><big,blue,squr><sm,blue,squr>

< ?,red,circ><big,?,circ><big,red,?><big,blue,?><sm,?,circ><?,blue,circ> <?,red,squr><sm.?,sqr><sm,red,?><sm,blue,?> <big,?,squr><?,blue,squr>

<?, ?, ?>

Size: {sm, big} Color: {red, blue} Shape: {circ, squr}

<<big, red, squr> positive>

<<sm, blue, circ> negative>

Color Code:

G

S

other VS

28


• The Candidate-Elimination algorithm computes the version space containing

all hypotheses from H that are consistent with an observed sequence of training

examples.

• It begins by initializing the version space to the set of all hypotheses in H; that

is, by initializing the G boundary set to contain the most general hypothesis in

H

G0 { <?, ?, ?, ?, ?, ?> }

and initializing the S boundary set to contain the most specific hypothesis

S0 { <Ø, Ø, Ø, Ø, Ø, Ø> }

• These two boundary sets delimit the entire hypothesis space, because every

other hypothesis in H is both more general than S0 and more specific than G0.

• As each training example is considered, the S and G boundary sets are

generalized and specialized, respectively, to eliminate from the version space

any hypotheses found inconsistent with the new training example.

• After all examples have been processed, the computed version space contains

all the hypotheses consistent with these examples and only these hypotheses.

29


Initialize G to the set of most-general hypotheses in H

Initialize S to the set of most-specific hypotheses in H

For each training example, d , do:

If d is a positive example then:

Remove from G any hypotheses that is inconsistent with d

For each hypothesis s in S that is inconsistent with d

Remove s from S

Add to S all minimal generalizations, h, of s such that:

1) h is consistent with d

2) some member of G is more general than hRemove from S any h that is more general than another hypothesis in S

If d is a negative example then:

Remove from S any hypotheses that is inconsistent with d

For each hypothesis g in G that is consistent with d

Remove g from G

Add to G all minimal specializations, h, of g such that:

1) h is inconsistent with d

2) some member of S is more specific than h

Remove from G any h that is more specific than another hypothesis in G

30

Required Methods

• To instantiate the algorithm for a specific

hypothesis language requires the following

procedures:

– equal-hypotheses(h1, h2)

– more-general(h1, h2)

– consistent(h, i)

– initialize-g()

– initialize-s()

– generalize-to(h, i)

– specialize-against(h, i)



31

Minimal Specialization and Generalization

• Procedures generalize-to and specialize-against arespecific to a hypothesis language and can becomplex.

• For conjunctive feature vectors:

– generalize-to: unique, see FIND-S – specialize-against: not unique, can convert each “?” to

an alernative non-matching value for this feature.• Inputs:

– h = <?, red, ?>

– i = <small, red, triangle>

• Outputs: – <big, red, ?>

– <medium, red, ?>

– <?, red, square>

– <?, red, circle>

32

Example Version Space

• A version space with its general and specific boundary sets.• The version space includes all six hypotheses shown here,

but can be represented more simply by S and G.

33

Candidate-Elimination Algo. - Example

• S0 and G0 are the initial

boundary sets

corresponding to the most

specific and most general

hypotheses.

• Training examples 1

and 2 force the S

boundary to become

more general.

• They have no effect on

the G boundary.

34


35


• Given that there are six attributes that could be specified tospecialize G2, why are there only three new hypotheses in G3?

• For example, the hypothesis h = <?, ?, ormal, ?, ?, ?> is aminimal specialization of G2 that correctly labels the newexample as a negative example, but it is not included in G3. The reason this hypothesis is excluded is that it is inconsistent with

S2. The algorithm determines this simply by noting that h is not more

general than the current specific boundary, S2.

• In fact, the S boundary of the version space forms a summary of the previously encountered positive examples that can be used to determinewhether any given hypothesis is consistent with these examples.

• The G boundary summarizes the information from previously encounterednegative examples. Any hypothesis more specific than G is assured to beconsistent with past negative examples

36




37

• The fourth training example further generalizes the S boundary.

• It also results in removing one member of the G boundary,

because this member fails to cover the new positive example. To understand the rationale for this step, it is useful to consider why

the offending hypothesis must be removed from G. Notice it cannot be specialized, because specializing it would not

make it cover the new example. It also cannot be generalized, because by the definition of G, any

more general hypothesis will cover at least one negative training

example. Therefore, the hypothesis must be dropped from the G boundary,

thereby removing an entire branch of the partial ordering from the

version space of hypotheses remaining under consideration


38

Candidate-Elimination Algorithm – Example

Final Version Space

39

Candidate-Elimination Algorithm – ExampleFinal Version Space

• After processing these four examples, the boundary sets

S4 and G4 delimit the version space of all hypotheses

consistent with the set of incrementally observed

training examples.

• This learned version space is independent of the

sequence in which the training examples are presented

(because in the end it contains all hypotheses consistent

with the set of examples).

• As further training data is encountered, the S and G

boundaries will move monotonically closer to each

other, delimiting a smaller and smaller version space of

candidate hypotheses. 40

Properties of VS Algorithm

• S summarizes the relevant information in the positiveexamples (relative to H ) so that positive examples do notneed to be retained.

• G summarizes the relevant information in the negativeexamples, so that negative examples do not need to beretained.

• Result is not affected by the order in which examples are processed but computational efficiency may.

• Positive examples move the S boundary up; Negativeexamples move the G boundary down.

• If S and G converge to the same hypothesis, then it is theonly one in H that is consistent with the data.

• If S and G become empty (if one does the other must also)then there is no hypothesis in H consistent with the data.

41

Correctness of Learning

• Since the entire version space is maintained, given

a continuous stream of noise-free training

examples, the VS algorithm will eventually

converge to the correct target concept if it is in the

hypothesis space, H , or eventually correctly

determine that it is not in H .• Convergence is correctly indicated when S=G.

42

Computational Complexity of

VS• Computing the S set for conjunctive feature

vectors is linear in the number of features and the

number of training examples.

• Computing the G set for conjunctive feature

vectors is exponential in the number of training

examples in the worst case.• In more expressive languages, both S and G can

grow exponentially.

• The order in which examples are processed can

significantly affect computational complexity.



43

Active Learning

• In active learning , the system is responsible for selecting good training examples and asking ateacher (oracle) to provide a class label.

• In sample selection, the system picks goodexamples to query by picking them from a

provided pool of unlabeled examples.

• In query generation, the system must generate thedescription of an example for which to request alabel.

• Goal is to minimize the number of queriesrequired to learn an accurate concept description.

44

Active Learning with VS

• An ideal training example would eliminate half of thehypotheses in the current version space regardless of its label.

• If a training example matches half of the hypotheses in theversion space, then the matching half is eliminated if theexample is negative, and the other (non-matching) half is

eliminated if the example is positive.• Example:

– Assume training set• Positive: <big, red, circle>

• Negative: <small, red, triangle>

– Current version space:• {<big, red, circle>, <big, red, ?>, <big, ?, circle>, <?, red, circle>

<?, ?, circle> <big, ?, ?>}

– An optimal query: <big, blue, circle>

• Given a ceiling of log2|VS| such examples will result inconvergence. This is the best possible guarantee in general.

45

Using an Unconverged VS

• If the VS has not converged, how does it classify a noveltest instance?

• If all elements of S match an instance, then the entireversion space must (since it is more general) and it can beconfidently classified as positive (assuming target conceptis in H ).

• If no element of G matches an instance, then the entireversion space must not (since it is more specific) and it can

be confidently classified as negative (assuming targetconcept is in H ).

• Otherwise, one could vote all of the hypotheses in the VS(or just the G and S sets to avoid enumerating the VS) togive a classification with an associated confidence value.

• Voting the entire VS is probabilistically optimal assumingthe target concept is in H and all hypotheses in H areequally likely a priori.

46

Inductive Bias

• A hypothesis space that does not include all possibleclassification functions on the instance space incorporatesa bias in the type of classifiers it can learn.

• Any means that a learning system uses to choose betweentwo functions that are both consistent with the training datais called inductive bias.

• Inductive bias can take two forms: – Language bias: The language for representing concepts defines a

hypothesis space that does not include all possible functions (e.g.conjunctive descriptions).

– Search bias: The language is expressive enough to represent all possible functions (e.g. disjunctive normal form) but the searchalgorithm embodies a preference for certain consistent functionsover others (e.g. syntactic simplicity).

47

Unbiased Learning

• For instances described by n features each with m values, there aremn instances. If these are to be classified into c categories, then thereare cm^n possible classification functions.

– For n=10, m=c=2, there are approx. 3.4x1038 possible functions, of which only 59,049 can be represented as conjunctions (an incrediblysmall percentage!)

• However, unbiased learning is futile since if we consider all possiblefunctions then simply memorizing the data without any real

generalization is as good an option as any.• Without bias, the version-space is always trivial. The unique most-specific hypothesis is the disjunction of the positive instances and theunique most general hypothesis is the negation of the disjunction of the negative instances:

)}...({

)}...{(

21

21

j

k

nnnG

p p pS

∨∨∨¬=

∨∨∨=

48

Futility of Bias-Free Learning

• A learner that makes no a priori assumptions about thetarget concept has no rational basis for classifying anyunseen instances.

• Inductive bias can also be defined as the assumptions that,when combined with the observed training data, logicallyentail the subsequent classification of unseen instances. – Training-data + inductive-bias |― novel-classifications

•The bias of the VS algorithm (assuming it refuses toclassify an instance unless it is classified the same by allmembers of the VS), is simply that H contains the targetconcept.

• The rote learner, which refuses to classify any instanceunless it has seen it during training, is the least biased.

• Learners can be partially ordered by their amount of bias – Rote-learner < VS Algorithm < Find-S



49

Inductive Bias – Three Learning

AlgorithmsROTE-LEARER: Learning corresponds simply to storing each observed

training example in memory. Subsequent instances are classified by lookingthem up in memory. If the instance is found in memory, the storedclassification is returned. Otherwise, the system refuses to classify the newinstance.

Inductive Bias:No inductive bias

CADIDATE-ELIMIATIO: New instances are classified only in the casewhere all members of the current version space agree on the classification.Otherwise, the system refuses to classify the new instance.

Inductive Bias: the target concept can be represented in its hypothesis space.

FID-S: This algorithm finds the most specific hypothesis consistent with thetraining examples. It then uses this hypothesis to classify all subsequentinstances.

Inductive Bias: the target concept can be represented in its hypothesis space,and all instances are negative instances unless the opposite is entailed by itsother know1edge. 50

Concept Learning - Summary

• Concept learning can be seen as a problem of searching through a

large predefined space of potential hypotheses.

• The general-to-specific partial ordering of hypotheses provides a

useful structure for organizing the search through the hypothesis

space.• The FIND-S algorithm utilizes this general-to-specific ordering,

performing a specific-to-general search through the hypothesis space

along one branch of the partial ordering, to find the most specific

hypothesis consistent with the training examples.

• The CANDIDATE-ELIMINATION algorithm utilizes this general-

to-specific ordering to compute the version space (the set of all

hypotheses consistent with the training data) by incrementally

computing the sets of maximally specific (S) and maximally general

(G) hypotheses.

51


• Because the S and G sets delimit the entire set of hypotheses

consistent with the data, they provide the learner with a

description of its uncertainty regarding the exact identity of the

target concept. This version space of alternative hypotheses can

be examined

– to determine whether the learner has converged to the target

concept,

– to determine when the training data are inconsistent,

– to generate informative queries to further refine the VS, and

– to determine which unseen instances can be unambiguously

classified based on the partially learned concept.

• The CANDIDATE-ELIMINATION algorithm is not robust to

noisy data or to situations in which the unknown target concept is

not expressible in the provided hypothesis space. 52


• Inductive learning algorithms are able to classify unseen

examples only because of their implicit inductive bias

for selecting one consistent hypothesis over another.

• If the hypothesis space is enriched to the point where

there is a hypothesis corresponding to every possible

subset of instances (the power set of the instances), this

will remove any inductive bias from the CANDIDATE-

ELIMINATION algorithm .

– Unfortunately, this also removes the ability to classify

any instance beyond the observed training examples.

– An unbiased learner cannot make inductive leaps to

classify unseen examples.