8/3/2019 Concept Learning Mitchell Ch2
http://slidepdf.com/reader/full/concept-learning-mitchell-ch2 1/9
1
CS 464: Introduction to
Machine Learning
Concept Learning
Slides adapted from Chapter 2,
Machine Learning by Tom M. Mitchell
http://www-2.cs.cmu.edu/afs/cs.cmu.edu/user/mitchell/ftp/mlbook.html2
Concept Learning
Acquiring the definition of a general
category from given sample positive and
negative training examples of the category. An example: learning “bird” concept from
the given examples of birds (positive
examples) and non-birds (negative
examples).
Inferring a boolean-valued function from
training examples of its input and output.
3
Classification (Categorization)
• Given
– A fixed set of categories: C={c1, c
2,…ck }
– A description of an instance x = (x1 , x2 ,…xn)
with n features, x∈ X , where X is the instance space.
• Determine:
– The category of x: c( x)∈C, where c( x) is a
categorization function whose domain is X and whose
range is C .
– If c( x) is a binary function C ={1,0} ({true,false},
{positive, negative}) then it is called a concept .
4
Learning for Categorization
• A training example is an instance x∈ X,
paired with its correct category c( x):
< x, c( x)> for an unknown categorization
function, c.
• Given a set of training examples, D.
• Find a hypothesized categorization function,
h( x), such that:
)()(:)(, xc xh D xc x =∈><∀
Consistency
5
A Concept Learning Task – Enjoy Sport
Training Examples
Example Sky AirTemp Humidity Wind Water Forecast EnjoySport
1 Sunny Warm Normal Strong Warm Same YES
2 Sunny Warm High Strong Warm Same YES
3 Rainy Cold High Strong Warm Change O
4 Sunny Warm High Strong Warm Change YES
• A set of example days, and each is described by six attributes.• The task is to learn to predict the value of EnjoySport for arbitrary day,
based on the values of i ts attribute values.
ATTRIBUTES COCEPT
6
Hypothesis Space
• Restrict learned functions to a given hypothesis space, H ,of functions h( x) that can be considered as definitions of c( x).
• For learning concepts on instances described by n discrete-valued features, consider the space of conjunctivehypotheses represented by a vector of n constraints
<c1, c2, … cn> where each ci is either:
– ?, a wild card indicating no constraint on that feature – A specific value from the domain of that feature – Ø indicating no value is acceptable
• Sample conjunctive hypotheses are – <Sunny, ?, ?, Strong, ?, ?>
– <?, ?, ?, ?, ?, ?> (most general hypothesis)
– < Ø, Ø, Ø, Ø, Ø, Ø> (most specific hypothesis)
8/3/2019 Concept Learning Mitchell Ch2
http://slidepdf.com/reader/full/concept-learning-mitchell-ch2 2/9
7
EnjoySport Concept Learning Task Given
Instances X : set of all possible days, each described by the attributes
Sky – (values: Sunny, Cloudy, Rainy)
AirTemp – (values: Warm, Cold)
Humidity – (values: Normal, High)
Wind – (values: Strong, Weak)
Water – (values: Warm, Cold)
Forecast – (values: Same, Change)
Target Concept (Function) c : EnjoySport : X {0,1}
Hypotheses H : Each hypothesis is described by a conjunction of constraints
on the attributes. Example hypothesis: <Sunny, ?, ?, ?, ?, ?>
Training Examples D : positive and negative examples of the target function
Determine
A hypothesis h in H such that h(x) = c(x) for all x in D.
8
Inductive Learning Hypothesis
Although the learning task is to determine a
hypothesis h identical to the target concept c over the
entire set of instances X , the only information
available about c is its value over the training
examples.
The Inductive Learning Hypothesis: Any function
that is found to approximate the target concept well on
a sufficiently large set of training examples will also
approximate the target function well on unobserved
examples.
• Assumes that the training and test examples are drawnindependently from the same underlying distribution.
9
Concept Learning as Search
Concept learning can be viewed as the task of searching
through a large space of hypotheses implicitly defined
by the hypothesis representation.
• The goal of this search is to find the hypothesis that best
fits the training examples.
• By selecting a hypothesis representation, the designer of
the learning algorithm implicitly defines the space of all
hypotheses that the program can ever represent and
therefore can ever learn.
10
Concept Learning as Search(Cont.)• Consider an instance space consisting of n binary
features which therefore has 2n instances.• For conjunctive hypotheses, there are 4 choices for
each feature: Ø, T, F, ?, so there are 4n syntacticallydistinct hypotheses.
• However, all hypotheses with one or more Øs areequivalent, so there are 3n+1 semantically distincthypotheses.
• The target concept learning function in principle could be any of the possible 22^n functions on n input bits.
• Therefore, conjunctive hypotheses are a small subsetof the space of possible functions, but both areintractably large.
• All reasonable hypothesis spaces are intractably large
or could be even infinite.
11
Enjoy Sport - Hypothesis Space
• Sky has 3 possible values, and other 5 attributes have 2 possible
values.
• There are 3·2·2·2·2·2 = 96 distinct instances in X.
• There are 5·4·4·4·4·4 = 5120 syntactically distinct hypotheses in
H. (Two more values for attribute values: ? and Ø)
• Every hypothesis containing one or more Ø symbols represents
the empty set of instance (it classifies every instance as negative).• There are 1+ 4·3·3·3·3·3 = 973 semantically distinct hypotheses in
H. Only one more value for attributes: ?, and one hypothesis
representing empty set of instances.
• Although EnjoySport has small, finite hypothesis space, most
learning tasks have much larger (even infinite) hypothesis spaces.12
Learning by Enumeration
• For any finite or countably infinite hypothesisspace, one can simply enumerate and testhypotheses one at a time until a consistent one isfound.
For each h in H do:
If h is consistent with the training data D,
then terminate and return h.
• This algorithm is guaranteed to terminate with aconsistent hypothesis if one exists; however, it isobviously computationally intractable for almostany practical problem.
8/3/2019 Concept Learning Mitchell Ch2
http://slidepdf.com/reader/full/concept-learning-mitchell-ch2 3/9
13
Using the Generality Structure
• By exploiting the structure imposed by the generality of hypotheses, an hypothesis space can be searched for consistent hypotheses without enumerating or explicitlyexploring all hypotheses.
• An instance, x∈ X , is said to satisfy an hypothesis, h, iff h( x)=1 (positive)
• Given two hypotheses h1 and h
2, h1 is more general than
or equal to h2 (h1≥h2) iff every instance that satisfies h2 also
satisfies h1.
• Given two hypotheses h1 and h2, h1 is (strictly) more
general than h2 (h1>h2) iff h1≥h2 and it is not the case that
h2≥ h1.
• Generality defines a partial order on hypotheses.14
Examples of Generality
• Conjunctive feature vectors
– <Sunny, ?, ?, ?, ?, ?> is more general than
<Sunny, ?, ?, Strong, ?, ?>
– Neither of <Sunny, ?, ?, ?, ?, ?> and
<?, ?, ?, Strong, ?, ?> is more general than the other.
• Axis-parallel rectangles in 2-d space
– A is more general than B
– Neither of A and C are more general than the other.
A
B
C
15
“More General Than” Relation
• h2 > h1 and h2 > h3
But there is no more general than relation between h1 and h3
16
FID-S Algorithm
• Find the most-specific hypothesis (least-general generalization) that is consistent withthe training data.
• Starting with the most specific hypothesis,incrementally update hypothesis after every positive example, generalizing it just enough tosatisfy the new example.
• Time complexity: O(|D| · n) if n is the number of features
17
FID-S Algorithm
Initialize h = <Ø, Ø,… Ø> (the most specific hypothesis)
For each positive training instance x in D
For each feature f i
If the constraint on f i in h is not satisfied by x
If f i in h is Ø
then set f i in h to the value of f i in x
else set f i in h to “?”
If h is consistent with the negative training instances in D
then return h
else no consistent hypothesis exists18
FID-S Algorithm - Example
8/3/2019 Concept Learning Mitchell Ch2
http://slidepdf.com/reader/full/concept-learning-mitchell-ch2 4/9
19
Issues with FID-S
• Given sufficient training examples, does FIND-Sconverge to a correct definition of the target concept(assuming it is in the hypothesis space)?
• How do we know when the hypothesis has converged
to a correct definition?• Why prefer the most-specific hypothesis? Are moregeneral hypotheses consistent? What about the most-general hypothesis? What about the simplesthypothesis?
• What if there are several maximally specific consistenthypotheses? Which one should be chosen?
• What if there is noise in the training data and sometraining examples are incorrectly labeled?
20
Candidate-Elimination Algorithm
• FIND-S outputs a hypothesis from H, that is consistent
with the training examples, which is just one of many
hypotheses from H that might fit the training data
equally well.• The key idea in the Candidate-Elimination algorithm is
to output a description of the set of all hypotheses
consistent with the training examples without explicitly
enumerating all of its members.
• This is accomplished by using the more-general-than
partial ordering and maintaining a compact
representation of the set of consistent hypotheses.
21
Consistent Hypothesis
• The key difference between this definition of consistent and
satisfies:
• An example x is said to satisfy hypothesis h when h(x) = 1,
regardless of whether x is a positive or negative example of
the target concept.• An example is consistent with hypothesis h when h(x) = c(x)
( depending on the target concept).
22
Version Space
• Given an hypothesis space, H , and training
data, D, the version space is the complete
subset of H that is consistent with D.
• The version space can be naively generated
for any finite H by enumerating all
hypotheses and eliminating the inconsistent
ones.
23
List-Then-Eliminate Algorithm
24
List-Then-Eliminate Algorithm
• The version space of candidate hypotheses shrinks as moreexamples are observed, until ideally just one hypothesis remainsthat is consistent with all the observed examples. What if there is insufficient training data? Will output the entire
set of hypotheses consistent with the observed data.
• List-Then-Eliminate algorithm can be applied whenever thehypothesis space H is finite.
It has many advantages, including the fact that it is guaranteed tooutput all hypotheses consistent with the training data.
Unfortunately, it requires exhaustively enumerating allhypotheses in H - an unrealistic requirement for all but the mosttrivial hypothesis spaces.
Can one compute the version space more efficiently than using
enumeration?
8/3/2019 Concept Learning Mitchell Ch2
http://slidepdf.com/reader/full/concept-learning-mitchell-ch2 5/9
25
Compact Representation of
Version Spaces with S and G• The version space can be represented more compactly by
maintaining two boundary sets of hypotheses, S , the set of
most specific consistent hypotheses, and G, the set of most
general consistent hypotheses:
• S and G represent the entire version space via its boundaries in
the generalization lattice:
)]},([),(|{ D sConsistent s s H s D sConsistent H sS ′∧′>∈′¬∃∧∈=
)]},([),(|{ D sConsistent g g H g D g Consistent H g G ′∧>′∈′¬∃∧∈=
version
space
G
S
26
Compact Representation of
Version Spaces• The Candidate-Elimination algorithm represents
the version space by storing only its most general
members G and its most specific members S.
• Given only these two sets S and G, it is possible
to enumerate all members of a version space by
generating hypotheses that lie between these two
sets in general-to-specific partial ordering over
hypotheses.
• Every member of the version space lies between
these boundaries.
27
Version Space Lattice
< Ø, Ø, Ø>
<?,?,circ> <big,?,?> <?,red,?> <?,blue,?> <sm,?,?> <?,?,squr>
< big,red,circ><sm,red,circ><big,blue,circ><sm,blue,circ>< big,red,squr><sm,red,squr><big,blue,squr><sm,blue,squr>
< ?,red,circ><big,?,circ><big,red,?><big,blue,?><sm,?,circ><?,blue,circ> <?,red,squr><sm.?,sqr><sm,red,?><sm,blue,?> <big,?,squr><?,blue,squr>
<?, ?, ?>
Size: {sm, big} Color: {red, blue} Shape: {circ, squr}
<<big, red, squr> positive>
<<sm, blue, circ> negative>
Color Code:
G
S
other VS
28
Candidate-Elimination Algorithm
• The Candidate-Elimination algorithm computes the version space containing
all hypotheses from H that are consistent with an observed sequence of training
examples.
• It begins by initializing the version space to the set of all hypotheses in H; that
is, by initializing the G boundary set to contain the most general hypothesis in
H
G0 { <?, ?, ?, ?, ?, ?> }
and initializing the S boundary set to contain the most specific hypothesis
S0 { <Ø, Ø, Ø, Ø, Ø, Ø> }
• These two boundary sets delimit the entire hypothesis space, because every
other hypothesis in H is both more general than S0 and more specific than G0.
• As each training example is considered, the S and G boundary sets are
generalized and specialized, respectively, to eliminate from the version space
any hypotheses found inconsistent with the new training example.
• After all examples have been processed, the computed version space contains
all the hypotheses consistent with these examples and only these hypotheses.
29
Candidate-Elimination Algorithm
Initialize G to the set of most-general hypotheses in H
Initialize S to the set of most-specific hypotheses in H
For each training example, d , do:
If d is a positive example then:
Remove from G any hypotheses that is inconsistent with d
For each hypothesis s in S that is inconsistent with d
Remove s from S
Add to S all minimal generalizations, h, of s such that:
1) h is consistent with d
2) some member of G is more general than hRemove from S any h that is more general than another hypothesis in S
If d is a negative example then:
Remove from S any hypotheses that is inconsistent with d
For each hypothesis g in G that is consistent with d
Remove g from G
Add to G all minimal specializations, h, of g such that:
1) h is inconsistent with d
2) some member of S is more specific than h
Remove from G any h that is more specific than another hypothesis in G
30
Required Methods
• To instantiate the algorithm for a specific
hypothesis language requires the following
procedures:
– equal-hypotheses(h1, h2)
– more-general(h1, h2)
– consistent(h, i)
– initialize-g()
– initialize-s()
– generalize-to(h, i)
– specialize-against(h, i)
8/3/2019 Concept Learning Mitchell Ch2
http://slidepdf.com/reader/full/concept-learning-mitchell-ch2 6/9
31
Minimal Specialization and Generalization
• Procedures generalize-to and specialize-against arespecific to a hypothesis language and can becomplex.
• For conjunctive feature vectors:
– generalize-to: unique, see FIND-S – specialize-against: not unique, can convert each “?” to
an alernative non-matching value for this feature.• Inputs:
– h = <?, red, ?>
– i = <small, red, triangle>
• Outputs: – <big, red, ?>
– <medium, red, ?>
– <?, red, square>
– <?, red, circle>
32
Example Version Space
• A version space with its general and specific boundary sets.• The version space includes all six hypotheses shown here,
but can be represented more simply by S and G.
33
Candidate-Elimination Algo. - Example
• S0 and G0 are the initial
boundary sets
corresponding to the most
specific and most general
hypotheses.
• Training examples 1
and 2 force the S
boundary to become
more general.
• They have no effect on
the G boundary.
34
Candidate-Elimination Algo. - Example
35
Candidate-Elimination Algo. - Example
• Given that there are six attributes that could be specified tospecialize G2, why are there only three new hypotheses in G3?
• For example, the hypothesis h = <?, ?, ormal, ?, ?, ?> is aminimal specialization of G2 that correctly labels the newexample as a negative example, but it is not included in G3. The reason this hypothesis is excluded is that it is inconsistent with
S2. The algorithm determines this simply by noting that h is not more
general than the current specific boundary, S2.
• In fact, the S boundary of the version space forms a summary of the previously encountered positive examples that can be used to determinewhether any given hypothesis is consistent with these examples.
• The G boundary summarizes the information from previously encounterednegative examples. Any hypothesis more specific than G is assured to beconsistent with past negative examples
36
Candidate-Elimination Algo. - Example
8/3/2019 Concept Learning Mitchell Ch2
http://slidepdf.com/reader/full/concept-learning-mitchell-ch2 7/9
37
• The fourth training example further generalizes the S boundary.
• It also results in removing one member of the G boundary,
because this member fails to cover the new positive example. To understand the rationale for this step, it is useful to consider why
the offending hypothesis must be removed from G. Notice it cannot be specialized, because specializing it would not
make it cover the new example. It also cannot be generalized, because by the definition of G, any
more general hypothesis will cover at least one negative training
example. Therefore, the hypothesis must be dropped from the G boundary,
thereby removing an entire branch of the partial ordering from the
version space of hypotheses remaining under consideration
Candidate-Elimination Algo. - Example
38
Candidate-Elimination Algorithm – Example
Final Version Space
39
Candidate-Elimination Algorithm – ExampleFinal Version Space
• After processing these four examples, the boundary sets
S4 and G4 delimit the version space of all hypotheses
consistent with the set of incrementally observed
training examples.
• This learned version space is independent of the
sequence in which the training examples are presented
(because in the end it contains all hypotheses consistent
with the set of examples).
• As further training data is encountered, the S and G
boundaries will move monotonically closer to each
other, delimiting a smaller and smaller version space of
candidate hypotheses. 40
Properties of VS Algorithm
• S summarizes the relevant information in the positiveexamples (relative to H ) so that positive examples do notneed to be retained.
• G summarizes the relevant information in the negativeexamples, so that negative examples do not need to beretained.
• Result is not affected by the order in which examples are processed but computational efficiency may.
• Positive examples move the S boundary up; Negativeexamples move the G boundary down.
• If S and G converge to the same hypothesis, then it is theonly one in H that is consistent with the data.
• If S and G become empty (if one does the other must also)then there is no hypothesis in H consistent with the data.
41
Correctness of Learning
• Since the entire version space is maintained, given
a continuous stream of noise-free training
examples, the VS algorithm will eventually
converge to the correct target concept if it is in the
hypothesis space, H , or eventually correctly
determine that it is not in H .• Convergence is correctly indicated when S=G.
42
Computational Complexity of
VS• Computing the S set for conjunctive feature
vectors is linear in the number of features and the
number of training examples.
• Computing the G set for conjunctive feature
vectors is exponential in the number of training
examples in the worst case.• In more expressive languages, both S and G can
grow exponentially.
• The order in which examples are processed can
significantly affect computational complexity.
8/3/2019 Concept Learning Mitchell Ch2
http://slidepdf.com/reader/full/concept-learning-mitchell-ch2 8/9
43
Active Learning
• In active learning , the system is responsible for selecting good training examples and asking ateacher (oracle) to provide a class label.
• In sample selection, the system picks goodexamples to query by picking them from a
provided pool of unlabeled examples.
• In query generation, the system must generate thedescription of an example for which to request alabel.
• Goal is to minimize the number of queriesrequired to learn an accurate concept description.
44
Active Learning with VS
• An ideal training example would eliminate half of thehypotheses in the current version space regardless of its label.
• If a training example matches half of the hypotheses in theversion space, then the matching half is eliminated if theexample is negative, and the other (non-matching) half is
eliminated if the example is positive.• Example:
– Assume training set• Positive: <big, red, circle>
• Negative: <small, red, triangle>
– Current version space:• {<big, red, circle>, <big, red, ?>, <big, ?, circle>, <?, red, circle>
<?, ?, circle> <big, ?, ?>}
– An optimal query: <big, blue, circle>
• Given a ceiling of log2|VS| such examples will result inconvergence. This is the best possible guarantee in general.
45
Using an Unconverged VS
• If the VS has not converged, how does it classify a noveltest instance?
• If all elements of S match an instance, then the entireversion space must (since it is more general) and it can beconfidently classified as positive (assuming target conceptis in H ).
• If no element of G matches an instance, then the entireversion space must not (since it is more specific) and it can
be confidently classified as negative (assuming targetconcept is in H ).
• Otherwise, one could vote all of the hypotheses in the VS(or just the G and S sets to avoid enumerating the VS) togive a classification with an associated confidence value.
• Voting the entire VS is probabilistically optimal assumingthe target concept is in H and all hypotheses in H areequally likely a priori.
46
Inductive Bias
• A hypothesis space that does not include all possibleclassification functions on the instance space incorporatesa bias in the type of classifiers it can learn.
• Any means that a learning system uses to choose betweentwo functions that are both consistent with the training datais called inductive bias.
• Inductive bias can take two forms: – Language bias: The language for representing concepts defines a
hypothesis space that does not include all possible functions (e.g.conjunctive descriptions).
– Search bias: The language is expressive enough to represent all possible functions (e.g. disjunctive normal form) but the searchalgorithm embodies a preference for certain consistent functionsover others (e.g. syntactic simplicity).
47
Unbiased Learning
• For instances described by n features each with m values, there aremn instances. If these are to be classified into c categories, then thereare cm^n possible classification functions.
– For n=10, m=c=2, there are approx. 3.4x1038 possible functions, of which only 59,049 can be represented as conjunctions (an incrediblysmall percentage!)
• However, unbiased learning is futile since if we consider all possiblefunctions then simply memorizing the data without any real
generalization is as good an option as any.• Without bias, the version-space is always trivial. The unique most-specific hypothesis is the disjunction of the positive instances and theunique most general hypothesis is the negation of the disjunction of the negative instances:
)}...({
)}...{(
21
21
j
k
nnnG
p p pS
∨∨∨¬=
∨∨∨=
48
Futility of Bias-Free Learning
• A learner that makes no a priori assumptions about thetarget concept has no rational basis for classifying anyunseen instances.
• Inductive bias can also be defined as the assumptions that,when combined with the observed training data, logicallyentail the subsequent classification of unseen instances. – Training-data + inductive-bias |― novel-classifications
•The bias of the VS algorithm (assuming it refuses toclassify an instance unless it is classified the same by allmembers of the VS), is simply that H contains the targetconcept.
• The rote learner, which refuses to classify any instanceunless it has seen it during training, is the least biased.
• Learners can be partially ordered by their amount of bias – Rote-learner < VS Algorithm < Find-S
8/3/2019 Concept Learning Mitchell Ch2
http://slidepdf.com/reader/full/concept-learning-mitchell-ch2 9/9
49
Inductive Bias – Three Learning
AlgorithmsROTE-LEARER: Learning corresponds simply to storing each observed
training example in memory. Subsequent instances are classified by lookingthem up in memory. If the instance is found in memory, the storedclassification is returned. Otherwise, the system refuses to classify the newinstance.
Inductive Bias:No inductive bias
CADIDATE-ELIMIATIO: New instances are classified only in the casewhere all members of the current version space agree on the classification.Otherwise, the system refuses to classify the new instance.
Inductive Bias: the target concept can be represented in its hypothesis space.
FID-S: This algorithm finds the most specific hypothesis consistent with thetraining examples. It then uses this hypothesis to classify all subsequentinstances.
Inductive Bias: the target concept can be represented in its hypothesis space,and all instances are negative instances unless the opposite is entailed by itsother know1edge. 50
Concept Learning - Summary
• Concept learning can be seen as a problem of searching through a
large predefined space of potential hypotheses.
• The general-to-specific partial ordering of hypotheses provides a
useful structure for organizing the search through the hypothesis
space.• The FIND-S algorithm utilizes this general-to-specific ordering,
performing a specific-to-general search through the hypothesis space
along one branch of the partial ordering, to find the most specific
hypothesis consistent with the training examples.
• The CANDIDATE-ELIMINATION algorithm utilizes this general-
to-specific ordering to compute the version space (the set of all
hypotheses consistent with the training data) by incrementally
computing the sets of maximally specific (S) and maximally general
(G) hypotheses.
51
Concept Learning - Summary
• Because the S and G sets delimit the entire set of hypotheses
consistent with the data, they provide the learner with a
description of its uncertainty regarding the exact identity of the
target concept. This version space of alternative hypotheses can
be examined
– to determine whether the learner has converged to the target
concept,
– to determine when the training data are inconsistent,
– to generate informative queries to further refine the VS, and
– to determine which unseen instances can be unambiguously
classified based on the partially learned concept.
• The CANDIDATE-ELIMINATION algorithm is not robust to
noisy data or to situations in which the unknown target concept is
not expressible in the provided hypothesis space. 52
Concept Learning - Summary
• Inductive learning algorithms are able to classify unseen
examples only because of their implicit inductive bias
for selecting one consistent hypothesis over another.
• If the hypothesis space is enriched to the point where
there is a hypothesis corresponding to every possible
subset of instances (the power set of the instances), this
will remove any inductive bias from the CANDIDATE-
ELIMINATION algorithm .
– Unfortunately, this also removes the ability to classify
any instance beyond the observed training examples.
– An unbiased learner cannot make inductive leaps to
classify unseen examples.
Top Related