Introduction to machine learning
description
Transcript of Introduction to machine learning
Introduction to Machine Learning
Reminders
• HW6 posted
• All HW solutions posted
A Review
• You will be asked to learn the parameters of a probabilistic model of C given A,B, using some data
• The parameters model entries of a conditional probability distribution as coin flips
A,B A,B A,B A,B
P(C|AB) 1 2 3 4
A Review
• Example data
• ML estimates
A,B A,B A,B A,B
P(C|AB) 1=3/8 2=1/11 3=4/11 4=6/11
A B C # obs
1 1 1 3
1 1 0 5
1 0 1 1
1 0 0 10
0 1 1 4
0 1 0 7
0 0 1 6
0 0 0 5
Maximum Likelihood for BN
• For any BN, the ML parameters of any CPT can be derived by the fraction of observed values in the data
Alarm
Earthquake Burglar
E 500 B: 200
N=1000
P(E) = 0.5 P(B) = 0.2
A|E,B: 19/20A|B: 188/200A|E: 170/500A| : 1/380
E B P(A|E,B)
T T 0.95
F T 0.95
T F 0.34
F F 0.003
Maximum A Posteriori
• Example data
• MAP Estimatesassuming Betaprior with a=b=3
• virtual counts 2H,2T
A,B A,B A,B A,B
P(C|AB) 1=5/14 2=3/15 3=6/15 4=8/15
A B C # obs
1 1 1 3
1 1 0 5
1 0 1 1
1 0 0 10
0 1 1 4
0 1 0 7
0 0 1 6
0 0 0 5
Discussion
• Where are we now?
• How does it relate to the future of AI?
• How will you be tested on it?
Motivation
• Agent has made observations (data)
• Now must make sense of it (hypotheses)– Hypotheses alone may be important (e.g., in
basic science)– For inference (e.g., forecasting)– To take sensible actions (decision making)
• A basic component of economics, social and hard sciences, engineering, …
Topics in Machine Learning
ApplicationsDocument retrievalDocument classificationData miningComputer visionScientific discoveryRobotics…
Tasks & settingsClassificationRankingClusteringRegressionDecision-making
SupervisedUnsupervisedSemi-supervisedActiveReinforcement learning
TechniquesBayesian learningDecision treesNeural networksSupport vector machinesBoostingCase-based reasoning Dimensionality reduction…
What is Learning?
Mostly generalization from experience:“Our experience of the world is specific, yet we are able to formulate general theories that account for the past and predict the future”M.R. Genesereth and N.J. Nilsson, in Logical Foundations of AI, 1987
Concepts, heuristics, policies Supervised vs. un-supervised learning
Inductive Learning
• Basic form: learn a function from examples• f is the unknown target function• An example is a pair (x, f(x))• Problem: find a hypothesis h
– such that h ≈ f– given a training set of examples D
• Instance of supervised learning– Classification task: f {0,1,…,C}– Regression task: f reals
–
•
•
Inductive learning method
• Construct/adjust h to agree with f on training set• (h is consistent if it agrees with f on all examples)• E.g., curve fitting:
••
Inductive learning method
• Construct/adjust h to agree with f on training set• (h is consistent if it agrees with f on all examples)• E.g., curve fitting:
••
Inductive learning method
• Construct/adjust h to agree with f on training set• (h is consistent if it agrees with f on all examples)• E.g., curve fitting:
••
Inductive learning method
• Construct/adjust h to agree with f on training set• (h is consistent if it agrees with f on all examples)• E.g., curve fitting:
••
Inductive learning method
• Construct/adjust h to agree with f on training set• (h is consistent if it agrees with f on all examples)• E.g., curve fitting:
•
Inductive learning method
• Construct/adjust h to agree with f on training set• (h is consistent if it agrees with f on all examples)• E.g., curve fitting:
••
h=D is a trivial, but perhaps uninteresting solution (caching)
Classification Task
The concept f(x) takes on values True and False A example is positive if f is True, else it is
negative The set X of all examples is the example set The training set is a subset of X
a small one!
Logic-Based Inductive Learning
• Here, examples (x, f(x)) take on discrete values
Logic-Based Inductive Learning
• Here, examples (x,f(x)) take discrete values
Concept
Note that the training set does not say whether an observable predicate is pertinent or not
Rewarded Card Example
Deck of cards, with each card designated by [r,s], its rank and suit, and some cards “rewarded”
Background knowledge KB: ((r=1) v … v (r=10)) NUM(r)((r=J) v (r=Q) v (r=K)) FACE(r)((s=S) v (s=C)) BLACK(s)((s=D) v (s=H)) RED(s)
Training set D:REWARD([4,C]) REWARD([7,C]) REWARD([2,S]) REWARD([5,H]) REWARD([J,S])
Rewarded Card Example
Deck of cards, with each card designated by [r,s], its rank and suit, and some cards “rewarded”
Background knowledge KB: ((r=1) v … v (r=10)) NUM(r)((r=J) v (r=Q) v (r=K)) FACE(r)((s=S) v (s=C)) BLACK(s)((s=D) v (s=H)) RED(s)
Training set D:REWARD([4,C]) REWARD([7,C]) REWARD([2,S]) REWARD([5,H]) REWARD([J,S])
Possible inductive hypothesis:h (NUM(r) BLACK(s) REWARD([r,s]))
There are several possible inductive hypotheses
Learning a Logical Predicate(Concept Classifier)
Set E of objects (e.g., cards) Goal predicate CONCEPT(x), where x is an object in E,
that takes the value True or False (e.g., REWARD)
Example: CONCEPT describes the precondition of an action, e.g.,
Unstack(C,A) • E is the set of states• CONCEPT(x)
HANDEMPTYx, BLOCK(C) x, BLOCK(A) x,
CLEAR(C) x, ON(C,A) xLearning CONCEPT is a step toward learning an action description
Learning a Logical Predicate(Concept Classifier)
Set E of objects (e.g., cards) Goal predicate CONCEPT(x), where x is an object in E, that takes the value True or False (e.g.,
REWARD) Observable predicates A(x), B(X), … (e.g., NUM, RED) Training set: values of CONCEPT for some combinations of values of the observable predicates
Learning a Logical Predicate(Concept Classifier)
Set E of objects (e.g., cards) Goal predicate CONCEPT(x), where x is an object in E, that takes the value True or False (e.g., REWARD) Observable predicates A(x), B(X), … (e.g., NUM, RED) Training set: values of CONCEPT for some combinations of values of the observable predicates
Find a representation of CONCEPT in the form: CONCEPT(x) S(A,B, …)where S(A,B,…) is a sentence built with the observable predicates, e.g.: CONCEPT(x) A(x) (B(x) v C(x))
An hypothesis is any sentence of the form: CONCEPT(x) S(A,B, …)where S(A,B,…) is a sentence built using the observable predicates
The set of all hypotheses is called the hypothesis space H
An hypothesis h agrees with an example if it gives the correct value of CONCEPT
Hypothesis Space
+
++
+
+
+
+
++
+
+
+ -
-
-
-
-
-
-
- --
-
-
Example set X{[A, B, …, CONCEPT]}
Inductive Learning Scheme
Hypothesis space H{[CONCEPT(x) S(A,B, …)]}
Training set DInductive
hypothesis h
Size of Hypothesis Space
n observable predicates 2n entries in truth table defining
CONCEPT and each entry can be filled with True or False
In the absence of any restriction (bias), there are hypotheses to choose from
n = 6 2x1019 hypotheses! 22n
h1 NUM(r) BLACK(s) REWARD([r,s])
h2 BLACK(s) (r=J) REWARD([r,s])
h3 ([r,s]=[4,C]) ([r,s]=[7,C]) [r,s]=[2,S]) REWARD([r,s])
h4 ([r,s]=[5,H]) ([r,s]=[J,S]) REWARD([r,s])
agree with all the examples in the training set
Multiple Inductive Hypotheses Deck of cards, with each card designated by [r,s], its
rank and suit, and some cards “rewarded” Background knowledge KB:
((r=1) v …v (r=10)) NUM(r)((r=J ) v (r=Q) v (r=K)) FACE(r)((s=S) v (s=C)) BLACK(s)((s=D) v (s=H)) RED(s)
Training set D:REWARD([4,C]) REWARD([7,C]) REWARD([2,S])
REWARD([5,H]) REWARD([J ,S])
h1 NUM(r) BLACK(s) REWARD([r,s])
h2 BLACK(s) (r=J) REWARD([r,s])
h3 ([r,s]=[4,C]) ([r,s]=[7,C]) [r,s]=[2,S]) REWARD([r,s])
h4 ([r,s]=[5,H]) ([r,s]=[J,S]) REWARD([r,s])
agree with all the examples in the training set
Multiple Inductive Hypotheses Deck of cards, with each card designated by [r,s], its
rank and suit, and some cards “rewarded” Background knowledge KB:
((r=1) v …v (r=10)) NUM(r)((r=J ) v (r=Q) v (r=K)) FACE(r)((s=S) v (s=C)) BLACK(s)((s=D) v (s=H)) RED(s)
Training set D:REWARD([4,C]) REWARD([7,C]) REWARD([2,S])
REWARD([5,H]) REWARD([J ,S])
Need for a system of preferences – called an inductive bias – to compare possible hypotheses
Notion of Capacity It refers to the ability of a machine to learn any training
set without error A machine with too much capacity is like a botanist with
photographic memory who, when presented with a new tree, concludes that it is not a tree because it has a different number of leaves from anything he has seen before
A machine with too little capacity is like the botanist’s lazy brother, who declares that if it’s green, it’s a tree
Good generalization can only be achieved when the right balance is struck between the accuracy attained on the training set and the capacity of the machine
Keep-It-Simple (KIS) Bias
Examples• Use much fewer observable predicates than the training
set• Constrain the learnt predicate, e.g., to use only “high-
level” observable predicates such as NUM, FACE, BLACK, and RED and/or to have simple syntax
Motivation• If an hypothesis is too complex it is not worth learning it
(data caching does the job as well)• There are much fewer simple hypotheses than complex
ones, hence the hypothesis space is smaller
Keep-It-Simple (KIS) Bias
Examples• Use much fewer observable predicates than the training
set• Constrain the learnt predicate, e.g., to use only “high-
level” observable predicates such as NUM, FACE, BLACK, and RED and/or to have simple syntax
Motivation• If an hypothesis is too complex it is not worth learning it
(data caching does the job as well)• There are much fewer simple hypotheses than complex
ones, hence the hypothesis space is smaller
Einstein: “A theory must be as simple as possible, but not simpler than this”
Keep-It-Simple (KIS) Bias
Examples• Use much fewer observable predicates than the training
set• Constrain the learnt predicate, e.g., to use only “high-
level” observable predicates such as NUM, FACE, BLACK, and RED and/or to have simple syntax
Motivation• If an hypothesis is too complex it is not worth learning it
(data caching does the job as well)• There are much fewer simple hypotheses than complex
ones, hence the hypothesis space is smaller
If the bias allows only sentences S that areconjunctions of k << n predicates picked fromthe n observable predicates, then the size of H is O(nk)
Relation to Bayesian Learning
• Recall MAP learningMAP = argmax P(D|) P()
– P(D|) is the likelihood ~ accuracy on training set
– P() is the prior ~ inductive bias
MAP = argmin -log P(D|) - log P()
Measures accuracy on training set
Minimum length encoding of
MAP learning is a form of KIS bias
Supervised Learning Flow Chart
Training set
Targetconcept
Datapoints
InductiveHypothesis
Prediction
Learner
Hypothesisspace
Choice of learning algorithm
Unknown concept we want to approximate
Observations we have seen
Test set
Observations we will see in the future
Better quantities to assess performance
Capacity is Not the Only Criterion
• Accuracy on training set isn’t the best measure of performance
+
++
+
+
+
+
++
+
+
+ -
-
-
-
-
-
-
- --
-
-Learn
Test
Example set X Hypothesis space H
Training set D
Generalization Error
• A hypothesis h is said to generalize well if it achieves low error on all examples in X
+
++
+
+
+
+
++
+
+
+ -
-
-
-
-
-
-
- --
-
-
Learn
Test
Example set X Hypothesis space H
Assessing Performance of a Learning Algorithm
• Samples from X are typically unavailable
• Take out some of the training set– Train on the remaining training set– Test on the excluded instances– Cross-validation
Cross-Validation
• Split original set of examples, train
+
+
+
+
++
+
-
-
-
--
-
+
+
+
+
+
-
-
-
-
-
-Hypothesis space H
Train
Examples D
Cross-Validation
• Evaluate hypothesis on testing set
+
+
+
+
++
+
-
-
-
--
-
Hypothesis space H
Testing set
Cross-Validation
• Evaluate hypothesis on testing set
Hypothesis space H
Testing set
++
++
+
--
-
-
-
-
++
Test
Cross-Validation
• Compare true concept against prediction
+
+
+
+
++
+
-
-
-
--
-
Hypothesis space H
Testing set
++
++
+
--
-
-
-
-
++
9/13 correct
Common Splitting Strategies
• k-fold cross-validation
• Leave-one-out
Train Test
Dataset
Cardinal sins of machine learning
• Thou shalt not:– Train on examples in the testing set– Form assumptions by “peeking” at the testing
set, then formulating inductive bias
Tennis Example
• Evaluate learning algorithmPlayTennis = S(Temperature,Wind)
Tennis Example
• Evaluate learning algorithmPlayTennis = S(Temperature,Wind)
Trained hypothesis
PlayTennis =(T=Mild or Cool) (W=Weak)
Training errors = 3/10
Testing errors = 4/4
Tennis Example
• Evaluate learning algorithmPlayTennis = S(Temperature,Wind)
Trained hypothesis
PlayTennis = (T=Mild or Cool)
Training errors = 3/10
Testing errors = 1/4
Tennis Example
• Evaluate learning algorithmPlayTennis = S(Temperature,Wind)
Trained hypothesis
PlayTennis = (T=Mild or Cool)
Training errors = 3/10
Testing errors = 2/4
Tennis Example
• Evaluate learning algorithmPlayTennis = S(Temperature,Wind)
Sum of all testing errors: 7/12
How to construct a better learner?
• Ideas?
Next Time
• Decision tree learning