Introduction to Machine Learning.

A Brief Survey of Machine Learning

Used materials from: William H. Hsu Linda Jackson

Lex LaneTom Mitchell

Machine Learning, Mc Graw Hill 1997Allan MoserTim Finin,

Marie desJardinsChuck Dyer

ML Lectures Outline: what we will discuss?ML Lectures Outline: what we will discuss?

• Why machine learning?• Brief Tour of Machine Learning

– A case study

– A taxonomy of learning

– Intelligent systems engineering: specification of learning problems

• Issues in Machine Learning– Design choices

– The performance element: intelligent systems

• Some Applications of Learning– Database mining, reasoning (inference/decision support), acting

– Industrial usage of intelligent systems

– Robotics

What is Learning?What is Learning?

• “Learning denotes changes in a system that ... enable a system to do the same task more efficiently the next time.” -- Herbert Simon

• “Learning is constructing or modifying representations of what is being experienced.” -- Ryszard Michalski

• “Learning is making useful changes in our minds.” -- Marvin Minsky

definitions

Why Machine Learning? What ML can do?Why Machine Learning? What ML can do?• Discovers new things or structures that are unknown to humans

– Examples: – Data mining, – Knowledge Discovery in Databases

• Fills in skeletal or incomplete specifications about a domain– Large, complex AI systems cannot be completely derived by hand– They require dynamic updating to incorporate new information. – Learning new characteristics:

– 1. expands the domain or expertise– 2. lessens the "brittleness" of the system

• Using learning, the software agentssoftware agents can adapt to:– to their users,

– to other software agents,

– to the changing environment.

Why Machine Learning?Why Machine Learning?

• New Computational Capability– Database mining:

– converting (technical) records into knowledge

– Self-customizing programs:

– learning news filters,

– adaptive monitors

– Learning to act:

– robot planning,

– control optimization,

– decision support

– Applications that are hard to program:

– automated driving,

– speech recognition

Why Machine Learning?Why Machine Learning?• Better Understanding of Human Learning and Teaching

– Understand and improve efficiency of human learning

– Use to improve methods for teaching and tutoring peopletutoring people– e.g., better computer-aided instruction. (can our robot-head teach

English?)

– Cognitive science: theories of knowledge acquisition (e.g., through practice)

– Performance elements: reasoning (inference) and recommender systems

• Time is Right– Recent progress in algorithms and theory

– Rapidly growing volume of online data from various sources

– Available computational power

– Growth and interest of learning-based industries (e.g., data mining/KDD)

A General Model of Learning Agents A General Model of Learning Agents

Three Aspects of Learning SystemsThree Aspects of Learning Systems– 1. Models:

– decision trees,

– linear threshold units (winnow, weighted majority),

– neural networks,

– Bayesian networks (polytrees, belief networks, influence diagrams, HMMs),

– genetic algorithms,

– instance-based (nearest-neighbor)

– 2. Algorithms (e.g., for decision trees):

– ID3,

– C4.5,

– CART,

– OC1

– 3. Methodologies:

– supervised,

– unsupervised,

– reinforcement;

– knowledge-guided

What are the aspects of research on Learning?What are the aspects of research on Learning?• 1. Theory of Learning

– Computational learning theory (COLT): complexity, limitations of learning

– Probably Approximately Correct (PAC) learning

– Probabilistic, statistical, information theoretic results

• 2. Multistrategy Learning:

– Combining Techniques,

– Knowledge Sources

• 3. Create and collect Data:

– Time Series,

– Very Large Databases (VLDB),

– Text Corpora

• 4. Select good applications– Performance element:

– classification,

– decision support,

– planning,

– control

– Database mining and knowledge discovery in databases (KDD)

– Computer inference: learning to reason

Some Issues in Machine LearningSome Issues in Machine Learning• What Algorithms Can Approximate Functions Well? When?

• How Do Learning System Design Factors Influence Accuracy?

– Number of training examples

– Complexity of hypothesis representation

• How Do Learning Problem Characteristics Influence Accuracy?

– Noisy data

– Multiple data sources

• What Are The Theoretical Limits of Learnability?

• How Can Prior Knowledge of Learner Help?

• What Clues Can We Get From Biological Learning Systems?

• How Can Systems Alter Their Own Representation?

Major Paradigms of Machine LearningMajor Paradigms of Machine Learning

• Rote Learning – One-to-one mapping from inputs to stored

representation. – "Learning by memorization.” – Association-based storage and retrieval.

• Clustering• Analogue

– Determine correspondence between two different representations

• Induction– Use specific examples to reach general conclusions

• Discovery– Unsupervised, specific goal not given

• Genetic Algorithms

Major Paradigms of Machine LearningMajor Paradigms of Machine Learning

• Neural Networks

• Reinforcement –Feedback given at end of a sequence of steps. –Feedback can be positive or negative reward

– Assign reward to steps by solving the credit assignment problem —– which steps should receive credit or blame for a final result?

The Inductive Learning ProblemThe Inductive Learning Problem• Induce rules that extrapolate from a given set of examples

– These rules should make “accurate” predictions about future examples.

• Supervised versus Unsupervised learning– Learn an unknown function f(X) = Y, where:

– X is an input example and – Y is the desired output.

– Supervised learning implies we are given a training set of (X, Y) pairs by a "teacher.“

– Unsupervised learning means we are only given the Xs and some (ultimate) feedback function on our performance.

The Inductive Learning ProblemThe Inductive Learning Problem• Concept learning

– Called also Classification

– Given a set of examples of some concept/class/category, determine if a given example is an instance of the concept or not.

– If it is an instance, we call it a positive example.

– If it is not, it is called a negative example.

SupervisedSupervised Concept Learning Concept Learning

• Given a training set of positive and negative examples of a concept– Usually each example has a set of features/attributes

• Construct a description that will accurately classify whether future examples are positive or negative.

• That is, – learn some good estimate of function f

– given a training set {(x1, y1), (x2, y2), ..., (xn, yn)}

– where each yi is either + (positive) or - (negative).

– f is a function of the features/attributes

Inductive Learning Inductive Learning FrameworkFramework

• Raw input data from sensors are preprocessed to obtain a feature vector, X, that adequately describes all of the relevant features for classifying examples.

• Each x is a list of (attribute, value) pairs. For example,

X = [Person:Sue, EyeColor:Brown, Age:Young, Sex:Female]

• The number and names of attributes (aka features) is fixed (positive, finite).

• Each attribute has a fixed, finite number of possible values.

• Each example can be interpreted as a point in an n-dimensional feature space, where n is the number of attributes.

Inductive Learning by Inductive Learning by Nearest-NeighborNearest-Neighbor Classification Classification

• One simple approach to inductive learning is to save each training example as a point in feature space

• Classify a new example by giving it the same classification (+ or -) as its nearest neighbor in Feature Space.

–1. A variation involves computing a weighted sum of class of a set of neighbors – where the weights correspond to distances

–2. Another variation uses the center of class

• The problem with this approach is that it doesn't necessarily generalize well if the examples are not well "clustered."

Learning Learning Decision TreesDecision Trees

• Goal:Goal: Build a decision tree for classifying examples as positive or negative instances of a concept using supervised learning from a training set.

• A decision tree is a tree where

– each non-leaf node is associated with an attribute (feature)

– each leaf node is associated with a classification (+ or -)

– each arc is associated with one of the possible values of the attribute at the node where the arc is directed from.

• Generalization: allow for >2 classes– e.g., {sell, hold, buy}

Color

ShapeSize +

+- Size

+-

+big

big small

small

roundsquare

redgreen blue

Preference Bias: Preference Bias: Ockham's RazorOckham's Razor• Aka Occam’s Razor, Law of Economy, or Law of Parsimony

• Principle stated by William of Ockham (1285-1347/49), a scholastic, that – “non sunt multiplicanda entia praeter necessitatem”

– or, entities are not to be multiplied beyond necessity.

• The simplest explanation that is consistent with all observations is the best.

• Therefore, the smallest decision tree that correctly classifies all of the training examples is best.

• Finding the provably smallest decision tree is NP-Hard

• Therefore we do not construct the absolute smallest tree consistent with the training examples.

• We construct a tree that is pretty small.

Inductive Learning and BiasInductive Learning and Bias

• Suppose that we want to learn a function f(x) = y and we are given some sample (x,y) pairs, as in figure (a).

• There are several hypotheses we could make about this function, e.g.: (b), (c) and (d).

• A preference for one over the others reveals the bias of our learning technique, e.g.:– prefer piece-wise functions– prefer a smooth function– prefer a simple function and treat outliers as noise

Example of using probabilities to create trees: Example of using probabilities to create trees: Huffman codeHuffman code

• In 1952 MIT student David Huffman devised, in the course of doing a homework assignment, an elegant coding scheme

• This scheme is optimal in the case where all symbols’ probabilities are integral powers of 1/2.

• A Huffman code can be built in the following manner:

– 1. Rank all symbols in order of probability of occurrence.

– 2. Successively combine the two symbols of the lowest probability to form a new composite symbol; – eventually we will build a binary tree where each node is the

probability of all nodes beneath it.

– 3. Trace a path to each leaf, noticing the direction at each node.

Huffman code example as a Huffman code example as a prototypical ideaprototypical idea from from other areaother area

Message Probability.

A .125

B .125

C .25

D .5

.5.5

1

.125.125

.25

A

C

B

D

.25

0 1

0

0 1

1

M code length prob

A 000 3 0.125 0.375B 001 3 0.125 0.375C 01 2 0.250 0.500D 1 1 0.500 0.500

average message length 1.750

If we need to send many messages (A,B,C or D) and they have this probability distribution and we use this code, then over time, the average bits/message should approach 1.75 (= 0.125*3+0.125*3+0.25*2*0.5*1)

• If a set T of records is partitioned into disjoint exhaustive classes (C1,C2,..,Ck) on the basis of the value of the categorical attribute, then the information needed to identify the class of an element of T is Info(T) = I(P)

where P is probability distribution of partition (C1,C2,..,Ck):

P = (|C1|/|T|, |C2|/|T|, ..., |Ck|/|T|)

• If we partition T w.r.t attribute X into sets {T1,T2, ..,Tn} then the information needed to identify the class of an element of T becomes the weighted average of the information needed to identify the class of an element of Ti,

– i.e. the weighted average of Info(Ti):

Info(X,T) = |Ti|/|T| * Info(Ti) = |Ti|/|T| * log |Ti|/|T|

GainGain• Consider the quantity Gain(X,T) defined as Gain(X,T) = Info(T) - Info(X,T)

• This represents the difference between – information needed to identify an element of T and – information needed to identify an element of T after the value of attribute X

has been obtained, that is, this is the gain in information due to attribute X.

• We can use this to rank attributes and to build decision trees where at each node is located the attribute with greatest gain among the attributes not yet considered in the path from the root.

• The intents of this ordering are twofold:

– 1. To create small decision trees so that records can be identified after only a few questions.

– 2.2. To match a hoped for minimality of the process represented by the records being considered (Occam's Razor).

We will use this idea to build decision trees, ID3

Rule and Decision Tree LearningRule and Decision Tree Learning• Example: Rule Acquisition from Historical Data

• Data– Patient 103 (time = 1): Age 23, First-Pregnancy: no, Anemia: no, Diabetes: no,

Previous-Premature-Birth: no, Ultrasound: unknown, Elective C-Section: unknown, Emergency-C-Section: unknown

– Patient 103 (time = 2): Age 23, First-Pregnancy: no, Anemia: no, Diabetes: yes, Previous-Premature-Birth: no, Ultrasound: abnormal, Elective C-Section: no, Emergency-C-Section: unknown

– Patient 103 (time = n): Age 23, First-Pregnancy: no, Anemia: no, Diabetes: no, Previous-Premature-Birth: no, Ultrasound: unknown, Elective C-Section: no,

Emergency-C-Section: YES

• Learned Rule– IF no previous vaginal delivery, AND abnormal 2nd trimester ultrasound,

AND malpresentation at admission, AND no elective C-SectionTHEN probability of emergency C-Section is 0.6

– Training set: 26/41 = 0.634

– Test set: 12/20 = 0.600

Neural Network LearningNeural Network Learning• Autonomous Learning Vehicle In a Neural Net (ALVINN): Pomerleau et al

– http://www.cs.cmu.edu/afs/cs/project/alv/member/www/projects/ALVINN.html

– Drives 70mph on highways

Specifying A Learning ProblemSpecifying A Learning Problem• Learning = Improving with Experience at Some Task

– Improve over task T,

– with respect to performance measure P,

– based on experience E.

• Example: Learning to Play Checkers– T: play games of checkers

– P: percent of games won in world tournament

– E: opportunity to play against self

• Refining the Problem Specification: Issues– What experience?

– What exactly should be learned?

– How shall it be represented?

– What specific algorithm to learn it?

• Defining the Problem Milieu– Performance element:

– How shall the results of learning be applied?

– How shall the performance element be evaluated? The learning system?

ExampleExample: Learning to Play Checkers: Learning to Play Checkers• Type of Training Experience

– Direct or indirect?

– Teacher or not?

– Knowledge about the game (e.g., openings/endgames)?

• Problem: Is Training Experience Representative (of Performance Goal)?

• Software Design

– Assumptions of the learning system: legal move generator exists

– Software requirements:

– generator,

– evaluator(s),

– parametric target function

• Choosing a Target Function

– ChooseMove: Board Move (action selection function, or policy)

– V: Board R (board evaluation function)

– Ideal target V; approximated target

– Goal of learning process: operational description (approximation) of V

V̂

A Target Function forA Target Function forLearning to Play CheckersLearning to Play Checkers

• Possible Definition– If b is a final board state that is won, then V(b) = 100

– If b is a final board state that is lost, then V(b) = -100

– If b is a final board state that is drawn, then V(b) = 0

– If b is not a final board state in the game, then V(b) = V(b’) where b’ is the best final board state that can be achieved starting from b and playing optimally until the end of the game

– Correct values, but not operational

• Choosing a Representation for the Target Function– Collection of rules?

– Neural network?

– Polynomial function (e.g., linear, quadratic combination) of board features?

– Other?

• A Representation for Learned Function–

– bp/rp = number of black/red pieces; bk/rk = number of black/red kings; bt/rt = number of black/red pieces threatened (can be taken on next turn)

bwbwbwbwbwbww bV 6543210 rtbtrkbkrpbp ˆ

A Training Procedure for A Training Procedure for Learning to Play CheckersLearning to Play Checkers

• Obtaining Training Examples

– the target function

– the learned function

– the training value

• One Rule For Estimating Training Values:

–

• Choose Weight Tuning Rule

– Least Mean Square (LMS) weight update rule:

REPEAT

• Select a training example b at random

• Compute the error(b) for this training example

• For each board feature fi, update weight wi as follows:

where c is a small, constant factor

to adjust the learning rate

bV̂

bV

bVtrain

bVbV Successortrainˆ

bVbV berror ˆ train

berrorfcww iii

Design Choices forDesign Choices forLearning to Play CheckersLearning to Play Checkers

Completed Design

Determine Type ofTraining Experience

Gamesagainst experts

Gamesagainst self

Table ofcorrect moves

DetermineTarget Function

Board valueBoard move

Determine Representation ofLearned Function

Polynomial Linear functionof six features

Artificial neuralnetwork

DetermineLearning Algorithm

Gradientdescent

Linearprogramming

Example of Interesting Application: Data Example of Interesting Application: Data MiningMining

NCSA D2K - http://www.ncsa.uiuc.edu/STI/ALG

Example: Reasoning (Inference, Decision Support)Example: Reasoning (Inference, Decision Support)

Cartia ThemeScapes - http://www.cartia.com

6500 news storiesfrom the WWWin 1997

Relevant DisciplinesRelevant Disciplines

• Artificial Intelligence

• Bayesian Methods

• Cognitive Science

• Computational Complexity Theory

• Control Theory

• Information Theory

• Neuroscience

• Philosophy

• Psychology

• Statistics

MachineLearningSymbolic Representation

Planning/Problem SolvingKnowledge-Guided Learning

Bayes’s TheoremMissing Data Estimators

PAC FormalismMistake Bounds

Language LearningLearning to Reason

OptimizationLearning Predictors

Meta-Learning

Entropy MeasuresMDL Approaches

Optimal Codes

ANN ModelsModular Learning

Occam’s RazorInductive Generalization

Power Law of PracticeHeuristic Learning

Bias/Variance FormalismConfidence IntervalsHypothesis Testing

Introduction to Machine Learning.

Documents

Transcript of Introduction to Machine Learning.