Introduction to Artificial Intelligence CSCI 3202: The Perceptron Algorithm
description
Transcript of Introduction to Artificial Intelligence CSCI 3202: The Perceptron Algorithm
Greg Grudic Intro AI 1
Introduction to Artificial IntelligenceCSCI 3202:
The Perceptron Algorithm
Greg Grudic
Greg Grudic Intro AI 2
Questions?
Greg Grudic Intro AI 3
Binary Classification
• A binary classifier is a mapping from a set of d inputs to a single output which can take on one of TWO values
• In the most general setting
• Specifying the output classes as -1 and +1 is arbitrary!– Often done as a mathematical convenience
{ }inputs:output:
1, 1
d
yx Î ÂÎ - +
Greg Grudic Intro AI 4
A Binary Classifier
ClassificationModelx ˆ 1, 1y
Given learning data: ( ) ( )1 1, ,..., ,N Ny yx x
A model is constructed:
( )M x
Greg Grudic Intro AI 5
Linear Separating Hyper-Planes
1x
2x
01
0d
i ii
xb b=
+ £å0
1
0d
i ii
xb b=
+ >å
01
0d
i ii
xb b=
+ =å
1y=-
1y=+
Greg Grudic Intro AI 6
Linear Separating Hyper-Planes
• The Model:
• Where:
• The decision boundary:
( )0 1ˆ ˆ ˆˆ ( ) sgn ,..., T
dy M b b bé ù= = +ê úë ûx x
[ ] 1 if 0sgn
1 otherwiseA
Aì >ïï=íï -ïî
( )0 1ˆ ˆ ˆ,..., 0T
db b b+ =x
Greg Grudic Intro AI 7
Linear Separating Hyper-Planes
• The model parameters are:
• The hat on the betas means that they are estimated from the data
• Many different learning algorithms have been proposed for determining
( )0 1ˆ ˆ ˆ, ,..., db b b
( )0 1ˆ ˆ ˆ, ,..., db b b
Greg Grudic Intro AI 8
Rosenblatt’s Preceptron Learning Algorithm
• Dates back to the 1950’s and is the motivation behind Neural Networks
• The algorithm:– Start with a random hyperplane– Incrementally modify the hyperplane such that
points that are misclassified move closer to the correct side of the boundary
– Stop when all learning examples are correctly classified
( )0 1ˆ ˆ ˆ, ,..., db b b
Greg Grudic Intro AI 9
Rosenblatt’s Preceptron Learning Algorithm
• The algorithm is based on the following property:– Signed distance of any point to the boundary is:
• Therefore, if is the set of misclassified learning examples, we can push them closer to the boundary by minimizing the following
( ) ( )( )0 1 0 1ˆ ˆ ˆ ˆ ˆ ˆ, ,..., ,..., T
d i d ii M
D yb b b b b bÎ
=- +å x
( ) ( )0 1
0 1
2
1
ˆ ˆ ˆ,...,ˆ ˆ ˆ,...,
ˆ
Td T
dd
ii
dx
xb b b
b b bb
=
+= µ +æ ö÷ç ÷ç ÷ç ÷è øå
x
M
Greg Grudic Intro AI 10
Rosenblatt’s Minimization Function
• This is classic Machine Learning!• First define a cost function in model
parameter space
• Then find an algorithm that modifies such that this cost function is minimized
• One such algorithm is Gradient Descent
( )0 1 01
ˆ ˆ ˆ ˆ ˆ, ,...,d
d i k iki M k
D y xb b b b bÎ =
æ ö÷ç=- + ÷ç ÷ç ÷çè øå å( )0 1
ˆ ˆ ˆ, ,..., db b b
Greg Grudic Intro AI 11
Gradient Descent
( )0 1ˆ ˆ ˆ, ,..., dD b b b =
0b̂ = 1̂b =
Greg Grudic Intro AI 12
The Gradient Descent Algorithm
( )0 1ˆ ˆ ˆ, ,...,
ˆ ˆˆ
d
i ii
D b b bb b r
b
¶¬ -
¶
Where the learning rate is defined by: 0r >
Greg Grudic Intro AI 13
The Gradient Descent Algorithm for the Perceptron
0 0
11 1
ˆ ˆ
ˆ ˆ
ˆ ˆ
i
i i
i idd d
yy x
y x
b bb b r
b b
æ ö æ ö æ ö÷ ÷ç ç ÷÷ ÷ çç ç ÷÷ ÷ çç ç ÷÷ ÷ çç ç ÷÷ ÷ çç ç ÷÷ ÷ çç ç ÷÷ ÷¬ - ç ÷ç ç÷ ÷ ç ÷÷ ÷ç ç ÷÷ ÷ çç ç ÷÷ ÷ çç ç ÷÷ ÷ çç ç ÷÷ç÷ ÷ çè øç ç÷ ÷ç ç ÷÷ ÷è ø è øç ç÷ ÷
MM M
( )0 1
0
ˆ ˆ ˆ, ,...,ˆ
d
ii M
Dy
b b b
b Î
¶=-
¶ å( )0 1
ˆ ˆ ˆ, ,...,, 1,...,ˆ
d
i iji Mj
Dy x j d
b b b
b Î
¶=- =
¶ å
0 0
11 1
ˆ ˆ
ˆ ˆ
ˆ ˆ
ii M
i ii M
d d i idi M
y
y x
y x
b bb b r
b b
Î
Î
Î
æ ö- ÷ç ÷æ ö æ ö ç ÷÷ ÷ çç ç ÷÷ ÷ çç ç ÷÷ ÷ çç ç ÷÷ ÷ -çç ç ÷÷ ÷ ç ÷ç ç÷ ÷ ÷çç ç÷ ÷¬ - ÷çç ç÷ ÷ ÷ç÷ ÷ç ç ÷÷ ÷ çç ç ÷÷ ÷ çç ç ÷÷ ÷ çç ç ÷÷ ÷ ç ÷ç ç÷ ÷è ø è ø - ÷ç ÷ç ÷çè ø
åå
åM M M
Update One misclassified point at a time (online)
Update all misclassified points at once (batch)
Two Versions of the Perceptron Algorithm:
Greg Grudic Intro AI 14
The Learning Data
• Matrix Representation of N learning examples of d dimensional inputs
11 1 1
1
, d
N Nd N
x x yX Y
x x y
æ ö æ ö÷ ÷ç ç÷ ÷ç ç÷ ÷ç ç÷ ÷= =ç ç÷ ÷ç ç÷ ÷ç ç÷ ÷÷ ÷ç çè ø è ø
KM O M M
L
( ) ( )1 1, ,..., ,N Ny yx xTraining Data:
Greg Grudic Intro AI 15
The Good Theoretical Properties of the Perceptron Algorithm
• If a solution exists the algorithm will always converge in a finite number of steps!
• Question: Does a solution always exist?
Greg Grudic Intro AI 16
Linearly Separable Data
• Which of these datasets are separable by a linear boundary?
+
a) b)
++
--
-
+
+
-
-
-
Greg Grudic Intro AI 17
Linearly Separable Data
• Which of these datasets are separable by a linear boundary?
+
a) b)
++
--
-
+
+
-
-
- NotLinearly
Separable!
Greg Grudic Intro AI 18
Bad Theoretical Properties of the Perceptron Algorithm
• If the data is not linearly separable, algorithm cycles forever!– Cannot converge!– This property “stopped” active research in this area between
1968 and 1984…• Perceptrons, Minsky and Pappert, 1969
• Even when the data is separable, there are infinitely many solutions– Which solution is best?
• When data is linearly separable, the number of steps to converge can be very large (depends on size of gap between classes)
Greg Grudic Intro AI 19
What about Nonlinear Data?
• Data that is not linearly separable is called nonlinear data
• Nonlinear data can often be mapped into a nonlinear space where it is linearly separable
Greg Grudic Intro AI 20
Nonlinear Models
• The Linear Model:
• The Nonlinear (basis function) Model:
• Examples of Nonlinear Basis Functions:
01
ˆ ˆˆ ( ) sgnd
i ii
y M xb b=
é ùê ú= = +ê úë ûåx
( )01
ˆ ˆˆ ( ) sgnk
i ii
y M b bf=
é ùê ú= = +ê úë ûåx x
( ) ( ) ( ) ( ) ( )2 21 1 2 2 3 1 2 4 55sinx x x x xff ff= = = =x x x x
Greg Grudic Intro AI 21
Linear Separating Hyper-Planes In Nonlinear Basis Function Space
1f
2f
01
0k
i ii
b bf=
+ £å0
1
0k
i ii
b bf=
+ >å
01
0k
i ii
b bf=
+ =å
1y=-
1y=+
Greg Grudic Intro AI 22
An Example
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
x1
x 2
: y=+1: y=-1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 = x1
2
2 = x
22
: y=+1: y=-1
F
Greg Grudic Intro AI 23
Kernels as Nonlinear Transformations
• Polynomial
• Sigmoid
• Gaussian or Radial Basis Function (RBF)
( ) ( )( ) ( )( ) 2
2
, ,
, tanh ,
1, exp2
k
i j i j
i j i j
i j i j
K q
K
K
x x x x
x x x x
x x x x
k q
s
= += +
æ ö÷ç= - - ÷ç ÷÷çè ø
Greg Grudic Intro AI 24
The Kernel Model
( )01
ˆ ˆˆ ( ) sgn ,N
i ii
y M Kx x xb b=
é ùê ú= = +ê úë ûå
( ) ( )1 1, ,..., ,N Ny yx xTraining Data:
The number of basis functions equals the number of training examples!
- Unless some of the beta’s get set to zero…
Greg Grudic Intro AI 25
Gram (Kernel) Matrix
( ) ( )
( ) ( )
1 1 1
1
, ,
, ,
N
N N N
K KK
K K
æ ö÷ç ÷ç ÷ç ÷=ç ÷ç ÷ç ÷ç ÷÷çè ø
x x x x
x x x x
KM O M
L
( ) ( )1 1, ,..., ,N Ny yx xTraining Data:
Properties:•Positive Definite Matrix
•Symmetric•Positive on diagonal
•N by N
Greg Grudic Intro AI 26
Picking a Model Structure?• How do you pick the Kernels?
– Kernel parameters• These are called learning parameters or
hyperparamters– Two approaches choosing learning paramters
• Bayesian– Learning parameters must maximize probability of correct
classification on future data based on prior biases• Frequentist
– Use the training data to learn the model parameters – Use validation data to pick the best hyperparameters.
• More on learning parameter selection later
( )0 1ˆ ˆ ˆ, ,..., db b b
Greg Grudic Intro AI 27
Perceptron Algorithm Convergence
• Two problems:– No convergence when data is not separable in
basis function space– Gives infinitely many solutions when data is
separable• Can we modify the algorithm to fix these
problems?