Supervised Learning: Artificial Neural Networks
Some slides adapted from Dan Klein et al. (Berkeley) and Percy Liang (Stanford)
Plan
• Project 4: http://www.mathcs.emory.edu/~eugene/cs325/p4/
• Review: Perceptron, MIRA
• New: Neural Networks (Supervised version)
– +Hidden Layer
– +Training algorithms (forward-, backword-propagation)
3/30/2016 CS325: Artificial Intelligence, Spring 2016 2
Single Neuron as Linear Classifier
• Inputs are feature values
• Each feature has a weight
• Sum is the activation
• If the activation is:– Positive, output +1
– Negative, output -1
f1f2f3
w
1w
2w
3
>0?
Weights
• Binary case: compare features to a weight vector
• Learning: figure out the weight vector from examples
# free : 2
YOUR_NAME : 0
MISSPELLED : 2
FROM_FRIEND : 0
...
# free : 4
YOUR_NAME :-1
MISSPELLED : 1
FROM_FRIEND :-3
...
# free : 0
YOUR_NAME : 1
MISSPELLED : 1
FROM_FRIEND : 1
...
Dot product
positive means the
positive class
Binary Decision Rule
• In the space of feature vectors
– Examples are points
– Any weight vector is a hyperplane
– One side corresponds to Y=+1
– Other corresponds to Y=-1
BIAS : -3
free : 4
money : 2
...0 1
0
1
2
free
mo
ney
+1 = SPAM
-1 = HAM
Learning: Binary Perceptron
• Start with weights = 0
• For each training instance:
– Classify with current weights
– If correct (i.e., y=y*), no change!
– If wrong: adjust the weight vector
Learning: Binary Perceptron
• Start with weights = 0
• For each training instance:
– Classify with current weights
– If correct (i.e., y=y*), no change!
– If wrong: adjust the weight vector by adding or subtracting the feature vector. Subtract if y* is -1.
Multiclass Decision Rule
• If we have multiple classes:– A weight vector for each class:
– Score (activation) of a class y:
– Prediction highest score wins
Binary = multiclass where the negative class has weight zero
Learning: Multiclass Perceptron
• Start with all weights = 0
• Pick up training examples one by one
• Predict with current weights
• If correct, no change!
• If wrong: lower score of wrong answer, raise score of right answer
MIRA*: Fixing the Perceptron
• What about “outliers” – correct or incorrect data points with abnormal feature values?
• MIRA: choose an update size that fixes the current mistake…
• … but, minimizes the change to w
• The +1 helps to generalize
* Margin Infused Relaxed Algorithm
wtf?
Minimum Correcting Update
min not =0, or would not have made an error, so min will be where equality holds
Maximum Step Size
• In practice, it’s also bad to make updates that are too large
– Example may be labeled incorrectly
– You may not have enough features
– Solution: cap the maximum possible value of with some constant C
– Corresponds to an optimization that assumes non-separable data
– Usually converges faster than perceptron
– Usually better, especially on noisy data
Linear Separators
• Which of these linear separators is optimal?
Support Vector Machines
• Maximizing the margin: good according to intuition, theory, practice
• Only support vectors matter; other training examples are ignorable
• Support vector machines (SVMs) find the separator with max margin
• Basically, SVMs are MIRA where you optimize over all examples at once
MIRA
SVM
Classification: Comparison
• Naïve Bayes– Builds a model training data
– Gives prediction probabilities
– Strong assumptions about feature independence
– One pass through data (counting)
• Perceptrons / MIRA:– Makes less assumptions about data
– Mistake-driven learning
– Multiple passes through data (prediction)
– Often more accurate
16
Different Non-Linearly
Separable Problems
StructureTypes of
Decision Regions
Exclusive-OR
Problem
Classes with
Meshed regions
Most General
Region Shapes
Single-Layer
Two-Layer
Three-Layer
Half Plane
Bounded By
Hyperplane
Convex Open
Or
Closed Regions
Arbitrary
(Complexity
Limited by No.
of Nodes)
A
AB
B
A
AB
B
A
AB
B
BA
BA
BA
Non-Linear Separable Data
• Demo: http://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html
3/30/2016 CS325: Artificial Intelligence, Spring 2016 17
layer_defs = [];
layer_defs.push({type:'input', out_sx:1, out_sy:1, out_depth:2});
layer_defs.push({type:'fc', num_neurons:1, activation: 'sigmoid'});
layer_defs.push({type:'softmax', num_classes:2});
net = new convnetjs.Net();
net.makeLayers(layer_defs);
trainer = new convnetjs.SGDTrainer(net, {learning_rate:0.01, momentum:0.1,
batch_size:10, l2_decay:0.001});
18
Multilayer Perceptron (MLP)
Output Values
Input Signals (External Stimuli)
Output Layer
Adjustable
Weights
Input Layer
Slide 19
The Key Elements of Neural Networks
• Neural computing requires a number of neurons, to be connected together into a neural network. Neurons are arranged in layers.
• Each neuron within the network is usually a simple processing unit which takes one or more inputs and produces an output. At each neuron, every input has an associated weight which modifies the strength of each input. The neuron simply adds together all the inputs and calculates an output to be passed on.
Inputs Weights
Output
Bias1
3p
2p
1p
fa
3w
2w
1w
bwpfbwpwpwpfa ii332211
Slide 20
Activation functions
21
Sigmoid Unit
x1
x2
xn
.
.
.
w1
w2
wn
w0
x0=1
net=i=0n wi xi
o
o=(net)=1/(1+e-net)
(x) is the sigmoid function: 1/(1+e-x)
d(x)/dx= (x) (1- (x))
Derive gradient decent rules to train:
• one sigmoid function
E/wi = -d(td-od) od (1-od) xi
• Multilayer networks of sigmoid units
backpropagation:
Online Demo: Add 2nd Layer
• Demo: http://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html
3/30/2016 CS325: Artificial Intelligence, Spring 2016 22
layer_defs = [];
layer_defs.push({type:'input', out_sx:1, out_sy:1, out_depth:2});
layer_defs.push({type:'fc', num_neurons:5, activation: 'sigmoid'});
layer_defs.push({type:'softmax', num_classes:2});
net = new convnetjs.Net();
net.makeLayers(layer_defs);
trainer = new convnetjs.SGDTrainer(net, {learning_rate:0.01, momentum:0.1,
batch_size:10, l2_decay:0.001});
2-Layer Neural Network
3/30/2016 CS325: Artificial Intelligence, Spring 2016 23
Hidden Layers
• Intermediate hidden units: learned features– activation vector h behaves a like our feature vector f(x) for
linear classifier. But, h is “learned” automatically.
• What kind of features that can be learned? – must be of the form x
3/30/2016 CS325: Artificial Intelligence, Spring 2016 24
Slide 25
Feedforword NNs
• The basic structure off a feedforward Neural Network
• When the desired output are known we have supervised learning or learning with a teacher.
(Feed-forward) Neural Network
x1
x2
x3
+1
Σ
w1
w2
w3
b
Output
(Feed-forward) Neural Network
x1
x2
x3
+1
Σ
w1
w2
w3
b
Output
Weights
bias
(Feed-forward) Neural Network
x1
x2
x3
+1
a2
w11
w21
Output
Stack many non-linear units (neurons)
a2
+1
Layer I
(input)
Layer 2
(hidden)
Layer 3
(output)
wij- weight between i-th
input and j-th hidden
unit
Neural networks
• Very expressive– Able to learn highly non-linear functions
• Supervised training– Binary classification
– Multi-class classification
– Regression / composite loss
• Unsupervised training– Dimensionality reduction
– Complex sequence modeling
– High level feature learning
Backpropagation Algorithm – Main
Idea – error in hidden layers
The ideas of the algorithm can be summarized as follows :
1. Computes the error term for the output units using the
observed error.
2. From output layer, repeat
- propagating the error term back to the previous layer
and
- updating the weights between the two layers
until the earliest hidden layer is reached.
Backpropagation Algorithm
• Initialize weights (typically random!)
• Keep doing epochs
– For each example e in training set do
• forward pass to compute– O = neural-net-output(network,e)
– miss = (T-O) at each output unit
• backward pass to calculate deltas to weights
• update all weights
– end
• until tuning set error stops improving
Backward pass explained in next slideForward pass explained
earlier
Backward Pass
• Compute deltas to weights
– from hidden layer
– to output layer
• Without changing any weights (yet), compute the actual contributions
– within the hidden layer(s)
– and compute deltas
Training Learning Details
• Method for learning weights in feed-forward (FF) nets
• Can’t use Perceptron Learning Rule– no teacher values are possible for hidden units
• Use gradient descent to minimize the error– propagate deltas to adjust for errors
backward from outputsto hidden layers
to inputs
forward
backward
Error Surface
Error as function of
weights in
multidimensional space
error
weights
Gradient
• Trying to make error decrease the fastest
• Compute:• GradE = [dE/dw1, dE/dw2, . . ., dE/dwn]
• Change i-th weight by• deltawi = -alpha * dE/dwi
• We need a derivative!
• Activation function must be continuous, differentiable, non-decreasing, and easy to compute
Derivatives of error for weights
Compute
deltas
Can’t use LTU
• To effectively assign credit / blame to units in hidden layers, we want to look at the first derivativeof the activation function
• Sigmoid function is easy to differentiate and easy to compute forward
Linear Threshold Units Sigmoid function
Training: Loss Function
x1
x2
x3
+1
a2
w11
w21
Output
Stack many non-linear units (neurons)
a2
+1
Layer I
(input)
Layer 2
(hidden)
Layer 3
(output)
Loss function
Training: Loss Function
x1
x2
x3
+1
a2
w11
w21
Output
Stack many non-linear units (neurons)
a2
+1
Layer I
(input)
Layer 2
(hidden)
Layer 3
(output)
Loss function
Loss functions
• Classification
– Cross entropy (binary and multiclass)
• Regression
– Squared deviation (mean squared error)
Training: Loss Function
x1
x2
x3
+1
w11
w21
Output
+1
Layer I
(input)
Layer 2
(hidden)
Layer 3
(output)
Loss function
a1
(2)
a1
(2)
a2
(2) a1
(3)
Training: Forward Pass
x1
x2
x3
+1
Output
+1
Layer I
(input)
Layer 2
(hidden)
Layer 3
(output)
Loss function
Training: Forward Pass
+1
+1
Layer k-1 Layer k
Training: Forward Pass
x1
x2
x3
+1
Output
Stack many non-linear units (neurons)
+1
Layer I
(input)
Layer 2
(hidden)
Layer 3
(output)
Training: Forward Pass
x1
x2
x3
+1
w11
w21
Output
Stack many non-linear units (neurons)
+1
Layer I
(input)
Layer 2
(hidden)
Layer 3
(output)
Loss function
Backpropagation Network training
• 1. Initialize network with random weights
• 2. For all training cases (called examples):– a. Present training inputs to network and calculate
output
– b. For all layers (starting with output layer, back to input layer):
• i. Compare network output with correct output
(error function)
• ii. Adapt weights in current layer
This is
what
you
want
Slide 46
The Learning Rule: Backpropagation
• The delta rule is often utilized by the most common class of ANNs called backpropagational neural networks.
Input
Desired
Output
Training: Back-propagation
• Use gradient descent optimization
f (x) =s (W (2)s (W (1)x+b(1))+b(2))
Loss(x, y;W,b) = y- f (x)2
Training: Back-propagation
• Use gradient descent optimization
f (x) =s (W (2)s (W (1)x+b(1))+b(2))
Loss(x, y;W,b) = y- f (x)2
What is back-propagation?
• An algorithm to compute (W,b) gradients
Chain Rule for Derivatives
x g f
g(x) f(g(x))
Chain Rule for Derivatives
x
g1
f
f(g1(x) + g2(x))
g2
Chain Rule for Derivatives
x
g1
f
f(g1(x) + … + gk(x))
gk
…
Training: Forward Pass
x1
x2
x3
+1
w11
w21
Output
Stack many non-linear units (neurons)
+1
Layer I
(input)
Layer 2
(hidden)
Layer 3
(output)
Loss function
Gradients of W’s and b’s
How do we pick ?
1. Tuning set, or
2. Cross validation, or
3. Small for slow, conservative learning
How many hidden layers?
• Usually just one (i.e., a 2-layer net)
• How many hidden units in the layer?
– Too few ==> can’t learn
– Too many ==> poor generalization
How big a training set?
• Determine your target error rate, e
• Success rate is 1- e
• Typical training set approx. n/e, where n is the number of weights in the net
• Example:
– e = 0.1, n = 80 weights
– training set size 800
trained until 95% correct training set classification
should produce 90% correct classification
on testing set (typical)
Back-propagation: Summary
• For each input (x,y)
• Compute forward pass (values of hidden units at all layers) using training data (x,y)
• For each layer
– Compute gradients of outputs w.r.t. inputs
• Compute
• Update
Neural Networks: Summary
x1
x2
x3
+1
Σ
w1
w2
w3
b
Output
Layer 1 Layer 2
Neural Networks: Applications
• Image Modeling / Classification
• Feature Learning (Deep Learning; many layers)
• Compositional Text Representation
• Sequence Modeling
Top Related