CS 343: Artificial Intelligencesniekum/classes/343-S19/... · 2019. 8. 29. · Data that is...
Transcript of CS 343: Artificial Intelligencesniekum/classes/343-S19/... · 2019. 8. 29. · Data that is...
CS343:ArtificialIntelligence
DeepLearning
Prof.ScottNiekum—TheUniversityofTexasatAustin [TheseslidesbasedonthoseofDanKlein,PieterAbbeel,AncaDraganforCS188IntrotoAIatUCBerkeley.AllCS188materialsareavailableathttp://ai.berkeley.edu.]
Pleasefilloutcourseevalsonline!
Review:LinearClassifiers
FeatureVectors
Hello,
Do you want free printr cartriges? Why pay more when you can get them ABSOLUTELY FREE! Just
# free : 2 YOUR_NAME : 0 MISSPELLED : 2 FROM_FRIEND : 0 ...
SPAM or +
PIXEL-7,12 : 1 PIXEL-7,13 : 0 ... NUM_LOOPS : 1 ...
“2”
Some(Simplified)Biology
▪ Verylooseinspiration:humanneurons
LinearClassifiers
▪ Inputsarefeaturevalues▪ Eachfeaturehasaweight▪ Sumistheactivation
▪ Iftheactivationis:▪ Positive,output+1▪ Negative,output-1
Σf1f2f3
w1
w2
w3>0?
Non-Linearity
Non-LinearSeparators
▪ Datathatislinearlyseparableworksoutgreatforlineardecisionrules:
▪ Butwhatarewegoingtodoifthedatasetisjusttoohard?
▪ Howabout…mappingdatatoahigher-dimensionalspace:
0
0
0
x2
x
x
x
ThisandnextslideadaptedfromRayMooney,UT
Non-LinearSeparators
▪ Generalidea:theoriginalfeaturespacecanalwaysbemappedtosomehigher-dimensionalfeaturespacewherethetrainingsetisseparable:
Φ: x → φ(x)
ComputerVision
ObjectDetection
ManualFeatureDesign
FeaturesandGeneralization
[DalalandTriggs,2005]
FeaturesandGeneralization
Image HoG
ManualFeatureDesign!DeepLearning
▪ Manualfeaturedesignrequires:
▪ Domain-specificexpertise
▪ Domain-specificeffort
▪ Whatifwecouldlearnthefeatures,too?
▪ DeepLearning
Perceptron
Σf1f2f3
w1
w2
w3
>0?
Two-LayerPerceptronNetwork
Σ
f1
f2
f3w13
w23
w33
>0?
Σw12
w22
w32
>0?
Σw11
w21
w31
>0?
Σ
w1
w2
w3
>0?
N-LayerPerceptronNetwork
Σ
f1
f2
f3
>0?
Σ >0?
Σ >0?
Σ
Σ >0?
Σ >0?
Σ >0?
Σ >0?
Σ >0?
Σ >0?…
…
…
>0?
Performance
graph credit Matt Zeiler, Clarifai
Performance
graph credit Matt Zeiler, Clarifai
Performance
graph credit Matt Zeiler, Clarifai
AlexNet
Performance
graph credit Matt Zeiler, Clarifai
AlexNet
Performance
graph credit Matt Zeiler, Clarifai
AlexNet
SpeechRecognition
graph credit Matt Zeiler, Clarifai
N-LayerPerceptronNetwork
Σ
f1
f2
f3
>0?
Σ >0?
Σ >0?
Σ
Σ >0?
Σ >0?
Σ >0?
Σ >0?
Σ >0?
Σ >0?…
…
…
>0?
LocalSearch
▪ Simple,generalidea:▪ Startwherever▪ Repeat:movetothebestneighboringstate▪ Ifnoneighborsbetterthancurrent,quit▪ Neighbors=smallperturbationsofw
▪ Properties▪ Plateausandlocaloptima
Howtoescapeplateausandfindagoodlocaloptimum?Howtodealwithverylargeparametervectors?E.g.,
Perceptron
Σf1f2f3
w1
w2
w3
>0?
▪ Objective:ClassificationAccuracy
▪ Issue:manyplateaus! howtomeasureincrementalprogresstowardacorrectlabel?
Soft-Max
▪ Scorefory=1: Scorefory=-1:
▪ Probabilityoflabel:
▪ Objective:
▪ Log:
Two-LayerNeuralNetwork
Σ
f1
f2
f3w13
w23
w33
>0?
Σw12
w22
w32
>0?
Σw11
w21
w31
>0?
Σ
w1
w2
w3
N-LayerNeuralNetwork
Σ
f1
f2
f3
>0?
Σ >0?
Σ >0?
Σ
Σ >0?
Σ >0?
Σ >0?
Σ >0?
Σ >0?
Σ >0?…
…
…
OurStatus
▪ Ourobjective ▪ Changessmoothlywithchangesinw▪ Doesn’tsufferfromthesameplateausastheperceptronnetwork
▪ Challenge:howtofindagoodw?
▪ Equivalently:
1-doptimization
▪ Couldevaluate and▪ Thenstepinbestdirection
▪ Or,evaluatederivative:
▪ Tellswhichdirectiontostepinto
2-DOptimization
Source: Thomas Jungblut’s Blog
▪ Idea:
▪ Startsomewhere▪ Repeat:Takeastepinthesteepestdescentdirection
SteepestDescent
Figure source: Mathworks
WhatistheSteepestDescentDirection?
WhatistheSteepestDescentDirection?
▪ SteepestDirection=directionofthegradient
OptimizationProcedure1:GradientDescent
▪ Init:
▪ Fori=1,2,…
▪ :learningrate---tweakingparameterthatneedstobechosencarefully
▪ How?Trymultiplechoices▪ Cruderuleofthumb:updatechangesabout0.1–1%
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - 25 Jan 201638
Suppose loss function is steep vertically but shallow horizontally:
Q: What is the trajectory along which we converge towards the minimum with Gradient Descent?
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - 25 Jan 201639
Suppose loss function is steep vertically but shallow horizontally:
Q: What is the trajectory along which we converge towards the minimum with Gradient Descent?
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - 25 Jan 201640
Suppose loss function is steep vertically but shallow horizontally:
Q: What is the trajectory along which we converge towards the minimum with Gradient Descent? very slow progress along flat direction, jitter along steep one
OptimizationProcedure2:Momentum
▪ Init:
▪ Fori=1,2,…
▪ GradientDescent
▪ Init:
▪ Fori=1,2,…
-Physicalinterpretationasballrollingdownthelossfunction+friction(mucoefficient). -mu=usually~0.5,0.9,or0.99(Sometimesannealedovertime,e.g.from0.5->0.99)
▪ Momentum
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - 25 Jan 201642
Suppose loss function is steep vertically but shallow horizontally:
Q: What is the trajectory along which we converge towards the minimum with Momentum?
Howdoweactuallycomputegradientw.r.t.weights?
Backpropagation!
1
Backpropagation Learning
15-486/782: Artificial Neural NetworksDavid S. Touretzky
Fall 2006
2
LMS / Widrow-Hoff Rule
Works fine for a single layer of trainable weights.
What about multi-layer networks?
S
wi
xi
y
wi = −y−dxi
3
With Linear Units, Multiple Layers Don't Add Anything
U : 2×3 matrix
V : 3×4 matrix
x
Linear operators are closed under composition.Equivalent to a single layer of weights W=U×V
But with non-linear units, extra layers addcomputational power.
y
y = U×V x = U×V 2×4
x
4
What Can be Done withNon-Linear (e.g., Threshold) Units?
1 layer oftrainable weights
separating hyperplane
5
2 layers oftrainable weights
convex polygon region
6
3 layers oftrainable weights
composition of polygons:convex regionsnon
7
How Do We Train AMulti-Layer Network?
Error = d-yy
Error = ???
Can't use perceptron training algorithm becausewe don't know the 'correct' outputs for hidden units.
8
How Do We Train AMulti-Layer Network?
y
Define sum-squared error:
E =1
2∑p
dp−yp2
Use gradient descent error minimization:
wij = −∂E
∂wij
Works if the nonlinear transfer function is differentiable.
9
Deriving the LMS or “Delta” RuleAs Gradient Descent Learning
y = ∑i
wi xi
E = 1
2∑p
dp−yp2 dE
d y= y−d
∂E
∂wi
=dE
d y⋅∂ y
∂wi
= y−dxi
wi = −∂E
∂wi
= −y−dxixi
wi
y
How do we extend this to two layers?
10
Switch to Smooth Nonlinear Units
net j = ∑i
wij yi
y j = gnet j
Common choices for g:
g x =1
1e−x
g 'x = gx⋅1−gx
g x=tanhx
g 'x=1 /cosh2x
g must be differentiable
11
Gradient Descent with Nonlinear Units
y=g net=tanh ∑i wi xidE
dy=y−d,
dy
dnet=1/cosh
2net , ∂net
∂wi
=xi
∂E
∂wi
=dE
dy⋅dy
dnet⋅∂net
∂wi
= y−d/cosh2
∑i wi xi⋅xi
tanh(Swix
i)xi
wiy
12
Now We Can Use The Chain Rule
yk
wjk
y j
wij
yi
∂E∂yk
= yk−dk
k =∂E
∂netk= yk−dk⋅g'netk
∂E
∂wjk
=∂E
∂netk⋅∂netk∂wjk
=∂E
∂netk⋅y j
∂E
∂y j
= ∑k ∂E
∂netk⋅∂netk∂ y j
j =
∂E∂net j
=∂E∂ y j
⋅g'net j
∂E
∂wij
=∂E
∂net j⋅yi
12
Now We Can Use The Chain Rule
yk
wjk
y j
wij
yi
∂E∂yk
= yk−dk
k =∂E
∂netk= yk−dk⋅g'netk
∂E
∂wjk
=∂E
∂netk⋅∂netk∂wjk
=∂E
∂netk⋅y j
∂E
∂y j
= ∑k ∂E
∂netk⋅∂netk∂ y j
j =
∂E∂net j
=∂E∂ y j
⋅g'net j
∂E
∂wij
=∂E
∂net j⋅yi
12
Now We Can Use The Chain Rule
yk
wjk
y j
wij
yi
∂E∂yk
= yk−dk
k =∂E
∂netk= yk−dk⋅g'netk
∂E
∂wjk
=∂E
∂netk⋅∂netk∂wjk
=∂E
∂netk⋅y j
∂E
∂y j
= ∑k ∂E
∂netk⋅∂netk∂ y j
j =
∂E∂net j
=∂E∂ y j
⋅g'net j
∂E
∂wij
=∂E
∂net j⋅yi
13
Weight Updates
∂E
∂wjk
=∂E
∂netk⋅∂netk∂wjk
= k⋅y j
∂E
∂wij
=∂E
∂net j⋅∂net j∂wij
= j⋅yi
wjk = −⋅∂E
∂wjk
wij = −⋅∂E
∂wij
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - 25 Jan 201657
Deep learning is everywhere
[Krizhevsky 2012]
Classification Retrieval
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - 25 Jan 201658
[Faster R-CNN: Ren, He, Girshick, Sun 2015]
Detection Segmentation
[Farabet et al., 2012]
Deep learning is everywhere
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - 25 Jan 201659
NVIDIA Tegra X1
self-driving cars
Deep learning is everywhere
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - 25 Jan 201660
[Toshev, Szegedy 2014]
[Mnih 2013]
Deep learning is everywhere
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - 25 Jan 201661
[Ciresan et al. 2013] [Sermanet et al. 2011] [Ciresan et al.]
Deep learning is everywhere
[Vinyals et al., 2015]
Image Captioning