Day 6: Neural networks, backpropagationsuriya/website-intromlss2018/...Introduction to Machine...
Transcript of Day 6: Neural networks, backpropagationsuriya/website-intromlss2018/...Introduction to Machine...
Introduction to Machine Learning Summer SchoolJune 18, 2018 - June 29, 2018, Chicago
Instructor:SuriyaGunasekar,TTIChicago
25June2018
Day6:Neuralnetworks,backpropagation
Schedule
• 9:00am-10:25am– Lecture 6.a:Review of week 1,introduction to neuralnetworks• 10:30am-11:30am– Invited Talk- GregDurett (also theTTICcolloquium talk)• 11:30am-12:30pm– Lunch• 12:30pm-2:00pm– Lecture 6.b:Backpropagation• 2:00pm-5:00pm– Programming
1
Review of week 1
2
3
Supervisedlearning– keyquestions
• Data: whatkindofdatacanweget?howmuchdatacanweget?
• Model:whatisthecorrectmodelformydata?– wanttominimizetheeffortputintothisquestion!
• Training:whatresources-computation/memory- doesthealgorithmneedtoestimatethemodel𝑓"?
• Testing: howwellwill𝑓" performwhendeployed? whatisthecomputational/memoryrequirementduringdeployment?
𝑓" face
Setup
Data collec,on
Representation
Modeling
Estimation/training
Model selection
Data
Algorithm
Linearregression
• Input𝒙 ∈ 𝒳 ⊂ ℝ),output𝑦 ∈ ℝ,wanttolearn𝑓:𝒳 → ℝ
• Trainingdata𝑆 = 𝒙 𝒊 , 𝑦 1 : 𝑖 = 1,2, … ,𝑁
• Parameterizecandidate𝑓:𝒳 → ℝ bylinearfunctions,ℋ = {𝒙 → 𝒘. 𝒙:𝑤 ∈ ℝ)}
• Estimate𝒘 byminimizinglossontrainingdata
𝒘= = argmin𝒘
𝐽EFE 𝒘 :=G 𝒘.𝒙 𝒊 − 𝑦 1 IJ
1KLo 𝐽EFE 𝒘 isconvexin𝒘àminimize𝐽EFE 𝒘 bysettinggradientto0o 𝛻𝒘𝐽EFE 𝒘 = ∑ 𝒘. 𝒙 𝒊 − 𝑦 1 𝒙 𝒊J
1KL
o Closedformsolution𝒘= = 𝑿P𝑿 QL𝑿𝒚
• Cangetnon-linearfunctionsbymapping𝒙 → 𝜙(𝒙) anddoinglinearregressionon𝜙(𝒙)
4
Overfitting• Forsameamountofdata,morecomplexmodels(e.g.,higherdegreepolynomials)overfit more
• orneedmoredatatofitmorecomplexmodels
• complexity≈ numberofparameters
Modelselection• mmodelclasses ℋL,ℋI,… ,ℋW
• 𝑆 = 𝑆XYZ1[ ∪ 𝑆]Z^ ∪ 𝑆X_`X• Train on𝑆XYZ1[ topickbest𝑓"Y ∈ ℋY
• Pick𝑓"∗ basedonvalidationlosson𝑆]Z^• Evaluatetestloss𝐿Ecdec 𝑓"
∗
5
reality
Regularization
• Complexityofmodelclasscanalsobecontrolledbynormofparameters– smallerrangeofvaluesallowed• Regularizationforlinearregression
argmin𝒘
𝐽EFE 𝒘 + 𝜆 𝒘 II
argmin𝒘
𝐽EFE 𝒘 + 𝜆 𝒘 L
• Againdomodelselectiontopick𝜆– using𝑆]Z^ orcross-validation
6
Classification
• Output𝑦 ∈ 𝒴 takesdiscretesetofvalues,e.g.,𝒴 = {0,1} or𝒴 = {−1,1} or𝒴 = {𝑠𝑝𝑎𝑚, 𝑛𝑜𝑠𝑝𝑎𝑚}o Unlikeregression,label-valuesdonothavemeaning
• Classifiersdividethespaceofinput𝒳 (oftenℝ))to“regions”whereeachregionisassignedalabel
• Non-parametricmodelso k-nearestneighbors– regionsdefinedbasedonnearestneighbors
o decisiontrees– structuredrectangularregions
• Linearmodels– classifierregionsarehalfspaces
7
!"
!#
ç
Classification– logisticregression
• 𝒳 = ℝ), 𝒴 = −1,1 , 𝑆 = 𝒙 𝒊 , 𝑦 1 : 𝑖 = 1,2, … , 𝑁
• Linearmodel𝑓 𝒙 = 𝑓𝒘 𝒙 = 𝒘. 𝒙
• Outputclassifier𝑦p 𝒙 = sign(𝒘. 𝒙)
• Empiricalriskminimization
𝒘= = argmin𝒘
Glog 1 + exp −𝒘. 𝒙 𝒊 𝑦 1 �
1
• Probabilisticformulation:Pr 𝑦 = 1 𝒙 = LLyz{| Q𝒘.𝒙
• Multi-classgeneralization:𝒴 = {1,2, … ,𝑚}
• Pr 𝑦 𝒙 = z{| Q𝒘𝒚.𝒙
∑ z{| Q𝒘𝒚}.𝒙�~}
• Canagaingetnon-lineardecisionboundariesbymapping𝒙 → 𝜙(𝒙)
8
Logisticlossℓ 𝑓 𝒙 , 𝑦 = log 1 + exp −𝑓(𝒙)𝑦
ℓ(#$,&)
0 #($)& →
Classification– maximummarginclassifierSeparabledata• Originalformulation
𝒘= = argmax𝒘∈ℝ�
min1𝑦 1 𝒘. 𝒙 𝒊
𝒘• Fixing 𝒘 = 1𝒘= = argmax
𝒘min
1𝑦 1 𝒘. 𝒙 𝒊 s. t. 𝒘 = 1
• Fixingmin1𝑦 1 𝒘. 𝒙 𝒊 = 1
𝒘� = argmin𝒘
𝒘 Is.t.∀𝑖, 𝑦 1 (𝒘. 𝒙 𝒊 ) ≥ 1
Slackvariablesfornon-separabledata
𝒘= = argmin𝒘,{����}
𝒘 I+λ ∑ 𝜉1�1 s.t.∀𝑖, 𝑦 1 𝒘. 𝒙 𝒊 ≥ 1 − 𝜉1
= argmin𝒘,{����}
𝒘 I+λ ∑ max 0,1 − 𝑦 1 𝒘. 𝒙 𝒊�1
9
!′
!
#$
#%
&
10
Kerneltrick• Usingrepresentortheorem𝒘 = ∑ 𝛽1𝒙 𝒊J
1KL
min𝒘 𝒘 I + 𝜆Gmax 0,1 − 𝑦 1 𝒘. 𝒙 𝒊
�
1
≡ min𝜷∈ℝ𝑵
𝜷P𝑮𝜷 + 𝜆Gmax 0,1 − 𝑦 1 𝑮𝜷 1
�
1
𝑮 ∈ ℝJ×Jwith𝐺1� = 𝒙 𝒊 . 𝒙 𝒋 iscalledthegrammatrix
• Optimizationdependson𝒙 𝒊 onlythrough𝐺1� = 𝒙 𝒊 . 𝒙 𝒋
• Forprediction𝒘=. 𝒙 = ∑ 𝛽1�1 𝒙 𝒊 . 𝒙,weagainonlyneed𝒙 𝒊 . 𝒙
• Function𝐾 𝒙, 𝒙� = 𝒙. 𝒙′ iscalledtheKernel• Whenlearningnon-linearclassifiersusingfeaturetransformations𝒙 → 𝜙(𝒙)and𝑓𝒘 𝒙 = 𝒘.𝜙(𝒙)
o Classifierfullyspecifiedintermsof𝐾� 𝒙, 𝒙� = 𝐾(𝜙 𝒙 , 𝜙(𝒙′))o 𝜙 𝒙 itselfcanbeveryveryhighdimensional(maybeeveninfinitedimensional)
1010
Optimization• ERM+regularizationoptimizationproblem
𝒘= = argmin𝒘
𝐽E� 𝒘 :=Gℓ(𝒘.𝜙 𝒙 𝒊 , 𝑦 1 )J
1KL
+ 𝜆‖𝒘‖
• If𝐽E� 𝒘 isconvexin𝒘,then𝒘= isoptimumifandonlyif gradientat𝒘= is0,i.e.,𝛻𝐽E� 𝒘= = 0
• Gradientdescent:startwithinitialization𝒘𝟎 anditerativelyupdateo 𝒘𝒕y𝟏 = 𝒘𝒕 − 𝜂X𝛻𝐽E� 𝒘𝒕
o where𝛻𝐽E� 𝒘𝒕 = ∑ 𝛻ℓ 𝒘𝒕. 𝜙 𝒙 𝒊 , 𝑦 1 �𝒊 + 𝜆𝛻‖𝒘𝒕‖
• Stochasticgradientdescento usegradientsfromonlyoneexampleo 𝒘𝒕y𝟏 = 𝒘𝒕 − 𝜂X𝛻 1 𝐽E� 𝒘𝒕
o where𝛻 1 𝐽E� 𝒘𝒕 = 𝛻ℓ 𝒘𝒕. 𝜙 𝒙 𝒊 , 𝑦 1 + 𝜆𝛻‖𝒘𝒕‖ forarandomsample(𝒙 𝒊 , 𝑦 1 )
11
Otherclassificationmodels• Optimalunrestrictedpredictor
o Regression+squaredlossà 𝑓∗∗(𝒙) = 𝐄 𝑦 𝒙o Classification+0-1lossà 𝑦p∗∗ 𝒙 = argmax¢ Pr(𝑦 = 𝑐|𝒙)
• Discriminativemodels:directlymodelPr 𝑦 𝒙 ,e.g.,logisticregression
• Generativemodels: modelfulljointdistributionPr 𝑦, 𝒙 =Pr 𝒙|𝑦 Pr(𝑦)
• Whygenerativemodels?o Oneconditionalmightbesimplertomodelwithpriorknowledge,e.g.,comparespecifyingPr(image|digit = 1) vsPr(digit = 1|image)
o Naturallyhandlesmissingdata
• Twoexamplesofgenerativemodelso NaïveBayesclassifiero HiddenMarkovmodel
12
Otherclassifiers• NaïveBayesclassifier:withdfeatures𝑥 = [𝑥L, 𝑥I, … , 𝑥)] whereeach𝑥L, 𝑥I, … , 𝑥) cantakeoneofKvaluesà 𝐶𝐾) parameters
o NBassumption:featuresareindependentgivenclass𝑦à 𝐶𝐾𝑑 params.
Pr(𝑥L, 𝑥I, … , 𝑥)|𝑦) = Pr(𝑥L|𝑦) Pr(𝑥I|𝑦)…Pr(𝑥)|𝑦) = ∏ Pr(𝑥¬|𝑦))¬KL
o Trainingamountstoaveragingsamplesacrossclasses
• HiddenMarkovmodel:variablelengthinput/observations{𝑥L, 𝑥I, … , 𝑥W} (e.g.,words)andvariablelengthoutput/state{𝑦L, 𝑦I, … , 𝑦W} (e.g.,tags)
o HMMassumption:a)currentstateconditionedonimmediatepreviousstateisconditionallyindependentofallothervariables,and(b)currentobservationconditionedoncurrentstate isconditionallyindependentofallothervariables.
Pr(𝑥L, 𝑥I, … , 𝑥W, 𝑦L, 𝑦I, … , 𝑦W) = Pr 𝑦L Pr(𝑥L|𝑦L)Pr(𝑦¬|𝑦¬QL) Pr(𝑦¬|𝑥¬)W
¬KI
o ParametersestimatedusingMLEdynamicprogramming13
TodayIntroductiontoneuralnetworks
Backpropagation
14
Graphnotation
15
Generalvariables- canbeinputvariableslike𝑥L, 𝑥I, … 𝑥)- prediction𝑦p- oranyintermediatecomputation(wewillsee
examplessoon)
𝑤L
𝑤I
𝑧L
𝑧I𝑧¯ denotescomputation𝑧¯ = 𝜎(𝑤L𝑧L + 𝑤I𝑧I)
forsome“activation”function𝜎 (specifiedapriori)
Linearclassifier
• Biologicalanalogy:singleneuron– stimulireinforcesynapticconnections
𝑥L
𝑥I
𝑥¯
𝑥)
1
𝑓 𝒙 = 𝟏 𝒘. 𝒙 + 𝑤� ≥ 0⋯
16Slidecredits:Nati Srebro,DavidMcAllester
Shallowlearning
• Wealreadysawhowtouselinearmodelstogetnon-lineardecisionboundaries• Featuretransform:map𝒙 ∈ ℝ)to𝜙 𝒙 ∈ ℝ)}anduse
𝑓𝒘 𝒙 = 𝒘.𝜙(𝒙)• Shallowlearning:hand-craftedandnon-hierarchical𝜙o Polynomialregressionwithsquaredorlogisticloss,𝜙 𝑥 ² = 𝑥²
o KernelSVM:𝐾 𝒙, 𝒙� = 𝜙 𝒙 . 𝜙 𝒙�
17
𝑓 𝒙 =𝟏 𝒘.𝜙 𝒙 ≥ 0
⋯
𝜙(𝒙)L
𝜙(𝒙)I
𝜙(𝒙)¯
𝜙(𝒙))}
Slidecredit:Nati Srebro
CombiningLinearUnits
𝑧L = 𝟏(𝑥L − 𝑥I > 0)
𝑓 𝒙 = 𝟏 𝑧L + 𝑧I > 0
𝑥L
𝑥I
18
1
1
𝑧I = 𝟏(𝑥I − 𝑥L > 0)
• Thenetworkrepresentsthefunction𝑓 𝑥 = 𝑥Landnot 𝑥I or 𝑥Iandnot 𝑥L
• Notalinearfunctionof𝑥
Slidecredit:Nati Srebro
CombiningLinearUnits
𝑧L = 𝟏(𝒘𝟏. 𝒙 > 0)
𝑧I = 𝟏(𝒘𝟐. 𝒙 > 0)
𝑓 𝒙 = 𝟏 𝑤�L𝑧L + 𝑤�I𝑧I ≥ 0
𝑥L
𝑥I
19
𝑤LL
𝑤II
Feed-ForwardNeuralNetworks
𝑥L
𝑥I
𝑥¯
𝑥)
⋯
𝑓 𝒙 = 𝜎 ∑ 𝑊 I 𝑗 𝑧 𝑗��
20
𝑧 𝑖 = 𝜎 ∑ 𝑊 L 𝑗, 𝑖 𝑥���
Figurecredit:Nati Srebro
Feed-ForwardNeuralNetworks
𝑥L
𝑥I
𝑥¯
𝑥)
⋯
𝑓 𝒙 = 𝜎 ∑ 𝑊 I 𝑗 𝑧 𝑗��
21
𝑧 1 = 𝜎 ∑ 𝑊 L 𝑗, 1 𝑥���
Feed-ForwardNeuralNetworks
𝑥L
𝑥I
𝑥¯
𝑥)
⋯
𝑓 𝒙 = 𝜎 ∑ 𝑊 I 𝑗 𝑧 𝑗��
22
𝑧 2 = 𝜎 ∑ 𝑊 L 𝑗, 2 𝑥���
Feed-ForwardNeuralNetworks
𝑥L
𝑥I
𝑥¯
𝑥)
⋯
𝑓 𝒙 = 𝜎 ∑ 𝑊 I 𝑗 𝑧 𝑗��
23
𝑧 3 = 𝜎 ∑ 𝑊 L 𝑗, 3 𝑥���
Feed-ForwardNeuralNetworks
𝑥L
𝑥I
𝑥¯
𝑥)
⋯
𝑓 𝒙 = 𝜎 ∑ 𝑊 I 𝑗 𝑧 𝑗��
24
𝑧 𝑑L = 𝜎 ∑ 𝑊 L 𝑗, 𝑑L 𝑥���
Feed-ForwardNeuralNetworks
𝑥L
𝑥I
𝑥¯
𝑥)
⋯
𝑓 𝒙 = 𝜎 ∑ 𝑊 I 𝑗 𝑧 𝑗��
25
𝑧 𝑖 = 𝜎 ∑ 𝑊 L 𝑗, 𝑖 𝑥���
Feed-ForwardNeuralNetworks
Architecture:
• DirectedAcyclicGraph𝐺(𝑉, 𝐸).Units(neurons)indexedbyverticesin𝑉.
𝑥L
𝑥I
𝑥¯
𝑥)
⋯
𝑓º »,¼ ,½,𝑾 𝒙
26Slidecredit:Nati Srebro
Feed-ForwardNeuralNetworks
𝑣L
𝑣I
𝑣¯
𝑣)
Architecture:
• DirectedAcyclicGraphG(V,E).Units(neurons)indexedbyverticesinV.• “InputUnits”𝑣L …𝑣) ∈ 𝑉:noincomingedgeshavevalue𝑜 𝑣1 = 𝑥1
𝑥L
𝑥I
𝑥¯
𝑥)
⋯
𝑓º »,¼ ,½,𝑾 𝒙
27Slidecredit:Nati Srebro
Feed-ForwardNeuralNetworks
𝑣L
𝑣I
𝑣¯
𝑣)𝑢
𝑣
Architecture:
• DirectedAcyclicGraphG(V,E).Units(neurons)indexedbyverticesinV.• “InputUnits”𝑣L …𝑣) ∈ 𝑉:noincomingedgeshavevalue𝑜 𝑣1 = 𝑥1• Eachedge𝑢 → 𝑣hasweight𝑾[𝑢 → 𝑣]
• Pre-activation 𝑎[𝑣] = ∑ 𝑾[𝑢 → 𝑣]�Á→]∈¼ 𝑜[𝑢]
𝑥L
𝑥I
𝑥¯
𝑥)
⋯
𝑓º »,¼ ,½,𝑾 𝒙
28Slidecredit:Nati Srebro
Feed-ForwardNeuralNetworks
𝑣L
𝑣I
𝑣¯
𝑣)𝑢
𝑣
Architecture:
• DirectedAcyclicGraphG(V,E).Units(neurons)indexedbyverticesinV.• “InputUnits”𝑣L …𝑣) ∈ 𝑉:noincomingedgeshavevalue𝑜 𝑣1 = 𝑥1• Eachedge𝑢 → 𝑣hasweight𝑾[𝑢 → 𝑣]
• Pre-activation 𝑎[𝑣] = ∑ 𝑾[𝑢 → 𝑣]�Á→]∈¼ 𝑜[𝑢]
• Outputvalue 𝑜 𝑣 = 𝜎(𝑎 𝑣 )
𝑥L
𝑥I
𝑥¯
𝑥)
⋯
𝑓º »,¼ ,½,𝑾 𝒙
29Slidecredit:Nati Srebro
Feed-ForwardNeuralNetworks
𝑣L
𝑣I
𝑣¯
𝑣)𝑢
𝑣
𝑣ÂÁX
Architecture:
• DirectedAcyclicGraphG(V,E).Units(neurons)indexedbyverticesinV.• “InputUnits”𝑣L …𝑣) ∈ 𝑉:noincomingedgeshavevalue𝑜 𝑣1 = 𝑥1• Eachedge𝑢 → 𝑣hasweight𝑾[𝑢 → 𝑣]
• Pre-activation 𝑎[𝑣] = ∑ 𝑾[𝑢 → 𝑣]�Á→]∈¼ 𝑜[𝑢]
• Outputvalue 𝑜 𝑣 = 𝜎(𝑎 𝑣 )• “OutputUnit”𝑣ÂÁX ∈ 𝑉,𝑓Ã 𝒙 = 𝑎 𝑣ÂÁX
𝑥L
𝑥I
𝑥¯
𝑥)
⋯
𝑓º »,¼ ,½,𝑾 𝒙
30
Feed-ForwardNeuralNetworks
𝑣L
𝑣I
𝑣¯
𝑣)𝑢
𝑣
𝑣ÂÁX
Architecture:
• DirectedAcyclicGraphG(V,E).Units(neurons)indexedbyverticesinV.• “InputUnits”𝑣L …𝑣) ∈ 𝑉:noincomingedgeshavevalue𝑜 𝑣1 = 𝑥1• Eachedge𝑢 → 𝑣hasweight𝑾[𝑢 → 𝑣]
• Pre-activation 𝑎[𝑣] = ∑ 𝑾[𝑢 → 𝑣]�Á→]∈¼ 𝑜[𝑢]
• Outputvalue 𝑜 𝑣 = 𝜎(𝑎 𝑣 )• “OutputUnit”𝑣ÂÁX ∈ 𝑉,𝑓Ã 𝒙 = 𝑎 𝑣ÂÁX
𝑥L
𝑥I
𝑥¯
𝑥)
⋯
𝑓º »,¼ ,½,𝑾 𝒙
31
Sometextbooks/conventiondon’tmakethedistinctionbetweenpre-activationand
outputvalueandsimplycompute𝑜 𝑣 = 𝜎 ∑ 𝑊 𝑢 → 𝑣 𝑜 𝑢�
Á→]∈¼
Feed-ForwardNeuralNetworks
𝑣L
𝑣I
𝑣¯
𝑣)𝑢
𝑣
𝑣ÂÁX
Parameters:
• Eachedge𝑢 → 𝑣hasweight𝑾[𝑢 → 𝑣]Activations:
• 𝜎:ℝ → ℝ, forexample• 𝜎 𝑧 = 𝑠𝑖𝑔𝑛 𝑧 or𝜎 𝑧 = L
Lyz{|(QÅ)• 𝜎 𝑧 = ReLU 𝑧 = max(0, 𝑧)
𝑥L
𝑥I
𝑥¯
𝑥)
⋯
𝑓º »,¼ ,½,𝑾 𝒙
32
Feed-ForwardNeuralNetworks
𝑣L
𝑣I
𝑣¯
𝑣)𝑢
𝑣
𝑣ÂÁX
𝑥L
𝑥I
𝑥¯
𝑥)
⋯
𝑓º »,¼ ,½,𝑾 𝒙
33
DeeplearningGeneralizetohierarchyoftransformationsoftheinput,learnedend-to-endjointlywiththepredictor.
𝑓𝑾(𝒙) = 𝑓F 𝑓FQL 𝑓FQI …𝑓L 𝒙 …
𝑓L 𝒙 = 𝜎 𝑾 𝟏 𝒙
𝑓I 𝒙 = 𝜎 𝑾 𝟐 𝑓L(𝒙)
NeuralNetsasFeatureLearning
• Canthinkofhiddenlayeras“features”𝜙(𝑥),thenalinearpredictorbasedon𝒘.𝜙 𝒙• “FeatureEngineering”approach:design𝜙(⋅) basedondomainknowledge• “DeepLearning”approach:learnfeaturesfromdata• Multilayernetworkswithnon-linearactivations
o moreandmorecomplexfeatures
𝑣L
𝑣I
𝑣¯
𝜙 𝑥 L
𝜙 𝑥 ¬
𝑥L
𝑥I
𝑥)⋯
𝑣ÂÁX⋯
34
⋯
Slidecredit:Nati Srebro
35
Multi-LayerFeatureLearning
Slidecredit:Nati Srebro
Moreknowledgeormorelearning
Expertknowledge:fullspecificknowledge
ExpertSystems(nodataatall)
nofreelunch
moredataà
Useexpertknowledgetoconstruct𝜙 𝑥 or𝐾(𝑥, 𝑥�),thenuse,eg SVM,on𝜙(𝑥)
“DeepLearning”:useverysimplerawfeaturesasinput,learngoodfeaturesusingdeepneuralnet
36Slidecredit:Nati Srebro
Neuralnetworks ashypothesisclass
• Hypothesisclassspecifiedby:o GraphG(V,E)o Activationfunction𝜎o Weights𝐖,withweightW[𝑢 → 𝑣] foreachedge𝑢 → 𝑣 ∈ 𝐸
ℋ = 𝑓º »,¼ ,½,𝐖|𝑾: 𝐸 → ℝ• Expressivepower:
𝑓 𝑓𝑐𝑜𝑚𝑝𝑢𝑡𝑎𝑏𝑙𝑒𝑖𝑛𝑡𝑖𝑚𝑒𝑇} ⊆ ℋº »,¼ ,`1Ñ[ with 𝐸 = 𝑂 𝑇I
• Computation:empiricalriskminimization
𝐖Ó = argminÃ
Gℓ 𝑓º »,¼ ,½,𝑾 𝒙 𝒊 , 𝑦 1J
1KLo Highlynon-convexproblem,evenif𝑙𝑜𝑠𝑠 ℓisconvexo Hardtominimizeovereventinyneuralnetworksarehard
Basedonarchitectureandfixed
37
Sohowdowelearn?
𝑾Ô = argminÃ
Gℓ 𝑓º »,¼ ,½,𝑾 𝒙 𝒊 , 𝑦 1J
1KL
• Stochasticgradientdescent:forrandom 𝒙 𝒊 , 𝑦 1 ∈ 𝑆𝑾(XyL) ← 𝑾 X − 𝜂(X)𝛻ℓ 𝑓º »,¼ ,½,𝑾 𝒕 𝒙 𝒊 , 𝑦 1
(Eventhoughitsnotconvex)
• Howdoweefficientlycalculate𝛻ℓ 𝑓º »,¼ ,½,𝑾 𝒕 𝒙 𝒊 , 𝑦 1 ?o Karlwilltellyou!
• NowabriefdetourintohistoryandresurrectionofNNs
39
Imagenet challenge– objectclassification
40
Objectdetection
Slidecredit:DavidMcAllester
HistoryofNeuralNetworks• 1940s-70s:
o Inspiredbylearninginthebrain,andasamodelforthebrain(Pitts,Hebb,andothers)o Variousmodels,directedandundirected,differentactivationandlearningruleso PerceptronRule(Rosenblatt),ProblemofXOR,Multilayerperceptron(Minksy andPapert)o Backpropagation (Werbos 1975)
• 1980s-early1990s:o PracticalBackprop (Rumelhart,Hintonetal1986)andSGD(Bottou)o Relationshiptodistributedcomputing;“Connectionism”o Initialempiricalsuccess
• 1990s-2000s:o Lostfavortoimplicitlinearmethods:SVM,Boosting
• 2000-2010s:o revivalofinterest(CIFARgroups)o ca.2005:layer-wisepretraining ofdeepish netso progressinspeechandvisionwithdeepneuralnets
• 2010s:o ComputationaladvancesallowtrainingHUGEnetworkso …andalsoafewnewtrickso Krizhevsky etal.winImageNeto Empiricalsuccessandrenewedinterest
41
Deeplearning- today
Stateoftheartperformanceinseveraltasksandareactivelydeployedinrealsystems
o Computervisiono Speechrecognitiono Machinetranslationo Dialogsystemso Computergameso Informationretrieval
42