Day 6: Neural networks, backpropagationsuriya/website-intromlss2018/...Introduction to Machine...

43
Introduction to Machine Learning Summer School June 18, 2018 - June 29, 2018, Chicago Instructor: Suriya Gunasekar, TTI Chicago 25 June 2018 Day 6: Neural networks, backpropagation

Transcript of Day 6: Neural networks, backpropagationsuriya/website-intromlss2018/...Introduction to Machine...

Page 1: Day 6: Neural networks, backpropagationsuriya/website-intromlss2018/...Introduction to Machine Learning Summer School June 18, 2018 -June 29, 2018, Chicago Instructor: Suriya Gunasekar,

Introduction to Machine Learning Summer SchoolJune 18, 2018 - June 29, 2018, Chicago

Instructor:SuriyaGunasekar,TTIChicago

25June2018

Day6:Neuralnetworks,backpropagation

Page 2: Day 6: Neural networks, backpropagationsuriya/website-intromlss2018/...Introduction to Machine Learning Summer School June 18, 2018 -June 29, 2018, Chicago Instructor: Suriya Gunasekar,

Schedule

• 9:00am-10:25am– Lecture 6.a:Review of week 1,introduction to neuralnetworks• 10:30am-11:30am– Invited Talk- GregDurett (also theTTICcolloquium talk)• 11:30am-12:30pm– Lunch• 12:30pm-2:00pm– Lecture 6.b:Backpropagation• 2:00pm-5:00pm– Programming

1

Page 3: Day 6: Neural networks, backpropagationsuriya/website-intromlss2018/...Introduction to Machine Learning Summer School June 18, 2018 -June 29, 2018, Chicago Instructor: Suriya Gunasekar,

Review of week 1

2

Page 4: Day 6: Neural networks, backpropagationsuriya/website-intromlss2018/...Introduction to Machine Learning Summer School June 18, 2018 -June 29, 2018, Chicago Instructor: Suriya Gunasekar,

3

Supervisedlearning– keyquestions

• Data: whatkindofdatacanweget?howmuchdatacanweget?

• Model:whatisthecorrectmodelformydata?– wanttominimizetheeffortputintothisquestion!

• Training:whatresources-computation/memory- doesthealgorithmneedtoestimatethemodel𝑓"?

• Testing: howwellwill𝑓" performwhendeployed? whatisthecomputational/memoryrequirementduringdeployment?

𝑓" face

Setup

Data collec,on

Representation

Modeling

Estimation/training

Model selection

Data

Algorithm

Page 5: Day 6: Neural networks, backpropagationsuriya/website-intromlss2018/...Introduction to Machine Learning Summer School June 18, 2018 -June 29, 2018, Chicago Instructor: Suriya Gunasekar,

Linearregression

• Input𝒙 ∈ 𝒳 ⊂ ℝ),output𝑦 ∈ ℝ,wanttolearn𝑓:𝒳 → ℝ

• Trainingdata𝑆 = 𝒙 𝒊 , 𝑦 1 : 𝑖 = 1,2, … ,𝑁

• Parameterizecandidate𝑓:𝒳 → ℝ bylinearfunctions,ℋ = {𝒙 → 𝒘. 𝒙:𝑤 ∈ ℝ)}

• Estimate𝒘 byminimizinglossontrainingdata

𝒘= = argmin𝒘

𝐽EFE 𝒘 :=G 𝒘.𝒙 𝒊 − 𝑦 1 IJ

1KLo 𝐽EFE 𝒘 isconvexin𝒘àminimize𝐽EFE 𝒘 bysettinggradientto0o 𝛻𝒘𝐽EFE 𝒘 = ∑ 𝒘. 𝒙 𝒊 − 𝑦 1 𝒙 𝒊J

1KL

o Closedformsolution𝒘= = 𝑿P𝑿 QL𝑿𝒚

• Cangetnon-linearfunctionsbymapping𝒙 → 𝜙(𝒙) anddoinglinearregressionon𝜙(𝒙)

4

Page 6: Day 6: Neural networks, backpropagationsuriya/website-intromlss2018/...Introduction to Machine Learning Summer School June 18, 2018 -June 29, 2018, Chicago Instructor: Suriya Gunasekar,

Overfitting• Forsameamountofdata,morecomplexmodels(e.g.,higherdegreepolynomials)overfit more

• orneedmoredatatofitmorecomplexmodels

• complexity≈ numberofparameters

Modelselection• mmodelclasses ℋL,ℋI,… ,ℋW

• 𝑆 = 𝑆XYZ1[ ∪ 𝑆]Z^ ∪ 𝑆X_`X• Train on𝑆XYZ1[ topickbest𝑓"Y ∈ ℋY

• Pick𝑓"∗ basedonvalidationlosson𝑆]Z^• Evaluatetestloss𝐿Ecdec 𝑓"

5

reality

Page 7: Day 6: Neural networks, backpropagationsuriya/website-intromlss2018/...Introduction to Machine Learning Summer School June 18, 2018 -June 29, 2018, Chicago Instructor: Suriya Gunasekar,

Regularization

• Complexityofmodelclasscanalsobecontrolledbynormofparameters– smallerrangeofvaluesallowed• Regularizationforlinearregression

argmin𝒘

𝐽EFE 𝒘 + 𝜆 𝒘 II

argmin𝒘

𝐽EFE 𝒘 + 𝜆 𝒘 L

• Againdomodelselectiontopick𝜆– using𝑆]Z^ orcross-validation

6

Page 8: Day 6: Neural networks, backpropagationsuriya/website-intromlss2018/...Introduction to Machine Learning Summer School June 18, 2018 -June 29, 2018, Chicago Instructor: Suriya Gunasekar,

Classification

• Output𝑦 ∈ 𝒴 takesdiscretesetofvalues,e.g.,𝒴 = {0,1} or𝒴 = {−1,1} or𝒴 = {𝑠𝑝𝑎𝑚, 𝑛𝑜𝑠𝑝𝑎𝑚}o Unlikeregression,label-valuesdonothavemeaning

• Classifiersdividethespaceofinput𝒳 (oftenℝ))to“regions”whereeachregionisassignedalabel

• Non-parametricmodelso k-nearestneighbors– regionsdefinedbasedonnearestneighbors

o decisiontrees– structuredrectangularregions

• Linearmodels– classifierregionsarehalfspaces

7

!"

!#

ç

Page 9: Day 6: Neural networks, backpropagationsuriya/website-intromlss2018/...Introduction to Machine Learning Summer School June 18, 2018 -June 29, 2018, Chicago Instructor: Suriya Gunasekar,

Classification– logisticregression

• 𝒳 = ℝ), 𝒴 = −1,1 , 𝑆 = 𝒙 𝒊 , 𝑦 1 : 𝑖 = 1,2, … , 𝑁

• Linearmodel𝑓 𝒙 = 𝑓𝒘 𝒙 = 𝒘. 𝒙

• Outputclassifier𝑦p 𝒙 = sign(𝒘. 𝒙)

• Empiricalriskminimization

𝒘= = argmin𝒘

Glog 1 + exp −𝒘. 𝒙 𝒊 𝑦 1 �

1

• Probabilisticformulation:Pr 𝑦 = 1 𝒙 = LLyz{| Q𝒘.𝒙

• Multi-classgeneralization:𝒴 = {1,2, … ,𝑚}

• Pr 𝑦 𝒙 = z{| Q𝒘𝒚.𝒙

∑ z{| Q𝒘𝒚}.𝒙�~}

• Canagaingetnon-lineardecisionboundariesbymapping𝒙 → 𝜙(𝒙)

8

Logisticlossℓ 𝑓 𝒙 , 𝑦 = log 1 + exp −𝑓(𝒙)𝑦

ℓ(#$,&)

0 #($)& →

Page 10: Day 6: Neural networks, backpropagationsuriya/website-intromlss2018/...Introduction to Machine Learning Summer School June 18, 2018 -June 29, 2018, Chicago Instructor: Suriya Gunasekar,

Classification– maximummarginclassifierSeparabledata• Originalformulation

𝒘= = argmax𝒘∈ℝ�

min1𝑦 1 𝒘. 𝒙 𝒊

𝒘• Fixing 𝒘 = 1𝒘= = argmax

𝒘min

1𝑦 1 𝒘. 𝒙 𝒊 s. t. 𝒘 = 1

• Fixingmin1𝑦 1 𝒘. 𝒙 𝒊 = 1

𝒘� = argmin𝒘

𝒘 Is.t.∀𝑖, 𝑦 1 (𝒘. 𝒙 𝒊 ) ≥ 1

Slackvariablesfornon-separabledata

𝒘= = argmin𝒘,{����}

𝒘 I+λ ∑ 𝜉1�1 s.t.∀𝑖, 𝑦 1 𝒘. 𝒙 𝒊 ≥ 1 − 𝜉1

= argmin𝒘,{����}

𝒘 I+λ ∑ max 0,1 − 𝑦 1 𝒘. 𝒙 𝒊�1

9

!′

!

#$

#%

&

10

Page 11: Day 6: Neural networks, backpropagationsuriya/website-intromlss2018/...Introduction to Machine Learning Summer School June 18, 2018 -June 29, 2018, Chicago Instructor: Suriya Gunasekar,

Kerneltrick• Usingrepresentortheorem𝒘 = ∑ 𝛽1𝒙 𝒊J

1KL

min𝒘 𝒘 I + 𝜆Gmax 0,1 − 𝑦 1 𝒘. 𝒙 𝒊

1

≡ min𝜷∈ℝ𝑵

𝜷P𝑮𝜷 + 𝜆Gmax 0,1 − 𝑦 1 𝑮𝜷 1

1

𝑮 ∈ ℝJ×Jwith𝐺1� = 𝒙 𝒊 . 𝒙 𝒋 iscalledthegrammatrix

• Optimizationdependson𝒙 𝒊 onlythrough𝐺1� = 𝒙 𝒊 . 𝒙 𝒋

• Forprediction𝒘=. 𝒙 = ∑ 𝛽1�1 𝒙 𝒊 . 𝒙,weagainonlyneed𝒙 𝒊 . 𝒙

• Function𝐾 𝒙, 𝒙� = 𝒙. 𝒙′ iscalledtheKernel• Whenlearningnon-linearclassifiersusingfeaturetransformations𝒙 → 𝜙(𝒙)and𝑓𝒘 𝒙 = 𝒘.𝜙(𝒙)

o Classifierfullyspecifiedintermsof𝐾� 𝒙, 𝒙� = 𝐾(𝜙 𝒙 , 𝜙(𝒙′))o 𝜙 𝒙 itselfcanbeveryveryhighdimensional(maybeeveninfinitedimensional)

1010

Page 12: Day 6: Neural networks, backpropagationsuriya/website-intromlss2018/...Introduction to Machine Learning Summer School June 18, 2018 -June 29, 2018, Chicago Instructor: Suriya Gunasekar,

Optimization• ERM+regularizationoptimizationproblem

𝒘= = argmin𝒘

𝐽E� 𝒘 :=Gℓ(𝒘.𝜙 𝒙 𝒊 , 𝑦 1 )J

1KL

+ 𝜆‖𝒘‖

• If𝐽E� 𝒘 isconvexin𝒘,then𝒘= isoptimumifandonlyif gradientat𝒘= is0,i.e.,𝛻𝐽E� 𝒘= = 0

• Gradientdescent:startwithinitialization𝒘𝟎 anditerativelyupdateo 𝒘𝒕y𝟏 = 𝒘𝒕 − 𝜂X𝛻𝐽E� 𝒘𝒕

o where𝛻𝐽E� 𝒘𝒕 = ∑ 𝛻ℓ 𝒘𝒕. 𝜙 𝒙 𝒊 , 𝑦 1 �𝒊 + 𝜆𝛻‖𝒘𝒕‖

• Stochasticgradientdescento usegradientsfromonlyoneexampleo 𝒘𝒕y𝟏 = 𝒘𝒕 − 𝜂X𝛻  1 𝐽E� 𝒘𝒕

o where𝛻  1 𝐽E� 𝒘𝒕 = 𝛻ℓ 𝒘𝒕. 𝜙 𝒙 𝒊 , 𝑦 1 + 𝜆𝛻‖𝒘𝒕‖ forarandomsample(𝒙 𝒊 , 𝑦 1 )

11

Page 13: Day 6: Neural networks, backpropagationsuriya/website-intromlss2018/...Introduction to Machine Learning Summer School June 18, 2018 -June 29, 2018, Chicago Instructor: Suriya Gunasekar,

Otherclassificationmodels• Optimalunrestrictedpredictor

o Regression+squaredlossà 𝑓∗∗(𝒙) = 𝐄 𝑦 𝒙o Classification+0-1lossà 𝑦p∗∗ 𝒙 = argmax¢ Pr(𝑦 = 𝑐|𝒙)

• Discriminativemodels:directlymodelPr 𝑦 𝒙 ,e.g.,logisticregression

• Generativemodels: modelfulljointdistributionPr 𝑦, 𝒙 =Pr 𝒙|𝑦 Pr(𝑦)

• Whygenerativemodels?o Oneconditionalmightbesimplertomodelwithpriorknowledge,e.g.,comparespecifyingPr(image|digit = 1) vsPr(digit = 1|image)

o Naturallyhandlesmissingdata

• Twoexamplesofgenerativemodelso NaïveBayesclassifiero HiddenMarkovmodel

12

Page 14: Day 6: Neural networks, backpropagationsuriya/website-intromlss2018/...Introduction to Machine Learning Summer School June 18, 2018 -June 29, 2018, Chicago Instructor: Suriya Gunasekar,

Otherclassifiers• NaïveBayesclassifier:withdfeatures𝑥 = [𝑥L, 𝑥I, … , 𝑥)] whereeach𝑥L, 𝑥I, … , 𝑥) cantakeoneofKvaluesà 𝐶𝐾) parameters

o NBassumption:featuresareindependentgivenclass𝑦à 𝐶𝐾𝑑 params.

Pr(𝑥L, 𝑥I, … , 𝑥)|𝑦) = Pr(𝑥L|𝑦) Pr(𝑥I|𝑦)…Pr(𝑥)|𝑦) = ∏ Pr(𝑥¬|𝑦))¬KL

o Trainingamountstoaveragingsamplesacrossclasses

• HiddenMarkovmodel:variablelengthinput/observations{𝑥L, 𝑥I, … , 𝑥W} (e.g.,words)andvariablelengthoutput/state{𝑦L, 𝑦I, … , 𝑦W} (e.g.,tags)

o HMMassumption:a)currentstateconditionedonimmediatepreviousstateisconditionallyindependentofallothervariables,and(b)currentobservationconditionedoncurrentstate isconditionallyindependentofallothervariables.

Pr(𝑥L, 𝑥I, … , 𝑥W, 𝑦L, 𝑦I, … , 𝑦W) = Pr 𝑦L Pr(𝑥L|𝑦L)­Pr(𝑦¬|𝑦¬QL) Pr(𝑦¬|𝑥¬)W

¬KI

o ParametersestimatedusingMLEdynamicprogramming13

Page 15: Day 6: Neural networks, backpropagationsuriya/website-intromlss2018/...Introduction to Machine Learning Summer School June 18, 2018 -June 29, 2018, Chicago Instructor: Suriya Gunasekar,

TodayIntroductiontoneuralnetworks

Backpropagation

14

Page 16: Day 6: Neural networks, backpropagationsuriya/website-intromlss2018/...Introduction to Machine Learning Summer School June 18, 2018 -June 29, 2018, Chicago Instructor: Suriya Gunasekar,

Graphnotation

15

Generalvariables- canbeinputvariableslike𝑥L, 𝑥I, … 𝑥)- prediction𝑦p- oranyintermediatecomputation(wewillsee

examplessoon)

𝑤L

𝑤I

𝑧L

𝑧I𝑧¯ denotescomputation𝑧¯ = 𝜎(𝑤L𝑧L + 𝑤I𝑧I)

forsome“activation”function𝜎 (specifiedapriori)

Page 17: Day 6: Neural networks, backpropagationsuriya/website-intromlss2018/...Introduction to Machine Learning Summer School June 18, 2018 -June 29, 2018, Chicago Instructor: Suriya Gunasekar,

Linearclassifier

• Biologicalanalogy:singleneuron– stimulireinforcesynapticconnections

𝑥L

𝑥I

𝑥¯

𝑥)

1

𝑓 𝒙 = 𝟏 𝒘. 𝒙 + 𝑤� ≥ 0⋯

16Slidecredits:Nati Srebro,DavidMcAllester

Page 18: Day 6: Neural networks, backpropagationsuriya/website-intromlss2018/...Introduction to Machine Learning Summer School June 18, 2018 -June 29, 2018, Chicago Instructor: Suriya Gunasekar,

Shallowlearning

• Wealreadysawhowtouselinearmodelstogetnon-lineardecisionboundaries• Featuretransform:map𝒙 ∈ ℝ)to𝜙 𝒙 ∈ ℝ)}anduse

𝑓𝒘 𝒙 = 𝒘.𝜙(𝒙)• Shallowlearning:hand-craftedandnon-hierarchical𝜙o Polynomialregressionwithsquaredorlogisticloss,𝜙 𝑥 ² = 𝑥²

o KernelSVM:𝐾 𝒙, 𝒙� = 𝜙 𝒙 . 𝜙 𝒙�

17

𝑓 𝒙 =𝟏 𝒘.𝜙 𝒙 ≥ 0

𝜙(𝒙)L

𝜙(𝒙)I

𝜙(𝒙)¯

𝜙(𝒙))}

Slidecredit:Nati Srebro

Page 19: Day 6: Neural networks, backpropagationsuriya/website-intromlss2018/...Introduction to Machine Learning Summer School June 18, 2018 -June 29, 2018, Chicago Instructor: Suriya Gunasekar,

CombiningLinearUnits

𝑧L = 𝟏(𝑥L − 𝑥I > 0)

𝑓 𝒙 = 𝟏 𝑧L + 𝑧I > 0

𝑥L

𝑥I

18

1

1

𝑧I = 𝟏(𝑥I − 𝑥L > 0)

• Thenetworkrepresentsthefunction𝑓 𝑥 = 𝑥Landnot 𝑥I or 𝑥Iandnot 𝑥L

• Notalinearfunctionof𝑥

Slidecredit:Nati Srebro

Page 20: Day 6: Neural networks, backpropagationsuriya/website-intromlss2018/...Introduction to Machine Learning Summer School June 18, 2018 -June 29, 2018, Chicago Instructor: Suriya Gunasekar,

CombiningLinearUnits

𝑧L = 𝟏(𝒘𝟏. 𝒙 > 0)

𝑧I = 𝟏(𝒘𝟐. 𝒙 > 0)

𝑓 𝒙 = 𝟏 𝑤�L𝑧L + 𝑤�I𝑧I ≥ 0

𝑥L

𝑥I

19

𝑤LL

𝑤II

Page 21: Day 6: Neural networks, backpropagationsuriya/website-intromlss2018/...Introduction to Machine Learning Summer School June 18, 2018 -June 29, 2018, Chicago Instructor: Suriya Gunasekar,

Feed-ForwardNeuralNetworks

𝑥L

𝑥I

𝑥¯

𝑥)

𝑓 𝒙 = 𝜎 ∑ 𝑊 I 𝑗 𝑧 𝑗��

20

𝑧 𝑖 = 𝜎 ∑ 𝑊 L 𝑗, 𝑖 𝑥���

Figurecredit:Nati Srebro

Page 22: Day 6: Neural networks, backpropagationsuriya/website-intromlss2018/...Introduction to Machine Learning Summer School June 18, 2018 -June 29, 2018, Chicago Instructor: Suriya Gunasekar,

Feed-ForwardNeuralNetworks

𝑥L

𝑥I

𝑥¯

𝑥)

𝑓 𝒙 = 𝜎 ∑ 𝑊 I 𝑗 𝑧 𝑗��

21

𝑧 1 = 𝜎 ∑ 𝑊 L 𝑗, 1 𝑥���

Page 23: Day 6: Neural networks, backpropagationsuriya/website-intromlss2018/...Introduction to Machine Learning Summer School June 18, 2018 -June 29, 2018, Chicago Instructor: Suriya Gunasekar,

Feed-ForwardNeuralNetworks

𝑥L

𝑥I

𝑥¯

𝑥)

𝑓 𝒙 = 𝜎 ∑ 𝑊 I 𝑗 𝑧 𝑗��

22

𝑧 2 = 𝜎 ∑ 𝑊 L 𝑗, 2 𝑥���

Page 24: Day 6: Neural networks, backpropagationsuriya/website-intromlss2018/...Introduction to Machine Learning Summer School June 18, 2018 -June 29, 2018, Chicago Instructor: Suriya Gunasekar,

Feed-ForwardNeuralNetworks

𝑥L

𝑥I

𝑥¯

𝑥)

𝑓 𝒙 = 𝜎 ∑ 𝑊 I 𝑗 𝑧 𝑗��

23

𝑧 3 = 𝜎 ∑ 𝑊 L 𝑗, 3 𝑥���

Page 25: Day 6: Neural networks, backpropagationsuriya/website-intromlss2018/...Introduction to Machine Learning Summer School June 18, 2018 -June 29, 2018, Chicago Instructor: Suriya Gunasekar,

Feed-ForwardNeuralNetworks

𝑥L

𝑥I

𝑥¯

𝑥)

𝑓 𝒙 = 𝜎 ∑ 𝑊 I 𝑗 𝑧 𝑗��

24

𝑧 𝑑L = 𝜎 ∑ 𝑊 L 𝑗, 𝑑L 𝑥���

Page 26: Day 6: Neural networks, backpropagationsuriya/website-intromlss2018/...Introduction to Machine Learning Summer School June 18, 2018 -June 29, 2018, Chicago Instructor: Suriya Gunasekar,

Feed-ForwardNeuralNetworks

𝑥L

𝑥I

𝑥¯

𝑥)

𝑓 𝒙 = 𝜎 ∑ 𝑊 I 𝑗 𝑧 𝑗��

25

𝑧 𝑖 = 𝜎 ∑ 𝑊 L 𝑗, 𝑖 𝑥���

Page 27: Day 6: Neural networks, backpropagationsuriya/website-intromlss2018/...Introduction to Machine Learning Summer School June 18, 2018 -June 29, 2018, Chicago Instructor: Suriya Gunasekar,

Feed-ForwardNeuralNetworks

Architecture:

• DirectedAcyclicGraph𝐺(𝑉, 𝐸).Units(neurons)indexedbyverticesin𝑉.

𝑥L

𝑥I

𝑥¯

𝑥)

𝑓º »,¼ ,½,𝑾 𝒙

26Slidecredit:Nati Srebro

Page 28: Day 6: Neural networks, backpropagationsuriya/website-intromlss2018/...Introduction to Machine Learning Summer School June 18, 2018 -June 29, 2018, Chicago Instructor: Suriya Gunasekar,

Feed-ForwardNeuralNetworks

𝑣L

𝑣I

𝑣¯

𝑣)

Architecture:

• DirectedAcyclicGraphG(V,E).Units(neurons)indexedbyverticesinV.• “InputUnits”𝑣L …𝑣) ∈ 𝑉:noincomingedgeshavevalue𝑜 𝑣1 = 𝑥1

𝑥L

𝑥I

𝑥¯

𝑥)

𝑓º »,¼ ,½,𝑾 𝒙

27Slidecredit:Nati Srebro

Page 29: Day 6: Neural networks, backpropagationsuriya/website-intromlss2018/...Introduction to Machine Learning Summer School June 18, 2018 -June 29, 2018, Chicago Instructor: Suriya Gunasekar,

Feed-ForwardNeuralNetworks

𝑣L

𝑣I

𝑣¯

𝑣)𝑢

𝑣

Architecture:

• DirectedAcyclicGraphG(V,E).Units(neurons)indexedbyverticesinV.• “InputUnits”𝑣L …𝑣) ∈ 𝑉:noincomingedgeshavevalue𝑜 𝑣1 = 𝑥1• Eachedge𝑢 → 𝑣hasweight𝑾[𝑢 → 𝑣]

• Pre-activation 𝑎[𝑣] = ∑ 𝑾[𝑢 → 𝑣]�Á→]∈¼ 𝑜[𝑢]

𝑥L

𝑥I

𝑥¯

𝑥)

𝑓º »,¼ ,½,𝑾 𝒙

28Slidecredit:Nati Srebro

Page 30: Day 6: Neural networks, backpropagationsuriya/website-intromlss2018/...Introduction to Machine Learning Summer School June 18, 2018 -June 29, 2018, Chicago Instructor: Suriya Gunasekar,

Feed-ForwardNeuralNetworks

𝑣L

𝑣I

𝑣¯

𝑣)𝑢

𝑣

Architecture:

• DirectedAcyclicGraphG(V,E).Units(neurons)indexedbyverticesinV.• “InputUnits”𝑣L …𝑣) ∈ 𝑉:noincomingedgeshavevalue𝑜 𝑣1 = 𝑥1• Eachedge𝑢 → 𝑣hasweight𝑾[𝑢 → 𝑣]

• Pre-activation 𝑎[𝑣] = ∑ 𝑾[𝑢 → 𝑣]�Á→]∈¼ 𝑜[𝑢]

• Outputvalue 𝑜 𝑣 = 𝜎(𝑎 𝑣 )

𝑥L

𝑥I

𝑥¯

𝑥)

𝑓º »,¼ ,½,𝑾 𝒙

29Slidecredit:Nati Srebro

Page 31: Day 6: Neural networks, backpropagationsuriya/website-intromlss2018/...Introduction to Machine Learning Summer School June 18, 2018 -June 29, 2018, Chicago Instructor: Suriya Gunasekar,

Feed-ForwardNeuralNetworks

𝑣L

𝑣I

𝑣¯

𝑣)𝑢

𝑣

𝑣ÂÁX

Architecture:

• DirectedAcyclicGraphG(V,E).Units(neurons)indexedbyverticesinV.• “InputUnits”𝑣L …𝑣) ∈ 𝑉:noincomingedgeshavevalue𝑜 𝑣1 = 𝑥1• Eachedge𝑢 → 𝑣hasweight𝑾[𝑢 → 𝑣]

• Pre-activation 𝑎[𝑣] = ∑ 𝑾[𝑢 → 𝑣]�Á→]∈¼ 𝑜[𝑢]

• Outputvalue 𝑜 𝑣 = 𝜎(𝑎 𝑣 )• “OutputUnit”𝑣ÂÁX ∈ 𝑉,𝑓Ã 𝒙 = 𝑎 𝑣ÂÁX

𝑥L

𝑥I

𝑥¯

𝑥)

𝑓º »,¼ ,½,𝑾 𝒙

30

Page 32: Day 6: Neural networks, backpropagationsuriya/website-intromlss2018/...Introduction to Machine Learning Summer School June 18, 2018 -June 29, 2018, Chicago Instructor: Suriya Gunasekar,

Feed-ForwardNeuralNetworks

𝑣L

𝑣I

𝑣¯

𝑣)𝑢

𝑣

𝑣ÂÁX

Architecture:

• DirectedAcyclicGraphG(V,E).Units(neurons)indexedbyverticesinV.• “InputUnits”𝑣L …𝑣) ∈ 𝑉:noincomingedgeshavevalue𝑜 𝑣1 = 𝑥1• Eachedge𝑢 → 𝑣hasweight𝑾[𝑢 → 𝑣]

• Pre-activation 𝑎[𝑣] = ∑ 𝑾[𝑢 → 𝑣]�Á→]∈¼ 𝑜[𝑢]

• Outputvalue 𝑜 𝑣 = 𝜎(𝑎 𝑣 )• “OutputUnit”𝑣ÂÁX ∈ 𝑉,𝑓Ã 𝒙 = 𝑎 𝑣ÂÁX

𝑥L

𝑥I

𝑥¯

𝑥)

𝑓º »,¼ ,½,𝑾 𝒙

31

Sometextbooks/conventiondon’tmakethedistinctionbetweenpre-activationand

outputvalueandsimplycompute𝑜 𝑣 = 𝜎 ∑ 𝑊 𝑢 → 𝑣 𝑜 𝑢�

Á→]∈¼

Page 33: Day 6: Neural networks, backpropagationsuriya/website-intromlss2018/...Introduction to Machine Learning Summer School June 18, 2018 -June 29, 2018, Chicago Instructor: Suriya Gunasekar,

Feed-ForwardNeuralNetworks

𝑣L

𝑣I

𝑣¯

𝑣)𝑢

𝑣

𝑣ÂÁX

Parameters:

• Eachedge𝑢 → 𝑣hasweight𝑾[𝑢 → 𝑣]Activations:

• 𝜎:ℝ → ℝ, forexample• 𝜎 𝑧 = 𝑠𝑖𝑔𝑛 𝑧 or𝜎 𝑧 = L

Lyz{|(QÅ)• 𝜎 𝑧 = ReLU 𝑧 = max(0, 𝑧)

𝑥L

𝑥I

𝑥¯

𝑥)

𝑓º »,¼ ,½,𝑾 𝒙

32

Page 34: Day 6: Neural networks, backpropagationsuriya/website-intromlss2018/...Introduction to Machine Learning Summer School June 18, 2018 -June 29, 2018, Chicago Instructor: Suriya Gunasekar,

Feed-ForwardNeuralNetworks

𝑣L

𝑣I

𝑣¯

𝑣)𝑢

𝑣

𝑣ÂÁX

𝑥L

𝑥I

𝑥¯

𝑥)

𝑓º »,¼ ,½,𝑾 𝒙

33

DeeplearningGeneralizetohierarchyoftransformationsoftheinput,learnedend-to-endjointlywiththepredictor.

𝑓𝑾(𝒙) = 𝑓F 𝑓FQL 𝑓FQI …𝑓L 𝒙 …

𝑓L 𝒙 = 𝜎 𝑾 𝟏 𝒙

𝑓I 𝒙 = 𝜎 𝑾 𝟐 𝑓L(𝒙)

Page 35: Day 6: Neural networks, backpropagationsuriya/website-intromlss2018/...Introduction to Machine Learning Summer School June 18, 2018 -June 29, 2018, Chicago Instructor: Suriya Gunasekar,

NeuralNetsasFeatureLearning

• Canthinkofhiddenlayeras“features”𝜙(𝑥),thenalinearpredictorbasedon𝒘.𝜙 𝒙• “FeatureEngineering”approach:design𝜙(⋅) basedondomainknowledge• “DeepLearning”approach:learnfeaturesfromdata• Multilayernetworkswithnon-linearactivations

o moreandmorecomplexfeatures

𝑣L

𝑣I

𝑣¯

𝜙 𝑥 L

𝜙 𝑥 ¬

𝑥L

𝑥I

𝑥)⋯

𝑣ÂÁX⋯

34

Slidecredit:Nati Srebro

Page 36: Day 6: Neural networks, backpropagationsuriya/website-intromlss2018/...Introduction to Machine Learning Summer School June 18, 2018 -June 29, 2018, Chicago Instructor: Suriya Gunasekar,

35

Multi-LayerFeatureLearning

Slidecredit:Nati Srebro

Page 37: Day 6: Neural networks, backpropagationsuriya/website-intromlss2018/...Introduction to Machine Learning Summer School June 18, 2018 -June 29, 2018, Chicago Instructor: Suriya Gunasekar,

Moreknowledgeormorelearning

Expertknowledge:fullspecificknowledge

ExpertSystems(nodataatall)

nofreelunch

moredataà

Useexpertknowledgetoconstruct𝜙 𝑥 or𝐾(𝑥, 𝑥�),thenuse,eg SVM,on𝜙(𝑥)

“DeepLearning”:useverysimplerawfeaturesasinput,learngoodfeaturesusingdeepneuralnet

36Slidecredit:Nati Srebro

Page 38: Day 6: Neural networks, backpropagationsuriya/website-intromlss2018/...Introduction to Machine Learning Summer School June 18, 2018 -June 29, 2018, Chicago Instructor: Suriya Gunasekar,

Neuralnetworks ashypothesisclass

• Hypothesisclassspecifiedby:o GraphG(V,E)o Activationfunction𝜎o Weights𝐖,withweightW[𝑢 → 𝑣] foreachedge𝑢 → 𝑣 ∈ 𝐸

ℋ = 𝑓º »,¼ ,½,𝐖|𝑾: 𝐸 → ℝ• Expressivepower:

𝑓 𝑓𝑐𝑜𝑚𝑝𝑢𝑡𝑎𝑏𝑙𝑒𝑖𝑛𝑡𝑖𝑚𝑒𝑇} ⊆ ℋº »,¼ ,`1Ñ[ with 𝐸 = 𝑂 𝑇I

• Computation:empiricalriskminimization

𝐖Ó = argminÃ

Gℓ 𝑓º »,¼ ,½,𝑾 𝒙 𝒊 , 𝑦 1J

1KLo Highlynon-convexproblem,evenif𝑙𝑜𝑠𝑠 ℓisconvexo Hardtominimizeovereventinyneuralnetworksarehard

Basedonarchitectureandfixed

37

Page 39: Day 6: Neural networks, backpropagationsuriya/website-intromlss2018/...Introduction to Machine Learning Summer School June 18, 2018 -June 29, 2018, Chicago Instructor: Suriya Gunasekar,

Sohowdowelearn?

𝑾Ô = argminÃ

Gℓ 𝑓º »,¼ ,½,𝑾 𝒙 𝒊 , 𝑦 1J

1KL

• Stochasticgradientdescent:forrandom 𝒙 𝒊 , 𝑦 1 ∈ 𝑆𝑾(XyL) ← 𝑾 X − 𝜂(X)𝛻ℓ 𝑓º »,¼ ,½,𝑾 𝒕 𝒙 𝒊 , 𝑦 1

(Eventhoughitsnotconvex)

• Howdoweefficientlycalculate𝛻ℓ 𝑓º »,¼ ,½,𝑾 𝒕 𝒙 𝒊 , 𝑦 1 ?o Karlwilltellyou!

• NowabriefdetourintohistoryandresurrectionofNNs

Page 40: Day 6: Neural networks, backpropagationsuriya/website-intromlss2018/...Introduction to Machine Learning Summer School June 18, 2018 -June 29, 2018, Chicago Instructor: Suriya Gunasekar,

39

Imagenet challenge– objectclassification

Page 41: Day 6: Neural networks, backpropagationsuriya/website-intromlss2018/...Introduction to Machine Learning Summer School June 18, 2018 -June 29, 2018, Chicago Instructor: Suriya Gunasekar,

40

Objectdetection

Slidecredit:DavidMcAllester

Page 42: Day 6: Neural networks, backpropagationsuriya/website-intromlss2018/...Introduction to Machine Learning Summer School June 18, 2018 -June 29, 2018, Chicago Instructor: Suriya Gunasekar,

HistoryofNeuralNetworks• 1940s-70s:

o Inspiredbylearninginthebrain,andasamodelforthebrain(Pitts,Hebb,andothers)o Variousmodels,directedandundirected,differentactivationandlearningruleso PerceptronRule(Rosenblatt),ProblemofXOR,Multilayerperceptron(Minksy andPapert)o Backpropagation (Werbos 1975)

• 1980s-early1990s:o PracticalBackprop (Rumelhart,Hintonetal1986)andSGD(Bottou)o Relationshiptodistributedcomputing;“Connectionism”o Initialempiricalsuccess

• 1990s-2000s:o Lostfavortoimplicitlinearmethods:SVM,Boosting

• 2000-2010s:o revivalofinterest(CIFARgroups)o ca.2005:layer-wisepretraining ofdeepish netso progressinspeechandvisionwithdeepneuralnets

• 2010s:o ComputationaladvancesallowtrainingHUGEnetworkso …andalsoafewnewtrickso Krizhevsky etal.winImageNeto Empiricalsuccessandrenewedinterest

41

Page 43: Day 6: Neural networks, backpropagationsuriya/website-intromlss2018/...Introduction to Machine Learning Summer School June 18, 2018 -June 29, 2018, Chicago Instructor: Suriya Gunasekar,

Deeplearning- today

Stateoftheartperformanceinseveraltasksandareactivelydeployedinrealsystems

o Computervisiono Speechrecognitiono Machinetranslationo Dialogsystemso Computergameso Informationretrieval

42