Day 6: Neural networks, backpropagationsuriya/website-intromlss2018/...Introduction to Machine...

Introduction to Machine Learning Summer SchoolJune 18, 2018 - June 29, 2018, Chicago

Instructor:SuriyaGunasekar,TTIChicago

25June2018

Day6:Neuralnetworks,backpropagation

Schedule

• 9:00am-10:25am– Lecture 6.a:Review of week 1,introduction to neuralnetworks• 10:30am-11:30am– Invited Talk- GregDurett (also theTTICcolloquium talk)• 11:30am-12:30pm– Lunch• 12:30pm-2:00pm– Lecture 6.b:Backpropagation• 2:00pm-5:00pm– Programming

1

Review of week 1

2

3

Supervisedlearning– keyquestions

• Data: whatkindofdatacanweget?howmuchdatacanweget?

• Model:whatisthecorrectmodelformydata?– wanttominimizetheeffortputintothisquestion!

• Training:whatresources-computation/memory- doesthealgorithmneedtoestimatethemodel𝑓"?

• Testing: howwellwill𝑓" performwhendeployed? whatisthecomputational/memoryrequirementduringdeployment?

𝑓" face

Setup

Data collec,on

Representation

Modeling

Estimation/training

Model selection

Data

Algorithm

Linearregression

• Input𝒙 ∈ 𝒳 ⊂ ℝ),output𝑦 ∈ ℝ,wanttolearn𝑓:𝒳 → ℝ

• Trainingdata𝑆 = 𝒙 𝒊 , 𝑦 1 : 𝑖 = 1,2, … ,𝑁

• Parameterizecandidate𝑓:𝒳 → ℝ bylinearfunctions,ℋ = {𝒙 → 𝒘. 𝒙:𝑤 ∈ ℝ)}

• Estimate𝒘 byminimizinglossontrainingdata

𝒘= = argmin𝒘

𝐽EFE 𝒘 :=G 𝒘.𝒙 𝒊 − 𝑦 1 IJ

1KLo 𝐽EFE 𝒘 isconvexin𝒘àminimize𝐽EFE 𝒘 bysettinggradientto0o 𝛻𝒘𝐽EFE 𝒘 = ∑ 𝒘. 𝒙 𝒊 − 𝑦 1 𝒙 𝒊J

1KL

o Closedformsolution𝒘= = 𝑿P𝑿 QL𝑿𝒚

• Cangetnon-linearfunctionsbymapping𝒙 → 𝜙(𝒙) anddoinglinearregressionon𝜙(𝒙)

4

Overfitting• Forsameamountofdata,morecomplexmodels(e.g.,higherdegreepolynomials)overfit more

• orneedmoredatatofitmorecomplexmodels

• complexity≈ numberofparameters

Modelselection• mmodelclasses ℋL,ℋI,… ,ℋW

• 𝑆 = 𝑆XYZ1[ ∪ 𝑆]Z^ ∪ 𝑆X_`X• Train on𝑆XYZ1[ topickbest𝑓"Y ∈ ℋY

• Pick𝑓"∗ basedonvalidationlosson𝑆]Z^• Evaluatetestloss𝐿Ecdec 𝑓"

∗

5

reality

Regularization

• Complexityofmodelclasscanalsobecontrolledbynormofparameters– smallerrangeofvaluesallowed• Regularizationforlinearregression

argmin𝒘

𝐽EFE 𝒘 + 𝜆 𝒘 II

argmin𝒘

𝐽EFE 𝒘 + 𝜆 𝒘 L

• Againdomodelselectiontopick𝜆– using𝑆]Z^ orcross-validation

6

Classification

• Output𝑦 ∈ 𝒴 takesdiscretesetofvalues,e.g.,𝒴 = {0,1} or𝒴 = {−1,1} or𝒴 = {𝑠𝑝𝑎𝑚, 𝑛𝑜𝑠𝑝𝑎𝑚}o Unlikeregression,label-valuesdonothavemeaning

• Classifiersdividethespaceofinput𝒳 (oftenℝ))to“regions”whereeachregionisassignedalabel

• Non-parametricmodelso k-nearestneighbors– regionsdefinedbasedonnearestneighbors

o decisiontrees– structuredrectangularregions

• Linearmodels– classifierregionsarehalfspaces

7

!"

!#

ç

Classification– logisticregression

• 𝒳 = ℝ), 𝒴 = −1,1 , 𝑆 = 𝒙 𝒊 , 𝑦 1 : 𝑖 = 1,2, … , 𝑁

• Linearmodel𝑓 𝒙 = 𝑓𝒘 𝒙 = 𝒘. 𝒙

• Outputclassifier𝑦p 𝒙 = sign(𝒘. 𝒙)

• Empiricalriskminimization

𝒘= = argmin𝒘

Glog 1 + exp −𝒘. 𝒙 𝒊 𝑦 1 �

1

• Probabilisticformulation:Pr 𝑦 = 1 𝒙 = LLyz{| Q𝒘.𝒙

• Multi-classgeneralization:𝒴 = {1,2, … ,𝑚}

• Pr 𝑦 𝒙 = z{| Q𝒘𝒚.𝒙

∑ z{| Q𝒘𝒚}.𝒙�~}

• Canagaingetnon-lineardecisionboundariesbymapping𝒙 → 𝜙(𝒙)

8

Logisticlossℓ 𝑓 𝒙 , 𝑦 = log 1 + exp −𝑓(𝒙)𝑦

ℓ(#$,&)

0 #($)& →

Classification– maximummarginclassifierSeparabledata• Originalformulation

𝒘= = argmax𝒘∈ℝ�

min1𝑦 1 𝒘. 𝒙 𝒊

𝒘• Fixing 𝒘 = 1𝒘= = argmax

𝒘min

1𝑦 1 𝒘. 𝒙 𝒊 s. t. 𝒘 = 1

• Fixingmin1𝑦 1 𝒘. 𝒙 𝒊 = 1

𝒘� = argmin𝒘

𝒘 Is.t.∀𝑖, 𝑦 1 (𝒘. 𝒙 𝒊 ) ≥ 1

Slackvariablesfornon-separabledata

𝒘= = argmin𝒘,{��}

𝒘 I+λ ∑ 𝜉1�1 s.t.∀𝑖, 𝑦 1 𝒘. 𝒙 𝒊 ≥ 1 − 𝜉1

= argmin𝒘,{��}

𝒘 I+λ ∑ max 0,1 − 𝑦 1 𝒘. 𝒙 𝒊�1

9

!′

!

#$

#%

&

10

Kerneltrick• Usingrepresentortheorem𝒘 = ∑ 𝛽1𝒙 𝒊J

1KL

min𝒘 𝒘 I + 𝜆Gmax 0,1 − 𝑦 1 𝒘. 𝒙 𝒊

�

1

≡ min𝜷∈ℝ𝑵

𝜷P𝑮𝜷 + 𝜆Gmax 0,1 − 𝑦 1 𝑮𝜷 1

�

1

𝑮 ∈ ℝJ×Jwith𝐺1� = 𝒙 𝒊 . 𝒙 𝒋 iscalledthegrammatrix

• Optimizationdependson𝒙 𝒊 onlythrough𝐺1� = 𝒙 𝒊 . 𝒙 𝒋

• Forprediction𝒘=. 𝒙 = ∑ 𝛽1�1 𝒙 𝒊 . 𝒙,weagainonlyneed𝒙 𝒊 . 𝒙

• Function𝐾 𝒙, 𝒙� = 𝒙. 𝒙′ iscalledtheKernel• Whenlearningnon-linearclassifiersusingfeaturetransformations𝒙 → 𝜙(𝒙)and𝑓𝒘 𝒙 = 𝒘.𝜙(𝒙)

o Classifierfullyspecifiedintermsof𝐾� 𝒙, 𝒙� = 𝐾(𝜙 𝒙 , 𝜙(𝒙′))o 𝜙 𝒙 itselfcanbeveryveryhighdimensional(maybeeveninfinitedimensional)

1010

Optimization• ERM+regularizationoptimizationproblem

𝒘= = argmin𝒘

𝐽E� 𝒘 :=Gℓ(𝒘.𝜙 𝒙 𝒊 , 𝑦 1 )J

1KL

+ 𝜆‖𝒘‖

• If𝐽E� 𝒘 isconvexin𝒘,then𝒘= isoptimumifandonlyif gradientat𝒘= is0,i.e.,𝛻𝐽E� 𝒘= = 0

• Gradientdescent:startwithinitialization𝒘𝟎 anditerativelyupdateo 𝒘𝒕y𝟏 = 𝒘𝒕 − 𝜂X𝛻𝐽E� 𝒘𝒕

o where𝛻𝐽E� 𝒘𝒕 = ∑ 𝛻ℓ 𝒘𝒕. 𝜙 𝒙 𝒊 , 𝑦 1 �𝒊 + 𝜆𝛻‖𝒘𝒕‖

• Stochasticgradientdescento usegradientsfromonlyoneexampleo 𝒘𝒕y𝟏 = 𝒘𝒕 − 𝜂X𝛻 1 𝐽E� 𝒘𝒕

o where𝛻 1 𝐽E� 𝒘𝒕 = 𝛻ℓ 𝒘𝒕. 𝜙 𝒙 𝒊 , 𝑦 1 + 𝜆𝛻‖𝒘𝒕‖ forarandomsample(𝒙 𝒊 , 𝑦 1 )

11

Otherclassificationmodels• Optimalunrestrictedpredictor

o Regression+squaredlossà 𝑓∗∗(𝒙) = 𝐄 𝑦 𝒙o Classification+0-1lossà 𝑦p∗∗ 𝒙 = argmax¢ Pr(𝑦 = 𝑐|𝒙)

• Discriminativemodels:directlymodelPr 𝑦 𝒙 ,e.g.,logisticregression

• Generativemodels: modelfulljointdistributionPr 𝑦, 𝒙 =Pr 𝒙|𝑦 Pr(𝑦)

• Whygenerativemodels?o Oneconditionalmightbesimplertomodelwithpriorknowledge,e.g.,comparespecifyingPr(image|digit = 1) vsPr(digit = 1|image)

o Naturallyhandlesmissingdata

• Twoexamplesofgenerativemodelso NaïveBayesclassifiero HiddenMarkovmodel

12

Otherclassifiers• NaïveBayesclassifier:withdfeatures𝑥 = [𝑥L, 𝑥I, … , 𝑥)] whereeach𝑥L, 𝑥I, … , 𝑥) cantakeoneofKvaluesà 𝐶𝐾) parameters

o NBassumption:featuresareindependentgivenclass𝑦à 𝐶𝐾𝑑 params.

Pr(𝑥L, 𝑥I, … , 𝑥)|𝑦) = Pr(𝑥L|𝑦) Pr(𝑥I|𝑦)…Pr(𝑥)|𝑦) = ∏ Pr(𝑥¬|𝑦))¬KL

o Trainingamountstoaveragingsamplesacrossclasses

• HiddenMarkovmodel:variablelengthinput/observations{𝑥L, 𝑥I, … , 𝑥W} (e.g.,words)andvariablelengthoutput/state{𝑦L, 𝑦I, … , 𝑦W} (e.g.,tags)

o HMMassumption:a)currentstateconditionedonimmediatepreviousstateisconditionallyindependentofallothervariables,and(b)currentobservationconditionedoncurrentstate isconditionallyindependentofallothervariables.

Pr(𝑥L, 𝑥I, … , 𝑥W, 𝑦L, 𝑦I, … , 𝑦W) = Pr 𝑦L Pr(𝑥L|𝑦L)Pr(𝑦¬|𝑦¬QL) Pr(𝑦¬|𝑥¬)W

¬KI

o ParametersestimatedusingMLEdynamicprogramming13

TodayIntroductiontoneuralnetworks

Backpropagation

14

Graphnotation

15

Generalvariables- canbeinputvariableslike𝑥L, 𝑥I, … 𝑥)- prediction𝑦p- oranyintermediatecomputation(wewillsee

examplessoon)

𝑤L

𝑤I

𝑧L

𝑧I𝑧¯ denotescomputation𝑧¯ = 𝜎(𝑤L𝑧L + 𝑤I𝑧I)

forsome“activation”function𝜎 (specifiedapriori)

Linearclassifier

• Biologicalanalogy:singleneuron– stimulireinforcesynapticconnections

𝑥L

𝑥I

𝑥¯

𝑥)

1

𝑓 𝒙 = 𝟏 𝒘. 𝒙 + 𝑤� ≥ 0⋯

16Slidecredits:Nati Srebro,DavidMcAllester

Shallowlearning

• Wealreadysawhowtouselinearmodelstogetnon-lineardecisionboundaries• Featuretransform:map𝒙 ∈ ℝ)to𝜙 𝒙 ∈ ℝ)}anduse

𝑓𝒘 𝒙 = 𝒘.𝜙(𝒙)• Shallowlearning:hand-craftedandnon-hierarchical𝜙o Polynomialregressionwithsquaredorlogisticloss,𝜙 𝑥 ² = 𝑥²

o KernelSVM:𝐾 𝒙, 𝒙� = 𝜙 𝒙 . 𝜙 𝒙�

17

𝑓 𝒙 =𝟏 𝒘.𝜙 𝒙 ≥ 0

⋯

𝜙(𝒙)L

𝜙(𝒙)I

𝜙(𝒙)¯

𝜙(𝒙))}

Slidecredit:Nati Srebro

CombiningLinearUnits

𝑧L = 𝟏(𝑥L − 𝑥I > 0)

𝑓 𝒙 = 𝟏 𝑧L + 𝑧I > 0

𝑥L

𝑥I

18

1

1

𝑧I = 𝟏(𝑥I − 𝑥L > 0)

• Thenetworkrepresentsthefunction𝑓 𝑥 = 𝑥Landnot 𝑥I or 𝑥Iandnot 𝑥L

• Notalinearfunctionof𝑥


CombiningLinearUnits

𝑧L = 𝟏(𝒘𝟏. 𝒙 > 0)

𝑧I = 𝟏(𝒘𝟐. 𝒙 > 0)

𝑓 𝒙 = 𝟏 𝑤�L𝑧L + 𝑤�I𝑧I ≥ 0

𝑥L

𝑥I

19

𝑤LL

𝑤II

Feed-ForwardNeuralNetworks

𝑥L

𝑥I

𝑥¯

𝑥)

⋯

𝑓 𝒙 = 𝜎 ∑ 𝑊 I 𝑗 𝑧 𝑗��

20

𝑧 𝑖 = 𝜎 ∑ 𝑊 L 𝑗, 𝑖 𝑥��

Figurecredit:Nati Srebro


𝑥L

𝑥I

𝑥¯

𝑥)

⋯

𝑓 𝒙 = 𝜎 ∑ 𝑊 I 𝑗 𝑧 𝑗��

21

𝑧 1 = 𝜎 ∑ 𝑊 L 𝑗, 1 𝑥��


𝑥L

𝑥I

𝑥¯

𝑥)

⋯

𝑓 𝒙 = 𝜎 ∑ 𝑊 I 𝑗 𝑧 𝑗��

22

𝑧 2 = 𝜎 ∑ 𝑊 L 𝑗, 2 𝑥��


𝑥L

𝑥I

𝑥¯

𝑥)

⋯

𝑓 𝒙 = 𝜎 ∑ 𝑊 I 𝑗 𝑧 𝑗��

23

𝑧 3 = 𝜎 ∑ 𝑊 L 𝑗, 3 𝑥��


𝑥L

𝑥I

𝑥¯

𝑥)

⋯

𝑓 𝒙 = 𝜎 ∑ 𝑊 I 𝑗 𝑧 𝑗��

24

𝑧 𝑑L = 𝜎 ∑ 𝑊 L 𝑗, 𝑑L 𝑥��


𝑥L

𝑥I

𝑥¯

𝑥)

⋯

𝑓 𝒙 = 𝜎 ∑ 𝑊 I 𝑗 𝑧 𝑗��

25

𝑧 𝑖 = 𝜎 ∑ 𝑊 L 𝑗, 𝑖 𝑥��


Architecture:

• DirectedAcyclicGraph𝐺(𝑉, 𝐸).Units(neurons)indexedbyverticesin𝑉.

𝑥L

𝑥I

𝑥¯

𝑥)

⋯

𝑓º »,¼ ,½,𝑾 𝒙

26Slidecredit:Nati Srebro


𝑣L

𝑣I

𝑣¯

𝑣)

Architecture:

• DirectedAcyclicGraphG(V,E).Units(neurons)indexedbyverticesinV.• “InputUnits”𝑣L …𝑣) ∈ 𝑉:noincomingedgeshavevalue𝑜 𝑣1 = 𝑥1

𝑥L

𝑥I

𝑥¯

𝑥)

⋯

𝑓º »,¼ ,½,𝑾 𝒙



𝑣L

𝑣I

𝑣¯

𝑣)𝑢

𝑣

Architecture:

• DirectedAcyclicGraphG(V,E).Units(neurons)indexedbyverticesinV.• “InputUnits”𝑣L …𝑣) ∈ 𝑉:noincomingedgeshavevalue𝑜 𝑣1 = 𝑥1• Eachedge𝑢 → 𝑣hasweight𝑾[𝑢 → 𝑣]

• Pre-activation 𝑎[𝑣] = ∑ 𝑾[𝑢 → 𝑣]�Á→]∈¼ 𝑜[𝑢]

𝑥L

𝑥I

𝑥¯

𝑥)

⋯

𝑓º »,¼ ,½,𝑾 𝒙



𝑣L

𝑣I

𝑣¯

𝑣)𝑢

𝑣

Architecture:



• Outputvalue 𝑜 𝑣 = 𝜎(𝑎 𝑣 )

𝑥L

𝑥I

𝑥¯

𝑥)

⋯

𝑓º »,¼ ,½,𝑾 𝒙



𝑣L

𝑣I

𝑣¯

𝑣)𝑢

𝑣

𝑣ÂÁX

Architecture:



• Outputvalue 𝑜 𝑣 = 𝜎(𝑎 𝑣 )• “OutputUnit”𝑣ÂÁX ∈ 𝑉,𝑓Ã 𝒙 = 𝑎 𝑣ÂÁX

𝑥L

𝑥I

𝑥¯

𝑥)

⋯

𝑓º »,¼ ,½,𝑾 𝒙

30


𝑣L

𝑣I

𝑣¯

𝑣)𝑢

𝑣

𝑣ÂÁX

Architecture:



• Outputvalue 𝑜 𝑣 = 𝜎(𝑎 𝑣 )• “OutputUnit”𝑣ÂÁX ∈ 𝑉,𝑓Ã 𝒙 = 𝑎 𝑣ÂÁX

𝑥L

𝑥I

𝑥¯

𝑥)

⋯

𝑓º »,¼ ,½,𝑾 𝒙

31

Sometextbooks/conventiondon’tmakethedistinctionbetweenpre-activationand

outputvalueandsimplycompute𝑜 𝑣 = 𝜎 ∑ 𝑊 𝑢 → 𝑣 𝑜 𝑢�

Á→]∈¼


𝑣L

𝑣I

𝑣¯

𝑣)𝑢

𝑣

𝑣ÂÁX

Parameters:

• Eachedge𝑢 → 𝑣hasweight𝑾[𝑢 → 𝑣]Activations:

• 𝜎:ℝ → ℝ, forexample• 𝜎 𝑧 = 𝑠𝑖𝑔𝑛 𝑧 or𝜎 𝑧 = L

Lyz{|(QÅ)• 𝜎 𝑧 = ReLU 𝑧 = max(0, 𝑧)

𝑥L

𝑥I

𝑥¯

𝑥)

⋯

𝑓º »,¼ ,½,𝑾 𝒙

32


𝑣L

𝑣I

𝑣¯

𝑣)𝑢

𝑣

𝑣ÂÁX

𝑥L

𝑥I

𝑥¯

𝑥)

⋯

𝑓º »,¼ ,½,𝑾 𝒙

33

DeeplearningGeneralizetohierarchyoftransformationsoftheinput,learnedend-to-endjointlywiththepredictor.

𝑓𝑾(𝒙) = 𝑓F 𝑓FQL 𝑓FQI …𝑓L 𝒙 …

𝑓L 𝒙 = 𝜎 𝑾 𝟏 𝒙

𝑓I 𝒙 = 𝜎 𝑾 𝟐 𝑓L(𝒙)

NeuralNetsasFeatureLearning

• Canthinkofhiddenlayeras“features”𝜙(𝑥),thenalinearpredictorbasedon𝒘.𝜙 𝒙• “FeatureEngineering”approach:design𝜙(⋅) basedondomainknowledge• “DeepLearning”approach:learnfeaturesfromdata• Multilayernetworkswithnon-linearactivations

o moreandmorecomplexfeatures

𝑣L

𝑣I

𝑣¯

𝜙 𝑥 L

𝜙 𝑥 ¬

𝑥L

𝑥I

𝑥)⋯

𝑣ÂÁX⋯

34

⋯


35

Multi-LayerFeatureLearning


Moreknowledgeormorelearning

Expertknowledge:fullspecificknowledge

ExpertSystems(nodataatall)

nofreelunch

moredataà

Useexpertknowledgetoconstruct𝜙 𝑥 or𝐾(𝑥, 𝑥�),thenuse,eg SVM,on𝜙(𝑥)

“DeepLearning”:useverysimplerawfeaturesasinput,learngoodfeaturesusingdeepneuralnet


Neuralnetworks ashypothesisclass

• Hypothesisclassspecifiedby:o GraphG(V,E)o Activationfunction𝜎o Weights𝐖,withweightW[𝑢 → 𝑣] foreachedge𝑢 → 𝑣 ∈ 𝐸

ℋ = 𝑓º »,¼ ,½,𝐖|𝑾: 𝐸 → ℝ• Expressivepower:

𝑓 𝑓𝑐𝑜𝑚𝑝𝑢𝑡𝑎𝑏𝑙𝑒𝑖𝑛𝑡𝑖𝑚𝑒𝑇} ⊆ ℋº »,¼ ,`1Ñ[ with 𝐸 = 𝑂 𝑇I

• Computation:empiricalriskminimization

𝐖Ó = argminÃ

Gℓ 𝑓º »,¼ ,½,𝑾 𝒙 𝒊 , 𝑦 1J

1KLo Highlynon-convexproblem,evenif𝑙𝑜𝑠𝑠 ℓisconvexo Hardtominimizeovereventinyneuralnetworksarehard

Basedonarchitectureandfixed

37

Sohowdowelearn?

𝑾Ô = argminÃ

Gℓ 𝑓º »,¼ ,½,𝑾 𝒙 𝒊 , 𝑦 1J

1KL

• Stochasticgradientdescent:forrandom 𝒙 𝒊 , 𝑦 1 ∈ 𝑆𝑾(XyL) ← 𝑾 X − 𝜂(X)𝛻ℓ 𝑓º »,¼ ,½,𝑾 𝒕 𝒙 𝒊 , 𝑦 1

(Eventhoughitsnotconvex)

• Howdoweefficientlycalculate𝛻ℓ 𝑓º »,¼ ,½,𝑾 𝒕 𝒙 𝒊 , 𝑦 1 ?o Karlwilltellyou!

• NowabriefdetourintohistoryandresurrectionofNNs

39

Imagenet challenge– objectclassification

40

Objectdetection

Slidecredit:DavidMcAllester

HistoryofNeuralNetworks• 1940s-70s:

o Inspiredbylearninginthebrain,andasamodelforthebrain(Pitts,Hebb,andothers)o Variousmodels,directedandundirected,differentactivationandlearningruleso PerceptronRule(Rosenblatt),ProblemofXOR,Multilayerperceptron(Minksy andPapert)o Backpropagation (Werbos 1975)

• 1980s-early1990s:o PracticalBackprop (Rumelhart,Hintonetal1986)andSGD(Bottou)o Relationshiptodistributedcomputing;“Connectionism”o Initialempiricalsuccess

• 1990s-2000s:o Lostfavortoimplicitlinearmethods:SVM,Boosting

• 2000-2010s:o revivalofinterest(CIFARgroups)o ca.2005:layer-wisepretraining ofdeepish netso progressinspeechandvisionwithdeepneuralnets

• 2010s:o ComputationaladvancesallowtrainingHUGEnetworkso …andalsoafewnewtrickso Krizhevsky etal.winImageNeto Empiricalsuccessandrenewedinterest

41

Deeplearning- today

Stateoftheartperformanceinseveraltasksandareactivelydeployedinrealsystems

o Computervisiono Speechrecognitiono Machinetranslationo Dialogsystemso Computergameso Informationretrieval

42

Day 6: Neural networks, backpropagationsuriya/website-intromlss2018/...Introduction to Machine...

Documents

Transcript of Day 6: Neural networks, backpropagationsuriya/website-intromlss2018/...Introduction to Machine...