Download - CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

CIS519/419AppliedMachineLearning

www.seas.upenn.edu/~cis519

[email protected]://www.cis.upenn.edu/~danroth/461C,3401Walnut

LecturegivenbyDanielKhashabi

SlideswerecreatedbyDanRoth(forCIS519/419atPennorCS446atUIUC),EricEatonforCIS519/419atPenn,orfromotherauthorswhohavemadetheirMLslidesavailable.


FunctionsCanbeMadeLinear§ Dataarenotlinearlyseparableinonedimension§ Notseparableifyouinsistonusingaspecificclassof

functions

2

x


BlownUpFeatureSpace§ Dataareseparablein<x,x2>space

3

x

x2


Multi-LayerNeuralNetwork§ Multi-layernetworkweredesignedtoovercomethe

computational(expressivity)limitationofasinglethresholdelement.

§ Theideaistostackseverallayersofthresholdelements,eachlayerusingtheoutputofthepreviouslayerasinput.

§ Multi-layernetworkscanrepresentarbitraryfunctions,butbuildingeffectivelearningmethodsforsuchnetworkwas[thoughttobe]difficult.

4

activation

Input

Hidden

Output


BasicUnits§ LinearUnit: Multiplelayersoflinearfunctions

oj =w¢ xproducelinearfunctions.Wewanttorepresentnonlinearfunctions.

§ Needtodoitinawaythatfacilitateslearning

§ Thresholdunits:oj =sgn(w¢ x)arenotdifferentiable, henceunsuitableforgradientdescent.

§ Thekeyideawastonoticethatthediscontinuityofthethresholdelementcanberepresentsbyasmoothnon-linearapproximation:oj =[1+exp{-w¢ x}]-1

§ (Rumelhart,Hinton,Williiam,1986),(Linnainmaa,1970),see:http://people.idsia.ch/~juergen/who-invented-backpropagation.html )

5

activation

Input

Hidden

Output

w2ij

w1ij


ModelNeuron(Logistic)§ Usanon-linear,differentiableoutputfunctionsuch

asthesigmoidorlogisticfunction

§ Netinputtoaunitisdefinedas:§ Outputofaunitisdefinedas:

6

iijj xwnet ∑ •=

)T(netj jje11O

−−+=

jT

12

6

345

7

67w

17w

∑T

jO

1x

7x


NeuralNetworks§ NeuralNetworksarefunctions:𝑁𝑁:𝑋 → 𝑌

§ where𝑋 = 0,1 *,or{0,1}* and𝑌 = 0,1 ,{0,1}§ Robustapproachtoapproximatingreal-valued,discrete-valued

andvectorvaluedtargetfunctions.

§ Amongthemosteffectivegeneralpurpose supervisedlearningmethodcurrentlyknown.

§ Effectiveespeciallyforcomplexandhardtointerpretinputdatasuchasreal-worldsensorydata,wherealotofsupervisionisavailable.

§ TheBackpropagationalgorithmforneuralnetworkshasbeenshownsuccessfulinmanypracticalproblems§ handwrittencharacterrecognition,speechrecognition,object

recognition,someNLPproblems

7


NeuralNetworks§ NeuralNetworksarefunctions:NN:𝑋 → 𝑌

§ where𝑋 = 0,1 *,or{0,1}* and𝑌 = 0,1 , {0,1}

§ NNcanbeusedasanapproximationofatargetclassifier§ Intheirgeneralform,evenwithasinglehiddenlayer,NNcan

approximateanyfunction§ AlgorithmsexistthatcanlearnaNNrepresentationfromlabeled

trainingdata(e.g.,Backpropagation).

8


Multi-LayerNeuralNetworks§ Multi-layernetworkweredesignedtoovercomethe

computational(expressivity)limitationofasinglethresholdelement.

§ Theideaistostackseverallayersofthresholdelements,eachlayerusingtheoutputofthepreviouslayerasinput.

9

activation

Input

Hidden

Output


MotivationforNeuralNetworks§ Inspiredbybiologicalsystems

§ Butdon’ttakethis(aswellasanyotherwordsinthenewon“emergence”ofintelligentbehavior)seriously;

§ WearecurrentlyonrisingpartofawaveofinterestinNNarchitectures,afteralongdowntimefromthemid-90-ies.§ Bettercomputerarchitecture(GPUs,parallelism)§ Alotmoredatathanbefore;inmanydomains,supervisionis

available.

§ Currentsurgeofinteresthasseenveryminimalalgorithmicchanges

10


MotivationforNeuralNetworks§ Minimaltonoalgorithmicchanges§ Onepotentiallyinterestingperspective:

§ BeforewelookedatNNonlyasfunctionapproximators.§ Now,welookattheintermediaterepresentationsgenerated

whilelearningasmeaningful§ Ideasarebeingdevelopedonthevalueoftheseintermediate

representationsfortransferlearningetc.

§ Wewillpresentinthenexttwolecturesafewofthebasicarchitecturesandlearningalgorithms,andprovidesomeexamplesforapplications

11


NeuralSpeedConstraints§ Neuron“switchingtime”isO(milliseconds),comparedto

nanosecondfortransistors.§ However,biologicalsystemscanperformsignificantcognitive

tasks(vision,languageunderstanding)infractionsofasecond.

§ Evenforlimitedabilities,currentAIsystemsrequireordersofmagnitudemoresteps.

§ Humanbrainhasapproximately10^10neurons,eachconnectedto10^4;mustexploremassiveparallelism(butthere’smore…)

12


BasicUnitinMulti-LayerNeuralNetwork

§ LinearUnit:𝑜/ = 𝑤.𝑥 multiplelayersoflinearfunctionsproducelinearfunctions.Wewanttorepresentnonlinearfunctions.

§ Thresholdunits:𝑜/ = 𝑠𝑔𝑛(𝑤.𝑥 − 𝑇) arenotdifferentiable,henceunsuitableforgradientdescent

13

activation

Input

Hidden

Output


ModelNeuron(Logistic)§ Neuronismodeledbyaunit𝑗 connectedbyweighted

links𝑤</ tootherunits𝑖.

§ Useanon-linear,differentiableoutputfunctionsuchasthesigmoidorlogisticfunction

§ Netinputtoaunitisdefinedas:

§ Outputofaunitisdefinedas:

14

net/ = ∑𝑤</. 𝑥<

𝑜/ =1

1 + exp −(net/ − 𝑇/)

∑ 𝑜/

𝑥E𝑥F𝑥G𝑥H𝑥I𝑥J

𝑥/𝑤E/

𝑤J/


History:NeuralComputation

15

§ McCullochandPitts(1943) showedhowlinearthresholdunitscanbeusedtocomputelogicalfunctions

§ Canbuildbasiclogicgates§ AND:𝑤</ = 𝑇//𝑛§ OR:𝑤</ = 𝑇/§ NOT: usenegativeweight

§ Canbuildarbitrarylogiccircuits,finite-statemachinesandcomputersgiventhesebasisgates.

§ CanspecifyanyBooleanfunctionusingtwolayernetwork(w/negation)§ DNFandCNFareuniversalrepresentations

net/ = ∑𝑤</ . 𝑥<

𝑜/ =1

1 + exp −(net/ − 𝑇/)


RepresentationalPower§ AnyBooleanfunctioncanberepresentedbyatwolayer

network(simulateatwolayerAND-ORnetwork)

§ Anybounded continuousfunctioncanbeapproximatedwitharbitrarysmallerrorbyatwolayernetwork.

§ Sigmoidfunctionsprovideasetofbasisfunctionfromwhicharbitraryfunctioncanbecomposed.

§ Anyfunctioncanbeapproximatedtoarbitraryaccuracybyathreelayernetwork.

16


QuizTime!§ Givenaneuralnetwork,howcanwe

makepredictions?§ Giveninput,calculatetheoutputofeach

layer(startingfromthefirstlayer),untilyougettotheoutput.

§ Whatisrequiredtofullyspecifyaneuralnetwork?§ Theweights.

17

§ WhyNNpredictionscanbequick?§ Becausemanyofthecomputationscouldbeparallelized.

§ Whatmakesaneuralnetworksnon-linearapproximator?§ Thenon-linearunits.


TrainingaNeuralNet


Widrow-HoffRule§ Thisincrementalupdateruleprovidesanapproximation

tothegoal:§ Findthebestlinearapproximationofthedata

𝐸𝑟𝑟 𝑤 / =12P 𝑡R − 𝑜R F

R∈T§ where:

𝑜R = P𝑤</. 𝑥< =<

𝑤 / . �⃗�

outputoflinearunitonexampled§ 𝑡R =Targetoutputforexampled

19


History:LearningRules§ Hebb(1949) suggestedthatiftwounitsarebothactive

(firing)thentheweightsbetweenthemshouldincrease:𝑤</ = 𝑤</ + 𝑅𝑜<𝑜/

§ 𝑅 andisaconstantcalledthelearningrate§ Supportedbyphysiologicalevidence

§ Rosenblatt(1959) suggestedthatwhenatargetoutputvalueisprovidedforasingleneuronwithfixedinput,itcanincrementallychangeweightsandlearntoproducetheoutputusingthePerceptronlearningrule.§ assumesbinaryoutputunits;singlelinearthresholdunit§ LedtothePerceptronAlgorithm

§ See:http://people.idsia.ch/~juergen/who-invented-backpropagation.html

20


PerceptronLearningRule§ Given:

§ thetarget outputfortheoutputunitis𝑡/§ theinput theneuronseesis𝑥<§ theoutput itproducesis𝑜/

§ Updateweightsaccordingto𝑤</ ← 𝑤</ + 𝑅 𝑡/ − 𝑜/ 𝑥<§ Ifoutputiscorrect,don’tchangetheweights§ Ifoutputiswrong,changeweightsforallinputswhichare1

§ Ifoutput islow(0,needstobe1)incrementweights§ Ifoutput ishigh(1,needstobe0)decrementweights

21

∑𝑇/

𝑜/


𝑥W𝑤EW

𝑤JW


GradientDescent§ Weusegradientdescentdeterminetheweightvectorthat minimizes

𝐸𝑟𝑟 𝑤 / ;

§ Fixingtheset𝐷 ofexamples,𝐸 isafunctionof𝑤 /

§ Ateachstep,theweightvectorismodifiedinthedirectionthatproducesthesteepestdescentalongtheerrorsurface.

22

𝐸𝑟𝑟(𝑤)

𝑤𝑤G 𝑤F 𝑤E 𝑤Y


Summary:SingleLayerNetwork§ Varietyofupdaterules

§ Multiplicative§ Additive

§ Batch andincrementalalgorithms§ Variousconvergenceandefficiencyconditions§ Thereareotherwaystolearnlinearfunctions

§ LinearProgramming (generalpurpose)§ ProbabilisticClassifiers(someassumption)

§ Keyalgorithmsaredrivenbygradientdescent

23


GeneralStochasticGradientAlgorithms

wt+1 =wt – rt gw Q(xt, yt,wt)=wt – rt gt

LMS: Q((x,y),w) =1/2 (y– wT x)2leadstotheupdaterule(AlsocalledWidrow’s Adaline):

wt+1 =wt +r(yt –𝑤Z[ xt)xtHere,eventhoughwemakebinarypredictionsbasedonsgn (wT x)wedonottakethesign ofthedot-productintoaccountintheloss.

Anothercommonlossfunctionis:Hingeloss:Q((x,y),w)=max(0,1- ywT x)Thisleadstotheperceptron updaterule:

Ifyi𝑤<[·xi >1(Nomistake,byamargin):NoupdateOtherwise (Mistake,relativetomargin): wt+1 =wt +ryt xt

24

wT x

ThelossQ:afunction ofx,w andyLearningrate gradient

Hereg=-yxGood tothinkaboutthecaseofBooleanexamples


Summary:SingleLayerNetwork§ Varietyofupdaterules

§ Multiplicative§ Additive

§ Batch andincrementalalgorithms§ Variousconvergenceandefficiencyconditions§ Thereareotherwaystolearnlinearfunctions

§ LinearProgramming (generalpurpose)§ ProbabilisticClassifiers(someassumption)

§ Keyalgorithmsaredrivenbygradientdescent§ However,therepresentationalrestrictionislimitingin

manyapplications

25


BackpropagationLearningRule§ Sincetherecouldbemultipleoutputunits,wedefinethe

error asthesumoverallthenetworkoutputunits.

𝐸𝑟𝑟 𝑤 = EF∑ ∑\∈] 𝑡\R − 𝑜\R FR∈T

§ where𝐷 isthesetoftrainingexamples,§ 𝐾 isthesetofoutputunits

§ Thisisusedtoderivethe(global)learningrulewhichperformsgradientdescentintheweightspaceinanattempttominimizetheerrorfunction.

Δ𝑤</ = −𝑅𝜕𝐸𝜕𝑤</

26

𝑜E…𝑜\

(1, 0, 1, 0, 0)

Function 1


LearningwithaMulti-LayerPerceptron

§ It’seasytolearnthetoplayer– it’sjustalinearunit.§ Givenfeedback(truth)atthetoplayer,andtheactivationatthe

layerbelowit,youcanusethePerceptronupdaterule(moregenerally,gradientdescent)toupdatedtheseweights.

§ Theproblemiswhattodowiththeothersetofweights– wedonotgetfeedbackintheintermediatelayer(s).

27

activation

Input

Hidden

Output

w2ij

w1ij


LearningwithaMulti-LayerPerceptron

§ Theproblem iswhattodowiththeother setofweights– wedonotgetfeedbackintheintermediatelayer(s).

§ Solution: Ifalltheactivationfunctionsaredifferentiable, thentheoutput ofthenetworkisalsoadifferentiablefunction oftheinputandweightsinthenetwork.

§ Defineanerror function (e.g.,sumofsquares)thatisadifferentiable functionoftheoutput, i.e.thiserrorfunction isalsoadifferentiable functionof theweights.

§ Wecanthenevaluatethederivativesoftheerrorwithrespecttotheweights,andusethesederivativestofindweightvaluesthatminimizethiserrorfunction, usinggradientdescent(orotheroptimizationmethods).

§ Thisresultsinanalgorithmcalledback-propagation.

28

activation

Input

Hidden

Output

w2ij

w1ij


SomefactsfromrealanalysisFirstlet’sgetthenotationright:

Thearrowshowsfunctionaldependnence ofz onyi.e.giveny,wecancalculatez.e.g.,forexample:z(y)=2y^2

Thederivativeofz,withrespecttoy.

29


Somefactsfromrealanalysis§ Simplechainrule

§ If𝑧 isafunctionof𝑦,and𝑦 isafunctionof𝑥§ Then𝑧 isafunctionof𝑥,aswell.

§ Question:howtofindcdce

30

WewillusethesefactstoderivethedetailsoftheBackpropagationalgorithm.

zwillbetheerror(loss) function.- Weneedtoknowhowtodifferentiatez

Intermediatenodesusealogisticsfunction (oranotherdifferentiablestepfunction).- Weneedtoknowhowtodifferentiateit.


Somefactsfromrealanalysis§ Multiplepathchainrule

31Slide Credit: Richard Socher


Somefactsfromrealanalysis§ Multiplepathchainrule:general

32Slide Credit: Richard Socher


KeyIntuitionsRequiredforBP§ GradientDescent

§ Changetheweightsinthedirectionofgradienttominimizetheerrorfunction.

§ ChainRule§ Usethechainruletocalculatethe

weightsoftheintermediateweights

§ DynamicProgramming(Memoization)§ Memoize theweightupdatestomake

theupdatesfaster.§ The“back”partof“backpropagation”

33

𝜕𝐸𝜕𝑤</

output

input


Backpropagation:thebigpicture§ Loopoverinstances:

1. Theforwardstep§ Giventheinput,makepredictions layer-by-layer,startingfromthefirstlayer)

2. Thebackwardstep§ Calculatetheerrorintheoutput§ Updatetheweightslayer-by-layer,startingfromthefinallayer

34

𝜕𝐸𝜕𝑤</

output

input


Quiztime!§ Whatisthepurposeofforwardstep?

§ Tomakepredictions,givenaninput.

§ Whatisthepurposeofbackwardstep?§ Toupdatetheweights,givenanoutputerror.

§ Whydoweusethechainrule?§ Tocalculategradientintheintermediatelayers.

§ Whybackpropagationcouldbeefficient?§ Becauseitcanbeparallelized.

35


Derivingtheupdaterules


Reminder:ModelNeuron(Logistic)

§ Neuronismodeledbyaunit𝑗 connectedbyweightedlinks𝑤</ tootherunits𝑖.

§ Useanon-linear,differentiableoutputfunctionsuchasthesigmoidorlogisticfunction

§ Netinputtoaunitisdefinedas:

§ Outputofaunitisdefinedas:

37

net/ = ∑𝑤</. 𝑥<

𝑜/ =1

1 + exp −(net/ − 𝑇/)

∑ 𝑜/


𝑥W𝑤EW

𝑤JW

Theparameterssofar?Thesetofconnectiveweights:𝑤</Thethresholdvalue:𝑇/


Derivatives§ Function1(error):

§ 𝐸 = EF∑ 𝑡\ − 𝑜\ F\∈]

§chcij

= − 𝑡𝑖 − 𝑜<§ Function2(lineargate):

§ net/ = ∑𝑤</ . 𝑥<

§cklmncojn

= 𝑥𝑖

§ Function3(differentiable stepfunction):

§ 𝑜< =E

Eplqr{s(klmns[)}

§cijcklmn

= lqr{s(klmns[)}(Eplqr{s(klmns[)})F

=𝑜<(1 − 𝑜<)

38

𝑜E…𝑜\

𝑗

𝑖

𝑤</


DerivationofLearningRule§ Theweightsareupdatedincrementally;theerroris

computedforeachexampleandtheweightupdateisthenderived.

𝐸R 𝑤 =12P 𝑡\ − 𝑜\ F

\∈]

§ 𝑤</ influencestheoutputonlythroughnet/

§ Therefore:𝜕𝐸R𝜕𝑤</

=𝜕𝐸R𝜕o/

𝜕𝑜/𝜕net/

𝜕net/𝜕𝑤</

39

𝑜E…𝑜\

𝑗

𝑖

𝑤</

𝑜< =E

Eplqr{s(klmns[)}and net/ = ∑𝑤</ . 𝑥<


= − 𝑡/ − 𝑜/ 𝑜/ 1 − 𝑜/ 𝑥<

𝜕𝐸R𝜕𝑤</

=𝜕𝐸R𝜕o/

𝜕𝑜/𝜕net/

𝜕net/𝜕𝑤</

DerivationofLearningRule(2)§ Weightupdatesofoutputunits:

§ 𝑤</ influencestheoutputonlythroughnet/§ Therefore:

40

𝑗

𝑖

𝑤</

𝐸R 𝑤 =12P 𝑡\ − 𝑜\ F

\∈]net/ = ∑𝑤</ . 𝑥<

𝜕𝑜/𝜕net/

= 𝑜/(1 − 𝑜/)

𝑜/ =1

1 + exp{−(net/ − 𝑇/)}

𝑜E…𝑜\


DerivationofLearningRule(3)§ Weightsofoutputunits:

§ 𝑤</ ischangedby:

Wherewedefined:

𝛿/ =chvcklmn

= 𝑡/ − 𝑜/ 𝑜/ 1 − 𝑜/

41

Δ𝑤</ = 𝑅 𝑡/ − 𝑜/ 𝑜/ 1 − 𝑜/ 𝑥<= 𝑅𝛿/𝑥<

𝑗

𝑖

𝑤</𝑜/

𝑥<


= P −𝛿\ 𝜕net\𝜕net/

𝑥<\∈wxyz*Z(/)

= P𝜕𝐸R𝜕net\

𝜕net\𝜕net/

𝑥<\∈wxyz*Z(/)

𝜕𝐸R𝜕𝑤</

=𝜕𝐸R𝜕net/

𝜕net/𝜕𝑤</

=

DerivationofLearningRule(4)§ Weightsofhiddenunits:

§ 𝑤</ Influencestheoutputonlythroughalltheunitswhosedirectinputinclude𝑗

42

𝑘

𝑗

𝑖

𝑤</

𝑜\net/ = ∑𝑤</ . 𝑥<

=𝜕𝐸R𝜕net/

𝑥< =


= P −𝛿\ 𝜕net\𝜕𝑜/

𝜕𝑜/𝜕net/

𝑥<\∈wxyz*Z(/)

= P −𝛿\𝑤/\𝑜/(1 − 𝑜/)𝑥<\∈wxyz*Z(/)


§ 𝑤</ influencestheoutputonlythroughalltheunitswhosedirectinputinclude𝑗

43

𝑘

𝑗

𝑖

𝑤</

𝑜\

𝜕𝐸R𝜕𝑤</

= P −𝛿\𝜕net\𝜕net/

𝑥<\∈wxyz*Z(/)

=



§ 𝑤</ ischangedby:

§ Where

𝛿/ = 𝑜/ 1 − 𝑜/ . ∑ −𝛿\𝑤/\\∈wxyz*Z /

§ Firstdeterminetheerrorfortheoutputunits.§ Then,backpropagate thiserrorlayerbylayerthroughthenetwork,

changingweightsappropriatelyineachlayer.

44

𝑘

𝑗

𝑖

𝑤</

𝑜\

Δ𝑤</ = 𝑅𝑜/ 1 − 𝑜/ . P −𝛿\𝑤/\\∈wxyz*Z /

𝑥<

= 𝑅𝛿/𝑥</


TheBackpropagationAlgorithm§ Createafullyconnectedthreelayernetwork.Initializeweights.§ Untilallexamplesproducethecorrectoutputwithin𝜖 (orother

criteria)Foreachexampleinthetrainingsetdo:

1. Computethenetworkoutput forthisexample2. Computetheerrorbetweentheoutputandtargetvalue

𝛿\ = 𝑡\ − 𝑜\ 𝑜\ 1 − 𝑜\1. Foreachoutputunitk,computeerrorterm

𝛿/ = 𝑜/ 1 − 𝑜/ . P −𝛿\𝑤/\ \∈Rio*}Zyzx~ /

1. Foreachhiddenunit, computeerrorterm:Δ𝑤</ = 𝑅𝛿/𝑥<

1. UpdatenetworkweightswithΔ𝑤</Endepoch

45


MoreHiddenLayers§ Thesamealgorithmholdsformorehiddenlayers.

46

output

input


Demotime!§ Link:https://playground.tensorflow.org/

47


CommentsonTraining§ Noguaranteeofconvergence;mayoscillateorreachalocal

minima.§ Inpractice,manylargenetworkscanbetrainedonlarge

amountsofdataforrealisticproblems.§ Manyepochs(tensofthousands)maybeneededforadequate

training.LargedatasetsmayrequiremanyhoursofCPU§ Terminationcriteria:Numberofepochs;Thresholdontraining

seterror;Nodecreaseinerror;Increasederroronavalidationset.

§ Toavoidlocalminima:severaltrialswithdifferentrandominitialweightswithmajorityorvotingtechniques

48


Over-trainingPrevention§ Runningtoomanyepochsmayover-trainthenetworkand

resultinover-fitting.(improvedresultontraining,decreaseinperformanceontestset)

§ Keepanhold-outvalidationsetandtestaccuracyaftereveryepoch

§ Maintainweightsforbestperformingnetworkonthevalidationsetandreturnitwhenperformancedecreasessignificantlybeyondthat.

§ Toavoidlosingtrainingdatatovalidation:§ Use10-foldcross-validationtodetermine theaveragenumberofepochs

thatoptimizesvalidationperformance§ Trainonthefulldatasetusingthismanyepochs toproduce thefinal

results

49


Over-fittingprevention§ Toofewhiddenunitspreventthesystemfromadequately

fittingthedataandlearningtheconcept.§ Usingtoomanyhiddenunitsleadstoover-fitting.§ Similarcross-validationmethodcanbeusedtodetermine

anappropriatenumberofhiddenunits.(general)§ Anotherapproachtopreventover-fittingisweight-decay:

allweightsaremultipliedbysomefractionin(0,1)aftereveryepoch.§ Encouragessmallerweightsandlesscomplexhypothesis§ Equivalently:changeErrorfunctiontoincludeatermforthesum

ofthesquaresoftheweightsinthenetwork.(general)

50


Dropouttraining§ Proposedby(Hintonetal,2012)

§ Eachtimedecidewhethertodeleteonehiddenunitwithsomeprobabilityp

51


Dropouttraining

§ Dropoutof50%ofthehiddenunitsand20%oftheinputunits(Hintonetal,2012)

52


Dropouttraining

§ Modelaveragingeffect§ Among2H models,withsharedparameters

§ H:numberofunitsinthenetwork§ Onlyafewgettrained§ Muchstrongerthantheknownregularizer

§ Whatabouttheinputspace?§ Dothesamething!

53


Input-OutputCoding§ Appropriatecodingofinputsandoutputscanmake

learningproblemeasierandimprovegeneralization.§ Encodeeachbinaryfeatureasaseparateinputunit;

§ Formulti-valuedfeaturesincludeonebinaryunitpervalueratherthantryingtoencodeinputinformationinfewerunits.§ Verycommontodaytousedistributedrepresentationoftheinput

– realvalued,denserepresentation.

§ Fordisjointcategorizationproblem,besttohaveoneoutputunitforeachcategoryratherthanencodingNcategoriesintologNbits.

54

Onewaytodoit,ifyoustartwithacollectionofsparselyrepresentationexamples,istousedimensionalityreductionmethods:- Yourm examplesarerepresentedasamx106 matrix- Multipleitbyarandommatrixofsize106 x300,say.- Randommatrix:Normal(0,1)- Newrepresentation:mx300denserows


HiddenLayerRepresentation§ Weighttuningproceduresetsweightsthatdefine

whateverhiddenunitsrepresentationismosteffectiveatminimizingtheerror.

§ SometimesBackpropagationwilldefinenewhiddenlayerfeaturesthatarenotexplicitintheinputrepresentation,butwhichcapturepropertiesoftheinputinstancesthataremostrelevanttolearningthetargetfunction.

§ Trainedhiddenunitscanbeseenasnewlyconstructedfeaturesthatre-representtheexamplessothattheyarelinearlyseparable

55


GradientChecksareuseful!§ Allowyoutoknowthattherearenobugsinyourneural

networkimplementation!

§ Implementyourgradient§ Implementafinitedifferencecomputationbyloopingthroughthe

parametersofyournetwork,addingandsubtractingasmallepsilon(∼10^-4)andestimatederivatives

𝑓� 𝜃 ≈ � �� s�(��)F� 𝜃± = 𝜃 ± 𝜖

§ Comparethetwoandmakesuretheyarealmostthesame

56


Auto-associativeNetwork§ Anauto-associativenetworktrainedwith8inputs,3hiddenunitsand

8outputnodes,wheretheoutputmustreproducetheinput.§ Whentrainedwithvectorswithonlyonebiton

INPUTHIDDEN10000000.89.400.801000000.97.99.71….00000001.01.11.88

§ Learnedthestandard3-bitencodingforthe8bitvectors.§ Illustratesalsodatacompressionaspectsoflearning

57

10001000

10001000


SparseAuto-encoder

58

§ Encoding:§ Decoding:

§ Goal:perfectreconstructionofinputvector𝒙,bytheoutput 𝒙�

§ Where𝜽 = {𝑾,𝑾′}§ Minimizeanerrorfunction𝒍(𝒙�, 𝒙)

§ Forexample:

§ Andregularizeit

§ Afteroptimizationdropthereconstructionlayerandaddanewlayer

𝒚 = 𝑓(𝑊𝒙 + 𝒃)

𝒙� = 𝑔(𝑊′𝒚 + 𝒃′)

𝑙 𝒙�, 𝒙 = 𝒙� − 𝒙 F

min� P𝑙 𝒙�, 𝒙𝒙

+P|𝑤<|<


StackingAuto-encoder§ Addanewlayer,andareconstructionlayerforit.§ Andtrytotuneitsparameterssuchthat§ Andcontinuethisforeachlayer

59


Beyondsupervisedlearning

60

§ Sofarwhatwehadwaspurelysupervised.§ Initializeparametersrandomly§ Traininsupervisedmodetypically,usingbackprop§ Usedinmostpracticalsystems (e.g.speechandimagerecognition)

§ Unsupervised,layer-wisepre-training+supervisedclassifierontop§ Traineachlayerunsupervised,oneaftertheother§ Trainasupervisedclassifierontop,keepingtheother layersfixed§ Goodwhenveryfewlabeledsamplesareavailable

§ Unsupervised,layer-wisepre-training+supervisedfine-tuning§ Traineachlayerunsupervised, oneaftertheother§ Addaclassifierlayer,andretrainthewholethingsupervised§ Goodwhenlabelsetispoor (e.g.pedestriandetection)

Wewon’ttalkaboutunsupervised pre-traininghere.Butit’s goodtohavethisinmind, sinceitisanactivetopicofresearch.


NN-2

61


Recap:Multi-LayerPerceptrons§ Multi-layernetwork

§ Aglobalapproximator§ Differentrulesfortrainingit

§ TheBack-propagation§ Forwardstep§ Backpropagationoferrors

§ Congrats!Nowyouknowtheoneoftheimportantalgorithmsinneuralnetworks!

§ Today:§ ConvolutionalNeuralNetworks§ RecurrentNeuralNetworks

62

activation

Input

Hidden

Output


ReceptiveFields§ The receptivefield ofanindividual sensoryneuron istheparticular

regionofthesensoryspace(e.g.,thebodysurface,ortheretina)inwhicha stimuluswilltriggerthefiringofthatneuron.§ Intheauditorysystem,receptivefieldscancorrespond tovolumesin

auditory space§ Designing“proper”receptivefieldsfortheinputNeuronsisa

significantchallenge.§ Considerataskwithimageinputs

§ Receptivefieldsshouldgiveexpressivefeaturesfromtherawinput tothesystem

§ Howwouldyoudesign thereceptivefieldsforthisproblem?

63


§ Afullyconnectedlayer:§ Example:

§ 100x100images§ 1000unitsintheinput

§ Problems:§ 10^7edges!§ Spatialcorrelationslost!§ Variablessizedinputs.

64Slide Credit: Marc'Aurelio Ranzato

Input layer


§ Considerataskwithimageinputs:§ Alocallyconnectedlayer:

§ Example:§ 100x100images§ 1000unitsintheinput§ Filtersize:10x10

§ Localcorrelationspreserved!§ Problems:

§ 10^5edges§ Thisparameterizationisgoodwheninput imageisregistered(e.g.,facerecognition). § Variablesizedinputs,again.


Input layer


ConvolutionalLayer§ Asolution:

§ Filterstocapturedifferentpatternsintheinputspace.§ Shareparametersacrossdifferentlocations(assuming inputisstationary)

§ Convolutionswithlearnedfilters§ Filterswillbelearnedduring training.§ Theissueofvariable-sizedinputswillberesolvedwithapoolinglayer.

66SlideCredit:Marc'Aurelio Ranzato

Sowhatisaconvolution?

Input layer


ConvolutionOperator§ Convolutionoperator:∗

§ takestwofunctionsandgivesanotherfunction

§ Onedimension:

67

𝑥 ∗ ℎ 𝑡 = � 𝑥 𝜏 ℎ 𝑡 − 𝜏 𝑑𝜏

𝑥 ∗ ℎ [𝑛] = ∑ 𝑥 𝑚 ℎ[𝑛− 𝑚]~

“Convolution” isvery similarto

“cross-correlation”,exceptthatinconvolutiononeofthefunctions

isflipped.


ConvolutionOperator(2)§ Convolutionintwodimension:

§ Thesameidea:fliponematrixandslideitontheothermatrix§ Example:Sharpenkernel:

68Tryotherkernels:http://setosa.io/ev/image-kernels/


ConvolutionOperator(3)

Convolutionintwodimension:q Thesameidea:fliponematrixandslideitontheother

matrix

69


ComplexityofConvolution§ Complexityofconvolutionoperatoris𝑛log 𝑛 ,for𝑛

inputs.§ UsesFast-Fourier-Transform(FFT)

§ Fortwo-dimension,eachconvolutiontakes𝑀𝑁log 𝑀𝑁time,wherethesizeofinputis𝑀𝑁.



ConvolutionalLayer§ Theconvolutionoftheinput(vector/matrix)withweights

(vector/matrix)resultsinaresponsevector/matrix.§ Wecanhavemultiplefiltersineachconvolutionallayer,each

producinganoutput.§ Ifitisanintermediatelayer,itcanhavemultipleinputs!

71

ConvolutionalLayer

FilterFilterFilterFilterOnecanaddnonlinearityattheoutputof

convolutional layer


PoolingLayer§ Howtohandlevariablesizedinputs?

§ Alayerwhichreducesinputsofdifferentsize,toafixedsize.§ Pooling

72SlideCredit:Marc'Aurelio Ranzato


PoolingLayer§ Howtohandlevariablesizedinputs?

§ Alayerwhichreducesinputsofdifferentsize,toafixedsize.§ Pooling§ Differentvariations

§ Maxpooling

ℎ< 𝑛 = max<∈¡(*)

ℎ¢ [𝑖]

§ Averagepooling

ℎ< 𝑛 = E*

∑<∈¡(*)

ℎ¢[𝑖]

§ L2-pooling

ℎ< 𝑛 = E*

∑<∈¡(*)

ℎ¢F[𝑖]

§ etc

73


ConvolutionalNets§ Onestagestructure:

§ Wholesystem:

74

Conv. Pooling

Stage1 Stage2 Stage3Fully

ConnectedLayer

InputImage

ClassLabel


TrainingaConvNet§ ThesameprocedurefromBack-propagationapplieshere.

§ Rememberinbackprop westartedfromtheerrortermsinthelaststage,andpassedthembacktothepreviouslayers,onebyone.

§ Back-propforthepoolinglayer:§ Consider,forexample,thecaseof“max”pooling.§ Thislayeronlyroutesthegradienttotheinputthathasthehighestvalueinthe

forwardpass.§ Hence,duringtheforwardpassofapoolinglayeritiscommontokeeptrackofthe

indexofthemaxactivation(sometimesalsocalled theswitches)sothatgradientroutingisefficientduringbackpropagation.

§ Thereforewehave: 𝛿 = chvc£j

75

Convol. Pooling

Stage3 FullyConnectedLayerInputImage

Class Label

𝛿¤¥¦ms¤¥§l̈ =𝜕𝐸R

𝜕𝑦¤¥¦ms¤¥§l¨

𝐸R

Stage1 Stage2

𝛿©ª¨¦ms¤¥§l¨ =𝜕𝐸R

𝜕𝑦©ª¨¦ms¤¥§l¨

𝑥< 𝑦<


TrainingaConvNet§ Back-propfortheconvolutionallayer:

76

Convol. Pooling

Stage3 FullyConnectedLayerInputImage

ClassLabel

𝛿¤¥¦ms¤¥§l̈ =𝜕𝐸R

𝜕𝑦¤¥¦ms¤¥§l¨

𝐸R

Stage1 Stage2

𝛿©ª¨¦ms¤¥§l¨ =𝜕𝐸R

𝜕𝑦©ª¨¦ms¤¥§l¨

𝑥< 𝑦<

Wederivetheupdaterulesfora1Dconvolution,buttheideaisthesameforbiggerdimensions.

𝑦« = 𝑤 ∗ 𝑥 ⟺𝑦«< = P 𝑤x

~sE

xY

𝑥<sx = P 𝑤<sx

~sE

xY

𝑥x∀𝑖

𝑦 = 𝑓 𝑦« ⟺𝑦< = 𝑓(𝑦«<)∀𝑖

𝜕𝐸R𝜕𝑤x

=

𝜕𝐸R𝜕𝑦«<

=

𝛿 =𝜕𝐸R𝜕𝑥x

=

Theconvolution

Adifferentiablenonlinearity

P𝜕𝐸R𝜕𝑦«<

𝜕𝑦« <𝜕𝑤x

~sE

<Y

= P𝜕𝐸R𝜕𝑦«<

𝑥<sx

~sE

<Y

𝜕𝐸R𝜕𝑦<

𝜕𝑦<𝜕𝑦«<

=𝜕𝐸R𝜕𝑦<

𝑓′(𝑦«)

P𝜕𝐸R𝜕𝑦«<

𝜕𝑦«<𝜕𝑥x

~sE

<Y

= P𝜕𝐸R𝜕𝑦«<

𝑤<sx

~sE

<Y

Nowwehaveeverythinginthislayertoupdatethefilter

Weneedtopassthegradienttotheprevious layer

NowwecanrepeatthisforeachstageofConvNet.


ConvolutionalNets

77

Stage1

Stage2

Stage3

FullyConnected

LayerInputImage

ClassLabel

FeaturevisualizationofconvolutionalnettrainedonImageNetfrom[Zeiler &Fergus2013]


Demo(TeachableMachines)https://teachablemachine.withgoogle.com/

78


ConvNet roots§ Fukushima,1980s designednetworkwithsamebasicstructurebut

didnottrainbybackpropagation.§ ThefirstsuccessfulapplicationsofConvolutionalNetworksbyYann

LeCun in1990's (LeNet)§ Wasusedtoreadzipcodes,digits,etc.

§ Manyvariantsnowadays,butthecoreideaisthesame§ Example:asystemdeveloped inGoogle (GoogLeNet)

§ Computedifferentfilters§ Composeonebigvectorfromallofthem§ Layerthisiteratively

79Seemore:http://arxiv.org/pdf/1409.4842v1.pdf


Depthmatters

80

Slidefrom[Kaiming He2015]


Vanishing/explodinggradients

§ Vanishinggradientsarequiteprevalentandaseriousissue.§ Arealexample

§ Trainingafeed-forwardnetwork§ y-axis:sumofthegradientnorms§ Earlierlayershaveexponentiallysmallersumofgradientnorms§ Thiswillmaketrainingearlierlayersmuchslower.

81

Gradientcanbecomeverysmallorverylargequickly,andthelocalityassumptionofgradientdescentbreaksdown (Vanishinggradient)[Bengio etal1994]


Vanishing/explodinggradients§ Inarchitectureswithmanylayers(e.g.>10)thegradientscaneasily

explodeorvanish.

§ Manymethodsproposedforreducetheeffectofvanishinggradients;althoughitisstillaproblem§ Introduceshorterpathbetweenlongconnections§ Abandonstochasticgradientdescentinfavorofamuchmore

sophisticatedHessian-Free(HF)optimization§ Clipgradientswithbiggersizes:

82

Defnne𝑔 = chc¯

If 𝑔 ≥ 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 then𝑔 ← Z²yz}²i³R

´ 𝑔


PracticalTips§ Beforelargescaleexperiments,testonasmallsubsetofthedataand

checktheerrorshouldgotozero.§ Overfittingonsmalltraining

§ Visualizefeatures(featuremapsneedtobeuncorrelated)andhavehighvariance

§ Badtraining:manyhiddenunitsignoretheinputand/orexhibitstrongcorrelations.

83FigureCredit:Marc'Aurelio Ranzato


Debugging§ Trainingdiverges:

§ Learningratemaybetoolarge→decreaselearningrate§ BackProp isbuggy→numericalgradientchecking

§ Lossisminimizedbutaccuracyislow§ Checklossfunction: Isitappropriateforthetaskyouwanttosolve?Does

ithavedegeneratesolutions?

§ NNisunderperforming/under-fitting§ Computenumber ofparameters→iftoosmall,makenetworklarger

§ NNistooslow§ Computenumber ofparameters→Usedistributed framework,useGPU,

makenetworksmaller

84

Manyofthesepointsapplytomanymachinelearningmodels,nojustneuralnetworks.


CNNforvectorinputs

85

§ Let’sstudyanothervariantofCNNforlanguage§ Example:sentenceclassification(sayspamornotspam)

§ Firststep:representeachwordwithavectorinℝRThis is not a spam

Concatenatethevectors

§ NowwecanassumethattheinputtothesystemisavectorℝR³§ Wheretheinput sentencehaslength𝑙 (𝑙 = 5 inourexample)§ Eachwordvector’slength𝑑 (𝑑 = 7 inourexample)

OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO

O OO O OO O O OO O OO O O OO O OO O O OO O OO O O OO O OO O


ConvolutionalLayeronvectors§ Thinkaboutasingleconvolutionallayer

§ Abunchofvectorfilters§ Eachdefined inℝR²

• Whereℎ isthenumberofthewordsthefiltercovers• Sizeofthewordvector𝑑

§ Findits(modified)convolutionwiththeinputvector

§ Resultoftheconvolutionwiththefilter

§ Convolutionwithafilterthatspans2words, isoperatingonallofthebi-grams(vectorsoftwoconsecutiveword,concatenated):“thisis”,“isnot”,“nota”,“aspam”.

§ Regardlessofwhether itisgrammatical(notappealinglinguistically)

86

OO O O O OO O OO OO OO O OO OO OO O OO OO OO O OO OO OO

OOO O OO O OO O OO O O

OOO O OO O OO O OO O O OO O OO OO OO OO O OO O OO O OO OOOO O OO O OO O OO O OOOO O OO O OO O OO O O OO O OO OO OO OO O OO O OO O OO O

OOO O OO O OO O OO O OOOO O OO O OO O OO O O OO O OO OO OO OO O OO O OO O OO O

OOO O OO O OO O OO O OOOO O OO O OO O OO O O OO O OO OO OO OO O OO O OO O OO O

OOO O OO O OO O OO O O

𝑐E = 𝑓(𝑤. 𝑥E:²)𝑐F = 𝑓(𝑤.𝑥²pE:F²)𝑐G = 𝑓(𝑤.𝑥F²pE:G²)𝑐H = 𝑓(𝑤.𝑥G²pE:H²)

𝑐 = [𝑐E,… . , 𝑐*s²pE]OOO O


ConvolutionalLayeronvectors

87

OOO O O

OOO O OO O

OOO O OO O OO OO O OO O OO O OO O OO OO O OO O OO O OO O

This is not a spam


OOO O OO O OO O OO O OOOO O OO O OO O OO O O

OOO O OO O OO O OO O O OO OO O OOOOO O OO O OO O OO O O OO OO O OO

OOO OOOO O

OOOOOO

Getwordvectorsforeachwords

Concatenatevectors

Performconvolution

witheachfilter Filterbank

Setofresponsevectors

*

Howarewegoing tohandlethevariable

sizedresponsevectors?Pooling!

#offilters

#words- #lengthoffilter+1


ConvolutionalLayeronvectors

§ Nowwecanpass thefixed-sizedvectortoalogisticunit(softmax), orgiveittomulti-layernetwork(lastsession)

88

OOO O O

OOO O OO O

OOO O OO O OO OO O OO O OO O OO O OO OO O OO O OO O OO O

This is not a spam


OOO O OO O OO O OO O OOOO O OO O OO O OO O O

OOO O OO O OO O OO O O OO OO O OOOOO O OO O OO O OO O O OO OO O OO

OOO OOOO O

OOOOOO

Getwordvectorsforeachwords

Concatenatevectors

Performconvolutionwitheachfilter

Filterbank

*

#offilters

#words- #lengthoffilter+1

Poolingonfilter

responses

OOOOOO

OOOOOOOOO

Somechoicesforpooling:

k-max,mean,etc


RecurrentNeuralNetworks§ Predictiononchain-likeinput:

§ Example:POStaggingwordsofasentence

§ Issues:§ Structureintheoutput: Thereisconnections betweenlabels§ Interdependencebetweenelementsoftheinputs: Thefinaldecisionisbased

onanintricateinterdependenceofthewordsoneachother.§ Variablesizeinputs:e.g.sentencesdifferinsize

§ Howwouldyougoaboutsolvingthistask?

89

This is a sample sentence

DT VBZ DT NN NN


RecurrentNeuralNetworks§ Infiniteusesoffinitestructure

90

Y0

H0

X0

Y1

H1

X1

Y2

H2

X2

Y3

H3

X3

Hiddenstaterepresentation

Output

Input


RecurrentNeuralNetworks§ AchainRNN:

§ Hasachain-likestructure§ Eachinputisreplacedwithitsvectorrepresentation𝑥Z§ Hidden(memory)unitℎZ containinformationaboutprevious

inputsandprevioushiddenunitsℎZsE,ℎZsF, etc§ Computed fromthepastmemoryandcurrentword.Itsummarizesthesentenceuptothattime.

91

OOO O O OOO O O OOO O O𝑥ZsE 𝑥Z 𝑥ZpE

OO

OOO

OO

OOO

OO

OOO

ℎZsE ℎZ ℎZpEMemorylayer

Inputlayer


RecurrentNeuralNetworks§ Apopularwayofformalizingit:

ℎZ = 𝑓(𝑊²ℎZsE +𝑊<𝑥Z)§ Where𝑓 isanonlinear, differentiable (why?) function.

§ Outputs?§ Manyoptions;dependingonproblemandcomputational

resource

92


OO

OOO

OO

OOO

OO

OOO

ℎZsE ℎZ ℎZpEMemorylayer

Inputlayer


RecurrentNeuralNetworks§ Prediction for𝑥Z,withℎZ:

§ Someinherent issueswithRNNs:§ Recurrentneuralnetscannotcapturephraseswithout prefixcontext§ Theyoftencapturetoomuchoflastwordsinfinalvector

§ Aslightlymoresophisticated solution: LongShort-TermMemory(LSTM)units

93

𝑦Z = softmax 𝑊iℎZ


OO

OOO

OO

OOO

OO

OOOℎZsE ℎZ ℎZpE

Memorylayer

Inputlayer

𝑦ZsE 𝑦Z 𝑦ZpE Output layer


RecurrentNeuralNetworks

§ Multi-layerfeed-forwardNN:DAG§ Justcomputesafixedsequenceofnon-linearlearnedtransformationstoconvertaninputpatterintoanoutputpattern

§ RecurrentNeuralNetwork:Digraph§ Hascycles.§ Cyclecanactasamemory;§ Thehiddenstateofarecurrentnetcancarryalonginformation

abouta“potentially”unboundednumberofpreviousinputs.§ Theycanmodelsequentialdatainamuchmorenaturalway.

94


EquivalencebetweenRNNandFeed-forwardNN

§ Assumethatthereisatimedelayof1inusingeachconnection.

§ Therecurrentnetisjustalayerednetthatkeepsreusingthesameweights.

95SlideCredit:GeoffHinton

1 2 3

1 2 3

1 2 3

1 2 3W1 W2 W3 W4

time=0

time=2

time=1

time=3

W1 W2 W3 W4

W1 W2 W3 W4

1 2 3

w1 w4

w2 w3


Bi-directionalRNN§ OneoftheissueswithRNN:

§ Hiddenvariablescaptureonlyonesidecontext

§ Abi-directionalstructure

96

RNN Bi-directional RNN


Stackofbi-directionalnetworks§ Usethesameideaandmakeyourmodelfurther

complicated:

97


TrainingRNNs§ Howtotrainsuchmodel?

§ Generalizethesameideasfromback-propagation

§ Totaloutputerror:𝐸 𝑦, 𝑡 = ∑ 𝐸Z 𝑦Z , 𝑡Z[ZE

98

OOO O O OOO O O OOO O O

𝑥ZsE 𝑥Z 𝑥ZpE

OO

OOO

OO

OOO

OO

OOOℎZsE ℎZ ℎZpE

𝑦ZsE 𝑦Z 𝑦ZpE

Reminder:𝑦Z = softmax 𝑊iℎZℎZ = 𝑓(𝑊²ℎZsE + 𝑊<𝑥Z)

𝜕𝐸𝜕𝑊 =P

𝜕𝐸Z𝜕𝑊

[

ZE𝜕𝐸Z𝜕𝑊 =P

𝜕𝐸Z𝜕𝑦Z

𝜕𝑦Z𝜕ℎZ

𝜕ℎZ𝜕ℎZs\

𝜕ℎZs\𝜕𝑊

[

ZE

Parameters?𝑊i ,𝑊< ,𝑊² +vectorsfor

input

Thissometimesiscalled“BackpropagationThroughTime”,sincethegradientsarepropagatedbackthroughtime.


RecurrentNeuralNetwork

99𝑦ZsE 𝑦Z 𝑦ZpE

OO O O O OO O O O OO O O O

𝑥ZsE 𝑥Z 𝑥ZpE

OO

OOO

OO

OOO

OO

OOO

ℎZsE ℎZ ℎZpE

Reminder:𝑦Z = softmax 𝑊iℎZℎZ = 𝑓(𝑊²ℎZsE +𝑊<𝑥Z)

𝜕𝐸𝜕𝑊 =P

𝜕𝐸Z𝜕𝑦Z

𝜕𝑦Z𝜕ℎZ

𝜕ℎZ𝜕ℎZs\

𝜕ℎZs\𝜕𝑊

[

ZE

𝜕ℎZ𝜕ℎZs\

= ¿𝜕ℎ/𝜕ℎ/sE

Z

/Zs\pE

= ¿ 𝑊²diag 𝑓�(𝑊²ℎZsE + 𝑊<𝑥Z)Z

/Zs\pE

𝜕ℎZ𝜕ℎZsE

= 𝑊²diag 𝑓�(𝑊²ℎZsE + 𝑊<𝑥Z) diag 𝑎E,… , 𝑎* =𝑎E 0 00 ⋱ 00 0 𝑎*


UnsupervisedRNNs§ Whattoputhere?

§ Hewaslockedupafterhe______.

§ Notethat:§ Thisisunsupervised;youcanusetonsofdatatotrainthis.§ Whiletrainingthemodel,wetrainthewordrepresentationstoo.

100

OOO O O OOO O O OOO O O

𝑥ZsF 𝑥ZsE 𝑥Z 𝑦

OO

OOO

ℎZMemorylayer

Input(context)

OOO O O

output

OO

OOO ℎZpE

OO

OOO ℎZsE


UnsupervisedRNNs§ Thiswouldresultinwordrepresentations

§ thatconveyinformationabouttheirco-occurrence§ Orsomeformofweak“semantic”similarity

§ Abigpartofprogress(past5-10years)ispartlyduetodiscoveringbetterwayscreateunsupervisedcontext-sensitiverepresentations

101