RNN Review & Hierarchical Attention...
Transcript of RNN Review & Hierarchical Attention...
RNN Review&HierarchicalAttentionNetworksSHANGGAO
Overview◦ ReviewofRecurrentNeuralNetworks◦ AdvancedRNNArchitectures
◦ Long-Short-Term-Memory◦ GatedRecurrentUnits
◦ RNNsforNaturalLanguageProcessing◦ WordEmbeddings◦ NLPApplications
◦ AttentionMechanisms◦ HierarchicalAttentionNetworks
FeedforwardNeuralNetworksInaregularfeedforwardnetwork,eachneurontakesininputsfromtheneuronsinthepreviouslayer,andthenpassitsoutputtotheneuronsinthenextlayer
Theneuronsattheendmakeaclassificationbasedonlyonthedatafromthecurrentinput
RecurrentNeuralNetworksInarecurrentneuralnetwork,eachneurontakesindatafromthepreviouslayerANDitsownoutputfromtheprevioustimestep
TheneuronsattheendmakeaclassificationdecisionbasedonNOTONLYtheinputatthecurrenttimestepBUTALSOtheinputfromalltimestepsbeforeit
Recurrentneuralnetworkscanthuscapturepatternsovertime(e.g.weather,stockmarketdata,speechaudio,naturallanguage)
RecurrentNeuralNetworksIntheexamplebelow,theneuronatthefirsttimesteptakesinaninputandgeneratesanoutput
TheneuronatthesecondtimesteptakesinaninputANDALSO theoutputfromthefirsttimesteptomakeitsdecision
Theneuronatthethirdtimesteptakesinaninputandalsotheoutputfromthesecondtimestep(whichaccountedfordatafromthefirsttimestep),soitsoutputisaffectedbydatafromboththefirstandsecondtimestep
RecurrentNeuralNetworksTraditional Neuron: output=sigmoid(weights*input+bias)
Recurrent Neuron:output=sigmoid(weights1*input+weights2*previous_output + bias)oroutput=sigmoid(weights*concat(input,previous_output)+bias)
Toy RNN ExampleAddingBinary
Ateachtimestep,RNNtakesintwovaluesrepresentingbinaryinput
Ateachtimestep,RNNoutputsthesumofthetwobinaryvaluestakingintoaccountanycarryoverfromprevioustimestep
ProblemswithBasicRNNsInabasicRNN,newdataiswrittenintoeachcellateverytimestep
Datafromtimestepsveryearlyongetdilutedbecausetheyarewrittenoversomanytimes
Intheexamplebelow,datafromthefirsttimestepisreadintotheRNN
Ateachsubsequenttimestep,theRNNfactorsindatafromthecurrenttimestep
BytheendoftheRNN,thedatafromthefirsttimestephasverylittleimpactontheoutputoftheRNN
ProblemswithBasicRNNsBasicRNNcellscan’tretaininformationacrossalargenumberoftimesteps
Dependingontheproblem,RNNscanlosedatainasfewas3-5timesteps
Thisiscausesproblemsontaskswhereinformationneedstoberetainedoveralongtime
Forexample,innaturallanguageprocessing,themeaningofapronounmaydependonwhatwasstatedinaprevioussentence
LongShortTermMemoryLongShortTermMemorycellsareadvancedRNNcellsthataddresstheproblemoflong-termdependencies
Insteadofalwayswritingtoeachcellateverytimestep,eachunithasaninternal‘memory’thatcanbewrittentoselectively
LongShortTermMemoryInputfromthecurrenttimestepiswrittentotheinternalmemorybasedonhowrelevantitistotheproblem(relevanceislearnedduringtrainingthroughbackpropagation)
Iftheinputisn’trelevant,nodataiswrittenintothecell
Thiswaydatacanbepreservedovermanytimestepsandberetrievedwhenitisneeded
LongShortTermMemoryMovementofdataintoandoutofanLSTMcelliscontrolledby“gates”
The“forgetgate”outputsavaluebetween0(delete)and1(keep)andcontrolshowmuchoftheinternalmemorytokeepfromtheprevioustimestep
Forexample,attheendofasentence,whena‘.’isencountered,wemaywanttoresettheinternalmemoryofthecell
LongShortTermMemoryThe“candidatevalue”istheprocessedinputvaluefromthecurrenttimestepthatmaybeaddedtomemory◦ Notethattanh activationisusedforthe“candidatevalue”toallowfornegativevaluestosubtractfrommemory
The“inputgate”outputsavaluebetween0(delete)and1(keep)andcontrolshowmuchofthecandidatevalueaddtomemory
LongShortTermMemoryCombined,the“inputgate”and“candidatevalue”determinewhatnewdatagetswrittenintomemory
The“forgetgate”determineshowmuchofthepreviousmemorytoretain
ThenewmemoryoftheLSTMcellisthe“forgetgate”*thepreviousmemorystate+the“inputgate”*the“candidatevalue”fromthecurrenttimestep
LongShortTermMemoryTheLSTMcelldoesnotoutputthecontentsofitsmemorytothenextlayer◦ Storeddatainmemorymightnotberelevantforcurrenttimestep,e.g.,acellcanstoreapronounreferenceandonlyoutputwhenthepronounappears
Instead,an“output”gateoutputsavaluebetween0and1thatdetermineshowmuchofthememorytooutput
Thememorygoesthroughafinaltanh activationbeforebeingpassedtothenextlayer
GatedRecurrentUnitsGatedRecurrentUnitsareverysimilartoLSTMsbutusetwogatesinsteadofthree
The“updategate”determineshowmuchofthepreviousmemorytokeep
The“resetgate”determineshowtocombinethenewinputwiththepreviousmemory
Theentireinternalmemoryisoutputwithoutanadditionalactivation
LSTMsvsGRUsGreff,etal.(2015) comparedLSTMsandGRUsandfoundtheyperformaboutthesame
Jozefowicz,etal.(2015) generatedmorethantenthousandvariantsofRNNsanddeterminedthatdependingonthetask,somemayperformbetterthanLSTMs
GRUstrainfasterthanLSTMsbecausetheyarelesscomplex
Generallyspeaking,tuninghyperparameters(e.g.numberofunits,sizeofweights)willprobablyaffectperformancemorethanpickingbetweenGRUandLSTM
RNNsforNaturalLanguageProcessingThenaturalinputforaneuralnetworkisavectorofnumericvalues(e.g.pixeldensitiesforimagingoraudiofrequencyforspeechrecognition)
Howdoyoufeedlanguageasinputintoaneuralnetwork?
Themostbasicsolutionisonehotencoding◦ Alongvector(equaltothelengthofyourvocabulary)whereeachindexrepresentsonewordinthevocabulary
◦ Foreachword,theindexcorrespondingtothatwordissetto1,andeverythingelseissetto0
OneHotEncodingLSTMExampleTrainedLSTMtopredictthenextcharactergivenasequenceofcharacters
Trainingcorpus:AllbooksinHitchhiker’sGuidetotheGalaxyseries
One-hotencodingusedtoconverteachcharacterintoavector
72possiblecharacters– lowercaseletters,uppercaseletters,numbers,andpunctuation
Inputvectorisfedintoalayerof256LSTMnodes
LSTMoutputfedintoasoftmax layerthatpredictsthefollowingcharacter
Thecharacterwiththehighestsoftmax probabilityischosenasthenextcharacter
GeneratedSamples700iterations:aeae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae aeae ae ae ae ae
4200iterations:thesandandthesaidthesandandthesaidthesandandthesaidthesandandthesaidthesandandthesaidthe
36000iterations:searedtobealittlewasasmallbeachoftheshipwasasmallbeachoftheshipwasasmallbeachoftheship
100000iterations:thesecondthestarsisthestarstothestarsinthestarsthathehadbeensotheshiphadbeensotheshiphadbeen
290000iterations:startedtorunacomputertothecomputertotakeabitofaproblemofftheshipandthesunandtheairwasthesound
500000iterations:"IthinktheGalaxywillbealotofthingsthatthesecondmanwhocouldnotbecontinuallyandthesoundofthestars
OneHotEncodingShortcomingsOne-hotencodingislackingbecauseitfailstocapturesemanticsimilaritybetweenwords,i.e.,theinherentmeaningofword
Forexample,thewords“happy”,“joyful”,and“pleased”allhavesimilarmeanings,butunderone-hotencodingtheyarethreedistinctandunrelatedentities
Whatifwecouldcapturethemeaningofwordswithinanumericalcontext?
WordEmbeddingsWordembeddingsarevectorrepresentationsofwordsthatattempttocapturesemanticmeaningEachwordisrepresentedasavectorofnumericalvaluesEachindexinthevectorrepresentssomeabstract“concept”◦ Theseconceptsareunlabeledandlearnedduringtraining
Wordsthataresimilarwillhavesimilarvectors
Masculinity Royality Youth Intelligence
King 0.95 0.95 -0.1 0.6
Queen -0.95 0.95 -0.1 0.6
Prince 0.8 0.8 0.7 0.4
Woman -0.95 0.01 -0.1 0.2
Peasant 0.1 -0.95 0.1 -0.3
Doctor 0.12 0.1 -0.2 0.95
Word2VecWordsthatappearinthesamecontextaremorelikelytohavethesamemeaning◦ Iamexcited toseeyoutoday!◦ Iamecstatic toseeyoutoday!
Word2Vecisanalgorithmthatusesafunnel-shapedsinglehiddenlayerneuralnetwork(similartoautoencoder)tocreatewordembeddings
Givenaword(inone-hotencodedformat),ittriestopredicttheneighborsofthatword(alsoinone-hotencodedformat),orviceversa
Wordsthatappearinthesamecontextwillhavesimilarembeddings
Word2VecThemodelistrainedonalargecorpusoftextusingregularbackpropagation
Foreachwordinthecorpus,predictthe5wordstotheleftandright(orviceversa)
Oncethemodelistrained,theembeddingforaparticularwordistherowoftheweightmatrixassociatedwiththatword
Manypretrainedvectors(e.g.Google)canbedownloadedonline
Word2Vecon20Newsgroups
BasicDeepLearningNLPPipelineGenerateWordEmbeddings◦ Pythongensim package
FeedwordembeddingsintoLSTMorGRUlayer
FeedoutputofLSTMorGRUlayerintosoftmax classifier
NLPApplicationsforRNNsLanguageModels◦ Givenaseriesofwords,predictthenextword◦ Understandtheinherentpatternsinagivenlanguage◦ Usefulforautocompletionandmachinetranslation
SentimentAnalysis◦ Givenasentenceordocument,classifyifitispositiveornegative◦ Usefulforanalyzingthesuccessofaproductlaunchorautomatedstocktradingbasedoffnews
Otherformstextclassification◦ Cancerpathologyreportclassification
AdvancedApplicationsQuestionAnswering◦ Readadocumentandthenanswerquestions◦ ManymodelsuseRNNsastheirfoundation
AutomatedImageCaptioning◦ Givenanimage,automaticallygenerateacaption
◦ ManymodelsusebothCNNsandRNNs
MachineTranslation◦ Automaticallytranslatetextfromonelanguagetoanother
◦ Manymodels(includingGoogleTranslate)useRNNsastheirfoundation
LSTM ImprovementsBi-directionalLSTMsSometimes,importantcontextforawordcomesaftertheword(especiallyimportanttranslation)◦ Isawacrane flyingacrossthesky◦ Isawacrane liftingalargeboulder
Solution- usetwoLSTMlayers,onethatreadstheinputforwardandonethatreadstheinputbackwards,andconcatenatetheiroutputs
LSTM ImprovementsAttentionMechanismsSometimesonlyafewwordsinasentenceordocumentareimportantandtherestdonotcontributeasmuchmeaning◦ Forexample,whenclassifyingcancerlocationfromcancerpathologyreports,wemayonlycareaboutcertainkeywordslike“rightupperlung”or“ovarian”
InatraditionalRNN,weusuallytaketheoutputatthelasttimestep
Bythelasttimestep,informationfromtheimportantwordsmayhavebeendiluted,evenwithLSTMsandGRUsunits
Howcanwecapturetheinformationatthemostimportantwords?
LSTM ImprovementsAttentionMechanismsNaïvesolution:topreventinformationloss,insteadofusingtheLSTMoutputatthelasttimestep,taketheLSTMoutputateverytimestepandusetheaverage
Bettersolution:findtheimportanttimesteps,andweighttheoutputatthosetimestepsmuchhigherwhendoingtheaverage
LSTM ImprovementsAttentionMechanismsAnattentionmechanismcalculateshowimportanttheLSTMoutputateachtimestepis
It’sasimplefeedforwardnetworkwithasingle(tanh)hiddenlayerandasoftmax output
Ateachtimestep,feedtheoutputfromtheLSTM/GRUintotheattentionmechanism
LSTM ImprovementsAttentionMechanisms
Oncetheattentionmechanismhasallthetimesteps,itcalculatesasoftmaxoverallthetimesteps◦ softmax alwaysaddsto1
Thesoftmax tellsushowtoweighttheoutputateachtimestep,i.e.,howimportanteachtimestepis
Multiplytheoutputateachtimestepwithitscorrespondingsoftmax weightandaddtocreateaweightedaverage
LSTM ImprovementsAttentionMechanisms
Attentionmechanismscantakeintoaccount“context”todeterminewhat’simportant
Rememberdotproductisameasureofsimilarity– twovectorsthataresimilarwillhavelargerdotproduct
Innormalsoftmax,dotproductinputwithrandomlyinitializedweightsbeforeapplyingsoftmax function
LSTM ImprovementsAttentionMechanisms
Instead,wecandotproductwithavectorthatrepresents“context”tofindwordsmostsimilar/relevanttothatcontext:◦ Forquestionanswering,canrepresentaquestionbeingasked
◦ Formachinetranslation,canrepresentthepreviousword
◦ Forclassification,canbeinitializedrandomlyandlearnedduringtraining
LSTM ImprovementsAttentionMechanisms
Withattention,youcanvisualizehowimportanteachtimestepisforaparticulartask
LSTM ImprovementsAttentionMechanisms
Withattention,youcanvisualizehowimportanteachtimestepisforaparticulartask
CNNsforTextClassificationStartwithWordEmbeddings◦ Ifyouhave10words,andyourembeddingsizeis300,you’llhavea10x300matrix
3ParallelConvolutionLayers◦ Takeinwordembeddings◦ Slidingwindowthatprocesses3,4,and5wordsatatime(1Dconv)
◦ Filtersizesare3x300x100,4x300x100,and5x300x100(width,in-channels,out-channels)
◦ Eachconvlayeroutputs10x100matrix
CNNsforTextClassificationMaxpool andConcatenate◦ Foreachfilterchannel,maxpoolacrosstheentirewidthofsentence
◦ Thisislikepickingthe‘mostimportant’wordinthesentenceforeachchannel
◦ Alsoensureseverysentence,nomatterhowlong,isrepresentedbysamelengthvector
◦ Foreachofthethree10x100matrices,returns1x100matrix
◦ Concatenatethethree1x100matricesintoa1x300matrix
DenseandSoftmax
HierarchicalAttentionNetworks
ProblemOverviewNationalCancerInstitutehasaskedOakRidgeNationalLabtodevelopaprogramthatcanautomaticallyclassifycancerpathologyreports
Pathologyreportsarewhatdoctorswriteupwhentheydiagnosecancer,andNCIusesthemtocalculatenationalstatisticsandtrackhealthtrends
Challenges:◦ Differentdoctorsusedifferentterminologytolabelthesametypesofcancer◦ Somediagnosesmayreferenceothertypesofcancerorotherorgansthatarenottheactualcancerbeingdiagnosed
◦ Typos
Task:givenapathologyreport,teachaprogramtofindthetypeofcancer,locationofcancer,histologicalgrade,etc
ApproachTheperformanceofvariousdifferentclassifiersweretested:◦ Traditionalmachinelearningclassifiers:NaiveBayes,LogisticRegression,SupportVectorMachines,RandomForests,andXG-Boost
◦ Traditionalmachinelearningclassifiersrequiremanuallydefinedfeatures,suchn-gramsandtf-idf
◦ Deeplearningmethods:recurrentneuralnetworks,convolutionalneuralnetworks,andhierarchicalattentionnetworks
◦ Givenenoughdata,deeplearningmethodscanlearntheirownfeatures,suchaswhichwordsorphrasesareimportant
TheHierarchicalAttentionNetworkisarelativelynewdeeplearningmodelthatcameoutlastyearandisoneofthetopperformers
HANArchitectureTheHierarchicalAttentionNetwork(HAN)isadeeplearningmodelfordocumentclassification
BuiltfrombidirectionalRNNscomposedofGRUs/LSTMswithattentionmechanisms
Composedof“hierarchies”wheretheoutputsofthelowerhierarchiesbecometheinputstotheupperhierarchies
HANArchitectureBeforefeedingadocumentintotheHAN,wefirstbreakitdownintosentences(orinourcase,lines)
Thewordhierarchyisresponsibleforcreatingsentenceembeddings◦ Thishierarchyreadsinonefullsentenceatime,intheformofwordembeddings
◦ Theattentionmechanismselectsthemostimportantwords
◦ Theoutputisasentenceembeddingthatcapturesthesemanticcontentofthesentencebasedonthemostimportantwords
HANArchitectureThesentencehierarchyisresponsibleforcreatingthefinaldocumentembedding◦ Identicalstructurewiththewordhierarchy◦ Readsinthesentenceembeddingsoutputfromthewordhierarchy
◦ Theattentionmechanismselectsthemostimportantsentence
◦ Theoutputisadocumentembeddingrepresentingthemeaningoftheentiredocument
Thefinaldocumentembeddingisusedforclassification
ExperimentalSetup945cancerpathologyreports,allcasesofbreastandlungcancer
10– foldcrossvalidationused,30epochsperfold
Hyperparameteroptimizationappliedonmodelstofindoptimalparameters
Twomaintasks– primarysiteclassificationandhistologicalgradeclassification◦ Unevenclassdistribution,somewithonly~10occurrencesindataset◦ F-scoreusedforperformancemetric◦ MicroF-scoreisweightedF-scoreaveragebasedonclasssize◦ MacroF-scoreisunweightedF-scoreaverageacrossallclasses
HAN PerformancePrimarySite12possiblecancersubsitelocations◦ 5lungsubsites◦ 7breastsubsites
DeeplearningmethodsoutperformedalltraditionalMLmethodsexceptforXGBoost
HANhadbestperformance,pretrainingimprovedperformanceevenfurther
TraditionalMachineLearningClassifiers
Classifier PrimarySiteMicroF-Score
PrimarySiteMacroF-Score
NaiveBayes .554(.521,.586)
.161(.152,.170)
LogisticRegression .621(.589,.652)
.222(.207,.237)
SupportVectorMachine(C=1,gamma=1)
.616(.585,.646)
.220(.205,.234)
RandomForest(numtrees=100)
.628(.597,.661)
.258(.236,.283)
XGBoost(maxdepth=5,nestimators=300)
.709(.681,.738)
.441(.404,.474)
DeepLearningClassifiers
RecurrentNeuralNetwork(withattentionmechanism)
.694(.666,.722)
.468(.432,.502)
ConvolutionalNeuralNetwork .712(.680,.736)
.398(.359,.434)
HierarchicalAttentionNetwork(nopretraining)
.784(.759,.810)
.566(.525,.607)
HierarchicalAttentionNetwork(withpretraining)
.800(.776,.825)
.594(.553,.636)
HAN PerformanceHistologicalGrade4possiblehistologicalgrades◦ 1-4,indicatinghowabnormaltumorcellsandtumortissueslookunderamicroscopewith4beingmostabnormal
◦ Indicateshowquicklyatumorislikelytogrowandspread
OtherthanRNNs,deeplearningmodelsgenerallyoutperformtraditionalMLmodels
HANhadbestperformance,butpretrainingdidnothelpperformance
TraditionalMachineLearningClassifiers
Classifier HistologicalGradeMicroF-Score
HistologicalGradeMacroF-Score
NaiveBayes .481(.442,.519)
.264(.244,.283)
LogisticRegression .540(.499,.576)
.340(.309,.371)
SupportVectorMachine(C=1,gamma=1)
.520(.482,.558)
.330(.301,.357)
RandomForest(numtrees=100)
.597(.558,.636)
.412(.364,.476)
XGBoost(maxdepth=5,nestimators=300)
.673(.634,.709)
.593(.516,.662)
DeepLearningClassifiers
RecurrentNeuralNetwork(withattentionmechanism)
.580(.541,.617)
.474(.416,.536)
ConvolutionalNeuralNetwork .716(.681,.750)
.521(.493,.548)
HierarchicalAttentionNetwork(nopretraining)
.916(.895,.936)
.841(.778,.895)
HierarchicalAttentionNetwork(withpretraining)
.904(.881,.927)
.822(.744,.883)
TFIDFDocumentEmbeddings
TFIDF-weightedWord2Vecembeddingsreducedto2dimensionsviaPCAfor(A.)primarysitetrainreports,(B.)histologicalgradetrainreports,(C.)primarysitetestreports,and(D.)histologicalgradetrainreports.
TFIDFDocumentEmbeddings
HANdocumentembeddingsreducedto2dimensionsviaPCAfor(A.)primarysitetrainreports,(B.)histologicalgradetrainreports,(C.)primarysitetestreports,and(D.)histologicalgradetrainreports.
PretrainingWehaveaccesstomoreunlabeleddatathanlabeleddata(approximately1500unlabeled,1000labeled)
Toutilizedunlabeleddata,wetrainedourHANtocreatedocumentembeddingsthatmatchedthecorrespondingTF-IDFweightedwordembeddingsforthatdocument
HANtrainingandvalidationaccuracywithandwithoutpretrainingfor(A.)primarysitetaskand(B.)histologicalgradetask
HANDocumentAnnotations
MostImportantWordsperTaskWecanalsousetheHAN’sattentionweightstofindthewordsthatcontributemosttowardstheclassificationtaskathand:
PrimarySite HistologicalGrademainstemadenocalullowerbreast
carinacusauppermiddlerul
buttocktemporalupperretrosputum
poorlyg2highiiidlr
Undifferentiatedg3iiig1
moderatelyintermediatewellarising2
ScalingRelativetoothermodels,HANisveryslowtotrain◦ OnCPU,HANtakesapproximately4hourstogothrough30epochs◦ Incomparison,CNNtakesaround40minutestogothrough30epochs,andtraditionalmachinelearningclassifierstakelessthanaminute
◦ TheHANisslowduetoitscomplexarchitectureanduseofRNNs,sogradientsareveryexpensivetocompute
WearecurrentworkingtoscaletheHANtorunonmultipleGPUs◦ OnTensorflow,RNNsonGPUrunslowerthanonCPU◦ WeareconsideringexploringaPyTorch implementationtogetaroundthisproblem
WehavesuccessfullydevelopedadistributedCPU-onlyHANthatrunsonTITANusingMPI,with4xspeedupon8nodes
AttentionisAllYouNeedNewpaperthatcameoutJune2017fromGoogleBrain,inwhichtheyshowedtheycouldgetcompetitiveresultsinmachinetranslationwithonlyattentionmechanismsandnoRNNs
WeappliedthesamearchitecturetoreplacetheRNNsinourHAN
Becauseattentionmechanismsarejustmatrixmultiplications,itrunsabout10xfasterthantheHANwithRNNs
ThisnewmodelperformsalmostaswellastheHANwithRNNs– 0.77micro-Fonprimarysite(comparedto0.78inoriginalHAN),and0.86micro-Fonhistologicalgrade(comparedto0.91inoriginalHAN)
BecausenoRNNsareutilized,thismodelismucheasiertoscaleontheGPU
OtherFutureWorkMultitaskLearning◦ Predicthistologicalgrade,primarysite,andothertaskssimultaneouslywithinthesamemodel◦ Hopefullyboosttheperformanceofalltasksbysharinginformationacrosstasks
Semi-SupervisedLearning◦ Utilizeunlabeleddataduringtrainingratherthaninpretrainingwiththegoalofimprovingclassificationperformance
◦ Thistaskischallengingbecauseinmostsemi-supervisedtasks,weknowallthelabelswithinthedataset.Inourcase,weonlyhaveasubsetofthelabels.
Questions?