Very Deep Convolutional Neural Networks for Noise Robust...
Transcript of Very Deep Convolutional Neural Networks for Noise Robust...
VeryDeepConvolutionalNeuralNetworksforNoiseRobust
SpeechRecognition
Yanmin Qian,etal.“VeryDeepConvolutionalNeuralNetworksforNoiseRobustSpeechRecognition.” IEEETransactionsonAudio,Speech,andLanguageProcessing.Acceptedforpublicationforafutureissue.
Presented by PeidongWang09/09/2016
1
Content
• Abstract• ReviewofConvolutionalNeuralNetworks• ModelDescription• Experiments• Conclusion
2
Content
• Abstract• ReviewofConvolutionalNeuralNetworks• ModelDescription• Experiments• Conclusion
3
Abstract
• ASR: PreviousattemptsincreasingthenumberofCNNlayersfrom2to3gaveadegradation.• CV:Recentworkinimageshowsthattheaccuracyofimageclassificationcanbeimprovedbyincreasingthenumberofconvolutionallayerswithcarefullytunedarchitecture.• ASR:VeryDeepConvolutionalNeuralNetworksusesupto10convolutionallayersandgetsaWERof8.81%onAurora4,whichisthebestpublishedresult.
4
Content
• Abstract• ReviewofConvolutionalNeuralNetworks• ModelDescription• Experiments• Conclusion
5
ReviewofConvolutionalNeuralNetworks
• AConventionalConvolutionalNeuralNetwork(CNN)
6
From:SlidesinCSE5526NeuralNetworks
ReviewofConvolutionalNeuralNetworks
• ConvolutionandPooling(Subsampling)
7
Content
• Abstract• ReviewofConvolutionalNeuralNetworks• ModelDescription• Experiments• Conclusion
8
ModelDescription
• ContextWindowExtension• Atypicalsizeofinputfeaturesinspeechrecognitionis11x40,where11denotesthenumberofframesinawindow,40denotesthedimensionofFBankfeatures.[*]
• Usingthiscontextwindowsize,convolutionscanbeperformedintime5timeswithafiltersizeof3,asinthefollowingfigure(vd6).
9
[*]addedbythepresenter
ModelDescription
• ContextWindowExtension(cont’d)
10
ModelDescription
• ContextWindowExtension(cont’d)• InVeryDeepConvolutionalNeuralNetworks(VDCNNs),thecontextwindowsizeisextendedto17(andfurtherto21),whichallows8(and10)convolutionstobeperformedintime,respectively.
11
ModelDescription
• ContextWindowExtension(cont’d)
12
ModelDescription
• ContextWindowExtension(cont’d)
13
ModelDescription
• FeatureDimensionExtension• Basedon40-dimFBankfeatures,atmost6convolutionsand2poolingscanbeperformedinfrequency,leadingtothevd6model.• InVDCNN,theFBankfeaturesareextendedto64-dim,sothat4moreconvolutionscanbeperformedinfrequency.
14
ModelDescription
• FeatureDimensionExtension(cont’d)
15
ModelDescription
• FeatureDimensionExtension(cont’d)• Finallytheinputextensionisperformedinbothtimeandfrequency,leadingtoa17x64input.Theresultingmodelisnamedvd10.
16
ModelDescription
• FeatureDimensionExtension(cont’d)
17
ModelDescription
• FeatureDimensionExtension(cont’d)• Thefull-ext modelfurtherextendsthenumberoftimeframesto21sothat2moreconvolutionoperationscanbeperformedintime,giving10convolutionoperationsinbothtimeandfrequency.
18
ModelDescription
• FeatureDimensionExtension(cont’d)
19
ModelDescription
• FeatureDimensionExtension(cont’d)• Toconfirmthattheperformancegainisnotfromtheextendedinputfeatures,amodelwiththesamewiderinputfeatures(17x64)butshallowconvolutionallayersisdeveloped.
20
ModelDescription
• FeatureDimensionExtension(cont’d)
21
ModelDescription
• PoolinginTime• YoumayhavenoticedthattheVDCNNmodelsallusepoolinginfrequencyanddonopoolingintime.• Toinvestigatewhetherpoolingintimeishelpful,vd10-tpoolisdesigned.
22
ModelDescription
• PoolinginTime(cont’d)
23
ModelDescription
• PoolinginTime(cont’d)
24
ModelDescription
• PaddinginFeatureMaps• InmostworkonCNNsforspeechrecognition,theconvolutionsareperformedwithoutpadding.• Paddingcansavethesizeoffeaturemapsandbetterutilizetheborderinformation.
25
ModelDescription
• PaddinginFeatureMaps(cont’d)
26
ModelDescription
• PaddinginFeatureMaps(cont’d)•Modelvd10-fpadpadsonlyinfrequency,allowingmorepoolingoperationsinfrequency.
27
ModelDescription
• PaddinginFeatureMaps(cont’d)
28
ModelDescription
• PaddinginFeatureMaps(cont’d)• Paddinginbothdimensionsisalsoapplied,whichisindicatedasvd10-fpad-tpad.• Inthismodel,consideringthatpoolingisanecessaryapproachtoreducethefeaturemapsize,poolingintimeisalsoapplied.
29
ModelDescription
• PaddinginFeatureMaps(cont’d)
30
ModelDescription
• PaddinginFeatureMaps(cont’d)
31
ModelDescription
• CompleteFigure
32
ModelDescription
• CompleteFigure(cont’d)
33
ModelDescription
• 1Channelvs.3ChannelsBasedInputFeatureMaps• VDCNNsuseonechannelfeaturemapasinput,i.e.thestaticFBankfeature.•Mostworkinspeechrecognition,however,usesthree-channelfeatures(static,∆,and∆∆).• ThenumberofinputchannelsarecomparedforVDCNN.
34
ModelDescription
• 1Channelvs.3ChannelsBasedInputFeatureMaps(cont’d)
35
ModelDescription
• 1Channelvs.3ChannelsBasedInputFeatureMaps(cont’d)• Itisinterestingtofindthat1channelbaseVDCNNsarebetterthanthemodelsusing3channels.• OnepossibleexplanationwouldbethattheinformationinthedynamicfeaturesmaybebetterextractedfromtherawstaticfeaturesdirectlybyVDCNN.
36
ModelDescription
• 1Channelvs.3ChannelsBasedInputFeatureMaps(cont’d)• Anotherexplanationmaybeasfollows.
37
ModelDescription
•ModelParameterSize• ItisobservedthatalthoughthenumberofconvolutionallayersisincreasedsignificantlyintheproposedVDCNN,thetotalparametersizeissmallerthanthebaselineCNNandDNN.
38
ModelDescription
•ModelParameterSize(cont’d)
39
ModelDescription
• ConvergenceofVeryDeepCNNs• TheVDCNNconvergesfasterthanothermodeltypes,intermsofthenumberofepochs[*].• Accordingly,althoughVDCNNsneedmorecomputationsineachiteration(9.5timesmorecomputationscomparedtothebaselineCNN),theVDCNNstakecomparabletimeformodeltraining.
40
[*]addedbythepresenter
ModelDescription
• ConvergenceofVeryDeepCNNs(cont’d)
41
ModelDescription
• NoiseRobustnessofVeryDeepCNNs
42
ModelDescription
• NoiseRobustnessofVeryDeepCNNs(cont’d)• TobetterunderstandhowVDCNNprocessesnoisyspeech,eachcondition(A,B,CorD)ofthisframeispropagatedthroughthebestperformingmodelvd10-fpad-tpad.• Theoutputsofthe1st convolutionallayerandthe6thconvolutionallayerforA,B,CandDareplottedinthenextfigures.
43
ModelDescription
• NoiseRobustnessofVeryDeepCNNs(cont’d)
44
ModelDescription
• NoiseRobustnessofVeryDeepCNNs(cont’d)• Tofurtherverifytheobservation,thedifferencesbetweennoisyfeaturemapsandcleanfeaturemapsaremeasuredforallconvolutionallayers.• Usingdatainthetest,wecomputetheaveragedmeansquareerror(MSE)toevaluatethedifferencesbetweenthethreenoisyconditionsandthecleancondition.
45
ModelDescription
• NoiseRobustnessofVeryDeepCNNs(cont’d)• TheMSEvaluesafteralloperationsareshowbelow.
46
ModelDescription
• NoiseRobustnessofVeryDeepCNNs(cont’d)• TheMSEvaluesfordifferentCNNmodels.
47
Content
• Abstract• ReviewofConvolutionalNeuralNetworks• ModelDescription• Experiments• Conclusion
48
Experiments
• ExperimentalSetup• TheGMM-HMMsystemisbuiltwithKaldi.• Allneuralnetworkmodels,includingDNN/CNN/LSTM,aretrainedusingCNTK.• ThestandardtestingpipelineinKaldirecipesareusedfordecodingandscoring.• Asimilarstructure(IBM-VGG)designedbyresearchersinIBMandNYUisalsoconstructedforcomparison.
49
Experiments
• EvaluationonAurora4• Aurora4isamediumvocabularytaskbasedontheWallStreetJournal(WSJ0).• Trainingsetscontain14276utterances.• Fourconditions,A,B,CandD,asmentionedbefore.
50
Experiments
• EvaluationonAurora4(cont’d)
51
Experiments
• EvaluationonAMI• AMIcorpuscontainsaround100hoursofmeetingrecords.• Thesignalwascapturedandsynchronizedwithmultiplemicrophonessuchasindividualheadmicrophones(IHM,close-talk)andmicrophonearrays(singledistantmicrophone(SDM)andmultipledistantmicrophones(MDM)).•MDMwasprocessedbyastandardbeamformingalgorithmtogenerateasinglechanneldataset.
52
Experiments
• EvaluationonAMI(cont’d)• Thesizeofinputfeaturesisinvestigated.
53
Experiments
• EvaluationonAMI(cont’d)• Theeffectofotherdesignsarealsoinvestigated.
54
Experiments
• EvaluationonAMI(cont’d)• TobetterexplainthesuperiorityofVDCNNs,weusesomerelatedfeaturemaps.
55
Experiments
• EvaluationonAMI(cont’d)• Onesamesinglesynchronizedframeispropagated.
56
Experiments
• EvaluationonAMI(cont’d)
57
Content
• Abstract• ReviewofConvolutionalNeuralNetworks• ModelDescription• Experiments• Conclusion
58
Conclusion
• FeaturesofVDCNN• Thesizesoffiltersandpoolingtemplatesaresmall.• Theinputfeaturemapsarelarge.• Otherdesignsuchaspoolingintime,padding,andinputfeaturemapsselectionareadjusted.• OnAurora4,itachievesaWERof8.81%(state-of-art).• OnAMI,itsaccuracyiscompetitivetoanLSTM.
59
Thank You!
60