Multi-talker Speech Separation and Tracing at AI NEXT Conference

Post on 11-Apr-2017

150 views 1 download

Transcript of Multi-talker Speech Separation and Tracing at AI NEXT Conference

DongYuDistinguishedScientistandViceGeneralManager

Tencent AILabworkwasdonewhile@MicrosoftResearch

JointworkwithMortenKolbæk,Zheng-HuaTan,andJesperJensen

Multi-talkerSpeechSeparationandTracingwith

PermutationInvariantTraining

Outline• Motivation• ProblemSetupandPriorArts• Multi-talkerSpeechSeparation• Experiments• Conclusion

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 2

Outline• Motivation• ProblemSetupandPriorArts• Multi-talkerSpeechSeparation• Experiments• Conclusion

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 3

FrontierShift

• Drivenbydemandfromuserstointeractwithdeviceswithoutwearingorcarryingaclose-talkmicrophone.

• Manydifficultieshiddenbyclose-talkmicrophonesnowsurface:

• Theenergyofspeechsignalisverylowwhenitreachesthemicrophones.

• Theinterferingsignals,suchasbackgroundnoise,reverberation,andspeechfromothertalkers,becomesodistinctthattheycannolongerbeignored.

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 4

close-talkmicrophone far-fieldmicrophone

reverberation from surface reflections

additive noise from other sound sources

source

Channeldistortion

ASRinRealWorldScenarios

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 5

CocktailPartyProblem• TermcoinedbyCherry

• “Oneofourmostimportantfacultiesisourabilitytolistento,andfollow,onespeakerinthepresenceofothers.Thisissuchacommonexperiencethatwemaytakeitforgranted;wemaycallit‘thecocktailpartyproblem’…”(Cherry’57)

• Human’sperformanceissuperiortomachine• “For‘cocktailparty’-likesituations…whenallvoicesareequallyloud,speechremainsintelligiblefornormal-hearinglisteners evenwhenthereareasmanyassixinterferingtalkers”(Bronkhorst &Plomp’92)

• Speechseparationproblem• Separate andtrace audiostreams• Sometimescalledspeechenhancementwhendealingwithnon-speechinterference

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 6

IsSpeechSeparationWorkNeeded?• End-to-endASRsystemsufficient?

• CurrentASRtechniquesrequirehugeamountoftrainingdatathatcoversvariousconditionstotrainwell

• Speechseparationcanbeusedasadvancedfront-end• SpeechseparationcriterioncanbeusedasregularizationtoaidandspeeduptrainingofASRsystems

• MoreapplicationsthanASR• Hearingaids• Cochlearimplants• Noisereductionformobilecommunication• Audioinformationretrieval

• Usingmicrophonearraysufficient?• Mic-arrayaloneisnotsufficient,e.g.,whenatsamedirection• Manyrecordingsarestillcollectedwithsinglemicrophone

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 7

Outline• Motivation• ProblemSetupandPriorArts• Multi-talkerSpeechSeparation• Experiments• Conclusion

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 8

ProblemDefinition• Sourcespeechstreams• Mixedspeech• STFTdomain• EstimateMask• ReconstructwithMask

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 9

• Ill-posedproblem(#constraints<#freeparams:• Thereareaninfinitenumberofpossible 𝑋" 𝑡, 𝑓 combinationsthatleadtothesame 𝑌 𝑡,𝑓

• Solution:• Learnfromtrainingsettolookforhiddenregularities(complicatedsoftconstraints)

PriorArtsBeforeDeepLearningEra• Computationalauditorysceneanalysis(CASA)

• Useperceptualgroupingcuestoestimatetime-frequencymasks• Non-negativematrixfactorization(NMF)

• Learnasetofnon-negativebasesduringtraining• Estimatemixingfactorsduringevaluation

• ModelbasedapproachsuchasfactorialGMM-HMM• Modelstheinteractionbetweenthetargetandcompetingspeechsignalsandtheirtemporaldynamics

• Spatialfilteringwithamicrophonearray• Beamforming:Extracttargetsoundfromaspecificspatialdirection• Independentcomponentanalysis:Findademixingmatrixfrommultiplemixturesofsoundsources

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 10

TrainingCriteriaforDeepLearning• Idealamplitudemask(IAM)𝑀" 𝑡, 𝑓 = )* +,,

- +,,• Minimizemask estimationerror(twoproblems)

• Insilencesegments 𝑋" 𝑡, 𝑓 = 0 and 𝑌 𝑡, 𝑓 = 0 → 𝑀" 𝑡, 𝑓 isnotwelldefined• Smallererroronmasksmaynotleadtoasmallererroronmagnitude(whichiswhatwecareabout)

• Minimizemagnitude estimationerror(usedinthisstudy)

• Magnitudestillestimatedthroughmasks:oftenleadtobetterperformanceesp.whentrainingsetissmall

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 11

PriorArtswithDL:Speech+Others(manyworks,OSU,MERL,CUST,etc.)

• BasicArchitecture:mixofdifferenttypesofsignals

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 12

Noise/Music/OtherSpeakers

Est.Noise/Music/OtherSpeakers

PriorArtswithDL:FocusonSpeech(manyworks,OSU,MERL,CUST,etc.)

• BasicArchitecture:mixofdifferenttypesofsignals

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 13

Noise/Music/OtherSpeakers

Est.Noise/Music/OtherSpeakers

Speech +noiseSpeech +musicSpecificspeaker+otherspeakers

Outline• Motivation• ProblemSetupandPriorArts• Multi-talkerSpeechSeparation• Experiments• Conclusion

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 14

Multi-TalkerSpeechSeparation• LabelAmbiguity/LabelPermutationProblem

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 15

Speaker1à output1 ?Speaker1à output2 ?

Solution1:DeepClustering(Hershey,Chen,Roux,Watanabe,2016)

• Learnaunit-sizeembeddingforeachtime-frequencybin• Iftwobinsbelongtothesamespeakertheyarecloseintheembeddingspace,andfatherawayotherwise.

• Trainedonalargewindowofframes

• Separationisdonebyclusteringembeddingspacerepresentations(i.e.,segmentthebins)

• Shortcomings• Pipelineiscomplicated• Eachbinisassumedtobelongtooneandonlyonespeakerà limiteditsabilitytocombinewithothertechniques

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 16

Solution2:UseManuallyDefinedRules(Weng,Yu,Seltzer,Droppo,14,15)

• UseinstantaneousenergyinsteadofspeakerIDtoassignlabels:manuallydesignedlimitedcues

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 17

Low-energyspeech

High-energyspeech

OurSolution:PermutationInvariantTraining(Yu, Kolbæk,Tan,Jensen,16,17)

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 18

SimpletoimplementCanbeeasilyextendedto3-speakers

𝑋0 − 𝑋203+ 𝑋3 − 𝑋23

3

𝑋3 − 𝑋203+ 𝑋0 − 𝑋23

3

Testing

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 19

• Defaultassignment:concatenateoutputs’sframestoformstreams• Optimalassignment:outputofeachframeiscorrectlyassignedtospeakers.Concatenateframesbelongtospeakerstoformstreams

• Gapbetweenthemindicatesthegainfromadditionalspeakertracing

Outline• Motivation• ProblemSetupandPriorArts• Multi-talkerSpeechSeparation• Experiments• Conclusion

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 20

ExperimentSetup:Datasets• WSJ0-2mixand3-mix

• DerivedfromWSJ0corpus• 2- and3-speakermixtures(artificiallygenerated)• 30htrainingset,10hvalidationset,5htestset• MixedatSIRsbetween0dBand5dB.

• Danish-2mixand3-mix• DerivedfromaDanishcorpus• 2- or3-speakermixtures(artificiallygenerated)• 10k,1k,1k+1kutterancesintraining,validation,andtestsets• Mixedat0dB

• WSJ0-2mix-other• SameasWSJ0-2mixbutmixedat0dB

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 21

Models• ImplementedusingtheMicrosoftcognitivetoolkit(CNTK)• Input:257dimSTFT;Output:257xSstreams• Segment-based(PIT-S):Eachsegmentisindependent,notracing

• DNN:3hiddenlayerseachwith1024ReLU units• PITwithtracing(PIT-T):forceallframesfromthesameoutputlayertobelongtothesamespeaker

• LSTM:3LSTMlayerseachwith1792units• BLSTM:3BLSTMlayerseachwith896units

• TestConditions• Closedcondition(CC): seenspeakers• Opencondition(OC):unseenspeakers

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 22

PIT-STrainingBehavior:WSJ0-2mix

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 23

PIT-S:SDRGain(dB)onWSJ0-2MIX

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 24

PIT-TTrainingBehavior:WSJ0-2mix

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 25

PIT-T:SDRGain(dB)onWSJ0-2MIX

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 26

SDR(dB)andPESQGainComparison

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 27

CrossLanguageBehavioron2-talkerMix

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 28

PIT-TonWSJ0-3mix

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 29

PIT-TTrainedwithBoth2- and3-mix

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 30

Examples:2-talkerMix•Male+Female:

•Mix:•S1:•S2:

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 31

•Female+Male:•Mix:•S1:•S2:

•Female+Female:•Mix:•S1:•S2:

•Male+Male:•Mix:•S1:•S2:

Examples:3-talkerMix•Male+2Female:

•Mix:•S1:•S2:•S3:

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 32

•Female+2Male:•Mix:•S1:•S2:•S3:

Example:Trainedon3-MixTeston2-Mix

•DiffGender:•Mix:•S1:•S2:•S3:

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 33

•SameGender:•Mix:•S1:•S2:•S3:

Example:Trainedon2and3-Mix,teston2-Mix

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 34

•DiffGender:•Mix:•S1:•S2:•S3:

•SameGender:•Mix:•S1:•S2:•S3:

Outline• Motivation• ProblemSetupandPriorArts• Multi-talkerSpeechSeparation• Experiments• Conclusion

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 35

Conclusion

• PITcansolvethelabelpermutationproblem• PITiseffectiveinspeechseparationwithoutknowingnumberofspeakers

• PITtrainedmodelsgeneralizewelltounseenspeakersandlanguages• PITissimpletoimplement• PIThasgreatpotentialsinceitcanbeeasilyintegratedandcombinedwithothertechniques

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 36

ClassificationView(supervisedapproach)

Segmentationview(deepclustering)

SeparationView(PIT)

PITisanimportantingredientinthefinalsolutiontothecocktailpartyproblem