Speech Processing 11-492/18-492tts.speech.cs.cmu.edu/courses/11492/slides/s2s_all.pdfSpeech vs Text...

34
Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Translation

Transcript of Speech Processing 11-492/18-492tts.speech.cs.cmu.edu/courses/11492/slides/s2s_all.pdfSpeech vs Text...

  • Speech Processing 11-492/18-492Speech Processing 11-492/18-492

    Speech Translation

  • Speech TranslationSpeech Translation

    Three part systemsThree part systems ASR -> Translation -> TTSASR -> Translation -> TTS

    System configurationsSystem configurations One way – phrasalOne way – phrasal One way – broadcast/lectureOne way – broadcast/lecture 1.5 way – phrasal with limited answers1.5 way – phrasal with limited answers Two way – full two wayTwo way – full two way

  • Machine Translation TechnologiesMachine Translation Technologies

    PhrasalPhrasal Phrase to phrase look upPhrase to phrase look up

    Template:Template: Template fillers, fixed translationTemplate fillers, fixed translation

    InterlinguaInterlingua Translation into meaning representationTranslation into meaning representation

    Statistical Machine TranslationStatistical Machine Translation From large collect of parallel textFrom large collect of parallel text

    Classification base translationClassification base translation Identify classes and deal directly with themIdentify classes and deal directly with them

  • Choices in TranslationChoices in Translation

    Choose any two …Choose any two … High accuracyHigh accuracy Large vocabularyLarge vocabulary Fully automaticFully automatic

    Speech vs TextSpeech vs Text Speech less clear than textSpeech less clear than text Less speech to train fromLess speech to train from Needs to be real-time (probably)Needs to be real-time (probably)

  • Simple TranslationSimple Translation

    Phrase to PhrasePhrase to Phrase GreetingsGreetings Do you need medical attention?Do you need medical attention? Relatively easy to build, but limited useRelatively easy to build, but limited use

    Template translationsTemplate translations The next train leaves at TIME from gate GATE The next train leaves at TIME from gate GATE

    form PLACEform PLACE Limited but still usefulLimited but still useful

  • InterlinguaInterlingua

    Translate sentences into standard formTranslate sentences into standard form Generate sentences from standard formGenerate sentences from standard form PROS:PROS:

    Can do multiple languages easilyCan do multiple languages easily Can be very accurateCan be very accurate

    CONSCONS Designing universal interlingua is very hardDesigning universal interlingua is very hard Doesn’t do well when out of domainDoesn’t do well when out of domain

  • Statistical Machine TranslationStatistical Machine Translation

    Build probabilistic models from parallel textBuild probabilistic models from parallel text Parallel text often available fromParallel text often available from

    Bilingual organizationsBilingual organizations Governments, UNGovernments, UN

    Relatively easy to collect Relatively easy to collect Requires translators rather than MT expertsRequires translators rather than MT experts

  • Learning from Parallel TextLearning from Parallel Text

    1. Ofi'at 'kowii'ã '츈hiyoh츈i '2. Kowii'at 'ofi'ã '츈hiyoh츈i '3. Ofi'at 'shoha4. Ihooat 'hattakã 'ho츈츈o '5. Lhiyoh츈i츈i6. Sa츈hiyoh츈i7. Hi츈ha

    1. The 'dog 'chases 'the 'cat2. The 'cat 'chases 'the 'dog3. The 'dog 'stinks4. The 'woman '츈oves 'the 'man5. I 'chase 'her/him6. She/he 'chases 'me7. She/he 'dances

  • Learning from Parallel TextLearning from Parallel Text

    1. Ofi'at 'kowii'ã '츈hiyoh츈i '2. Kowii'at 'ofi'ã '츈hiyoh츈i '3. Ofi'at 'shoha4. Ihooat 'hattakã 'ho츈츈o '5. Lhiyoh츈i츈i6. Sa츈hiyoh츈i7. Hi츈ha

    1. The 'dog 'chases 'the 'cat2. The 'cat 'chases 'the 'dog3. The 'dog 'stinks4. The 'woman '츈oves 'the 'man5. I 'chase 'her/him6. She/he 'chases 'me7. She/he 'dances

  • Statistical Machine TranslationStatistical Machine Translation

    PROSPROS Data collection doesn’t require MT expertsData collection doesn’t require MT experts Data drivenData driven Degrades gracefully when out of domainDegrades gracefully when out of domain

    CONSCONS Needs all language pairsNeeds all language pairs Needs good/lots of dataNeeds good/lots of data Hard to fix specific errorsHard to fix specific errors

  • SPEECH TranslationSPEECH Translation

    Speech isn’t textSpeech isn’t text Different style, hard to find lots of examplesDifferent style, hard to find lots of examples

    Speech isn’t fluentSpeech isn’t fluent False starts, hesitations, ungrammaticalFalse starts, hesitations, ungrammatical

    ASR never makes errors ASR never makes errors

  • One Way: Broadcast One Way: Broadcast

    One speaker One speaker Lecturer: can modify language modelLecturer: can modify language model

    Multiple speakersMultiple speakers May be repeat speakers (News Anchor)May be repeat speakers (News Anchor) May had other noises: music etcMay had other noises: music etc (TV programs)(TV programs)

    Doesn’t need to be real time (maybe)Doesn’t need to be real time (maybe)

  • Two Way: DialogTwo Way: Dialog

    Users can detect own errors and correctUsers can detect own errors and correct Needs to be real timeNeeds to be real time One user may be much more familiarOne user may be much more familiar How do you teach the other userHow do you teach the other user Typically domain directedTypically domain directed

  • Speech Technology IssuesSpeech Technology Issues

    ASR:ASR: Disfluencies, dialects, speaking styleDisfluencies, dialects, speaking style Unfamiliarity with systemUnfamiliarity with system

    TTS:TTS: MT output isn’t always fluentMT output isn’t always fluent TTS says it anywayTTS says it anyway Can be hard to understandCan be hard to understand

  • Speech Technology IssuesSpeech Technology Issues

    Spoken not Written LanguagesSpoken not Written Languages Arabic vs Arabic DialectsArabic vs Arabic Dialects Mixture of languagesMixture of languages Politeness levelsPoliteness levels Gender in speechGender in speech

  • Phraselator: One Way TranslationPhraselator: One Way Translation

    Commercial SystemCommercial System VoxTecVoxTec

    Rapid deploymentRapid deployment Modules of 500ish uttsModules of 500ish utts

  • Transtac: Two S2S SystemTranstac: Two S2S System

    DARPA developed forDARPA developed for Check points, medical and civil defenseCheck points, medical and civil defense

    RequirementsRequirements Two wayTwo way Eyes-free (no screen)Eyes-free (no screen) PortablePortable Usable by real usersUsable by real users

  • Transtac SystemTranstac System

    Laptop secured in Backpack

    Optional speech controlPush-to-Talk Buttons

    Close-talking Microphone

    Small powerful Speakers

  • Transtac System DetailsTranstac System Details

    Two way systemTwo way system 2 ASR systems: English and Iraqi2 ASR systems: English and Iraqi 2 way statistical translation2 way statistical translation 2 synthesizers2 synthesizers

    Push-to-talk systemPush-to-talk system (Users don’t like “translate everything mode”)(Users don’t like “translate everything mode”)

    Echo back ASR resultEcho back ASR result And then translationAnd then translation

  • Iraqi LanguageIraqi Language

    Iraqi Arabic is a dialectIraqi Arabic is a dialect Most Iraqi’s write Modern Standard ArabicMost Iraqi’s write Modern Standard Arabic Most Iraqi’s do not write their own dialectMost Iraqi’s do not write their own dialect

    No standardized spellingNo standardized spelling Transtac project invented oneTranstac project invented one But Iraqi’s may not be used to itBut Iraqi’s may not be used to it

    Arabic (MSA and dialects)Arabic (MSA and dialects) Do not write short vowels in wordsDo not write short vowels in words

  • Data for TrainingData for Training

    Collected human mediated dialogsCollected human mediated dialogs Human acts as a machineHuman acts as a machine Passed a microphone back an forwardPassed a microphone back an forward Try to get people not to talk at same timeTry to get people not to talk at same time

    Large number of collections (over 4 years)Large number of collections (over 4 years) 650 thousand sentences pairs650 thousand sentences pairs Many different speakersMany different speakers Hand transcribed by experts (in Iraqi spelling)Hand transcribed by experts (in Iraqi spelling) Hand translate (Source sentences and Interpreter’s)Hand translate (Source sentences and Interpreter’s)

  • Iraqi ASRIraqi ASR

    Acoustic model from Iraqi dataAcoustic model from Iraqi data Based on MSA phonesetBased on MSA phoneset Needs to be small fast modelsNeeds to be small fast models Discriminative TrainingDiscriminative Training Speaker specific adaptationSpeaker specific adaptation

    LexiconLexicon Based on LDC provided lexiconBased on LDC provided lexicon Multiple pronunciations/typos still a problemMultiple pronunciations/typos still a problem Statistically trained LTS rulesStatistically trained LTS rules

    Language ModelLanguage Model Trained on Iraqi input (and translated output)Trained on Iraqi input (and translated output)

  • English ASREnglish ASR

    Acoustic modelAcoustic model Originally using other modelsOriginally using other models Then trained from collected dataThen trained from collected data (Mostly military personnel)(Mostly military personnel)

    LexiconLexicon Existing lexicon but needed to add Military speak: Existing lexicon but needed to add Military speak:

    MRAP, IEDMRAP, IED Language modelLanguage model

    Trained from data providedTrained from data provided Trained from “similar” data found on the webTrained from “similar” data found on the web Training from hand created “typical” examplesTraining from hand created “typical” examples

  • TTSTTS

    Standard English TTSStandard English TTS Appropriate “command” voiceAppropriate “command” voice Unit selectionUnit selection Added lots of military vocabularyAdded lots of military vocabulary

    Iraqi TTSIraqi TTS Recorded from Iraqi radio announcerRecorded from Iraqi radio announcer Based on example sentences in the domainBased on example sentences in the domain LDC lexicon and LTS rules (same as ASR)LDC lexicon and LTS rules (same as ASR) Hand tunedHand tuned

  • S2S Interface IssuesS2S Interface Issues

    How do you teach people to use the systemHow do you teach people to use the system ““Transtac say instructions”Transtac say instructions” Not really sufficientNot really sufficient

    How can you tell it translated correctlyHow can you tell it translated correctly Give (speech) feedback.Give (speech) feedback.

    BacktranslationBacktranslation ASR echo backASR echo back

  • S2S Interface IssuesS2S Interface Issues

    How do you translate namesHow do you translate names A correct translation/transliteration is hard to A correct translation/transliteration is hard to

    understandunderstand Mark names in translationsMark names in translations

    ““My name is … Abdullah”My name is … Abdullah” ““He lives on … al-Aqar … street”He lives on … al-Aqar … street”

  • S2S Evaluation (Transtac)S2S Evaluation (Transtac)

    Offline testsOffline tests ASR->Text and Text->TextASR->Text and Text->Text Compare to translation referencesCompare to translation references WER and “BLEU” scoreWER and “BLEU” score

    Online testsOnline tests Concept transfer (through defined scenarios)Concept transfer (through defined scenarios) Speed (number of concepts per minute)Speed (number of concepts per minute) (English speech masking)(English speech masking)

    Utility testsUtility tests Does it really workDoes it really work

  • Transtac ParticipantsTranstac Participants

    Developer groupsDeveloper groups IBMIBM SRISRI BBNBBN CMUCMU USCUSC

    EvaluationsEvaluations Twice a year in Iraqi (somewhere in DC)Twice a year in Iraqi (somewhere in DC) One surprise language One surprise language

    Farsi, Bahasa Malay, Dari, PashtoFarsi, Bahasa Malay, Dari, Pashto Other evaluations with military groupsOther evaluations with military groups

  • Does it work??Does it work??

    Yes, mostlyYes, mostly 27 concepts out of 30-ish turns27 concepts out of 30-ish turns

    Systems are mostly similarSystems are mostly similar But some better than othersBut some better than others

    Other techniquesOther techniques Belt/holster based PC with handheld speakerBelt/holster based PC with handheld speaker Small PC in pouchSmall PC in pouch Chest mounted array microphoneChest mounted array microphone

  • S2S ASR Advanced issuesS2S ASR Advanced issues

    Tight couplingTight coupling ASR should output N-bestASR should output N-best Translated all (lattice)Translated all (lattice) Choose best translationChoose best translation (MT as a LM for ASR)(MT as a LM for ASR)

    Remove disfluencies/hestitationsRemove disfluencies/hestitations Add more relevant dataAdd more relevant data

    Automatically convert past tense/third person data to Automatically convert past tense/third person data to present tense/first+second person …present tense/first+second person …

  • S2S TTS Advance IssuesS2S TTS Advance Issues

    MT output isn’t grammaticalMT output isn’t grammatical TTS doesn’t care and just says itTTS doesn’t care and just says it TTS should try to say MT output with more TTS should try to say MT output with more

    breaks.breaks. TTS (unit selection)TTS (unit selection)

    As a LM on MT output As a LM on MT output Choose the best translation on what is said Choose the best translation on what is said

    bestbest

  • S2S MT Advanced issuesS2S MT Advanced issues

    Train on ASR outputTrain on ASR output Do ASR on training dataDo ASR on training data Build SMT model ASR-TEXT to TEXTBuild SMT model ASR-TEXT to TEXT

    Session adaptationSession adaptation Improve coverage from daily usageImprove coverage from daily usage

  • S2S In-line TranslationS2S In-line Translation

    CMU-INESC (Portugal) projectCMU-INESC (Portugal) project Translation of TED videosTranslation of TED videos Align audio to give “dubbing” not “voiceover”Align audio to give “dubbing” not “voiceover” Align: timing, breaks, focus across languageAlign: timing, breaks, focus across language

  • Speech Processing 11-492/18-492Speech TranslationMachine Translation TechnologiesChoices in TranslationSimple TranslationInterlinguaStatistical Machine TranslationLearning from Parallel TextSlide 9Slide 10SPEECH TranslationOne Way: BroadcastTwo Way: DialogSpeech Technology IssuesSlide 15Phraselator: One Way TranslationTranstac: Two S2S SystemTranstac SystemTranstac System DetailsIraqi LanguageData for TrainingIraqi ASREnglish ASRTTSS2S Interface IssuesSlide 26S2S Evaluation (Transtac)Transtac ParticipantsDoes it work??S2S ASR Advanced issuesS2S TTS Advance IssuesS2S MT Advanced issuesS2S In-line TranslationSlide 34