Voice Recognition and Natural Language - Dallas TechFest 2016
-
Upload
crispin-reedy -
Category
Technology
-
view
140 -
download
0
Transcript of Voice Recognition and Natural Language - Dallas TechFest 2016
Voice Recognition and Natural Language: Dallas Tech Fest 2016
Voice Recognition andNatural LanguageDallas TechFestJanuary 29, 2016Crispin Reedy @crispinTX #DallasTechFest16
1
2 2016 Versay Solutions LLC
Voice User Interface Designer10 years in the fieldFormer coder; got interested in UXPresident of the Association for Voice Interaction DesignConsultant for Versay [email protected]
DO NOT FORGET TO BRING THE MINI-SPEAKERS!!!2
Siri, Alexa, Cortana: Voice recognition is hot. This session will give an overview of the voice recognition and natural language ecosystem and the technologies behind the experience. For example: What is the difference between voice recognition and natural language understanding? What are some of the common technologies in the market today? What are design considerations around these types of interfaces? This is an introductory session designed for people interested the exploring the possibilities of voice and conversational user interfaces, especially when considering the Internet of Things.3
DisclaimersThis Session Is About:What is speech recognition anyway?Should I speech-enable X? How?In general, how does it work?What technologies should I consider?What skills are important?What are the design considerations?Its NOT About:Detailed codeIn depth how-tosDeep technical knowledgeAdvanced ASR
Siri, Alexa, Cortana: Voice recognition is hot. This session will give an overview of the voice recognition and natural language ecosystem and the technologies behind the experience. For example: What is the difference between voice recognition and natural language understanding? What are some of the common technologies in the market today? What are design considerations around these types of interfaces? This is an introductory session designed for people interested the exploring the possibilities of voice and conversational user interfaces, especially when considering the Internet of Things.4
Should I Speech-Enable X?
What IS X?6 2016 Versay Solutions LLC
Computers: Apps and webpages. Consoles: Gaming / ConnectivityMobile and TabletIndustrial devices especially something task drivenGadgetsCarsThe phone6
How does this new modality enable or enhance what I want to do on this platform?
What IS X?8 2016 Versay Solutions LLC
Computers: Apps and webpages. Consoles: Gaming / ConnectivityMobile and TabletIndustrial devices especially something task drivenGadgetsCarsThe phoneEssentially what were coming to terms with here is a new input modality. Its one that doesnt always work very well for reasons well get into later. But, it can be a very powerful one when it does work well. Its also a lot harder to figure out how to properly combine speech with everything else that is going on in your environment.8
Terms & TechnologiesSpeech RecognitionNatural Language UnderstandingText to Speech Voice Verification (Biometrics)9 2016 Versay Solutions LLC
Speech RecognitionAlso known as ASRSpeech to Text ?
10 2016 Versay Solutions LLC
See the cat.
Spoken languageMachine-readable format
Natural Language UnderstandingExtracting meaning from natural textNot necessarily tied to speech recognition
11 2016 Versay Solutions LLC
Hello, yes, Id like to pay my water bill. Can you help me with that?Action = BillPay
BillType = Water
Text to SpeechSpeech SynthesisUsed to convert text to spoken words12 2016 Versay Solutions LLC
Voice VerificationAlso called voiceprints, biometrics, voice authentication, etc.Recognizes a person, not necessarily what they are saying.You can have ASR without Voice VerificationAnd vice versa
13 2016 Versay Solutions LLC
My voice is my password.Authenticated. Welcome, Mr. Smith.
Not going to discuss this one in a lot of detail today but its important that you understand the difference between these technologies.13
14 2016 Versay Solutions LLCSpeech RecognitionHands-free command / controlDictationInput textSmall form factor device, etc. Text To SpeechOutput text dynamicallyRespond to input Useful when no display is availableNatural Language UnderstandingNecessary at some level for all language-based inputAlso used to parse large volumes of textVoice VerificationSecurity
Uses: Separate Applications
Uses: Combined15 2016 Versay Solutions LLC
ASR
Application
Data
Sign-InInteractionRequestActionMeaningAccess DataOutputTTSNLUVoiceprintsVerifi-cation
True Multimodality16 2016 Versay Solutions LLC
ASR
Application
Data
Sign-InInteractionRequestActionMeaningAccess DataOutputTTSNLUVoiceprintsVerifi-cation
Touch
KeyboardManage I/O ModalityDetermine Meaning in Context
VisualContext!
Credit: Jon Bloom
Lets Talk Speech!
Output: Text to Speech(Somewhat) mature technology(Fairly) easy to understand and useNote: Create TTS audio is not the same as having a TTS engine19 2016 Versay Solutions LLC
How it Works20 2016 Versay Solutions LLC
Human voice talentHundreds of hours of recordingDigitizedPhonemes: Concatenated speech synthesis
20
TTS EngineText in, speech outMay do some text pre-processingSt. James St. Saint James StreetPunctuationIf it doesnt do this, youll have to yourself.Grapheme to phoneme transcriptionIdentify intonation patternsAssign the correct lexical stress to the words21 2016 Versay Solutions LLC
What Makes Good TTS?Phonemes change based on locationCatAlligatorElisionIm. Awaiting. You.Im awaiting you.IntonationDo you want coffee?Do you want soda, tea, or coffee?22 2016 Versay Solutions LLC
SSMLXML based WC3 standard for Speech Synthesis MarkupNot universally supported by vendors.Tags for marking up text to produce a more natural quality output. EmphasisBreakVoiceProsodyPitch23 2016 Versay Solutions LLC
SSML Example24 2016 Versay Solutions LLC
When To Use ItWhen high quality audio is not a considerationTTS has improved considerably, but is still noticeableWhen you have a lot of dynamic dataIf you just need to say a few things, it may be overkill25 2016 Versay Solutions LLC
Other ConsiderationsMore phonemes = higher quality voiceAlso means a bigger download and install (if on device)Exceptions (addresses, names) can be iffyMay require a lot of work to handle wellYour data needs to be clean and ready to voice backAcronyms, incomplete sentences will not sound goodSome applications may have other acoustic limitationsTelephonyIt is possible to build a custom voiceBut it takes a lot of work!26 2016 Versay Solutions LLC
Where To Find ItMany commercial products availableMost languages and dialects i.e. American English, British English, etc. Many different voicesNuance, Cepstral, InovaSome open sourceSome APIsChrome https://developer.chrome.com/apps/tts27 2016 Versay Solutions LLC
ASR and NLU
ASR and NLU: TopicsComplications of speechWhy is it so hard?How it works: overviewEarly commercial adoptionsIVRDesign considerationsSpeech todayDifferent vendorsShould I voice-enable X?
29 2016 Versay Solutions LLC
30(The Speech Chain, Bell Labs, 1963)
31The Voice in the Machine: PieracciniWorld KnowledgeSemanticsSyntaxLexiconMorphologyPhoneticsAcousticsLinguistics
PhysiologyConceptsPhrasesWordsPhonemesSounds
Speaking / ListeningASRNLU
World Knowledge: Concepts of the world around us, i.e. Tables have four legs, what is left and right, what is a car, etc. This is the level before languageSemantics: The first level of language. Knowledge can be represented in structured meaningful elements. Example: semantics of a party invitationSyntax: The rules that govern putting words together to form meaningful unitsLexicon: What words meanMorphology: How words change their form to perform differently in a language i.e. horse / horsesPhonetics: Phonemes and how words are builtAcoustics: What phonemes sound like and how to create them31
Speech Is AmbiguousSpeech is never stationaryCoarticulationNoisy environmentsAccentsDifferent speakers have voices with different acoustic qualitiesGoatsChallenges vary depending on what you are going to recognizeSpelling (short utterances) can be difficult even for humansPhonetic alphabet (Military)32 2016 Versay Solutions LLC
Language Is AmbiguousHumans can deduce meaning from context and unknown words
How can I help you?Im having a problem with my account.
Id like that one. No, not the green one, the red one.
Time flies like an arrow.Fruit flies like a banana.
33 2016 Versay Solutions LLC
Everything Is AmbiguousAll modern speech recognition is probabilisticGUI: Button clicked? true / falseVUI: There is an 85% chance that button was clicked34 2016 Versay Solutions LLC
Three Dimensions of Speech Problems35The Voice in the Machine: PieracciniSpeaker IndependenceSpeaking StyleSpeaker DependentMultiple SpeakersSpeaker IndependentIsolated WordsConnected WordsNatural Speech10 words1000 words100,000 wordsUnlimitedVocabulary Size
Humanlike
History of Speech RecognitionAUDREY: Davis, Biddulph, and Balashek - Bell Labs 195236 2016 Versay Solutions LLCAnalogIsolated digit recognitionPause between digitsSpeaker-dependent
SamplingThe start of being able to digitally manipulate audio39 2016 Versay Solutions LLC
40 2016 Versay Solutions LLC
0 dbfrequencySpectrogram vs. Waveform
Waveforms show the variation in overall intensity (decibels) over time.Spectrograms show the variation of individual frequency components40
1970s: Template MatchingTemplate matching approachBrute force modelQuantitized spectrogramsWhat about duration? Dynamic time warpingEndpoint detectionDifficult to doFeature extraction41 2016 Versay Solutions LLC
1980s: The Power of StatisticsThe recognition of connected speech becomes a search for the best path in a large networkProblem of finding the probabilitiesStatistical Language ModelsNot all sequences of words are equally probableRank all permissible sentences in terms of probabilityCorrect grammar is not applicableRestricted by domainHidden Markov Models (HMM)Unified probabilistic model for speech42 2016 Versay Solutions LLC
Hidden Markov Model Example43"HiddenMarkovModel" by Tdunningvectorization (Wikimedia)
X statesy possible observationsa state transition probabilitiesb output probabilities
Youre Only As Good As What Youre Trained OnCorporaCollection of speech used to train a recognizerAcoustic and/or Pronunciation Model Associates sounds with symbols and words.Created by a general speech corpora and a phonetic and orthographic transcriptionStatistical Language Model (SLM)A probability distribution over sequences of wordsCreated by a domain-specific speech corpora and a tagged transcription to extract meaning44 2016 Versay Solutions LLC
Training45 2016 Versay Solutions LLCSpeech Recognition EngineAcoustic ModelSLM and/orGrammarPronunciation Model
Language Model vs. GrammarSLMHas to be trained against collected utterancesLarge potential set of what the caller can sayTagged with the meanings of what they can sayGrammar (GrXML)More tightly constrained than an SLMEasier to createNot trained in the same waySystem will only recognize what is in the grammar
46 2016 Versay Solutions LLC
47 2016 Versay Solutions LLC
UtteranceNoise Levels?Barge-In?Feature ExtractionEndpointingSpeech Recognition EngineGrammar or SLMProbabilitiesn:best listLiteral returnTokensRecognition Event
Natural Language UnderstandingParsing input to extract meaningCovers a large fieldCommandsAutomatic classification of emailsNewspaper articles, large chunks of textLexiconParserGrammar rulesNew tools / APIs48 2016 Versay Solutions LLC
Levels of Meaning49 2016 Versay Solutions LLC
Too Broad / AmbiguousToo MuchJust RightIm having a problem with my account.Well, I was looking at my bill, because I do that every week, and I was reviewing everything on there, and I sawIm seeing an unusual charge on my bill.How can I help you?
Multi-Token UtterancesId like to transfer $50 from my checking account to my savings account.ACTION = TransferFROM_ACCOUNT = CheckingTO_ACCOUNT = SavingsAMOUNT = $50Unfortunately, people dont often naturally produce these kinds of utterances.50 2016 Versay Solutions LLC
Early Commercial AdoptionIVRTouchtone / DTMFFor checking, press 1. For savings, press 2.Directed Dialog (Grammar-based ASR)Which account? Just say checking, savings, or money market. Natural Language (SLM-based ASR)From which account?SpeechWorks / Nuance technologyVoice XML / GrXML
51 2016 Versay Solutions LLC
53 2016 Versay Solutions LLC
Typical IVR Architecture54 2016 Versay Solutions LLCVoice BrowserVUIVXML
PSTN / VOIP
HTTPApp Server / Data ConnectionData
SIPMRCPASR ServerTTS Server
Anatomy of an VUI + NLU projectVoice User Interface DesignHigh level designDesign style, sound and feel, IA, Detailed designPrompts (recorded)Grammars for directed dialog statesData I/O55 2016 Versay Solutions LLCSLM Creation Utterance captureTranscriptionTaggingCompiling and deployment
56 2016 Versay Solutions LLC
Observations to make: Represents the entirety of a VUI experiencePlacement of Spanish prompt would vary depending on type of call.Confirmation is variableConfirmation prompt is general
56
VUI Design Doc Detailed Example57 2016 Versay Solutions LLC
Corpora Documentation Example58 2016 Versay Solutions LLC
Design ConsiderationsTypes of Speech User InterfacesCommand and ControlDictationDialog-basedSpeech is a linear, time-based interfaceMultimodality introduces additional complications59 2016 Versay Solutions LLC
Design ConsiderationsIf the recognizer doesnt get something, you have to reprompt. Dont say sorry.
Where are you traveling today?Im going to. What city was that?
60 2016 Versay Solutions LLC
Design ConsiderationsSpeech is interruptibleMain Menu: Choose from: Beverages, Sandwiches, Sides, Salads, or Alcoholic Drinks.
61 2016 Versay Solutions LLC
Design ConsiderationsPrompts imply more than choicesWould you like chocolate or vanilla?YesBoth
62 2016 Versay Solutions LLC
Design ConsiderationsInput must be limited *after* it is providedCant check the box on the client side to only allow input of valid amountsSorry, youre only allowed to transfer up to $500.
63 2016 Versay Solutions LLC
Design ConsiderationsAvoid using the word Help as a global command.Instead, if there is a need to give additional information, supply it in the first or second reprompts.Or use specific keywordsOther than help You can also say instructions.Or, say Its something else.
64 2016 Versay Solutions LLC
User Centered Design TechniquesA set of techniques designed to keep the focus on the user during the design processMay include but are not limited to:ConversationsSpecific to VUI designRead AloudSpecific to VUI designCard SortsUsed to construct an IAPersonasUsed in all modalitiesUsability TestingUsed in all modalitiesA/B TestingUseful for applications that are already in production65 2015 Versay Solutions LLC
Usability Testing66 2016 Versay Solutions LLC
67
Should I Speech-Enable X?
What IS X?69 2016 Versay Solutions LLC
Whats the Use Case For Speech?Enabling applicationUser cant do it any other wayNew tasksEnhancing applicationUser can do it nowBut speech makes it betterFasterSafer70Credit: Bruce Ballentine, EIG
How Hard Is It To Do?What do you need it for?What kind of device will you be running it on?Connectivity? Can you use cloud based ASR?Do you have to download it? If so, how much space do you have?How much control do you need over the application / user interface?71 2016 Versay Solutions LLC
Possibilities72 2016 Versay Solutions LLCWrite an app (skill) for an agent such as Cortana / AlexaUse cloud APIs to add ASR to your app / device / page / gadgetDownload an ASR and use full-featured capabilities for more robust recognitionBuild your own
Distributed: Todays Speech AgentsSiriCortanaGoogle NowAmazon Echo (Alexa)73 2016 Versay Solutions LLC
Todays Cloud-Based Speech APIsDistributed speech recognitionCollection and compression of speech is on the deviceThe language models are typically on the networkPhone can be speaker-dependentTrains itself on your voice and on the acoustic environments you are in most oftenMany companies are providing APIs to use their speech recognition
74 2016 Versay Solutions LLC
AVS vs. Amazon EchoCould use AVS with the Amazon Echo, or with your own device75 2016 Versay Solutions LLC
Speech API Example: Alexa Voice Services76 2016 Versay Solutions LLC
Alexa Skill Example 77 2016 Versay Solutions LLC
78 2016 Versay Solutions LLC
Alexa SkillsAlexa, ask Yelp to find me a restaurant.Cortana has similar integrationRegister your skill with Amazon and publish it79 2016 Versay Solutions LLC
Cloud vs. Downloadable / EmbeddedMicrosoftCortana integrationProject Oxford APIGoogle APIAmazonSeveral new recent startupsApi.ai, Capio.ai, Speechmatics, iSpeech
80 2016 Versay Solutions LLCMicrosoftWindows 10 Speech APIsMicrosoft Speech ServerNuancethe 800 pound gorilla in the roomInteractionsIBM Watson
Cloud vs. Downloadable / EmbeddedEasy to get startedLightweightNot much specialized knowledge
81 2016 Versay Solutions LLCCustomizableProbably better recognitionCan be device-specificMore featuresHigher poweredWill require specialized knowledgeSpeech scientist
Todays NLU APIsMicrosoft LUIS (part of Project Oxford)Api.ai
82 2016 Versay Solutions LLC
Open Source ASRCMU SphinxpocketsphinxKaldihttp://kaldi-asr.org/GithubNew updates include some pretty interesting stuff (DNN)Requires: Corpus Tech know-how
83 2016 Versay Solutions LLC
Who May You Need On Your TeamSpeech ScientistVUI Designer84 2016 Versay Solutions LLC
Should I Speech-Enable X?85 2016 Versay Solutions LLC
Should I Speech-Enable X?86 2016 Versay Solutions LLCDesktop App / WebsiteEasy to get started with API-based ASRBut the use case may not be as powerfulTablet / MobileStronger use caseBut will the network be available for APIs?Industrial DeviceGreat use case esp. with multimodalBut this is harder to do and probably will be customGadgetDecent use caseAPIs are tailored for thisWill they do everything you need?Will the extra modality be a plus or just a silly add-on?CarSafety considerations are high hereNeed better user interfaces & more robustIVRTouchtone can still be good for a lot of applicationsSpeech is good for complex call routing and input
ResourcesThe Voice in the Machine: Building Computers that Understand Speech Roberto PieracciniYouTube video: Open the Pod Bay Doors, SiriBest Practices in VUI Design: AVIxD Wikihttp://videsign.wikispaces.com/AVIxD: Quarterly Brown Bags87 2016 Versay Solutions LLC
88 2016 Versay Solutions LLC
Thanks!
DO NOT FORGET TO BRING THE MINI-SPEAKERS!!!88