Conversational Computers:Always 10 Years Away?
Kai-Fu LeeCorporate Vice PresidentMicrosoft Corporation
Why Conversational Why Conversational Interface?Interface?
Speech : “invented” for interaction“[Speech & language are] a biological adaptation to communicate information… One of nature’s engineering marvels” – Steven Pinker “Vision evolved from the need to survive; speech evolved from the need to communicate” – Michael Dertouzos.
Benefits of “Conversational Interface”“To me, speech recognition will be a transforming capability … when you can speak to your computer and it will understand what you're saying in context.” – Gordon Moore“Speech and natural language understanding are the key technologies that will have the most impact in the next 15 years.” – Bill Gates
Future UI vision assume conversational UIApple’s “Knowledge Navigator”.Microsoft’s “information at your fingertips”.
Science fiction movies assume conversational UI
But “Always” 10 Years But “Always” 10 Years AwayAway19501950
Jerome Weisner predicted by 1960 Jerome Weisner predicted by 1960 machine translation may be possiblemachine translation may be possible
19571957Herbert Simon predicted by 1967 machine Herbert Simon predicted by 1967 machine will match human performance in many will match human performance in many areasareas
19691969US Expert Panel predicted “voice I/O will US Expert Panel predicted “voice I/O will be in common use by 1978”be in common use by 1978”
19931993I predicted by 2003 every PC will ship with I predicted by 2003 every PC will ship with speech recognitionspeech recognition
19981998Gartner Group predicted PC UI will assume Gartner Group predicted PC UI will assume voice input by 2003voice input by 2003
Decomposing the Decomposing the PredictionPrediction
Speech recognitionSpeech recognitionText to speechText to speechNatural language understandingNatural language understandingWhy have we been a constant 10 years Why have we been a constant 10 years away?away?My 3-year & 10-year predictionsMy 3-year & 10-year predictions
NaturalNaturalLanguageLanguage
UnderstandUnderstandinging
SpeechSpeechRecognitioRecognitio
nn
Text to Text to SpeechSpeech
Talk OutlineTalk Outline
Talk OutlineTalk Outline
NaturalNaturalLanguageLanguage
UnderstandUnderstandinging
SpeechSpeechRecognitioRecognitio
nn
Text to Text to SpeechSpeech
Speech recognitionSpeech recognitionText to speechText to speechNatural language understandingNatural language understandingWhy have we been a constant 10 years Why have we been a constant 10 years away?away?My 3-year & 10-year predictionsMy 3-year & 10-year predictions
Fundamental Equation of Fundamental Equation of Speech RecognitionSpeech Recognition
XX is the acoustic waveform is the acoustic waveformWW is the word string is the word string
A speech recognizer finds A speech recognizer finds WW such that such thatWW = argmax = argmax pp((WW | | X X ) = argmax ) = argmax pp((XX | | W W ) ) pp((W W ))
pp((XX | | W W )) is the is the acoustic modelacoustic modelpp((W W )) is the is the language modellanguage model
Statistical ModelingStatistical Modeling
Improving the acoustic model – p(X | W )
Statistical Approach1. Build a detailed statistical model for each
word.Detail could be based on phonetics, speaker, dialect, gender, or data-driven details etc.
2. Collect a lot more samples for each word.There is no data like more data.
3. Go to step one.
Improving the language model – p(W )
Statistical Approach – Trigrams.There is no data like more data.
This helps recognition, not understanding.
Does Moore’s Law Help Does Moore’s Law Help Speech?Speech?
Moore’s law is necessary but not Moore’s law is necessary but not sufficientsufficient
Just faster chips means recognition Just faster chips means recognition errors appear faster.errors appear faster.
Super-Moore’s law for speech:Super-Moore’s law for speech:Faster processors/memory/disk +Faster processors/memory/disk +Getting more real data & feedback loop Getting more real data & feedback loop ++Improved statistical modelsImproved statistical models
Result:Result:Moore’s law doubles performance in 18 Moore’s law doubles performance in 18 monthsmonthsSuper-Moore’s law halves errors in 60 Super-Moore’s law halves errors in 60 monthsmonths
Speech Speech Recognition: Recognition: Approaching Human Error RateApproaching Human Error Rate
0%
5%
10%
15%
20%
25%
30%
1993 1996 1999 2002 2005 2008 2011
Microsoft licensed CMU Sphinx-II
Whisper in MSR
Speech in Office XP
Speech in Tablet/Office 11
Speech in Longhorn HumanHumanError RateError Rate
Talk OutlineTalk Outline
NaturalNaturalLanguageLanguage
UnderstandUnderstandinging
SpeechSpeechRecognitioRecognitio
nn
Text to Text to SpeechSpeech
Speech recognitionSpeech recognitionText to speechText to speechNatural language understandingNatural language understandingWhy have we been a constant 10 years Why have we been a constant 10 years away?away?My 3-year & 10-year predictionsMy 3-year & 10-year predictions
Fundamental Approach for Fundamental Approach for TTSTTSConcatenative Synthesis
Concatenation of pre-recorded speech unitsFront-end
Natural language processing (word breaking, POS…)Determine emphasis to drive speed, pitch, loudness.
Back-endCollect a lot of dataCarefully segment & store in a databaseSelect the best units from the database
Find statistical metrics that match “naturalness”, e.g., smoothness rather than specific duration targetsUse these metrics to select units
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
1982 1988 1994 1998 2001 2004 2007 2010
The best system in the year
Natu
raln
ess
Text to Speech Text to Speech Approaching Human NaturalnessApproaching Human Naturalness
Na
tura
lne
ss
HumanHumanNaturalnessNaturalness
ASR & TTS: Optimization & ASR & TTS: Optimization & EngineeringEngineeringBy leveraging Moore’s lawExponential improvements from…
Faster CPU + bigger database + better algorithm
Approaching human abilities, but not AI, but…
Optimization, or “speech engineering”
Still falls short of humans on:Learning, adaptation.Robustness to environment.
But many applications just from ASR & TTS:
ASR: Dictation, speech search, speaker verification, language learning…TTS: Telephony info access, voice fonts, voice conversion…
Talk OutlineTalk Outline
NaturalNaturalLanguageLanguage
UnderstandUnderstandinging
SpeechSpeechRecognitioRecognitio
nn
Text to Text to SpeechSpeech
Speech recognitionSpeech recognitionText to speechText to speechNatural language understandingNatural language understandingWhy have we been a constant 10 years Why have we been a constant 10 years away?away?My 3-year & 10-year predictionsMy 3-year & 10-year predictions
Syntax (rules of the human’s language)Nouns, verbs, etc. and how they combine
“Book about a trip to Chicago” vs. “Book a trip to Chicago”Normalize linguistic variations .
SemanticsMeaning of the words
Book means reserve a ticket; requires from-city, to-city, etc.
Context (additional hints)Domain knowledge :
No train from Hawaii to Chicago Statistics : Book as a noun > Book as a verb
“Book Chicago”Personal Preferences :
Where you live, your calendar, how you pay…
Model of time, urgency, presenceDialog (resolving ambiguity & determine intent)
“Buy a book or book travel?”“What date would you like to travel?”
Natural Language Understanding Natural Language Understanding Combines:Combines:
Applying Statistics to Applying Statistics to UnderstandingUnderstandingEngineering approach:Engineering approach:
Focus on one domain, engineer all the Focus on one domain, engineer all the knowledge.knowledge.Collect data & create feedback loop to Collect data & create feedback loop to improve.improve.
Applying Bayes Rule to understandingApplying Bayes Rule to understandingWW is the word string is the word string MM is the meaning is the meaning
A speech recognizer finds A speech recognizer finds MM such that such thatMM = argmax = argmax pp((MM | | W W ) = argmax ) = argmax pp((WW | | MM) ) pp((W W ))
pp((W W | | M M )) models all the ways to express a models all the ways to express a “meaning”“meaning”pp((MM)) is the is the semantic modelsemantic model
What is “unsolved” by What is “unsolved” by Statistics?Statistics?
Fusion of many sources of knowledgeFusion of many sources of knowledgeDomain-free understandingDomain-free understanding
Instant context switchingInstant context switchingGeneral knowledgeGeneral knowledge
History, sports, etc.History, sports, etc.Common sense reasoningCommon sense reasoning
““Least common of all senses”Least common of all senses”AmbiguityAmbiguity
““Mr. Mr. WrightWright should should writewrite to Mrs. to Mrs. WrightWright rightright away”away”
Emotion, humor, etc.Emotion, humor, etc.Many of the challenges are “AI-Many of the challenges are “AI-complete”complete”
Milestones in Speech Technology Milestones in Speech Technology ResearchResearch
1962 1967 1972 1977 1982 1987 1992 1997 2002
Isolated Words
Filter-bank analysis;
Time-normalization;Dynami
c programmi
ng
Isolated Words;
Connected Digits;
Continuous Speech
Pattern recognition; LPC analysis;
Clustering algorithms;
Continuous Speech; Speech
Understanding
Stochastic language
understanding; Finite-state machines; Statistical learning;
Small Vocabulary,
Acoustic Phonetics-
based
Medium Vocabular,Template-based
Large Vocabulary;
Syntax, Semantics,
Connected Words;
Continuous Speech
Large Vocabula
ry, Statistical-based
Hidden Markov
models; Stochastic
Language modeling;
Spoken dialog;
Multiple modalities
Very Large Vocabulary; Semantics, Multimodal Dialog, TTS
Concatenative synthesis;
Machine learning; Mixed-initiative
dialog;
Fueled by Moore’s Law + Data Fueled by Moore’s Law + Data + Research+ Research
Talk OutlineTalk Outline
Speech recognitionSpeech recognitionText to speechText to speechNatural language understandingNatural language understandingWhy have we been a constant 10 years Why have we been a constant 10 years away?away?My 3-year & 10-year predictionsMy 3-year & 10-year predictions
NaturalNaturalLanguageLanguage
UnderstandUnderstandinging
SpeechSpeechRecognitioRecognitio
nn
Text to Text to SpeechSpeech
Why Constant 10 Years Why Constant 10 Years Away?Away?Immature technologyImmature technology
Improving but only recently becoming Improving but only recently becoming usefuluseful
Over-sold expectationsOver-sold expectationsScience fiction moviesScience fiction moviesEffective (but not real product) demosEffective (but not real product) demos
Under-estimated risksUnder-estimated risksUser habits are hard to changeUser habits are hard to changeCost of developing speech application is Cost of developing speech application is high high
Things are different now!Things are different now!Technology is readyTechnology is readyAnd we have learned our lessons.And we have learned our lessons.
What Have We Learned?What Have We Learned?
Don’t make predictions.Don’t make predictions.… … based on extrapolating from one data based on extrapolating from one data point!point!
There is no data like more data.There is no data like more data.Real data & feedback > Moore’s Law.Real data & feedback > Moore’s Law.
Change the world, one domain at a Change the world, one domain at a time.time.
Breakthrough from data + rigor is just Breakthrough from data + rigor is just fine.fine.
Start with user’s comfort zone.Start with user’s comfort zone.Start with the greatest customer Start with the greatest customer need & business opportunity.need & business opportunity.
Talk OutlineTalk Outline
Speech recognitionSpeech recognitionText to speechText to speechNatural language understandingNatural language understandingWhy have we been a constant 10 years Why have we been a constant 10 years away?away?My 3-year & 10-year predictionsMy 3-year & 10-year predictions
NaturalNaturalLanguageLanguage
UnderstandUnderstandinging
SpeechSpeechRecognitioRecognitio
nn
Text to Text to SpeechSpeech
3-Year Speech Prediction:3-Year Speech Prediction:Most Realistic Near-Term Speech ApplicationMost Realistic Near-Term Speech Application
Meeting / Voicemail Meeting / Voicemail TranscriptionTranscription
Market Market OpportunitOpportunityy
Mobile Devices / CarsMobile Devices / Cars
Telephony / Call CenterTelephony / Call Center
AccessibilityAccessibility
Desktop DictationDesktop Dictation
Windows Commands Windows Commands & Applications / API& Applications / API
Technology Technology ReadinessReadiness
CustomCustomer Needer Need
PoorPoorAlternativeAlternative
10-Year Speech 10-Year Speech PredictionsPredictions
Telephony
Devices
DesktopDictation &
New applications
All phoneshave speech;
Mainstream app
20052005
Accessibility &Asian
Dictation
Mobility & Automotiveapplications
20082008 20102010
Structured Search
Delegation
Call CenterMainstream app(unified msg…)
VOIP convergesdata & voice
Central Part of Mobile UI;
Mobile dictation
20132013
Key part ofDesktop UI;
PlanningFederation
QuestionAnswering
Task-specific translation
Home appliances
Voice dataVoicemail &
MeetingSearch
PersonalAnnotations &
Recording search
Mining fromaudio data
(e.g., call center)
Voicemail &Meeting
transcription
ConclusionConclusionSpeech technologies will follow Moore’s Speech technologies will follow Moore’s LawLaw
Faster CPU + more data + better Faster CPU + more data + better algorithms.algorithms.Near-human quality possible in 7-10 yearsNear-human quality possible in 7-10 years
Natural language understanding is Natural language understanding is hardhard
Domain-free reasoning & common sense Domain-free reasoning & common sense hardesthardestTruly human-level understanding likely Truly human-level understanding likely elusiveelusive
Smart, conversational systems will Smart, conversational systems will emergeemerge
2-3 years: telephony, multimodal, 2-3 years: telephony, multimodal, accessibility.accessibility.7-10 years: intelligent assistance, meeting 7-10 years: intelligent assistance, meeting search/transcription, speech everywhere. search/transcription, speech everywhere.
© 2001 Microsoft Corporation. All rights reserved.© 2001 Microsoft Corporation. All rights reserved.
Top Related