Download - Conversational Computers: Always 10 Years Away? Kai-Fu Lee Corporate Vice President Microsoft Corporation.

Conversational Computers:Always 10 Years Away?

Kai-Fu LeeCorporate Vice PresidentMicrosoft Corporation

Why Conversational Why Conversational Interface?Interface?

Speech : “invented” for interaction“[Speech & language are] a biological adaptation to communicate information… One of nature’s engineering marvels” – Steven Pinker “Vision evolved from the need to survive; speech evolved from the need to communicate” – Michael Dertouzos.

Benefits of “Conversational Interface”“To me, speech recognition will be a transforming capability … when you can speak to your computer and it will understand what you're saying in context.” – Gordon Moore“Speech and natural language understanding are the key technologies that will have the most impact in the next 15 years.” – Bill Gates

Future UI vision assume conversational UIApple’s “Knowledge Navigator”.Microsoft’s “information at your fingertips”.

Science fiction movies assume conversational UI

But “Always” 10 Years But “Always” 10 Years AwayAway19501950

Jerome Weisner predicted by 1960 Jerome Weisner predicted by 1960 machine translation may be possiblemachine translation may be possible

19571957Herbert Simon predicted by 1967 machine Herbert Simon predicted by 1967 machine will match human performance in many will match human performance in many areasareas

19691969US Expert Panel predicted “voice I/O will US Expert Panel predicted “voice I/O will be in common use by 1978”be in common use by 1978”

19931993I predicted by 2003 every PC will ship with I predicted by 2003 every PC will ship with speech recognitionspeech recognition

19981998Gartner Group predicted PC UI will assume Gartner Group predicted PC UI will assume voice input by 2003voice input by 2003

Decomposing the Decomposing the PredictionPrediction

Speech recognitionSpeech recognitionText to speechText to speechNatural language understandingNatural language understandingWhy have we been a constant 10 years Why have we been a constant 10 years away?away?My 3-year & 10-year predictionsMy 3-year & 10-year predictions

NaturalNaturalLanguageLanguage

UnderstandUnderstandinging

SpeechSpeechRecognitioRecognitio

nn

Text to Text to SpeechSpeech

Talk OutlineTalk Outline





nn



Fundamental Equation of Fundamental Equation of Speech RecognitionSpeech Recognition

XX is the acoustic waveform is the acoustic waveformWW is the word string is the word string

A speech recognizer finds A speech recognizer finds WW such that such thatWW = argmax = argmax pp((WW | | X X ) = argmax ) = argmax pp((XX | | W W ) ) pp((W W ))

pp((XX | | W W )) is the is the acoustic modelacoustic modelpp((W W )) is the is the language modellanguage model

Statistical ModelingStatistical Modeling

Improving the acoustic model – p(X | W )

Statistical Approach1. Build a detailed statistical model for each

word.Detail could be based on phonetics, speaker, dialect, gender, or data-driven details etc.

2. Collect a lot more samples for each word.There is no data like more data.

3. Go to step one.

Improving the language model – p(W )

Statistical Approach – Trigrams.There is no data like more data.

This helps recognition, not understanding.

Does Moore’s Law Help Does Moore’s Law Help Speech?Speech?

Moore’s law is necessary but not Moore’s law is necessary but not sufficientsufficient

Just faster chips means recognition Just faster chips means recognition errors appear faster.errors appear faster.

Super-Moore’s law for speech:Super-Moore’s law for speech:Faster processors/memory/disk +Faster processors/memory/disk +Getting more real data & feedback loop Getting more real data & feedback loop ++Improved statistical modelsImproved statistical models

Result:Result:Moore’s law doubles performance in 18 Moore’s law doubles performance in 18 monthsmonthsSuper-Moore’s law halves errors in 60 Super-Moore’s law halves errors in 60 monthsmonths

Speech Speech Recognition: Recognition: Approaching Human Error RateApproaching Human Error Rate

0%

5%

10%

15%

20%

25%

30%

1993 1996 1999 2002 2005 2008 2011

Microsoft licensed CMU Sphinx-II

Whisper in MSR

Speech in Office XP

Speech in Tablet/Office 11

Speech in Longhorn HumanHumanError RateError Rate





nn



Fundamental Approach for Fundamental Approach for TTSTTSConcatenative Synthesis

Concatenation of pre-recorded speech unitsFront-end

Natural language processing (word breaking, POS…)Determine emphasis to drive speed, pitch, loudness.

Back-endCollect a lot of dataCarefully segment & store in a databaseSelect the best units from the database

Find statistical metrics that match “naturalness”, e.g., smoothness rather than specific duration targetsUse these metrics to select units

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

1982 1988 1994 1998 2001 2004 2007 2010

The best system in the year

Natu

raln

ess

Text to Speech Text to Speech Approaching Human NaturalnessApproaching Human Naturalness

Na

tura

lne

ss

HumanHumanNaturalnessNaturalness

ASR & TTS: Optimization & ASR & TTS: Optimization & EngineeringEngineeringBy leveraging Moore’s lawExponential improvements from…

Faster CPU + bigger database + better algorithm

Approaching human abilities, but not AI, but…

Optimization, or “speech engineering”

Still falls short of humans on:Learning, adaptation.Robustness to environment.

But many applications just from ASR & TTS:

ASR: Dictation, speech search, speaker verification, language learning…TTS: Telephony info access, voice fonts, voice conversion…





nn



Syntax (rules of the human’s language)Nouns, verbs, etc. and how they combine

“Book about a trip to Chicago” vs. “Book a trip to Chicago”Normalize linguistic variations .

SemanticsMeaning of the words

Book means reserve a ticket; requires from-city, to-city, etc.

Context (additional hints)Domain knowledge :

No train from Hawaii to Chicago Statistics : Book as a noun > Book as a verb

“Book Chicago”Personal Preferences :

Where you live, your calendar, how you pay…

Model of time, urgency, presenceDialog (resolving ambiguity & determine intent)

“Buy a book or book travel?”“What date would you like to travel?”

Natural Language Understanding Natural Language Understanding Combines:Combines:

Applying Statistics to Applying Statistics to UnderstandingUnderstandingEngineering approach:Engineering approach:

Focus on one domain, engineer all the Focus on one domain, engineer all the knowledge.knowledge.Collect data & create feedback loop to Collect data & create feedback loop to improve.improve.

Applying Bayes Rule to understandingApplying Bayes Rule to understandingWW is the word string is the word string MM is the meaning is the meaning

A speech recognizer finds A speech recognizer finds MM such that such thatMM = argmax = argmax pp((MM | | W W ) = argmax ) = argmax pp((WW | | MM) ) pp((W W ))

pp((W W | | M M )) models all the ways to express a models all the ways to express a “meaning”“meaning”pp((MM)) is the is the semantic modelsemantic model

What is “unsolved” by What is “unsolved” by Statistics?Statistics?

Fusion of many sources of knowledgeFusion of many sources of knowledgeDomain-free understandingDomain-free understanding

Instant context switchingInstant context switchingGeneral knowledgeGeneral knowledge

History, sports, etc.History, sports, etc.Common sense reasoningCommon sense reasoning

““Least common of all senses”Least common of all senses”AmbiguityAmbiguity

““Mr. Mr. WrightWright should should writewrite to Mrs. to Mrs. WrightWright rightright away”away”

Emotion, humor, etc.Emotion, humor, etc.Many of the challenges are “AI-Many of the challenges are “AI-complete”complete”

Milestones in Speech Technology Milestones in Speech Technology ResearchResearch

1962 1967 1972 1977 1982 1987 1992 1997 2002

Isolated Words

Filter-bank analysis;

Time-normalization;Dynami

c programmi

ng

Isolated Words;

Connected Digits;

Continuous Speech

Pattern recognition; LPC analysis;

Clustering algorithms;

Continuous Speech; Speech

Understanding

Stochastic language

understanding; Finite-state machines; Statistical learning;

Small Vocabulary,

Acoustic Phonetics-

based

Medium Vocabular,Template-based

Large Vocabulary;

Syntax, Semantics,

Connected Words;

Continuous Speech

Large Vocabula

ry, Statistical-based

Hidden Markov

models; Stochastic

Language modeling;

Spoken dialog;

Multiple modalities

Very Large Vocabulary; Semantics, Multimodal Dialog, TTS

Concatenative synthesis;

Machine learning; Mixed-initiative

dialog;

Fueled by Moore’s Law + Data Fueled by Moore’s Law + Data + Research+ Research






nn


Why Constant 10 Years Why Constant 10 Years Away?Away?Immature technologyImmature technology

Improving but only recently becoming Improving but only recently becoming usefuluseful

Over-sold expectationsOver-sold expectationsScience fiction moviesScience fiction moviesEffective (but not real product) demosEffective (but not real product) demos

Under-estimated risksUnder-estimated risksUser habits are hard to changeUser habits are hard to changeCost of developing speech application is Cost of developing speech application is high high

Things are different now!Things are different now!Technology is readyTechnology is readyAnd we have learned our lessons.And we have learned our lessons.

What Have We Learned?What Have We Learned?

Don’t make predictions.Don’t make predictions.… … based on extrapolating from one data based on extrapolating from one data point!point!

There is no data like more data.There is no data like more data.Real data & feedback > Moore’s Law.Real data & feedback > Moore’s Law.

Change the world, one domain at a Change the world, one domain at a time.time.

Breakthrough from data + rigor is just Breakthrough from data + rigor is just fine.fine.

Start with user’s comfort zone.Start with user’s comfort zone.Start with the greatest customer Start with the greatest customer need & business opportunity.need & business opportunity.






nn


3-Year Speech Prediction:3-Year Speech Prediction:Most Realistic Near-Term Speech ApplicationMost Realistic Near-Term Speech Application

Meeting / Voicemail Meeting / Voicemail TranscriptionTranscription

Market Market OpportunitOpportunityy

Mobile Devices / CarsMobile Devices / Cars

Telephony / Call CenterTelephony / Call Center

AccessibilityAccessibility

Desktop DictationDesktop Dictation

Windows Commands Windows Commands & Applications / API& Applications / API

Technology Technology ReadinessReadiness

CustomCustomer Needer Need

PoorPoorAlternativeAlternative

10-Year Speech 10-Year Speech PredictionsPredictions

Telephony

Devices

DesktopDictation &

New applications

All phoneshave speech;

Mainstream app

20052005

Accessibility &Asian

Dictation

Mobility & Automotiveapplications

20082008 20102010

Structured Search

Delegation

Call CenterMainstream app(unified msg…)

VOIP convergesdata & voice

Central Part of Mobile UI;

Mobile dictation

20132013

Key part ofDesktop UI;

PlanningFederation

QuestionAnswering

Task-specific translation

Home appliances

Voice dataVoicemail &

MeetingSearch

PersonalAnnotations &

Recording search

Mining fromaudio data

(e.g., call center)

Voicemail &Meeting

transcription

ConclusionConclusionSpeech technologies will follow Moore’s Speech technologies will follow Moore’s LawLaw

Faster CPU + more data + better Faster CPU + more data + better algorithms.algorithms.Near-human quality possible in 7-10 yearsNear-human quality possible in 7-10 years

Natural language understanding is Natural language understanding is hardhard

Domain-free reasoning & common sense Domain-free reasoning & common sense hardesthardestTruly human-level understanding likely Truly human-level understanding likely elusiveelusive

Smart, conversational systems will Smart, conversational systems will emergeemerge

2-3 years: telephony, multimodal, 2-3 years: telephony, multimodal, accessibility.accessibility.7-10 years: intelligent assistance, meeting 7-10 years: intelligent assistance, meeting search/transcription, speech everywhere. search/transcription, speech everywhere.