From Voice Browsers to Multimodal Systems
description
Transcript of From Voice Browsers to Multimodal Systems
1/41 W3C AC/WWW10Hong Kong May 2001
From Voice Browsers to Multimodal Systems
Dave Raggett
W3C Lead for Voice/Multimodal
W3C & Openwave
http://www.w3.org/Voice
With thanks to Jim Larson
The W3C Speech Interface Framework
2/41 W3C AC/WWW10Hong Kong May 2001
Voice â The Natural Interfaceavailable from over a billion phones
⢠Personal assistant functions:â Name dialing and Searchâ Personal Information Managementâ Unified Messaging (mail, Fax & IM)â Call screening & call routing
⢠Voice Portalsâ Access to news, information, entertainment,
customer service and V-commerce(e.g. Find a friend, Wine Tips, Flight info, Find a hotel room , Buy ringing tones, Track a shipment)
⢠Front-ends for Call Centersâ 90% cost savings over human agentsâ Reduced call abandonment rates (IVR)â Increased customer satisfaction
(Portal Demo)
3/41 W3C AC/WWW10Hong Kong May 2001
W3C Voice Browser Working Grouphttp://www.w3.org/Voice/Group
⢠Founded: May 1999 following workshop in October 1998
⢠Missionâ Prepare and review markup languages to enable Internet-based
speech applications
⢠Has published requirements and specifications for languages in the W3C Speech Interface Framework
⢠Is now due to be re-chartered with clarified IP policy
4/41 W3C AC/WWW10Hong Kong May 2001
Voice Browser WG MembershipAlcatelAnyDeviceAsk JeevesAT&TAvayaBeVocalBrienceBTCanonCiscoComverseConversayEDFFrance TelecomGeneral Magic
HitachiHPIBMInformioIntelIsSoundLernout & HauspieLocus DialogueLucentMicrosoftMiloMitreMotorolaNokiaNortel Networks
Nuance PhilipsOpenwavePipeBeachSpeechHostSpeechWorksSun MicrosystemsTelecom ItaliaTeleraTellmeUnisysVerascapeVoiceGenieVoxeoVoxSurfYahoo
5/41 W3C AC/WWW10Hong Kong May 2001
ASRLanguage
Understanding ContextInterpretation
DialogManager
TTSLanguageGeneration
Speech Synthesis ML
Speech RecognitionGrammar ML
N-gram Grammar ML
Natural Language Semantics ML VoiceXML 2.0
Reusable Components
WorldWideWeb
User
DTMF Tone Recognizer
Media Planning
Prerecorded Audio Player
Lexicon
TelephoneSystem
W3C Speech Interface Framework
Call Control
6/41 W3C AC/WWW10Hong Kong May 2001
W3C Speech Interface Framework Published Documents
RECPRCR
LCWDWD
REQDialog Speech Speech N-gram NL Reusable Lexicon Call Synthesis Grammar Semantics Comp'ts Control
12-99 12-9912-99 12-99 12-99
12-99 12-99 12-99 5-00
5-00
1-01 1-01
Documents available at http://www.w3.org/Voice
4-01
Soon
2-01
Soon
7/41 W3C AC/WWW10Hong Kong May 2001
Voice User Interfaces and VoiceXML
⢠Why use voice as a user interface? â Far more phones than PCsâ More wireless phones than PCsâ Hands and eyes free operation
⢠Why do we need a language for specifying voice dialogs?â High-level language simplifies application developmentâ Separates Voice interface from Application serverâ Leverage existing Web application development tools
⢠What does VoiceXML describe?â Conversational dialogs: System and user turns to speakâ Dialogs based on form-filling metaphor plus events and links
⢠W3C is standardizing VoiceXML based upon VoiceXML 1.0 submission by AT&T, IBM, Lucent and Motorola
8/41 W3C AC/WWW10Hong Kong May 2001
VoiceXML Architecture
CorporationCarrier
Any PhoneVoiceXML Gateway
PSTN or VoIP
Brings the power of the Web to Voice
Consumer or Corporate Web site
Speech +DTMF
VoiceXMLGrammarsAudio files
9/41 W3C AC/WWW10Hong Kong May 2001
Reaching Out to Multiple Channels
Applications Database
XML, Images, Audio, âŚ
XHTML VoiceXML WML/HDML
Content AdaptationAdjust as needed foreach device & user
10/41 W3C AC/WWW10Hong Kong May 2001
VoiceXML Features
⢠Menus, Forms, Sub-dialogsâ <menu>, <form>, <subdialog>
⢠Inputsâ Speech Recognition <grammar>
â Recording <record>
â Keypad <dtmf>
⢠Outputâ Audio files <audio>
â Text-To-Speech
⢠Variablesâ <var>, <script>
⢠Eventsâ <nomatch>, <noinput>, <help>,
<catch>, <throw>
⢠Transition & submissionâ <goto>, <submit>
â Telephonyâ Call transfer â Telephony information
â Platformâ Objects
â Performanceâ Fetch
11/41 W3C AC/WWW10Hong Kong May 2001
Example VoiceXML<menu>
<prompt> <speak>
Welcome to Ajax Travel. Do you want to fly to
<emphasis>
New York
</emphasis>
or
<emphasis>
Washington
</emphasis>
</speak>
</prompt>
<choice next="http://www.NY...".><grammar>
<choice>
<item> New York </item>
<item> Big Apple </item> </choice>
</grammar>
</choice>
<choice next="http://www.Wash...">
<grammar>
<choice> <item> Washington </item>
<item> The Capital </item> </choice>
</grammar>
</choice>
</menu>
12/41 W3C AC/WWW10Hong Kong May 2001
<form id="weather_info"> <block>Welcome to the international weather service.</block> <field name=âcountry"> <prompt>What country?</prompt> <grammar src=âcountry.gram" type="application/x-jsgf"/> <catch event="help"> Please say the country for which you want the weather. </catch> </field> <field name="city"> <prompt>What city?</prompt> <grammar src="city.gram" type="application/x-jsgf"/> <catch event="help"> Please say the city for which you want the weather. </catch> </field> <block> <submit next="/servlet/weather" namelist="city country"/> </block> </form>
Example VoiceXML
13/41 W3C AC/WWW10Hong Kong May 2001
VoiceXML Implementations
⢠BeVocal
⢠General Magic
⢠HeyAnita
⢠IBM
⢠Lucent
⢠Motorola
⢠Nuance ⢠PipeBeach ⢠SpeechWorks⢠Telera⢠Tellme⢠Voice Genie
See http://www.w3.org/Voice
These are the companies who asked to be listed on the W3C Voice page
14/41 W3C AC/WWW10Hong Kong May 2001
Reusable Components
Voice ApplicationDeveloper
ReusableComponents
VoiceXMLScripts
DialogManager
Voice ApplicationDeveloper
15/41 W3C AC/WWW10Hong Kong May 2001
Reusable Dialog Modules⢠Express application at task level rather than interaction level⢠Save development time by reusing tried and effective
modules⢠Increase consistency among applications
Examples include:
Credit card number
Date
Name
Address
Telephone number
Yes/No question
Shopping cart
Order status
Weather
Stock quotes
Sport scores
Word games
16/41 W3C AC/WWW10Hong Kong May 2001
Speech Grammar ML
⢠Specifies the words and patterns of words for which a speaker independent recognizer can listen
⢠May be specified â Inline as part of a VoiceXML page
â Referenced and stored separately on Web servers
⢠Three variants: XML, ABNF, N-Gram⢠Action Tags for âsemantic processingâ
17/41 W3C AC/WWW10Hong Kong May 2001
Three forms of the Grammar ML⢠XML
â Modeled after Java Speech Grammar Format
â Mandatory for Dialog ML interpretersâ Manually specified by developer
⢠Augmented BNF syntax (ABNF)â Modeled after Java Speech Grammar
Formatâ Optional for Dialog ML interpretersâ May be mapped to and from XML
grammarsâ Manually specified by developer
⢠N-gramsâ Optional for Dialog ML interpretersâ Used for larger vocabulariesâ Generated statistically
<rule id="state" scope="public">
<one-of>
<item> Oregon </item>
<item>Maine </item>
</one-of>
</rule>
public $state = Oregon | Maine
18/41 W3C AC/WWW10Hong Kong May 2001
Action Tags
⢠Specify what VoiceXML variables to set when grammar rules are matched to user input
⢠Based upon subset of ECMAScript
$drink = coke | pepsi | coca cola {"coke"};
// medium is default if nothing said$size = {"medium"} [small | medium | large | regular {"medium"}]
19/41 W3C AC/WWW10Hong Kong May 2001
N-Gram Language Models⢠Likelihood of a given word following certain
others⢠Used as a linguistic model to identify most likely
sequence of words that matches the spoken input⢠N-Grams are computed automatically from a
corpus of many inputs⢠The N-Gram Markup Language is used as
interchange format for automatic analysis of words and phrases to an dictation ASR engine.
20/41 W3C AC/WWW10Hong Kong May 2001
Speech synthesis process
StructureAnalysis
TextNormali-
zation
Text-to-Phoneme
Conversion
Prosody Analysis
WaveformProduction
⢠Dr. Jones lives at 175 Park Dr. He weighs 175 lb. He plays bass in a blues band. He also likes to fish; last week he caught a 20 lb. bass.
⢠Doctor Jones lives at one seventy-five Park Drive. He weighs one hundred and seventy-five pounds. He plays base in a blues band. He likes to fish; last week he caught a twenty-pound bass.
IN OUT
modeled after Sunâs Java Speech Markup Language
21/41 W3C AC/WWW10Hong Kong May 2001
Speech Synthesis ML
StructureAnalysis
TextNormali-
zation
Text-to-Phoneme
Conversion
Prosody Analysis
WaveformProduction
Non-markup behavior:infer structure byautomated text analysis Markup support:paragraph, sentence
<paragraph><sentence>
This is the first sentence.</sentence><sentence>
This is the second sentence.</sentence>
</paragraph>
22/41 W3C AC/WWW10Hong Kong May 2001
StructureAnalysis
TextNormali-
zation
Text-to-Phoneme
Conversion
Prosody Analysis
WaveformProduction
Non-markup behavior: automatically identify and convert constructs Markup support: sayas for dates, times, etc.
Examples
<sayas sub="World Wide Web Consortium" > W3C</sayas>
<sayas type="number:digits"> 175 </sayas>
Speech Synthesis ML
23/41 W3C AC/WWW10Hong Kong May 2001
StructureAnalysis
TextNormali-
zation
Text-to-Phoneme
Conversion
Prosody Analysis
WaveformProduction
Non-markup behavior:look up in a pronunciation dictionary Markup support:phoneme, sayas
Example<phoneme alphabet="ipa" ph="tɒmɑtoʊ"> tomato</phoneme>
International Phonetic Alphabet (IPA) using character entities
Phonetic Alphabets⢠International Phonetic Alphabet⢠Worldbet⢠X-SAMPA
Speech Synthesis ML
24/41 W3C AC/WWW10Hong Kong May 2001
StructureAnalysis
TextNormali-
zation
Text-to-Phoneme
Conversion
Prosody Analysis
WaveformProduction
Non-markup behavior:automatically generates prosody through analysis of document structure andsentence syntaxMarkup support:emphasis, break, prosody
Examples<emphasis> Hi </emphasis><break time="3s"/><prosody rate="slow"/>
Prosody elementpitch: high, medium, low, defaultcontourrange: high, medium, low, defaultrate: fast medium, slow, defaultvolume: silent, soft medium, loud, default
Speech Synthesis ML
25/41 W3C AC/WWW10Hong Kong May 2001
StructureAnalysis
TextNormali-
zation
Text-to-Phoneme
Conversion
Prosody Analysis
WaveformProduction
Markup support:voice, audio
Examples<audio src=âlaughter.wav">[laughter]</audio><voice age="child"> Mary had a little lamb </voice>
Attributesgender: male, female, neutralage: child, teenager, adult, elder, (integer)variant: different, (integer)name: default, (voice-name)
Speech Synthesis ML
26/41 W3C AC/WWW10Hong Kong May 2001
LexiconML - Why?
<lexicon>
either /iy th r/
either /ay th r/
</lexicon>
Pronunciation Lexicon
either
TTS /ay th r/
ASR
Voice ApplicationDeveloper
either
/ay th r//iy th r/
â˘Accurate pronunciations are essential in EVERY speech applicationâ˘Platform default lexicons do not give 100% coverage of user speech
27/41 W3C AC/WWW10Hong Kong May 2001
LexiconML - Key Requirements
⢠Meets both synthesis and recognition requirements
⢠Pronunciations for any language (including tonal)â reuse standard alphabets, support for suprasegmentals
⢠Multiple pronunciations per word
⢠Alternate orthographiesâ Spelling variations â âcolourâ and âcolorâ
â Alternative writing systems âJapanese Kanji and Kana
â Abbreviations and Acronyms - e.g. Dr., BT,
⢠Homophones e.g âreadâ and âreedâ (same sound)
⢠Homographs e.g. âreadâ and âreadâ (same spelling)
28/41 W3C AC/WWW10Hong Kong May 2001
Interaction Style⢠Voice user interfaces needn't be dull⢠Choose prompts to reflect an explicit choice of
personality⢠Introduce variety in prompts rather than always
repeating the same thing⢠Politeness, helpfulness and sense of humor⢠Target different groups of users e.g. Gen Y⢠Allow users to select personality (skin)
(Personality Demo)
29/41 W3C AC/WWW10Hong Kong May 2001
Call Control
Dialog Manager
VoiceXML
CallControl
Voice ApplicationDeveloper
User
(Call control Demo)
30/41 W3C AC/WWW10Hong Kong May 2001
Call Control Requirements
⢠Call managementâPlace outbound call, conditionally answer inbound call, outbound fax
⢠Call leg managementâCreate, redirect, interact while on hold
⢠Conference managementâCreate, join, exit⢠Intersession communicationâAsynchronous
events⢠Interpreter contextâInvoke, terminate
31/41 W3C AC/WWW10Hong Kong May 2001
Natural Language Semantics ML
ASRLanguage
UnderstandingContext
Interpretation
Voice ApplicationDeveloper
Grammar and semantic tags
Text NLSemantics
32/41 W3C AC/WWW10Hong Kong May 2001
Natural Language Semantics ML
⢠Represent semantic interpretations of an utteranceâ Speechâ Natural language textâ Other forms (e.g., handwriting, ocr, DTMF.)
⢠Used primarily as an interchange format among voice browser components
⢠Usually generated automatically and not authored directly by developers
⢠Goal is to use XForms as a data model
33/41 W3C AC/WWW10Hong Kong May 2001
Result Interpretation
NLSemantics ML structuregrammarx-modelxmlns
confidencegrammarx-modelxmlns
InputText
Nomatch Noinput Input
Text
modetimestamp-starttimestamp-endconfidence
xf:model xf:instance
Application-specificelements defined byX Forms data model
Xformsdefinition
MeaningIncoming data
34/41 W3C AC/WWW10Hong Kong May 2001
What toppings do you have?<interpretation grammar="http://toppings" xmlns:xf="http://www.w3.org/xxxâ> <input mode="speech">what toppings to you have?</input> <xf:x-model> <xf: group xf:name="question"/> <xf:string xf:name="questioned_item"/> <xf: string xf:name="questioned_property"/> </xf:group> </xf:x-model> <xf: instance>
<app:question> <app:questioned-item>toppings</app:questioned_item> <app:questioned_property>availability</app:questioned_property> </app:question> </xf:instance></interpretation>
35/41 W3C AC/WWW10Hong Kong May 2001
Richer Natural Language
⢠Most current voice apps restrict users to keywords or short phrases
⢠The application does most of the talking
⢠Alternative is to use open grammars with word spotting and let user do the talking
⢠Rules for figuring out what the user said and why as basis for asking next question
(GM/AskJeeves Demo)
36/41 W3C AC/WWW10Hong Kong May 2001
Multimodal = Voice + Displays
⢠Say which City you want weather for and see the information on your phone
⢠Say which bands/CDâs you want to buy and confirm the choices visually
What is the weather in San Francisco?
I want to place an orderfor âHotshotâ by Shaggy.
37/41 W3C AC/WWW10Hong Kong May 2001
Multimodal Interaction⢠Multimodal applications
â Voice + Display + Key pad + Stylus etc.â User is free to switch between voice interaction and use of
display/key pad/clicking/handwriting
⢠July 2000 Published Multimodal Requirements Draft⢠Demonstrations of Multimodal prototypes at Paris face to
face meeting of Voice Browser WG⢠Joint W3C/WAP Forum workshop on Multimodal â Hong
Kong September 2000⢠February 2001 â W3C publishes Multimodal Request for
Proposals⢠Plan to set up Multimodal Working Group later this year
assuming we get appropriate submission(s)
38/41 W3C AC/WWW10Hong Kong May 2001
⢠Primary market is mobile wirelessâ cell phones, personal digital assistants and cars
⢠Timescale is driven by deployment of 3G networks⢠Input modes:
â speech, keypads, pointing devices, and electronic ink
⢠Output modes:â speech, audio, and bitmapped or character cell displays
⢠Architecture should allow for both local and remote speech processing
Multimodal Interaction
39/41 W3C AC/WWW10Hong Kong May 2001
Some Ideas âŚ
⢠Speech enabling XHTML (and WML) without requiring changes to markup language
â New ECMAScript Speech Object?
⢠Loose coupling of VoiceXML with externally defined pages written in XHTML, SMIL, etc.
â Turn-driven synchronization protocol based on SIP?
⢠Distributed Speech Processingâ Reduce load on wireless network and speech serversâ Increase recognition accuracy in presence of noiseâ ETSI work on Aurora
⢠Using pen-based gestures to constrain ASR (click and speak)
W3C is seeking detailed proposals with broad industry support as basis for chartering multimodal working group
40/41 W3C AC/WWW10Hong Kong May 2001
VoiceXML IP Issues
⢠Technical work on VoiceXML 2.0 is proceeding well⢠Publication of VoiceXML 2.0 working draft held up over IP issues
(although internal version is accessible to W3C Members)⢠Related specifications for grammar, speech synthesis, natural
language synthesis, lexicon, and call control have or shortly will be published.
⢠W3C and VoiceXML Forum Management are in process of developing a formal Memorandum of Understanding
⢠W3C is convening a Patent Advisory Group to recommend IP Policy for re-chartering the Voice Browser Activityâ Draw inspiration from IETF, ECTF, ETSI and other bodies, e.g. require
all WG members to license essential IP under openly specified RAND terms with operational criteria for effective terms expressed in terms of exit criteria for Candidate Recommendation phase. No requirement for advanced disclosure of IP
41/41 W3C AC/WWW10Hong Kong May 2001
Discussion?