Voice Browsing And Multimodal Interaction In 2009

118
Google TechTalk – Mar 6 th , 2009 1 Paolo Baggia 1 March 6th, 2009 Voice Browser and Multimodal Interaction In 2009 Paolo Baggia Director of International Standards Google TechTalk

Transcript of Voice Browsing And Multimodal Interaction In 2009

Page 1: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 1Paolo Baggia 1

March 6th, 2009

Voice Browser and Multimodal Interaction In 2009

Paolo BaggiaDirector of International Standards

Google TechTalk

Page 2: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 2Paolo Baggia

Overview

A Bit of History

W3C Speech Interaction Framework TodayASR/DMTFTTSLexiconsVoice Dialog and Call ControlVoice Platforms and Next Evolutions

W3C Multimodal Interaction TodayMMI ArchitectureEMMA and InkMLA language for Emotions

Next Future

Page 3: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 3Paolo Baggia

Company Profile

� Privately held company (fully owned by Telecom Italia), founded in 2001 as spin-off from Telecom Italia Labs, capitalizing on 30yrs experience and expertise in voice processing.

� Global Company, leader in Europe and South America for award-winning, high quality voice technologies (synthesis, recognition, authentication and identification) available in 26 languages and 62 voices.

� Multilingual, proprietary technologies protected over 100 patents worldwide

� Financially robust, break-even reached in 2004, revenues and earnings growing year on year

� Growth-plan investment approved for the evolution of products and services.

� Offices in New York. Headquarters in Torino, local representative sales offices in Rome, Madrid, Paris, London, Munich

� Flexible: About 100 employees, plus a vibrant ecosystem of local freelancers.

Torino

Rome

Madrid

Paris

London

New York

Munich

Page 4: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 4Paolo Baggia

International Awards

“Best Innovation in Automotive Speech Synthesis” Pri ze AVIOS-SpeechTEK West 2007

“Best Innovation in Expressive Speech Synthesis” Pri ze AVIOS-SpeechTEK West 2006

“Best Innovation in Multi-Lingual Speech Synthesis”Prize AVIOS-SpeechTEK West 2005

“2008 Frost & Sullivan European Telematics and Infot ainmentEmerging Company of the Year” Award

Winner of “Market leader-Best Speech Engine” Speech Industry Award 2007 and 2008

Loquendo MRCP Server: Winner of 2008 IP Contact Center Technology Pioneer Award

Page 5: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 5Paolo Baggia

A Bit of History

Page 6: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 6Paolo Baggia

Standard Bodies

Two main standard bodies:W3C – World Wide Web Consortium

Founded in 1994, by Tim Berners-Lee with a mission to lead the Web to its full potential. Staff based in MIT (USA), ERCIM (France), Keio Univ (Japan).400 members all over the world, 50 Working, Interest and Coordination Groups.W3C is where the framework of today’s Web is developed (HTML, CSS, XML, DOM, SOAP, RDF, OWL, VoiceXML, SVG, XSLT, P3P, XML, Internationalization, Web Accessibility, Device Independence)

IETF – Internet Engineering Task ForceFounded in 1986, but growth in 1991as Internet Society. 1300 members.HTTP, SIP, RTP and many others protocols. Media Resource Control Protocol (MRCP) is very relevant for speech platforms.

Two industrial forums:VoiceXML Forum (www.voicexml.org)

Inventors of VoiceXML 1.0, then submitted to W3C for standardization.Current goal is to promote, disseminate and support VoiceXML and related standards.

SALT Forum (www.saltforum.org)Supported by Microsoft to define a lightweight markup for telephony and multimodal applications.

Other relevant bodies:3GPP, OMA, ETSI, NIST

Page 7: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 7Paolo Baggia

SSML 1.0 W3C RecSRGS 1.0

W3C Rec

1998

1999

2000

2002

2004

W3C Voice Browser

WorkshopVoiceXML 1.0

Released

VoiceXML Forum Birth

W3C charters Voice Browser

WG

W3C charters Multimodal Interaction

WG

SALT Forum Birth

VoiceXML 2.0 W3C Rec

By AT&T, IBM,Lucent, Motorola,

By Cisco, Comverse, Intel, Microsoft, Philips,SpeechWorks,

Preparing to announce VoiceXML 1.0Friday Feb. 25 th, 2000Lucent, Naperville, Illinois

Left to right: Gerald Karam (AT&T), Linda Boyer (IBM), Ken Rehor (Lucent), Bruce Lucas (IBM),Pete Danielsen (Lucent), Jim Ferrans (Motorola), Dave Ladd (Motorola).

The (r)evolution of VoiceXML1998 - 2004

SISR 1.0 W3C Rec

2007

VoiceXML 2.0 W3C Rec

2008

EMMA 1.0 W3C Rec

PLS 1.0W3C REC

2009

Page 8: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 8Paolo Baggia

Speech Interface Framework in 2000 (by Jim Larson)

DialogManager

WorldWideWeb

TelephoneSystem

ContextInterpretation

MediaPlanning

LanguageGeneration

TTS

ASRLanguage

Understanding

DTMF Tone Recognizer

Pre-recorded Audio Player

Speech SynthesisMarkup Language (SSML)

Pronunciation LexiconSpecification (PLS)

Reusable Components Call Control XML(CCXML)

Semantic Interpretation forSpeech Recognition (SISR)

N-gram Grammar ML

Speech RecognitionGrammar Spec. (SRGS)

Natural LanguageSemantics ML

VoiceXML 2.0

VoiceXML 2.1 EMMA

User

Page 9: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 9Paolo Baggia

DialogManager

WorldWideWeb

TelephoneSystem

ContextContextContextContextInterpretationInterpretationInterpretationInterpretation

MediaPlanning

LanguageGeneration

TTS

ASR

DTMF Tone Recognizer

Pre-recorded Audio Player

Speech SynthesisMarkup Language (SSML)

Pronunciation LexiconSpecification (PLS)

Reusable Components Call Control XML(CCXML)

Semantic Interpretation forSpeech Recognition (SISR)

N-gram Grammar ML

Speech RecognitionGrammar Spec. (SRGS)

Natural LanguageSemantics ML

VoiceXML 2.0

VoiceXML 2.1 EMMA 1.0

User

LanguageLanguageLanguageLanguageUnderstandingUnderstandingUnderstandingUnderstanding

Speech Interface Framework - Today(by Jim Larson)

Page 10: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 10Paolo Baggia

DialogManager

WorldWideWeb

TelephoneSystem

ContextContextContextContextInterpretationInterpretationInterpretationInterpretation

MediaPlanning

LanguageGeneration

TTS

ASR

DTMF Tone Recognizer

Pre-recorded Audio Player

Speech SynthesisMarkup Language (SSML)

Pronunciation LexiconSpecification (PLS)

Reusable Components Call Control XML(CCXML)

Semantic Interpretation forSpeech Recognition (SISR)

N-gram Grammar ML

Speech RecognitionGrammar Spec. (SRGS)

Natural LanguageSemantics ML

VoiceXML 2.0

VoiceXML 2.1 EMMA 1.0

User

LanguageLanguageLanguageLanguageUnderstandingUnderstandingUnderstandingUnderstanding

Speech Interface Framework - End of 2009 (by Jim Larson)

Page 11: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 11Paolo Baggia

W3C Process

Page 12: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 12Paolo Baggia

Architectural Changes

User Speech Applic.

ASR / DTMF

TTS / Audio

Traditional (proprietary) architecture

ProprietarySCE

Proprietaryplatform

User VoiceXML Browser

ASR / DTMF

TTS / Audio

Web Applic.HTTP

VoiceXML architecture

.vxml

.grxml/.gram, .pls

.ssml, .wav/.mp3, .pls

VoiceXMLplatform

Page 13: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 13Paolo Baggia

The VoiceXML Impact

VoiceXML changed the landscape of IVRs and speech application creationFrom proprietary to standard-based speech applications

• Proprietary platforms(HW & SW)

• Proprietary applications (by proprietary SCE)

• Mainly DTMF and pre-recorded prompts

• First attempts to add speech into IVR

• Standard VoiceXMLplatforms

• Standards for SpeechTechnologies

• Standard tools forVoiceXML applications

• Integration of DTMFand ASR

• Still predominance ofDTMF, but more andmore speechapplications

Before After

Page 14: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 14Paolo Baggia

Overview

� A Bit of History

W3C Speech Interaction Framework TodayASR/DMTFTTSLexiconsVoice Dialog and Call ControlVoice Platforms and Next Evolutions

� W3C Multimodal Interaction Today� MMI Architecture� EMMA and InkML� A language for Emotions

� Next Future

Page 15: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 15Paolo Baggia

Standards for ASR and DTMFSRGS 1.0, SISR 1.0

Page 16: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 16Paolo Baggia

SYNTAXDefines constraints on

admissible sentences fora specific recognition turn

W3C Standards for Speech/DTMF Grammars

Speech

grammar

SRGSSRGS

voicevoice dtmfdtmf

ABNFABNF XMLXML

SEMANTICSDescribes how to

produce results after an utterance is recognized

SISRSISR

literalliteral scriptscript

http://www.w3.org/TR/semantic-interpretation/http://www.w3.org/TR/speech-grammar/

Page 17: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 17Paolo Baggia

SRGS/SISR Grammars for “Torino”

#ABNF 1.0 iso-8859-1;

mode voice;

tag-format < semantics/1.0 >;

{var unused=7;};

public $main = Torino {out="10100";} ;

<?xml version="1.0" encoding="UTF-8"?><grammar xml:lang="en-US" version="1.0" xmlns="http://www.w3.org/2001/06/grammar" tag-format=" semantics/1.0 ">

<tag>var unused=7;</tag><rule id="main" scope="public">

<token>Torino</token><tag>out="10100";</tag>

</rule>

</grammar>

SISRscript

#ABNF 1.0 iso-8859-1;

mode voice;

tag-format < semantics/1.0-literals >;

public $main = Torino {10100} ;

<?xml version="1.0" encoding="UTF-8"?><grammar xml:lang="en-US" version="1.0" xmlns="http://www.w3.org/2001/06/grammar" tag-format=" semantics/1.0-literals ">

<rule id="main" scope="public"><token>Torino</token><tag>10100</tag>

</rule>

</grammar>

SISRliteral

SRGS ABNFSRGS XML

Page 18: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 18Paolo Baggia

SRGS/SISR Standards – Pros

Powerful syntax (CFG) and very powerful semantics (ECMA)DMTF and Voice input are transparent to the applicationWide and consistent adoption among technology vendors

Two syntax XML and ABNF are great!� Developers can choose (XML validation vs. compact format)

� Transformations are possibleXML � ABNF (easy, simple XSLT)ABNF � XML (requires a ABNF parser)

� Open Source tools might be created to:� Validate grammar syntax

� Transform grammars

� Debug grammars on written input� Coverage tests: explode covered sentences, GenSem, SemTester, etc.

Page 19: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 19Paolo Baggia

SRGS/SISR Standards – Small Issues

Semantics declaration: tag-format attribute� If value “semantics/1.0”?

� Mandate SISR Script semantics inside semantic tags� If value “semantics/1.0-literal”?

� Mandate SISR Literal semantics inside semantic tags� If missing?

� Unclear! Risk of interoperability troubles

SISR Script Semantics� Clumsy default assignment: returns last referenced rule only

� Developer must properly pop-up results� Be careful to redefine “out”

� Assign a scalar value might result in errors

SISR Literal Semantics� Only useful for very simple word-list rules� No support for encapsulating rules

� SISR Literal grammars as external references ONLY!

Page 20: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 20Paolo Baggia

SRGS/SISR – Encapsulated Grammars

Gr1.grxmlScript

Gr2.gramLiteral

Gr3.grxmlScript

Gr41.grxmlLiteral

Gr42.gramScript

Page 21: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 21Paolo Baggia

SRGS/SISR Standards – Rich XML Results

Section 7 of SISR 1.0 specificationhttp://www.w3.org/TR/semantic-interpretation/#SI7

Serialization rules from SISR ECMA results into XMLEdge cases:

ArraysSpecial variable “_attribute ” and “_value ”

Creation of namespaces and prefixes{

drink: {_nsdecl: {

_prefix:"n1",_name:"http://www.example.com/n1"

},_nsprefix:"n1",liquid: {

_nsdecl: {_prefix:"n2",_name:"http://www.example.com/n2"

},_attributes: {

color: {_nsprefix:"n2",_value:"black"

}},_value:"coke"

},size:"medium"

}}

<n1:drink xmlns:n1="http://www.example.com/n1"><liquid n2:color="black“

xmlns:n2="http://www.example.com/n2">coke</liquid><size>medium</size>

</n1:drink>

Page 22: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 22Paolo Baggia

SRGS/SISR Standards – Next Steps

Adoption of the PLS 1.0 lexicon� Clear entry point into PLS lexicons, <token> element

� Missing role attribute in <token> to allow homographs disambiguation

Next extensions via Errata� XML 1.1 support and IR

� Update normative references

���� No Major Extensions are needed!

Page 23: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 23Paolo Baggia

Speech SynthesisSSML 1.0/1.1

Page 24: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 24Paolo Baggia

TTS – Functional Architecture and Markup/Non-Markup support

StructureAnalysis

TextNormalization

Text-to-Phoneme

Conversion

ProsodyAnalysis

WaveformProduction

Markup support:<p>, <s>Non-Markup support:infer the structure by automatic text analysis

Markup support:<say-as> for date, time, phone number, numbers<sub> for acronyms and transliterationsNon-Markup support:automatically identify and convert constructs

Markup support:<phoneme> , <lexicon>Non-Markup support:look up in pronunciation dictionary

Markup support:<emphasis> , <break> , <prosody>Non-Markup support:automatically generate prosody through analysis of document structure and sentence syntax

Markup support:<voice> , <audio>Non-Markup support:

http://www.w3.org/TR/speech-synthesis/

Page 25: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 25Paolo Baggia

SSML 1.0 – Language description (I)

Document Structure<speak> root element

<?xml version="1.0" encoding="ISO-8859-1"?><speak version="1.0" xmlns="http://www.w3.org/2001/ 10/synthesis" xml:lang="en-US"><p>I don't speak Japanese.</p><p xml:lang="ja">Nihongo-ga wakarimasen.</p></speak>

version attributeSSML namespace attribute

Languages

Processing and Pronunciation – <p> and <s> (paragraph and sentence)

to give a structure to the text– <say-as> element

to indicate the type of text construct contained within the elementex. date, numbers, etc.

– <phoneme> elementto provides a phonetic pronunciation for the contained text in IPA

– <sub> elementto provide substitutions for expanding acronyms in sequence of words

http://www.w3.org/TR/speech-synthesis/

Page 26: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 26Paolo Baggia

SSML 1.0 – Language description (II)

Style- <voice> element

<?xml version="1.0" encoding="ISO-8859-1"?><speak version="1.0"

xmlns="http://www.w3.org/2001/10/synthesis" xml:lan g="en-US">

The moon is raising on the beach, when John says, looking Mary in the eyes:<voice name="simon">I love you!</voice>

but she suddenly replies:<voice name="susan"> Please, be serious! </voice>

</speak>

Other voice selection attributes are:name, xml:lang , gender , age , and variant

- <emphasis> elementrequests that the contained text be spoken with emphasis

level attribute can set it to strong , moderate , reduced , or none

- <break> elementcontrols the pausing between words

time attribute with two kind of values:Time expressions “5s”, “20ms”

strength attribute with values:none , x-weak , weak, medium (default value), strong , or x-strong

http://www.w3.org/TR/speech-synthesis/

Page 27: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 27Paolo Baggia

SSML 1.0 – Language description (III)

Prosody<prosody> element

permits control of the pitch, speaking rate and volume of the speech output.

The attributes are:volume : the volume for the contained text.rate : the speaking rate in words-per-minute for the contained text.duration : a value in seconds or milliseconds for the desired time to take

to read the element contents.pitch : the baseline pitch for the contained text.range : the pitch range (variability) for the contained text in Hertz.contour : sets the actual pitch contour for the contained text.

Other elements<audio> element - to play an audio file<mark> element - to place a marker into the text/tag sequence<desc> element - to provide a description of a non-speech audio

source in <audio>http://www.w3.org/TR/speech-synthesis/

Page 28: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 28Paolo Baggia

Towards SSML 1.1 – Motivations

Internationalization needs:� Three Workshops: Beijing (Nov’05), Crete (May’06), Hyderabad (Jan’07)

� Results:� No major needs for Eastern and Western European languages

� Many issues for Far East languages (Mandarin, Japanese, Korean)

� Some specific issues for Semitic languages (Arabic, Hebrew), Farsi and many Indian languages

� Mark input with or without vowels� Mark the transliteration schema used for input

Extensions required by Voice Browser:� More powerful error handling, selection of fall-back strategies� Trimming attributes

� Volume attribute to adopt a logarithmic scale (before was linear)

Alignment with PLS 1.0 specification for user lexicons

http://www.w3.org/TR/speech-synthesis11/

Page 29: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 29Paolo Baggia

SSML 1.1 – Language Changes

<w> element

Lexicon extensions<lookup> element

permits control of the pitch, speaking rate and volume of the speech output.

Phonetic Alphabet Registry creation and adoption� "ipa " for International Phonetic Alphabet

� Registering policy for other phonetic alphabets, similar to LTRU for Language tags

� Candidates:� PinYin for Mandarin Chinese� JEITA for Japanese

� X-SAMPA, ASCII transliteration of IPA codes

http://www.w3.org/TR/speech-synthesis/

Page 30: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 30Paolo Baggia

Pronunciation LexiconPLS 1.0

Page 31: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 31Paolo Baggia

Pronunciation Lexicons

Pronunciation LexiconA mapping between words (or short phrases), their written representations,

and their pronunciations suitable for use by an ASR engine or a TTS engine

Pronunciation lexicons are not only useful for voice browsers They have also proven effective mechanisms to support accessibility for the

differently able as well as greater usability for all users

They are used to good effect in screen readers and user agents supporting multimodal interfaces

The W3C Pronunciation Lexicon Specification (PLS) Version 1.0 isdesigned to enable interoperable specification of pronunciation lexicons

http://www.w3.org/TR/pronunciation-lexicon/

Page 32: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 32Paolo Baggia

PLS 1.0 – Language Overview

A PLS document is a container (<lexicon> ) of several lexical entries (<lexeme> )

Each lexical entry containsOne or more spellings (<grapheme> )

One or more pronunciations (<phoneme>) or substitutions (<alias> )

Each PLS document is related to a single unique language (xml:lang )

SSML 1.0 and SRGS 1.0 documents can reference one or more PLS documents

Current version doesn’t include morphological, syntactic and semantic information associated with pronunciations

http://www.w3.org/TR/pronunciation-lexicon/

Page 33: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 33Paolo Baggia

PLS 1.0 – An Example

<?xml version="1.0" encoding="UTF-8"?><lexicon version="1.0"

xmlns=" http://www.w3.org/2005/01/pronunciation-lexicon "xmlns:xsi=" http://www.w3.org/2001/XMLSchema-instance " xsi:schemaLocation=" http://www.w3.org/2005/01/pronunciation-lexicon

http://www.w3.org/TR/pronunciation-lexicon/pls.xsd "alphabet =" ipa " xml:lang =" en-US ">

<lexeme ><grapheme >Sepulveda</ grapheme ><phoneme>sə ˈ̍̍̍pȜȜȜȜlv ǺǺǺǺdə</ phoneme>

</ lexeme >

<lexeme ><grapheme >W3C</grapheme ><alias >World Wide Web Consortium</ alias >

</ lexeme >

</ lexicon >

http://www.w3.org/TR/pronunciation-lexicon/

Page 34: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 34Paolo Baggia

PLS 1.0 – Used for TTS

SSML 1.0<?xml version="1.0" encoding="UTF-8"?><speak version="1.0" … xml:lang="en-US">

<lexicon uri="http://www.example.com/SSMLexample.pl s"/>The title of the movie is: " La vita è bella " (Life is beautiful),which is directed by Benigni .

</speak>

PLS 1.0<?xml version="1.0" encoding="UTF-8"?>

<lexicon version="1.0" … alphabet="ipa" xml:lang="en -US">

<lexeme>

<grapheme> La vita è bella </grapheme>

<phoneme> ˈ̍̍̍l ǡǡǡǡ ˈ̍̍̍vi ːːːːȎȎȎȎə ˈ̍̍̍ȤȤȤȤeǺǺǺǺ ˈ̍̍̍bǫǫǫǫl ə</phoneme></lexeme>

<lexeme>

<grapheme> Benigni </grapheme>

<phoneme>bǫǫǫǫ ˈ̍̍̍ni ːːːːnji</phoneme></lexeme>

</lexicon>

http://www.w3.org/TR/pronunciation-lexicon/

Page 35: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 35Paolo Baggia

PLS 1.0 – Used for ASR

SRGS 1.0<?xml version="1.0" encoding="UTF-8"?><grammar version="1.0“ xml:lang="en-US" root="movies " mode="voice">

<lexicon uri="http://www.example.com/SRGSexample.pl s"/><rule id="movies" scope="public">

<one-of><item>Terminator 2: Judgment Day</item> <item>Pluto's Judgement Day</item>

</one-of> </rule>

</grammar>

PLS 1.0<?xml version="1.0" encoding="UTF-8"?>

<lexicon version="1.0" … alphabet="ipa" xml:lang="en -US">

<lexeme>

<grapheme> judgment </grapheme>

<grapheme> judgement </grapheme>

<phoneme> ˈ̍̍̍dʒȜȜȜȜdʒ.mənt</phoneme></lexeme>

</lexicon>http://www.w3.org/TR/pronunciation-lexicon/

Page 36: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 36Paolo Baggia

Examples of Use

Multiple pronunciations for the same orthography

Multiple orthographies

Homophones

Homographs

Acronyms, Abbreviations, etc.

Detailed descriptions can be found in:W3C specification, WikipediaPaolo Baggia, SpeechTEK 2008 & Voice Search 2009

Page 37: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 37Paolo Baggia

PLS 1.0 – Open Issues

No wide support of IPA in speech engines � Slowly changes are under way

� Phonetic Alphabet Registry will open doors to other alphabets in a controlled and interoperable way

Integration in ASR/TTS� SSML 1.1 will interoperate with PLS 1.0

� SRGS 1.0 still missing support of role attribute for PLS 1.0

No matching algorithm inside PLS, because it is mainly a data format

http://www.w3.org/TR/pronunciation-lexicon/

Page 38: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 38Paolo Baggia

Pronunciation AlphabetsIPA, SAMPA

Page 39: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 39Paolo Baggia

International Phonetic Alphabet

Pronunciation is represented by a phonetic alphabet� Standard phonetic alphabets

International Phonetic Alphabet (IPA)

� Well known phonetic alphabetSAMPA - ASCII based (simple to write)Pinyin (Chinese Mandarin), JEITA (Japanese), etc.

� Proprietary phonetic alphabets

International Phonetic Alphabet (IPA)� Created by International Phonetic Association (active since 1896),

collaborative effort by all the major phoneticians around the world

� Universally agreed system of notation for sounds of languages� Covers all languages

� Requires UNICODE to write it

� Normatively referenced by PLS

Page 40: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 40Paolo Baggia

IPA – Chart

IPA was founded in 1886It is the major international

association of phoneticiansThe IPA alphabet provides

symbols making possible the phonemic transcription of all known languages

IPA characters can be encoded in Unicode by supplementing ASCII with characters from other ranges, particularly:

IPA extensions (0250–02AF)

Latin Extended-A (0100-017F)

See the detailed: http://www.unicode.org/charts

Page 41: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 41Paolo Baggia

Phonetic Alphabets – Issues

The real problem is how to write pronunciation in a reliable, unless

you are trained phonetician

Issues with fonts and authoring, browsers, but Unicode fonts today

support IPA extensions, see:

� http://www.phon.ucl.ac.uk/home/wells/phoneticsymbols.htm

There are very few tools to help writing pronunciations and to let

you listen to what you have written

� Make available pronunciations in IPA or other general phonetic

languages.

Page 42: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 42Paolo Baggia

Voice Dialog languages:VoiceXML 2.0VoiceXML 2.1

Page 43: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 43Paolo Baggia

VoiceXML 2.0 – Features, Elements

Menus, forms, sub-dialogs<menu>, <form> , <subdialog>

InputSpeech recognition<grammar>

Recording<record>

Keypad<grammar mode="dtmf">

OutputAudio files<audio>

Text-To-Speech<prompt>

Variables (ECMA-262)<var> , <assign> , <script>

scoping rules

Events<nomatch> , <noinput> , <help>,

<catch> , <throw>

Transition and submission<goto> , <submit>

TelephonyConnection control<transfer> , <disconnect>

Telephony informationPlatform specifics

<object>

PerformanceFetchProperties

http://www.w3.org/TR/voicexml20/

Page 44: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 44Paolo Baggia

VoiceXML 2.0 – Execution Model

Execution is synchronous� Only disconnect event is handled (somewhat) asynchronous

Execution is always in a single dialog: <form> or <menu>� Form Interpretation Algorithm for <field> selection

Prompt are queued� Played only when encountering a waiting state� Played before a fetchaudio is started

Processing is always in one of two states:� Waiting for input in an input item:

<field> , <record> , <transfer> , etc.� Transitioning between input items in response of an input

Event-driven:� <nomatch> , <noinput> user’s input event handling� <catch> , <throw> generalized event mechanism� connection.* call event handling� error.* error event handling

http://www.w3.org/TR/voicexml20/

Page 45: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 45Paolo Baggia

VoiceXML 2.1 – Extended Features

Dynamically referencing grammars and scripts:<grammar expr="…"> , <script expr="…">

Record user’s utterance during form fillingrecordutterance propertyAdd new shadow variables: recording , recordingsize , recordingduration

Detect barge-in during prompt playback (SSML <mark> )Add markexpr attributeAdd new shadow variables: markname and marktime

Fetch XML data without transitionUse read-only subset of DOM

Dynamically concatenate prompts <foreach>

Iterate throught ECMAScript arrays and execute content

Send data upon disconnect<disconnect namelist="…">

Additional transfer type<transfer type="consultation"> http://www.w3.org/TR/voicexml21/

Page 46: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 46Paolo Baggia

VoiceXML Applications

Static VoiceXML applications� The VoiceXML page is always the same, so the user experience

� No personalization or customization

Dynamic VoiceXML applications� User experience is customized

• After authentication (PIN) • Using caller-id or SIP-id

� Data driven

� Dynamic pages generated at runtimee.g. JSP, ASP, etc.

http://www.w3.org/TR/voicexml21/http://www.w3.org/TR/voicexml20/

Page 47: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 47Paolo Baggia

A Drawback of VoiceXML 2.0

A drawback of VoiceXML is that the transition from a VoiceXML page to another is a costly activity:� Fetch the new page, if not cached� Parse the page

� Initialize the context, possibly loading and initializing a new application root document

� Load or pre-compile scripts

The transitions are the only way to return data to the Web Application(if the VoiceXML is dynamic)

Pages must be created to include dynamic data

� VoiceXML 2.1 addresses part of this drawback by feeding dynamic data to a running VoiceXML page

http://www.w3.org/TR/voicexml21/http://www.w3.org/TR/voicexml20/

Page 48: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 48Paolo Baggia

Advantages of VoiceXML 2.1 - AJAX

Two of the eight new features in VoiceXML 2.1 helps to createmore dynamic VoiceXML applications:� <data> element� <foreach> element

Static VoiceXML document can fetch user-specific data at runtime, without changing the VoiceXML document<data> element allows retrieval of arbitrary XML data without VoiceXML document transitionsReturned XML data are accessible by a subset of DOM primitives<foreach> extend the prompts to allow the iteration on a dynamic array of information to create a dynamic prompt

This is similar to AJAX programming for HTML servicesIt decouples presentation layer (VoiceXML) from business logic (accessed via <data> )

http://www.w3.org/TR/voicexml21/

Page 49: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 49Paolo Baggia

VoiceXML 2.1 – <data> Element

Attributes:� name the variable to be filled with the DOM of the retrieved data

� scr or srcexpr the URI of the location of the XML data

� namelist the list of variables to be submitted� method either ‘get ’ or ‘post ’

� enctype media encoding

� fetch and caching attributes

As <var> , it may appear in executable content (<form> and <vxml> )

The value of name must be a declared variableThe platform will fill the variable of the DOM of the fetched XML data<data> element is synchronous (the service stops to get data)

http://www.w3.org/TR/voicexml21/

Page 50: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 50Paolo Baggia

VoiceXML 2.1 – <foreach> Element

Attributes:� array ECMAScript expression that must evaluate to ECMAScript array

� item the variable that stores the element to be processed

<foreach> allows the application to iterate on an ECMAScript array and to execute the content<foreach> may appear:� In executable content (all executable content elements may appear as

content of <foreach> )� In <prompt> (restrictions on the content are applied)

<foreach> allows sophisticated concatenation of prompts

http://www.w3.org/TR/voicexml21/

Page 51: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 51Paolo Baggia

VoiceXML – Final Remarks

The changed landscape for speech application development:� Virtually all the IVRs today support VoiceXML

� New options related to VoiceXML:� SIP-based VoiceXML platforms (Loquendo, Voxpilot, Voxeo, VoiceGenie)

� Large hosting of speech applications (TellMe, Voxeo)

� Development tools (VoiceObjects, Audium, SpeechVillage, Syntellect, etc.)

� Further changes may come from the CCXML adoption

… but:� Mainly system driven applications are actually deployed

� New challenges to incorporate more powerful dialog strategies,mixed-initiative are under discussion.

http://www.w3.org/TR/voicexml21/http://www.w3.org/TR/voicexml20/

Page 52: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 52Paolo Baggia

VoiceXML Resources

Voice Browser Working Group (spec, FAQ, implementations, resources):http://www.w3.org/Voice/

VoiceXML Forum site (resources, education, interest groups):http://www.voicexml.org/

VoiceXML Forum Review:http://www.voicexmlreview.org/

Interesting articles related to VoiceXML and moreExample code in the sections "First Words" and "Speak & Listen"

Ken Rehor’s World of VoiceXMLhttp://www.kenrehor.com/voicexml

Online documentation related to VoiceXML PlatformsLoquendo Café, Voxeo (http://www.vxml.org/ ), TellMe, VoiceGenie

Many books on VoiceXML:Jim Larson, "VoiceXML Introduction to Developing Speech Applications", Prentice-Hall,

2002.A. Hocek, D. Cuddihy, "Definitive VoiceXML", Prentice-Hall, 2002

Page 53: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 53Paolo Baggia

Call Control:CCXML 1.0

Page 54: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 54Paolo Baggia

CCXML 1.0 – Highlights

Asynchronous event processing

Acceptance or refusal of an incoming call

Different type of transfer call management

Outbound call activation (interaction with an external entity)

Use of ECMAScript adding scripting capabilities to call control applications

VoiceXML modularization

Conferencing management

Page 55: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 55Paolo Baggia

CCXML 1.0 – Elements Relationship

Page 56: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 56Paolo Baggia

CCXML 1.0 – Incoming Call

Event catching and processing

connection.alertingCCXML

Interpreter

<?xml version="1.0"encoding="UTF-8"?>

<ccxml version="1.0">

[…]

CCXML document

<transition event="connection.disconnected">

[…]</transition>

event$

name:’connection.alerting’;connectionid:‘0239023901903993’;eventid:’00001’; ....…..

<transition event="connection.alerting">[…]</transition>

http://www.w3.org/TR/ccxml

Page 57: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 57Paolo Baggia

CCXML 1.0 – connection.alerting Event

Basic telephony information has been retrieved on alerting event and is available into CCXML document:Local URI, remote URI, protocol used, redirection info, etc.

Based on certain checked info, CCXML can accept or refuse the incoming call, even before contacting the dialog server;

Any error that can occur during the phone call can be managed byCCXML service (connection.failed , error.connection events)

Call Control Adapter

CCXML Interpreter

VoiceXMLInterpreter

connection.alerting

Analyzing events$ content<accept/> | <reject/>

http://www.w3.org/TR/ccxml

Page 58: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 58Paolo Baggia

CCXML 1.0 – How to activate a new dialog

CCXML actions:� Receives alerting event from Call Control Adapter� Asks to dialog server to prepare a new dialog� Waits for the preparation� If the dialog has been successfully prepared, accept the call � Asks to dialog server to start the prepared new dialog

CCXML Interpreter

Call Control Adapter

VoiceXMLInterpreter

alerting

prepare a new dialog

dialog preparedcall accepted

start the prepared dialog

dialog started

connected

Page 59: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 59Paolo Baggia

Call transfer

CCXML supports transfer call of different modality: "bridge ", "blind ", "consultation ";

Based on different modalities features CCXML language allows the expected interaction with the Call Control Adapter to correctly perform the transfer;

During the different phases of transfer call creation the CCXML can receive any asynchronous event and correctly manage it, interrupting the call, if requested

CCXML Interpreter

Call Control Adapter

VoiceXMLInterpreter

Performing a transfer

command1

answer1

[…]transfer complete …

Page 60: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 60Paolo Baggia

External Events

CCXML Interpreter Context can receive events from an external entity able to use the HTTP protocol; Events generated in this way must be sent to a CCXML by a POST HTTP commandA event is so performed and: � It can be addressed on a new session whose creation must be requested� It can be addressed on an existent session, specifying the ID in the

requestCCXML

Interpreter

basic http event

External Entity

Event management

Event management result

http://www.w3.org/TR/ccxml

Page 61: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 61Paolo Baggia

External event on a new session: the Outbound Call

A particular request arrived to Call Control from an external entity;A particular CCXML service associated with the received event is started and a set of operations between Call Control Adapter, Call Control and Dialog Server is activated: the outbound call is so placed

CCXML Interpreter

outbound call request

Call Control Adapter

VoiceXMLInterpreter

prepared

connection progressing …

Start the prepared dialog

Create a call

Prepare a dialog

connection connected

Page 62: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 62Paolo Baggia

External event on a session:dialog termination request

An external entity performs a HTTP POST request towards the CCXML Interpreter Context, specifying a sessionid, requesting the termination of a particular dialog;The CCXML check the session id, if this is valid then CCXML Interpreter injects the event received in the session; The CCXML service has a transition on that event and performs the dialog termination on a particular dialog identifier;

CCXML Interpreter

Dialog termination request

Call Control Adapter

dialog.exit

dialogterminate (dialogid)

VoiceXMLInterpreter

disconnect(connId) dialogprepare

It depends on dialog.exit eventmanagement

Page 63: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 63Paolo Baggia

Loading different CCXML documents:<fetch> and <goto> elements

<fetch> and <goto> elements are used respectively to asynchronously fetch content identified by the attributes of the <fetch> and to go in a fetched document, if it’s successfully loaded;

CCXML Interpreter

<fetchnext="'http://../Fetch/doc1.ccxml'" type="'application/ccxml+xml'" fetchid="result"/>

fetch the document "doc1.ccxml"

fetch.done / error.fetch

goto into the new document /continue to work on the same dialog

The first event occurred in a new documentis ccxml.loaded

- MODULARIZATION - SOURCE EXEMPLIFICATION- MORE READABILITY

http://www.w3.org/TR/ccxml

Page 64: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 64Paolo Baggia

Simple CCXML Document

<?xml version="1.0" encoding="UTF-8"?><ccxml version="1.0" xmlns="http://www.w3.org/2002/09/ccxm l">

<var name="currentState"/><var name="myDialogId"/><var name="myConnId"/><eventprocessor statevariable="currentState">

<transition event=" connection.alerting "><assign name="myConnId" expr="event$.connectionid"/ ><accept connectionid="event$.connectionid"/>

</transition><transition event=" connection.connected ">

<dialogstart src="'http://www.example.com/flight.vxml'"connectionid="myConnId" dialogid="myDialogId"/>

</transition><transition event=" dialog.started ">

<log expr="’VoiceXML appl is running now’"/></transition><transition event=" connection.disconnected ">

<dialogterminate dialogid="myDialogId"/></transition><transition event=" dialog.exit ">

<disconnect connectionid="myConnId"/></transition><transition event="*">

<log expr="'Closing, unexpected:'+ event$.name"/><exit />

</transition></eventprocessor>

</ccxml>

Page 65: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 65Paolo Baggia

CCXML 1.0 – Next Steps

CCXML specification is a Last Call Working Draft, all the feature requests and clarifications have been addressed;

An Implementation Report test suite is under development;

It is very close to be published as W3C Candidate Recommendation;

Internal or external companies will be invited to send implementation report on their CCXML platform;

After that, CCXML 1.0 specification will be able to become Proposed Recommendation and then W3C Recommendation.

http://www.w3.org/TR/ccxml

Page 66: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 66Paolo Baggia

Speech Interface FrameworkTour Complete!

Page 67: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 67Paolo Baggia

DialogManager

WorldWideWeb

TelephoneSystem

ContextContextContextContextInterpretationInterpretationInterpretationInterpretation

MediaPlanning

LanguageGeneration

TTS

ASR

DTMF Tone Recognizer

Pre-recorded Audio Player

Speech SynthesisMarkup Language (SSML)

Pronunciation LexiconSpecification (PLS)

Reusable Components Call Control XML(CCXML)

Semantic Interpretation forSpeech Recognition (SISR)

N-gram Grammar ML

Speech RecognitionGrammar Spec. (SRGS)

Natural LanguageSemantics ML

VoiceXML 2.0

VoiceXML 2.1 EMMA 1.0

User

LanguageLanguageLanguageLanguageUnderstandingUnderstandingUnderstandingUnderstanding

Speech Interface Framework - End of 2009 (by Jim Larson)

Page 68: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 68Paolo Baggia

Architectural Changes

User VoiceXML Browser

ASR / DTMF

TTS / Audio

Web Applic.HTTP

VoiceXML architecture

.vxml

.grxml/.gram, .pls

.ssml, .wav/.mp3, .pls

VoiceXMLplatform

Page 69: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 69Paolo Baggia

VoxNauta – Internal Architecture

Page 70: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 70Paolo Baggia

Loquendo MRCP Server/LSS 7.0 Architecture

RTSP Parser

TTS and ASR API

RTSP(MRCPv1)

MRCP v1/v2 ServerAPAPI

RTP

AudioProvider

SDPMRCP v1 Parser

APInterf.

TTS & ASR interface

MP

Config

Logger

OS

Management(SNMP)

Configuration files

Log files

Win32/Linux

TTS and ASR API

LTTS LASR

NLSML / EMMA

SIP

SIP(SDP)

Load Balancer

MRCP v2 parser

LASR-SV

MRCP v2

GraphicManagement

Consolle

Page 71: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 71Paolo Baggia

Media Resource Control Protocol MRCP are IETF standards� MRCPv1 is RFC 4463, http://www.ietf.org/rfc/rfc4463.txt, based on

RTSP/RTP� MRCPv2 is Internet Draft,

http://tools.ietf.org/html/draft-ietf-speechsc-mrcpv2-17, based on SIP/RTPoffering the new audio recording and Speaker Verificationfunctionalities

Optimized client-server solution for the large-scale deployment of speech technologies in the telephony field, such as call centers, CRM, news and email-reading, self-service applications, etc.Allows standard interface of speech technologies in all IVR platforms

IETF MRCP Protocols

For more information read:Dave Burke, Speech Processing for IP Networks. Media

Resource Control Protocol (MRCP), ed. Wiley

Page 72: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 72Paolo Baggia

Fixed/MobileNetwork

PBX

ACD

OptionalVoice Gateway for

Non SIP PBX

VOXNAUTA IVR

CTIServer

DataServer

Operators

WEBServer

VoiceXML in a Call Center

Page 73: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 73Paolo Baggia

Fixed/MobileNetwork

VOICE GATEWAY

VOXNAUTA MRF

IPNetwork

VoiceXML in the IMS Architecture

Application Server

TDM protocols

SIP protocolsRTPVoiceXML on HTTPS

Page 74: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 74Paolo Baggia

Overview

� A Bit of History

� W3C Speech Interaction Framework Today� ASR/DMTF� TTS� Lexicons� Voice Dialog and Call Control� Voice Platforms and Next Evolutions

W3C Multimodal Interaction TodayMMI ArchitectureEMMA and InkMLA language for Emotions

� Next Future

Page 75: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 75Paolo Baggia

Modes, Modalities and Technologies

� Speech � Audio� Stylus� Touch� Accelerometer� Keyboard/keypad� Mouse/touchpad� Camera� Geolocation� Handwriting recognition� Speaker verification� Signature verification� Fingerprint identification� ….

Page 76: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 76Paolo Baggia

Complement and Supplement

Speech Visual- Transient - Persistent- Linear - Spatial- Hands and Eyes-Free - Eyes- Suffers Noise - Suffers Light Conditions

����Enable to choose among different modalities or to mi x them

����Adaptable to different social, environmental conditio ns or to user preference

Page 77: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 77Paolo Baggia

GUI VUI MUIor

MMUI

Page 78: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 78Paolo Baggia

InteractionManager

Speakerverification

Speakerverification

Faceidentification

Faceidentification

Audiorecording

Audiorecording

fingerprintfingerprint

drawingdrawing

videovideophotographphotograph

Vitalsigns

Vitalsigns

geolocationgeolocation

speechspeech

texttext

mousemouse

handwritinghandwriting

accelerometeraccelerometer

User intentSensor

Recording

Identification

MMI has an Intrinsic Complexity

Deborah Dahl, Voice Search 2009

Page 79: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 79Paolo Baggia

MMI can Include Many Different Technologies

Interaction Manager

Speechrecognition

Handwritingrecognition

Accelerometer

Geolocation

Touchscreen

KeypadFingerprintrecognition

Deborah Dahl, Voice Search 2009

Page 80: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 80Paolo Baggia

Getting everything to work together is complicated.

One simplification is to represent the same information from different modalities in the same format.

The need a common language for representing the same information from different modalities

� EMMA (Extensible MultiModal Annotation) 1.0A uniform representation for multimodal information

Uniform Representation for MMI

Page 81: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 81Paolo Baggia

Interaction Manager

Speechrecognition

Handwritingrecognition

Accelerometer

Geolocation

Touchscreen

KeypadFingerprintrecognition

EMMA

EMMA

EMMA

EMMA

EMMAEMMA

EMMA

Deborah Dahl, Voice Search 2009

Page 82: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 82Paolo Baggia

EMMA Structural Elements

Provide containers for application semantics and for multimodal annotation

<emma:emma …><emma:one-of>

<emma:interpretation>…

</emma:interpretation> <emma:interpretation>

…</emma:interpretation>

</emma:one-of></emma:emma>

emma:emma

EMMA Elements

emma:lattice

emma:interpretation

emma:one-of

emma:sequence

emma:group

http://www.w3.org/TR/emma/

Page 83: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 83Paolo Baggia

EMMA Annotations

Characteristics and processing of input, e.g.:

emma:hook

emma:medium emma:modeemma:function

emma:start emma:end

emma:source

emma:confidence

emma:media-type

emma:signal

emma:lang

emma:uninterpreted

emma:no-input

emma:process

emma:tokens

Timestamps (absolute/relative)

media type

uninterpretable input

lack of input

token of input

medium, mode, and function of input

annotation of input source

hook

confidence scores

reference to signal

human language of input

reference to processing

http://www.w3.org/TR/emma/

Page 84: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 84Paolo Baggia

EMMA 1.0 – Example Travel Application

INPUT:"I want to go from Boston to Denver on March 11"

http://www.w3.org/TR/emma/ Deborah Dahl, Voice Search 2009

Page 85: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 85Paolo Baggia

EMMA 1.0 – Same meaning

<emma:interpretation medium="acoustic" mode="voice" id="int1">

<origin>Boston</origin>

<destination>Denver</destination>

<date>11032009</date>

</emma:interpretation>

<emma:interpretation medium="tactile" mode="gui“id="int1">

<origin>Boston</origin>

<destination>Denver</destination>

<date>11032009</date>

</emma:interpretation>

Speech

Mouse

http://www.w3.org/TR/emma/ Deborah Dahl, Voice Search 2009

Page 86: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 86Paolo Baggia

EMMA 1.0 – Handwriting Input

<emma:interpretation medium="tactile" mode="ink" id="int1">

<origin>Boston</origin>

<destination>Denver</destination>

<date>11032009</date>

</emma:interpretation>

http://www.w3.org/TR/emma/ Deborah Dahl, Voice Search 2009

Page 87: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 87Paolo Baggia

EMMA 1.0 – Biometrics Input

<emma:emma version="1.0"> <emma:interpretation

id="int1"emma:confidence=".75"emma:medium="visual" emma:mode="photograph" emma:verbal="false" emma:function="identification">

<person>12345</person> <name>Mary Smith</name>

</emma:interpretation> </emma:emma>

<emma:emma version="1.0"> <emma:interpretation

id="int1"emma:confidence=".80"emma:medium="acoustic" emma:mode="voice" emma:verbal="false"

emma:function="identification"> <person>12345</person> <name>Mary Smith</name>

</emma:interpretation> </emma:emma>

http://www.w3.org/TR/emma/ Deborah Dahl, Voice Search 2009

Page 88: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 88Paolo Baggia

EMMA 1.0 – Representing Lattices

Speech recognizers, Handwriting recognizers and other input processing components may provide lattice output:

A graph encoding a range of possible recognition results or interpretations

flights to fromplease

boston

austinportland

oakland

today

tomorrow1 2 3 54 6

78

http://www.w3.org/TR/emma/ From Michael Joshnston, AT&T Research

Page 89: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 89Paolo Baggia

Lattices can be represented using EMMA elements:<emma:lattice emma:initial="?" emma:final="?">

<emma:arc emma:from="?" emma:to="?">

<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma"><emma:interpretation><emma:lattice emma:initial="1" emma:final="8">

<emma:arc emma:from="1" emma:to="2">flights</emma:ar c><emma:arc emma:from="2" emma:to="3">to</emma:arc><emma:arc emma:from="3" emma:to="4">boston</emma:arc ><emma:arc emma:from="3" emma:to="4">austin</emma:arc ><emma:arc emma:from="4" emma:to="5">from</emma:arc><emma:arc emma:from="5" emma:to="6">portland</emma:a rc><emma:arc emma:from="5" emma:to="6">oakland</emma:ar c><emma:arc emma:from="6" emma:to="7">today</emma:arc><emma:arc emma:from="7" emma:to="8">please</emma:arc ><emma:arc emma:from="6" emma:to="8">tomorrow</emma:a rc>

</emma:lattice></emma:interpretation></emma:emma>

EMMA 1.0 – Representing Lattices

http://www.w3.org/TR/emma/ From Michael Joshnston, AT&T Research

Page 90: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 90Paolo Baggia

EMMA in Multimodal Frameworkhttp://www.w3.org/TR/mmi-framework

EMMA

Page 91: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 91Paolo Baggia

InkML 1.0 – Digital Ink

Ink Markup Language (InkML), http://www.w3.org/TR/InkML

Data format for presenting digital Ink (pen, stylus, etc)Allows the input and processing of handwritings, gesture, sketches, music, etc.

<ink><trace>

10 0, 9 14, 8 28, 7 42, 6 56, 6 70, 8 84, 8 98, 8 1 12, 9 126, 10 140,13 154, 14 168, 17 182, 18 188, 23 174, 30 160, 38 147, 49 135,58 124, 72 121, 77 135, 80 149, 82 163, 84 177, 87 191, 93 205

</trace><trace>

130 155, 144 159, 158 160, 170 154, 179 143, 179 12 9, 166 125,152 128, 140 136, 131 149, 126 163, 124 177, 128 19 0, 137 200,150 208, 163 210, 178 208, 192 201, 205 192, 214 18 0

</trace><trace>

227 50, 226 64, 225 78, 227 92, 228 106, 228 120, 2 29 134,230 148, 234 162, 235 176, 238 190, 241 204

</trace><trace>

282 45, 281 59, 284 73, 285 87, 287 101, 288 115, 2 90 129,291 143, 294 157, 294 171, 294 185, 296 199, 300 21 3

</trace><trace>

366 130, 359 143, 354 157, 349 171, 352 185, 359 19 7,371 204, 385 205, 398 202, 408 191, 413 177, 413 16 3,405 150, 392 143, 378 141, 365 150

</trace></ink>

http://www.w3.org/TR/InkML/

Page 92: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 92Paolo Baggia

InkML 1.0 – Status and Advances

Rich annotation for Ink:Trace, Trace formats and Trace collections

Contextual information

CanvasesEtc.

Result of classification of InkML traces may be a semantic representation in EMMA 1.0

Current status is Last Call Working Draft, next will be Candidate Recommendation with release of an Impl. Report test-suiteRaising interest from major industries

http://www.w3.org/TR/InkML/

Page 93: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 93Paolo Baggia

MMI Architecture Specification

“Multimodal Architecture and Interfaces“, W3C Working Draft,http://www.w3.org/TR/mmi-arch/

Runtime Framework provides the basic infrastructure and controls communication among the constituents. Interaction Manager (IM) coordinates Modality Components (MCs) by life-cycle events and contains the shared data (context).Event-based communication between IM and MCs.

Modality Component 1

Modality Component N

Runtime Framework

Data Component

InteractionManager

DeliveryContext

Component

Modality Component API

Ingmar Kliche, SpeechTEK 2008http://www.w3.org/TR/mmi-arch/

Page 94: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 94Paolo Baggia

MMI Arch – Laboratory Implementation

Implementation of components using W3C markup languages.

Modality Component 1

Modality Component N

Runtime Framework

Data Component

InteractionManager

DeliveryContext

Component

Modality Component API Modality Component API

HTMLfor GUI

SCXML

VoiceXMLfor VUI

Ingmar Kliche, SpeechTEK 2008http://www.w3.org/TR/mmi-arch/

Page 95: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 95Paolo Baggia

MMI Arch – Laboratory Implementation

SCXML based Interaction Manager.VoiceXML + HTML modality components.

HTML Browser

CCXML/VoiceXMLBrowser

Modality Component API: HTTP + XML (using AJAX) Modality Component API: HTTP + XML (EMMA)

Server

ClientPhone Client

Server

Telephony interface

GUI modality component Voice modality component

HTTP I/O Processor

SCXML interpreter

Ingmar Kliche, SpeechTEK 2008http://www.w3.org/TR/mmi-arch/

Page 96: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 96Paolo Baggia

MMI Architecture – Open Issues

Profiles

Start-up, Registration, Delegationin distributed environment

Transport of Events

Extensibility of Events

http://www.w3.org/TR/mmi-arch/

Page 97: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 97Paolo Baggia

Emotion in Wikipedia

From Wikipedia definition:

“An emotion is a mental and physiological state associated with a wide variety of feelings, thoughts, and behaviours. It is a prime determinant of the sense of subjective well-being and appears to play a central role in many human activities. As a result of this generality, the subject has been explored in many, if not all of the human sciences and art forms. There is much controversy concerning howemotions are defined and classified.”

General goal: Make interaction between humans and machines more natural for the humans

Machines should become able:• to register human emotions (and related states)

• to convey emotions (and related states)

• to “understand” the emotional relevance of events

Page 98: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 98Paolo Baggia

adventurous

triumphant

lusting

ambitious conceited

bellicose

self-confident

courageous feeling superior

convinced

light-hearted

enthusiastic

determined amused

passionate

expectant

elated

interested

joyous

excited

hostile

envioushateful

enraged defiant

contemptuousjealous

angry

disgusted

loathingindignant

impatientsuspicious

bored

distrustful

startled

insulted

bitterdiscontented

feel well impressed disappointed

EXCITED �

amourous astonished apatheticdissatisfied

confidenttakenabackcontent hopeful

relaxedlonging

solemn attentive

worried

uncomfortabledespondent

feel guilt

languid ashamed desperate

friendlycontemplative

pensive embarrassed

polite serious

conscientious

peaceful

reverentempathic

melancholic

hesitantwavering

anxious

lonely

doubtful

sad dejected insecure

DEPRESSED �

ASTONISHED

AROUSED

DELIGHTED

HAPPY �

PLEASED �GLAD �

SERENE �

CONTENT � AT EASE �SATISFIED � RELAXED

� CALM �

SLEEPY �

� TENSE�

ALARMED� ANGRY � AFRAID

ANNOYED �

DISTRESSED

FRUSTRATED �

MISERABLE �

� SAD

GLOOMY

� TIRED

� BORED

DROOPY �

Emotional States are Numerous

Scherer et al.Univ. Geneva

Happy

Angry

Sad

Hi Power/Control

Conducive

Obstructive

Lo Power/Control

Active

Positive Negative

Passive

Page 99: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 99Paolo Baggia

HUMAINE project

HUMAINE projectEuropean Network of ExcellenceActivity: 01/2004 - 12/200733 partner institutions from many disciplines

Today: HUMAINE Association (since June 2007)125 membersWeb-site: http://emotion-research.net

Page 100: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 100Paolo Baggia

Online Speaker Classification

Principal Component Analysis (PCA) or Linear Discriminant Analysis (LDA) –preprocessing step to reduce feature vector dimensionK-nearest NeighborGaussian Mixture Models : model training data as Gaussian densitiesArtificial Neural Networks (ANN), e.g. MLP: interesting training algorithms

Support Vector Machines (SVM) : use “kernel functions” to separate non-linear decision boundariesClassification and Regression Trees (CART)Hidden Markov Models (HMMs) used to model temporal structure

Felix Burkhardt, Colloqium Hochschule Zittau/Görlitz 4 .8.2008, Seite 1.

Classification Techniques

Page 101: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 101Paolo Baggia

Text+expressive tags

Selection

style 1

style 2

style n

Waveform

1. Different speech databases, one for each expressive style:

� Effective solution, feasible only for a very limited range of emotions

2. Speech signal manipulation according to style dependent prosodic models

� Flexible solution, but requires accurate models and effective signal processing capabilities

Selection

Signal Processing

Prosodic Model

Text+expressive tags

Waveformneutral style

Expressive TTS – Two Approaches

From Enrico Zovato, Loquendo

Page 102: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 102Paolo Baggia

Expressive TTS – Example Prosodic Patterns

Time (s)0 1.8

0

500

Fre

qu

en

cy (

Hz)

Time (s)0 1.8

0

500

Fre

qu

en

cy (

Hz)

0

100

200

300

400

500

POS (“happy”)

NEG (“sad”)

Synthesis of two basic emotional styles through prosodic modification:

� different intonation contours

� different acoustic units duration

Male-UK

POSNEG

Female-UKFrom Enrico Zovato, Loquendo

Page 103: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 103Paolo Baggia

Emotions in ECAs

From Piero Cosi, CNR, Padova

Page 104: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 104Paolo Baggia

W3C Emotion Incubator

“The W3C Incubator Activity fosters rapid development, on a time scale of a year or less , of new Web-related concepts. Target concepts include innovative ideas for specifications, guidelines, and applications that are not (or not yet) clear candidates as Web standardsdeveloped through the more thorough process afforded by the W3C Recommendation Track.”

W3C Emotion Incubator Aims:First Charter XG (2006-2007):

“...to investigate the prospects of defining a general-purpose Emotion annotation and representation language...”“...which should be usable in a large variety of technological contexts where emotions need to be represented.”

Second Charter XG (Nov. 2007 – Nov. 2008):Prioritize the requirements; Release a first specification draft; Illustrate how to combine the Emotion Markup Language with existing markup languages.

Page 105: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 105Paolo Baggia

W3C Emotion Incubator – Members

W3C Members:DFKI

LoquendoDeutsche Telekom

SRI International

NTUAFraunhofer

Chinese Acad. Science

Invited Experts:Emotion AI

Univ. Paris 8Uuniv. Basque Country

Univ. C. Cork

OFAI, AustriaIPCA, Portugal

Tech.Univ. Munich

Web space: http://www.w3.org/2005/Incubator/emotion

Results:• Use case description document• Requirements document• Final Report (20 Nov 2008): Elements of an EmotionML 1.0� http://www.w3.org/2005/Incubator/emotion/XGR-emotio nml/

Chairman: Marc Schröder, DFKI

Page 106: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 106Paolo Baggia

W3C Emotion Incubator – EmotionML 1.0

Document structure: container element (<emotionml> ), single emotion annotation (<emotion> )

Representation of emotions:<category> element, <dimensions> element, <appraisals> element,

<action-tendency> element, <intensity> element

Meta information:confidence attribute, <modality> element, <metadata> element

Links and time:<link> element, <timing> element

Scale valuesvalue attribute, <traces> element

Page 107: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 107Paolo Baggia

EmotionML 1.0 – Examples

Expression of emotions in SSML 1.1:

<?xml version="1.0"?><speak version="1.1" xmlns="http://www.w3.org/2001/ 10/synthesis"

xmlns:emo="http://www.w3.org/2008/11/emotionml"xml:lang="en-US">

<s><emo:emotion>

<emo:category set="everydayEmotions" name="doubt"/><emo:intensity value="0.4"/>

</emo:emotion>

Do you need help?</s>

</speak>

Detection of emotions in EMMA 1.0:

<emma:emma version="1.0" xmlns:emma="http://www.w3.o rg/2003/04/emma"xmlns="http://www.w3.org/2008/11/emotionml" >

<emma:interpretation start="12457990" end="12457995" mode="voice" verbal="false">

<emotion><intensity value="0.1" confidence="0.8"/><category set="everydayEmotions" name="boredom" con fidence="0.1"/>

</emotion>

</emma:interpretation></emma:emma>

Page 108: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 108Paolo Baggia

Overview

� A Bit of History

� W3C Speech Interaction Framework Today� ASR/DMTF� TTS� Lexicons� Voice Dialog and Call Control� Voice Platforms and Next Evolutions

� W3C Multimodal Interaction Today� MMI Architecture� EMMA and InkML� A language for Emotions

Next Future

Page 109: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 109Paolo Baggia

W3C VBWG/MMIWG – Next Future

Spec for the next generation of Voice Browsing

SCXML 1.0

VoiceXML 3.0

Page 110: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 110Paolo Baggia

State Charts - SCXML

State Chart XML (SCXML): http://www.w3.org/TR/2008/WD-scxml-20080516/

Powerful State-Machine LanguageBased on David Harel’s State Charts (see his book)

Adopted by in UMLStandard under development by W3C VBWG

http://www.w3.org/TR/scxml/

States, Transitions, EventsData model extends basic finite state automatonConditions on transitions

Nested StatesRepresents task decompositionIn multiple dependent states at same time

Parallel StatesRepresent fork/join logic

Wide interest:VBWG, MMI WG, Other W3C groups, Universities, IndustriesAlready available Open Source Implementations

Page 111: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 111Paolo Baggia

SCXML 1.0 – Parallel State Charts

Page 112: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 112Paolo Baggia

SCXML as MMI Interaction Manager

Voice Modality

Visual Modality

Gesture Modality

SCXML Interaction Manager

Page 113: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 113Paolo Baggia

SCXML for VoiceXML 3.0

Voice Modality

Visual Modality

Gesture Modality

SCXML Interaction Manager

Page 114: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 114Paolo Baggia

SCXML 1.0 – Open Issues

Data model:ECMA Script (ECMA-262) or other formats?

Definition of Profiles

Other

Page 115: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 115Paolo Baggia

Re-Thinking VoiceXML – VoiceXML 3.0

Well-founded:From syntactic description to a semantic model

Extensible:SIV, EMMA support, rich media, VCR control, etc.

Profiled:light profile (mobile?), media profile (scalability), VoiceXML 2.1 profile (interoperability), etc.

Flexibility:Customization of FIA (Form Interpretation Algorithm)

Page 116: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 116Paolo Baggia

VoiceXML 3.0 – Separation of Concerns

SCXML 1.0Application and interaction logic

VoiceXML 3.0:Voice Interaction only, under control of SCXML

VoiceXML 3.0 has been published as a First Working Draft, http://www.w3.org/TR/2008/WD-voicexml30-20081219/� Send public comments

Page 117: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 117Paolo Baggia

THANK YOUTHANK YOU

for clarifications or questions:

[email protected]

Page 118: Voice Browsing And Multimodal Interaction In 2009

Google TechTalk – Mar 6 th, 2009 118Paolo Baggia

For more information please:Keep an eye on: www.loquendo.com

Contact: [email protected]

Loquendo S.p.A.Loquendo S.p.A.745 Fifth Ave, 27th Floor New York, NY 10151USATel. +1 212.310.9075Fax. +1 212.310.9001www.loquendo.com

THANK YOUTHANK YOU

Loquendo S.p.A.Loquendo S.p.A.Via Olivetti, 610148 TORINOItalyTel. +39 011 291 3111 Fax +39 011 291 3199www.loquendo.com

Keep in touch with Loquendo news, subscribe to the Loquendo Newsletter

Try our interactive TTS demo : insert your text, choose a language, and listen

The latest News at a click

Consult the Loquendo Newsletter online

Keep up to date on events and initiatives

For further information, fill in our Contacts Form