Bridging the Gap Between Speech Recognition and Business Logic

Post on 11-Apr-2017

460 views 1 download

Transcript of Bridging the Gap Between Speech Recognition and Business Logic

@wolfpaulus

Wolf Paulus wolf@wolfpaulus.com

Bridging the Gap Between Speech Recognition and Business Logic

@wolfpaulus

@wolfpaulus

@wolfpaulus1977

History

@wolfpaulus

1975|6|7|8|9|1980|1|2|3|4|1985|6|7|8|9|1990|1|2|3|4|1995|6|7|8|9|2000|1|2|3|4|2005|6|7|8|9|2010|1|2|3|4|2015

1977

@wolfpaulus19814….Years

@wolfpaulus

1975|6|7|8|9|1980|1|2|3|4|1985|6|7|8|9|1990|1|2|3|4|1995|6|7|8|9|2000|1|2|3|4|2005|6|7|8|9|2010|1|2|3|4|2015

1981

@wolfpaulus

1975|6|7|8|9|1980|1|2|3|4|1985|6|7|8|9|1990|1|2|3|4|1995|6|7|8|9|2000|1|2|3|4|2005|6|7|8|9|2010|1|2|3|4|2015

19843…Years

@wolfpaulus

1975|6|7|8|9|1980|1|2|3|4|1985|6|7|8|9|1990|1|2|3|4|1995|6|7|8|9|2000|1|2|3|4|2005|6|7|8|9|2010|1|2|3|4|2015

1984

@wolfpaulus199511………..Years

@wolfpaulus

1975|6|7|8|9|1980|1|2|3|4|1985|6|7|8|9|1990|1|2|3|4|1995|6|7|8|9|2000|1|2|3|4|2005|6|7|8|9|2010|1|2|3|4|2015

1995

@wolfpaulus

1975|6|7|8|9|1980|1|2|3|4|1985|6|7|8|9|1990|1|2|3|4|1995|6|7|8|9|2000|1|2|3|4|2005|6|7|8|9|2010|1|2|3|4|2015

200712…………Years

@wolfpaulus2007

@wolfpaulus

1975|6|7|8|9|1980|1|2|3|4|1985|6|7|8|9|1990|1|2|3|4|1995|6|7|8|9|2000|1|2|3|4|2005|6|7|8|9|2010|1|2|3|4|2015

2007

@wolfpaulus

1975|6|7|8|9|1980|1|2|3|4|1985|6|7|8|9|1990|1|2|3|4|1995|6|7|8|9|2000|1|2|3|4|2005|6|7|8|9|2010|1|2|3|4|2015

20114….Years

@wolfpaulus

1975|6|7|8|9|1980|1|2|3|4|1985|6|7|8|9|1990|1|2|3|4|1995|6|7|8|9|2000|1|2|3|4|2005|6|7|8|9|2010|1|2|3|4|2015

2011

@wolfpaulus

1975|6|7|8|9|1980|1|2|3|4|1985|6|7|8|9|1990|1|2|3|4|1995|6|7|8|9|2000|1|2|3|4|2005|6|7|8|9|2010|1|2|3|4|2015

2013

Emotion andVoice enabled Avatar

2..Years

@wolfpaulus

1975|6|7|8|9|1980|1|2|3|4|1985|6|7|8|9|1990|1|2|3|4|1995|6|7|8|9|2000|1|2|3|4|2005|6|7|8|9|2010|1|2|3|4|2015

Emotion andVoice enabled Avatar 2013

@wolfpaulus

1975|6|7|8|9|1980|1|2|3|4|1985|6|7|8|9|1990|1|2|3|4|1995|6|7|8|9|2000|1|2|3|4|2005|6|7|8|9|2010|1|2|3|4|2015

2013

@wolfpaulus

1975|6|7|8|9|1980|1|2|3|4|1985|6|7|8|9|1990|1|2|3|4|1995|6|7|8|9|2000|1|2|3|4|2005|6|7|8|9|2010|1|2|3|4|2015

2014

30 Years

@wolfpaulus

1975|6|7|8|9|1980|1|2|3|4|1985|6|7|8|9|1990|1|2|3|4|1995|6|7|8|9|2000|1|2|3|4|2005|6|7|8|9|2010|1|2|3|4|2015

2015

Amazon Alexa powers Echo and is designed around your voice.

It’s always on - just ask for information, news, weather, and more.

@wolfpaulus

38 Years

1975|6|7|8|9|1980|1|2|3|4|1985|6|7|8|9|1990|1|2|3|4|1995|6|7|8|9|2000|1|2|3|4|2005|6|7|8|9|2010|1|2|3|4|2015

@wolfpaulus

38 Years

1975|6|7|8|9|1980|1|2|3|4|1985|6|7|8|9|1990|1|2|3|4|1995|6|7|8|9|2000|1|2|3|4|2005|6|7|8|9|2010|1|2|3|4|2015

Standalone text based ..

.. globally connected conversational Voice User Interface

@wolfpaulus

Customer Benefit

Inte

grat

ion

Com

plex

ity

Recognize who is speaking

Execute simple commands

Intelligently respond to natural language input

Classification

Recognize what is said and write it down

@wolfpaulus

Bouncer

AssistantClassification

Gopher

?Customer Benefit

Inte

grat

ion

Com

plex

ity

Secretary

@wolfpaulus

Speech Input (encoded compressed sound file)

Speech Recognition returns a text

Mapping recognized words to pre-programmed actions

@wolfpaulus

Digital art is an artistic work, using digital technology as an essential part of the creative process

@wolfpaulus

When your phone’s screen is a canvas, your finger becomes a paint brush

@wolfpaulus

@wolfpaulus

@wolfpaulus

@wolfpaulus

@wolfpaulus

@wolfpaulus

@wolfpaulus

@wolfpaulus

All painting were created with Artist on Android

@wolfpaulus

@wolfpaulus

Tool

Style

Action Bar

Color

@wolfpaulus

Tasks about, explore, publish, share, print, save, reset …

Tools pen, brush, roll

Styles emboss, blur, blend, erase, normal

Colors 216 named colors …

@wolfpaulus

Tasks about, explore, publish, share, print, save, reset …

Tools pen, brush, roll

Styles emboss, blur, blend, erase, normal

Colors 216 named colors …

@wolfpaulus

Task

Tool

Style

Color

Tool

Color

Style

Tool

StyleColor

ToolStyleColor

GrXML - Speech Recognition Grammarhttp://www.w3.org/TR/speech-grammar/<?xml version="1.0"?> <grammar version="1.0" xml:lang="en-US" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/06/grammar http://www.w3.org/TR/speech-grammar/grammar.xsd" xmlns="http://www.w3.org/2001/06/grammar">

.. <rule id=“tool">

<one-of> <item>pen</item> <item>brush</item> <item>roll</item> </one-of>

</rule>

.. <rule id=“command">

<one-of> <ruleref uri=“#tool”/> </one-of><one-of> <ruleref uri=“#color"/></one-of><one-of> <ruleref uri=“#style"/></one-of>

<one-of> <ruleref uri=“#color”/>

<ruleref uri=“#tool”/></one-of>

..

</rule> ..

}2x

6x

216x3x

5x

8x

}values

}variables

}variables

@wolfpaulus

private void startVoiceRecognitionActivity() {

final Intent intent = new Intent( RecognizerIntent.ACTION_RECOGNIZE_SPEECH );

// Specify the calling package to identify your application intent.putExtra( RecognizerIntent.EXTRA_CALLING_PACKAGE, getClass().getPackage().getName() );

// Display an hint to the user about what he should say. intent.putExtra( RecognizerIntent.EXTRA_PROMPT, getResources().getString( R.string.speakPROMPT ) );

// Given an hint to the recognizer about what the user is going to say intent.putExtra( RecognizerIntent.EXTRA_LANGUAGE_MODEL, RecognizerIntent.LANGUAGE_MODEL_FREE_FORM);

// Specify how many results you want to receive. The results will be sorted // where the first result is the one with higher confidence. intent.putExtra( RecognizerIntent.EXTRA_MAX_RESULTS, 1 );

intent.putExtra(RecognizerIntent.EXTRA_LANGUAGE, new Locale("en").getLanguage()); startActivityForResult( intent, SPEECH_RECOGNITION_REQUEST_CODE ); }

Request …

@wolfpaulus

… Response

@Override protected void onActivityResult( final int requestCode, final int resultCode, final Intent data ) {

switch (requestCode) { case SPEECH_RECOGNITION_REQUEST_CODE: if (resultCode == Activity.RESULT_OK) {

final ArrayList<String> matches = data.getStringArrayListExtra( RecognizerIntent.EXTRA_RESULTS ); assert matches != null; do_something_with_the_result ( matches.get( 0 ) );

} else { mTV_STT.setText( R.string.tapScreen ); } break; default: mTV_STT.setText(""); break; } }

@wolfpaulus

Speech Input (encoded compressed sound file)

Speech Recognition returns a text, mainly based on statistical models, prioritizing frequently used words and words that are frequently used together. I.e., it’s unlikely to get an utterance like “blue blur brush” correctly recognized.

Mapping recognized words to actions. E.g. “What’s my schedule for tomorrow” opens the calendar app

@wolfpaulus

Bobby "Blue" Bland January 27, 1930 – June 23, 2013

@wolfpaulus

blue blend ‣ blue bland

powder blue brush ‣ powder blue bra

turquoise blur brush ‣ turquoise player price

cyan brush ‣ Diane Brush

sky blue emboss pen ‣ sky blue in Boston

@wolfpaulus

@wolfpaulus

Homophones pronounced the same, differ in meaning, and may differ in spelling E.g.: to, too, two, and there, their, they’re

Synophonessimilar pronunciations, different meanings E.g.: sheep, Jeep, cheap

@wolfpaulus

Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions, or substitutions) required to change one word into the other.

Dr. Vladimir Levenshtein

@wolfpaulus

Soundex is a phonetic algorithm for indexing names by their pronunciation.

For instance both "Robert" and "Rupert" return the same encoding, but not "Rubin". Also "Ashcraft" and "Ashcroft" return the same encoding.

Phonetic Algorithms

@wolfpaulus

1. Retain 1st letter2. Drop all occurrences of a, e, i, o, u, h, w, y 3. Replace consonants with digits as follows:

• b, f, p, v → 1 • c, g, j, k, q, s, x, z → 2• d, t → 3• l → 4• m, n → 5 • r → 6

4. Remove duplicates: 1. If two or more letters with the same number are adjacent in the original

name (before step 1), only retain the first letter. 2. If two letters with the same number separated by 'h' or 'w' are coded as

a single number. 5. Iterate until you have one letter and three numbers. or append ‘0’

Soundex AlgorithmsNumber Represents the Letters

1 b, f, p, v2 c, g, j, k, q, s, x, z3 d, t4 l5 m, n6 r

Soundex Coding Guide

Disregard the letters: a,e,i,o,u,h,w,y

@wolfpaulus

1. Retain 1st letter Robert

2. Drop all occurrences of a, e, i, o, u, h, w, y Rbrt

3. Replace consonants with digits as follows: • b, f, p, v → 1 • c, g, j, k, q, s, x, z → 2• d, t → 3• l → 4• m, n → 5 • r → 6 R163

4. Remove duplicates 5. Iterate until you have one letter and three numbers. or append ‘0’

Example: “Robert”

Number Represents the Letters

1 b, f, p, v2 c, g, j, k, q, s, x, z3 d, t4 l5 m, n6 r

Soundex Coding Guide

Disregard the letters: a,e,i,o,u,h,w,yRobert : R163Rupert : R163Rubin : R150Ashcraft : A261Ashcroft : A261

@wolfpaulus

Algorithm Correction Rate1 RefinedSoundex 64%2 Soundex 55%3 StringUtils.LevensteinDistance 46%4 DoubleMetaphone 37%5 Metaphone 28%6 Nysiis 28%7 StringUtils.FuzzyDistance 28%8 StringUtils.JaroWinklerDistance 28%9 MatchRating 19%

10 Caverphone1 19%11 Caverphone2 10%12 BeiderMorse 0%13 String.compareTo 0%

Vocabulary: 126 Words final String[][] mTuples = new String[][]{ {"Bland", "blend"}, {“Bra", "brush"}, {"Player", "blur"}, {"Price", "brush"}, {"Diane", "cyan"}, {"Boston", "emboss"}, {"in Boston", "emboss pen"}, {"sheep", "jeep"}, {"cheap", "jeep"}, {"their", "there"}, {"they are", "there"}};

https://github.com/wolfpaulus/phonetic-alg-compare.git

Apache Commons Codec - Language Encodershttp://commons.apache.org/proper/commons-codec/

@wolfpaulushttps://github.com/wolfpaulus/phonetic-alg-compare.git

Algorithm Correction Rate1 RefinedSoundex 64%2 Soundex 55%

Number Represents the Letters

1 b, f, p, v2 c, g, j, k, q, s, x, z3 d, t4 l5 m, n6 r

Soundex Coding Guide

Disregard the letters: a,e,i,o,u,h,w,y

Number Represents the Letters

1 b, p2 f, v3 c, k, s4 g, j5 q, x, z6 d, t7 l8 m, n9 r

Refined Soundex Guide

Phonetic Algorithms

Disregard the letters: a,e,i,o,u,h,w,y

@wolfpaulus

Build a Dictionary of Words/Sentences and Actions Phonetically encode the Words in the Dictionary

Capture Speech Input (encoded sound file)

Speech Recognition returns a text

Phonetically encode the recognition result Find the best match in the encoded dictionary

Launch the action mapped to the best match

@wolfpaulus

Recognize who is speaking

Recognize what is said and record it

Execute simple commands

Intelligently respond to natural language input

Classification

Customer Benefit

Inte

grat

ion

Com

plex

ity

@wolfpaulus

a web-service for building apps and devices, users can talk or text to.

provides an open and extensible natural language platform.

“learns” from every interaction, and leverages the community: what’s learned is shared across developers/apps.

declare

teach

@wolfpaulus

{ "msg_id" : "befcd276-e6e0-4f53-a075-d26c991786e4", "_text" : "blue blur brush", "outcomes" : [ { "_text" : "blue blur brush", "intent" : "setup", "entities" : { "style" : [ { "value" : "blur" } ], "color" : [ { "value" : "blue" } ], "tool" : [ { "value" : "brush" } ] }, "confidence" : 0.515 } ] }

curl -H 'Authorization: Bearer 4FPMGMHNMWDFV3E2BM5QQ4XVQA3A3W23' ‘https://api.wit.ai/message?v=20150406&q=blue+blur+brush'

Request Response

@wolfpaulus

@wolfpaulus

@wolfpaulus

@wolfpaulus

• Intent: GetAccountBalance • Institution: Bank of America • Account Type: Checking

Alexa, ask Mindy, What is the Balance of my Bank of America

checking account { "intents": [ { "intent": "GetAccountBalance", "slots": [ { "name": "Bank", "type": "LITERAL" },{ "name": "Type", "type": "LITERAL" } ] } ], .. }

Intent Schema

TYPES:LITERAL NUMBER DATE TIME DURATION

@wolfpaulus

• Intent: GetAccountBalance • Institution: Bank of America • Account Type: Checking

GetAccountBalance What is my balance

GetAccountBalance balance of my {Bank of America|Bank} account GetAccountBalance balance of my {Wells Fargo|Bank} account GetAccountBalance balance of my {PayPal|Bank} account ...

GetAccountBalance balance of my {savings|Type} account GetAccountBalance balance of my {bank|Type} account GetAccountBalance balance of my {credit|Type} account GetAccountBalance balance of my {investment|Type} account

GetAccountBalance balance of my {Bank of America|Bank} {investment|Type} account GetAccountBalance balance of my {PayPal|Bank} {checking|Type} account

GetAccountBalance balance of my {savings|Type} account at {Citi|Bank} GetAccountBalance balance of my {credit|Type} account at {Credit Union|Bank} ...

Sample Utterances …provide as many as you can

intent …. {value|name} …. {value|name} .…

@wolfpaulus

@wolfpaulus

<?xml version="1.0" encoding="utf-8" ?> <aiml version="2.0"> <category>

<pattern>What is *</pattern> <template>I don’t know what <star/> is</template>

</category> </aiml>

Dr. Richard Wallace

AIML Artificial Intelligence Markup Language, is an XML dialect for creating natural language software agents.

@wolfpaulus

<?xml version="1.0" encoding="utf-8" ?> <aiml version="2.0"> <category>

<pattern>What is *</pattern> <template>

I don’t know what <star/> is </template>

</category> </aiml>

<?xml version="1.0" encoding="utf-8" ?> <aiml version="2.0"> <category>

<pattern>I like *</pattern> <template>

I see, you like <person><star/></person> </template>

</category> </aiml>

What is cheese? I don’t know what cheese is.

What is chocolate ? I don’t know what chocolate is.

I like coffee.I see, you like coffee.

I like your voice. I see, you like my voice.

I like her dress.I see, you like her dress.

@wolfpaulus

<?xml version="1.0" encoding="utf-8" ?> <aiml version="2.0">

<category> <pattern><set>colors</set> <set>styles</set> <set>tools</set></pattern> <template> Color=<star/> Style=<star index="2"/> Tool=<star index=“3"/> </template> </category>

<category> <pattern><set>colors</set> <set>styles</set> ^</pattern> <template>Color=<star/> Style= <star index="2"/></template> </category>

<category> <pattern><set>colors</set> <set>tools</set> ^</pattern> <template>Color=<star/> Tool=<star index="2"/></template> </category>

<category> <pattern><set>colors</set> ^</pattern> <template>Color=<star/></template> </category>

<category> <pattern><set>styles</set></pattern> <template>Style=<star/></template> </category>

<category> <pattern><set>tools</set></pattern> <template>Tool=<star/></template> </category>

</aiml>

TOOLS[["Roll"], ["Brush"], ["Pen"]]

STYLES[["Blur"], ["Emboss"], ["Blend"]]

COLORS.. ["Dark", "blue"], ["Dark", "brown"], ["Dark", "byzantium"], ["Dark", "cerulean"], ["Dark", "chestnut"], ["Dark", "coral"], ["Dark", "cyan"], ["Dark", "goldenrod"], ["Dark", "gray"], ["Dark", "green"], ["Dark", "khaki"], ["Dark", "lava"], ["Dark", "lavender"], ["Dark", "magenta"], …

@wolfpaulus

pandorabots

@wolfpaulus

1975|6|7|8|9|1980|1|2|3|4|1985|6|7|8|9|1990|1|2|3|4|1995|6|7|8|9|2000|1|2|3|4|2005|6|7|8|9|2010|1|2|3|4|2015

Wit.ai

FB acq. Wit.ai

SpeakToIt

SpeakToIt Api.ai

AIML 1.0

AIML 2.0

GrXML 1.0

PandoraBots

AMZN acq. Ivona

AMZN acq. Yap

@wolfpaulus

@wolfpaulus

Emotional ProsodyWolf Paulus

https://speakerdeck.com/wolfpaulus/emotional-prosody

@wolfpaulus

Emotion andVoice enabled AvatarEmotional Prosody

20⍰⍰

@wolfpaulus

@wolfpaulus

SummaryThe adoption of context-free recognizers, may benefit from post recognition processing, using text-to-sound language encoders, dictionaries, etc.

Gopher-VUIs are still hard to find, while Apple, Google, Amazon, Microsoft, and Facebook are building Assistant-VUIs. For domain specific information, or where information is protected, those Assistants will require 3rd-party substitutes. It will be interesting to see, how much control (i.e. personality) those substitute will be granted.

Basic NLU solutions from Wit.ai, Api.ai, and Amazon/Alexa reach their limits fast and are hardly powerful enough to implement those substitute-assistants. AIML is the oldest, but still seems to be the most powerful declarative approach, for simple VUI related tasks.

@wolfpaulus

Thanks for listening