Bridging the Gap Between Speech Recognition and Business Logic

@wolfpaulus Wolf Paulus [email protected] Bridging the Gap Between Speech Recognition and Business Logic

Transcript of Bridging the Gap Between Speech Recognition and Business Logic

Page 1: Bridging the Gap Between Speech Recognition and Business Logic


Wolf Paulus [email protected]

Bridging the Gap Between Speech Recognition and Business Logic

Page 2: Bridging the Gap Between Speech Recognition and Business Logic


Page 3: Bridging the Gap Between Speech Recognition and Business Logic


Page 4: Bridging the Gap Between Speech Recognition and Business Logic



Page 5: Bridging the Gap Between Speech Recognition and Business Logic




Page 6: Bridging the Gap Between Speech Recognition and Business Logic


Page 7: Bridging the Gap Between Speech Recognition and Business Logic




Page 8: Bridging the Gap Between Speech Recognition and Business Logic




Page 9: Bridging the Gap Between Speech Recognition and Business Logic




Page 10: Bridging the Gap Between Speech Recognition and Business Logic


Page 11: Bridging the Gap Between Speech Recognition and Business Logic




Page 12: Bridging the Gap Between Speech Recognition and Business Logic




Page 13: Bridging the Gap Between Speech Recognition and Business Logic


Page 14: Bridging the Gap Between Speech Recognition and Business Logic




Page 15: Bridging the Gap Between Speech Recognition and Business Logic




Page 16: Bridging the Gap Between Speech Recognition and Business Logic




Page 17: Bridging the Gap Between Speech Recognition and Business Logic




Emotion andVoice enabled Avatar


Page 18: Bridging the Gap Between Speech Recognition and Business Logic



Emotion andVoice enabled Avatar 2013

Page 19: Bridging the Gap Between Speech Recognition and Business Logic




Page 20: Bridging the Gap Between Speech Recognition and Business Logic




30 Years

Page 21: Bridging the Gap Between Speech Recognition and Business Logic




Amazon Alexa powers Echo and is designed around your voice.

It’s always on - just ask for information, news, weather, and more.

Page 22: Bridging the Gap Between Speech Recognition and Business Logic


38 Years


Page 23: Bridging the Gap Between Speech Recognition and Business Logic


38 Years


Standalone text based ..

.. globally connected conversational Voice User Interface

Page 24: Bridging the Gap Between Speech Recognition and Business Logic


Customer Benefit







Recognize who is speaking

Execute simple commands

Intelligently respond to natural language input


Recognize what is said and write it down

Page 25: Bridging the Gap Between Speech Recognition and Business Logic





?Customer Benefit








Page 26: Bridging the Gap Between Speech Recognition and Business Logic


Speech Input (encoded compressed sound file)

Speech Recognition returns a text

Mapping recognized words to pre-programmed actions

Page 27: Bridging the Gap Between Speech Recognition and Business Logic


Digital art is an artistic work, using digital technology as an essential part of the creative process

Page 28: Bridging the Gap Between Speech Recognition and Business Logic


When your phone’s screen is a canvas, your finger becomes a paint brush

Page 29: Bridging the Gap Between Speech Recognition and Business Logic


Page 30: Bridging the Gap Between Speech Recognition and Business Logic


Page 31: Bridging the Gap Between Speech Recognition and Business Logic


Page 32: Bridging the Gap Between Speech Recognition and Business Logic


Page 33: Bridging the Gap Between Speech Recognition and Business Logic


Page 34: Bridging the Gap Between Speech Recognition and Business Logic


Page 35: Bridging the Gap Between Speech Recognition and Business Logic


Page 36: Bridging the Gap Between Speech Recognition and Business Logic


All painting were created with Artist on Android

Page 37: Bridging the Gap Between Speech Recognition and Business Logic


Page 38: Bridging the Gap Between Speech Recognition and Business Logic




Action Bar


Page 39: Bridging the Gap Between Speech Recognition and Business Logic


Tasks about, explore, publish, share, print, save, reset …

Tools pen, brush, roll

Styles emboss, blur, blend, erase, normal

Colors 216 named colors …

Page 40: Bridging the Gap Between Speech Recognition and Business Logic


Tasks about, explore, publish, share, print, save, reset …

Tools pen, brush, roll

Styles emboss, blur, blend, erase, normal

Colors 216 named colors …

Page 41: Bridging the Gap Between Speech Recognition and Business Logic












GrXML - Speech Recognition Grammar<?xml version="1.0"?> <grammar version="1.0" xml:lang="en-US" xmlns:xsi="" xsi:schemaLocation="" xmlns="">

.. <rule id=“tool">

<one-of> <item>pen</item> <item>brush</item> <item>roll</item> </one-of>


.. <rule id=“command">

<one-of> <ruleref uri=“#tool”/> </one-of><one-of> <ruleref uri=“#color"/></one-of><one-of> <ruleref uri=“#style"/></one-of>

<one-of> <ruleref uri=“#color”/>

<ruleref uri=“#tool”/></one-of>


</rule> ..









Page 42: Bridging the Gap Between Speech Recognition and Business Logic


private void startVoiceRecognitionActivity() {

final Intent intent = new Intent( RecognizerIntent.ACTION_RECOGNIZE_SPEECH );

// Specify the calling package to identify your application intent.putExtra( RecognizerIntent.EXTRA_CALLING_PACKAGE, getClass().getPackage().getName() );

// Display an hint to the user about what he should say. intent.putExtra( RecognizerIntent.EXTRA_PROMPT, getResources().getString( R.string.speakPROMPT ) );

// Given an hint to the recognizer about what the user is going to say intent.putExtra( RecognizerIntent.EXTRA_LANGUAGE_MODEL, RecognizerIntent.LANGUAGE_MODEL_FREE_FORM);

// Specify how many results you want to receive. The results will be sorted // where the first result is the one with higher confidence. intent.putExtra( RecognizerIntent.EXTRA_MAX_RESULTS, 1 );

intent.putExtra(RecognizerIntent.EXTRA_LANGUAGE, new Locale("en").getLanguage()); startActivityForResult( intent, SPEECH_RECOGNITION_REQUEST_CODE ); }

Request …

Page 43: Bridging the Gap Between Speech Recognition and Business Logic


… Response

@Override protected void onActivityResult( final int requestCode, final int resultCode, final Intent data ) {

switch (requestCode) { case SPEECH_RECOGNITION_REQUEST_CODE: if (resultCode == Activity.RESULT_OK) {

final ArrayList<String> matches = data.getStringArrayListExtra( RecognizerIntent.EXTRA_RESULTS ); assert matches != null; do_something_with_the_result ( matches.get( 0 ) );

} else { mTV_STT.setText( R.string.tapScreen ); } break; default: mTV_STT.setText(""); break; } }

Page 44: Bridging the Gap Between Speech Recognition and Business Logic


Speech Input (encoded compressed sound file)

Speech Recognition returns a text, mainly based on statistical models, prioritizing frequently used words and words that are frequently used together. I.e., it’s unlikely to get an utterance like “blue blur brush” correctly recognized.

Mapping recognized words to actions. E.g. “What’s my schedule for tomorrow” opens the calendar app

Page 45: Bridging the Gap Between Speech Recognition and Business Logic


Bobby "Blue" Bland January 27, 1930 – June 23, 2013

Page 46: Bridging the Gap Between Speech Recognition and Business Logic


blue blend ‣ blue bland

powder blue brush ‣ powder blue bra

turquoise blur brush ‣ turquoise player price

cyan brush ‣ Diane Brush

sky blue emboss pen ‣ sky blue in Boston


Page 47: Bridging the Gap Between Speech Recognition and Business Logic


Homophones pronounced the same, differ in meaning, and may differ in spelling E.g.: to, too, two, and there, their, they’re

Synophonessimilar pronunciations, different meanings E.g.: sheep, Jeep, cheap

Page 48: Bridging the Gap Between Speech Recognition and Business Logic


Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions, or substitutions) required to change one word into the other.

Dr. Vladimir Levenshtein

Page 49: Bridging the Gap Between Speech Recognition and Business Logic


Soundex is a phonetic algorithm for indexing names by their pronunciation.

For instance both "Robert" and "Rupert" return the same encoding, but not "Rubin". Also "Ashcraft" and "Ashcroft" return the same encoding.

Phonetic Algorithms

Page 50: Bridging the Gap Between Speech Recognition and Business Logic


1. Retain 1st letter2. Drop all occurrences of a, e, i, o, u, h, w, y 3. Replace consonants with digits as follows:

• b, f, p, v → 1 • c, g, j, k, q, s, x, z → 2• d, t → 3• l → 4• m, n → 5 • r → 6

4. Remove duplicates: 1. If two or more letters with the same number are adjacent in the original

name (before step 1), only retain the first letter. 2. If two letters with the same number separated by 'h' or 'w' are coded as

a single number. 5. Iterate until you have one letter and three numbers. or append ‘0’

Soundex AlgorithmsNumber Represents the Letters

1 b, f, p, v2 c, g, j, k, q, s, x, z3 d, t4 l5 m, n6 r

Soundex Coding Guide

Disregard the letters: a,e,i,o,u,h,w,y

Page 51: Bridging the Gap Between Speech Recognition and Business Logic


1. Retain 1st letter Robert

2. Drop all occurrences of a, e, i, o, u, h, w, y Rbrt

3. Replace consonants with digits as follows: • b, f, p, v → 1 • c, g, j, k, q, s, x, z → 2• d, t → 3• l → 4• m, n → 5 • r → 6 R163

4. Remove duplicates 5. Iterate until you have one letter and three numbers. or append ‘0’

Example: “Robert”

Number Represents the Letters

1 b, f, p, v2 c, g, j, k, q, s, x, z3 d, t4 l5 m, n6 r

Soundex Coding Guide

Disregard the letters: a,e,i,o,u,h,w,yRobert : R163Rupert : R163Rubin : R150Ashcraft : A261Ashcroft : A261

Page 53: Bridging the Gap Between Speech Recognition and Business Logic


Algorithm Correction Rate1 RefinedSoundex 64%2 Soundex 55%3 StringUtils.LevensteinDistance 46%4 DoubleMetaphone 37%5 Metaphone 28%6 Nysiis 28%7 StringUtils.FuzzyDistance 28%8 StringUtils.JaroWinklerDistance 28%9 MatchRating 19%

10 Caverphone1 19%11 Caverphone2 10%12 BeiderMorse 0%13 String.compareTo 0%

Vocabulary: 126 Words final String[][] mTuples = new String[][]{ {"Bland", "blend"}, {“Bra", "brush"}, {"Player", "blur"}, {"Price", "brush"}, {"Diane", "cyan"}, {"Boston", "emboss"}, {"in Boston", "emboss pen"}, {"sheep", "jeep"}, {"cheap", "jeep"}, {"their", "there"}, {"they are", "there"}};

Apache Commons Codec - Language Encoders

Page 54: Bridging the Gap Between Speech Recognition and Business Logic


Algorithm Correction Rate1 RefinedSoundex 64%2 Soundex 55%

Number Represents the Letters

1 b, f, p, v2 c, g, j, k, q, s, x, z3 d, t4 l5 m, n6 r

Soundex Coding Guide

Disregard the letters: a,e,i,o,u,h,w,y

Number Represents the Letters

1 b, p2 f, v3 c, k, s4 g, j5 q, x, z6 d, t7 l8 m, n9 r

Refined Soundex Guide

Phonetic Algorithms

Disregard the letters: a,e,i,o,u,h,w,y

Page 55: Bridging the Gap Between Speech Recognition and Business Logic


Build a Dictionary of Words/Sentences and Actions Phonetically encode the Words in the Dictionary

Capture Speech Input (encoded sound file)

Speech Recognition returns a text

Phonetically encode the recognition result Find the best match in the encoded dictionary

Launch the action mapped to the best match

Page 56: Bridging the Gap Between Speech Recognition and Business Logic


Recognize who is speaking

Recognize what is said and record it

Execute simple commands

Intelligently respond to natural language input


Customer Benefit







Page 57: Bridging the Gap Between Speech Recognition and Business Logic


a web-service for building apps and devices, users can talk or text to.

provides an open and extensible natural language platform.

“learns” from every interaction, and leverages the community: what’s learned is shared across developers/apps.



Page 58: Bridging the Gap Between Speech Recognition and Business Logic


{ "msg_id" : "befcd276-e6e0-4f53-a075-d26c991786e4", "_text" : "blue blur brush", "outcomes" : [ { "_text" : "blue blur brush", "intent" : "setup", "entities" : { "style" : [ { "value" : "blur" } ], "color" : [ { "value" : "blue" } ], "tool" : [ { "value" : "brush" } ] }, "confidence" : 0.515 } ] }

curl -H 'Authorization: Bearer 4FPMGMHNMWDFV3E2BM5QQ4XVQA3A3W23' ‘'

Request Response

Page 59: Bridging the Gap Between Speech Recognition and Business Logic


Page 60: Bridging the Gap Between Speech Recognition and Business Logic


Page 61: Bridging the Gap Between Speech Recognition and Business Logic


Page 62: Bridging the Gap Between Speech Recognition and Business Logic


• Intent: GetAccountBalance • Institution: Bank of America • Account Type: Checking

Alexa, ask Mindy, What is the Balance of my Bank of America

checking account { "intents": [ { "intent": "GetAccountBalance", "slots": [ { "name": "Bank", "type": "LITERAL" },{ "name": "Type", "type": "LITERAL" } ] } ], .. }

Intent Schema


Page 63: Bridging the Gap Between Speech Recognition and Business Logic


• Intent: GetAccountBalance • Institution: Bank of America • Account Type: Checking

GetAccountBalance What is my balance

GetAccountBalance balance of my {Bank of America|Bank} account GetAccountBalance balance of my {Wells Fargo|Bank} account GetAccountBalance balance of my {PayPal|Bank} account ...

GetAccountBalance balance of my {savings|Type} account GetAccountBalance balance of my {bank|Type} account GetAccountBalance balance of my {credit|Type} account GetAccountBalance balance of my {investment|Type} account

GetAccountBalance balance of my {Bank of America|Bank} {investment|Type} account GetAccountBalance balance of my {PayPal|Bank} {checking|Type} account

GetAccountBalance balance of my {savings|Type} account at {Citi|Bank} GetAccountBalance balance of my {credit|Type} account at {Credit Union|Bank} ...

Sample Utterances …provide as many as you can

intent …. {value|name} …. {value|name} .…

Page 64: Bridging the Gap Between Speech Recognition and Business Logic


Page 65: Bridging the Gap Between Speech Recognition and Business Logic


<?xml version="1.0" encoding="utf-8" ?> <aiml version="2.0"> <category>

<pattern>What is *</pattern> <template>I don’t know what <star/> is</template>

</category> </aiml>

Dr. Richard Wallace

AIML Artificial Intelligence Markup Language, is an XML dialect for creating natural language software agents.

Page 66: Bridging the Gap Between Speech Recognition and Business Logic


<?xml version="1.0" encoding="utf-8" ?> <aiml version="2.0"> <category>

<pattern>What is *</pattern> <template>

I don’t know what <star/> is </template>

</category> </aiml>

<?xml version="1.0" encoding="utf-8" ?> <aiml version="2.0"> <category>

<pattern>I like *</pattern> <template>

I see, you like <person><star/></person> </template>

</category> </aiml>

What is cheese? I don’t know what cheese is.

What is chocolate ? I don’t know what chocolate is.

I like coffee.I see, you like coffee.

I like your voice. I see, you like my voice.

I like her dress.I see, you like her dress.

Page 67: Bridging the Gap Between Speech Recognition and Business Logic


<?xml version="1.0" encoding="utf-8" ?> <aiml version="2.0">

<category> <pattern><set>colors</set> <set>styles</set> <set>tools</set></pattern> <template> Color=<star/> Style=<star index="2"/> Tool=<star index=“3"/> </template> </category>

<category> <pattern><set>colors</set> <set>styles</set> ^</pattern> <template>Color=<star/> Style= <star index="2"/></template> </category>

<category> <pattern><set>colors</set> <set>tools</set> ^</pattern> <template>Color=<star/> Tool=<star index="2"/></template> </category>

<category> <pattern><set>colors</set> ^</pattern> <template>Color=<star/></template> </category>

<category> <pattern><set>styles</set></pattern> <template>Style=<star/></template> </category>

<category> <pattern><set>tools</set></pattern> <template>Tool=<star/></template> </category>


TOOLS[["Roll"], ["Brush"], ["Pen"]]

STYLES[["Blur"], ["Emboss"], ["Blend"]]

COLORS.. ["Dark", "blue"], ["Dark", "brown"], ["Dark", "byzantium"], ["Dark", "cerulean"], ["Dark", "chestnut"], ["Dark", "coral"], ["Dark", "cyan"], ["Dark", "goldenrod"], ["Dark", "gray"], ["Dark", "green"], ["Dark", "khaki"], ["Dark", "lava"], ["Dark", "lavender"], ["Dark", "magenta"], …

Page 68: Bridging the Gap Between Speech Recognition and Business Logic



Page 69: Bridging the Gap Between Speech Recognition and Business Logic



FB acq.



AIML 1.0

AIML 2.0

GrXML 1.0


AMZN acq. Ivona

AMZN acq. Yap

Page 70: Bridging the Gap Between Speech Recognition and Business Logic


Page 71: Bridging the Gap Between Speech Recognition and Business Logic


Emotional ProsodyWolf Paulus

Page 72: Bridging the Gap Between Speech Recognition and Business Logic


Emotion andVoice enabled AvatarEmotional Prosody


Page 73: Bridging the Gap Between Speech Recognition and Business Logic


Page 74: Bridging the Gap Between Speech Recognition and Business Logic


SummaryThe adoption of context-free recognizers, may benefit from post recognition processing, using text-to-sound language encoders, dictionaries, etc.

Gopher-VUIs are still hard to find, while Apple, Google, Amazon, Microsoft, and Facebook are building Assistant-VUIs. For domain specific information, or where information is protected, those Assistants will require 3rd-party substitutes. It will be interesting to see, how much control (i.e. personality) those substitute will be granted.

Basic NLU solutions from,, and Amazon/Alexa reach their limits fast and are hardly powerful enough to implement those substitute-assistants. AIML is the oldest, but still seems to be the most powerful declarative approach, for simple VUI related tasks.

Page 75: Bridging the Gap Between Speech Recognition and Business Logic


Thanks for listening