Developing Your Own Wake Word Engine Just Like Alexa and...

Post on 18-Aug-2020

3 views 0 download

Transcript of Developing Your Own Wake Word Engine Just Like Alexa and...

Developing Your Own Wake Word Engine

Just Like “Alexa” and “OK Google”

Xuchen Yao, CEO, KITT.AI

Guoguo Chen, CTO, KITT.AI

What’s a “wake word”?

• Wake word

• Hot word

• Offline

• Code runs on

CPU/DSP/MCU

• 7x24• Always listening

• One shot

understanding

• Online

• Code runs on cloud

• On Demand

• Explicit permission

Alexa

OK Google

Hey Siriwhat’s the weather today?

Conversational UI Pipeline

wake up

device

speech text

text

understandingdialogue

management

text speech

text

voice

a customizable hotword detection engine

a.k.a: deep neural network in 2MB of RAM

hotword.io video blog

10,000+ developers, 7000+ unique hotwords

Who’s using it (released 5/2016)

Dominating developer community for hotword detection

Use Cases

#1 Hotword: Smart Mirrorhttps://github.com/evancohen/smart-mirror (credits to Evan Cohen) video link

Command & Control: GoPiGo(credits to Paul Matz) video link

Project RePL(credits to Chris Burns) video link

Conversational UI Pipeline

wake up

device

speech text

text

understandingdialogue

management

text speech

text

voice

Speech Pipeline

VoiceMicrophone

Array

Wake Word

Detection

Speech

Recognition

local

• Close talking

• Far field (3-9

feet)

• 2, 4, or 6

microphones

• Linear/circular

cloud/local

• Voice Activity

Detection

• Auto Gain

Control

• Fast response

(0.1 second)

• High accuracy

• Adaptive Echo

Cancellation

• Beam forming

• IBM/Microsoft/Nua

nce/Google

• Alexa Voice Service

• Kaldi

• PocketSphinx

• HTK

• Command & Control

• Language

Understanding

• Telephone

(8KHz Sampling)

• Others (16KHz)

• Noises: TV,

radio, street,

café, car, music

• Pitch: children,

adults, senior

• Accent:

US/UK/Europe/

Asian…

Speech Pipeline

Supported Platforms and Wrappers

• Raspberry Pi

• Mac OS X

• iPhone/iPad/iPod

• x86/64bit Ubuntu

• Android

• Pine 64

• Intel Edison

• Samsung Artik

• Allwinner R-series

• Ingenic X1000

• Rockchip

Personal vs. Universal modelsPersonal Universal

Voice samples needed 3 At least 1500

Speaker-independent No Yes

Speaker-specific Sort of No

Robust against noise No Yes

Free Yes No

Time needed Immediately 2 weeks

Customizing a universal model

define

hotwordcollect voice

train a

model

deliver &

evaluate

deploy to

beta users

ship &

success

collect voice

from device

hotword

web API

Iterate & Improve

desired performance:

>90% detection rate

<= 3 false alarms in 24 hours

Science behind wake word

Challenges

• High detection rate

• Low false alarm

• Efficient: detect every 0.1 second

• Small RAM: <2MB

• Too much ambiguity, not much context

Is this “Alexa”?

short window longer window

Existing Algorithm

Existing Algorithm

Existing Algorithm

• Advantage:

–Simplified pipeline

–Simplified decoder

• Disadvantage:

–Massive hotword specific training data

Possible Ways to Improve

• Data augmentation

– Adding noise

– Adding reverberation

– And so on…

original add noise add noise

and reverberation

Possible Ways to Improve

• Network models

– Model selection

• Feedforward models? Recurrent models?

– Model compression

• 32-bit float 16-bit float 8-bit integer

• Parameters with small absolute value

Possible Ways to Improve

• Decoder redesigning

– Modeling smaller units

• Syllables, phones, etc

– False alarm suppression

• Additional classifier?

Training with Tesla K20/K80

• Positive data

– 1,500 hotword samples

• Negative data

– Thousands of hours of speech

• Training time

– Half a day with 4 K80 GPUs

Software Architecture

FrontendBackend

KITT.AI Scientific Computing

Deep Learning Cloud

DevicesProduction

Cloud

Traffic

ELB

Content

Websocket

audio, msg

HTTPs

Message

Queue

Data Training Model Deploy

Running Your First Snowboy Demo