Final Project Presentation - University of Southern...

20
Final Project Presentation By Amritaansh Verma

Transcript of Final Project Presentation - University of Southern...

Final Project Presentation By Amritaansh Verma

Introduction   I am making a Virtual Voice Assistant that

understands and reacts to emotions  The emotions I am targeting are Sarcasm,

Happiness, Anger/Aggression and Sadness/Boredom

Why is it interesting?  Most current virtual assistants like Apple’s

siri, Samsung’s S-Voice, Vlingo etc. disregard the prosodic information of the user’s speech

  Including this capability in Virtual Assistants will make them more life-like and would help them gain more wide spread acceptance

Is it really required?  Do users really direct Anger/Sarcasm

towards Virtual Assistants?  Sometimes it is an inevitable, spontaneous

natural human reaction

Project Description  Two Main Components: 1. An emotion detection module (openEAR) 2. A simple voice assistant

Emotion Detection Module openEar Toolkit   Munich Open-Source Emotion and Affect

Recognition Toolkit   Open Source, free   Provides efficient feature extraction

algorithms implemented in C++, classfiers, and pre-trained models on well-k

  Behaves as an API, takes in user utterances as input and give its classification into the four basic emotional categories as output.

openEar is ready to use  Four ready-to-use classifier model-sets are

provided for recognition of basic emotion categories and interest level

  I am planning to collect some speech data myself pertaining to the four basic emotional categories which arise in a typical interaction with a virtual assistant to train and test my module upon (Anger, Happiness, Sadness, and Sarcasm)

Built-in classifier model Sets   Berlin Speech Emotion Database (EMO-DB),

containing seven classes of basic emotions (Anger, Fear, Happiness, Disgust, Boredom, Sadness, Neutral)

  eNTERFACE corpus with six emotion categories (Anger, Disgust, Fear, Happiness, Sadness, and Surprise)

  ABC corpus with the classes (Aggressive, Cheerful, Intoxicated, Nervous, Neutral, Tired)

  Audio Visual Interest Corpus (AVIC) with labels for three levels of interest (-1: disinterest, 0: normal, and 1: high interest).

How it works?  The voice assistant will record user speech

(or pipeline real time speech data for faster incremental processing) and forwards it to the openEar API to detect its emotional content

 Upon getting back the prosody information, the virtual assistant responds appropriately based on both the user utterance and the prosody information

How It works?

User Utterance Voice Assistant

openEAR

Prosody Informaiton Utterance

Response

Some Use Cases   Sarcasm Example: Virtual Assistant: Did you mean (…says something which it interpreted incorrectly…)? User: (Sarcastically/Angrily) Yes, that’s exactly what I meant! Virtual Assistant (appropriate response): I am sorry that I misunderstood can you please provide me with that information again?

Some Use Cases  Happiness Example: User: (Happily) Remind me to meet Roy tomorrow at 6. Virtual Assistant: I have set at reminder for 6 tomorrow, by the way you seem happy today, what’s the good news?

Some Use Cases  Anger Example: Virtual Assistant: Would you like me to call (Wrong person’s name)? User: (Angrily) NO! Virtual Assistant: (Opens up phonebook for manual selection rather than repeating question as a response to anger detection)

Some Use Cases  Sadness Example: User: (Sad/Bored) What’s the weather like today?(Or any other task) Virtual Assistant: The weather is …, You sound sad today, do you want me to tell you a joke to cheer you up?

Evaluation   I am planning to do a hands-on

approach which involves testing out the application on a number of utterances (from a corpus of real users) belonging to different categories of emotions and see if it can properly classify and respond to them

 The users would be given a usability questionnaire to restrict the domain of utterances to be more relevant

Demonstration  openEAR toolkit  Virtual Assistant(Java Application)

Lessons Learnt from the Course  The papers on Recognition &

Understanding prosody helped me out on the basics

 Current systems are lacking in prosody recognition and response, thus the motivation

Future Work:   I have to fine tune/debug the integration of my

voice assistant and openEAR   Figure out the right choice for the virtual Assistant:

npceditor or AIML chatbot OR voiceXML, right now I am testing using a simple java application

  Collect data pertaining to typical interactions of humans with virtual assistants and train openEAR on it for better accuracy

  Develop more use cases and implement them   Port the application to an Android device

(tricky??)

Open Questions?  Some good follow-up projects might be

extending the openEar implementation itself to detect new emotions

 Making this a pluggable module which can be integrated into existing virtual assistants to give them the capability to recognize emotions