Final Project Presentation - University of Southern...
Transcript of Final Project Presentation - University of Southern...
Introduction I am making a Virtual Voice Assistant that
understands and reacts to emotions The emotions I am targeting are Sarcasm,
Happiness, Anger/Aggression and Sadness/Boredom
Why is it interesting? Most current virtual assistants like Apple’s
siri, Samsung’s S-Voice, Vlingo etc. disregard the prosodic information of the user’s speech
Including this capability in Virtual Assistants will make them more life-like and would help them gain more wide spread acceptance
Is it really required? Do users really direct Anger/Sarcasm
towards Virtual Assistants? Sometimes it is an inevitable, spontaneous
natural human reaction
Project Description Two Main Components: 1. An emotion detection module (openEAR) 2. A simple voice assistant
Emotion Detection Module openEar Toolkit Munich Open-Source Emotion and Affect
Recognition Toolkit Open Source, free Provides efficient feature extraction
algorithms implemented in C++, classfiers, and pre-trained models on well-k
Behaves as an API, takes in user utterances as input and give its classification into the four basic emotional categories as output.
openEar is ready to use Four ready-to-use classifier model-sets are
provided for recognition of basic emotion categories and interest level
I am planning to collect some speech data myself pertaining to the four basic emotional categories which arise in a typical interaction with a virtual assistant to train and test my module upon (Anger, Happiness, Sadness, and Sarcasm)
Built-in classifier model Sets Berlin Speech Emotion Database (EMO-DB),
containing seven classes of basic emotions (Anger, Fear, Happiness, Disgust, Boredom, Sadness, Neutral)
eNTERFACE corpus with six emotion categories (Anger, Disgust, Fear, Happiness, Sadness, and Surprise)
ABC corpus with the classes (Aggressive, Cheerful, Intoxicated, Nervous, Neutral, Tired)
Audio Visual Interest Corpus (AVIC) with labels for three levels of interest (-1: disinterest, 0: normal, and 1: high interest).
How it works? The voice assistant will record user speech
(or pipeline real time speech data for faster incremental processing) and forwards it to the openEar API to detect its emotional content
Upon getting back the prosody information, the virtual assistant responds appropriately based on both the user utterance and the prosody information
Some Use Cases Sarcasm Example: Virtual Assistant: Did you mean (…says something which it interpreted incorrectly…)? User: (Sarcastically/Angrily) Yes, that’s exactly what I meant! Virtual Assistant (appropriate response): I am sorry that I misunderstood can you please provide me with that information again?
Some Use Cases Happiness Example: User: (Happily) Remind me to meet Roy tomorrow at 6. Virtual Assistant: I have set at reminder for 6 tomorrow, by the way you seem happy today, what’s the good news?
Some Use Cases Anger Example: Virtual Assistant: Would you like me to call (Wrong person’s name)? User: (Angrily) NO! Virtual Assistant: (Opens up phonebook for manual selection rather than repeating question as a response to anger detection)
Some Use Cases Sadness Example: User: (Sad/Bored) What’s the weather like today?(Or any other task) Virtual Assistant: The weather is …, You sound sad today, do you want me to tell you a joke to cheer you up?
Evaluation I am planning to do a hands-on
approach which involves testing out the application on a number of utterances (from a corpus of real users) belonging to different categories of emotions and see if it can properly classify and respond to them
The users would be given a usability questionnaire to restrict the domain of utterances to be more relevant
Lessons Learnt from the Course The papers on Recognition &
Understanding prosody helped me out on the basics
Current systems are lacking in prosody recognition and response, thus the motivation
Future Work: I have to fine tune/debug the integration of my
voice assistant and openEAR Figure out the right choice for the virtual Assistant:
npceditor or AIML chatbot OR voiceXML, right now I am testing using a simple java application
Collect data pertaining to typical interactions of humans with virtual assistants and train openEAR on it for better accuracy
Develop more use cases and implement them Port the application to an Android device
(tricky??)