Speech Recognition Yonglei Tao. Voice-Activated GPS.
-
Upload
jordan-harper -
Category
Documents
-
view
220 -
download
0
Transcript of Speech Recognition Yonglei Tao. Voice-Activated GPS.
Speech Recognition
Yonglei Tao
Voice-Activated GPS
Voice User Interface (VUI) A VUI allows human interaction with
computers through a voice/speech platform Basic components
Speech recognition Meaning extraction Response generation Speech output
Benefits Loosen some physical constraints Provide tools for universal design
disability and situational impairments Intuitive and efficiency
System Architecture
Components Endpointing
Speech to endpointed utterance Feature extraction
Endpointed utterance to feature vectors Recognition
Feature vectors to word string(s) Natural language understanding
Word string(s) to meaning(s) Dialog management
Meaning to actions
Typical Recognition Components
Examples Book, boot
Write, right Flew, flu, flue
Eight books Ate books
I scream Ice cream
Components Acoustic models
Internal representation of each basic sound Dictionary
A list of words and pronunciations Grammar
Defines all possible strings of words the recognizer can handle
Allows to associate a meaning with those strings Either rule-based or statistical
Recognition Recognition search
A recognizer searches the recognition model to find the best-matching word string
Confidence measures A quantitative measure of how confident the
recognizer is for the best-matching string VUI developers can use those measures in several
ways N-Best processing
A recognizer returns several results with a confidence measure for each
Speech Recognition Engines Microsoft Visual Studio & CMU Sphinx
Grammar Android
Language model – free form for dictation or web search for short phrases
Google Web Speech API for Web Applications
BNF (Backus-Naur Form) Notation for context-free grammars
Often used to describe the syntax of programming languages
Also specify the words and patterns of words to be listened for by a speech recognizer
EBNF (Extended Backus-Naur Form) ABNF (Augmented Backus-Naur Form)
Basis for speech grammar specifications ABNF for .Net Regular grammar for Java
Basics
::= meaning "is defined as" | meaning "or" < > include category nameTerminal basic component
<X> ::= a b c a sequence
<Y> ::= a | b | c optional
<Z> ::= a | a <Z> one or more
An Example Grammar for a speech recognition calculator
Reference: Grammar creation in C#https://msdn.microsoft.com/en-us/library/hh538495%28v=office.14%29.aspx
Speech to Text in C#using System.Speech.Recognition;using System.Speech.Synthesis;using System.Threading;
static ManualResetEvent _completed = null;
static void Main(string[] args) {
_completed = new ManualResetEvent(false); SpeechRecognitionEngine _recognizer = new SpeechRecognitionEngine();
_recognizer.LoadGrammar(new Grammar(new GrammarBuilder("test")) Name = { "testGrammar" }); _recognizer.LoadGrammar(new Grammar(new GrammarBuilder("exit")) Name = { "exitGrammar" }); _recognizer.SpeechRecognized += _recognizer_SpeechRecognized; // add an event handler
_recognizer.SetInputToDefaultAudioDevice(); _recognizer.RecognizeAsync(RecognizeMode.Multiple); … _completed.WaitOne(); // wait until speech recognition is completed _recognizer.Dispose(); // dispose the speech recognition engine}
Speech to Text in C#(Cont.)void _recognizer_SpeechRecognized(object sender, SpeechRecognizedEventArgs e) {
if (e.Result.Text == "test") {
Console.WriteLine("The test was successful!");
}
else if (e.Result.Text == "exit") {
_completed.Set();
}
}
void _recognizer_SpeechRecognitionRejected(object sender, SpeechRecognitionRejectedEventArgs e) {
if (e.Result.Alternates.Count == 0) {
Console.WriteLine("Speech rejected. No candidate phrases found.");
return;
}
Console.WriteLine("Speech rejected. Did you mean:");
foreach (RecognizedPhrase r in e.Result.Alternates) {
Console.WriteLine(" " + r.Text); }}
Text to Speech in C#
SpeechSynthesizer _synthesizer = new SpeechSynthesizer();
synthesizer.Speak("Now the computer is speaking to you.");
...synthesizer.Dispose(); // dispose the SpeechSynthesizer
References SpeechRecognitionEngine Class
https://msdn.microsoft.com/en-us/library/system.speech.recognition.speechrecognitionengine%28v=vs.110%29.aspx?cs-save-lang=1&cs-lang=vb#code-snippet-1
Speech recognition, speech to text, text to speech, and speech synthesis in C# http://
www.codeproject.com/Articles/483347/Speech-recognition-speech-to-text-text-to-speech-a
Visual Studio Speech Recognizer
Speech Recognition with Visual Studio Examples
http://www.phon.ucl.ac.uk/courses/spsci/compmeth/speech/recognition.html
http://blogs.msdn.com/b/devschool/archive/2012/02/06/speech-recognition-using-visual-studio-determining-the-bna.aspx
Grammar Class http://msdn.microsoft.com/en-us/library/system.spee
ch.recognition.grammar.aspx
GrammarBuilder Class http://msdn.microsoft.com/en-us/library/system.spee
ch.recognition.grammarbuilder.aspx
Speech Recognition for Java Sphinx 4
A speech recognition engine written entirely in Java
Created by CMU, Sun, Mitsubishi, HP, … Open source Compliant with JSpeech Grammar Format Platform- and vendor-independent
Programmer’s guidehttp://cmusphinx.sourceforge.net/sphinx4/
An example https://www.assembla.com/code/sonido/subversion/nodes/4/sphinx4/src/apps/edu/cmu/sphinx/demo/helloworld
A Sample Grammar in Java
#JSGF V1.0;
public <workProgram> = <ask> <action> <program>;<ask> = please | could you;<action> = start | open | stop | close | kill | shut down ;<program> = word | excel | out look | note pad ;
Android Speech Recognitionpublic class MainActivity extends Activity { private static final int VOICE_RECOGNITION = 1; Button speakButton ; TextView spokenWords; @Override protected void onCreate(Bundle savedInstanceState) { super.onCreate(savedInstanceState); setContentView(R.layout.activity_main); speakButton = (Button) findViewById(R.id.button1); spokenWords = (TextView)findViewById(R.id.textView1); }
@Override public boolean onCreateOptionsMenu(Menu menu) { // Inflate the menu; this adds items to the action bar if it is present. getMenuInflater().inflate(R.menu.main, menu); return true; }
public void btnSpeak(View view){ Intent intent = new Intent(RecognizerIntent.ACTION_RECOGNIZE_SPEECH); // Specify free form input intent.putExtra(RecognizerIntent.EXTRA_LANGUAGE_MODEL, RecognizerIntent.LANGUAGE_MODEL_FREE_FORM); intent.putExtra(RecognizerIntent.EXTRA_PROMPT,"Please start speaking"); intent.putExtra(RecognizerIntent.EXTRA_MAX_RESULTS, 1); intent.putExtra(RecognizerIntent.EXTRA_LANGUAGE, Locale.ENGLISH); startActivityForResult(intent, VOICE_RECOGNITION); } }
@Override protected void onActivityResult(int requestCode, int resultCode, Intent data) { if (requestCode == VOICE_RECOGNITION && resultCode == RESULT_OK) { ArrayList<String> results; results = data.getStringArrayListExtra(RecognizerIntent.EXTRA_RESULTS); // TODO Do something with the recognized voice strings Toast.makeText(this, results.get(0), Toast.LENGTH_SHORT).show(); spokenWords.setText(results.get(0)); } super.onActivityResult(requestCode, resultCode, data); }
Android and Web Speech Recognition Android Voice Recognition Tutorial
http://www.javacodegeeks.com/2012/08/android-voice-recognition-tutorial.html
http://code4reference.com/2012/07/tutorial-android-voice-recognition/
Google Web Speech Recognition Examples http://stiltsoft.com/blog/2013/05/google-chrome-h
ow-to-use-the-web-speech-api/
http://stackoverflow.com/questions/17635354/developing-a-simple-voice-driven-web-app-using-web-speech-api
http://apprentice.craic.com/tutorials/37
Challenges for VUI Design People have very little patience for a
"machine that does not understand” VUIs need to respond to input reliably, or they
will be rejected by their users Designing a usable VUI requires
interdisciplinary talents of computer science, linguistics and human factors
The closer the VUI matches the user's mental model of the task, the easier it will be to use with little or no training, resulting in both higher efficiency and higher user satisfaction
Natural Language Understanding Ambiguity
Refers to phrases that look distinct in print but sound similar when spoken, for example, “Wreck a nice beach” “Recognize speech”
As the vocabulary and grammar get larger, the potential for ambiguity increases
Short words and phrases are harder to recognize than longer ones
Language Understanding (Cont.) Deviation
Deviating from what the developer expects For example, an issue with the question “Is that
correct?” Expecting a simple response like “Yes”, “No”, or
“Correct” Southern speakers would respond with “Yes,
ma’am” or “No, ma’am”
Discussion What you would expect if the user asks to
start Microsoft Word? Please start word Could you start word Start word Please open word Could you open word Open word
Language Understanding (Cont.) Keyword Extraction
Important for applications built with a speech recognizer that returns a string containing the actual words spoke by the user Leaving the application to interpret their semantic
meaning One might say “Computer, find me some
information about the flooding in Detroit recently“
Keywords like “find”, “flooding”, and “Detroit” are crucial for an accurate response from the VUI Others are filler words
Dialog Management Multi-modelity
Interaction can occur through different mediums Need to consider when and which part of the
application allows to be multi-model Grammar
There is a close relationship between what a prompt says and what the caller ends up saying to the system Especially the words used
Configuration files You may choose the confidence level at which the
recognizer will reject the input rather than return the answer
You may also choose parameters for the endpointer, that is, how long it should listen before timing out
Dialog Management (Cont.) Error handling
Allow the user to be able to recover after errors and get the dialog with the user back on track
Recognition does not always succeed. When it fails, there are a number of messages the recognizer may return to the application.
Voice recognition accuracy In-grammar data Out-grammar data
Error Handling In-grammar data
Correct Accept the recognizer returned the correct answer
False Accept the recognizer returned the wrong answer
False Reject the recognizer could not find match and gave up
Out-of-grammar data Correct Reject
the recognizer correctly rejected the input False Accept
the recognizer returned a value that is wrong because the input is not in the grammar
How to handle each categories?
Error Handing in Android