S Legrand S nack for R uby. Talk Objectives Tour of API Learn the walk and talk Have Fun.
-
Upload
norma-kelly -
Category
Documents
-
view
215 -
download
0
Transcript of S Legrand S nack for R uby. Talk Objectives Tour of API Learn the walk and talk Have Fun.
S Legrand
Snack
for
Ruby
Talk Objectives
Tour of APILearn the walk and talkHave Fun
SnackSnack library is a tool to aid in the learning about sound, voice, ASR, and is hopefully a fun way to experimentSnack is a tcl-based APISnack has been adapted to and included in Standard Python Distribution
SnackSnack is Swedish for “talk” or “chat”Kåre Sjölander is the principal investigator for tcl-based snackTcl Snack is available at http://www.speech.kth.se/snack/
Snack for
RubyrbSnack is a ruby wrapper around tcl snackrbSnack has additional ruby based utilitiesrbSnack has html-based help. (rdoc+rbTeX)rbSnack can be found at http://rbsnack.sourceforge.net/
Snack Toolkit Includes
Recording, PlaybackWaveform displaySpectrogram: Fourier, LPCFormant analysisPower analysisFilters
(will demo)
The Speech Signal
Continuous speech is discretely sampledSignal consist of rapidly changing data points.The display of the sampled signal is called the waveformSnack can display the waveform real-time
Analysis uses framesSignal is broken into framesFrames may overlapCharacteristics of signal analyzed using Fourier and LPC analysis on a per frame basis.
Going in Circles
Complex numbers is just a funny way of multiplying: add angles.
Eulers formula
Fourier Analysis
Fourier matrix is an unitary matrixMultiplication by Fourier matrix returns the frequency components of the signal, called the Fourier coefficientsEasy to compute the inverse: Called Fourier Inverse
The Fourier Matrix Looks Like
Spinning disks
Multiplication by signal produces Fourier coefficients (frequency components)
Examining Fourier components
A Spectrogram gives a picture of the Fourier components (coefficients) as they evolve over time. Snack can display real time.Looks like an X RayBands of high activity correspond to formants
Linear Filters
Useful to understand nature of speech signalsGenerators: generate square waves, sin waves, saw tooth, etc.Composers: composes several filters.FIR: Finite impulse responseIIR: Infinite impulse response
FIR Filter
Determined completely by response to a unit impulse.Response finite in duration.
y(t)=b0 x(t) + b1 x(t-1)+ b2x(t-2)+…+bn x(t-n)
(We will demo FIR using rbSnack)
IIR Filter
Also called Recursive filterResponse infinite in duration.
y(t)=b0 x(t) + b1 x(t-1)+ b2x(t-2)+…+bn x(t-n) +a1 y(t-1)+ a2y(t-2)+…+an y(t-n)
(We will demo IIR using rbSnack)
Linear Predictive Analysis
Analogous to Fourier analysisAssumption: For each frame, the signal is predicted by
The LPC coefficients are the best least squares approximation.Can also be used to predict formants
y(t)=a1 y(t-1)+ a2y(t-2)+…+ap y(t-p)
What is Sound? What is Speech?
Sound is the resulting signal created by the longitude waves in some medium like air.Sound waves are continuousCan be decomposed into linear combination of sin waves.Speech is a special noise made by humans
It’s Just Tubing…
The simplest model of speech is to consider the lungs and trachea as one long tube.Resonance frequencies are called Formants.
F1 F2
Some Speech Recognition
FeaturesFormantsPitchVoiced/UnvoicedNasalityFrication
Energy
Our current work only uses Formants and Energy
Basic UtterancesA basic unit of speech is called a PhoneVowels are utterances with constant formantsDiphthong is the transitioning from one vowel to anotherVowels and Diphthongs are essentially characterized by the first and second formant.
Other Phones: The Consonants
Plosives: closure in oral cavity /p/Nasal: Closure of nasal cavity /m/Fricative: Turbulent airstream noise /s/Retroflex liquid: Vowel like-tongue high curled back /r/Lateral liquid: Vowel like, tongue central, side air stream /l/Glide: Vowel like /y/
Some Problems with Speech Signals
Segmentation: when does a word begin and end? (Noise?)Wet ware: (speaker’s internal configuration + lip smacks, breathing etc.)
SegmentationWorkshop demos one approach.
Code Books
A code book consists of code words.Idea is to search through code book to find code word corresponding to best match of feature sequence.RbSnack uses codebook approach in word recognition.
Code Book Approach
++ Easy to implement
+ Good for isolated words
+- Works best on small vocabularies
-- Is insensitive to context, prone to errors
Code Book Approach
WhichWay is a simple demo of this approach
More Problems with Speech Signals
Accent: Southern vs. New England vs. California Valley vs. Other.Variation in rate of speech makes it hard to compare words
Dynamic Time Warping
A pattern comparison techniqueA way of stretching or compressing one sequence to match another.Evaluated using dynamic programming
Dynamic Programming
Form a grid, with start at lower left, end at upper right.Label each node with difference (error) between pattern 1 at time i and pattern 2 at time j.Find minimal distance from start to end using
Dynamic Programming
A possible path
Basic Assumption:
If best path P(S,E) passes through node N, then P(S,E) is the concatenation of P(S,N) (best from S to N) and P(N,E) (best from N to E)
Dynamic Programming
RbSnack includes examples for various time alignment approaches
1
2 13
3
2
Type I
Type III
Dynamic Programming
1
Itakura
1 1
Type IV
1
11
1
1
Hidden Markov Models
Sometime the second (or third) best match is the right word. Use HMM’s to ascertain the correct word in the context of the sentence. (Ditto for phones within a word)HMM’s are similar to non-deterministic finite state machines, except for they have non-deterministic output.
Hidden Markov Models
Dynamic Programming is used to compute weights.HMM’s look like
31
4
2
.4
.2
.4
P(/i/)=.5 P(/a/)=.2 P(/o/)=.3
PossibleFuture Directions
Examine other features, (pitch?)Incorporate other libraries. (Do the computationally hard work in C)Add more signal processing routinesAdd more examplesUse Hidden Markov Models
Lessons Learned/to be learned
Document everything.Nothings perfectAutomate everythingProject is never done
What’s next?Try it out.