Distinctive Feature Detection For Automatic Speech Recognition Jun Hou Prof. Lawrence Rabiner Dr....
-
Upload
cayden-brayton -
Category
Documents
-
view
216 -
download
1
Transcript of Distinctive Feature Detection For Automatic Speech Recognition Jun Hou Prof. Lawrence Rabiner Dr....
Distinctive Feature Detection For Automatic Speech Recognition
Jun Hou Prof. Lawrence Rabiner
Dr. Sorin DusanCAIP, ECE Dept., Rutgers University
Sep.13, 2004
Outline The history of Automatic Speech
Recognition Current Feature Detection
Technologies ASAT – Automatic Speech Attribute
Transcription Distinctive Feature Detection, as a
part of ASAT Proposed Work schedule
The Evolution of Speech Recognition
Data-driven (1980’s, 1990’s and 2000’s) vs. knowledge-driven (1960’s, 1970’s)
Figure 1 S-curve limits ASR technology advances (C.-H. Lee)
The gap between Human Speech Recognition (HSR) and Automatic Speech Recognition (ASR) is still very large
Is HMM the end of the line? Or is there somewhere else to go?
Problems with Signals To Be Recognized No two utterances of the same
linguistic content are ever the same (often they are not even close in their waveforms or spectral characteristics) Speaker variation Speaking style Background environment etc.
Statistical Methods
Figure 2 State-of-the-art HMM-based systems (C.-H. Lee)
Typical approaches: HMM and ANN
Statistical Methods Top-down approach. Higher level knowledge guides the processing
primarily at the lower levels. Incremental discrimination to get refined results (e.g., better stop
consonant discrimination) Utterance verification – Confidence measures to approximately
estimate the reliability of the result, often on a word-by-word basis Errors inevitable, mainly when the measured features fall into the
overlapped region of the different pdfs Data driven => Sensitive to training data, both the amount and type Robustness problem – Sensitive to speaking environment and
transmission characteristics of the medium No explicit use of acoustic, or phonetic knowledge No clear calculation of the required size of the training data set High computational cost when the size of statistical patterns is large
HMM Issues Sequential model Assumes frame independence – blindly treat
frames with equal importance; more or less okay when using cepstral features
No higher level (linguistic) knowledge used in acoustic modeling
Etc.
ti-1 ti ti+1 ti+2
Figure 3 HMM diagram
ANN Issues No meaningful representation of the internal nodes Lots of uncertainty as to what processing is happening Computationally expensive Hard to train; virtually impossible to guarantee
convergence at true minimum solution Etc.
……
……
…… Input layer
Hidden layer(s)
Output layer
Figure 4 ANN diagram
Knowledge Based Methods Bottom-up approach. Uses acoustic-phonetic
knowledge at all levels of processing. Temporal features are critical in discriminating
some speech sounds, e.g., VOT in stop detection
Spectral features are critical in discriminating other speech sounds, e.g., fricatives from spectral energy concentrations
Learn information in temporal and spectral domains using both static and dynamic features
Problems with Knowledge-Based Methods The knowledge of the acoustic properties of phonetic
units is not complete. Hard to cover all the rules.
The knowledge of phonetic properties of acoustic units is not complete.
Pronunciation models explain the formation of waveforms from vocal tract shapes, but no clear reverse knowledge exists.
The choice of features is not optimal in a well defined and meaningful sense.
The design of sound classifiers is not optimal. No well-defined automatic tuning methods exist.
Feature Extraction-Ali et al
Feature Extraction (Jakobson)1. Total energy2. Spectral Center of Gravity (SCG)3. Duration4. Low, medium and high frequency energy5. Formant transitions6. Silence detection7. Voicing detection8. Rate of change of energy in various frequency bands9. Rate of change of SCG10. Most prominent peak frequency11. Rate of change of the most prominent peak frequency12. Zero-crossing rate
Auditory-Based Front End Processing
Feature Extraction Utterance Segmentation (silence, obstruents,
sonorants) Fine Utterance Classification into Four
categories Sonorants – fine identification Stops – voiced and unvoiced Fricatives – voiced and unvoiced Silence
Excellent performance for stops and fricatives
Feature Extraction
Figure 5 Block diagram of the System Figure 6 Block diagram of the front-end
Feature Extraction Fricative classification
Voicing detection DUP – The Duration of the Unvoiced Portion
Place of articulation detection MDP - The Most Dominant Peak from the synchrony
detector MNSS - The Maximum Normalized Spectral Slope SCG - The Spectral Center of Gravity MDSS - The Most Dominant Spectral Slope DRHF - The Dominant Relative to the Highest Filters
Feature Extraction
Voicing detection Prevoicing VOT Closure duration
Place of articulation detection BF - Burst Frequency The second formant of the following vowel MNSS DRHF, LINP (most prominent peak of the synchrony
response after being laterally inhibited by the higher 10 filters)
Formant transitions before and after the stop The voicing decision
Stop detection
Landmark Detection Landmark Detection – Junija, et al., PhD Thesis Proposal Manner landmarks are used, whereas place and voicing
are extracted using the locations provided by the manner landmarks
Three steps: Location of manner landmarks Analysis of landmarks for place and voicing phonetic features Matching phonetic features to features of words or sentence
representations
Two manner landmarks Defined by abrupt change, e.g., burst landmark for stop
consonants, vowel onset point Defined by the most prominent manifestation of a manner
phonetic feature, e.g., a point of maximum low energy in a vowel
Landmark Detection Recognition of 5 broad classes
Vowel Stop Fricative Sonorant consonant Silence
Table 1 Broad manner classification of English phonemes
Use Support Vector Machines (SVM) to segment TIMIT data into binary classes
Results of 2 different feature organizations are reported:
Parallel – discriminate each feature against all other features
Hierarchical – distinguish the features using a probabilistic hierarchy
Landmark Detection
Table 2 Landmarks extracted for each of the manner classes and knowledge based acoustic measurements
Landmark Detection
Table 3 Acoustic Parameters used in broad class segmentation
Landmark Detection Compare the organizations of SVMs
Figure 8 Hierarchical SVM organizationFigure 7 Parallel SVM organization
Landmark Detection
Compare classification results
Table 4 Results of parallel SVM organization Table 5 Results of hierarchical SVM organization
Landmark DetectionDiscussion
1. Combine landmarks with acoustic parameters
2. The gap between correctness and accuracy is due to the insertions mainly of sonorant consonants and stops
3. Performance gap between hierarchical SVM and parallel SVM architectures is due to ??? – possibly: wrong classification in the upper level in the hierarchical architecture causes error propagation to the lower level
4. Isolated or connected word recognition Use Finite State Automata (FSA) to constrain the
segmentation paths Doesn’t allow the use of a probabilistic language
model
Landmark Detection– ANN Benoit Launay, et al. Train Artificial Neural Network to
map short-term spectral features to the posterior probability of some distinctive features
Feed features into HMM
ASAT – Automatic Speech Attribute Transcription Knowledge-based, data driven approach
Figure 9 Bottom-up ASAT based on speech attribute detection, event merging and evidence verification (C.-H. Lee)
NEW!
Distinctive Feature Detection
1. What Attributes?
2. How to measure them? 3. What
Features?
5. What outputs?6. How to
compute them?
4. How to combine the
attributes to form features?
Attribute Detector 1
Attribute Detector 2
Attribute Detector 3
Attribute Detector 4
Attribute Detector 5
Attribute Detector 6
Attribute Detector 7
Attribute Detector 8
Attribute Detector M
……
Feature Detector 1
Feature Detector 2
Feature Detector N
……
Speech Signal
Feature 1
Feature 2
Feature N
Figure 10 Distinctive Feature Detection
Attributes Combination:
Linear,
ANN,
K-L,
etc.
Attributes and Features in ASAT – Issues to be Resolved Q1: What attributes? Q2: How to measure them? Q3: What features? Q4: How to combine the attributes to
form features? Q5: What outputs? Q6: How to compute the outputs? Q7: Why use them?
Q1: What attributes?
Different set of attributes for each feature
MFCC and their derivatives, Energy in specific spectral ranges, Zero Crossing Rate, Formant Frequency, ratio of spectral peaks, etc.
VOT, energy onset, energy offset, etc. Refer to those attributes in Ali’s paper Find other indicative attributes in spectral graph,
cepstral graph, etc. Find other significant characteristics in waveforms Find characteristics inside/between the time and
frequency domains
Q2: How to measure them? Observe and analyze the speech signal in both time and
frequency domain, e.g., filter bank analysis Data mining of meaningful “patterns”
Enhance distinctive attributes, eliminate confusing attributes – better ways to measure things
Find the relations of attributes inside a frame, e.g., between prominent attributes, weak attributes.
Experiments needed to find distinguishing attributes for each acoustic feature
Calculate correlation between attributes in succeeding frames
Calculate information redundancy for different attributes
Q2: How to measure them? Topology of attribute organization
Parallel Organization – ASAT Organization
Graph Organization Hierarchical – Junija et al. (features) Eliminate redundancy in computation One attribute may trigger the test of existence of
other attributes
Combined organization-i.e., sequential and graph methods combined
Q3: What features? Features available in current acoustic-
phonetic area: binary distinctive features
Distinctive features are related to: Voicing
vocal folds vibrates or not Place of articulation
The particular articulator that is used (glottis, soft palate, lips, etc.)
Manner of articulation How that articulator is used to produce the sound
Q3: What features? Initial list of twelve pairs of distinctive features
1. Vocalic/non-vocalic 2. Consonantal/non-consonantal 3. Interrupted/continuent 4. Checked/unchecked 5. Strident/mellow 6. Voiced/unvoiced 7. Compact/diffuse 8 .Grave/acute 9. Flat/plain 10. Sharp/plain 11. Tense/lax 12. Nasal/oral
English is characterized by 9 pairs of these features
Q3: What features? Need to detect all relevant features to perform
automatic speech recognition at the phonetic level
Acoustic-phonetic features are intuitively plausible, but there might exist other good features obtained from data mining and/or clustering techniques
We can optimize (how we do it is unclear) and obtain the minimum necessary set of speech distinctive features
May use attributes directly and together with features when calculating the outputs from the detectors
Q4: How to compute or estimate the features? Develop combination methods and optimize them
to get better combination of attributes to form meaningful features, and select the best features for phonemes and possibly larger acoustic units
Possible combination algorithms: Linearly weighted average ANN K-L Fuzzy integral seems promising, compared with ANN
(cf. Chang & Greenberg’s paper)
Prominent attributes characterize features. The existence of some particular attributes may help to further define the feature or features.
Q5: What outputs?
Study the acoustic-phonetic theories and establish models that best describe the production of sound signals
Study each acoustic class and find their differences and relations
Modified features? Phonemes? Phoneme-like units?
Q6: How to compute the outputs? Study acoustical variation during
pronunciation, find common characteristics and distinguishing characteristics for acoustic-phonetic variations
Score the outputs of the feature detectors using probabilities or likelihood measures of the presence of these distinctive features
Other methods???
Q7: Why use them? We have no other choice at this
time These attributes and features may
be far from optimal, but they are well motivated by acoustic-phonetic theories
Will consider other ideas, as they are developed
Evaluation Evaluation criteria for attributes, features
Mutual information (cf. Hasegawa-Johnson’s paper) Entropy (e.g., traditional Shannon Entropy, Rényi
Entropy, cf. Cachin’s paper) Perplexity, like that used in language modeling False acceptance rate, false rejection rate Other criteria???
Use those criteria to find correlations between attributes, as well as between features
Gradually minimize the mutual information between attributes/features, e.g., Gradient Descent, and get the minimum sets of attributes and features
Segmentation of Speech Study how humans segment different
portions of speech, e.g., spectrum reading
Multiple segmentations are possible, and thus we might want to search through a range of segmentation candidates to find the best result
Collect the segments with high confidence scores
Use other knowledge sources to help clarify the segments with poor scores
Training and Testing Database – TIMIT and/or Vic corpus Divide the database into separate
training and testing sets Training
(1) On the training set (2) On the training set + testing set – is this
meaningful or proper Find the difference between (1) and (2), and
the generalization ability of the features to out of task data
Test performance on the testing set
Training and Testing Training
Study differences between isolated words, connected words, continuous and spontaneous speech
Try not to depend solely on the training data, but instead find rules that adapt the data and can be applied to more general environments
Try not to defuse the model as more data is added
Training and Testing Testing
Find reasons why the detectors failed Observe error patterns Did the error patterns emerge due to
different reasons? If so, re-examine previous steps, and combine the different information sources in ways that are less sensitive to the observed error patterns
Work Schedule First year:
Set up the structure for the ASAT system Define the most reasonable starting set of
acoustic attributes and phonetic features Look at a range of ways of combining evidence
from the acoustic attributes to create the phonetic features
Evaluate the baseline performance of the system on a given training and testing set of date – most probably using TIMIT
Baseline alternative approaches, especially front ends, including auditory models and standard speech recognition features