Harmonic-Temporal Clustering of Speech
-
Upload
tallulah-ramos -
Category
Documents
-
view
22 -
download
0
description
Transcript of Harmonic-Temporal Clustering of Speech
Harmonic-Temporal Clustering of
Speech
Jonathan Le Roux, Hirokazu Kameoka, Nobutaka Ono, Alain de Cheveigné,
Shigeki Sagayama
Motivation and ApproachPrecise and Robust F0 analysis
Analysis of complex and varied acoustical scenes For speech, applications in speech recognition, prosody analysis,
speech enhancement, speaker identification…Desirable features of a new pitch determination algorithm (PDA)
The performance should stay high in a wide range of background noises (white noise, pink noise, noise bursts, music, other speech)
Extracting simultaneously the pitch contours of several concurrent voices is possible
Overall speech model, spectro-temporal model with constraints Several existing multi-pitch tracking algorithms: initial frame-by-frame
analysis, then post-processing to reduce errors and obtain a smooth pitch contour (for example using HMMs)
We propose to perform estimation and model-based interpolation simultaneously:
Parametric model of the voiced parts of the power spectrum of speech Introduction of a noise model to extract harmonically structured “islands”
within a “sea” of unstructured noise.
Overview of the method
time
Log-
Fre
quen
cySimultaneous optimization of the parameters
Characteristic: Through the harmonicity
assumption, the method models the voiced parts of speech
k
k txqtxW );,(),(
Express the whole pitch contour as a smooth curve→ cubic spline Distribute audio objects with different acoustical properties
Express the harmonic structure as a parametric function: GMM Express the power envelope in time direction as a parametric function: GMM
F0 estimation in noisy environments Speech mixed with broadband background
noise:
Voiced speech with several types of interferences:
Accuracy (%) of the F0 estimation:
0s 1.3s
Multi-pitch estimation
0s 1.3s50Hz
Fre
quen
cy
time
No second sound here
8kHz
「 a-o-i 」「 o-i-o-o-u 」
Co-channel speech of two speakers speaking simultaneously with equal average power.
Test data Bagshaw database 、150 mixtures 16kHz, monaural signal
Results