Variational Bayesian Methods for Audio Indexing

Variational Bayesian Methodsfor Audio Indexing

Fabio Valente, Christian Wellekens

Institut Eurecom

Outline Generalities on speaker clustering Model selection/BIC Variational learning Variational model selection Results

Speaker clustering Many applications (speaker indexing, speech

recognition) require clustering segments with the same characteristics e.g. speech from the same speaker.

Goal: grouping together speech segments of the same speaker

Fully connected (ergodic) HMM topology with duration constraint. Each state represent a speaker.

When speaker number is not known it must be estimated with a model selection criterion (e.g. BIC,…)

Model selectionGiven data Y and model m optimal model maximizes:

)(

)()|()|(

Yp

mpmYpYmp

If prior is uniform, decision depends only on p(Y|m) (a.k.a. marginal likelihood)

Prohibitive to compute for some models (HMM,GMM)

Bayesian modeling assumes distributions over parameters

The criterion is thus the marginal likelihood:

)|,()|( mYpmp

dmpmYpmYp )|(),|()|(

Bayesian information criterion (BIC)

First order approximation obtained from the Laplace approximationof the marginal likelihood (Schwartz, 1978)

nd

mYpmYBIC log2

),ˆ|(log),(

Generally, penalty is multiplied by a constant (threshold):

BIC does not depend on parameter distributions !

Asymptotically (n large) BIC converges to log-marginal likelihood

Variational Learning with hidden variables

If x is the hidden variable, we can write:

Independence hypothesis )()(),( ixqqxq

Sometimes model optimization needs the use of hidden variables (e.g. state sequence in the EM)

dxdxq

xYpxqFm

),(

),,(ln),(

))(||)(()(

)|,(ln)()( pqKLdxd

xq

xYpxqqFm

EM-like algorithm

Under the hypothesis:

E-step: dmxypmqmxq )|,,(ln)|(exp)|(

M-step: )|(])|,,(ln)|([exp)|( mpdmxypmxqmq

)|()|()|,( mxqmqmxq

VB Model selection

)(}exp{)( mpFmq m

)(mqIn the same way an approximated posterior distribution overmodels can be defined:

])(

)(ln)[()|()()(ln

mq

mpFmqmYpmpYp m

Model selection based on mF

Maximizing w.r.t. q(m) yields:

Best model maximizes q(m)

Experimental framework BN-96 Hub4 evaluation data set Initialize a model with N speakers (states) and train the system using

VB and ML (or VB and MAP with UBM) Reduce the speaker number from N-1 to 1 and train using VB and ML

(or MAP). Score the N models with VB and BIC and choose the best one Three score

Best score Selected score (with VB or BIC) Score obtained with the known speaker number

Results given in terms of :Acp: average cluster purityAsp: average speaker purity

aspacpK

Experiments I

File 1

N acp asp K

ML-known

8 0.60

0.84

0.71

ML-best 10 0.80

0.86

0.83

ML/BIC 13 0.80

0.86

0.83

File 1

N acp asp K

VB-known

8 0.70

0.91

0.80

VB-best 12 0.85

0.89

0.87

VB 15 0.85

0.89

0.87

File 2

N acp asp K

ML-known

14 0.76

0.67

0.72

ML-best 9 0.72

0.77

0.74

ML/BIC 13 0.84

0.63

0.73

File 2

N acp asp K

VB-known

14 0.75

0.82

0.78

VB-best 14 0.84

0.81

0.82

VB 14 0.84

0.81

0.82

File 3

N acp asp K

ML-known

16 0.75

0.74

0.75

ML-best 15 0.77

0.83

0.80

ML/BIC 15 0.77

0.83

0.80

File 3

N acp asp K

VB-known

16 0.68

0.86

0.76

VB-best 14 0.75

0.90

0.82

VB 14 0.75

0.90

0.82

File 4

N acp asp K

ML-known

21 0.72

0.65

0.68

ML-best 12 0.63

0.80

0.71

ML/BIC 21 0.76

0.60

0.68

File 4

N acp asp K

VB-known

21 0.72

0.65

0.68

VB-best 13 0.63

0.80

0.71

VB 13 0.64

0.72

0.68

Experiments II

Dependence on threshold

K function of the threshold Speaker number function of the threshold

Free Energy vs. BIC

Experiments III

File 1

N acp asp K

MAP-known

8 0.52

0.72

0.62

MAP-best 15 0.81

0.84

0.83

MAP/BIC 13 0.80

0.81

0.81

File 1

N acp asp K

VB-known

8 0.68

0.88

0.77

VB-best 22 0.83

0.85

0.84

VB 22 0.83

0.85

0.84

File 2

N acp asp K

MAP-known

14 0.68

0.78

0.73

MAP-best 22 0.84

0.80

0.82

MAP/BIC 18 0.68

0.85

0.81

File 2

N acp asp K

VB-known

14 0.69

0.80

0.74

VB-best 18 0.85

0.87

0.86

VB 19 0.87

0.80

0.83

Experiments IV

File 3

N acp asp K

MAP-known

16 0.71

0.77

0.74

MAP-best 29 0.78

0.74

0.76

MAP/BIC 16 0.69

0.77

0.73

File 3

N acp asp K

VB-known

16 0.74

0.83

0.78

VB-best 22 0.82

0.82

0.82

VB 16 0.78

0.79

0.79

File 4

N acp asp K

MAP-known

18 0.65

0.69

0.67

MAP-best 18 0.65

0.69

0.67

MAP/BIC 20 0.63

0.64

0.64

File 4

N acp asp K

VB-known

21 0.67

0.73

0.70

VB-best 20 0.69

0.72

0.70

VB 19 0.67

0.73

0.70

Conclusions and Future Works VB uses free energy for parameter

learning and model selection. VB generalizes both ML and MAP learning

framework. VB outperforms ML/BIC on 3 of the 4 BN

files. VB outperforms MAP/BIC on 4 of the 4 BN

files. Repeat the experiments on other

databases (e.g. NIST speaker diarization).

Thanks for your attention!

Data vs. Gaussian components

Final gaussian components function of amount of data for each speaker

Experiments (file 1)

Real VB ML/BIC

Speaker 8 15 13


Real VB ML/BIC

Speaker 14 14 16


Real VB ML/BIC

Speaker 16 14 15


Real VB ML/BIC

Speaker 21 13 12

Variational Bayesian Methods for Audio Indexing

Documents

Transcript of Variational Bayesian Methods for Audio Indexing