Institute of Information Science, Academia Sinica, Taiwan Introduction to Speaker Diarization Date:...
-
Upload
julian-brice-harmon -
Category
Documents
-
view
217 -
download
0
Transcript of Institute of Information Science, Academia Sinica, Taiwan Introduction to Speaker Diarization Date:...
Institute of Information Science, Academia Sinica, TaiwanInstitute of Information Science, Academia Sinica, Taiwan
Introduction to Speaker Introduction to Speaker DiarizationDiarization
Date: 2007/08/16Date: 2007/08/16Speaker: Shih-Sian ChengSpeaker: Shih-Sian Cheng
2
OutlineOutline Speaker diarization Problem formulation A prototypical speaker diarization system Speaker segmentation Problem formulation Speaker segmentation using a fixed-size analysis window Speaker segmentation using a variable-size analysis window Bottom-up segmentation using BIC Top-down segmentation using BIC Speaker clustering Problem formulation Hierarchical agglomerative clustering Optimization-oriented approaches Two leading speaker diarization systems LIMSI’s system Cambridge’s system
3
Speaker diarization (Problem formulation)Speaker diarization (Problem formulation)
Problem formulation: the “who spoke when” task on an continuous audio stream (NIST RT03 Spring Eval.)
speaker segmentation
Speaker 1 Speaker 2 Speaker 3
speaker clustering
4
Speaker diarization (Problem formulation)Speaker diarization (Problem formulation)
Performance measure of the speaker diarization task (C. Barras et. al., 2006 ; NIST RT03 Spring Eval.)
Applications
Find the mapping between reference speakers and hypothesis speakers such thattheir overlapping in time is largest. In this case, S1->A and S3->B.
5
Speaker diarization (Problem formulation)Speaker diarization (Problem formulation) Example: Automatic transcription for a broadcast news show
By speaker recognition
Speaker adaptation+ speech recognition
6Speaker diarization (A prototypical Speaker diarization (A prototypical system)system)
Change boundary refinement
Speaker segmentation (usually, over segmentation)
To filter out non-speech data
The prototypical speaker diarization system (S. E. Tranter & D. A. Reynolds, 2006)
Speaker clustering
7Speaker segmentation (Problem Speaker segmentation (Problem formulation)formulation)
detect the speaker change boundaries
Problem formulation
Performance measure
Target changes
Hypothesized changes
false alarm
miss detection
Error type: miss detection & false alarm
Performance metric: ROC curveROC curve: F-score:
RP
PRF
2
P: precision rate
R: recall rate
8Speaker segmentation (Fixed-size analysis Speaker segmentation (Fixed-size analysis window approach)window approach)
Data stream
Distance computation
Sliding windows
Distance curve
Speaker segmentation using a fixed-size analysis window (Siegler et. al., 1997)
Distance measure of two segments
Kullback-Leibler (KL) distance (Siegler et. al., 1997)
9Speaker segmentation (Fixed-size analysis Speaker segmentation (Fixed-size analysis window approach)window approach)
SVM training error ( 王駿發 et. al., 2005)
YX
YX
More overlap, larger training error larger distance, less similarity
10
Bayesian information criterion (BIC) for model selection:
• Data set: dn RxxxX },,,{ 21
• Candidate models: },,,{ 21 kMMMM
• Model selection by BIC:
,log)(#2
1)ˆ|(log)( nMXprMBIC iii
λ=1 in the BIC theory, but is usually tuned for trade-off between error types; maximum likelihood of X for model ; : the number of parameters of ;
)ˆ|( iXpr iM
)(# iM
Speaker segmentation (Fixed-size analysis Speaker segmentation (Fixed-size analysis window approach)window approach)
ΔBIC (S. Chen et. al., 1998; P. Delacourt et. al., 2001)
iM
11
Use BIC as an inter-segment distance computation
n)(mH|-λΣ| n
|Σ| m
|-Σ| nm
HBICHBICBIC
NyyyNxxxH
NyyyxxxH
YX
YYnXXm
nm
log)(#log2
log2
log2
)()(
),(~...,, );,(~,...,,:
),(~...,,,,...,,:
0
01
,21211
,21210
Given two audio segments represented by feature vectors and these two segments can be judged as under the same or different acoustic conditions via the following hypothesis test:
X and Y are judged as from the same acoustic condition if BIC <0.
},...,,{ 21 myyyY },...,,{ 21 mxxxX
Seg X
Seg Y
0H1H
Seg X
Seg Y
0H
1H
Ex:
X and Y are from different acoustic conditions, BIC>=0
X and Y are from the same acoustic condition, BIC<=0
Speaker segmentation (Fixed-size analysis Speaker segmentation (Fixed-size analysis window approach)window approach)
12Speaker segmentation (Variable-size Speaker segmentation (Variable-size analysis window approach)analysis window approach)
The bottom-up detection process on an audio stream
Seg 4Seg 1 Seg 2 Seg 3Audio stream
One-change- point detection
Seg 4Seg 1 Seg 2 Seg 3
Change point
Bottom-up detection using BIC (S. Chen and P. Gopalakrishnan, 1998; M. Cettolo et. al., 2005 )
Speaker segmentation using a variable-size analysis window
13
One-change-point detection using BIC
X YCalculate at each feature vectorBIC
BIC
Feature vectors X Y
Calculate at each feature vectorBIC
BIC
Speaker segmentation (Variable-size Speaker segmentation (Variable-size analysis window approach)analysis window approach)
14Speaker segmentation (Variable-size Speaker segmentation (Variable-size analysis window approach)analysis window approach) Top-down detection using BIC (Top-down detection using BIC (C. H. Wu and C. H. Hsieh, 2006; ; M. Cettolo et. al., 2005 ))
The top-down detection process for an audio stream
multiple-change-detection
Seg 4Seg 1 Seg 2 Seg 3Audio stream Seg 4Seg 1 Seg 2 Seg 3
15
Multiple-change-detection using BIC
Seg 4Seg 1 Seg 2 Seg 3Audio stream Seg 4Seg 1 Seg 2 Seg 3
H0 :
H1 :
H2 :
H3 :
Assumption: different segments arise from different Gaussian processes
X
pr(X| H0)<pr(X| H1)<pr(X| H2)<pr(X| H3)
Intuitively,
but,
BIC(X|H2)>BIC(X| H3)>BIC(X| H1)>BIC(X| H0)
Multiple-change-detection: Search the H that has the largest BIC value in the solution space
• Exhausted search
Speaker segmentation (Variable-size Speaker segmentation (Variable-size analysis window approach)analysis window approach)
16Speaker segmentation (Variable-size Speaker segmentation (Variable-size analysis window approach)analysis window approach)
• Top-down, hierarchical search (C. H. Wu and C. H. Hsieh, 2006)
Seg 4Seg 1 Seg 2 Seg 3Audio stream Seg 4Seg 1 Seg 2 Seg 3
Pass1:
X
Pass2:
Terminate
• Dynamic programming (M. Cettolo et. al., 2005 ) An optimal search
An sub-optimal search
17
Speaker clustering (Problem formulation)Speaker clustering (Problem formulation)
Problem formulation
given N speech utterances from P unknown speakers, partition these utterances into M clusters, such that M = P and each cluster consists exclusively of utterances from only one speaker
Partitioning
Speech Utterances Clusters
Speaker 3 Speaker 4
Speaker 1 Speaker 2
18
Cluster PurityThe probability that if we pick any utterance from a cluster twice at random, with replacement, both of the selected utterances are from the same speaker
0.4210
62112
2222
1.06
62
2
0.254
11112
2222
P : total no. of speakers involved,M : total no. of clusters, m : purity of the m-th cluster,
nm* : no. of utterances in the m-th cluster,
n*p : no. of utterances from the p-th speaker,
nmp : no. of utterances in the m-th cluster that are from the p-th speaker
,1
1
22
*
P
pmp
mm n
n . '
1 * NnM
m mm
Increases as the number of clusters increases
Speaker clustering (Problem formulation)Speaker clustering (Problem formulation)
19
speaker
cluster
1 2 … M Sum
1 n11 n21 … nM1 n1
2 n12 n22 … nM2 n2
… … … … … …
P n1P n2P … nMP nP
Sum n1 n2 … nM N
Rand Index
M
m
P
pmp
P
pp
M
mm
M
m
P
pmp
P
pp
M
m
P
pmp
M
mm nnnnnnnMR
1 1
2
1
2*
1
2*
1 1
2
1
2*
1 1
2
1
2* 2)(
Two error types:I: The number of utterance pairs (with replacement) in the same cluster but from different speakersII: The number of utterance pairs (with replacement) from the same speaker but in different clusters
The number of utterance pairs from the same speaker that are in the same cluster
The number of utterance pairs from the same speaker
M
m
P
pmp
P
pp nn
1 1
2
1
2*
Type II error:
The number of utterance pairs from the same cluster and are in the same cluster
The number of utterance pairs from the same cluster
M
m
P
pmp
M
mm nn
1 1
2
1
2*Type I error:
Speaker clustering (Problem formulation)Speaker clustering (Problem formulation)
Reaches its minimum only when M = P
20Speaker clustering (Hierarchical Speaker clustering (Hierarchical agglomerative clustering)agglomerative clustering) Hierarchical agglomerative clustering (S. Chen and P. Gopalakrishnan, 1998; Barras et. al., 2006)
X2 X13
X13X1 X2 X14 XN
X1 X19 XN
X1 X19 X2 X13 XN
X1 XNX2
Distance of two clusters: ΔBIC
Stopping criteria: Local BIC Global BIC
21Speaker clustering ( Optimization-oriented Speaker clustering ( Optimization-oriented approaches )approaches )
Optimization-oriented approaches
For a given number of cluster and a set of cluster indices H = [ h1, h2, …, hN ] for N utterances X1 , X2 ,…, XN , the average cluster purity is
oi is the true speaker index of utterance Xi, (1 oi P )
Maximum purity clustering (W. H. Tsai et. al., IEEE Trans. ASLP, 2007)
M
mN
i i
N
i
N
j jijiM
m
P
p m
mpm
mh
oomhmh
Nn
nn
N 11
1 1
1 12
2
),(
),(),(),(11)'(H
(oi , oj ) (the ground truth) is unknown and needs to be estimated.
otherwise ,0
, )],([ and , if ,),(
),( if ,1
),(ˆmji
ii
jiji nSRji
S
Sji
oo XXXX
XX
(oi, oj) is approximated by
S(Xi,Xj): similarity between utterances Xi and Xj
R[S(Xi,Xj)]: rank of inter-utterance similarity S(Xi,Xj) among
S(Xi,X1), S(Xi,X2), …, S(Xi,XN) in descending order
i : utterance most similar to Xi, i.e., R[S(Xi,Xi)] = 2.
mth-cluster ; nm=4 jXiX
…
22
Use BIC to determine the cluster number
Let denote the estimated purity. Use Genetic Algorithm to find H* such that )('maxarg* HH
H
)'(H
Speaker clustering ( Optimization-oriented Speaker clustering ( Optimization-oriented approaches )approaches )
Minimum rand index clustering (W. H. Tsai and H. M. Wang, Proc. ICASSP, 2007): Performing the grouping of utterances and determining the group number at within the optimization process
M
m
P
pmp
P
pp
M
mm nnnMR
1 1
2
1
2*
1
2* 2)(
N
i
N
j
Mj
Mi hh
1 1
)()( ),(
N
i
N
jji
Mj
Mi oohh
1 1
)()( ),(),(2
constant
(oi , oj ) (the ground truth) is unknown and needs to be estimated.
23Speaker clustering ( Optimization-oriented Speaker clustering ( Optimization-oriented approaches )approaches )
Use Genetic Algorithm to find H* such that
N
i
N
jji
Mj
Mi
N
i
N
j
Mj
Mi
M oohhhhR1 1
)()(
1 1
)()()( ),(ˆ),(2),()(ˆ H
)(ˆminarg )(
1,
*
)(
M
NM
RM
HHH
(oii,ojj) is approximated by a normalized inter-utterance similarity:
jiSS
jioo
jiji if ,),(
if , 1),(ˆ
maxXX
)|Pr()|Pr(
)|Pr(),(
jjii
ijijjiS
XX
XXX
Smax is the maximum among the similarities S(Xi, Xj), i j.
where(Generalized likelihood Ratio)
24Two leading systemsTwo leading systems
LIMSI’s system (Barras et. al., 2006)
Fixed-size sliding window segmentation
Boundary refinement
Use ΔBIC to measure the inter-cluster similarity
,
To filter out short-duration silence segments that were not removed in the initial speech detection step
To remove only long regions without speech such as silence, music, andnoise using GMM
Use the cross-likelihood ratio,
to measure the inter-cluster similarity.Mi is a MAP-adapted GMM .
Boundary refinement; Align the changeboundaries to silence portions
25Two leading systemsTwo leading systems
Cambridge’s system (Sinha et. al., 2005)
SD: speech detection
Speaker identification (SID) clustering:MAP adaptation (mean-only) was applied towards each cluster from the appropriategender/bandwidth UBM.Use the cross likelihood ratio (CLR) betweenany two given clusters.
CPD: change point detection
IAC: iterative agglomerative clustering
26
ReferenceReference C. Barras, X. Zhu, S. Meignier, and J.-L. Gauvain, “Multistage Speaker Diarization
of Broadcast News,” IEEE Transactions on Audio, Speech and Language Processing, Special Issue on Rich Transcription, 2006.
NIST 2003 Spring, http://www.nist.gov/speech/tests/rt/rt2003/spring/ R. Sinha, S. E. Tranter, M.J.F. Gales, P. C. Woodland, “The Cambridge University
March 2005 Speaker Diarization System,” INTERSPEECH 2005. S. E. Tranter & D. A. Reynolds, “An Overview of Automatic Speaker Diarisation
Systems,” IEEE Transactions on Audio, Speech and Language Processing, Special Issue on Rich Transcription, 2006.
S. Chen and P. Gopalakrishnan, “Speaker, environment and channel change detection and clustering via the Bayesian Information Criterion,” in Proc. DARPA Broadcast News Transcription and Understanding Workshop, 1998.
C. H. Wu and C. H. Hsieh, “Multiple Change-Point Audio Segmentation and Classification Using an MDL-based Gaussian Model,” IEEE Transactions on Audio, Speech and Language Processing, 2006.
M. Cettolo, M. Vescovi, and R. Rizzi, “Evaluation of BIC-based algorithms for audio segmentation,” Computer Speech and Language, 2005.
M. Siegler, U. Jain, B. Raj and R. Stern, “Automatic Segmentation, Classification and clustering of broadcast News Audio,” in Proc. DARPA Speech Recognition Workshop, 1997.
P. Delacourt and C. J. Welkens, “DISTBIC: A Speaker-based segmentation for Audio Data Indexing", Speech Communication, vol. 32, pp 111-126, 2000.
王駿發 , 林博川 , 王家慶 , 宋豪靜 , “ 以支援向量機為基礎之新穎語者切換偵測演算法 ,” in Proc. ROCLING 2005.
27
ReferenceReference C. Barras, X. Zhu, S. Meignier, and J.-L. Gauvain, “Multistage Speaker Diarization
of Broascast News," IEEE Trans. Audio, Speech, and Language Processing, vol. 14, no.5, pp. 1505-1512, 2006.
Wei-Ho Tsai, Shih-sian Cheng, and Hsin-min Wang, "Automatic Speaker Clustering Using a Voice Characteristic Reference Space and Maximum Purity Estimation," IEEE Trans. on Audio, Speech, and Language Processing, volume 15,
number 4, pages 1461-1474, May 2007. Wei-Ho Tsai and Hsin-min Wang, "Speaker Clustering Based on Minimum Rand
Index," IEEE Int. Conf. Acoustics, Speech, Signal processing (ICASSP2007), April 2007.
R. Sinha, S. E. Tranter, M.J.F. Gales, P. C. Woodland, “The Cambridge University March 2005 Speaker Diarization System,” INTERSPEECH 2005.