Hierarchical clustering Introduction - MIT OpenCourseWare · PDF fileVector Quantization and...

37
Vector Quantization and Clustering Introduction K -means clustering Clustering issues Hierarchical clustering Divisive (top-down) clustering Agglomerative (bottom-up) clustering Applications to speech recognition 6.345 Automatic Speech Recognition Vector Quantization & Clustering 1

Transcript of Hierarchical clustering Introduction - MIT OpenCourseWare · PDF fileVector Quantization and...

Page 1: Hierarchical clustering Introduction - MIT OpenCourseWare · PDF fileVector Quantization and Clustering Introduction K ... Hierarchical clustering ... 6.345 Automatic Speech Recognition

Vector Quantization and Clustering

• Introduction

• K-means clustering

• Clustering issues

• Hierarchical clustering

– Divisive (top-down) clustering

– Agglomerative (bottom-up) clustering

• Applications to speech recognition

6.345 Automatic Speech Recognition Vector Quantization & Clustering 1

rahkuma
Lecture # 6 Session 2003
Page 2: Hierarchical clustering Introduction - MIT OpenCourseWare · PDF fileVector Quantization and Clustering Introduction K ... Hierarchical clustering ... 6.345 Automatic Speech Recognition

Acoustic Modelling

Waveform

- SignalRepresentation

-

Feature Vectors

VectorQuantization

-

Symbols

• Signal representation produces feature vector sequence

• Multi-dimensional sequence can be processed by:

– Methods that directly model continuous space

– Quantizing and modelling of discrete symbols

• Main advantages and disadvantages of quantization:

– Reduced storage and computation costs

– Potential loss of information due to quantization

6.345 Automatic Speech Recognition Vector Quantization & Clustering 2

Page 3: Hierarchical clustering Introduction - MIT OpenCourseWare · PDF fileVector Quantization and Clustering Introduction K ... Hierarchical clustering ... 6.345 Automatic Speech Recognition

Vector Quantization (VQ)

• Used in signal compression, speech and image coding

• More efficient information transmission than scalar quantization(can achieve less that 1 bit/parameter)

• Used for discrete acoustic modelling since early 1980s

• Based on standard clustering algorithms:

– Individual cluster centroids are called codewords

– Set of cluster centroids is called a codebook

– Basic VQ is K-means clustering

– Binary VQ is a form of top-down clustering(used for efficient quantization)

6.345 Automatic Speech Recognition Vector Quantization & Clustering 3

Page 4: Hierarchical clustering Introduction - MIT OpenCourseWare · PDF fileVector Quantization and Clustering Introduction K ... Hierarchical clustering ... 6.345 Automatic Speech Recognition

VQ & Clustering

• Clustering is an example of unsupervised learning

– Number and form of classes {Ci} unknown

– Available data samples {xi} are unlabeled

– Useful for discovery of data structure before classification ortuning or adaptation of classifiers

• Results strongly depend on the clustering algorithm

6.345 Automatic Speech Recognition Vector Quantization & Clustering 4

Page 5: Hierarchical clustering Introduction - MIT OpenCourseWare · PDF fileVector Quantization and Clustering Introduction K ... Hierarchical clustering ... 6.345 Automatic Speech Recognition

Acoustic Modelling Example

[I]

[n]

-

[t]

6.345 Automatic Speech Recognition Vector Quantization & Clustering 5

Page 6: Hierarchical clustering Introduction - MIT OpenCourseWare · PDF fileVector Quantization and Clustering Introduction K ... Hierarchical clustering ... 6.345 Automatic Speech Recognition

Clustering Issues

• What defines a cluster?

– Is there a prototype representing each cluster?

• What defines membership in a cluster?

– What is the distance metric, d(x, y)?

• How many clusters are there?

– Is the number of clusters picked before clustering?

• How well do the clusters represent unseen data?

– How is a new data point assigned to a cluster?

6.345 Automatic Speech Recognition Vector Quantization & Clustering 6

Page 7: Hierarchical clustering Introduction - MIT OpenCourseWare · PDF fileVector Quantization and Clustering Introduction K ... Hierarchical clustering ... 6.345 Automatic Speech Recognition

K-Means Clustering

• Used to group data into K clusters, {C1, . . . , CK}• Each cluster is represented by mean of assigned data

• Iterative algorithm converges to a local optimum:

– Select K initial cluster means, {µµµ1, . . . ,µµµK}– Iterate until stopping criterion is satisfied:

1. Assign each data sample to the closest cluster

xεCi, d(x,µµµi) ≤ d(x,µµµj), ∀i 6= j

2. Update K means from assigned samples

µµµi = E(x), xεCi, 1 ≤ i ≤ K

• Nearest neighbor quantizer used for unseen data

6.345 Automatic Speech Recognition Vector Quantization & Clustering 7

Page 8: Hierarchical clustering Introduction - MIT OpenCourseWare · PDF fileVector Quantization and Clustering Introduction K ... Hierarchical clustering ... 6.345 Automatic Speech Recognition

K-Means Example: K = 3

• Random selection of 3 data samples for initial means

• Euclidean distance metric between means and samples

0 3

1 4

2 5

6.345 Automatic Speech Recognition Vector Quantization & Clustering 8

Page 9: Hierarchical clustering Introduction - MIT OpenCourseWare · PDF fileVector Quantization and Clustering Introduction K ... Hierarchical clustering ... 6.345 Automatic Speech Recognition

K-Means Properties

• Usually used with a Euclidean distance metric

d(x,µµµi) = ‖x − µµµi‖2 = (x − µµµi)t(x − µµµi)

• The total distortion, D, is the sum of squared error

D =K∑

i=1

∑xεCi

‖x − µµµi‖2

• D decreases between nth and n + 1st iteration

D(n + 1) ≤ D(n)

• Also known as Isodata, or generalized Lloyd algorithm

• Similarities with Expectation-Maximization (EM) algorithm forlearning parameters from unlabeled data

6.345 Automatic Speech Recognition Vector Quantization & Clustering 9

Page 10: Hierarchical clustering Introduction - MIT OpenCourseWare · PDF fileVector Quantization and Clustering Introduction K ... Hierarchical clustering ... 6.345 Automatic Speech Recognition

K-Means Clustering: Initialization

• K-means converges to a local optimum

– Global optimum is not guaranteed

– Initial choices can influence final result

K = 3

• Initial K-means can be chosen randomly

– Clustering can be repeated multiple times

• Hierarchical strategies often used to seed clusters

– Top-down (divisive) (e.g., binary VQ)

– Bottom-up (agglomerative)

6.345 Automatic Speech Recognition Vector Quantization & Clustering 10

Page 11: Hierarchical clustering Introduction - MIT OpenCourseWare · PDF fileVector Quantization and Clustering Introduction K ... Hierarchical clustering ... 6.345 Automatic Speech Recognition

K-Means Clustering: Stopping Criterion

Many criterion can be used to terminate K-means :

• No changes in sample assignments

• Maximum number of iterations exceeded

• Change in total distortion, D, falls below a threshold

1 − D(n + 1)D(n)

< T

6.345 Automatic Speech Recognition Vector Quantization & Clustering 11

Page 12: Hierarchical clustering Introduction - MIT OpenCourseWare · PDF fileVector Quantization and Clustering Introduction K ... Hierarchical clustering ... 6.345 Automatic Speech Recognition

Acoustic Clustering Example

• 12 clusters, seeded with agglomerative clustering

• Spectral representation based on auditory-model

6.345 Automatic Speech Recognition Vector Quantization & Clustering 12

Page 13: Hierarchical clustering Introduction - MIT OpenCourseWare · PDF fileVector Quantization and Clustering Introduction K ... Hierarchical clustering ... 6.345 Automatic Speech Recognition

Clustering Issues: Number of Clusters

• In general, the number of clusters is unknown

K = 2

K = 4

• Dependent on clustering criterion, space, computation ordistortion requirements, or on recognition metric

6.345 Automatic Speech Recognition Vector Quantization & Clustering 14

Page 14: Hierarchical clustering Introduction - MIT OpenCourseWare · PDF fileVector Quantization and Clustering Introduction K ... Hierarchical clustering ... 6.345 Automatic Speech Recognition

Clustering Issues: Clustering Criterion

The criterion used to partition data into clusters plays a strong rolein determining the final results

6.345 Automatic Speech Recognition Vector Quantization & Clustering 15

Page 15: Hierarchical clustering Introduction - MIT OpenCourseWare · PDF fileVector Quantization and Clustering Introduction K ... Hierarchical clustering ... 6.345 Automatic Speech Recognition

Clustering Issues: Distance Metrics

• A distance metric usually has the properties:

1. 0 ≤ d(x, y) ≤ ∞2. d(x, y) = 0 iff x = y

3. d(x, y) = d(y, x)

4. d(x, y) ≤ d(x, z) + d(y, z)

5. d(x + z, y + z) = d(x, y) (invariant)

• In practice, distance metrics may not obey some of theseproperties but are a measure of dissimilarity

6.345 Automatic Speech Recognition Vector Quantization & Clustering 16

Page 16: Hierarchical clustering Introduction - MIT OpenCourseWare · PDF fileVector Quantization and Clustering Introduction K ... Hierarchical clustering ... 6.345 Automatic Speech Recognition

Clustering Issues: Distance Metrics

Distance metrics strongly influence cluster shapes:

• Normalized dot-product:xty

‖x‖‖y‖• Euclidean: ‖x − µµµi‖2 = (x − µµµi)t(x − µµµi)

• Weighted Euclidean: (x − µµµi)tW (x − µµµi) (e.g., W = ΣΣΣ−1)

• Minimum distance (chain): min d(x, xi), xiεCi

• Representation specific . . .

6.345 Automatic Speech Recognition Vector Quantization & Clustering 17

Page 17: Hierarchical clustering Introduction - MIT OpenCourseWare · PDF fileVector Quantization and Clustering Introduction K ... Hierarchical clustering ... 6.345 Automatic Speech Recognition

Clustering Issues: Impact of Scaling

Scaling feature vector dimensions can significantly impactclustering results

Scaling can be used to normalize dimensions so a simple distancemetric is a reasonable criterion for similarity

6.345 Automatic Speech Recognition Vector Quantization & Clustering 18

Page 18: Hierarchical clustering Introduction - MIT OpenCourseWare · PDF fileVector Quantization and Clustering Introduction K ... Hierarchical clustering ... 6.345 Automatic Speech Recognition

Clustering Issues: Training and Test Data

• Training data performance can be arbitrarily good e.g.,

limK→∞

DK = 0

• Independent test data needed to measure performance

– Performance can be measured by distortion, D, or some morerelevant speech recognition metric

– Robust training will degrade minimally during testing

– Good training data closely matches test conditions

• Development data are often used for refinements, since throughiterative testing they can implicitly become a form of training data

6.345 Automatic Speech Recognition Vector Quantization & Clustering 19

Page 19: Hierarchical clustering Introduction - MIT OpenCourseWare · PDF fileVector Quantization and Clustering Introduction K ... Hierarchical clustering ... 6.345 Automatic Speech Recognition

Alternative Evaluation Criterion:LPC VQ Example

Autumn

kHz kHz

Wide Band Spectrogram

kHz kHz

0

1

2

3

4

0

1

2

3

4

Time (seconds)0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9

Autumn LPC

kHz kHz

Wide Band Spectrogram

kHz kHz

0

1

2

3

4

0

1

2

3

4

Time (seconds)0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9

(codebook size = 1024)

6.345 Automatic Speech Recognition Vector Quantization & Clustering 21

Page 20: Hierarchical clustering Introduction - MIT OpenCourseWare · PDF fileVector Quantization and Clustering Introduction K ... Hierarchical clustering ... 6.345 Automatic Speech Recognition

Hierarchical Clustering

• Clusters data into a hierarchical class structure

• Top-down (divisive) or bottom-up (agglomerative)

• Often based on stepwise-optimal, or greedy, formulation

• Hierarchical structure useful for hypothesizing classes

• Used to seed clustering algorithms such as K-means

6.345 Automatic Speech Recognition Vector Quantization & Clustering 22

Page 21: Hierarchical clustering Introduction - MIT OpenCourseWare · PDF fileVector Quantization and Clustering Introduction K ... Hierarchical clustering ... 6.345 Automatic Speech Recognition

Divisive Clustering

• Creates hierarchy by successively splitting clusters into smallergroups

• On each iteration, one or more of the existing clusters are splitapart to form new clusters

• The process repeats until a stopping criterion is met

• Divisive techniques can incorporate pruning and mergingheuristics which can improve the final result

6.345 Automatic Speech Recognition Vector Quantization & Clustering 23

Page 22: Hierarchical clustering Introduction - MIT OpenCourseWare · PDF fileVector Quantization and Clustering Introduction K ... Hierarchical clustering ... 6.345 Automatic Speech Recognition

Example of Non-Uniform Divisive Clustering

0

1

2

6.345 Automatic Speech Recognition Vector Quantization & Clustering 24

Page 23: Hierarchical clustering Introduction - MIT OpenCourseWare · PDF fileVector Quantization and Clustering Introduction K ... Hierarchical clustering ... 6.345 Automatic Speech Recognition

Example of Uniform Divisive Clustering

0

1

2

6.345 Automatic Speech Recognition Vector Quantization & Clustering 25

Page 24: Hierarchical clustering Introduction - MIT OpenCourseWare · PDF fileVector Quantization and Clustering Introduction K ... Hierarchical clustering ... 6.345 Automatic Speech Recognition

Divisive Clustering Issues

• Initialization of new clusters

– Random selection from cluster samples

– Selection of member samples far from center

– Perturb dimension of maximum variance

– Perturb all dimensions slightly

• Uniform or non-uniform tree structures

• Cluster pruning (due to poor expansion)

• Cluster assignment (distance metric)

• Stopping criterion

– Rate of distortion decrease

– Cannot increase cluster size

6.345 Automatic Speech Recognition Vector Quantization & Clustering 26

Page 25: Hierarchical clustering Introduction - MIT OpenCourseWare · PDF fileVector Quantization and Clustering Introduction K ... Hierarchical clustering ... 6.345 Automatic Speech Recognition

Divisive Clustering Example: Binary VQ

• Often used to create M = 2B size codebook(B bit codebook, codebook size M)

• Uniform binary divisive clustering used

• On each iteration each cluster is divided in two

µµµ+i = µµµi(1 + ε)

µµµ−i = µµµi(1 − ε)

• K-means used to determine cluster centroids

• Also known as LBG (Linde, Buzo, Gray) algorithm

• A more efficient version does K-means only within each binarysplit, and retains tree for efficient lookup

6.345 Automatic Speech Recognition Vector Quantization & Clustering 27

Page 26: Hierarchical clustering Introduction - MIT OpenCourseWare · PDF fileVector Quantization and Clustering Introduction K ... Hierarchical clustering ... 6.345 Automatic Speech Recognition

Agglomerative Clustering

• Structures N samples or seed clusters into a hierarchy

• On each iteration, the two most similar clusters are mergedtogether to form a new cluster

• After N − 1 iterations, the hierarchy is complete

• Structure displayed in the form of a dendrogram

• By keeping track of the similarity score when new clusters arecreated, the dendrogram can often yield insights into the naturalgrouping of the data

6.345 Automatic Speech Recognition Vector Quantization & Clustering 28

Page 27: Hierarchical clustering Introduction - MIT OpenCourseWare · PDF fileVector Quantization and Clustering Introduction K ... Hierarchical clustering ... 6.345 Automatic Speech Recognition

Dendrogram Example (One Dimension)D

ista

nce

��������

����������������

��������

����������������

6.345 Automatic Speech Recognition Vector Quantization & Clustering 29

Page 28: Hierarchical clustering Introduction - MIT OpenCourseWare · PDF fileVector Quantization and Clustering Introduction K ... Hierarchical clustering ... 6.345 Automatic Speech Recognition

Agglomerative Clustering Issues

• Measuring distances between clusters Ci and Cj with respectivenumber of tokens ni and nj

– Average distance:1

ninj

∑i,j

d(xi , xj)

– Maximum distance (compact): maxi,j

d(xi, xj)

– Minimum distance (chain): mini,j

d(xi , xj)

– Distance between two representative vectors of each clustersuch as their means: d(µµµi,µµµj)

6.345 Automatic Speech Recognition Vector Quantization & Clustering 30

Page 29: Hierarchical clustering Introduction - MIT OpenCourseWare · PDF fileVector Quantization and Clustering Introduction K ... Hierarchical clustering ... 6.345 Automatic Speech Recognition

Stepwise-Optimal Clustering

• Common to minimize increase in total distortion on each mergingiteration: stepwise-optimal or greedy

• On each iteration, merge the two clusters which produce thesmallest increase in distortion

• Distance metric for minimizing distortion, D, is:√

ninj

ni + nj‖µµµi − µµµj‖

• Tends to combine small clusters with large clusters beforemerging clusters of similar sizes

6.345 Automatic Speech Recognition Vector Quantization & Clustering 31

Page 30: Hierarchical clustering Introduction - MIT OpenCourseWare · PDF fileVector Quantization and Clustering Introduction K ... Hierarchical clustering ... 6.345 Automatic Speech Recognition

Clustering for Segmentation

6.345 Automatic Speech Recognition Vector Quantization & Clustering 32

Page 31: Hierarchical clustering Introduction - MIT OpenCourseWare · PDF fileVector Quantization and Clustering Introduction K ... Hierarchical clustering ... 6.345 Automatic Speech Recognition

Speaker Clustering

• 23 female and 53 male speakers from TIMIT corpus

• Vector based on F1 and F2 averages for 9 vowels

• Distance d(Ci, Cj) is average of distances between members

24

6

6.345 Automatic Speech Recognition Vector Quantization & Clustering 33

Page 32: Hierarchical clustering Introduction - MIT OpenCourseWare · PDF fileVector Quantization and Clustering Introduction K ... Hierarchical clustering ... 6.345 Automatic Speech Recognition

Velar Stop Allophones

6.345 Automatic Speech Recognition Vector Quantization & Clustering 34

Page 33: Hierarchical clustering Introduction - MIT OpenCourseWare · PDF fileVector Quantization and Clustering Introduction K ... Hierarchical clustering ... 6.345 Automatic Speech Recognition

Velar Stop Allophones (con’t)

kHz

Wide Band Spectrogram

kHz

0

1

2

3

4

5

6

7

8

0

1

2

3

4

5

6

7

8

Time (seconds)0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

Waveform

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

kHz

Wide Band Spectrogram

kHz

0

1

2

3

4

5

6

7

8

0

1

2

3

4

5

6

7

8

Time (seconds)0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

Waveform

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

kHz kHz

Wide Band Spectrogram

kHz kHz

0

1

2

3

4

5

6

7

8

0

1

2

3

4

5

6

7

8

Time (seconds)0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

Waveform

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

keep cot quick

6.345 Automatic Speech Recognition Vector Quantization & Clustering 35

Page 34: Hierarchical clustering Introduction - MIT OpenCourseWare · PDF fileVector Quantization and Clustering Introduction K ... Hierarchical clustering ... 6.345 Automatic Speech Recognition

Acoustic-Phonetic Hierarchy

Clustering of phonetic distributions across 12 clusters

6.345 Automatic Speech Recognition Vector Quantization & Clustering 36

Page 35: Hierarchical clustering Introduction - MIT OpenCourseWare · PDF fileVector Quantization and Clustering Introduction K ... Hierarchical clustering ... 6.345 Automatic Speech Recognition

Word Clustering

A

A_M

AF

TE

RN

OO

N

AIR

CR

AF

TA

IRP

LAN

E

AM

ER

ICA

N

AN

AN

Y

AT

LAN

TA

AU

GU

ST A

VA

ILA

BLE

BA

LTIM

OR

E

BE

BO

OK

BO

ST

ON

CH

EA

PE

ST

CIT

Y

CLA

SS

CO

AC

H

CO

NT

INE

NT

AL

CO

ST

DA

LLA

S

DA

LLA

S_F

OR

T_W

OR

TH

DA

Y

DE

LTA

DE

NV

ER

DO

LLA

RS

DO

WN

TO

WN

EA

RLI

ES

T

EA

ST

ER

N

EC

ON

OM

Y

EIG

HT

EIG

HT

Y

EV

EN

ING

FA

RE

FA

RE

S

FIF

TE

EN

FIF

TY

FIN

D

FIR

ST

_CLA

SS

FIV

EFLY

FO

RT

Y

FO

UR

FR

IDA

Y

GE

T

GIV

E

GO

GR

OU

ND

HU

ND

RE

D

INF

OR

MA

TIO

N

IT

JULY

KIN

D

KN

OW

LAT

ES

TLEA

ST

LOW

ES

T

MA

KE

MA

Y

ME

AL

ME

ALS

MO

ND

AY

MO

RN

ING

MO

ST

NE

ED

NIN

E

NIN

ET

Y

NO

NS

TO

P

NO

VE

MB

ER

O+

CLO

CK

OA

KLA

ND

OH

ON

E

ON

E_W

AY

P_M

PH

ILA

DE

LPH

IA

PIT

TS

BU

RG

H

PLA

NE

RE

TU

RN

RO

UN

D_T

RIP

SA

N_F

RA

NC

ISC

O

SA

TU

RD

AY

SC

EN

AR

IO

SE

RV

ED

SE

RV

ICE

SE

VE

N

SE

VE

NT

Y

SIX

SIX

TY

ST

OP

OV

ER

SU

ND

AY

TA

KE

TE

LL

TH

ER

E

TH

ES

E

TH

IRT

Y

TH

IS

TH

OS

E

TH

RE

E

TH

UR

SD

AY

TIC

KE

T

TIM

E

TIM

ES

TR

AN

SP

OR

TA

TIO

N

TR

AV

EL

TU

ES

DA

Y

TW

EN

TY

TW

O

TY

PE U

_S_A

IR

UN

ITE

D

US

ED

WA

NT

WA

SH

ING

TO

N

WE

DN

ES

DA

Y

WIL

LW

OU

LD

ZE

RO

NIL

02

46

6.345 Automatic Speech Recognition Vector Quantization & Clustering 37

Page 36: Hierarchical clustering Introduction - MIT OpenCourseWare · PDF fileVector Quantization and Clustering Introduction K ... Hierarchical clustering ... 6.345 Automatic Speech Recognition

VQ Applications

• Usually used to reduce computation

• Can be used alone for classification

• Used in dynamic time warping (DTW) and discrete hidden Markovmodels (HMMs)

• Multiple codebooks are used when spaces are statisticallyindependent (product codebooks)

• Matrix codebooks are sometimes used to capture correlationbetween succesive frames

• Used for semi-parametric density estimation(e.g., semi-continuous mixtures)

6.345 Automatic Speech Recognition Vector Quantization & Clustering 38

Page 37: Hierarchical clustering Introduction - MIT OpenCourseWare · PDF fileVector Quantization and Clustering Introduction K ... Hierarchical clustering ... 6.345 Automatic Speech Recognition

References

• Huang, Acero, and Hon, Spoken Language Processing,Prentice-Hall, 2001.

• Duda, Hart and Stork, Pattern Classification, John Wiley & Sons,2001.

• A. Gersho and R. Gray, Vector Quantization and SignalCompression, Kluwer Academic Press, 1992.

• R. Gray, Vector Quantization, IEEE ASSP Magazine, 1(2), 1984.

• B. Juang, D. Wang, A. Gray, Distortion Performance of VectorQuantization for LPC Voice Coding, IEEE Trans ASSP, 30(2), 1982.

• J. Makhoul, S. Roucos, H. Gish, Vector Quantization in SpeechCoding, Proc. IEEE, 73(11), 1985.

• L. Rabiner and B. Juang, Fundamentals of Speech Recognition,Prentice-Hall, 1993.

6.345 Automatic Speech Recognition Vector Quantization & Clustering 39