DT2118 Speech and Speaker Recognition - Introduction · DT2118 Speech and Speaker Recognition...

DT2118

Speech and Speaker RecognitionIntroduction

Giampiero Salvi

KTH/CSC/TMH [email protected]

VT 2012

1 / 55

Contact Info

Giampiero Salvi ([email protected])

Mats Blomberg([email protected])

Course mailing list: dt2118 [email protected]

4 / 55

Course Objectives

after the course you should be able to:

I implement simple training and evaluation methods for HiddenMarkov Models

I train and evaluate a speech recogniser using the HTKsoftware package

I compare different feature extraction and training methods

I document and discuss specific aspects related to speech andspeaker recognition

I with the help of the literature, review and criticise otherstudents’ work in the subject

5 / 55

Schedule

GS = Giampiero Salvi MB = Mats Blomberg

Part 1 Introduction, Speech Signal, Features,Statistics (ca 4 hours, GS)

Part 2 Hidden Markov Models, Training and Decoding,HTK tutorial (ca 4 hours, MB, GS)

Part 3 Decoding and Search Algorithms(ca 2 hours, MB)

Part 4 Language Models (Grammars)(ca 2 hours, GS)

Part 5 Noise robustness and Speaker Recognition (ca 2-4hours, MB)

6 / 55

Notes

Notes

Notes

Notes

http://www.kth.se

http://www.csc.kth.se

http://www.speech.kth.se

mailto:[email protected]




LiteratureI Spoken Language Processing: A Guide to Theory,

Algorithm, and System DevelopmentXuedong Huang, Alex Acero, Hsiao-Wuen Hon, PrenticeHall

I 3 (2) at KTH library,I 9 (9) at TMH library (against 300 SEK deposit)

I A Tutorial on Text-Independent SpeakerVerificationBimbot et al. EURASIP J. on Applied Signal Processing2004:4, 430-451

I An overview of text-independent speakerrecognition: From features to supervectorsKinnunen and Li, Speech Communication 52 (2010) 12-40

I HTK manual version 3.4

I . . .7 / 55

Reading Instructions (course book)

pages # pagesPart 1 (Spoken Language Structure) (19–71) (52)

Probability, Statistics and Inform. Theory 73–131 59Pattern Recognition 133–197 65Speech Signal Representations 275–336 62

Part 2 Hidden Markov Models 377–413 37Acoustic Modeling 415–475 61Environmental Robustness 477–544 68HTK tutorial (HTK book)

Part 3 Basic Search Algorithms 591–643 53(Large-Vocabulary Search Algorithms) (645–685) (41)(Applications and User Interfaces) (919–956) (38)

Part 4 Language Modeling 545–590 46Part 5 Speaker Recognition literature

(Optional chapters in parentheses)

8 / 55

Requirements/Activities

Grades: Pass/Fail In order to pass you have to:

1. attend most of the lectures

2. (read the literature)

3. solve computational exercises and hand in solutions

4. carry out ASR lab and hand in lab report

5. write term paper or carry out mini-project in groupsand present results at final seminar

6. act as reviewer and opponent for two course participants’paper/report at final seminar

9 / 55

Computational Exercises

I solve problems

I implement algorithms (write code preferably in Matlab)

I Topics:

1. Vector Quantisation2. Decision Trees3. Gaussian Mixtures4. Viterbi decoding and Forward Probabilities5. Baum-Welch algorithm

I to be performed individually

I hand in both solutions and code

I depending on time-constraints I will present the results orsit with each of you singularly and discuss them.

10 / 55

Notes

Notes

Notes

Notes

ASR Lab

I record a small database of spoken digits

I use HTK to build a simple digit recogniser

I test the recogniser in different conditions

I hand in report and lab files

11 / 55

Term Paper/Project

I Suggest a title or choose a topic from a list

I Term Paper: around 6 pages (max 10)

I Suggested topics:

Own work and experiments after discussion with the teacherLimitations in standard HMM and a survey of alternativesPronunciation variation and its importance for speech recogni-tionLanguage models for speech recognitionNew search methodsTechniques for robust recognition of speechConfidence measures in speech recognitionThe role of prosody for speech recognitionSpeaker variability and methods for adaptation

12 / 55

Important dates1. Fri 13 April: decide project/title of term paper

2. Fri 20 April: hand-in exercise solutions

3. Tue 24 April: discuss exercise solutions

4. Fri 4 May: hand-in lab report

5. Wed 16 May: hand-in term paper (draft). Needed for thepeer review.

6. Fri 25 May: Final seminar: present project/term paperresults, with opposition

7. Thu 31 May: Final report

13 14 15 16 17 18 19 20 21 22

lect

.1

lect

.2

lect

.3/

HT

Ktu

tori

al

lect

.4

lect

.5

lect

.6

lect

.7

lect

.8

dead

line

sele

ctpr

ojec

t

dead

line

ex.

solu

tion

s

disc

uss

solu

tion

s

dead

line

lab

rep

ort

dead

line

proj

ect

draf

t

fina

lse

min

ar

fina

lre

por

t

13 / 55

Part 1

14 / 55

Notes

Notes

Notes

Notes

The dream of Artificial Intelligence

2001: A space odyssey (1968)

16 / 55

Today’s Reality

I Now Pronounce You Chuck & Larry (2007)

17 / 55

The ASR Goal (for this course)

Convert speech into text

AutomaticSpeech

Recognition“My name is . . . ”

CC Please tell me your name

LV Larry Valentine

CC I’m sorry, I didn’t quite get that

LV Larry Valentine

CC You said “Berry Schmallenpine”. . . is that right?

LV Schmallenpine?!?!

CC You said “Schmallenpine”. . . is that right?

18 / 55

The Speech Chain

musclesVocal

Feedbacklink

Sensorynerves

Motornerves

Sensorynerves

levelLinguistic

levelPhysiological

levelAcoustic

levelPhysiological

levelLinguistic

EarBrain

BrainSound waves

Ear

SPEAKER LISTENER

Peter Denes, Elliot Pinson, 1963

19 / 55

Notes

Notes

Notes

Notes

The Speech Chain (functional)

20 / 55

The Speech Chain (Limitations)

Not covered in this course:

I multimodality

I interaction (bi-directional)

I incrementality

I non-verbal communication

musclesVocal

Feedbacklink

Sensorynerves

Motornerves

Sensorynerves

levelLinguistic

levelPhysiological

levelAcoustic

levelPhysiological

levelLinguistic

EarBrain

BrainSound waves

Ear

SPEAKER LISTENER

21 / 55

Speech Examples

TIMIT database (American English)

example of “clean” speech

22 / 55

Speech: Variation

I speaker variability (dialect, . . . )

I environmental noise

I speaking style

I degree of spontaneity (read speech vs dialogue)

I size of vocabulary

23 / 55

Notes

Notes

Notes

Notes

http://www.itl.nist.gov/iad/mig/publications/ASRhistory/

24 / 55

Physiology

Esophagus

Larynx

Soft palate(velum)

Hard palate

Lung

Trachea

Jaw

Oral cavity

Teeth

TongueLip

Nostril

Nasal cavity

Diaphragm

cavityPharyngeal

��

Trachea

Muscle Force and Relaxation

Lungs

FoldsVocal

Glottis

PharyngealCavity Cavity

Cavity

Oral

Nasal

Velum

NoseOutput

MouthOutput

26 / 55

Source/Filter Model, Vowel-like sounds

Vowels

Esophagus

Larynx

Soft palate(velum)

Hard palate

Lung

Trachea

Jaw

Oral cavity

Teeth

TongueLip

Nostril

Nasal cavity

Diaphragm

cavityPharyngeal

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

� Source (periodic)� Front Cavity� Back Cavity� Back Cavity (2nd approx.)

27 / 55

Glottal Flow

��

Trachea


Lungs

FoldsVocal

Glottis


Cavity

Oral

Nasal

Velum

NoseOutput

MouthOutput

0 5 10 15

glo

tta

l flo

w

Liljencrants−Fant glottal model

0 5 10 15

de

riva

tive

time (msec)

G (z) =1

(1− βz)2, β < 1

28 / 55

Notes

Notes

Notes

Notes

http://www.itl.nist.gov/iad/mig/publications/ASRhistory/

Radiation form the Lips/Nose

��

Trachea


Lungs

FoldsVocal

Glottis


Cavity

Oral

Nasal

Velum

NoseOutput

MouthOutput

Problem of radiation at the lipsplus diffraction about the head toocomplicated. Approx. with a pis-ton in a rigid sphere: solved butnot in closed form

2nd approx: piston in an infinitewall

R(z) ≈ 1− αz−1

29 / 55

Tube Model of the Vocal Tract


Cavity

Oral

Nasal

Velum

NoseOutput

MouthOutput

30 / 55

Tube Model (cntd.)

k k+1

0 1 2 3 4 5 6 7 8

all−pole transfer function

freqency (kHz)

I assume planar wave propagation and lossless tubes

I solve pressure p(x , t) and velocity u(x , t) in each tubeaccording to wave equation

I impose continuity of pressure and velocity at the junctions

⇒ all-pole transfer function (N = number of tubes)

V (z) =Az−N/2

1−∑N

k=1 akz−k

31 / 55

Source/Filter Model: vowel-like sounds

0 5 10 15

waveform

0 2 4 6 8

spectrum (log)

0 5 10 15 0 2 4 6 8

0 5 10 15 0 2 4 6 8

0 5 10 15

time (msec)

0 2 4 6 8

freqency (kHz)

← p[n]

← p[n] ∗ g [n]

← p[n] ∗ g [n] ∗ r [n]

← p[n]∗g [n]∗r [n]∗v [n]

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

32 / 55

Notes

Notes

Notes

Notes

F0 and Formants

I Varying F0 (vocal fold oscillation rate)

0 2 4 6 8

spectrum (log) f0 = 100Hz

freqency (kHz)

0 2 4 6 8


freqency (kHz)

I Varying Formants (vocal tract shape)

0 2 4 6 8

spectrum (log) vowel [ε]

freqency (kHz)

0 2 4 6 8

spectrum (log) vowel [u]

freqency (kHz)

33 / 55

Source/Filter Model, General Case

Vowels

Esophagus

Larynx

Soft palate(velum)

Hard palate

Lung

Trachea

Jaw

Oral cavity

Teeth

TongueLip

Nostril

Nasal cavity

Diaphragm

cavityPharyngeal

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��


34 / 55


Fricatives (e.g. sh) or Plosive (e.g. k)

Esophagus

Larynx

Soft palate(velum)

Hard palate

Lung

Trachea

Jaw

Oral cavity

Teeth

TongueLip

Nostril

Nasal cavity

Diaphragm

cavityPharyngeal

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

� Source (noise or im-pulsive)� Front Cavity� Back Cavity� Back Cavity (2nd approx.)

34 / 55


Fricatives (e.g. s) or Plosive (e.g. t)

Esophagus

Larynx

Soft palate(velum)

Hard palate

Lung

Trachea

Jaw

Oral cavity

Teeth

TongueLip

Nostril

Nasal cavity

Diaphragm

cavityPharyngeal

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

� Source (noise or impul-sive)� Front Cavity� Back Cavity� Back Cavity (2nd approx.)

34 / 55

Notes

Notes

Notes

Notes


Nasalised Vowels

Esophagus

Larynx

Soft palate(velum)

Hard palate

Lung

Trachea

Jaw

Oral cavity

Teeth

TongueLip

Nostril

Nasal cavity

Diaphragm

cavityPharyngeal

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��


34 / 55

Source/Filter Model: fricative sounds

0 5 10 15

waveform

0 2 4 6 8

spectrum (log)

0 5 10 15 0 2 4 6 8

0 5 10 15

time (msec)

0 2 4 6 8

freqency (kHz)

← p[n]

← p[n] ∗ r [n]

← p[n] ∗ r [n] ∗ v [n]

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

35 / 55

Complete Source/Filter Model

Ap

A f

AvωG( )

ωV( ) ωR( )fricative

plosive

Source

voiced

Filter

36 / 55

IPA Chart: Consonants

37 / 55

Notes

Notes

Notes

Notes

IPA Chart: Vowels

38 / 55

Phonology vs Phonetics

Phonemes

co-articulation, speakingstyle, dialogue, reduction,

assimilation, speakerdifferences, environment

(loudness, channel,room acoustics, noise)

Phones

Words

Sounds

39 / 55

Quasi-Stationarity

I speech is time varying

I short segment are quasi-stationary

I use short time analysis

41 / 55

Short-Time Fourier Analysis

7500 8000 8500 9000 9500 10000−0.1

−0.05

0

0.05

0.1

am

plit

ude

samples

0 1000 2000 3000 4000 5000 6000 7000 8000−15

−10

−5

0

5

pow

er

spectr

um

(dB

)

frequency (Hz)

42 / 55

Notes

Notes

Notes

Notes


7500 8000 8500 9000 9500 10000−0.1

−0.05

0

0.05

0.1

am

plit

ude

samples

0 1000 2000 3000 4000 5000 6000 7000 8000−16

−14

−12

−10

−8

−6

−4

−2

pow

er

spectr

um

(dB

)

frequency (Hz)

42 / 55


42 / 55

Acoustic Features

I capture relevant information

I disregard irrelevant information

I computationally efficient

43 / 55

F0 and Formants

I Varying F0 (vocal fold oscillation rate)

0 2 4 6 8


freqency (kHz)

0 2 4 6 8


freqency (kHz)

I Varying Formants (vocal tract shape)

0 2 4 6 8

spectrum (log) vowel [ε]

freqency (kHz)

0 2 4 6 8

spectrum (log) vowel [u]

freqency (kHz)

44 / 55

Notes

Notes

Notes

Notes

Linear Prediction Coefficients (LPC)

I assume all-pole model:

H(z) =S(z)

Ug (z)= AG (z)V (z)R(z) ,

A

1−∑p

k=1 akz−k

I the output signal s[n] can be expressed as the sum of theinput ug [n] and a number of previous samples aks[n − k]:

s[n] =

p∑k=1

aks[n − k] + Aug [n]

I Better match at spectral peaks than at valleys

45 / 55

LPC Example

s[n] =

p∑k=1

aks[n − k] + Aug [n]

46 / 55

Perceptual Linear Prediction

I Transform to the Bark frequency scale before computingthe LPC coefficients

I Cubic root of energy instead of logarithm

0 2 4 6 8 10 12 14 160

5

10

15

20

25

frequency (kHz)

Bark

scale

47 / 55

LPC Limitations

I better match at spectral peaks than at valleys

I not accurate if transfer function contain zeros (nasals,fricatives. . . )

48 / 55

Notes

Notes

Notes

Notes

Mel Frequency Cepstral Coefficients

0 2 4 6 8

freqency (kHz) Linear to Mel frequency

0 1000 2000 3000

freqency (mel) log() + Filterbank (∼ 20-25 filters)

0 1000 2000 3000

freqency (mel) Discrete Cosine Transform

0 5 10 15 20

quefrency

49 / 55

MFCC Rationale

I signals combined in a convolutive way: a[n] ∗ b[n] ∗ c[n]

I in the spectral domain: A(z)B(z)C (z)

I taking the log: log(A(z)) + log(B(z)) + log(C (z))

I to analise the different contribution perfor Fouriertransform (DCT if not interested in phase information).

50 / 55

MFCC Advantages

I fairly uncorrelated coefficients (simpler statistical models)

I high phonetic discrimination (empirically shown)

I do not assume all-pole model

I low number of coeff. enough to capture coarse structureof spectrum

Bogert, Healy & Tukey: “The Quefrency Alanysis of Time Series for

Echoes: Cepstrum, Pseudo-autocovariance, Cross-Cepstrum and Saphe

Cracking” Proc. Symp. Time Series Analysis, J. Wiley & Sons, 1963

51 / 55

MFCC Examples

52 / 55

Notes

Notes

Notes

Notes

Segment-Based Processing

File: sx352.WAV Page: 1 of 1 Printed: Mon Dec 05 09:01:39

1

2

3

4

5

6

7

kHz

0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95

time

ttclnixaymaxttclniydxraokkclixh#

mytoaccording

53 / 55

Landmark-Based Processing


1

2

3

4

5

6

7

kHz

0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95

time


mytoaccording

54 / 55

Frame-Based Processing


1

2

3

4

5

6

7

kHz

0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95

time


mytoaccording

55 / 55

Notes

Notes

Notes

Notes

DT2118 Speech and Speaker Recognition - Introduction · DT2118 Speech and Speaker Recognition...

Documents

Transcript of DT2118 Speech and Speaker Recognition - Introduction · DT2118 Speech and Speaker Recognition...