2002 VIU Oct 2007 : Speaker Recognition1F. Schiel Florian Schiel Venice International University Oct...

Post on 22-Dec-2015

217 views 0 download

Tags:

Transcript of 2002 VIU Oct 2007 : Speaker Recognition1F. Schiel Florian Schiel Venice International University Oct...

2002

VIU Oct 2007 : Speaker Recognition 1 F. Schiel

Florian SchielVenice International University

Oct 2007

Speaker Recognition =Speaker Identification, Speaker Verification

2002

VIU Oct 2007 : Speaker Recognition 2 F. Schiel

Agenda

• See the Context

• Speech Recognition vs. Speaker Recognition

• Speaker Identification vs. Speaker Verification

• Speaker Recognition: Basics

• Speaker Verification using HMM

• Discussion

• and then ...

2002

VIU Oct 2007 : Speaker Recognition 3 F. Schiel

General Approach to Authentification

• Three general ways to perform authentification:- proof of knowledge (e.g. password),- proof of possession (e.g. chip card),- proof of property (biometrics), and their combinations

• Biometrics: physiological based vs. behavioural based• Biometrical features:

Fingerprint, iris scan, facial scan, hand geometry, signature, voice

from U. Türk 2007

2002

VIU Oct 2007 : Speaker Recognition 4 F. Schiel

Biometric Features: General Requirements

• universal: can be found in any user• unique: even for identical twins• measurable: does not require human evaluation• robust to short-term and long-term variability• low dimensionality• robust to changing environment• robust to impersonation

from U. Türk 2007

++++++ooo+

2002

VIU Oct 2007 : Speaker Recognition 5 F. Schiel

Taxonomie Speech Processing

Natural Language Processing(NLP)

Spoken Language Processing(SLP)

Lexica

SyntaxParsing

Spellers

Search /IndexingSemantics

Terminology

Thesaurus

Dialogue systems

SpeechIdentification

Speech Synthesis

Speaker recognition

Speech Recognition

Forensics

2002

VIU Oct 2007 : Speaker Recognition 6 F. Schiel

Speech Recognition

"Decode the spoken content from the acoustic signal"

Speaker Recognition

"Determine the identity of a speaker from acoustic signal"

ASR "Sehr geehrter .." SI/SVAccepted/Rejected

ID

SpeechModels

SpeakerCharacteristics

ClaimedIdentity

2002

VIU Oct 2007 : Speaker Recognition 7 F. Schiel

Speaker Verification• Authentification according to

claimed identity• Result is binary:

"accept" / "reject"• Scaling: effort independent

of number of participants• Accuracy: dependent of size

of enrolment data

Speaker Identification• Identification from limited number

of participants• Result is speaker identity• Scaling: effort increases linear

with number of participants• Accuracy: dependent of

+ size of enrolment data+ number of participants

reject

Identität falsch

accept

Identität ok correctidentity ok

accept

Identität ok falsereject

rejectreject

Identität falsch correct

accept

falseaccept

identity wrong

100

NCor

rect

ness

Speaker Recognition

2002

VIU Oct 2007 : Speaker Recognition 8 F. Schiel

• Applications:– Access Control

– Verification of identity

via the phone

– Automatic Teller Machines

– Password resetting

– Banking: Identity for new

accounts etc.

– Protection against theft (cars...)

Speaker Verification

• Applications:– Forensics

– Police Work

– Automatic User Settings

– Speaker Classification:

Advertising

Speaker Identification

2002

VIU Oct 2007 : Speaker Recognition 9 F. Schiel

Speaker Verification: Doddington's Zoo (1)

User = registered speaker, Impostor = non-registered speaker

• Goats : users that are often rejected wrongly (increasing 'false reject' errors)

• Lambs : users that are easily imitated (increasing 'false accept' errors)

• Sheep : users that 'behave' (not goats and not lambs)• Wolfs : particulary successful impostors

(increasing 'false accept' errors)

from Doddington 1998

2002

VIU Oct 2007 : Speaker Recognition 10 F. Schiel

Speaker Verification: Doddington's Zoo (2)

Wolfs may perform zero-effort or active impostor attempts to break into a SV system.

Problem:Speaker verification data bases do not contain active impostorattempts data of wolfs -> most technical evaluations are based on non-realistic data!

2002

VIU Oct 2007 : Speaker Recognition 11 F. Schiel

Technical Speech Processing

Featuredetection

DekoderHighpass

Analog Signal

0

t

Digital Signal

t

Vectors

m1

.

.mN

m1

.

.mN

10 20

...• "Call Richard!"• "Radio off!"• "216"

Symbols

Symbols:• Text• Action• Semantics

A / DAnti-

AliasingFilter

2002

VIU Oct 2007 : Speaker Recognition 12 F. Schiel

Verification"Accept""Reject"

Featuredetection

Highpass

A / DAnti-

AliasingFilter

Claimedidentity

PINFingerprint

ASR

SelectID

Speaker Models

Speaker Verifikation: Basics (1)

2002

VIU Oct 2007 : Speaker Recognition 13 F. Schiel

VerificationFeature

detectionHighpass

Speaker Verification: Basics (2)

ffsam

/2

Analog low pass filterto avoid anti-aliasingeffects

+ Analog-DigitalConverter

„Accept”„Reject”A / D

Anti-Aliasing

Filter

Anti-aliasing

filterA / D

2002

VIU Oct 2007 : Speaker Recognition 14 F. Schiel

Speaker Verification: Basics (3)

Features:• speaker specific• robust against noise• partly long term

0

Extraction ofSpeakercharacteristics

m1

...mN

m1

...mN

10 20

m1

...mN

m1

...mN

30 40

...

Window

25 ms

Merkmals-berechnung

VerificationHighpass

A / DAnti-

AliasingFilter

"Accept""Reject"A / D

Anti-Aliasing

FilterFeature

detection

2002

VIU Oct 2007 : Speaker Recognition 15 F. Schiel

Featuredetection

Highpass

A / DAnti-

AliasingFilter

Verification

"Accept""Reject"

p(S | ID) < threshold

vector sequenceS

m1

.

.mN

m1

.

.mN

10 20

...

decision

p(S | ID) > threshold

"Accept"

"Reject"

speaker modelof claimed ID

Speaker Verification: Basics (4)

2002

VIU Oct 2007 : Speaker Recognition 16 F. Schiel

Speaker Verification: Tuning

• Error types highly dependent on threshold

high security -> false accept low false reject highuser friendly -> false reject low false accept high

EqualErrorRate

falseaccept

falsereject

• Both errors increase by:- channel disturbance- crosstalk- noise- room acoustics

threshold

• Solution:- multiple enrolments- adaptive learning

2002

VIU Oct 2007 : Speaker Recognition 17 F. Schiel

Speaker Verification: Score Normalisation (1)

Problem:How to set the optimal threshold?

HMMs generate a priori probabilities:O : observation = sequence of featuresl : speaker model

Bayes:

but is dependent on various factors

P l∣O=p O∣l P l P O

p O∣l

P O

2002

VIU Oct 2007 : Speaker Recognition 18 F. Schiel

Speaker Verification: Score Normalisation (2)

Solution: Bayesian Decision Rule:

with Bayes and log to both sides this leads to:

P l∣O =p O∣l P l P O

C FR P l∣O C FAP l∣O

log p O∣l − log p O∣l log C FAP l C FRP l

=threshold

CFR

, CFA

: cost functions

2002

VIU Oct 2007 : Speaker Recognition 19 F. Schiel

Speaker Verification: Score Normalisation (3)

Often assumed: costs are equal and speakers occurequally distributed

is estimated using a world or cohort model

world model : speaker model trained to all speakers

cohort model : speaker model trained to a group of

most competing models (wolfs)

lo g p O∣l − lo g p O∣l lo g N − 1

N : number of users∧ im postors

p O∣l

2002

VIU Oct 2007 : Speaker Recognition 20 F. Schiel

Speaker Verification: Enrolment

Method

Fixed, pre-specified sentence:e.g. "My voice is my password"

Fixed, selectable sentence:e.g. maiden name of grandmother

Changing number triplets:e.g. fifteen, thirtynine, seventythree

System generates a new sentencefor each verification

Enrolment Remarks

Speak sentence3 - 5 times

Speak sentence3 – 5 times

Speak each number3 – 5 times

Sentence may be intercepted and played back

Additional securityby content

High security by manypossible combinations

Elaborate enrolment,high processing effort,very high security

Speak each phoneme3 – 5 times

2002

VIU Oct 2007 : Speaker Recognition 21 F. Schiel

Speaker Verification: HMM types

Method

pre-specified sentence

recombination of segments taken from enrolment data

modeling without time structure

Model Security

Accuracy

linear

piecewise linear

ergodic

o

2002

VIU Oct 2007 : Speaker Recognition 22 F. Schiel

Speaker Verification: Features (1)

Variable signal characteristics• often required: telephone band 300 – 3300 Hz

(higher resonances cut off)• changing channel characteristics, caused by

transmission line, handset, distance to mouth• static and intermittent noise • user: health, intoxication, fatigue

2002

VIU Oct 2007 : Speaker Recognition 23 F. Schiel

Speaker Verification: Features (2)

Candidates determined by physiology:• fundamental frequency, average• wave form of vocal folds, jimmer, jitter, irregularities• formants: average and dynamics• places of articulation: fricatives, plosives• nasal cavity resonance• sub-glottal resonance

2002

VIU Oct 2007 : Speaker Recognition 24 F. Schiel

Speaker Verification: Features (3)

Candidates determined by behaviour:• voiced/unvoice ratio• fundamental frequency, dynamics• syllable rate, pause/speech ratio• dialectal features: vowel qualityCandidates determined by speech technology:• Linear Predictor Coefficients (LPC)• filter bank, Bark filter bank, Mel filter bank• Cepstrum, Mel-Cepstrum• (derivations with respect to time)

2002

VIU Oct 2007 : Speaker Recognition 25 F. Schiel

Sprecherverifikation: Road Map

1990 Heute 2010 2020

ZugangskontrollenSicherheitsbereich

Authentifizierungüber Telefon

Geräte "erkennen"ihren Benutzer

Sprecherprofilauf Chipkarten

Zugangskontrolle fürTastaturlose PDAs

Authentifizierungim Hintergrund

ÖffentlicheSprecherprofile

Automatischer Alkohol-test im Fahrzeug

2002

VIU Oct 2007 : Speaker Recognition 26 F. Schiel

Thank You!