MIT Lincoln Laboratory Nuance Communications Automatic Speaker Recognition Recent Progress, Current...

31
MIT Lincoln Laboratory Nuance Communications Automatic Speaker Recognition Recent Progress, Current Applications, and Future Trends Douglas A. Reynolds, PhD Senior Member of Technical Staff M.I.T. Lincoln Laboratory Larry P. Heck, PhD Speaker Verification R&D Nuance Communications

Transcript of MIT Lincoln Laboratory Nuance Communications Automatic Speaker Recognition Recent Progress, Current...

MIT Lincoln Laboratory Nuance Communications

Automatic Speaker RecognitionRecent Progress, Current Applications,

and Future Trends

Douglas A. Reynolds, PhDSenior Member of Technical

StaffM.I.T. Lincoln Laboratory

Larry P. Heck, PhDSpeaker Verification R&DNuance Communications

MIT Lincoln Laboratory Nuance Communications

Outline

• Introduction and applications

• General theory

• Performance

• Conclusion and future directions

MIT Lincoln Laboratory Nuance Communications

Extracting Information from Speech

SpeechRecognition

LanguageRecognition

SpeakerRecognition

Words

Language Name

Speaker Name

“How are you?”

English

James Wilson

Speech Signal

Goal: Automatically extract information transmitted in speech signal

MIT Lincoln Laboratory Nuance Communications

IntroductionIdentification

• Determines who is talking from set of known voices

• No identity claim from user (many to one mapping)

• Often assumed that unknown voice must come from set of known speakers - referred to as closed-set identification

?

?

?

?

Whose voice is this?

MIT Lincoln Laboratory Nuance Communications

IntroductionVerification/Authentication/Detection

• Determine whether person is who they claim to be

• User makes identity claim: one to one mapping

• Unknown voice could come from large set of unknown speakers - referred to as open-set verification

• Adding “none-of-the-above” option to closed-set identification gives open-set identification

?

Is this Bob’s voice?

MIT Lincoln Laboratory Nuance Communications

IntroductionSpeech Modalities

• Text-dependent recognition

– Recognition system knows text spoken by person

– Examples: fixed phrase, prompted phrase

– Used for applications with strong control over user input

– Knowledge of spoken text can improve system performance

Application dictates different speech modalities:

• Text-independent recognition

– Recognition system does not know text spoken by person

– Examples: User selected phrase, conversational speech

– Used for applications with less control over user input

– More flexible system but also more difficult problem

– Speech recognition can provide knowledge of spoken text

MIT Lincoln Laboratory Nuance Communications

IntroductionVoice as a Biometric

Strongest security

• Biometric: a human generated signal or attribute for authenticating a person’s identity

• Voice is a popular biometric:

– natural signal to produce

– does not require a specialized input device

– ubiquitous: telephones and microphone equipped PC

• Voice biometric with other forms of security

– Something you have - e.g., badge

– Something you know - e.g., password

– Something you are - e.g., voice

HaveKnow

Are

MIT Lincoln Laboratory Nuance Communications

IntroductionApplications

• Access control– Physical facilities– Data and data networks

• Transaction authentication– Toll fraud prevention– Telephone credit card purchases– Bank wire transfers

• Monitoring– Remote time and attendance logging– Home parole verification– Prison telephone usage

• Information retrieval– Customer information for call centers– Audio indexing (speech skimming device)

• Forensics– Voice sample matching

MIT Lincoln Laboratory Nuance Communications

Outline

• Introduction and applications

• General theory

• Performance

• Conclusion and future directions

MIT Lincoln Laboratory Nuance Communications

ACCEPT

General TheoryComponents of Speaker Verification System

Feature extraction

Feature extraction

SpeakerModel

SpeakerModel

Bob’s “Voiceprint”

“My Name is Bob”

ACCEPT

Bob

ImpostorModel

ImpostorModel

Identity Claim

DecisionDecision

REJECTInput Speech

Impostor “Voiceprints”

MIT Lincoln Laboratory Nuance Communications

General TheoryPhases of Speaker Verification System

Two distinct phases to any speaker verification system

Feature extraction

Feature extraction

Model training

Model training

Enrollment speech for each speaker

Bob

Sally

Voiceprints (models) for each speaker

Sally

Bob

Enrollment Enrollment PhasePhase

Model training

Model training

Accepted!Feature extraction

Feature extraction

Verificationdecision

Verificationdecision

Claimed identity: Sally

Verification Verification PhasePhase

Verificationdecision

Verificationdecision

MIT Lincoln Laboratory Nuance Communications

General TheoryFeatures for Speaker Recognition

• Humans use several levels of perceptual cues for speaker recognition

Semantics, diction,pronunciations,idiosyncrasies

Socio-economicstatus, education,place of birth

Prosodics, rhythm,speed intonation,volume modulation

Personality type,parental influence

Acoustic aspect ofspeech, nasal,deep, breathy,rough

Anatomical structureof vocal apparatus

Semantics, diction,pronunciations,idiosyncrasies

Socio-economicstatus, education,place of birth

Prosodics, rhythm,speed intonation,volume modulation

Personality type,parental influence

Acoustic aspect ofspeech, nasal,deep, breathy,rough

Anatomical structureof vocal apparatus

High-level cues (learned traits)

Low-level cues (physical traits)

Easy to automatically extract

Difficult to automatically extract

Hierarchy of Perceptual Cues

• There are no exclusive speaker identifiably cues

• Low-level acoustic cues most applicable for automatic systems

MIT Lincoln Laboratory Nuance Communications

General TheoryFeatures for Speaker Recognition

• Desirable attributes of features for an automatic system (Wolf ‘72)

• Occur naturally and frequently in speech

• Easily measurable

• Not change over time or be affected by speaker’s health

• Not be affected by reasonable background noise nor depend on specific transmission characteristics

• Not be subject to mimicry

• Occur naturally and frequently in speech

• Easily measurable

• Not change over time or be affected by speaker’s health

• Not be affected by reasonable background noise nor depend on specific transmission characteristics

• Not be subject to mimicry

Practical

Robust

Secure

• No feature has all these attributes

• Features derived from spectrum of speech have proven to be the most effective in automatic systems

MIT Lincoln Laboratory Nuance Communications

General TheorySpeech Production

• Speech production model: source-filter interaction– Anatomical structure (vocal tract/glottis) conveyed in speech spectrum

Glottal pulses Vocal tract Speech signal

MIT Lincoln Laboratory Nuance Communications

General TheoryFeatures for Speaker Recognition

• Speech is a continuous evolution of the vocal tract – Need to extract time series of spectra– Use a sliding window - 20 ms window, 10 ms shift

...

Fourier Transform

Fourier Transform MagnitudeMagnitude

• Produces time-frequency evolution of the spectrum

MIT Lincoln Laboratory Nuance Communications

General TheorySpeaker Models

SpeakerModel

SpeakerModel

Bob’s “Voiceprint”

Bob

ACCEPT

Feature extraction

Feature extraction

“My Name is Bob”

ACCEPT

ImpostorModel

ImpostorModel

Identity Claim

DecisionDecision

REJECT

Impostor “Voiceprints”

SpeakerModel

SpeakerModel

Bob’s “Voiceprint”

Bob

General Theory Components of Speaker Verification System

MIT Lincoln Laboratory Nuance Communications

General TheorySpeaker Models

• Speaker models (voiceprints) represent voice biometric in compact and generalizable form

h-a-d

• Modern speaker verification systems use Hidden Markov Models (HMMs)

– HMMs are statistical models of how a speaker produces sounds

– HMMs represent underlying statistical variations in the speech state (e.g., phoneme) and temporal changes of speech between the states.

– Fast training algorithms (EM) exist for HMMs with guaranteed convergence properties.

MIT Lincoln Laboratory Nuance Communications

General TheorySpeaker Models

Form of HMM depends on the application

“Open sesame”

Fixed Phrase Word/phrase models

/s/ /i/ /x/

Prompted phrases/passwords Phoneme models

General speech

Text-independent single state HMM

MIT Lincoln Laboratory Nuance Communications

General TheoryVerification Decision

General Theory Components of Speaker Verification System

Bob’s “Voiceprint”

Bob

ACCEPT

Feature extraction

Feature extraction

“My Name is Bob”

Identity Claim

SpeakerModel

SpeakerModel

ImpostorModel

ImpostorModel

DecisionDecision

REJECT

Impostor “Voiceprints”

ACCEPT

SpeakerModel

SpeakerModel

Bob’s “Voiceprint”

Bob

ImpostorModel

ImpostorModel

DecisionDecision

REJECT

Impostor “Voiceprints”

ACCEPT

MIT Lincoln Laboratory Nuance Communications

General TheoryVerification Decision

Verification decision approaches have roots in signal detection theory

• 2-class Hypothesis test: H0: the speaker is an impostor

H1: the speaker is indeed the claimed speaker.

• Statistic computed on test utterance S as likelihood ratio:

Likelihood S came from speaker HMMLikelihood S did not come from speaker HMM

log

reject

Feature extraction

Feature extraction

SpeakerModel

SpeakerModel

ImpostorModel

ImpostorModel

DecisionDecision+

-

accept

MIT Lincoln Laboratory Nuance Communications

Outline

• Introduction and application

• General theory

• Performance

• Conclusion and future directions

MIT Lincoln Laboratory Nuance Communications

Verification PerformanceEvaluating Speaker Verification Systems

• There are many factors to consider in design of an evaluation of a speaker verification system

Speech quality – Channel and microphone characteristics– Noise level and type– Variability between enrollment and verification

speech

Speech modality – Fixed/prompted/user-selected phrases– Free text

Speech duration – Duration and number of sessions of enrollment and verification speech

Speaker population – Size and composition

• Most importantly: The evaluation data and design should match the target application domain of interest

MIT Lincoln Laboratory Nuance Communications

Verification PerformanceEvaluating Speaker Verification Systems

PROBABILITY OF FALSE ACCEPT (in %)

PR

OB

AB

ILIT

Y O

F F

AL

SE

RE

JEC

T (

in %

)

Equal Error Rate (EER) = 1 %

Wire Transfer:

False acceptance is very costly

Users may tolerate rejections for security

Toll Fraud:

False rejections alienate customers

Any fraud rejection is beneficial

Application operating point depends on relative costs of the two error types

High Convenience

High Security

Balance

Example Performance Curve

MIT Lincoln Laboratory Nuance Communications

Verification PerformanceNIST Speaker Verification Evaluations

• Annual NIST evaluations of speaker verification technology (since 1995)

• Aim: Provide a common paradigm for comparing technologies

• Focus: Conversational telephone speech (text-independent)

Evaluation Coordinator

Linguistic Data Consortium

Data Provider

Technology Developers

Comparison of technologies on common task

Evaluate

Improve

MIT Lincoln Laboratory Nuance Communications

Verification PerformanceRange of Performance

Probability of False Accept (in %)

Pro

bab

ilit

y o

f F

alse

Rej

ect

(in

%)

Text-dependent (Combinations)

Clean Data

Single microphone

Large amount of train/test speech

Text-dependent (Combinations)

Clean Data

Single microphone

Large amount of train/test speech

Text-independent (Conversational)

Telephone Data

Multiple microphones

Moderate amount of training data

Text-independent (Conversational)

Telephone Data

Multiple microphones

Moderate amount of training data

Text-dependent (Digit strings)

Telephone Data

Multiple microphones

Small amount of training data

Text-dependent (Digit strings)

Telephone Data

Multiple microphones

Small amount of training data

Text-independent (Read sentences)

Military radio Data

Multiple radios & microphones

Moderate amount of training data

Text-independent (Read sentences)

Military radio Data

Multiple radios & microphones

Moderate amount of training data

Incre

asing constra

ints

MIT Lincoln Laboratory Nuance Communications

Verification PerformanceHuman vs. Machine

• Motivation for comparing human to machine

– Evaluating speech coders and potential forensic applications

• Schmidt-Nielsen and Crystal used NIST evaluation (DSP Journal, January 2000)

– Same amount of training data

– Matched Handset-type tests

– Mismatched Handset-type tests

– Used 3-sec conversational

utterances from telephone speech Mat

ched

Mis

mat

ched

Matched(COMPUTER)

Matched(HUMAN)

Mismatched(COMPUTER)

Mismatched(HUMAN)

ErrorRates

Humans44%

betterHumans

15%worse

MIT Lincoln Laboratory Nuance Communications

Verification PerformanceApplication Deployments

ApplicationApplication• Voice authentication based Voice authentication based

on spoken phone numberon spoken phone number• Provides secure access to Provides secure access to

customer record & credit customer record & credit card informationcard information

ImplementationImplementation• Edify telephony platformEdify telephony platform• Performance @1% EERPerformance @1% EER

BenefitsBenefits• SecuritySecurity• PersonalizationPersonalization

VolumeVolume• 250k customers 250k customers

enrolled currentlyenrolled currently@20K calls/day@20K calls/day

• 5 million customers 5 million customers will enroll by Q2 ‘00 will enroll by Q2 ‘00 @150K calls/day@150K calls/day

MIT Lincoln Laboratory Nuance Communications

Verification PerformanceSpeaker + Knowledge Verification

AuthenticateAuthenticateKnowledgeKnowledge

AuthenticateAuthenticateKnowledgeKnowledge

AcceptAccept

RejectReject

DataDataDataData

AuthenticateAuthenticateVoiceVoice

AuthenticateAuthenticateVoiceVoice

VoiceVoicePrintsPrints

VoiceVoicePrintsPrints

Please enter your account number““5551234”5551234”Say your date of

birth““October 13, October 13,

1964”1964”You’re accepted by the system

BiometricBiometricBiometricBiometric

KnowledgeKnowledge VoiceVoiceover over

TelephoneTelephone

MIT Lincoln Laboratory Nuance Communications

Outline

• Introduction

• General theory

• Performance

• Conclusion and future directions

MIT Lincoln Laboratory Nuance Communications

Conclusions

• Speaker verification is one of the few recognition areas where machines can outperform humans

• Speaker verification technology is a viable technique currently available for applications

• Speaker verification can be augmented with other authentication techniques to add further security

MIT Lincoln Laboratory Nuance Communications

Future Directions

• Research will focus on using speaker verification techniques for more unconstrained, uncontrolled situations

– Audio search and retrieval– Increasing robustness to channel variabilities– Incorporating higher-levels of knowledge into decisions

• Speaker recognition technology will become an integral part of speech interfaces

– Personalization of services and devices– Unobtrusive protection of transactions and information