Spoken Arabic Dialect ID - Columbia Universitydpwe/e6820/proposals/fadi.pdf · 13 Corpora –...

37
1 Spoken Arabic Dialect ID Speech & Audio Processing & Recognition Fadi Biadsy March 13, 2008

Transcript of Spoken Arabic Dialect ID - Columbia Universitydpwe/e6820/proposals/fadi.pdf · 13 Corpora –...

Page 1: Spoken Arabic Dialect ID - Columbia Universitydpwe/e6820/proposals/fadi.pdf · 13 Corpora – Levantine Arabic CTS Levantine Fisher Training Data Set 1,2,3 Speech 762 Dialogues 1524

1

Spoken Arabic Dialect ID

Speech & Audio Processing & Recognition

Fadi BiadsyMarch 13, 2008

Page 2: Spoken Arabic Dialect ID - Columbia Universitydpwe/e6820/proposals/fadi.pdf · 13 Corpora – Levantine Arabic CTS Levantine Fisher Training Data Set 1,2,3 Speech 762 Dialogues 1524

2

Background

Modern Standard Arabic (MSA): standard language throughout the Arab world (Literary Arabic)

A native Language of Nobody

Colloquial Arabic: collective term for all dialects of Arabic

Page 3: Spoken Arabic Dialect ID - Columbia Universitydpwe/e6820/proposals/fadi.pdf · 13 Corpora – Levantine Arabic CTS Levantine Fisher Training Data Set 1,2,3 Speech 762 Dialogues 1524

3

Maghrebi, Egyptian, Sudanese, Levantine, Iraqi, Arabian

Page 4: Spoken Arabic Dialect ID - Columbia Universitydpwe/e6820/proposals/fadi.pdf · 13 Corpora – Levantine Arabic CTS Levantine Fisher Training Data Set 1,2,3 Speech 762 Dialogues 1524

4

Dialect ID

Given a speech segment as short as possible Dialect ID

Page 5: Spoken Arabic Dialect ID - Columbia Universitydpwe/e6820/proposals/fadi.pdf · 13 Corpora – Levantine Arabic CTS Levantine Fisher Training Data Set 1,2,3 Speech 762 Dialogues 1524

5

Why Study Dialect ID

Interesting problem Phonetic cues? Prosodic cues? (e.g., intonational contours, phrase accents,

durational features...)

*Lexical and syntactic features?

Page 6: Spoken Arabic Dialect ID - Columbia Universitydpwe/e6820/proposals/fadi.pdf · 13 Corpora – Levantine Arabic CTS Levantine Fisher Training Data Set 1,2,3 Speech 762 Dialogues 1524

6

Why Study Dialect ID

Page 7: Spoken Arabic Dialect ID - Columbia Universitydpwe/e6820/proposals/fadi.pdf · 13 Corpora – Levantine Arabic CTS Levantine Fisher Training Data Set 1,2,3 Speech 762 Dialogues 1524

6

Why Study Dialect ID

ASR fails when an Arabic speaker code switches to her regional dialect

Page 8: Spoken Arabic Dialect ID - Columbia Universitydpwe/e6820/proposals/fadi.pdf · 13 Corpora – Levantine Arabic CTS Levantine Fisher Training Data Set 1,2,3 Speech 762 Dialogues 1524

6

Why Study Dialect ID

ASR fails when an Arabic speaker code switches to her regional dialect

Identifying dialects prior to recognition enables the ASR to adapt its:

Page 9: Spoken Arabic Dialect ID - Columbia Universitydpwe/e6820/proposals/fadi.pdf · 13 Corpora – Levantine Arabic CTS Levantine Fisher Training Data Set 1,2,3 Speech 762 Dialogues 1524

6

Why Study Dialect ID

ASR fails when an Arabic speaker code switches to her regional dialect

Identifying dialects prior to recognition enables the ASR to adapt its:

Pronunciation Model Acoustic Models Morphological Model Language Model

Page 10: Spoken Arabic Dialect ID - Columbia Universitydpwe/e6820/proposals/fadi.pdf · 13 Corpora – Levantine Arabic CTS Levantine Fisher Training Data Set 1,2,3 Speech 762 Dialogues 1524

6

Why Study Dialect ID

ASR fails when an Arabic speaker code switches to her regional dialect

Identifying dialects prior to recognition enables the ASR to adapt its:

Pronunciation Model Acoustic Models Morphological Model Language Model

Speaker Annotation

Page 11: Spoken Arabic Dialect ID - Columbia Universitydpwe/e6820/proposals/fadi.pdf · 13 Corpora – Levantine Arabic CTS Levantine Fisher Training Data Set 1,2,3 Speech 762 Dialogues 1524

7

Dialect ID – Our Approach

Phonotactic Modeling Hypothesis: Every Arabic dialect has its own

phonetic distribution This approach was successfully used in

Language ID

Page 12: Spoken Arabic Dialect ID - Columbia Universitydpwe/e6820/proposals/fadi.pdf · 13 Corpora – Levantine Arabic CTS Levantine Fisher Training Data Set 1,2,3 Speech 762 Dialogues 1524

8

Dialect ID - TRAIN

Page 13: Spoken Arabic Dialect ID - Columbia Universitydpwe/e6820/proposals/fadi.pdf · 13 Corpora – Levantine Arabic CTS Levantine Fisher Training Data Set 1,2,3 Speech 762 Dialogues 1524

8

Dialect ID - TRAIN

First, train an MSA Arabic “phone” recognizer

Page 14: Spoken Arabic Dialect ID - Columbia Universitydpwe/e6820/proposals/fadi.pdf · 13 Corpora – Levantine Arabic CTS Levantine Fisher Training Data Set 1,2,3 Speech 762 Dialogues 1524

8

Dialect ID - TRAIN

First, train an MSA Arabic “phone” recognizer Now, given K dialects

For Dialect idh uw z hh ih n d uw w ay ey d y aw ao uh jh y eh k oh aa k v hh aw ao n

f uw v ow z l iy g s m p l k dh n eh g f ey m p l ay ae

dh iy jh sh p eh ae ey d p sh ua r m ey f ay n z

Page 15: Spoken Arabic Dialect ID - Columbia Universitydpwe/e6820/proposals/fadi.pdf · 13 Corpora – Levantine Arabic CTS Levantine Fisher Training Data Set 1,2,3 Speech 762 Dialogues 1524

9

Dialect ID - TRAIN

First, train an MSA Arabic “phone” recognizer Now, given K dialects

For Dialect idh uw z hh ih n d uw w ay ey d y aw ao uh jh y eh k oh aa k v hh aw ao n

f uw v ow z l iy g s m p l k dh n eh g f ey m p l ay ae

dh iy jh sh p eh ae ey d p sh ua r m ey f ay n z

Train an n-gram modelλi

Page 16: Spoken Arabic Dialect ID - Columbia Universitydpwe/e6820/proposals/fadi.pdf · 13 Corpora – Levantine Arabic CTS Levantine Fisher Training Data Set 1,2,3 Speech 762 Dialogues 1524

10

Dialect ID - TEST

Given a speech segment S from an unknown dialect:

uw hh ih n d uw w ay ey uh jh y eh k oh v hh aw ao n hh aa m

S PS

Page 17: Spoken Arabic Dialect ID - Columbia Universitydpwe/e6820/proposals/fadi.pdf · 13 Corpora – Levantine Arabic CTS Levantine Fisher Training Data Set 1,2,3 Speech 762 Dialogues 1524

11

Dialect ID - TEST

Given a speech segment S from an unknown dialect:

uw hh ih n d uw w ay ey uh jh y eh k oh v hh aw ao n hh aa m

S PS

Page 18: Spoken Arabic Dialect ID - Columbia Universitydpwe/e6820/proposals/fadi.pdf · 13 Corpora – Levantine Arabic CTS Levantine Fisher Training Data Set 1,2,3 Speech 762 Dialogues 1524

12

Experiment

Train an MSA “phone” recognizer on ~37 hours of speech from TDT4 Broadcast News

Page 19: Spoken Arabic Dialect ID - Columbia Universitydpwe/e6820/proposals/fadi.pdf · 13 Corpora – Levantine Arabic CTS Levantine Fisher Training Data Set 1,2,3 Speech 762 Dialogues 1524

13

Corpora – Levantine

Page 20: Spoken Arabic Dialect ID - Columbia Universitydpwe/e6820/proposals/fadi.pdf · 13 Corpora – Levantine Arabic CTS Levantine Fisher Training Data Set 1,2,3 Speech 762 Dialogues 1524

13

Corpora – Levantine

Arabic CTS Levantine Fisher Training Data Set 1,2,3 Speech 762 Dialogues 1524 speaker Each dialogue is 10 minutes 127 hours of speech Annotated: LEB=547, JOR=393, PAL=187, SYR=72

Page 21: Spoken Arabic Dialect ID - Columbia Universitydpwe/e6820/proposals/fadi.pdf · 13 Corpora – Levantine Arabic CTS Levantine Fisher Training Data Set 1,2,3 Speech 762 Dialogues 1524

13

Corpora – Levantine

Arabic CTS Levantine Fisher Training Data Set 1,2,3 Speech 762 Dialogues 1524 speaker Each dialogue is 10 minutes 127 hours of speech Annotated: LEB=547, JOR=393, PAL=187, SYR=72

Silence based segmentation + remove every segment < 0.5s

Page 22: Spoken Arabic Dialect ID - Columbia Universitydpwe/e6820/proposals/fadi.pdf · 13 Corpora – Levantine Arabic CTS Levantine Fisher Training Data Set 1,2,3 Speech 762 Dialogues 1524

14

Corpora – Egyptian

Page 23: Spoken Arabic Dialect ID - Columbia Universitydpwe/e6820/proposals/fadi.pdf · 13 Corpora – Levantine Arabic CTS Levantine Fisher Training Data Set 1,2,3 Speech 762 Dialogues 1524

14

Corpora – Egyptian

CALLHOME Egyptian Arabic Speech

Page 24: Spoken Arabic Dialect ID - Columbia Universitydpwe/e6820/proposals/fadi.pdf · 13 Corpora – Levantine Arabic CTS Levantine Fisher Training Data Set 1,2,3 Speech 762 Dialogues 1524

14

Corpora – Egyptian

CALLHOME Egyptian Arabic Speech 120 Dialogues 240 speakers

Page 25: Spoken Arabic Dialect ID - Columbia Universitydpwe/e6820/proposals/fadi.pdf · 13 Corpora – Levantine Arabic CTS Levantine Fisher Training Data Set 1,2,3 Speech 762 Dialogues 1524

14

Corpora – Egyptian

CALLHOME Egyptian Arabic Speech 120 Dialogues 240 speakers Each dialogue is 30 minutes 60 hours of

speech

Page 26: Spoken Arabic Dialect ID - Columbia Universitydpwe/e6820/proposals/fadi.pdf · 13 Corpora – Levantine Arabic CTS Levantine Fisher Training Data Set 1,2,3 Speech 762 Dialogues 1524

14

Corpora – Egyptian

CALLHOME Egyptian Arabic Speech 120 Dialogues 240 speakers Each dialogue is 30 minutes 60 hours of

speech

Page 27: Spoken Arabic Dialect ID - Columbia Universitydpwe/e6820/proposals/fadi.pdf · 13 Corpora – Levantine Arabic CTS Levantine Fisher Training Data Set 1,2,3 Speech 762 Dialogues 1524

14

Corpora – Egyptian

CALLHOME Egyptian Arabic Speech 120 Dialogues 240 speakers Each dialogue is 30 minutes 60 hours of

speech

Silence based segmentation + remove every segment < 0.5s

Page 28: Spoken Arabic Dialect ID - Columbia Universitydpwe/e6820/proposals/fadi.pdf · 13 Corpora – Levantine Arabic CTS Levantine Fisher Training Data Set 1,2,3 Speech 762 Dialogues 1524

14

Corpora – Egyptian

CALLHOME Egyptian Arabic Speech 120 Dialogues 240 speakers Each dialogue is 30 minutes 60 hours of

speech

Silence based segmentation + remove every segment < 0.5s

Page 29: Spoken Arabic Dialect ID - Columbia Universitydpwe/e6820/proposals/fadi.pdf · 13 Corpora – Levantine Arabic CTS Levantine Fisher Training Data Set 1,2,3 Speech 762 Dialogues 1524

14

Corpora – Egyptian

CALLHOME Egyptian Arabic Speech 120 Dialogues 240 speakers Each dialogue is 30 minutes 60 hours of

speech

Silence based segmentation + remove every segment < 0.5s

Page 30: Spoken Arabic Dialect ID - Columbia Universitydpwe/e6820/proposals/fadi.pdf · 13 Corpora – Levantine Arabic CTS Levantine Fisher Training Data Set 1,2,3 Speech 762 Dialogues 1524

14

Corpora – Egyptian

CALLHOME Egyptian Arabic Speech 120 Dialogues 240 speakers Each dialogue is 30 minutes 60 hours of

speech

Silence based segmentation + remove every segment < 0.5s

Page 31: Spoken Arabic Dialect ID - Columbia Universitydpwe/e6820/proposals/fadi.pdf · 13 Corpora – Levantine Arabic CTS Levantine Fisher Training Data Set 1,2,3 Speech 762 Dialogues 1524

15

Experiment

Page 32: Spoken Arabic Dialect ID - Columbia Universitydpwe/e6820/proposals/fadi.pdf · 13 Corpora – Levantine Arabic CTS Levantine Fisher Training Data Set 1,2,3 Speech 762 Dialogues 1524

15

Experiment

Egyptian corpus: held-out 20/240 speakers Run the Arabic phone recognizer on 220 files: ~18.3 million phones

Levantine corpus, held out 757/1524 Run the Arabic phone recognizer on 220 files:

~19.4 million phones

Page 33: Spoken Arabic Dialect ID - Columbia Universitydpwe/e6820/proposals/fadi.pdf · 13 Corpora – Levantine Arabic CTS Levantine Fisher Training Data Set 1,2,3 Speech 762 Dialogues 1524

16

Results on the held out Data

Levantine: 98.3% 744/757 were correctly classified as Levantine

Egyptian: 95% 19/20 were correctly classified as Egyptian

Page 34: Spoken Arabic Dialect ID - Columbia Universitydpwe/e6820/proposals/fadi.pdf · 13 Corpora – Levantine Arabic CTS Levantine Fisher Training Data Set 1,2,3 Speech 762 Dialogues 1524

17

Results on a different corpus

Babylon Levantine corpus Microphone Recordings 164 speakers ~60 hours of speech Accuracy: 96.3% speakers

Page 35: Spoken Arabic Dialect ID - Columbia Universitydpwe/e6820/proposals/fadi.pdf · 13 Corpora – Levantine Arabic CTS Levantine Fisher Training Data Set 1,2,3 Speech 762 Dialogues 1524

18

TODO

Test on a different corpus for Egyptian

Try to identify “sub” dialects (from the same corpus)

Identify Gulf and Iraqi Arabic

Incorporate English phone recognizer

Page 36: Spoken Arabic Dialect ID - Columbia Universitydpwe/e6820/proposals/fadi.pdf · 13 Corpora – Levantine Arabic CTS Levantine Fisher Training Data Set 1,2,3 Speech 762 Dialogues 1524

19

Important issue (TODO)

We use all the speech of a speaker avg: ~5 minutes for Lev. avg: ~15 minutes for Egy.

Will this approach work if we use less than 30s of speech?

Page 37: Spoken Arabic Dialect ID - Columbia Universitydpwe/e6820/proposals/fadi.pdf · 13 Corpora – Levantine Arabic CTS Levantine Fisher Training Data Set 1,2,3 Speech 762 Dialogues 1524

20

Thank you!