1 CS 551/651: Structure of Spoken Language Lecture 5: Characteristics of Place of Articulation;...

20
1 CS 551/651: Structure of Spoken Language ure 5: Characteristics of Place of Articulat Phonetic Transcription John-Paul Hosom Fall 2010

Transcript of 1 CS 551/651: Structure of Spoken Language Lecture 5: Characteristics of Place of Articulation;...

1

CS 551/651:Structure of Spoken Language

Lecture 5: Characteristics of Place of Articulation;Phonetic Transcription

John-Paul HosomFall 2010

2

Acoustic-Phonetic Features: Manner of Articulation

Approximately 8 manners of articulation:

Name Sub-Types Examples . Vowel vowel, diphthong aa, iy, uw, eh, ow, …Approximants liquid, glide l, r, w, yNasal m, n, ngStop unvoiced, voiced p, t, k, b, d, g Fricative unvoiced, voiced f, th, s, sh, v, dh, z, zhAffricate unvoiced, voiced ch, jhAspiration hFlap dx, nx

Change in manner of articulation usually abrupt and visible;manner provides much information about location of phonemes.

3

Acoustic-Phonetic Features: Place of Articulation

Approximately 8 places of articulation for consonants:

Name Examples . Labial p, b, m, (w)Labio-Dental f, vDental th, dhAlveolar t, d, s, z, n, l*

Palato-Alveolar sh, zh, ch**, jh**, r***

Palatal yVelar k, g, ng, (w)Glottal h

* /l/ doesn’t have same coarticulatory properties as other alveolars** starts as alveolar (/t/, /d/), then becomes palatal-alveolar*** /r/ can have a complex place of articulation

Place of articulation more subject to coarticulation than manner;F2 trajectory important for identifying place of articulation.

4

Acoustic-Phonetic Features: Place of Articulation

Labial (/p/, /b/, /m/, /w/):• constriction (or complete closure) at lips• the only unvoiced labial is /p/• the only nasal labial is /m/• characterized by F1, F2, (even) F3 of adjacent vowel(s)

rapidly and briefly decreasing at border with labial

5

Acoustic-Phonetic Features: Place of Articulation

Labio-Dental (/f/, /v/):• produced by constriction between lower lip and upper teeth• in English, all labio-dental phonemes are fricatives• can be characterized by formants of adjacent vowel(s)

decreasing at border with labial (similar to characteristicsof labials)

Dental (/th/, /dh/):• produced by constriction between tongue tip and upper teeth

(sometimes tongue tip is closer to alveolar ridge)• in English, all dental phonemes are fricatives• may be characterized by stronger energy above 6 KHz,

but weaker than /sh/, /zh/ fricatives

6

Acoustic-Phonetic Features: Place of Articulation

Alveolar (/t/, /d/, /s/, /z/, /n/, /l/):• tongue tip is at or near alveolar ridge• a large number of English consonants are alveolar• primary cue to alveolars: F2 of neighboring vowel(s)

is around 1800 Hz, except for /l/• /l/ has low F1 ( 400 Hz) and F2 ( 1000 Hz), high F3• /l/ before vowel is “light” /l/, after vowel is “dark” /l/.

7

Acoustic-Phonetic Features: Place of Articulation

Palato-Alveolar (/sh/, /zh/, /ch/, /jh/, /r/):• tongue is between alveolar ridge and hard palate• 2 fricatives, 2 affricates, 1 rhotic consonant (r sound)• retroflex has “depression” midway along tongue• the palato-alveolar fricatives tend to have strong energy

due to weak constriction allowing large airflow• /r/ (and /er/) most easily identified by F3 below 2000 Hz• /r/ sometimes considered alveolar approximant

Palatal (/y/):• produced with tongue close to hard palate• “extreme” production of /iy/• F1-F2 tend to be more spread than /iy/, F1 is lower than /iy/

8

Acoustic-Phonetic Features: Place of Articulation

Velar (/k/, /g/, /ng/):• produced with constriction against velum (soft palate)• only plosives /k/ and /g/, and nasal /ng/• characteristic of velars is the “velar pinch”, in which

F2 and F3 of neighboring vowel become very closeat boundary with velar. Most visible in front vowel /ih/

9

Acoustic-Phonetic Features: Place of Articulation

Glottal (/h/):

• /h/ is the nominal glottal phoneme in English; inreality, the tongue can be in any vowel-like position

• the primary cue for /h/ is formant structure withoutvoicing, an energy dip, and/or an increase in aspirationnoise in higher frequencies.

10

Distinctive Phonetic Features: Summary

• Distinctive features may be used to categorize phoneticsub-classes and show relationships between phonemes

• There is often not a one-to-one correspondence between afeature value and a particular trait in the speech signal

• A variety of context-dependent and context-independent cues (sometimes conflicting, sometimes complimentary) serve to identify features

• Speech is highly variable, highly context-dependent, andcues to phonemic identity are spread in both the spectraland time domains. The diffusion of features makesautomatic speech recognition difficult, but human speechrecognition is able to use this diffusion for robustness.

11

Redundancy

• Distinctive features are not always independent; someredundancy may be implied (especially binary features)

• Example: Spanish

i e a o u

High + +

Low + Back + + +

Round + +

+high low +low high back round+round +back +low +back +low roundback low +round low

These relationships are language and feature-set specific.

(from Schane, p. 35-38)

12

Redundancy

• Redundant information can be indicated by circling redundantfeatures:

i e a o u

High + +

Low + Back + + +

Round + +

• Some redundancies are universal (can’t be +high and +low)

• Phonetic sequences also have constraints (redundant info.):English has no more than 3 word-initial consonants; in thiscase, first consonant is always /s/; next is always /p/, /t/, or /k/;third is always /r/ or /l/ (from Schane, p. 36-40)

13

Phonetic Transcription

Given a corpus of speech data, it’s often necessary to create a transcription:

• word level• phoneme level• time-aligned phoneme level• time-aligned detailed phoneme level (with diacritics)• other information: phonetic stress, emotion, syntax, repairs

Most common are word-level and time-aligned phoneme level.

Time-aligned phonetic transcription examples:0 110 .pau

110 180 h180 240 eh240 280 l280 390 ow390 540 .pau t uw .br

14

Phonetic Transcription

Are phonemes precise quantities with exact boundaries?No… humans disagree on phonetic labels and boundary positions;disagreement may be a matter of interpretation of the utterance.

Phonetic label agreement between humans:

Full Labels Base Labels Broad Categories

English 70% 71% 89%

German 61% 65% 81%

Mandarin 66% 78% 87%

Spanish 74% 82% 90%

Full, Base Label Set: 55 (English), 62 (German), 50 (Mandarin), 42 (Spanish)

Broad Categories: 7 corresponding to manner of articulation

*From Cole, Oshika, et al., ICSLP’94

15

Phonetic Transcription

70% agreement on 55 phonemes, 89% agreement on 7 categories

Best phoneme-level automatic speech recognition results on TIMIT,with a 39-phoneme symbol set: 75.8% (Antoniou, 2001; Reynolds and Antoniou, 2003?)

Differences:1. Human agreement evaluated on spontaneous speech (stories),

TIMIT is read speech2. Humans used 55 phonemes; 39 phonemes for evaluating TIMIT

Phoneme agreement doesn’t translate into word accuracy…human word accuracy is typically an order of magnitude betterthan the best automatic speech recognition system.

16

Phonetic Transcription

Phonetic label boundary agreement between humans:

Agreement measured by comparing two manual labelings, A and B,and computing the percentage of cases in which B labels are withinsome threshold (20 msec) of A labels.

50

60

70

80

90

100

0 5 10 15 20 25 30 35 40 45

Cosi

Ljolje

Wesenick

Cole

Leung

Hosom

Average agreement of 93.8% within 20 msec threshold;Maximum agreement of 96% within 20 msec

agre

emen

t (%

)

threshold (msec)

17

Phonetic Transcription

Is there a “correct” answer? No; inherently subjective althoughsemi-arbitrary guidelines can be imposed.

Is measuring accuracy meaningless? No; phonemes do haveidentity and order, although details may be subjective.

Sometimes very precise (if semi-arbitrary) labels and boundaries are extremely important (e.g. concatenative text-to-speech databases).

What about getting a computer to generate transcriptions, orat least phonetic boundaries?

Advantages: consistent, fastDisadvantages: not accurate, compared to human transcription

not robust to different speakers, environments

18

Phonetic Transcription

Automatic Phonetic Alignment (assume phonetic identity is known):

Two common methods:

(1) “Forced Alignment”:Use existing speech recognizer, constrained to recognizeonly the “correct” phoneme sequence. The search processused by HMM recognizers returns both phoneme identity andlocation. Location information is boundary information.

(2) Dynamic Time Warping:(a) Use text-to-speech or utterance “templates” to generatesame speech content with known boundaries. (b) Warp time scale of reference (TTS or template) with input speech to minimize spectral error. (c) Convert known boundary locations to original time scale.

19

Phonetic Transcription

Accuracy of automatic alignmentSpeaker-independent alignment using Forced Alignment:

30

40

50

60

70

80

90

100

0 5 10 15 20 25 30 35 40 45

BrugnaraLjolje1PellomWightmanLjolje2RappSvendsen1PauwsDalsgaardMalfrere1KippStober1Stober2Hosom

threshold (msec)

agre

emen

t (%

)

20

50

60

70

80

90

100

0 5 10 15 20 25 30 35 40 45

Automatic

Manual

Phonetic Transcription

Comparing manual and automatic alignment of TIMIT corpus:

• Automatic method still makes “stupid” mistakes.• Manual labeling criteria not rigorously defined.• Performance degrades significantly in presence of noise.• Assumes correct phonetic sequence is known…