Local and Global Adaptation in Hyperarticulation Amanda Stent, Susan Brennan, Marie Huffman.

37
Local and Global Adaptation in Hyperarticulation Amanda Stent, Susan Brennan, Marie Huffman

Transcript of Local and Global Adaptation in Hyperarticulation Amanda Stent, Susan Brennan, Marie Huffman.

Local and Global Adaptation in Hyperarticulation

Amanda Stent, Susan Brennan, Marie Huffman

Outline Introduction: Adaptation User adaptation: Hyperarticulation Current and future work

Adaptation in Spoken Dialog There is considerable variation in

spoken dialog. Much of this variation is designed to be adaptive. Speakers may converge (e.g. Brennan and

Clark 96) or complement each other (e.g. Oviatt 95, Brennan 90).

Adaptation may be partner-specific or generic (Brown and Dell 87).

Adaptation may be local or global.

Adaptation in Spoken Dialog Interesting questions include:

How do humans adapt to each other in spoken dialog?

Speaking style, e.g. dialect, speaking rate Lexical and syntactic choices Initiative

Are these adaptations partner-specific or generic, local or global?

How does adaptation in human-computer dialog differ from adaptation in human-human dialog?

Can we use adaptation in human-computer dialog to improve dialog outcomes?

Adaptation: The User Humans adapt to their dialog partners, including

computers, at many levels: Phonetic (e.g. hyperarticulation) Lexical/syntactic (e.g. producing simpler utterances,

rephrasing, mirroring system’s choice of words) Dialog and task (e.g. skipping acknowledgments,

following system initiative) Some of these adaptations reflect incorrect

models of the conversational partner, and/or are known to be maladaptive (e.g. hyperarticulation, some rephrasing).

Adaptation: The System Systems can adapt to make the user feel more

comfortable or to mimic human adaptations (responsive generation). Converging on the user’s choice of referring expression. Following the user’s topic shifts.

Systems can construct interactions that guide the user to useful forms of adaptation (directive generation). Using words that can be recognized/parsed. Suggesting rephrases on misrecognition. Presenting their capabilities accurately.

Outline Introduction: Adaptation User adaptation: Hyperarticulation Current and future work

Experiment: The problem Hypotheses Experiment design Experiment results Discussion

The Problem When users experience speech recognition

errors, they try to adapt in ways that do not lead to performance improvements Hyperarticulation (Soltau and Waibel 98, Wade et

al. 92) Rephrasing to out of grammar (Fischer 99,

Choularton and Dale 04) Our questions:

Considered as a form of adaptation, how exactly does hyperarticulation function?

Hypotheses In repairs of misrecognitions, subjects will exhibit

hyperarticulation. Slower speaking rate, longer pauses, more careful

speech (Oviatt et al. 98; Levow 98, 99; Hirschberg et al. 99, 00)

(Local impact) Hyperarticulation will be more likely to appear around the actual misrecognition than elsewhere in the utterance.

(Global impact) Once users start hyperarticulating, this behaviour will persist even if errors stopped occurring.

Experiment Design Wizard-of-Oz procedure Subjects answered prerecorded questions about a

children’s softball team database. Subjects were told to answer in complete sentences and

to repeat until heard correctly. System feedback was provided in text.

Usually “I heard you say …” For unplanned errors by subjects (e.g. disfluencies, use

of pronouns or ellipsis, incomplete utterances), other feedback was provided.

For selected planned error utterances, system feedback contained misrecognitions.

Unplannederror

Repair

Example: Unplanned Error Q. What is Ryan Dade bringing to the

food sale? U. Ryan Dade is bringing cat collars,

and a basket, and pet toys to the foo, to the garage sale, oops

S. Please repeat U. Ryan Dade is bringing cat collars, a

basket, and pet toys to the garage sale

Plannederror

Repair

Example: Planned Error Q. What is Kate Tolstoy bringing to the food

sale? U. Kate Tolstoy is bringing some cookie dough

and a picnic table to the food sale S. You said: Kate Tolstoy is bringing some

cooking label in a pickle to the food sale U. Kate Tolstoy is bringing some cookie dough

and a picnic table to the food sale S. You said: Kate Tolstoy is bringing some

cookie dough and a picnic table to the food sale

Measurements Speaking rate (syllables/sec.) Average pause length (ms.) Phonetic features indicating careful speech:

mid-word /t/ tapping vs. flap /D/ e.g. Peter, tutor, party, forty, writer

Word-final /t/ release vs. non-release e.g. Kate, scientist, peat, flute, dart

/t/ release after /n/ vs. non-release e.g. Kanter, scientist, Planters, dentist, Santa

Tense a in indefinite articles /d/ in and

Measurements Local impact of hyperarticulation

Target phonemes were coded for all planned and unplanned errors and all repairs

Global impact of hyperarticulation Planned errors were placed so that very few errors

occurred in the 1st third of each dialog, errors occurred every 1-3 utterances in the 2nd third, and a run of 5 errors occurred in the last third

Impact of hyperarticulation on SR: All utterances were run through two speech

recognizers, one grammar-based and one statistical

Experiment 16 subjects (9 women, 7 men, mean age 22

years) participated in the experiment All native speakers of English 10 monolingual, 6 bilingual but English-dominant Each answered 66 questions

2 additional subjects’ data were discarded due to equipment failure

Some utterances were discarded due to major disfluencies or being cut off

Result: 1202 utterances -- 373 planned errors and repairs

Data Coding Utterance length, number of words,

number of syllables were computed automatically

PRAAT was used to measure number and length of utterance-internal pauses greater than 10 ms. in length

Phonetic annotation of target words in errors and repairs was done by hand

Measures of Hyperarticulation: Speaking Rate Speaking rate and clear speech are

reliably correlated (r = -.239, p < .001). Speakers spoke more slowly in a repair

than in a planned error, 3.62 syl./sec to 4.12 syl./sec. (p =< .001). For all paired utterances taken together,

repairs were slower than errors, 3.67 to 4.17 syl./sec. (p < .001)

Measures of Hyperarticulation: Careful Speech On average, speakers produced more clear

forms in repairs than in errors, 38% to 30% (p < .001).

Of the 5 phonetic features coded for the paired utterances: 3 were more likely to be pronounced in their clear

forms in the repair than in the error: /t/ tapping vs. flap /D/, word-final /t/ release vs. non-release, /t/ release vs. non-release after /n/.

and 2 were not: tense a in indefinite articles, and /d/ in and.

Measures of Hyperarticulation: Careful Speech Content words were produced in clear

form 13% more often in a repair than in an error (p = .002).

Function words were produced in clear form only 4% more often in a repair than in an error (p = .002).

Local Impact of Hyperarticulation

Do speakers hyperarticulate as a precise form of correction aimed at repairing the most troublesome part of the utterance?

The percentage of clear forms increased 12% for the misunderstood portion during the repair, significantly greater than the before and after portions (only 4.3% and 4.7%, respectively).

Global Impact of Hyperarticulation

Is hyperarticulation a “switch” or a “dial”? The closer an utterance was to the most

recent previous error, the more carefully it was produced (speaking rate, clear forms) (p < .005).

Speakers gradually return to relaxed speech about 4-7 utterances after seeing evidence of misrecognition.

Individual Differences Individual speakers displayed

substantial variability in average speaking rate (2.43—5.27 syl./sec).

BUT All speakers slowed their speaking rate during repairs, relative to before repairs (.04 syl./sec -- 1.33 syl./sec).

Individual Differences All but 3 speakers produced more clear

speech during repairs than before repairs. Speaking rate and careful speech were

correlated across speakers; that is, those who spoke rapidly tended to produce more relaxed forms and those who spoke slowly tended to produce more clear forms.

Individual Differences A few speakers adopted a hyperarticulate style

of speaking throughout the experiment; those who experienced the most unplanned errors spoke the slowest during non-repairs.

Both monolingual and bilingual speakers slowed their speaking rate equally during repairs (and there was no difference in average speaking rates of monolinguals versus bilinguals). However, monolinguals increased their proportion of clear speech marginally more than did bilinguals.

Impact on Speech Recognition For the statistical speech recognizer, higher

word error rates were associated with slower speech (p < .001) but not with more careful speech.

For the grammar-based recognizer, higher word error rates were correlated with faster speech (p < .001), and with more careful speech (p = .05).

For both recognizers, the effect sizes (by Cohen’s 88 standards) are rather small.

Impact on Speech Recognition

As (Wade et al. 92) found, not all aspects of hyperarticulation cause problems, and any effects depend a great deal on how the acoustic model was trained.

Misrecognition errors may cause more problems due to users’ rephrasing than to users’ switching to hyperarticulate speech.

Discussion Hyperarticulation varies both by location

within the utterance and over time. The type and degree of hyperarticulation

depend somewhat on the individual speaker. Once hyperarticulation has been detected,

the system can try to guide the user away from hyperarticulation by modifying its behaviours (Hockey et al. 03).

However, hyperarticulation is not as maladaptive as rephrasing to out-of-grammar.

Outline Introduction: Adaptation User adaptation: Hyperarticulation Current and future work

Models of System (Weaver et al.)

The problem Experiment design Preliminary results

The Problem Users may develop inaccurate models of

dialog systems, leading to maladaptive interactions.

Our question: How can we construct system behaviors that

reduce user maladaptation?

Experiment Design Same as experiment 1, except:

Questions and system feedback provided using TTS.

Planned errors appear throughout dialog -- each phonetic category is represented in each quarter of the dialog, and in each location (before, during and after error).

Subjects assigned to one of two conditions: (Graceful) System model is one of a system that

understands human language. (Nongraceful) System model is one of a system that

recognizes but does not understand speech.

Experiment Design System model is presented to subjects in

experiment setup, through choice of TTS voice, and through construction of planned errors. For example: (True) Hunter Mariano plays #center# (Graceful) Hunter Mariano plays #better#

Semantically and syntactically meaningful (Nongraceful) Hunter Mariano plays #venture#

Phonetically similar, syntactically nonsensical

Preliminary Results Subjects hyperarticulate in repairs regardless of

condition; however, there is a trend to clearer speech in the nongraceful condition before errors. Subjects in the graceful condition use less clear

speech initially (26%, increasing to 44% on repairs). Their speaking rate slows down an average of .25 syl./sec on repairs.

Subjects in the nongraceful condition use more clear speech initially (38%, increasing to 49% on repairs). Their speaking rate slows down an average of .52 syl./sec on repairs.

System Adaptation (Marge, Gerrig, Stent et al.) Experiment design:

Subjects interact with a spoken dialog system to fill out a survey.

Two variables: intiative and lexical choice. Initiative:

System chooses topics and their order (directive) System chooses topics, user chooses order (mixed) User chooses topics and their order (nondirective)

Lexical choice: System does not adapt to user’s choice of topic labels, choice

of tense (directive) System does adapt to user’s choice of topic labels, choice of

tense (adaptive)

Directive Generation Measures:

Initiative: Topic choice, order Requests for help, prompt repetition Length of user responses Number of hangups Match between system’s and user’s estimate of user’s

overall opinion of course Lexical choice:

Number of misrecognitions Pause length between prompt and response

Conclusions Variation in human-human dialog is

omnipresent. Much of it is purposeful or adaptive. We do not know enough about

adaptation in human-computer dialog. We may be able to use humans’

tendencies to adapt to improve outcomes for spoken dialog systems.