ITCS 6010 VUI Evaluation Paradise & SUM. PARADISE Paradigm for Dialogue System Evaluation Goal:...

18
ITCS 6010 VUI Evaluation Paradise & SUM
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    218
  • download

    4

Transcript of ITCS 6010 VUI Evaluation Paradise & SUM. PARADISE Paradigm for Dialogue System Evaluation Goal:...

Page 1: ITCS 6010 VUI Evaluation Paradise & SUM. PARADISE Paradigm for Dialogue System Evaluation Goal: Maximize User Satisfaction.

ITCS 6010

VUI EvaluationParadise & SUM

Page 2: ITCS 6010 VUI Evaluation Paradise & SUM. PARADISE Paradigm for Dialogue System Evaluation Goal: Maximize User Satisfaction.

PARADISE Paradigm for Dialogue System Evaluation

Goal: Maximize User Satisfaction

Page 3: ITCS 6010 VUI Evaluation Paradise & SUM. PARADISE Paradigm for Dialogue System Evaluation Goal: Maximize User Satisfaction.

PARADISE Paradigm for Dialogue System Evaluation

Performance is modeled as a weighted function of a task-based success measure and dialogue-based cost measures, where weights are computed by correlating user satisfaction with performance.

Dialogue tasks are represented as Attribute Value Matrix (AVM) pairs.

Page 4: ITCS 6010 VUI Evaluation Paradise & SUM. PARADISE Paradigm for Dialogue System Evaluation Goal: Maximize User Satisfaction.

Attribute Value Matrix (AVM)

AVM represents dialogue task Information exchange between agent and

user Represented as ordered pairs of attribute and

possible values AVM instantiation of task

Same regardless of dialogue

Page 5: ITCS 6010 VUI Evaluation Paradise & SUM. PARADISE Paradigm for Dialogue System Evaluation Goal: Maximize User Satisfaction.

AVM (cont’d)

A1: Hello this is Train Enquiry Service. DC,AC,DR,DTPlease speak after the tone. Which information do you need? U1: I want to go from Torino to Milano. DC, ACA2: Do you want to go from Trento to DC, ACMilano? Yes or No?U2: No. DC, ACA3: Do you want to leave from Trento?Yes if No? DCU3: No. DCA4: Where do you want to leave from? DCU4: I want to leave from Torino. DCA5: Do you want to leave from Torino? DCYes or No?U5: Yes. DCA6: Do you want to go to Milano? ACU6: Yes. ACA7: At which time do you want to leave? DRU7: I want to travel in the evening. DRA8: Do you want to leave between 6and 9pm? Yes or No? DRU8: Yes. DRA9: There is a train leaving at 8:00pm. DT….

B1: Hello this is Train Enquiry Service. DC,AC,DR,DT

Please speak after the tone. Which information do you need? U1: I want to travel from Torino to Milano. DC, ACB2: At which time do you want to leave DC, AC,DRfrom Merano to Milano?U2: No, I want to leave from Torino in the DC, DRevening.B3: Do you want to leave from Torinobetween 6 and 11pm?Please answer Yes DC, DRor No.U3: Yes. DC,DRB4: A train leaves at 8:00pm. DT….

attribute actual value

depart-city Torino

arrival-city Milano

depart-range Evening

depart-time 8pm

Page 6: ITCS 6010 VUI Evaluation Paradise & SUM. PARADISE Paradigm for Dialogue System Evaluation Goal: Maximize User Satisfaction.

PARADISE Paradigm for Dialogue System Evaluation

Advantages PARADISE approach addresses performance and user

satisfaction

Disadvantages Too complex to compute. Need a large sample size up front

Page 7: ITCS 6010 VUI Evaluation Paradise & SUM. PARADISE Paradigm for Dialogue System Evaluation Goal: Maximize User Satisfaction.

Alternative Approaches

What’s important? Maximize User Satisfaction Maximize Task Success

Page 8: ITCS 6010 VUI Evaluation Paradise & SUM. PARADISE Paradigm for Dialogue System Evaluation Goal: Maximize User Satisfaction.

User Satisfaction How do we measure user satisfaction?

Questionnaires

Interviews

Focus Groups

Page 9: ITCS 6010 VUI Evaluation Paradise & SUM. PARADISE Paradigm for Dialogue System Evaluation Goal: Maximize User Satisfaction.

Task Success How do we measure task success?

Logging Actual Use

Performance Measurement

Walkthroughs

Pilot Testing

Page 10: ITCS 6010 VUI Evaluation Paradise & SUM. PARADISE Paradigm for Dialogue System Evaluation Goal: Maximize User Satisfaction.

Task Success

For each dialogue and the entire conversation establish AVMs.

Measure task success with respect to: Task completion time Accuracy or Errors (e.g. misinterpretations)

Page 11: ITCS 6010 VUI Evaluation Paradise & SUM. PARADISE Paradigm for Dialogue System Evaluation Goal: Maximize User Satisfaction.
Page 12: ITCS 6010 VUI Evaluation Paradise & SUM. PARADISE Paradigm for Dialogue System Evaluation Goal: Maximize User Satisfaction.

Conclusions

PARADISE is good, but too complex!

Measure user satisfaction and task success.

What if user satisfaction not most relevant aspect?

Page 13: ITCS 6010 VUI Evaluation Paradise & SUM. PARADISE Paradigm for Dialogue System Evaluation Goal: Maximize User Satisfaction.

Speech Usability Metric (SUM)

Uses 3 metrics: User satisfaction Accuracy Task completion time

Eliminates restriction of one factor to determine usability

Page 14: ITCS 6010 VUI Evaluation Paradise & SUM. PARADISE Paradigm for Dialogue System Evaluation Goal: Maximize User Satisfaction.

Speech Usability Metric (SUM)

SUM = X * User Satisfaction + Y * Accuracy + Z * Completion Time X + Y + Z = 1 X, Y, Z > 0

Weights determined by evaluator

Page 15: ITCS 6010 VUI Evaluation Paradise & SUM. PARADISE Paradigm for Dialogue System Evaluation Goal: Maximize User Satisfaction.

User Satisfaction

Surveys

Questionnaires

Interviews

Page 16: ITCS 6010 VUI Evaluation Paradise & SUM. PARADISE Paradigm for Dialogue System Evaluation Goal: Maximize User Satisfaction.

Accuracy

Misinterpretations System recognizes wrong word

Out-of-vocabulary errors Words not in system grammar

Wrong choice Correct word recognized, wrong path chosen

Page 17: ITCS 6010 VUI Evaluation Paradise & SUM. PARADISE Paradigm for Dialogue System Evaluation Goal: Maximize User Satisfaction.

Task Completion Time

Time to complete task Time for expert to complete task (ETCT) Maximum time to complete task (MTCT) Expected time to complete task (ExTCT)

Page 18: ITCS 6010 VUI Evaluation Paradise & SUM. PARADISE Paradigm for Dialogue System Evaluation Goal: Maximize User Satisfaction.

Conclusion

SUM determines usability of a speech application Utilizes 3 pre-defined metrics

Allows for greater flexibility