HCI460: Week 10 Lecture November 11, 2009. 2 Project 3: Questions? Recap: Step-by-step stats for...

HCI460: Week 10 LectureHCI460: Week 10 LectureNovember 11, 2009November 11, 2009

2

Project 3: Questions?

Recap: Step-by-step stats for A-B comparisons– For within-subjects design – For between-subjects design

Summative testing– Case study & exercise: Pick best-in-class mp3 player

Other usability evaluation methods– Longitudinal– Focus groups– Remote unmoderated

Evaluating expert user interfaces

Review of Project 2

Grades

Outline

3

Project 3

4

Deliverable: – Final Report only

Due date: – In-class students: Nov 13th, 11:59pm (Friday)– Distance learning: Nov 17th, 11:59pm (Tuesday)

Report for Project 3Project 3

5

Executive Summary– Study background: what, when, how etc.– Summary of findings

Introduction (incl. objectives)

Method– Participants– Materials (or Stimuli)– Procedure

Findings– Quantitative results (with statistics) + “story” (i.e., interpretation

of what the results mean)

Recommendations (if applicable)

Report SectionsProject 3

6

This is a good time to ask.

Questions about Project 3?Project 3

7

Recap: Step-by-Step Stats for A-B Comparisons

8

SummaryStep-by-Step Stats Recap

For time on task, ratings, and number of errors data:– If you have a between-subjects design, use an unpaired t-test

(aka “independent samples t-test”)• DF (degrees of freedom) = sum of the participants minus 2

– If you have a within-subjects design, use a paired t-test (aka “dependent samples t-test” or “repeated measures t-test”)

• DF (degrees of freedom) = sum of all participants minus 1

For task completion (success/fail) data:– Use Fisher’s Exact Test

9

Recap: Step-by-Step Stats for A-B ComparisonsWithin-Subjects Design

10

New vs. old package inserts

Search tasks, e.g.:– Question: How many drops

of control are used in the sample cup?

– Correct answer: 3 drops

Study design– Within-subjects (all

participants saw both versions)

Measures:– Task completion (success/fail)– Time on task (per task)– Ease of use ratings (per task)

Within-Subjects Design: Study BackgroundStep-by-Step Stats Recap

Old New

11

Within-Subjects Design: The DataStep-by-Step Stats Recap

These findings now need to make it to your report.

What are your next steps?

12

Within-Subjects Design: Next StepsStep-by-Step Stats Recap

1. ?

2. ?

3. ?

4. ?

5. ?

6. ?

7. ?

8. ?

9. ?

10. ?

Form groups of 2 – 4 people

Write down your steps (plan of action) – “What are you going to do with these findings before you hand them in to the stakeholders (in a report)?”

No need to compute anything at this point

You have 5 - 7 minutes

Form groups of 2 – 4 people

Write down your steps (plan of action) – “What are you going to do with these findings before you hand them in to the stakeholders (in a report)?”

No need to compute anything at this point

You have 5 - 7 minutes

13

Within-Subjects Design: Step 1Step-by-Step Stats Recap

1. Calculate the averages to get a sense for what the data is telling you.– But…

14



15


2. Delete time on task and ease of use data for participants who failed.– If within-

subjects design, delete all data for participants who failed A and/or B (at least one).[If between-subjects design, keep all correct data]

16

Within-Subjects Design: Steps 3 and 4Step-by-Step Stats Recap

3. Open Stats Usability Pak spreadsheet (or another statistical package).

4. Decide on the alpha level (before running any statistics).– .01?– .05? most

common– .1?

17


5. Select an appropriate test for task completion (success rate).

18


5. Select an appropriate test for task completion (success rate).– Run Fisher’s Exact Test.

19


5. Select an appropriate test for task completion (success rate).– Run Fisher’s Exact Test.– Write down the important number(s).

20


6. Select an appropriate test for the time on task measure.

21


6. Select an appropriate test for the time on task measure.– Run a paired t-test.

22


6. Select an appropriate test for the time on task measure.– Run a paired t-test.– Write down the important number(s).

Why is the sample size only 17?

Why is the sample size only 17?

Why do we need this number?

Why do we need this number?

Degrees of freedom!

DF for within-subjects design = n – 1

DF = 17 – 1 = 16

Degrees of freedom!

DF for within-subjects design = n – 1

DF = 17 – 1 = 16

23


7. Select an appropriate test for the ease of use ratings.

24


7. Select an appropriate test for the ease of use ratings.– Run a paired t-test.

7. Select an appropriate test for the ease of use ratings.– Run a paired t-test.– Write down the important number(s).

25


DF = 16DF = 16

26


8. Describe your findings for task completion (success/fail).– We conducted Fisher’s Exact test to compare the task

completion rate with the old insert (69%) to the task completion rate with the new insert (86%) and found no statistically significant difference at alpha level .05.

But what if there was a difference?• We conducted Fisher’s Exact

test to compare the task completion rate with the old insert (69%) to the task completion rate with the new insert (93%). The completion rates were significantly different (p < .05), such that more participants successfully completed the task when using the new insert than when using the old insert.

27


9. Describe your findings for time on task.– We conducted a paired t-test to compare the time on task

between the two insert versions. We found a significant difference (t(16) = 2.19, p < .05), such that participants took longer to complete the task when using the old insert (M = 33.1 s, SD = 12.8 s) than when using the new insert (M = 23.6 s, SD = 12.5 s).

28


10. Describe your findings for ease of use ratings.– We also conducted a paired t-test to compare the ease of use

ratings participants assigned to the two insert versions. There was no significant difference between the ease of use ratings for the old insert (M = 5.6, SD = 1.2) and the new insert (M = 6.0, SD = 0.9) at alpha level .05.

29

Recap: Step-by-Step Stats for A-B ComparisonsBetween-Subjects Design

30

Between-Subjects Design: The DataStep-by-Step Stats Recap

Now let’s pretend that we had a between-subjects design:– One group of participants used the old insert and another group

used the new insert:

31

Between-Subjects Design: Step 1Step-by-Step Stats Recap


32



33


2. Delete time on task and ease of use data for participants who failed.

34

Between-Subjects Design: Steps 3 and 4Step-by-Step Stats Recap

3. Open Stats Usability Pak spreadsheet (or another statistical package).

4. Decide on the alpha level (before running any statistics).– .01?– .05? most

common– .1?

35


5. Select an appropriate test for task completion (success rate).

36


5. Select an appropriate test for task completion (success rate).– Run Fisher’s Exact Test.

37


5. Select an appropriate test for task completion (success rate).– Run Fisher’s Exact Test.– Write down the important number(s).

38


6. Select an appropriate test for the time on task measure.

39


6. Select an appropriate test for the time on task measure.– Run an independent samples t-test.

6. Select an appropriate test for the time on task measure.– Run an independent samples t-test.– Write down the important number(s).

40


DF = (20+25) – 2 =43

DF = (20+25) – 2 =43

41


7. Select an appropriate test for the ease of use ratings.

42


7. Select an appropriate test for the ease of use ratings.– Run an independent samples t-test.

7. Select an appropriate test for the ease of use ratings.– Run an independent samples t-test.– Write down the important number(s).

43


DF = (20+25) – 2 =43

DF = (20+25) – 2 =43

44


8. Describe your findings for task completion (success/fail).– We conducted Fisher’s Exact test to compare the task

completion rate with the old insert (69%) to the task completion rate with the new insert (86%) and found no statistically significant difference at alpha level .05.

But what if there was a difference?• We conducted Fisher’s Exact

test to compare the task completion rate with the old insert (69%) to the task completion rate with the new insert (93%). The completion rates were significantly different (p < .05), such that more participants successfully completed the task when using the new insert than when using the old insert.

45


9. Describe your findings for time on task.– We conducted an independent samples t-test to compare the

time on task between the two insert versions. We found a significant difference (t(43) = -2.50, p < .05), such that participants who used the old insert took longer to complete the task (M = 33.5 s, SD = 12.6 s) than participants who used the new insert (M = 24.3 s, SD = 11.7 s).

46


10. Describe your findings for ease of use ratings.– We also conducted an independent samples t-test to compare

the ease of use ratings participants assigned to the two insert versions. There was no significant difference between the ease of use ratings for the old insert (M = 5.6, SD = 1.1) and the new insert (M = 6.0, SD = 0.8) at alpha level .05.

47

Case Study: Pick Best-In-Class MP3 Player

48

Pick A Device, Any DeviceCase Study: Pick Best-In-Class MP3 Player

49

Understand the user experience related to popular MP3 devices

Identify usability issues

Recommend possible solutions to improve the UI

Any concerns?

Seemed reasonable to us.

Research ObjectivesCase Study: Pick Best-In-Class MP3 Player

50

What should we do?

What are we measuring?

Who are the participants?

How long is each session?

How many participants?

What Makes Sense?Case Study: Pick Best-In-Class MP3 Player

51

What should we do?– Formative research, but on several competitive products– Iterative usability test a UI

What are we measuring?– Success/Fail, ease-of-use, usability issues, etc.

Who are the participants?– Depends on market demographics

How long is each session?– Depends on what we test

How many participants?– Probably small samples, because this is formative

Our Thoughts…Case Study: Pick Best-In-Class MP3 Player

52

On market at the time…

Five MP3 PlayersCase Study: Pick Best-In-Class MP3 Player

53

21 Tasks of Interest were selected– High Frequency of Use (“Play Song”)– Priority (“Create Playlist”)

5 User Interfaces – Four Competitors– One Client Design (Echo)

Stakeholder Had More Information…Case Study: Pick Best-In-Class MP3 Player

Alpha Bravo Charlie Delta Echo

54

Task x Device MatrixCase Study: Pick Best-In-Class MP3 Player

What do you make of this?

55

Ensure that the new design will be best-of-breed relative to the competition

Oh, we failed to mention:– This is extremely high profile– Data will drive strategy for entire organization and all products– Report will go directly to C-level executives…

And oh, did I mention that we’re gonna need the results to be statistically significant…?

During Kickoff Meeting, More Was Revealed…Case Study: Pick Best-In-Class MP3 Player

56

21 Tasks x 5 Designs = 105 Combinations to test

Can we run a within-subjects design methodology? – Why or why not?

• Learning• Fatigue

Can we run a between-subjects design methodology?

What Concerns You?Case Study: Pick Best-In-Class MP3 Player

57






58






59






60

Found out that not all tasks were possible on each device!!!

When We Looked Deeper…Case Study: Pick Best-In-Class MP3 Player

61

They analyzed the task flows and wanted to add another variant UI

Goal Is Best of Breed, So One More ChangeCase Study: Pick Best-In-Class MP3 Player

62

What Else Can I Say?Case Study: Pick Best-In-Class MP3 Player

63

Our ApproachCase Study: Pick Best-In-Class MP3 Player

Now, we did convince the client that– UC would design the study such that it would be sensitive

enough to detect statistical significance if it did indeed exist

Thus, there would be NO a priori assurances of finding significant differences!

We were charged with:– Research activities– Methodology that would provide data to justify design direction

A device was going to be built…

64

Core Elements to TestCase Study: Pick Best-In-Class MP3 Player

Core elements to test– Access points (navigating to a feature is easy)– Feature task flows (completing task may be hard)– Design look and feel– Iconography– Verbiage

65

Research Program InvolvedCase Study: Pick Best-In-Class MP3 Player

Testing functional task flows

Testing verbiage (navigation and features)

Testing access

Testing iconography

Testing graphic design treatments

Re-test of updated user interface (all elements)

66

Focusing on Testing Functional Task FlowsCase Study: Pick Best-In-Class MP3 Player

Thoughts on considerations for 21 Tasks x 6 Designs?

Areas for bias?– Order

• Task• Device• Combination

– Device hardware

67

Minimize BiasCase Study: Pick Best-In-Class MP3 Player

Realistically, participants could– Only interact with 2-3 designs effectively (not 5-7) – Assumed each participant could complete ~ 6 tasks

Create prototypes on a computer– Level the “playing field” to core task flow elements

Usability testing / Quantitative Data Collection– Recruited target demographic (incl. high schools) – Needed simultaneous test teams– In the end, we really needed to know the “story”

68

Approach: Create Blocks Case Study: Pick Best-In-Class MP3 Player

Sheer size of all possible combinations required blocking into groups– Designs x Tasks combinations

Within each block… Order of task presentation

– Tasks were systematically counter-balanced to reduce learning and order effects

Order of device presentation– For each participant, devices were randomized within each task

to reduce learning and order effects

69

Use of BlocksCase Study: Pick Best-In-Class MP3 Player

Participants assigned to blocks– Individual participant device x task combinations were

• Randomized • Assigned to participants to reduce learning effects

– Constraint: Familiarity biases were avoided • In this sanitized example, iPod owners did not test their

iPods

Blocks of four to six tasks were formed to roughly control for total number of steps to complete

70

Experimental DesignCase Study: Pick Best-In-Class MP3 Player

Design– Each participant received five or six (of the 21) tasks by three (of

the six) devices• Six devices (A/B/C/D/E/F) taken three at a time yields a total

of 20 unique device combinations per task (e.g., ABC, ABD, ABE…DEF)

• A participant was matched to each unique three device combinations

• Hence, to complete a full block, 20 participants were needed

Each device by task cell was represented with at least 10 participants (N=80 to have four blocks of 20)

71

Example of Participants #1 and #2Case Study: Pick Best-In-Class MP3 Player

Julian Tracy O.

Participant #1 Participant #2

16-24 years 25-44 years

task="11" device=Bravo task="16" device=Bravo

task="11" device=Foxtrot task="16" device=Delta

task="16" device=Charlie

task="6" device=Alpha task="13" device=Charlie

task="6" device=Foxtrot task="13" device=Delta


task="4" device=Bravo task="8" device=Delta

task="4" device=Foxtrot task="8" device=Bravo

task="4" device=Alpha task="8" device=Charlie

task="15" device=Foxtrot task="18" device=Bravo

task="15" device=Bravo task="18" device=Delta

task="17" device=Foxtrot task="9" device=Charlie


task="17" device=Alpha task="9" device=Delta

72

MeasuresCase Study: Pick Best-In-Class MP3 Player

Time-on-Task

Efficiency (Deviation from Optimal Path)– Total screens viewed / Optimal path for the task

• More incorrect “steps” increases this metric

Success– % participants in each cell (device x task) who successfully

completed the task

Preference– Pair-wise device preferences for a particular task with a

magnitude judgment

73

Task-by-Task AnalysisCase Study: Pick Best-In-Class MP3 Player

74

Need More Than Simply QuantCase Study: Pick Best-In-Class MP3 Player

We anticipated that the results would not be so clear cut as to identify a single design and declare a winner

On a particular task, we might find that Device Foxtrot was #1– But, what made Device Echo #2?

We need to know the “why”– Partitioned participants (N=16) for qualitative UT– Other participants (N=80) were sufficient for quantitative

75

Qualitative data: User CommentsCase Study: Pick Best-In-Class MP3 Player

76

Interesting PointsCase Study: Pick Best-In-Class MP3 Player

Start with the high frequency / high priority tasks…– Why did the Foxtrot design win?– Why did the Alpha design lose?– Compare quantitative with qualitative

Complex tasks – The fastest time was not the best– More clicks, less error, high satisfaction

Sometimes winners would emerge for different reasons…– How do you weigh different UI conventions?

77

Lessons LearnedCase Study: Pick Best-In-Class MP3 Player

“Know the story”– Benefit of Qualitative Data– Absolute “must have”– Asked for N=100, and pushed some to qualitative

Learning– Counterbalancing was sufficient– But, these are not walk-up-and-use devices…

Avoid “Frankenstein” Design– Do not simply pick the winning task flow and implement– Consistency matters!

78

Other Usability Evaluation Methods

79

FGs are structured group interviews that quickly and inexpensively reveal a target audience’s desires, experiences, reactions, attitudes...– Mostly used in market

research.– Capture opinions rather

than behavior.

However, FGs can sometimes be used for usability evaluation.– But the procedure and

questions should be modified.

Focus Groups (FGs)Other Usability Evaluation Methods

80

If it is a new product:– Participants interact with a

product and then discuss their difficulties as a group.

• You can have a few simultaneous one-on-one sessions followed by a FG.

• You can also have activities during FGs.

Benefits:– Participants are not just watching the moderator use the product

(as they do in traditional FGs) but get to experience it first hand.– Group discussion can provide more feedback and richer

feedback in less time than several individual UTs.

Focus GroupsOther Usability Evaluation Methods

81

If it is an existing product:– Participants who have experience interacting with the product

are invited to discuss their difficulties in a FG.• E.g., Pharmacists come to discuss their issues with particular

types of blister packs. They also get to interact with the packs (and a few prototype packs) during the session.

Benefits:– Group discussion can provide more feedback and richer

feedback in less time than several individual UTs.

Focus GroupsOther Usability Evaluation Methods

82

Usability testing assesses walk-up-and-use usability, not actual use– Assess learning, usage and satisfaction over time

“Drop and Soak” studies– Provide device and let soak– Transition from initial use to familiar/experienced use

Repeat user testing studies– Bring users back multiple times to assess evolution– Multiple usability tests with same participants– Iterations and evolution

User group – Rinse, wash, repeat

Sidebar: Why longitudinal research is rarely seen

LongitudinalOther Usability Evaluation Methods

83

Gap Between Web Analytics and SurveysOther Usability Evaluation Methods

Web AnalyticsWhat customers do

SurveysWho Customers AreWhat Customers Say

Automated TestingCombines Analytics andSurvey

84

Traditional online surveys measure satisfaction.– Collect quantitative measures.– Obtain attitudinal scores.

But, there is no connection to the experience.

Interactive surveys (remote un-moderated testing)– Task-based research using online panels (Keynote, UserZoom

etc.)– True-intent research using actual site visitors (LEOtrace)

Remote Unmoderated TestingOther Usability Evaluation Methods

85

Task-based Web SurveyOther Usability Evaluation Methods

86

Task-based Web SurveyOther Usability Evaluation Methods

87

Heatmaps of ClicksOther Usability Evaluation Methods

88

ScorecardsOther Usability Evaluation Methods

Navigation menu through the modules defined

The homepage highlights the most important information

from each module

Different colors illustrate different results

Comparative analysis

89

Evaluating Expert User Interfaces

90

User of interface is used an expert– User has extensive experience on application– User uses the application a lot

• One could argue MS Word is an example

Exercise– Break into groups– Evaluate usability of this expert interface– Customer relationship management (CRM) system

Not Walk Up and UseEvaluating Expert User Interfaces

91

Customer care information– Details on a single customer for residential service

Exercise: Review This Interface…Evaluating Expert User Interfaces

92

As with any customer, knowing the “history” can help– User clicks Notes on left to open Interaction Notes

Example: HistoryEvaluating Expert User Interfaces

93

From Interactive Notes, individual notes can be reviewed

Example: NotesEvaluating Expert User Interfaces

94

Call centers– Approx. 100 calls per day– Typical environment involves cubicle farms– Stand and ask questions– Multi-tasking– Time pressure– Possible sales incentives in effect– Rapid consumption of screens

Does this change anything?!?!

Does the evaluation differ?

Example: ContextEvaluating Expert User Interfaces

95

Quite a unique user group

Demonstrate over-learned behaviors– High number of transactions, high volume of calls– Rote memorization of commands and actions

Emphasis is all about their workflow – They make transactions quickly across multiple systems – In most cases, they do not need to look at entire screen

• Consider adding a new service, the first few screens may be irrelevant as they are observed clicking for navigation and not for information

– Introduction of any new system can have grave impacts

Reality: – Traditional walk up and use methods may be insufficient for this

audience and this type of interface

Expert UsersEvaluating Expert User Interfaces

96

S:– s

Take Another 5 minutes…Evaluating Expert User Interfaces

97

Findings…?

Main ScreenEvaluating Expert User Interfaces

98

Findings…?

Interaction NotesEvaluating Expert User Interfaces

99

Emphasis on pretty and “easily-trainable” – But, this is not a walk-up-and-use application

Design issues– Fitts’ Law

Workflow– More effort to get the same information

What is the impact?– Will users even click?– Poor knowledge of user history

• Impact?

Core IssuesEvaluating Expert User Interfaces

100

Change can be hard– Especially if this represents a paradigm shift from accessing

features and sections to a single view of customer• Seems logical that a single point of view is better, but this

assumes an thorough understanding of the workflow across a day

• However, theoretical savings on a spreadsheet do not often equate to benefits in the real world

– Users multi-task

Adoption and usage are key drivers…– How should you test these interfaces?!?!

Introduction of a New ApplicationEvaluating Expert User Interfaces

101

Usability tests uncovered issues on a page by page and task level basis– Areas such as time, content comprehension, navigation, layout,

labeling, and functionality – UT is exceptional at targeting page level details

However, usage adoption requires more than just page level data– How this system fits into workflow or “I won’t use”– UT participants were so task-focused – No mention of high-level impacts to their job or and daily tasks

Focus groups for interactive applications leads to “spoon feeding”

Try two-person UTs– Workflow becomes an issue as one watches the other work…

Discussion of MethodsEvaluating Expert User Interfaces

102

Review of Project 2

103

Executive summary– Context + main themes / key findings– Don’t forget to mention the positives– Issues: Too much detail or too high-level

Objectives Methods Findings Recommendations– “Dangling objectives” (no findings address them)– “Dangling methods” (no corresponding findings or objectives)– “Dangling findings” (no corresponding objectives or method)

Describing findings– Description of a participant difficulty/error.– Description of the source of the difficulty.

• What in the interface caused the difficulty?

ThemesReview of Project 2

TASK/USER-ORIENTED

PRODUCT-ORIENTED

104

Within-subjects design?

“Participants” vs. “users”

“Formative study”

Visual references– Video clips

ThemesReview of Project 2

105

ScoresReview of Project 2

ABB-C+CDF

106

Grades

107

15% Project 1: Expert evaluation

25% Project 2: Formative usability study

20% Project 3: Quantitative comparison study

10% Take-home midterm quiz

20% Final exam

10% Individual contribution to projects

GradingGrades

108

Project 1Grades

ABB-C+CDF A-

109

Project 2Grades

ABB-C+CDF

110

A 10 and 9.5 A- 9 B+ 8.5 B 8 B- 7.5 C+ 7 C 6.5 C- 6 D+ 5.5 D 5

MidtermGrades

HCI460: Week 10 Lecture November 11, 2009. 2 Project 3: Questions? Recap: Step-by-step stats for...

Documents

Transcript of HCI460: Week 10 Lecture November 11, 2009. 2 Project 3: Questions? Recap: Step-by-step stats for...