Evaluation Eyal Ophir CS 376 4/28/09. Readings Methodology Matters (McGrath, 1994) Practical Guide...

Evaluation

Eyal Ophir

CS 376

4/28/09

Readings

Methodology Matters (McGrath, 1994)Practical Guide to Controlled Experiments

on the Web (Kohavi et al., 2007)

Methodology Matters

Methodology Matters

Methods for Research in the Behavioral and Social Sciences

Different methods have strengths and weaknesses

Tradeoff between: Generalizability Precision Realism

Credibility requires consistency, convergence across methods

Study Design

Find baserates, correlations, or differencesRandomization of selection, assignment to

conditionsStatistical significanceValidity (internal, statistical, construct,

external)

Measures

Self reportTrace measuresObservation (by a visible or hidden

observer)Archival records (public or private)

Manipulation

SelectionDirect interventionInduction (indirect intervention:

confederates, deception)

Case Study: Multitasking UI

Users play two simultaneous instantiations of a game

Does making the two instantiations visually different make it easier to switch back and forth?

Case Study

Case Study

• Tradeoffs: Generalizability, Precision, Realism

• Design: baserates, correlations, differences

• Random selection, assignment

• Validity: internal, statistical, construct, external

• Measures: self-report, trace measures, observation, archival records

• Manipulation: selection, intervention, induction

General Question

Has social psychology resisted formal theory, and if so, why?

Practical Guide to Controlled Experiments

on the Web

Web Experiments

OEC: Overall Evaluation Criterion

Web Experiments

Hypothesis testing and sample size Confidence, power Reducing the standard error

Sufficiently large sample size OEC with inherently low variability Reduce variability by excluding irrelevant cases

Web Experiments

Extensions for Online Experiments Treatment ramp-up Automation Software Migration

Web Experiments

Limitations of web experiments No explanation of mechanism Focus on short term effects Primacy/newness Must implement treatments

Web Experiments

Implementation Randomization

Pseudorandom with caching Hash and partition

Assignment Traffic splitting Server-side Client-side

Lessons learned (i.e.- tips for the researcher):Analysis

Mine the Data Time matters Multi-factor experiments

Lessons Learned

Trust and Execution Run A/A tests (test your system) Ramp-up and abort Correct sample size Assign 50% to treatment Beware day of week effects

Lessons Learned

Culture and Business Agree on OEC upfront Beware “harmless” features Weigh performance vs. maintenance cost Data-driven (vs. opinion-driven) culture

Extended Case Study

Assume the game UI from the first case study was an actual gaming site

The website is interested in promoting multiple simultaneous games between users, but users complain that it’s difficult to manage multiple games

Design a web-based study informed by the reading to test the new design

Case Study

• OEC

• Sample size, reducing error

• Ramp-up, automation

• Mechanism explanation, short vs. long-term effects, primacy/newness

• Randomization/assignment

• Mine the data, multi-factor experiments

• A/A tests, sample size, day of week effects

Data-Oriented Culture

Pros?Cons?How can we best use user tests to inform

design and innovation?Trade-offs of experimentation vs. intuitionWhy the OEC? What are good measures

for non-commerce sites?Do online tests maximize all McGrath’s

parameters?

Evaluation Eyal Ophir CS 376 4/28/09. Readings Methodology Matters (McGrath, 1994) Practical Guide...

Documents

Transcript of Evaluation Eyal Ophir CS 376 4/28/09. Readings Methodology Matters (McGrath, 1994) Practical Guide...