Focus the mining beacon: lessons and challenges

43
Ronny Kohavi General Manager, Experimentation Platform, Microsoft Partly based on Joint work with Llew Mason, Rajesh Parekh, Zijian Zheng, Machine Learning, vol 57, 2004 Focus the Mining Beacon: Lessons and Challenges from the World of E-Commerce SF Bay ACM Data Mining SIG, 6/13/2006

Transcript of Focus the mining beacon: lessons and challenges

Page 1: Focus the mining beacon: lessons and challenges

Ronny KohaviGeneral Manager, Experimentation Platform, Microsoft

Partly based on Joint work with Llew Mason, Rajesh Parekh, Zijian Zheng, Machine Learning, vol 57, 2004

Focus the Mining Beacon: Lessons and Challenges from the World of E-Commerce

SF Bay ACM Data Mining SIG, 6/13/2006

Page 2: Focus the mining beacon: lessons and challenges

Ronny Kohavi, Microsoft

2

Overview

Background/experience Business lessons and Controlled Experiments Simpson’s paradox Technical lessons Challenges Q&A

Page 3: Focus the mining beacon: lessons and challenges

Ronny Kohavi, Microsoft

3

Background (I)

1993-1995: Led development of MLC++, the Machine Learning Library in C++ (Stanford University) Implemented or interfaced many ML algorithms.

Source code is public domain, used for algorithm comparisons 1995-1998: Developed and managed MineSet

MineSet ™ was a “horizontal” data mining and visualization product at Silicon Graphics, Inc (SGI). Utilized MLC++. Now owned by Purple Insight

Key insight: customers want simple stuff: Naïve Bayes + Viz ICML 1998 keynote: claimed that to be successful, data mining

needs to be part of a complete solution in a vertical market I followed this vision to Blue Martini Software

A consultant is someone who• borrows your razor,• charges you by the hour,• learns to shave

on your face

Page 4: Focus the mining beacon: lessons and challenges

Ronny Kohavi, Microsoft

4

Background (II)

1998-2003: Director of Data Mining, then VP of Business Intelligence at Blue Martini Software Developed end-to-end e-commerce platform with integrated business

intelligence from collection, extract-transform-load (ETL) to data warehouse, reporting, mining, visualizations

Analyzed data from over 20 clients Key insight: collection, ETL worked great. Found many insights.

However, customers mostly just ran the reports/analyses we provided 2003-2005: Director, Data Mining and Personalization,

Amazon Key insights: (i) simple things work, and (ii) human insight is key

2005: Microsoft Assistance Platform Started Experimentation Platform group 3/2006

Page 5: Focus the mining beacon: lessons and challenges

Ronny Kohavi, Microsoft

5

Business-level Lessons (I)

Auto-creation of the data warehouse worked very well At Blue Martini we owned the operational side as well as

the analysis, we had a ‘DSSGen’ process that auto-generated a star-schema data warehouse

This worked very well. For example, if a new customer attribute was added at the operational side, it automatically became available in the data warehouse

Clients are reluctant to list specific questions Conduct an interim meeting with basic findings.

Clients often came up with a long list of questionsfaced with basic statistics about their data

Page 6: Focus the mining beacon: lessons and challenges

Ronny Kohavi, Microsoft

6

Business-level Lessons (II)

Collect business-level data from operational side Many things not observable in weblogs (search

information, shopping cart events, registration forms, time to return results). Log more at app-server

External events: marketing promotions, advertisements, site changes

Choose to collect as much data as you realistically can because you do not know what might be relevant for a future question.(Subject to privacy issues, but aggregated/anonymous data is usually OK.)

Page 7: Focus the mining beacon: lessons and challenges

Ronny Kohavi, Microsoft

7

Collection example – Form Errors

Here is a good example of data collection that we introduced without knowing apriori whether it will help: form errors

If a web form was filled and a field did not pass validation, we logged the field and value filled

This was the Bluefly home page when they went live

Looking at form errors, we saw thousands of errors every day on this page

Any guesses?

Page 8: Focus the mining beacon: lessons and challenges

Ronny Kohavi, Microsoft

8

Business-level Lessons (III)

Crawl, Walk, Run Do basic reporting first, generate univariate statistics, then

use OLAP for hypothesis testing, and only then start asking characterization questions and use data mining algorithms

Agree on terminology What is the difference between a visit and a session? How do you define a customer

(e.g., did every customer purchase)? How is “top seller” defined when showing best sellers?

Why are lists from Amazon (left) and Barnes Noble (right) so different?The answer: no agreed-upon definition of sales rank.

Page 9: Focus the mining beacon: lessons and challenges

Ronny Kohavi, Microsoft

9

Human Intuition is Poor

Many explanations we give to “success” are backwards looking. Hindsight is 20/20 Sales of sunglasses per-capita in Seattle vs. LA example

Our intuition at assessing new ideas is usually very poor We are especially bad at assessing ideas that are not incremental, i.e., radical

changes We commonly confuse ourselves with the target audience Discoveries that contradict our prior thinking are usually the most interesting

Next set of slides are a series of examples where you can test your intuition, or your “prior probabilities.”

Do you believe in intuition?No, but I have a feeling I might someday

Page 10: Focus the mining beacon: lessons and challenges

Ronny Kohavi, Microsoft

10

We tend to interpret the picture to the left as a serious problem

How Priors Fail us

Warning: graphic image may be disturbing to some people.

However, it’s just your priors.

Page 11: Focus the mining beacon: lessons and challenges

Ronny Kohavi, Microsoft

11

We are not Used to Seeing Pacifiers with Teeth

Page 12: Focus the mining beacon: lessons and challenges

Ronny Kohavi, Microsoft

12

Checkout Page

Example from Bryan Eisenberg’s article on clickz.com

The conversion rate is the percentage of visits to the website that include a purchase

Which version has a higher conversion rate? Why?

A B

Page 13: Focus the mining beacon: lessons and challenges

Ronny Kohavi, Microsoft

13

Graphics / Color

Which one converts (to search) better?

A

B

Source: Marketing Experimentshttp://www.marketingexperiments.com

Page 14: Focus the mining beacon: lessons and challenges

Ronny Kohavi, Microsoft

14

Amazon Shopping Cart Recs Add an item to your shopping cart at a website Most sites show the cart At Amazon, Greg Linden had the idea of showing

recommendations based on cart items Evaluation

Pro: cross-sell more items Con: distract people from checking out – VP asked to stop work on this idea As with many new things, hard to decide

A/B test was run

From Greg Linden’s Blog: http://glinden.blogspot.com/2006/04/early-amazon-shopping-cart.html

Idea was great. As many of you know from experience, this feature is live on the site

Page 15: Focus the mining beacon: lessons and challenges

Ronny Kohavi, Microsoft

15

Office Online Small UI changes can make a big difference Example from Microsoft Help When reading help (from product or web), you have an option to

give feedback

Page 16: Focus the mining beacon: lessons and challenges

Ronny Kohavi, Microsoft

16

Office Online Feedback

A B

Feedback A puts everything together, whereas feedback B is two-stage: question follows rating.

Feedback A just has 5 stars, whereas B annotates the stars with “Not helpful” to “Very helpful” and makes them lighter

Which one has a higher response rate? By how much?

Page 17: Focus the mining beacon: lessons and challenges

Ronny Kohavi, Microsoft

17

Another Feedback Variant

Call this variant C. Like B, also two stage.Which one has a higher response rate, B or C?

C

Page 18: Focus the mining beacon: lessons and challenges

Ronny Kohavi, Microsoft

18

Twyman’s Law

Any statistic that appears interestingis almost certainly a mistake

Validate “amazing” discoveries in different ways.They are usually the result of a business process 5% of customers were born on the exact same day (including year)

o 11/11/11 is the easiest way to satisfy the mandatory birth date field

For US and European Web sites, there will be a small sales increase on Oct 29th, 2006

Page 19: Focus the mining beacon: lessons and challenges

Ronny Kohavi, Microsoft

19

Twyman’s Law (II)

KDD CUP 2000 Customers who were willing to receive e-mail

correlated with heavy spenders (target variable)o Default for registration question was changed from “yes” to “no” on 2/28

o When it was realized that few were opting-in, the default was changed

o This coincided with a $10 discount off every purchase

o Lots of participants found thisspurious correlation, but itwas terrible for predictionson the test set

Sites go through phases(launches) and multiplethings change together

0%

20%

40%

60%

80%

100%

2/1 2/82/1

52/2

22/2

93/7

3/14

3/21

3/28

Date

Pe

rce

nta

ge

of

Cu

sto

me

rs

Heavy Spenders Accepts Email

Page 20: Focus the mining beacon: lessons and challenges

Ronny Kohavi, Microsoft

20

Interrupt: Key Takeaways

Every talk (hopefully) has a few key points to take away. Here are two from this talk:

Encourage controlled experiments (A/B tests) o The previous examples should have convinced you that our intuition is

poor and we need to experiment to get data

Simpson’s paradox o Lack of awareness of the phenomenon can lead to mistaken conclusions

o Unlike esoteric brain teasers, it happens in real life

o In the next few slides I’ll share examples that seem “impossible”

o We’ll then explain why they are possible and do happen

o Discuss implications/warning

Page 21: Focus the mining beacon: lessons and challenges

Ronny Kohavi, Microsoft

21

Examples 1: Drug Treatment

Real-life example for kidney stone treatments Overall success rates:

Treatment A succeeded 78%, Treatment B succeeded 83% (better) Further analysis splits the population by stone size

For small stonesTreatment A succeeded 93% (better), Treatment B succeeded 83%

For large stonesTreatment A succeeded 73% (better), Treatment B succeeded 69%

Hence treatment A is better in both cases, yet was worse in total People going into treatment have either small stones

or large stones

A similar real-life example happened when the two populations segments were cities (A was better in each city, but worse overall)

Adopted from wikipedia/simpson’s paradox

Page 22: Focus the mining beacon: lessons and challenges

Ronny Kohavi, Microsoft

22

Example 2: Sex Bias?

Adopted from real data for UC Berkeley admissions Women claimed sexual discrimination

Only 34% of women were accepted, while 44% of men were accepted

Segmenting by departments to isolate the bias, they found that all departments accept a higher percentage of women applicants than men applicants.(If anything, there is a slight bias in favor of women!)

There is no conflict in the above statements. It’s possible and it happened

Bickel, P. J., Hammel, E. A., and O'Connell, J. W. (1975). Sex bias in graduateadmissions: Data from Berkeley. Science, 187, 1975, 398-404.

Page 23: Focus the mining beacon: lessons and challenges

Ronny Kohavi, Microsoft

23

Example 3: Purchase Channels

Real example from a Blue Martini Customer We plotted the average customer spending for customers

purchasing on the web or “on the web and offline (POS)” (multi-channel), but segmented bynumber of purchases per customer

In all segments, multi-channelcustomers spent less

However, like shop.org predicted,ignoring the segments, multi-channelcustomers spent more on average

Multichannel customers spend 72% more per year than single channel customers

-- State of Retailing Online, shop.org

0

200

400

600

800

1000

1200

1400

1600

1800

2000

1 2 3 4 5 >5

Number of purchases

Cu

sto

mer

Avera

ge S

pen

din

g

Multi-channel Web-channel only

Page 24: Focus the mining beacon: lessons and challenges

Ronny Kohavi, Microsoft

24

Last Example: Batting Average

Baseball example (For those not familiar with baseball, batting average is percent of hits.) One player can hit for a higher batting average than another player

during the first half of the year Do so again during the second half But to have a lower batting average for the entire year

First Half Second Half Total seasonA 4/ 10 = 0.400 25/100 = 0.250 29/110 = 0.264B 35/100 = 0.350 2/ 10 = 0.200 37/110 = 0.336

Example

Key to the “paradox” is that the segmenting variable (e.g., half year) interacts with “success” and with the counts.E.g., “A” was sick and rarely played in the 1st half, then “B” was sick in the 2nd half, but the 1st half was “easier” overall.

Page 25: Focus the mining beacon: lessons and challenges

Ronny Kohavi, Microsoft

25

Not Really a Paradox, Yet Non-Intuitive

If a/b < A/B and c/d < C/D, it’s possible that (a+c)/(b+d) > (A+C)/(B+D)

We are essentially dealing with weighted averages when we combine segments

Here is a simple example with two treatments Each cell has Success / Total = Percent Success % T1 is superior in both segment C1 and segment C2, yet loses overall C1 is “harder” (lower success for both treatments) T1 gets tested more in C1

T1 T2C1 2/8 = 25% 1/5 = 20%C2 4/5 = 80% 6/8 = 75%Both 6/13 = 46% 7/13= 54%

Page 26: Focus the mining beacon: lessons and challenges

Ronny Kohavi, Microsoft

26

Important, not Just Cool

Why is this so important? In knowledge discovery, we state probabilities

(correlations) and associate them with causality Treatment T1 works better Berkeley discriminates against women

We must be careful to check for confounding variables

Confounding variables may not be ones we are collecting (e.g., latent/hidden)

Page 27: Focus the mining beacon: lessons and challenges

Ronny Kohavi, Microsoft

27

Controlled Experiments

Multiple names to the same concept A/B tests Control/Treatment Controlled experiments Randomized Experimental Design

Concept is trivial Randomly split traffic between two versions

o Control: usually current live versiono Treatment: new idea (or multiple)

Collect metrics of interest, analyze (statistical tests, data mining) First known controlled experiment in the 1700s

British captain noticed lack of scurvy in Mediterranean ships Had half the sailors eat limes (treatment), half did not (control) Experiment was so successful, British sailors are still called limeys Note: success despite no understanding of vitamin C deficiency

100%Users

50%Users

50%Users

Control:Existing System

Treatment:Existing System with Feature X

Users interactions instrumented, analyzed & compared

Analyze at the end of the experiment

Page 28: Focus the mining beacon: lessons and challenges

Ronny Kohavi, Microsoft

28

Advantages of Controlled Experiments

Controlled experiments test for causal relationships, not simply correlations

They insulate external factors Problems that plague interrupted time series, such as

history/seasonality/regression impact both versions They are the standard in FDA drug tests But like most great things, there are problems

and it’s important to recognize them…

Page 29: Focus the mining beacon: lessons and challenges

Ronny Kohavi, Microsoft

29

Issues with Controlled Experiments (1 of 4)

Org has to agree on key metric(s) to improve While it may seem obvious that we need to know if we’re

improving, it’s not easy to get clear agreement If nothing else, bringing this question to the surface is a great

benefit to the org!

If you don't know where you are going, any road will take you there —Lewis Carroll

Page 30: Focus the mining beacon: lessons and challenges

Ronny Kohavi, Microsoft

30

Issues with Controlled Experiments (2 of 4)

Quantitative metrics, not always explanations of “why” For example, we may know that lemons work against scurvy, but not

why;it may take a while to understand vitamin C deficiency

Data Mining may help identify segments where difference is large, leading to better understanding

Usability studies also useful at explaining

Short-term vs. Long-term Hard to assess long term effects, such as customer abandonment Example: if you optimize for ads for clickthrough revenues, you might

plaster the site with ads. Long-term concerns should be part of metric (e.g., revenue per pixels of real estate on the window)

Page 31: Focus the mining beacon: lessons and challenges

Ronny Kohavi, Microsoft

31

Issues with Controlled Experiments (3 of 4)

Primacy effect Changing navigation in a website may degrade the customer experience

(temporarily), even if the new navigation is better Evaluation may need to focus on new users, or run for a long period

Multiple experiments Even though the methodology shields an experiment from other changes,

statistical variance increases making it harder to get significant results It is useful to avoid multiple changes to the same “area.”

QA also becomes harder when tests interact Consistency/contamination

On the web, assignment is usually cookie-based, but people may use multiple computers, erase cookies, etc. Typically a small issue

Launch events / media announcements sometimes preclude controlled experiments The journalists need to be shown the “new” version

Page 32: Focus the mining beacon: lessons and challenges

Ronny Kohavi, Microsoft

32

Statistical tests: distributions are far from normal 97% of sessions do not purchase, so there’s a large mass on

the zero spending Proper randomization required

You cannot run option A on day 1 and option B on day 2, you have to run them in parallel

When running in parallel, you cannot randomize based on IP (e.g., load-balancer randomization) because all of AOL traffic comes from a few proxy servers

Every customer must have an equal chance of falling into control or treatment and must stick to that group

Issues with Controlled Experiments (4 of 4)

Page 33: Focus the mining beacon: lessons and challenges

Ronny Kohavi, Microsoft

33

Technical Lessons – Cleansing (I)

Auditing data Make sure time-series data exists for the whole period.

It is very easy to conclude that this week was bad relative to last week because some data is missing (e.g., collection bug)

Synchronize clocks from all data collection points.In one example, some servers were set to GMT and others to EST, leading to strange anomalies.Even being a few minutes off can cause add-to-carts to appear “prior” to the search

Page 34: Focus the mining beacon: lessons and challenges

Ronny Kohavi, Microsoft

34

Technical Lessons – Cleansing (II)

Auditing data (continued) Remove test data.

QA organizations constantly test the system. Make sure the data can be identified and removed from analysis

Remove robots/bots/spiders5-40% of site e-commerce site traffic is generated by crawlers from search engines andstudents learning Perl.These significantly skew results unless removed

Page 35: Focus the mining beacon: lessons and challenges

Ronny Kohavi, Microsoft

35

Data Processing

Utilize hierarchies Generalizations are hard to find when there are many attribute

values (e.g., every product has a Stock Keeping Unit number) Collapse such attribute values based on hierarchies

Remember date/time attributes Date/time attributes are often ignored, but contain information Convert them into cyclical attributes, such as hour of day or

morning/afternoon/evening, day of week, etc. Compute deltas between such attributes (e.g., ship date minus

order date)

Page 36: Focus the mining beacon: lessons and challenges

Ronny Kohavi, Microsoft

36

Analysis / Model Building

Mining at the right granularity level To answer questions about customers, we must aggregate

clickstreams, purchases, and other information to the customer level

Defining the right transformation and creating summary attributes is the key to success

Phrase the problem to avoid leaks A leak is an attribute that “gives away” the label.

E.g., heavy spenders pay more sales tax (VAT) Phrasing the problem to avoid leaks is a key insight.

Instead of asking who is a heavy spender, ask which customers migrate from spending a small amount in period 1 to a large amount in period 2

Page 37: Focus the mining beacon: lessons and challenges

Ronny Kohavi, Microsoft

37

Data Visualizations

Picking the right visualization is key to seeing patterns On the left is traffic by day – note the weekends (but hard to see patterns) On the right is a heatmap, showing traffic colored from green to yellow to red

utilizing the cyclical nature of the week (going up in columns)It’s easy to see the weekend, Labor day on Sept 3, and the effect of Sept 11

weekends

Page 38: Focus the mining beacon: lessons and challenges

Ronny Kohavi, Microsoft

38

Model Visualizations

When we build models for prediction, it is sometimes important to understand them

For MineSet™, we built visualizations for all models

Here is one: Naïve-Bayes / Evidence model (movie)

Page 39: Focus the mining beacon: lessons and challenges

Ronny Kohavi, Microsoft

39

A Real Technical Lesson:Computing Confidence Intervals

In many situations we need to compute confidence intervals, which are simply estimated as: acc_h +- z*stdDev

where acc_h is the estimated mean accuracy, stdDev is the estimated standard deviation, and z is usually 1.96 for a 95% confidence interval)

This fails miserably for small amounts of data For Example: If you see three coin tosses that are head, the confidence interval for

the probability of head would be [1,1] Use a more accurate formula that does not require using stdDev

(but still assumes Normality):

It’s not used often because it’s more complex, but that’s what computers are for See Kohavi, “A Study of Cross-Validation and Bootstrap for Accuracy Estimation

and Model Selection” in IJCAI-95

Page 40: Focus the mining beacon: lessons and challenges

Ronny Kohavi, Microsoft

40

Challenges (I)

Finding a way to map business questions to data transformations Don Chamberlin wrote on the design of SQL “What we

thought we were doing was making it possible for non-programmers to interact with databases." The SQL99 standard is now about 1,000 pages

Many operations that are needed for mining are not easy to write in SQL

Explaining models to users What are ways to make models more comprehensible How can association rules be visualized/summarized?

Page 41: Focus the mining beacon: lessons and challenges

Ronny Kohavi, Microsoft

41

Challenges (II)

Dealing with “slowly changing dimensions” Customer attributes change (people get married, their children

grow and we need to change recommendations) Product attributes change, or are packaged differently.

New editions of books come out Supporting hierarchical attributes Deploying models

Models are built based on constructed attributes in the data warehouse. Translating them back to attributes available at the operational side is an open problem

For web sites, detecting bots/robots/spiders Detection is based on heuristics (useragent, IP, javascript)

Page 42: Focus the mining beacon: lessons and challenges

Ronny Kohavi, Microsoft

42

Challenges (III)

Analyzing and measuring long-term impact of changes Control/Treatment experiments give us short-term value.

How do we address long-term impact of changes? For non-commerce sites, how do we measure user

satisfaction? Example: users hit F1 for help in Microsoft Office and execute a series of queries, browsing through documents.How do we measure satisfaction other than through surveys?

Page 43: Focus the mining beacon: lessons and challenges

Ronny Kohavi, Microsoft

43

Summary

The lessons and challenges are from e-commerce, but likely to be applicable in other domains

Think about the problem end-to-end fromcollection, transformations, reporting, visualizations, modeling, taking action

Beware of hidden variables when concluding causality. Think about Simpson’s paradox.

Conduct many controlled experiments (A/B tests) because our intuition is poor Build infrastructure for controlled experiments (this is what my team is

now doing at Microsoft)

Copy of talk at http://exp-platform.com