CSEP 546: Data Mining

55

Click here to load reader

description

CSEP 546: Data Mining. Instructor: Pedro Domingos. Program for Today. Rule induction Propositional First-order First project. Rule Induction. First Project: Clickstream Mining. Overview. The Gazelle site Data collection Data pre-processing KDD Cup Hints and findings. - PowerPoint PPT Presentation

Transcript of CSEP 546: Data Mining

Page 1: CSEP 546: Data Mining

CSEP 546: Data Mining

Instructor: Pedro Domingos

Page 2: CSEP 546: Data Mining

Program for Today

• Rule induction– Propositional– First-order

• First project

Page 3: CSEP 546: Data Mining

Rule Induction

Page 4: CSEP 546: Data Mining
Page 5: CSEP 546: Data Mining
Page 6: CSEP 546: Data Mining
Page 7: CSEP 546: Data Mining
Page 8: CSEP 546: Data Mining
Page 9: CSEP 546: Data Mining
Page 10: CSEP 546: Data Mining
Page 11: CSEP 546: Data Mining
Page 12: CSEP 546: Data Mining
Page 13: CSEP 546: Data Mining
Page 14: CSEP 546: Data Mining
Page 15: CSEP 546: Data Mining
Page 16: CSEP 546: Data Mining
Page 17: CSEP 546: Data Mining
Page 18: CSEP 546: Data Mining
Page 19: CSEP 546: Data Mining
Page 20: CSEP 546: Data Mining
Page 21: CSEP 546: Data Mining
Page 22: CSEP 546: Data Mining
Page 23: CSEP 546: Data Mining
Page 24: CSEP 546: Data Mining
Page 25: CSEP 546: Data Mining
Page 26: CSEP 546: Data Mining
Page 27: CSEP 546: Data Mining
Page 28: CSEP 546: Data Mining
Page 29: CSEP 546: Data Mining
Page 30: CSEP 546: Data Mining
Page 31: CSEP 546: Data Mining
Page 32: CSEP 546: Data Mining
Page 33: CSEP 546: Data Mining
Page 34: CSEP 546: Data Mining

First Project:Clickstream Mining

Page 35: CSEP 546: Data Mining

Overview

• The Gazelle site

• Data collection

• Data pre-processing

• KDD Cup

• Hints and findings

Page 36: CSEP 546: Data Mining

The Gazelle Site

• Gazelle.com was a legwear and legcareweb retailer.

• Soft-launch: Jan 30, 2000• Hard-launch: Feb 29, 2000

with an Ally McBeal TV ad on 28thand strong $10 off promotion

• Training set: 2 months• Test sets: one month

(split into two test sets)

Page 37: CSEP 546: Data Mining

Data Collection

• Site was running Blue Martini’s Customer Interaction System version 2.0

• Data collected includes:– Clickstreams

• Session: date/time, cookie, browser, visit count, referrer• Page views: URL, processing time, product, assortment

(assortment is a collection of products, such as back to school)

– Order information• Order header: customer, date/time, discount, tax, shipping.• Order line: quantity, price, assortment

– Registration form: questionnaire responses

Page 38: CSEP 546: Data Mining

Data Pre-Processing

• Acxiom enhancements: age, gender, marital status, vehicle type, own/rent home, etc.

• Keynote records (about 250,000) removed. They hit the home page 3 times a minute, 24 hours.

• Personal information removed, including: Names, addresses, login, credit card, phones, host name/IP, verification question/answer. Cookie, e-mail obfuscated.

• Test users removed based on multiple criteria (e.g., credit card) not available to participants

• Original data and aggregated data (to session level) were provided

Page 39: CSEP 546: Data Mining

KDD Cup Questions

1. Will visitor leave after this page?

2. Which brands will visitor view?

3. Who are the heavy spenders?

4. Insights on Question 1

5. Insights on Question 2

Page 40: CSEP 546: Data Mining

KDD Cup Statistics

• 170 requests for data

• 31 submissions

• 200 person/hours per submission (max 900)

• Teams of 1-13 people (typically 2-3)

Page 41: CSEP 546: Data Mining

Algorithms Tried vs Submitted

0

2

4

6

8

10

12

14

16

18

20

Decisi

on T

rees

Neare

st Neig

hbor

Assoc

iatio

n Rule

s

Decisi

on R

ules

Boost

ing

Naïve

Bay

es

Seque

nce

Analys

is

Neura

l Net

work

SVM

Logis

tic R

egre

ssio

n

Linea

r Reg

ress

ion

Genet

ic Pro

gram

min

g

Cluste

ring

Baggi

ng

Bayes

ion

Belief

Net

Decisi

on T

able

Mar

kov

Mod

els

Algorithm

En

trie

s

Tried

Submitted

Decision trees most widely tried and by far themost commonly submitted

Note: statistics from final submitters only

Page 42: CSEP 546: Data Mining

Evaluation Criteria

• Accuracy (or score) was measured for the two questions with test sets

• Insight questions judged with help of retail experts from Gazelle and Blue Martini

• Created a list of insights from all participants– Each insight was given a weight

– Each participant was scored on all insights

– Additional factors: presentation quality, correctness

Page 43: CSEP 546: Data Mining

Question: Who Will Leave• Given set of page views, will visitor view

another page on site or leave?Hard prediction task because most sessions are of length 1. Gains chart for sessions longer than 5 is excellent.

Cumulative Gains Chart for Sessions >= 5 Clicks

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

0%

10

%

20

%

30

%

40

%

50

%

60

%

70

%

80

%

90

%

10

0%

X

% c

on

tin

ue 1st

2nd

Random

Optimal

The 10% highest scored sessions account for 43%of target. Lift=4.2

Page 44: CSEP 546: Data Mining

Insight: Who Leaves

• Crawlers, bots, and Gazelle testers– Crawlers hitting single pages were 16% of sessions – Gazelle testers: distinct patterns, referrer file://c:\...

• Referring sites: mycoupons have long sessions, shopnow.com are prone to exit quickly

• Returning visitors' prob. of continuing is double• View of specific products (Oroblue,Levante) causes

abandonment - Actionable• Replenishment pages discourage customers. 32%

leave the site after viewing them - Actionable

Page 45: CSEP 546: Data Mining

Insight: Who Leaves (II)• Probability of leaving decreases with page views

Many many “discoveries” are simply explained by this.E.g.: “viewing 3 different products implies low abandonment”

• Aggregated training set contains clipped sessionsMany competitors computed incorrect statistics

Abandonment ratio

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49

Session length

Per

cen

t ab

and

on

men

t

Unclipped

Training Set

Page 46: CSEP 546: Data Mining

Insight: Who Leaves (III)

• People who register see 22.2 pages on average compared to 3.3 (3.7 without crawlers)

• Free Gift and Welcome templates on first three pages encouraged visitors to stay at site

• Long processing time (> 12 seconds) implies high abandonment - Actionable

• Users who spend less time on the first few pages (session time) tend to have longer session lengths

Page 47: CSEP 546: Data Mining

Question: “Heavy” Spenders

• Characterize visitors who spend more than $12 on an average order at the site

• Small dataset of 3,465 purchases /1,831 customers• Insight question - no test set• Submission requirement:

– Report of up to 1,000 words and 10 graphs– Business users should be able to understand report– Observations should be correct and interesting

average order tax > $2 implies heavy spender

is not interesting nor actionable

Page 48: CSEP 546: Data Mining

Time is a major factor

Total Sales, Discounts, and "Heavy Spenders"

0

500

1000

1500

2000

2500

3000

3500

4000

4500

50001/

27/

00

2/3

/00

2/1

0/00

2/1

7/00

2/2

4/00

3/2

/00

3/9

/00

3/1

6/00

3/2

3/00

3/3

0/00

Order date

$

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

Percent heavy Discount Order amount

No data

Discounts greaterthan order amount(after discount)

1. Soft Launch

2. Ally McBealad &$10 offpromotion

3. Steady state

Page 49: CSEP 546: Data Mining

Insights (II)• Factors correlating with heavy purchasers:

– Not an AOL user (defined by browser) (browser window too small for layout - poor site design)

– Came to site from print-ad or news, not friends & family(broadcast ads vs. viral marketing)

– Very high and very low income

– Older customers (Acxiom)

– High home market value, owners of luxury vehicles (Acxiom)

– Geographic: Northeast U.S. states

– Repeat visitors (four or more times) - loyalty, replenishment

– Visits to areas of site - personalize differently (lifestyle assortments, leg-care vs. leg-ware)

Ta

r ge

ts e

gm

en

t

Page 50: CSEP 546: Data Mining

Insights (III)Referring site traffic changed dramatically over time.

Graph of relative percentages of top 5 sites

Note spikein traffic

Top Referrers

0%

20%

40%

60%

80%

100%

2/2/

00

2/4/

00

2/6/

00

2/8/

00

2/10

/00

2/12

/00

2/14

/00

2/16

/00

2/18

/00

2/20

/00

2/22

/00

2/24

/00

2/26

/00

2/28

/00

3/1/

00

3/3/

00

3/5/

00

3/7/

00

3/9/

00

3/11

/00

3/13

/00

3/15

/00

3/17

/00

3/19

/00

3/21

/00

3/23

/00

3/25

/00

3/27

/00

3/29

/00

3/31

/00

Session date

Per

cen

t o

f to

p r

efer

rers

0

1000

2000

3000

4000

5000

6000

Fashion Mall Yahoo ShopNow MyCoupons Winnie-cooper Total from top referrers

Yahoo searches for THONGS and Companies/Apparel/Lingerie

FashionMall.com

ShopNow.com

Winnie-Cooper

MyCoupons.com

Page 51: CSEP 546: Data Mining

Insights (IV)• Referrers - establish ad policy based on conversion

rates, not clickthroughs– Overall conversion rate: 0.8% (relatively low)– MyCoupons had 8.2% conversion rate, but low spenders– FashionMall and ShopNow brought 35,000 visitors

Only 23 purchased (0.07% conversion rate!)– What about Winnie-Cooper?

Winnie Cooper is a 31-year-old guy who wears pantyhose and has a pantyhose site.

8,700 visitors came from his site (!).

Actions:

• Make him a celebrity, interview him about

how hard it is for men to buy in stores

• Personalize for XL sizes

Page 52: CSEP 546: Data Mining

Common Mistakes• Insights need support

Rules with high confidence are meaningless when they apply to 4 people

• Dig deeperMany “interesting” insights with interesting explanations were simply identifying periods of the site. For example:– “93% of people who responded that they are purchasing

for others are heavy purchasers.”True, but simply identifying people who registered prior to 2/28, before the form was changed.

– Similarly, “presence of children" (registration form) implies heavy spender.

Page 53: CSEP 546: Data Mining

Example• Agreeing to get e-mail in registration was claimed

to be predictive of heavy spender• It was mostly an indirect predictor of time

(Gazelle changed default for on 2/28 and back on 3/16)Send-email versus heavy-spender

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

1/31

/00

2/7/

00

2/14

/00

2/21

/00

2/28

/00

3/6/

00

3/13

/00

3/20

/00

3/27

/00

Percent heavy

Percent e-mail

Page 54: CSEP 546: Data Mining

Question: Brand View• Given set of page views, which product brand

will visitor view in remainder of the session? (Hanes, Donna Karan, American Essentials, or none)

• Good gains curves for long sessions (lift of 3.9, 3.4, and 1.3 for three brands at 10% of data).

• Referrer URL is great predictor– FashionMall, Winnie-Cooper are referrers for Hanes, Donna

Karan - different population segments reach these sites– MyCoupons, Tripod, DealFinder are referrers for American

Essentials - AE contains socks, excellent for coupon users

• Previous views of a product imply later views• Few realized Donna Karan only available > Feb 26

Page 55: CSEP 546: Data Mining

Project

• Use C4.5 decision tree learner

• Apply to first question (Who leaves?)

• Improve accuracy by refining data

• Report insights

• Good luck and have fun!