A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse...

65
A Practical Look at Data Preparation Jason Brown Jason Brown Cognicase Inc. Cognicase Inc. Data Mining Data Mining IRMAC: Data Warehouse SIG November 5, 2002

Transcript of A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse...

Page 1: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002.

A Practical Look at Data Preparation

Jason BrownJason BrownCognicase Inc.Cognicase Inc.

Data MiningData Mining

IRMAC: Data Warehouse SIGNovember 5, 2002

Page 2: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002.

2

Agenda

• Crash Course in Data Mining– What– Why– How

• The virtuous cycle– Data Preparation

• Case Study– Background– Going through the cycle

• Data Preparation

• Q&A

Page 3: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002.

3

The Crash Course

•What

•Why

•How

Page 4: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002.

4

Definitions

Data Mining:The process of exploration and analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns and rules.

Knowledge Discovery

Data Mining is not Data Warehousing, OLAP etc.

Page 5: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002.

5

Definitions

Modeling:• Not an ER type Data Model• A data mining model is computational, full of

algorithms• A model can be descriptive or predictive.

– A descriptive model helps in understanding underlying processes or behavior.

– A predictive model uses known values (input) to predict an unknown value (output)

Page 6: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002.

6

Two Types of Data Mining

• Directed– Know specifically what we are looking for

• Who is likely to respond to our offer?• What our customers going to be worth to us over their lifetime?

– Model is a Black Box

Input Output

Page 7: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002.

7

Two Types of Data Mining

• Undirected– Not exactly sure what we are looking for

• How should we define our Customer Segments?• What is interesting about all of our point of sale data?

– Model is a Transparent Box

Input Output

Page 8: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002.

8

Modeling Techniques

Decision Trees

If …….. Then ……..

Rule Induction

Neural Networks

Nearest Neighbour

Clustering

Page 9: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002.

9

Modeling TechniquesDecision Trees

• The tree is built based on the input of a training data set– Training Data Set is based on historical data– Over sample the data that reflects your question

• Each record of the Model Set is run through the branches of the tree until the record reaches a leaf

Page 10: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002.

10

Modeling TechniquesDecision Trees

Age < 23

YNIncome < 12 000

N

N

Y

Y

Male ?

28% 37%

Page 11: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002.

11

Modeling TechniquesNeural Networks

• Neural networks are a nonlinear model -similar to a ‘brain’.

• The network is built based on the input of a training set.

• Model sets run through this network will return accurate results based on the patterns identified in the training set.

• Very Complex

Page 12: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002.

12

Modeling TechniquesClustering

• Clustering finds groups of records that are similar.

• For example, customers can be clustered by:– income

– age

– ytd revenue

Page 13: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002.

13

Modeling TechniquesClustering

MaleIncome < 12 000Age < 23Coke Buyers

MaleIncome < 12 000Age < 23Non Coke Buyers

Page 14: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002.

14

Modeling TechniquesNearest Neighbour

• Model is built based on the input of a training set.

• Classifies a record by calculating the distances between the record criteria and the training data set

• Then it assigns the record to the class that is most common among its nearest neighbours

Page 15: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002.

15

Modeling TechniquesNearest Neighbour

Records plotted based on:

IncomeGenderAge

Bought Coke

Bought Coke

Bought Coke

Did Not Did Not

Page 16: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002.

16

Modeling TechniquesRule Induction

• A technique that infers generalizations from the information in the data

IF age < 19 AND purchase is coke THEN 40% purchase chips

• Describes the data, allows us to visualize what is going on

Page 17: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002.

17

The Crash Course

•What

•Why

•How

Page 18: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002.

18

The Reasons to Mine Data

Expenses

IncreaseProfit

Revenues

Page 19: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002.

19

The Reasons to Mine Data

• For Marketing/CRM– Targeting prospects – Predicting future customer behaviour– Costs Revenues

• For Research– Identify drugs likely to be successful– Costs

• For Process Improvement– Identify causes of production failures– Costs

Page 20: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002.

20

The Crash Course

•What

•Why

•How

Page 21: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002.

21

Process

• Many different processes for Data Mining– Vendor Driven

• SAS - SEMMA– Sample, Explore, Modify, Model, Assess

• SPSS - 5 A’s– Assess, Access, Analyze, Act, Automate

– Consulting Companies– The Virtuous Cycle

• Michael Berry and Gordon Linoff

Page 22: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002.

22

Business Problem

Transform Data

Act

Measure

ProcessThe Virtuous Cycle

Page 23: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002.

23

• Define the business problem• Understand the business and the rules• Determine if Data Mining fits the need• Understand the value to the business of

solving the problem

Business Problem

Page 24: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002.

24

Data for Data Mining

• Type of Data Values– Categorical

• Defined set of values• Ontario, Quebec, PEI …

– Ranks• High, Medium, Low• 0 – 20 000, 20 001 – 35 000, 35 001 – 50 000

– Intervals• Date• Time • Temperature

– True Numeric• Values that support numeric operations

Page 25: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002.

25

Transform DataSteps

Identify Data1

Prepare Model Set6

Add Derived Variables5

Transpose to Right Granularity4

Validate & Clean3

Obtain Data2

Conduct Modeling7

Page 26: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002.

26

Transform Data

Identify Data1

Step 1 - Identify Data

• What data is required to meet the modeling need?

• What data is available?

Page 27: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002.

27

Transform DataStep2 - Obtain Data

Identify Data

Obtain Data

1

2

• OLTP• Data Warehouse• Data Marts and OLAP• Self Reported• External

Page 28: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002.

28

Transform DataStep 3: Validate & Clean

Identify Data

Obtain Data

Validate & Clean

1

3

2

• Data Issues:– Missing– Fuzzy– Incorrect– Outliers

• Solutions:– Change Source– Filter Out– Ignore– Integrate– Predict– Derive a New Variable

Page 29: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002.

29

Transform DataStep 4: Transpose to right granularity

Identify Data

Obtain Data

Transpose to Right Granularity

Validate & Clean

1

4

3

2

• Data sets for Data Mining need one view, one record

• Grain must be consistent throughout– Aggregates can be problematic– Atomic data is often required to build data set

• Training data sets cast from point in time of event looking back

Page 30: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002.

30

Transform DataStep 5: Add Derived Variables

Identify Data

Obtain Data

Transpose to Right Granularity

Validate & Clean

Add Derived Variables

1

5

4

3

2

• Combined Columns• Summarizations• Features from Columns• Time Series

Page 31: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002.

31

Transform DataStep 6: Prepare Model Set

Identify Data

Obtain Data

Transpose to Right Granularity

Validate & Clean

Add Derived Variables

Prepare Model Set

1

6

5

4

3

2

• The Actual Input to the modeling

Page 32: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002.

32

Transform DataStep 7: Conduct Modeling

Identify Data

Obtain Data

Transpose to Right Granularity

Validate & Clean

Add Derived Variables

Prepare Model Set

Conduct Modeling

1

6

5

4

3

2

7

• Get our result– Decision Trees– Neural Networks– Clustering – Nearest Neighbour– Rule Induction

Page 33: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002.

33

Act

The Business has to actually do something with the results or what was the point?

Marketing or Retention Campaigns

Business Changes

Page 34: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002.

34

Measure

• Answer 2 Questions– Was the Data Mining effort accurate? – Were the Business Actions successful?

• Use different sets of data to compare real results– Actioned Customers vs. Non Actioned

• Accuracy Types– Absolute

• Our prediction was 80% of Group D would buy Coke and 78% really did

– Relative• Our prediction was 80% of Group D would buy Coke but

57 % really did, however Group C which we predicted had a 60% propensity to buy Coke actually bought Coke 42% of the time

Page 35: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002.

35

And back around

Business Problem

Transform Data

Act

Measure

Page 36: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002.

36

The Case Study

Page 37: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002.

37

The Case Study

• Background– The Business– Data Warehouse Overview– Strengths and Challenges

• The Project– Business Problem– Transform Data– Act– Measure

Page 38: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002.

38

The Case Study

• Background– The Business– Data Warehouse Overview– Strengths and Challenges

• The Project– Business Problem– Transform Data– Act– Measure

Page 39: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002.

39

The Business

• One of the top 3 (4?) cellular phone providers in Canada

• Recent Acquisitions– Clearnet– Quebectel

• Important Business Concepts– Handset– Subscriber– Client– Activity - Activations, Deactivations– Churn– Usage

Page 40: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002.

40

Cubes

Catalogues

Sources

Staging

Integration (3N F)

Reports

Data Marts(Dimensional)

DW Environment

Page 41: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002.

41

• Commitment to Data Warehousing• Prior Experience in Data Mining• Tools already Established• Strong business support for the outcomes Data

Mining would provide

Strengths at

Page 42: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002.

42

• Data Warehouse still in midst of major re-architecture effort

• Ongoing billing system integration projects• A data mart for data mining had existed

(Clearnet) but it was a victim of both of the above– Successful at Churn Prediction

Challenges at

Page 43: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002.

43

The Case Study

• Background– The Business– Data Warehouse Overview– Strengths and Challenges

• The Project– Business Problem– Transform Data– Act– Measure

Page 44: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002.

44

Using the Virtuous Cycle

Business Problem

Transform Data

Act

Measure

Page 45: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002.

45

Business Problems

• Churn Modeling– predict which subscriber is likely to leave

• Behavioural Segmentation– clustering subscribers into subgroups based on some

commonality– revenue, usage, demographic

• Client Value Estimation– the present value of all future profits generated throughout

the lifetime of that client

Page 46: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002.

46

Transform Data Steps

Identify Data

Obtain Data

Transpose to Right Granularity

Validate & Clean

Add Derived Variables

Prepare Model Set

Conduct Modeling

1

6

5

4

3

2

7

Page 47: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002.

47

Identify Data

EVERY POSSIBLE VARIABLE RELATED TO A SUBSCRIBER!

• Business Wanted:

1

• They provided a detailed list, by subject area, of the variables that they believed were required to conduct the kind of Data Mining Activities desired.

Page 48: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002.

48

• 19 Subject Areas identified with up to 75 variables each - What is the Priority?

• Avoid the big bang - How much can we actually do?

• Where to source the data from?• Resources - Who is going to do it?

Identify DataIT Challenges

1

Page 49: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002.

49

• First we asked the business to rate each variable as H, M or L priority– Almost everything was given an H

• Then we asked the business to rank the subject areas in order of importance– Hard to convince them of the value– Hard to find consensus– Necessary for determining a release strategy

Identify DataPrioritizing

1

Page 50: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002.

50

Obtain Data

• For each Subject Area and each variable we assessed and documented the following:– Where can it be sourced from (and when)?– What are the known issues?– Q&A back and forth on the variables with business– Identified possible additional variables

2

Page 51: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002.

51

+ =Release Strategy:•1 Release a Quarter

•3 planned releases

•15 of 19 Subject Areas

Obtain Data

2

Page 52: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002.

52

Obtain Data

• Data Mining Access tool against sources– Data Warehouse– OLTP– External Data

• Creation of flat files to feed to Data Mining team– Multi source– Validate and Clean

• Build a Data Mart for Data Mining– Part of the Data Warehouse Architecture

2

Page 53: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002.

53

Obtain Data

Sources

Staging

Integration (3N F)

Data Marts(Dimensional)

Data Mart forData Mining

Page 54: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002.

54

Validate & Clean Data

Validate & Clean3

Staging

Integration (3N F)

Data Marts

Data Mart forData Mining

• Clean?– Not Fuzzy– Correct - mostly– Missing and Outliers

Page 55: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002.

55

Transpose to Right Granularity

• The Grain is Subscriber by Bill Cycle• Monthly Snapshots of subscriber Data• Fact Data Cast to Bill Cycle

Transpose to Right Granularity4

BILLED_AMOUNTS

SUBSCRIBER_ID (FK)MONTH (FK)

LOTS_OF_BILLED_REVENUE_FACTSUPDATE_DTLOAD_DT

PRODUCT_SKU

PRODUCT_SKU_KEY

PRODUCT_INFOUPDATE_DTLOAD_DT

SUBSCRIBER_BILL_CYCLE_SNAPSHOT

SUBSCRIBER_IDMONTH

CURRENT_HANDSET_KEY (FK)FIRST_HANDSET_KEY (FK)SUBSCRIBER_PERSONAL_INFOSUBSCRIBER_DEMOGRAPHIC_INFOCONTRACT_INFORATE_PLAN_INFOMUCH_MORE_INFOLOAD_DTUPDATE_DT

VALUE_ADDED_SERVICES

VALUE_ADDED_SERVICES_KEY

VALUE_ADDED_SERVICES_DESVALUE_ADDED_SERVICES_CDVALUE_ADDED_SERVICES_IDUPDATE_DTLOAD_DT

VALUE_ADDED_SERVICES_BILLED

MONTH (FK)SUBSCRIBER_ID (FK)VALUE_ADDED_SERVICES_KEY (FK)

VALUE_ADDED_SERVICES_AMTUPDATE_DTLOAD_DT

FACTS_5SUBSCRIBER_ID (FK)MONTH (FK)

LOTS_OF_FACTSUPDATE_DTLOAD_DT

FACTS_4SUBSCRIBER_ID (FK)MONTH (FK)

LOTS_OF_FACTSUPDATE_DTLOAD_DT

FACTS_3SUBSCRIBER_ID (FK)MONTH (FK)

LOTS_OF_FACTSUPDATE_DTLOAD_DT

FACTS_2SUBSCRIBER_ID (FK)MONTH (FK)

LOTS_OF_FACTSUPDATE_DTLOAD_DTSUBSCRIBER_BILL_CYCLE_SNAPSHOT

SUBSCRIBER_IDMONTH

CLIENT_ACCOUNT_ID (FK)CLIENT_SNAPSHOT_MONTH (FK)CURRENT_HANDSET_KEY (FK)FIRST_HANDSET_KEY (FK)SUBSCRIBER_PERSONAL_INFOSUBSCRIBER_DEMOGRAPHIC_INFOCONTRACT_INFORATE_PLAN_INFOMUCH_MORE_INFOLOAD_DTUPDATE_DT

CLIENT_ACCOUNT_BILL_CYCLE_SNAPSHOTCLIENT_ACCOUNT_IDCLIENT_SNAPSHOT_MONTH

INFO_ON_THE_ACCOUNTACCOUNT_NO_OF_CANCELLED_SUBSACCOUNT_NO_OF_SUSPENDED_SUBSACCOUNT_NO_OF_ACTIVE_SUBSLOAD_DTUPDATE_DT

Page 56: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002.

56

Add Derived Variables

• Combined Columns– PERCENT_BUCKET_USED

• IN_BUCKET_CALLS / RATE_PLAN_BUCKET_MINUTES)

• Summarizations– Sphere of Influence

• UNIQUE_CALLED_NUMBER_CNT• UNIQUE_CALLING_NUMBER_CNT

Add Derived Variables5

Page 57: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002.

57

Add Derived Variables

• Features from Columns– CONTRACT_INDICATOR (Y/N Flag)

• Was the the date of snapshot for that subscriber between COMMIT_START_DATE and COMMIT_END_DATE

• Time Series

Add Derived Variables5

BLOCKED_USAGE_WORK

SUBSCRIBER_ID: NUMBER(22)USAGE_DT: DATE

BLOCKED_CNT: NUMBER(22)UPDATE_DT: DATELOAD_DT: DATE

BLOCKED_USAGE

MONTH: NUMBER(8)SUBSCRIBER_ID: NUMBER(22)

BLOCKED_CNT: NUMBER(22)UPDATE_DT: DATELOAD_DT: DATE

VALUE_ADDED_SERVICES_BILLED

MONTH: NUMBER(8)SUBSCRIBER_ID: NUMBER(22)VALUE_ADDED_SERVICES_KEY: NUMBER

VALUE_ADDED_SERVICES_AMT: NUMBER(22,2)UPDATE_DT: DATELOAD_DT: DATE

SUBSCRIBER_BILL_CYCLE_SNAPSHOT

SUBSCRIBER_ID: NUMBER(22)MONTH: NUMBER(8)

CLIENT_ACCOUNT_ID: NUMBER(22)CLIENT_SNAPSHOT_MONTH: NUMBER(8)SUBSCRIBER_CD: VARCHAR2(50)TIME_REMAINING_ON_CONTRACT: NUMBERCONTRACT_END_DT: DATECONTRACT_START_DT: DATECONTRACT_TERM: NUMBERCONTRACT_IND: VARCHAR2(1)SUBSCRIBER_STATUS_CD: VARCHAR2(50)ACTIVATION_CHANNEL_ORG_CD: VARCHAR2(50)ACTIVATION_REASON_CD: VARCHAR2(50)ACTIVATION_DT: DATEPOSTAL_CODE: VARCHAR2(20)PROVINCE_STATE_CD: VARCHAR2(50)LAST_NM: VARCHAR2(60)LANGUAGE_CD: VARCHAR2(50)BIRTHDATE: DATEGENDER: VARCHAR2(1)DRIVERS_LICENSE: VARCHAR2(20)FIRST_NM: VARCHAR2(50)SOCIAL_INSURANCE_NO: VARCHAR2(20)MOBILE_PHONE_NO: VARCHAR2(50)DEACTIVATION_REASON_CD: VARCHAR2(50)DEACTIVATION_DT: DATEREVENUE_BAND_CD: VARCHAR2(50)SEGMENT_CD: VARCHAR2(50)SUSPEND_REASON_CD: VARCHAR2(50)SUSPEND_DT: DATECURRENT_HANDSET_KEY: NUMBERFIRST_HANDSET_KEY: NUMBERUPDATE_DT: DATELOAD_DT: DATEAGE_OF_CURRENT_HANDSET: NUMBERPREPAID_IND: VARCHAR2(1)REVENUE_IND: VARCHAR2(1)RATE_PLAN_BUCKET_MINUTES: NUMBER(9,2)RATE_PLAN_GROUP_CD: VARCHAR2(50)RATE_PLAN_CD: VARCHAR2(50)SUBSCRIBER_TENURE_DAYS: NUMBERSUBSCRIPTION_ID: VARCHAR2(50)TECHNOLOGY_TYPE_CD: VARCHAR2(50)AGE: NUMBER(4,1)

PRODUCT_SKU_KD2

PRODUCT_SKU_KEY: NUMBER

PRODUCT_GROUP_DES: VARCHAR2(240)PRODUCT_GROUP_CD: VARCHAR2(50)PRODUCT_GROUP_ID: VARCHAR2(50)ENGLISH_PRODUCT_DES: VARCHAR2(240)PRODUCT_CD: VARCHAR2(50)REFURBISHED_HANDSET_IND: VARCHAR2(1)WEB_READY_IND: VARCHAR2(1)SIM_CARD_IND: VARCHAR2(1)PRODUCT_ID: VARCHAR2(50)UPDATE_DT: DATELOAD_DT: DATE

DROPPED_USAGE_WORK

SUBSCRIBER_ID: NUMBER(22)USAGE_DT: DATE

DROPPED_CNT: NUMBER(22)UPDATE_DT: DATELOAD_DT: DATE

DROPPED_USAGE

MONTH: NUMBER(8)SUBSCRIBER_ID: NUMBER(22)

DROPPED_CNT: NUMBER(22)UPDATE_DT: DATELOAD_DT: DATE

BUCKET_USAGE

MONTH: NUMBER(8)SUBSCRIBER_ID: NUMBER(22)

UPDATE_DT: DATELOAD_DT: DATEPERCENT_BUCKET_USED: NUMBER

BILLED_USAGE_3_MTH_AVG

MONTH: NUMBER(8)SUBSCRIBER_ID: NUMBER(22)

TOTAL_CALLS_MOU_3_MTH_AVG: NUMBER(22,5)LONG_DISTANCE_CALLS_MOU_3_MTH_: NUMBER(22,5)WEEKEND_CALLS_MOU_3_MTH_AVG: NUMBER(22,5)EVENING_CALLS_MOU_3_MTH_AVG: NUMBER(22,5)PEAK_CALLS_MOU_3_MTH_AVG: NUMBER(22,5)UPDATE_DT: DATELOAD_DT: DATE

BILLED_USAGE

MONTH: NUMBER(8)SUBSCRIBER_ID: NUMBER(22)

TOTAL_CALLS_MOU: NUMBER(22,5)TOTAL_CALLS_CNT: NUMBERLONG_DISTANCE_CALLS_MOU: NUMBER(22,5)LONG_DISTANCE_CALLS_CNT: NUMBERWEEKEND_CALLS_MOU: NUMBER(22,5)WEEKEND_CALLS_CNT: NUMBEREVENING_CALLS_MOU: NUMBER(22,5)EVENING_CALLS_CNT: NUMBERPEAK_CALLS_MOU: NUMBER(22,5)PEAK_CALLS_CNT: NUMBERUPDATE_DT: DATELOAD_DT: DATEOUT_BUCKET_CALLS_MOU: NUMBER(22,5)IN_BUCKET_CALLS_MOU: NUMBER(22,5)OUT_BUCKET_CALLS_CNT: NUMBERIN_BUCKET_CALLS_CNT: NUMBERROAMING_CALLS_MOU: NUMBER(22,5)ROAMING_CALLS_CNT: NUMBER

SUBSCRIBER_ID_KD2_TEMP

SUBSCRIBER_ID: NUMBER(22)

ACCOUNT_CYCLE_BILL_DAY: NUMBER(2)

Page 58: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002.

58

Prepare Model Set

Prepare Model Set6

FACTS_5

SUBSCRIBER_ID (FK)MONTH (FK)

LOTS_OF_FACTSUPDATE_DTLOAD_DT

FACTS_4

SUBSCRIBER_ID (FK)MONTH (FK)

LOTS_OF_FACTSUPDATE_DTLOAD_DT

FACTS_3

SUBSCRIBER_ID (FK)MONTH (FK)

LOTS_OF_FACTSUPDATE_DTLOAD_DT

FACTS_2

SUBSCRIBER_ID (FK)MONTH (FK)

LOTS_OF_FACTSUPDATE_DTLOAD_DT

FACTS_1

SUBSCRIBER_ID (FK)MONTH (FK)

LOTS_OF_FACTSUPDATE_DTLOAD_DT

BILLED_AMOUNTS

SUBSCRIBER_ID (FK)MONTH (FK)

LOTS_OF_BILLED_REVENUE_FACTSUPDATE_DTLOAD_DT

DROPPED_AND_BLOCKED_USAGE

DROPPED_CALLS_CNTUPDATE_DTLOAD_DT

EQUIPMENT_ACTIVITY_SNAPSHOT

SUBSCRIBER_ID (FK)MONTH (FK)

HANDSET_COUNTABLE_DATAUPDATE_DTLOAD_DT

PRODUCT_SKU

PRODUCT_SKU_KEY

PRODUCT_INFOUPDATE_DTLOAD_DT

SUBSCRIBER_BILL_CYCLE_SNAPSHOT

SUBSCRIBER_IDMONTH

CLIENT_ACCOUNT_ID (FK)CLIENT_SNAPSHOT_MONTH (FK)CURRENT_HANDSET_KEY (FK)FIRST_HANDSET_KEY (FK)SUBSCRIBER_PERSONAL_INFOSUBSCRIBER_DEMOGRAPHIC_INFOCONTRACT_INFORATE_PLAN_INFOMUCH_MORE_INFOLOAD_DTUPDATE_DT

VALUE_ADDED_SERVICES

VALUE_ADDED_SERVICES_KEY

VALUE_ADDED_SERVICES_DESVALUE_ADDED_SERVICES_CDVALUE_ADDED_SERVICES_IDUPDATE_DTLOAD_DT

VALUE_ADDED_SERVICES_BILLED

MONTH (FK)SUBSCRIBER_ID (FK)VALUE_ADDED_SERVICES_KEY (FK)

VALUE_ADDED_SERVICES_AMTUPDATE_DTLOAD_DT

BLOCKED_USAGE

BLOCKED_CALLS_CNTUPDATE_DTLOAD_DT

CLIENT_ACCOUNT_BILL_CYCLE_SNAPSHOT

CLIENT_ACCOUNT_IDCLIENT_SNAPSHOT_MONTH

INFO_ON_THE_ACCOUNTACCOUNT_NO_OF_CANCELLED_SUBSACCOUNT_NO_OF_SUSPENDED_SUBSACCOUNT_NO_OF_ACTIVE_SUBSLOAD_DTUPDATE_DT

08/01/2002 1,735,436.26 281,032.21 388,970.91 0.00 516,911.31 462,667.55 749,330.77 6,371.95 500,183.42 08/14/2002

• Select Sample– Population– Data for given

modeling effort

• Denormalize completely

Page 59: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002.

59

Prepare Model Set

Prepare Model Set6

Denormalize completely

Example of Denormalization for Data Mining

Handset 1 Handset 2 Handset 3 Handset 40 1 0 0

BILLED_USAGE

MONTH (FK)SUBSCRIBER_ID (FK)

TOTAL_CALLS_MOUTOTAL_CALLS_CNTLONG_DISTANCE_CALLS_MOULONG_DISTANCE_CALLS_CNTWEEKEND_CALLS_MOUWEEKEND_CALLS_CNTEVENING_CALLS_MOUEVENING_CALLS_CNTPEAK_CALLS_MOUPEAK_CALLS_CNTUPDATE_DTLOAD_DTOUT_BUCKET_CALLS_MOUIN_BUCKET_CALLS_MOUOUT_BUCKET_CALLS_CNTIN_BUCKET_CALLS_CNTROAMING_CALLS_MOUROAMING_CALLS_CNT

BILLED_USAGE_3_MTH_AVG

MONTH (FK)SUBSCRIBER_ID (FK)

TOTAL_CALLS_MOU_3_MTH_AVGLONG_DISTANCE_CALLS_MOU_3_MTH_WEEKEND_CALLS_MOU_3_MTH_AVGEVENING_CALLS_MOU_3_MTH_AVGPEAK_CALLS_MOU_3_MTH_AVGUPDATE_DTLOAD_DT

BUCKET_USAGE

MONTH (FK)SUBSCRIBER_ID (FK)

UPDATE_DTLOAD_DTPERCENT_BUCKET_USED

PRODUCT_SKU

PRODUCT_SKU_KEY

PRODUCT_INFOUPDATE_DTLOAD_DT

SUBSCRIBER_BILL_CYCLE_SNAPSHOT

SUBSCRIBER_IDMONTH

CLIENT_ACCOUNT_ID (FK)CLIENT_SNAPSHOT_MONTH (FK)CURRENT_HANDSET_KEY (FK)FIRST_HANDSET_KEY (FK)SUBSCRIBER_PERSONAL_INFOSUBSCRIBER_DEMOGRAPHIC_INFOCONTRACT_INFORATE_PLAN_INFOMUCH_MORE_INFOLOAD_DTUPDATE_DT

VALUE_ADDED_SERVICES

VALUE_ADDED_SERVICES_KEY

VALUE_ADDED_SERVICES_DESVALUE_ADDED_SERVICES_CDVALUE_ADDED_SERVICES_IDUPDATE_DTLOAD_DT

VALUE_ADDED_SERVICES_BILLED

MONTH (FK)SUBSCRIBER_ID (FK)VALUE_ADDED_SERVICES_KEY (FK)

VALUE_ADDED_SERVICES_AMTUPDATE_DTLOAD_DT

Page 60: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002.

60

Prepare Model Set

Prepare Model Set6

Example of Casting a a Test Set for Churn Modeling

• Sliding Windows Concept

O c t N o v D e c J a n F e b M a r A p r M a y J u n J u l

M o d e l S e t F e b 3 2 1 X PM o d e l S e t M a r 3 2 1 X PM o d e l S e t A p r 3 2 1 X P

S c o r e S e t 3 2 1 X P P

Page 61: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002.

61

Conduct Modeling

7

• Done by business team with help from Vendor (SAS)– Decision Trees and Neural Networks (Churn)– Clustering (Segmentation)

Decision Trees

Neural Networks

Clustering

Page 62: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002.

62

Various Marketing and CRM Activities

The Action

Page 63: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002.

63

Measuring

• Not there yet– Intend to compare actioned vs. non-actioned results– 50 % to be actioned

Page 64: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002.

64

And the cycle continues ...

Business Problem

Transform Data

Act

Measure

Page 65: A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse SIG November 5, 2002.

65

AQ&Q U E S T I O N SQ U E S T I O N SA N S W E R SA N S W E R S

Jason [email protected]