A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse...
-
Upload
wilfred-mosley -
Category
Documents
-
view
221 -
download
5
Transcript of A Practical Look at Data Preparation Jason Brown Cognicase Inc. Data Mining IRMAC: Data Warehouse...
A Practical Look at Data Preparation
Jason BrownJason BrownCognicase Inc.Cognicase Inc.
Data MiningData Mining
IRMAC: Data Warehouse SIGNovember 5, 2002
2
Agenda
• Crash Course in Data Mining– What– Why– How
• The virtuous cycle– Data Preparation
• Case Study– Background– Going through the cycle
• Data Preparation
• Q&A
3
The Crash Course
•What
•Why
•How
4
Definitions
Data Mining:The process of exploration and analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns and rules.
Knowledge Discovery
Data Mining is not Data Warehousing, OLAP etc.
5
Definitions
Modeling:• Not an ER type Data Model• A data mining model is computational, full of
algorithms• A model can be descriptive or predictive.
– A descriptive model helps in understanding underlying processes or behavior.
– A predictive model uses known values (input) to predict an unknown value (output)
6
Two Types of Data Mining
• Directed– Know specifically what we are looking for
• Who is likely to respond to our offer?• What our customers going to be worth to us over their lifetime?
– Model is a Black Box
Input Output
7
Two Types of Data Mining
• Undirected– Not exactly sure what we are looking for
• How should we define our Customer Segments?• What is interesting about all of our point of sale data?
– Model is a Transparent Box
Input Output
8
Modeling Techniques
Decision Trees
If …….. Then ……..
Rule Induction
Neural Networks
Nearest Neighbour
Clustering
9
Modeling TechniquesDecision Trees
• The tree is built based on the input of a training data set– Training Data Set is based on historical data– Over sample the data that reflects your question
• Each record of the Model Set is run through the branches of the tree until the record reaches a leaf
10
Modeling TechniquesDecision Trees
Age < 23
YNIncome < 12 000
N
N
Y
Y
Male ?
28% 37%
11
Modeling TechniquesNeural Networks
• Neural networks are a nonlinear model -similar to a ‘brain’.
• The network is built based on the input of a training set.
• Model sets run through this network will return accurate results based on the patterns identified in the training set.
• Very Complex
12
Modeling TechniquesClustering
• Clustering finds groups of records that are similar.
• For example, customers can be clustered by:– income
– age
– ytd revenue
13
Modeling TechniquesClustering
MaleIncome < 12 000Age < 23Coke Buyers
MaleIncome < 12 000Age < 23Non Coke Buyers
14
Modeling TechniquesNearest Neighbour
• Model is built based on the input of a training set.
• Classifies a record by calculating the distances between the record criteria and the training data set
• Then it assigns the record to the class that is most common among its nearest neighbours
15
Modeling TechniquesNearest Neighbour
Records plotted based on:
IncomeGenderAge
Bought Coke
Bought Coke
Bought Coke
Did Not Did Not
16
Modeling TechniquesRule Induction
• A technique that infers generalizations from the information in the data
IF age < 19 AND purchase is coke THEN 40% purchase chips
• Describes the data, allows us to visualize what is going on
17
The Crash Course
•What
•Why
•How
18
The Reasons to Mine Data
Expenses
IncreaseProfit
Revenues
19
The Reasons to Mine Data
• For Marketing/CRM– Targeting prospects – Predicting future customer behaviour– Costs Revenues
• For Research– Identify drugs likely to be successful– Costs
• For Process Improvement– Identify causes of production failures– Costs
20
The Crash Course
•What
•Why
•How
21
Process
• Many different processes for Data Mining– Vendor Driven
• SAS - SEMMA– Sample, Explore, Modify, Model, Assess
• SPSS - 5 A’s– Assess, Access, Analyze, Act, Automate
– Consulting Companies– The Virtuous Cycle
• Michael Berry and Gordon Linoff
22
Business Problem
Transform Data
Act
Measure
ProcessThe Virtuous Cycle
23
• Define the business problem• Understand the business and the rules• Determine if Data Mining fits the need• Understand the value to the business of
solving the problem
Business Problem
24
Data for Data Mining
• Type of Data Values– Categorical
• Defined set of values• Ontario, Quebec, PEI …
– Ranks• High, Medium, Low• 0 – 20 000, 20 001 – 35 000, 35 001 – 50 000
– Intervals• Date• Time • Temperature
– True Numeric• Values that support numeric operations
25
Transform DataSteps
Identify Data1
Prepare Model Set6
Add Derived Variables5
Transpose to Right Granularity4
Validate & Clean3
Obtain Data2
Conduct Modeling7
26
Transform Data
Identify Data1
Step 1 - Identify Data
• What data is required to meet the modeling need?
• What data is available?
27
Transform DataStep2 - Obtain Data
Identify Data
Obtain Data
1
2
• OLTP• Data Warehouse• Data Marts and OLAP• Self Reported• External
28
Transform DataStep 3: Validate & Clean
Identify Data
Obtain Data
Validate & Clean
1
3
2
• Data Issues:– Missing– Fuzzy– Incorrect– Outliers
• Solutions:– Change Source– Filter Out– Ignore– Integrate– Predict– Derive a New Variable
29
Transform DataStep 4: Transpose to right granularity
Identify Data
Obtain Data
Transpose to Right Granularity
Validate & Clean
1
4
3
2
• Data sets for Data Mining need one view, one record
• Grain must be consistent throughout– Aggregates can be problematic– Atomic data is often required to build data set
• Training data sets cast from point in time of event looking back
30
Transform DataStep 5: Add Derived Variables
Identify Data
Obtain Data
Transpose to Right Granularity
Validate & Clean
Add Derived Variables
1
5
4
3
2
• Combined Columns• Summarizations• Features from Columns• Time Series
31
Transform DataStep 6: Prepare Model Set
Identify Data
Obtain Data
Transpose to Right Granularity
Validate & Clean
Add Derived Variables
Prepare Model Set
1
6
5
4
3
2
• The Actual Input to the modeling
32
Transform DataStep 7: Conduct Modeling
Identify Data
Obtain Data
Transpose to Right Granularity
Validate & Clean
Add Derived Variables
Prepare Model Set
Conduct Modeling
1
6
5
4
3
2
7
• Get our result– Decision Trees– Neural Networks– Clustering – Nearest Neighbour– Rule Induction
33
Act
The Business has to actually do something with the results or what was the point?
Marketing or Retention Campaigns
Business Changes
34
Measure
• Answer 2 Questions– Was the Data Mining effort accurate? – Were the Business Actions successful?
• Use different sets of data to compare real results– Actioned Customers vs. Non Actioned
• Accuracy Types– Absolute
• Our prediction was 80% of Group D would buy Coke and 78% really did
– Relative• Our prediction was 80% of Group D would buy Coke but
57 % really did, however Group C which we predicted had a 60% propensity to buy Coke actually bought Coke 42% of the time
35
And back around
Business Problem
Transform Data
Act
Measure
36
The Case Study
37
The Case Study
• Background– The Business– Data Warehouse Overview– Strengths and Challenges
• The Project– Business Problem– Transform Data– Act– Measure
38
The Case Study
• Background– The Business– Data Warehouse Overview– Strengths and Challenges
• The Project– Business Problem– Transform Data– Act– Measure
39
The Business
• One of the top 3 (4?) cellular phone providers in Canada
• Recent Acquisitions– Clearnet– Quebectel
• Important Business Concepts– Handset– Subscriber– Client– Activity - Activations, Deactivations– Churn– Usage
40
Cubes
Catalogues
Sources
Staging
Integration (3N F)
Reports
Data Marts(Dimensional)
DW Environment
41
• Commitment to Data Warehousing• Prior Experience in Data Mining• Tools already Established• Strong business support for the outcomes Data
Mining would provide
Strengths at
42
• Data Warehouse still in midst of major re-architecture effort
• Ongoing billing system integration projects• A data mart for data mining had existed
(Clearnet) but it was a victim of both of the above– Successful at Churn Prediction
Challenges at
43
The Case Study
• Background– The Business– Data Warehouse Overview– Strengths and Challenges
• The Project– Business Problem– Transform Data– Act– Measure
44
Using the Virtuous Cycle
Business Problem
Transform Data
Act
Measure
45
Business Problems
• Churn Modeling– predict which subscriber is likely to leave
• Behavioural Segmentation– clustering subscribers into subgroups based on some
commonality– revenue, usage, demographic
• Client Value Estimation– the present value of all future profits generated throughout
the lifetime of that client
46
Transform Data Steps
Identify Data
Obtain Data
Transpose to Right Granularity
Validate & Clean
Add Derived Variables
Prepare Model Set
Conduct Modeling
1
6
5
4
3
2
7
47
Identify Data
EVERY POSSIBLE VARIABLE RELATED TO A SUBSCRIBER!
• Business Wanted:
1
• They provided a detailed list, by subject area, of the variables that they believed were required to conduct the kind of Data Mining Activities desired.
48
• 19 Subject Areas identified with up to 75 variables each - What is the Priority?
• Avoid the big bang - How much can we actually do?
• Where to source the data from?• Resources - Who is going to do it?
Identify DataIT Challenges
1
49
• First we asked the business to rate each variable as H, M or L priority– Almost everything was given an H
• Then we asked the business to rank the subject areas in order of importance– Hard to convince them of the value– Hard to find consensus– Necessary for determining a release strategy
Identify DataPrioritizing
1
50
Obtain Data
• For each Subject Area and each variable we assessed and documented the following:– Where can it be sourced from (and when)?– What are the known issues?– Q&A back and forth on the variables with business– Identified possible additional variables
2
51
+ =Release Strategy:•1 Release a Quarter
•3 planned releases
•15 of 19 Subject Areas
Obtain Data
2
52
Obtain Data
• Data Mining Access tool against sources– Data Warehouse– OLTP– External Data
• Creation of flat files to feed to Data Mining team– Multi source– Validate and Clean
• Build a Data Mart for Data Mining– Part of the Data Warehouse Architecture
2
53
Obtain Data
Sources
Staging
Integration (3N F)
Data Marts(Dimensional)
Data Mart forData Mining
54
Validate & Clean Data
Validate & Clean3
Staging
Integration (3N F)
Data Marts
Data Mart forData Mining
• Clean?– Not Fuzzy– Correct - mostly– Missing and Outliers
55
Transpose to Right Granularity
• The Grain is Subscriber by Bill Cycle• Monthly Snapshots of subscriber Data• Fact Data Cast to Bill Cycle
Transpose to Right Granularity4
BILLED_AMOUNTS
SUBSCRIBER_ID (FK)MONTH (FK)
LOTS_OF_BILLED_REVENUE_FACTSUPDATE_DTLOAD_DT
PRODUCT_SKU
PRODUCT_SKU_KEY
PRODUCT_INFOUPDATE_DTLOAD_DT
SUBSCRIBER_BILL_CYCLE_SNAPSHOT
SUBSCRIBER_IDMONTH
CURRENT_HANDSET_KEY (FK)FIRST_HANDSET_KEY (FK)SUBSCRIBER_PERSONAL_INFOSUBSCRIBER_DEMOGRAPHIC_INFOCONTRACT_INFORATE_PLAN_INFOMUCH_MORE_INFOLOAD_DTUPDATE_DT
VALUE_ADDED_SERVICES
VALUE_ADDED_SERVICES_KEY
VALUE_ADDED_SERVICES_DESVALUE_ADDED_SERVICES_CDVALUE_ADDED_SERVICES_IDUPDATE_DTLOAD_DT
VALUE_ADDED_SERVICES_BILLED
MONTH (FK)SUBSCRIBER_ID (FK)VALUE_ADDED_SERVICES_KEY (FK)
VALUE_ADDED_SERVICES_AMTUPDATE_DTLOAD_DT
FACTS_5SUBSCRIBER_ID (FK)MONTH (FK)
LOTS_OF_FACTSUPDATE_DTLOAD_DT
FACTS_4SUBSCRIBER_ID (FK)MONTH (FK)
LOTS_OF_FACTSUPDATE_DTLOAD_DT
FACTS_3SUBSCRIBER_ID (FK)MONTH (FK)
LOTS_OF_FACTSUPDATE_DTLOAD_DT
FACTS_2SUBSCRIBER_ID (FK)MONTH (FK)
LOTS_OF_FACTSUPDATE_DTLOAD_DTSUBSCRIBER_BILL_CYCLE_SNAPSHOT
SUBSCRIBER_IDMONTH
CLIENT_ACCOUNT_ID (FK)CLIENT_SNAPSHOT_MONTH (FK)CURRENT_HANDSET_KEY (FK)FIRST_HANDSET_KEY (FK)SUBSCRIBER_PERSONAL_INFOSUBSCRIBER_DEMOGRAPHIC_INFOCONTRACT_INFORATE_PLAN_INFOMUCH_MORE_INFOLOAD_DTUPDATE_DT
CLIENT_ACCOUNT_BILL_CYCLE_SNAPSHOTCLIENT_ACCOUNT_IDCLIENT_SNAPSHOT_MONTH
INFO_ON_THE_ACCOUNTACCOUNT_NO_OF_CANCELLED_SUBSACCOUNT_NO_OF_SUSPENDED_SUBSACCOUNT_NO_OF_ACTIVE_SUBSLOAD_DTUPDATE_DT
56
Add Derived Variables
• Combined Columns– PERCENT_BUCKET_USED
• IN_BUCKET_CALLS / RATE_PLAN_BUCKET_MINUTES)
• Summarizations– Sphere of Influence
• UNIQUE_CALLED_NUMBER_CNT• UNIQUE_CALLING_NUMBER_CNT
Add Derived Variables5
57
Add Derived Variables
• Features from Columns– CONTRACT_INDICATOR (Y/N Flag)
• Was the the date of snapshot for that subscriber between COMMIT_START_DATE and COMMIT_END_DATE
• Time Series
Add Derived Variables5
BLOCKED_USAGE_WORK
SUBSCRIBER_ID: NUMBER(22)USAGE_DT: DATE
BLOCKED_CNT: NUMBER(22)UPDATE_DT: DATELOAD_DT: DATE
BLOCKED_USAGE
MONTH: NUMBER(8)SUBSCRIBER_ID: NUMBER(22)
BLOCKED_CNT: NUMBER(22)UPDATE_DT: DATELOAD_DT: DATE
VALUE_ADDED_SERVICES_BILLED
MONTH: NUMBER(8)SUBSCRIBER_ID: NUMBER(22)VALUE_ADDED_SERVICES_KEY: NUMBER
VALUE_ADDED_SERVICES_AMT: NUMBER(22,2)UPDATE_DT: DATELOAD_DT: DATE
SUBSCRIBER_BILL_CYCLE_SNAPSHOT
SUBSCRIBER_ID: NUMBER(22)MONTH: NUMBER(8)
CLIENT_ACCOUNT_ID: NUMBER(22)CLIENT_SNAPSHOT_MONTH: NUMBER(8)SUBSCRIBER_CD: VARCHAR2(50)TIME_REMAINING_ON_CONTRACT: NUMBERCONTRACT_END_DT: DATECONTRACT_START_DT: DATECONTRACT_TERM: NUMBERCONTRACT_IND: VARCHAR2(1)SUBSCRIBER_STATUS_CD: VARCHAR2(50)ACTIVATION_CHANNEL_ORG_CD: VARCHAR2(50)ACTIVATION_REASON_CD: VARCHAR2(50)ACTIVATION_DT: DATEPOSTAL_CODE: VARCHAR2(20)PROVINCE_STATE_CD: VARCHAR2(50)LAST_NM: VARCHAR2(60)LANGUAGE_CD: VARCHAR2(50)BIRTHDATE: DATEGENDER: VARCHAR2(1)DRIVERS_LICENSE: VARCHAR2(20)FIRST_NM: VARCHAR2(50)SOCIAL_INSURANCE_NO: VARCHAR2(20)MOBILE_PHONE_NO: VARCHAR2(50)DEACTIVATION_REASON_CD: VARCHAR2(50)DEACTIVATION_DT: DATEREVENUE_BAND_CD: VARCHAR2(50)SEGMENT_CD: VARCHAR2(50)SUSPEND_REASON_CD: VARCHAR2(50)SUSPEND_DT: DATECURRENT_HANDSET_KEY: NUMBERFIRST_HANDSET_KEY: NUMBERUPDATE_DT: DATELOAD_DT: DATEAGE_OF_CURRENT_HANDSET: NUMBERPREPAID_IND: VARCHAR2(1)REVENUE_IND: VARCHAR2(1)RATE_PLAN_BUCKET_MINUTES: NUMBER(9,2)RATE_PLAN_GROUP_CD: VARCHAR2(50)RATE_PLAN_CD: VARCHAR2(50)SUBSCRIBER_TENURE_DAYS: NUMBERSUBSCRIPTION_ID: VARCHAR2(50)TECHNOLOGY_TYPE_CD: VARCHAR2(50)AGE: NUMBER(4,1)
PRODUCT_SKU_KD2
PRODUCT_SKU_KEY: NUMBER
PRODUCT_GROUP_DES: VARCHAR2(240)PRODUCT_GROUP_CD: VARCHAR2(50)PRODUCT_GROUP_ID: VARCHAR2(50)ENGLISH_PRODUCT_DES: VARCHAR2(240)PRODUCT_CD: VARCHAR2(50)REFURBISHED_HANDSET_IND: VARCHAR2(1)WEB_READY_IND: VARCHAR2(1)SIM_CARD_IND: VARCHAR2(1)PRODUCT_ID: VARCHAR2(50)UPDATE_DT: DATELOAD_DT: DATE
DROPPED_USAGE_WORK
SUBSCRIBER_ID: NUMBER(22)USAGE_DT: DATE
DROPPED_CNT: NUMBER(22)UPDATE_DT: DATELOAD_DT: DATE
DROPPED_USAGE
MONTH: NUMBER(8)SUBSCRIBER_ID: NUMBER(22)
DROPPED_CNT: NUMBER(22)UPDATE_DT: DATELOAD_DT: DATE
BUCKET_USAGE
MONTH: NUMBER(8)SUBSCRIBER_ID: NUMBER(22)
UPDATE_DT: DATELOAD_DT: DATEPERCENT_BUCKET_USED: NUMBER
BILLED_USAGE_3_MTH_AVG
MONTH: NUMBER(8)SUBSCRIBER_ID: NUMBER(22)
TOTAL_CALLS_MOU_3_MTH_AVG: NUMBER(22,5)LONG_DISTANCE_CALLS_MOU_3_MTH_: NUMBER(22,5)WEEKEND_CALLS_MOU_3_MTH_AVG: NUMBER(22,5)EVENING_CALLS_MOU_3_MTH_AVG: NUMBER(22,5)PEAK_CALLS_MOU_3_MTH_AVG: NUMBER(22,5)UPDATE_DT: DATELOAD_DT: DATE
BILLED_USAGE
MONTH: NUMBER(8)SUBSCRIBER_ID: NUMBER(22)
TOTAL_CALLS_MOU: NUMBER(22,5)TOTAL_CALLS_CNT: NUMBERLONG_DISTANCE_CALLS_MOU: NUMBER(22,5)LONG_DISTANCE_CALLS_CNT: NUMBERWEEKEND_CALLS_MOU: NUMBER(22,5)WEEKEND_CALLS_CNT: NUMBEREVENING_CALLS_MOU: NUMBER(22,5)EVENING_CALLS_CNT: NUMBERPEAK_CALLS_MOU: NUMBER(22,5)PEAK_CALLS_CNT: NUMBERUPDATE_DT: DATELOAD_DT: DATEOUT_BUCKET_CALLS_MOU: NUMBER(22,5)IN_BUCKET_CALLS_MOU: NUMBER(22,5)OUT_BUCKET_CALLS_CNT: NUMBERIN_BUCKET_CALLS_CNT: NUMBERROAMING_CALLS_MOU: NUMBER(22,5)ROAMING_CALLS_CNT: NUMBER
SUBSCRIBER_ID_KD2_TEMP
SUBSCRIBER_ID: NUMBER(22)
ACCOUNT_CYCLE_BILL_DAY: NUMBER(2)
58
Prepare Model Set
Prepare Model Set6
FACTS_5
SUBSCRIBER_ID (FK)MONTH (FK)
LOTS_OF_FACTSUPDATE_DTLOAD_DT
FACTS_4
SUBSCRIBER_ID (FK)MONTH (FK)
LOTS_OF_FACTSUPDATE_DTLOAD_DT
FACTS_3
SUBSCRIBER_ID (FK)MONTH (FK)
LOTS_OF_FACTSUPDATE_DTLOAD_DT
FACTS_2
SUBSCRIBER_ID (FK)MONTH (FK)
LOTS_OF_FACTSUPDATE_DTLOAD_DT
FACTS_1
SUBSCRIBER_ID (FK)MONTH (FK)
LOTS_OF_FACTSUPDATE_DTLOAD_DT
BILLED_AMOUNTS
SUBSCRIBER_ID (FK)MONTH (FK)
LOTS_OF_BILLED_REVENUE_FACTSUPDATE_DTLOAD_DT
DROPPED_AND_BLOCKED_USAGE
DROPPED_CALLS_CNTUPDATE_DTLOAD_DT
EQUIPMENT_ACTIVITY_SNAPSHOT
SUBSCRIBER_ID (FK)MONTH (FK)
HANDSET_COUNTABLE_DATAUPDATE_DTLOAD_DT
PRODUCT_SKU
PRODUCT_SKU_KEY
PRODUCT_INFOUPDATE_DTLOAD_DT
SUBSCRIBER_BILL_CYCLE_SNAPSHOT
SUBSCRIBER_IDMONTH
CLIENT_ACCOUNT_ID (FK)CLIENT_SNAPSHOT_MONTH (FK)CURRENT_HANDSET_KEY (FK)FIRST_HANDSET_KEY (FK)SUBSCRIBER_PERSONAL_INFOSUBSCRIBER_DEMOGRAPHIC_INFOCONTRACT_INFORATE_PLAN_INFOMUCH_MORE_INFOLOAD_DTUPDATE_DT
VALUE_ADDED_SERVICES
VALUE_ADDED_SERVICES_KEY
VALUE_ADDED_SERVICES_DESVALUE_ADDED_SERVICES_CDVALUE_ADDED_SERVICES_IDUPDATE_DTLOAD_DT
VALUE_ADDED_SERVICES_BILLED
MONTH (FK)SUBSCRIBER_ID (FK)VALUE_ADDED_SERVICES_KEY (FK)
VALUE_ADDED_SERVICES_AMTUPDATE_DTLOAD_DT
BLOCKED_USAGE
BLOCKED_CALLS_CNTUPDATE_DTLOAD_DT
CLIENT_ACCOUNT_BILL_CYCLE_SNAPSHOT
CLIENT_ACCOUNT_IDCLIENT_SNAPSHOT_MONTH
INFO_ON_THE_ACCOUNTACCOUNT_NO_OF_CANCELLED_SUBSACCOUNT_NO_OF_SUSPENDED_SUBSACCOUNT_NO_OF_ACTIVE_SUBSLOAD_DTUPDATE_DT
08/01/2002 1,735,436.26 281,032.21 388,970.91 0.00 516,911.31 462,667.55 749,330.77 6,371.95 500,183.42 08/14/2002
• Select Sample– Population– Data for given
modeling effort
• Denormalize completely
59
Prepare Model Set
Prepare Model Set6
Denormalize completely
Example of Denormalization for Data Mining
Handset 1 Handset 2 Handset 3 Handset 40 1 0 0
BILLED_USAGE
MONTH (FK)SUBSCRIBER_ID (FK)
TOTAL_CALLS_MOUTOTAL_CALLS_CNTLONG_DISTANCE_CALLS_MOULONG_DISTANCE_CALLS_CNTWEEKEND_CALLS_MOUWEEKEND_CALLS_CNTEVENING_CALLS_MOUEVENING_CALLS_CNTPEAK_CALLS_MOUPEAK_CALLS_CNTUPDATE_DTLOAD_DTOUT_BUCKET_CALLS_MOUIN_BUCKET_CALLS_MOUOUT_BUCKET_CALLS_CNTIN_BUCKET_CALLS_CNTROAMING_CALLS_MOUROAMING_CALLS_CNT
BILLED_USAGE_3_MTH_AVG
MONTH (FK)SUBSCRIBER_ID (FK)
TOTAL_CALLS_MOU_3_MTH_AVGLONG_DISTANCE_CALLS_MOU_3_MTH_WEEKEND_CALLS_MOU_3_MTH_AVGEVENING_CALLS_MOU_3_MTH_AVGPEAK_CALLS_MOU_3_MTH_AVGUPDATE_DTLOAD_DT
BUCKET_USAGE
MONTH (FK)SUBSCRIBER_ID (FK)
UPDATE_DTLOAD_DTPERCENT_BUCKET_USED
PRODUCT_SKU
PRODUCT_SKU_KEY
PRODUCT_INFOUPDATE_DTLOAD_DT
SUBSCRIBER_BILL_CYCLE_SNAPSHOT
SUBSCRIBER_IDMONTH
CLIENT_ACCOUNT_ID (FK)CLIENT_SNAPSHOT_MONTH (FK)CURRENT_HANDSET_KEY (FK)FIRST_HANDSET_KEY (FK)SUBSCRIBER_PERSONAL_INFOSUBSCRIBER_DEMOGRAPHIC_INFOCONTRACT_INFORATE_PLAN_INFOMUCH_MORE_INFOLOAD_DTUPDATE_DT
VALUE_ADDED_SERVICES
VALUE_ADDED_SERVICES_KEY
VALUE_ADDED_SERVICES_DESVALUE_ADDED_SERVICES_CDVALUE_ADDED_SERVICES_IDUPDATE_DTLOAD_DT
VALUE_ADDED_SERVICES_BILLED
MONTH (FK)SUBSCRIBER_ID (FK)VALUE_ADDED_SERVICES_KEY (FK)
VALUE_ADDED_SERVICES_AMTUPDATE_DTLOAD_DT
60
Prepare Model Set
Prepare Model Set6
Example of Casting a a Test Set for Churn Modeling
• Sliding Windows Concept
O c t N o v D e c J a n F e b M a r A p r M a y J u n J u l
M o d e l S e t F e b 3 2 1 X PM o d e l S e t M a r 3 2 1 X PM o d e l S e t A p r 3 2 1 X P
S c o r e S e t 3 2 1 X P P
61
Conduct Modeling
7
• Done by business team with help from Vendor (SAS)– Decision Trees and Neural Networks (Churn)– Clustering (Segmentation)
Decision Trees
Neural Networks
Clustering
62
Various Marketing and CRM Activities
The Action
63
Measuring
• Not there yet– Intend to compare actioned vs. non-actioned results– 50 % to be actioned
64
And the cycle continues ...
Business Problem
Transform Data
Act
Measure