Lecture 4 MARK2039 Winter 2006 George Brown College Wednesday 9-12.

50
Lecture 4 MARK2039 Winter 2006 George Brown College Wednesday 9-12

Transcript of Lecture 4 MARK2039 Winter 2006 George Brown College Wednesday 9-12.

Page 1: Lecture 4 MARK2039 Winter 2006 George Brown College Wednesday 9-12.

Lecture 4

MARK2039

Winter 2006

George Brown College

Wednesday 9-12

Page 2: Lecture 4 MARK2039 Winter 2006 George Brown College Wednesday 9-12.

2

Assignment 44 marks 1)You have been asked to find the number of Quebecers that have spent over $100 in the last two months. There is a customer file containing 5million customers and a transaction file containing 500 million transactions. You are running SAS software against this system and realize that it will take a full day of processing time to get your answer. The customer file contains the following fields: Account ID,Name,address,postal code,income,age,household size, and start date The transaction file contains the following fields: Account ID,amount, quantity,date,and transaction type Answer the following: What specific fields of information from each file would you pull in order to answer this request.? -postal code(1st digit=’G’,’H’,’J’) from customer file -amount from transaction file -date from transaction file -account ID on both customer file and transaction file

Page 3: Lecture 4 MARK2039 Winter 2006 George Brown College Wednesday 9-12.

3

Assignment 42 marks 2)If you were to use the above information to develop a model, how would the information finally need to be structured. Hint: you need to think of the data itself and also how it is finally organized as an analytical file? One record per one customer with all information summarized to the customer level. 4 marks 3) Listed below are a set of variables. For each variable, indicate whether they are nominal,ordinal, or interval.

1st 3 digits of postal code: nominal household size: interval Credit Score: interval Model Rank: ordinal Product Code: nominal Median Income: interval

Variables # of recordsData Field

Format# of unique

values# of missing

values1st 3 digits of postal

code 100000 character 1000 0household size 100000 numeric 10 20000

Credit score 100000 numeric 90000 5000Model rank 100000 character 100 0

Product code 100000 character 5000 0

Median Income of Postal Code of record 100000 numeric 20000 0

Page 4: Lecture 4 MARK2039 Winter 2006 George Brown College Wednesday 9-12.

4

Recap

• Where are we within the data mining phase?

• Types of Data– Nominal– Ordinal– Interval

• What are some key things to look for in determining whether or not a variable is good for data mining analysis

Databases• Why do we need to have some understanding of databases?

• How does a database facilitate the data miner’s work?

Page 5: Lecture 4 MARK2039 Winter 2006 George Brown College Wednesday 9-12.

5

Database Structure

• In Database Design, most databases are relational– Creates a key which becomes a database index– This index or key becomes the link between different files

CustomerCustomer Transaction Promotion

Customer ID is the link between all the tables

Why do we need to think about the notion relational?

Page 6: Lecture 4 MARK2039 Winter 2006 George Brown College Wednesday 9-12.

6

Database Structure

• Relational DB– Database indexes allow very quick processing of data when

joining and merging files together

• The key in all database design is to create a database that optimizes processing of all information.

• In database design, you want the right data to be stored which is useful from a data mining perspective

• From a marketing standpoint, can you think of some examples? Why is this important from a data mining standpoint?

Page 7: Lecture 4 MARK2039 Winter 2006 George Brown College Wednesday 9-12.

7

Database Structure

• Other approaches used in speeding up database processing– Inverted flat files

• This technology allows each field to be indexed

• Very common amongst the leading-edge DB suppliers today.

• Is much faster at processing data than traditional relational DB technology

• Again, why is this relevant from a data mining perspective?

Page 8: Lecture 4 MARK2039 Winter 2006 George Brown College Wednesday 9-12.

8

Databases

• Analytical File• For most data mining applications, your analytic file needs to be in

the format of one record per customer with all known attributes• Generally, the database is not in that format.• ECTL – extraction, clean, transform, load – is the

process/methodology for preparing data for data mining

Typically a flat file used for analysis

What do think is the most important concept for data mining?Databases or Analytical File

How do they work together?

Page 9: Lecture 4 MARK2039 Winter 2006 George Brown College Wednesday 9-12.

9

Databases

File 1File 1

-Cust ID-Cust ID

-Income-Income

-Age-Age

-Household -Household SizeSize

File 2File 2

-Cust ID-Cust ID

-Trans. Type-Trans. Type

-Trans Date-Trans Date

-Trans Amt-Trans Amt

Page 10: Lecture 4 MARK2039 Winter 2006 George Brown College Wednesday 9-12.

10

Databases

• In building databases, the notion of continuity management is important

• In the context of household or customers on a database, continuity management is the process by which you are able to track customers through events in time.

• Why is this important?

Page 11: Lecture 4 MARK2039 Winter 2006 George Brown College Wednesday 9-12.

11

Analytical file

• All data mining algorithms want their input in tabular form – rows & columns as in a spreadsheet or database table

Typically, if we saw data like this, what typically needs to be done? Assume reference number is the customer I.D. What doescontinuity mean here?

Page 12: Lecture 4 MARK2039 Winter 2006 George Brown College Wednesday 9-12.

12

What the Data Should Look Like

• A customer “snapshot” = a single row

Each row representsthe customer and whatever might be useful for data mining

Page 13: Lecture 4 MARK2039 Winter 2006 George Brown College Wednesday 9-12.

13

What the Data Should Look Like

• The columns– Contain data that describe aspects of the customer (e.g., sales $ and

quantity for each of product A, B, C)– Contain the results of calculations referred to as derived variables (e.g.,

total sales $)

Cust IdDate Of

Purchase# of Months since

last purchase123 jan 4/2006 4123 dec.6/2006 5123 mar.4/2006 2456 apr.6/2006 1456 feb.6/2006 3

Derived variables areTotal Price in 1st chartand # of months sincelast purchase in 2nd chart

Page 14: Lecture 4 MARK2039 Winter 2006 George Brown College Wednesday 9-12.

14

Sourcing the Data from External Data Sources

Typical Data Sources - External• Geo-demographic information

– Statistics Canada (aggregated level data)• Census data• Taxfiler data

• Geo-demographic Cluster Codes

– Generation 5 – Mosaic

– Equifax -Psyte

• Survey Data

– ICOM

Page 15: Lecture 4 MARK2039 Winter 2006 George Brown College Wednesday 9-12.

15

Sourcing the Data (Extraction)

Census data

Data collected every 5 years.

Enumeration Area level.

~ 250 households on average.

~ 440 households in large urban areas.

~125 households in rural areas.

~ 50,000 EA’s in Canada

Can be converted to postal code level and appended to your file.

Type of data

-immigration/ethnicity/language patterns

-occupation

-education

-income/gender/age/employment

-religion

Page 16: Lecture 4 MARK2039 Winter 2006 George Brown College Wednesday 9-12.

16

Sourcing the Data (Extraction)

Taxfiler data.

Data collected every year.

Postal walk level.

~ 450 households on average.

~ 26,000 Postal Walks in Canada.

·Contains data from previous year tax returns.

· Income by source and type. Employment, investment.

· RRSP contributions and room. Etc.

Can also be appended to your files at postal code level.

Page 17: Lecture 4 MARK2039 Winter 2006 George Brown College Wednesday 9-12.

17

Sourcing the Data (Extraction)

Geo-Demographic cluster codes.

• Uses Stats Can data in most cases plus other externaldata overlays to determine postal code cluster groups

– Quebec farm families

– Young and Struggling

– Empty Nesters

– Upper Income Family-Oriented

• Equifax

– High credit risk

– Medium credit Risk

– Etc.

Page 18: Lecture 4 MARK2039 Winter 2006 George Brown College Wednesday 9-12.

18

Sourcing the Data-Stats Can Type Table

Postal Area Median Income Avg. Age

Avg. Household

Size % FrenchArea 1 42000 40 2 10.00%Area 1 42000 40 2 10.00%Area 1 42000 40 2 10.00%Area 2 50000 35 1 85.00%Area 2 50000 35 1 85.00%Area 3 37000 43 3 5.00%Area 3 37000 43 3 5.00%Area 3 37000 43 3 5.00%

Page 19: Lecture 4 MARK2039 Winter 2006 George Brown College Wednesday 9-12.

19

Sourcing the Data (Extraction)

Typical Data Sources - External• Business to Business “Firmographics”

– SIC, Number of Employees, Revenue etc.

– Sources:• D&B• CBI / InfoCanada• Scott’s

Company Employee SizeIndustry

Classification Sales SizeYrs In

businessXYZ 1-4 retail <1 million 10…. … … .. ..

Page 20: Lecture 4 MARK2039 Winter 2006 George Brown College Wednesday 9-12.

20

Sourcing the Data (Extraction)

Typical Data Sources - Survey• Attitudinal- Needs, preferences, social values, opinions• Behavioral- Buying habits, lifestyle, brand usage

For most data mining projects, we want to assign a value to all customers; therefore the information used must be available for all customers– survey-based information generally cannot be used as it typically

can only be applied a small portion of the database

Page 21: Lecture 4 MARK2039 Winter 2006 George Brown College Wednesday 9-12.

21

Sourcing the Data (Extraction)

Typical Data Sources - Survey• ICOM

– Surveys to approx. 10MM Canadians– Fully updated every 2 years– Contains attitude behaviour and purchase behaviours across all

industry sectors

• What do you think the value is here?

Page 22: Lecture 4 MARK2039 Winter 2006 George Brown College Wednesday 9-12.

22

Examples

• A marketer wants to target high risk cancelsfor a retention campaign for a Telco. Information is contained in legacy database systems containing a customer file, transaction file, and call detail file. As a marketer and analyst, answer the following requirements

– 5 Key Data fields from above files that should be created in analytical exercise

– Create a diagram or schema of how this data would be linked into an analytical file

– What resources would you need and why?• People• Software

Page 23: Lecture 4 MARK2039 Winter 2006 George Brown College Wednesday 9-12.

23

Examples

• How would the previous example change if the information was available in a data mart or warehouse

Page 24: Lecture 4 MARK2039 Winter 2006 George Brown College Wednesday 9-12.

24

Examples

• A university is conducting a fund raising campaign to its alumni(100000 members). On its database, it has the following information:

– Age of alumnus

– Year graduated

– Degree and specialization

– Donation value

– Current Address

• It has also collected information from a survey. 10% of members have responded to the survey with the following %’s of members answering the following information:

– Current Occupation-5%

– Current Income-8%

– Why they give?-7%

– How much they give

• As a marketer and analyst, how would you use the information to conduct a campaign to its high value donors

Page 25: Lecture 4 MARK2039 Winter 2006 George Brown College Wednesday 9-12.

25

Examples

• A computer company collects information from all customers who purchase a new product. This new product information is collected through a product registration form which the customer fills in at point of purchase. This information relates to the following:

– Product preferences,Income,household size and hobbies

All customer tombstone information as well as purchase information related to products bought has been summarized and stored onto a data mart.

As a marketer and analyst, how would you use the information to develop a cross-sell campaign.

Page 26: Lecture 4 MARK2039 Winter 2006 George Brown College Wednesday 9-12.

26

Examples

• A credit card company has 100000 customers containing tombstone information and detailed transactional information on their database. 50000 customers have email addresses. 10% of 50000 customers have responded to a survey in which 5% have indicated that they consider themselves loyal customers. Web activity of these loyal customers indicate that many of them have clicked on travel-related packages.

• Database information contains– Age,gender,income, where they spend, recency of spend,frequency

of spend, and amount of spend.

• As a marketer and analyst, how would you use this information to sell travel-related insurance.

Page 27: Lecture 4 MARK2039 Winter 2006 George Brown College Wednesday 9-12.

27

Creating the Analytical File-Reviewing Data Dumps

Initial dump of 1st few records

Account Postal Birth Start Behave. Income # inNumber Code Date Date Score House123456 M5A3S6 07/49 03/91 500 30000 6

345231 H3A2B4 08/54 04/92 550 42500 1

543236 T5A3S7 06/92 600 35000 3 543210etc…

Missing values in data are not properly being treated.

Page 28: Lecture 4 MARK2039 Winter 2006 George Brown College Wednesday 9-12.

28

Creating the Analytical File-Reviewing Data Dumps

Proper treatment of missing values results in the following dump:

Account Postal Birth Start Behav. Income # inNumber Code Date Date Score House123456 M5A3S6 07/49 03/91 500 30000 6

345231 H3A2B4 08/54 04/92 550 42500 1

543236 T5A3S7 06/92 600 35000 3

543210 etc…

Effective programming can ensure that records are being properly loaded into the system.

Initial dump of 1Initial dump of 1stst few records few records

Page 29: Lecture 4 MARK2039 Winter 2006 George Brown College Wednesday 9-12.

29

Creating the Analytical File-Reviewing Data Dumps

A dump of a few records from a billing file revealed the following after sorting by account number

Account Purchase Product Date ofAmount Category Purchase

123460 $50 ABC123 19980630123460 $75 DEF789 19980703456720 $90 GHI123 19980701456720 $100 ABC456 19980715333121 $25 JKL432 19980315333121 $40 GHI342 19980401789232 $30 GHI261 19980228789232 $20 236phi 19980307

View of the Transaction FileView of the Transaction File

Page 30: Lecture 4 MARK2039 Winter 2006 George Brown College Wednesday 9-12.

30

Creating the Analytical File-Reviewing Data Dumps

A dump of a few promotion history records revealed the following after sorting by account number:

Account No. Promotion ID Promotion Date 123460 ABA123 19970115123460 ACB431 19970315123460 AAC221 19970618456720 BAA123 19970115456720 BBA321 19980115456720 BCB330 19980315456720 BAC112 19980618333121 CBA321 19980115789232 BAD333 19980415

View of the Promo History FileView of the Promo History File

Page 31: Lecture 4 MARK2039 Winter 2006 George Brown College Wednesday 9-12.

31

Creating the Analytical File-Reviewing Data Dumps

• Using your marketing knowledge, give me examples of variables that we might create from the last three slides– Slide 14– Slide 15– Slide 16

Page 32: Lecture 4 MARK2039 Winter 2006 George Brown College Wednesday 9-12.

32

Creating the Analytical File-Data Hygiene and Cleansing

• Once the data has been dumped in order to view records, typically data hygiene and cleansing have to take place

• Two key deliverables– Clean name and address information– Standard rules for coding of data values

Page 33: Lecture 4 MARK2039 Winter 2006 George Brown College Wednesday 9-12.

33

Creating the Analytical File-Data Hygiene and Cleansing

• Clean Name and Address Information– Market to right Individual– Create Match keys

Page 34: Lecture 4 MARK2039 Winter 2006 George Brown College Wednesday 9-12.

34

• Clean Name and Address Information– Market to right Individual– Create Match keys– Name and Address Standardization

BankID 987654321Name JONH SMITH JR.Address1 123 WILLIAMS STRETAddress2 2ND FLOORAddress3 TRT., O.N. M5G-1F3Country CDNUnIndivID 123456789

BankID 987654321PreNameFirstNameSurname JONH SMITH JR.PostNameStreet1 123 WILLIAMS STRETStreet2 2ND FLOOR

City TRTProvince O.N.Postal Code M5G-1F3Country CANADAUnIndivID 123456789Origin Bank

Creating the Analytical FileCreating the Analytical File Name and Address Standardization

Page 35: Lecture 4 MARK2039 Winter 2006 George Brown College Wednesday 9-12.

35

DATA CLEANING• Address correction• Name parsing• Genderizing• Casing

BankID 987654321PreName Mr.FirstName JohnSurname SmithPostName Jr.Street1 200-123 Williams StreetStreet2

City TorontoProvince ONPostal Code M5G 1F3Country CanadaUnIndivID 123456789Origin Bank

BankID 987654321PreNameFirstNameSurname JONH SMITH JR.PostNameStreet1 123 WILLIAMS STRETStreet2 2ND FLOOR

City TRTProvince O.N.Postal Code M5G-1F3Country CANADAUnIndivID 123456789Origin Bank

Creating the Analytical File-Creating the Analytical File-Name and Address Standardization

Page 36: Lecture 4 MARK2039 Winter 2006 George Brown College Wednesday 9-12.

36

Creating the Analytical File-Merge Purge of Names

• What are the reasons for creating unique match customer keys

– Generating a marketing list– Conducting analysis

Should the match keys be the same forboth above scenarios?

What are the situations when match keys that are numeric?

Page 37: Lecture 4 MARK2039 Winter 2006 George Brown College Wednesday 9-12.

37

Creating the Analytical File-Merge Purge of Names

Common fields to use in creating Match keys

• First Name;

• Surname;

• Unique Individual ID;

• Postal Code

• Credit Card Number

• Duns Number for Businesses

• Phone Number

Unique I.D’s or number type I.D’s are the preferred choice when creating match keys

• Let’s take a closer look at creating match keys using name and address

Page 38: Lecture 4 MARK2039 Winter 2006 George Brown College Wednesday 9-12.

38

Creating the Analytical File-Merge Purge of Names

• Let’s take a look at 6 records and see what this means.

Surname First Name Address Postal Code Match Key

Smith John12345 Elm Street L1A2A1 L1A2A1SMITHJ

Smith James45678 Elm Street L1A2A1 L1A2A1SMITHJ

Brown Tim 5678 Oak M5A3A2 M5A3A2BROWNT

Brown T.5678 Oak Road M5A3A2 M5A3A2BROWNT

Green Ted 3478 Pine V6A2A1 V6A2A1GREENTGreen Tanya 3478 Pine V6A2A2 V6A2A1GREENTFiller Robert 2345 Nurr M5A3A2 M5A3A2FILLERR

Filler Larry5672 Bolton Dr. M6A2A1 M6A2A1FILLERL

Page 39: Lecture 4 MARK2039 Winter 2006 George Brown College Wednesday 9-12.

39

Creating the Analytical File-Merge Purge of Names

• Example: You have one record here: – Richard Boire-4628 Mayfair Ave. H4B2E5

– How would you use the above information for a backend analysis if I were a responder to an acquisition campaign?

– What about if you were conducting analysis on me as an existing customer who responded to a cross-sell campaign.

– How about if you wanted to send me a direct mail piece

Page 40: Lecture 4 MARK2039 Winter 2006 George Brown College Wednesday 9-12.

40

Creating the Analytical File- Data standardization

• Refers to a process where values from a common variable from different files are mapped to the same value. Some common examples:

• SIC Code Industry Classification Table– Industry categories have common set of codes

• Postal Code Variable– Postal code has to have 6 digits comprised of

alpha,numeric,alpha,numeric,alpha,numeric which exclude the following alphas: D,F,O,Q,U, and Z.

• Give me examples of bad postal codes vs. good postal codes.

Page 41: Lecture 4 MARK2039 Winter 2006 George Brown College Wednesday 9-12.

41

Creating the Analytical File- Data Standardization

• Here is an example of how disposition codes for telemarketing outcomes might be handled

Code Description21 Do Not Call21 Do Not Call21 Do Not Call32 Do Not Call9 Do Not Call - Place on “Do

Not Call” list permanently

20 Do Not Solicit - Do not call, mail, email or attempt any other form of solicitations to this customer

22 Do Not Mail - Place permanently on “Do Not Mail” list; future calling solicitations ok

U28 No sale - Do not sollicitateB22 Never call again, <<Client>>B23 Never call again, general

C08 Scrubbed Vendor DNS

Page 42: Lecture 4 MARK2039 Winter 2006 George Brown College Wednesday 9-12.

42

Creating the Analytical File- Data Standardization

• Postal Code Standardization– Six digit code comprising

Alpha,numeric,alpha,numeric,alpha,numeric– 1st letters: A,B,C,E,G,H,J,K,L,M,N,P,R,S,T,V,X,Y

• SIC(Standard Industry Code Classification– 4 digit code used to classify all companies into standard set of

industries

Page 43: Lecture 4 MARK2039 Winter 2006 George Brown College Wednesday 9-12.

43

Creating the Analytical File- Data standardization

• Example:– You have been asked to build retention model You have two

years worth of transaction data.Changes in the product category codes occurred six months ago. Key information that you would look at would be as follows:• Income category• Product Category• Transaction Codes• Transaction Amount• Postal Code• Transaction Date• Gender

What would you need to do

Page 44: Lecture 4 MARK2039 Winter 2006 George Brown College Wednesday 9-12.

44

• Geocoding is the process that assigns a latitude-longitude coordinate to an address. Once a latitude-longitude coordinate is assigned, the address can be displayed on a map or used in a spatial search.

• Data miners often use these coordinates to calculate such things as “distance to the nearest store”

Creating the Analytical File- Geo-Codingn

Page 45: Lecture 4 MARK2039 Winter 2006 George Brown College Wednesday 9-12.

45

Demographic Analysis

Population Population CountCount

Population Population CountCount

Age Age DistributionDistribution

Age Age DistributionDistribution Average AgeAverage AgeAverage AgeAverage Age

Store Store LocationLocation

Store Store LocationLocation

GeoGeoProfileProfile

Page 46: Lecture 4 MARK2039 Winter 2006 George Brown College Wednesday 9-12.

46

Creating the Analytical File-What is Geocoding?

• Let’s look at a sample of what some data might look like?

Postal Code latitude LongitudeA1A5A2 5 10B5V1A2 7 20M6B2A2 10 30T4B1A2 6 40V4H2B5 11 50

How do we use this data to create meaningful variables?

Page 47: Lecture 4 MARK2039 Winter 2006 George Brown College Wednesday 9-12.

47

Creating the Analytical File-What is Geocoding

• Example:– A retailer has the following information:

• Name and address of its customers

• Address of its stores

• Stats Can Information

– As a marketer, how would you intelligently use this information

Page 48: Lecture 4 MARK2039 Winter 2006 George Brown College Wednesday 9-12.

Region # of Customers % of Total

Prairie Provinces 25 M 2.5%

Quebec 100 M 10%

Ontario 350 M 35%

West 25 M 2.5%

Missing Values 500 M 50%

Total 1 MM 100%

Frequency Distribution

• The report below uses first digit of postal code to assign customers to region.

• For example, postal codes beginning with ‘G’, ‘H’, or ’J’ represent the Quebec region.

Customer Profiling

Page 49: Lecture 4 MARK2039 Winter 2006 George Brown College Wednesday 9-12.

49

Frequency Distribution

Tenure # of

Customers % of

Customers 1998 9800 14% 1999 10000 14% 2000 12000 17% 2001 8000 11%

Missing 30000 43% Total 69800 100%

This tenure report would tell us that the tenure field was not on this database prior to 1998 and that 30,000 customers began prior to that date. Given the high percent of customers with missing values, we would need to determine whether we could capture tenure from another field in the database or not use

Page 50: Lecture 4 MARK2039 Winter 2006 George Brown College Wednesday 9-12.

50

Frequency Distribution

Type of Product/Services

Purchased# of

Customers% of

CustomersProduct A 35000 29.66%Product B 40000 33.90%Product C 25000 21.19%Product D 15000 12.71%

Other 3000 2.54%Total 118000 100.00%

The Product/service field has good coverage and concludes that product B has been the best selling product, followed closely by product A