Data Mining Process Source: CRISP-DM (SPSS.com website)

19
Data Mining Process Source: CRISP-DM (SPSS.com website)

Transcript of Data Mining Process Source: CRISP-DM (SPSS.com website)

Data Mining ProcessSource: CRISP-DM (SPSS.com website)

Data Cleaning

MIS Issues (Source: Article by Ralph Kimball)

Analyst Issues

MIS Issues

Elementizing (Parsing)StandardizingVerifyingMatching,HouseholdingDocumenting

Elementising

Ralph B and Julianne Kimball Trustees for Kimball Fred CSte. 11613150 Hiway 9Box 1234 Boulder CrkColo 95006

Addressee First Name(1): RalphAddressee Middle Initial(1): BAddressee Last Name(1): KimballAddressee First Name(2): JulianneAddressee Last Name(2): KimballAddressee Relationship: Trustees forRelationship Person First Name: FredRelationship Person Middle Name: CRelationship Person Last Name: KimballStreet Address Number: 13150Street Name: Hiway 9Suite Number: 116Post Office Box Number: 1234City: Boulder CrkState: ColoFive Digit Zip: 95006

Standardizing

Ste = suiteHiway 9 = Highway 9

Other example - Grade “D” = Distinction in Australia

Verification

Zip code 95006 is CA, not Colorado

Matching/Householding

Match record with other customer records containing Ralph and Julianne Kimball

Establish that they are part of the same household

Analyst Issues

Physical data problemsData DictionariesValidation (Frequencies)Missing DataThe “zero” value problemInappropriate (Future) data for modelingUnavailable data

Physical

Cannot access data ASCII vs EBCDIC On a medium that you can’t use (certain type of tape,

for instance)

Data Dictionaries

What are the fields?Where are they located?What format are they stored in?

Missing Data

IgnoreFind the right values if you canUse Average for that variableReplace with number that matches its

characteristics(What do the missing people look like in terms of the dependent? Who else looks like that?

The zero problem

What does 0 mean?If “Number of Revolving Bankcard Trades

Currently Past Due” = 0, what does that mean?

# of Bank Rev. Trds Currently Past Due

Cumulative Cumulative

BRPSTD Frequency Percent Frequency Percent

ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

0 13485 96.0 13485 96.0

1 486 3.5 13971 99.5

2 57 0.4 14028 99.9

3 12 0.1 14040 100.0

4 2 0.0 14042 100.0

# of Trds

Cumulative Cumulative

TRADES Frequency Percent Frequency Percent

ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

0 1606 11.4 1606 11.4

1 1080 7.7 2686 19.1

2 1056 7.5 3742 26.6

3 1007 7.2 4749 33.8

4 949 6.8 5698 40.6

5 911 6.5 6609 47.1

6 849 6.0 7458 53.1

7 793 5.6 8251 58.8

8 682 4.9 8933 63.6

9 622 4.4 9555 68.0

10+ 4487 32.0 14042 100.0

# of Bank Rev. Trds

Cumulative Cumulative

BRTRDS Frequency Percent Frequency Percent

ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

INQS. & PR ONLY 64 0.5 64 0.5

PR ONLY 22 0.2 86 0.6

INQS. ONLY 960 6.8 1046 7.4

NO RECORD 560 4.0 1606 11.4

0 6183 44.0 7789 55.5

1 2616 18.6 10405 74.1

2 1427 10.2 11832 84.3

3 831 5.9 12663 90.2

4 496 3.5 13159 93.7

5 287 2.0 13446 95.8

6 188 1.3 13634 97.1

7 142 1.0 13776 98.1

8 92 0.7 13868 98.8

9 60 0.4 13928 99.2

10+ 114 0.8 14042 100.0

# of Bank Rev. Trds Currently Past Due

Cumulative Cumulative

BRPSTD Frequency Percent Frequency Percent

ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

NO TRADES OF THIS TYPE 6183 44.0 6183 44.0

INQS. & PR ONLY 64 0.5 6247 44.5

PR ONLY 22 0.2 6269 44.6

INQS. ONLY 960 6.8 7229 51.5

NO RECORD 560 4.0 7789 55.5

MISSING 3475 24.7 11264 80.2

0 2221 15.8 13485 96.0

1 486 3.5 13971 99.5

2 57 0.4 14028 99.9

3 12 0.1 14040 100.0

4 2 0.0 14042 100.0

Inappropriate Data Used

Future data used to build great looking model. Used payments till month end instead of payments

until cycle date.

Unavailable Data

Data on Rejected Applicants Would they have been Good or Bad had they been

accepted?

Use “Reject Inferencing” techniques.