Basic Data Mining Techniques

41
Basic Data Mining Techniques Decision Trees 1

description

Basic Data Mining Techniques. Decision Trees. Basic concepts. Decision trees are constructed using only those attributes best able to differentiate the concepts to be learned A decision tree is built by initially selecting a subset of instances from a training set - PowerPoint PPT Presentation

Transcript of Basic Data Mining Techniques

Page 1: Basic Data Mining Techniques

Basic Data Mining Techniques

Decision Trees

1

Page 2: Basic Data Mining Techniques

Basic concepts

• Decision trees are constructed using only those attributes best able to differentiate the concepts to be learned

• A decision tree is built by initially selecting a subset of instances from a training set

• This subset is then used to construct a decision tree

• The remaining training set instances test the accuracy of the constructed tree

2

Page 3: Basic Data Mining Techniques

The Accuracy Score and the Goodness Score

• The accuracy score is a ratio (usually expressed in percent) of the number of the correctly classified samples to the total number of samples in the training set.

• The goodness score is the ratio of the accuracy score to the total number of branches added to the tree by the attribute, which is used to make a decision.

• That tree is better, which shows better accuracy and goodness scores

3

Page 4: Basic Data Mining Techniques

An Algorithm for Building Decision Trees

1. Let T be the set of training instances.2. Choose an attribute that best differentiates the instances in T .3. Create a tree node whose value is the chosen attribute.

-Create child links from this node where each link represents a unique value for the chosen attribute.-Use the child link values to further subdivide the instances into subclasses.

4. For each subclass created in step 3: -If the instances in the subclass satisfy predefined criteria (minimum training set classification accuracy) or if the set of remaining attribute choices for this path is empty, verify the classification for the remaining training set instances following this decision path and STOP. -If the subclass does not satisfy the criteria and there is at least one attribute to further subdivide the path of the tree, let T be the current set of

subclass instances and return to step 2.

4

Page 5: Basic Data Mining Techniques

Attribute Selection

• The attribute choice made when building a decision tree determines the size of the constructed tree

• A main goal is to minimize the number of tree levels and tree nodes and to maximize data generalization

5

Page 6: Basic Data Mining Techniques

Example: The Credit Card Promotion Database

• Designation of the life insurance promotion as the output attribute

• Our input attributes are: income range, credit card insurance, sex, and age

6

Page 7: Basic Data Mining Techniques

• The Credit Card Promotion Database

Income Life Insurance Credit Card Range Promotion Insurance Sex Age

40–50K No No Male 45 30–40K Yes No Female 40 40–50K No No Male 42 30–40K Yes Yes Male 43 50–60K Yes No Female 38 20–30K No No Female 55 30–40K Yes Yes Male 35 20–30K No No Male 27 30–40K No No Male 43 30–40K Yes No Female 41 40–50K Yes No Female 43 20–30K Yes No Male 29 50–60K Yes No Female 39 40–50K No No Male 55 20–30K Yes Yes Female 19

7

Page 8: Basic Data Mining Techniques

Partial Decision Trees for the Credit Card Promotion Database

8

Page 9: Basic Data Mining Techniques

IncomeRange

30-40K

4 Yes1 No

2 Yes2 No

1 Yes3 No

2 Yes

50-60K40-50K20-30K

Table 3.1 • The Credit Card Promotion Database

Income Life Insurance Credit CardRange Promotion Insurance Sex Age

40–50K No No Male 4530–40K Yes No Female 4040–50K No No Male 4230–40K Yes Yes Male 4350–60K Yes No Female 3820–30K No No Female 5530–40K Yes Yes Male 3520–30K No No Male 2730–40K No No Male 4330–40K Yes No Female 4140–50K Yes No Female 4320–30K Yes No Male 2950–60K Yes No Female 3940–50K No No Male 5520–30K Yes Yes Female 19

Accuracy Score 11/15~0.7373%Goodness score0.73/4~0.183

9

Page 10: Basic Data Mining Techniques

CreditCard

Insurance

No Yes

3 Yes0 No

6 Yes6 No

Table 3.1 • The Credit Card Promotion Database

Income Life Insurance Credit CardRange Promotion Insurance Sex Age

40–50K No No Male 4530–40K Yes No Female 4040–50K No No Male 4230–40K Yes Yes Male 4350–60K Yes No Female 3820–30K No No Female 5530–40K Yes Yes Male 3520–30K No No Male 2730–40K No No Male 4330–40K Yes No Female 4140–50K Yes No Female 4320–30K Yes No Male 2950–60K Yes No Female 3940–50K No No Male 5520–30K Yes Yes Female 19

Accuracy Score 9/15=0.660%Goodness score0.6/2~0.3

10

Page 11: Basic Data Mining Techniques

Age

<= 43 > 43

0 Yes3 No

9 Yes3 No

Table 3.1 • The Credit Card Promotion Database

Income Life Insurance Credit CardRange Promotion Insurance Sex Age

40–50K No No Male 4530–40K Yes No Female 4040–50K No No Male 4230–40K Yes Yes Male 4350–60K Yes No Female 3820–30K No No Female 5530–40K Yes Yes Male 3520–30K No No Male 2730–40K No No Male 4330–40K Yes No Female 4140–50K Yes No Female 4320–30K Yes No Male 2950–60K Yes No Female 3940–50K No No Male 5520–30K Yes Yes Female 19

Accuracy Score 12/15~0.880%Goodness score0.8/2~0.4

11

Page 12: Basic Data Mining Techniques

Multiple-Node Decision Trees for the Credit Card Promotion

Database

12

Page 13: Basic Data Mining Techniques

Table 3.1 • The Credit Card Promotion Database

Income Life Insurance Credit CardRange Promotion Insurance Sex Age

40–50K No No Male 4530–40K Yes No Female 4040–50K No No Male 4230–40K Yes Yes Male 4350–60K Yes No Female 3820–30K No No Female 5530–40K Yes Yes Male 3520–30K No No Male 2730–40K No No Male 4330–40K Yes No Female 4140–50K Yes No Female 4320–30K Yes No Male 2950–60K Yes No Female 3940–50K No No Male 5520–30K Yes Yes Female 19

Accuracy Score 14/15~0.9393%Goodness score0.93/6~0.16

13

Age

Sex

<= 43

Male

Yes (6/0)

Female

> 43

CreditCard

Insurance

YesNo

No (4/1) Yes (2/0)

No (3/0)

No (3/1)

Page 14: Basic Data Mining Techniques

CreditCard

Insurance

Sex

No

Male

Yes (6/1)

Female

Yes

Yes (3/0)

No (6/1)

Table 3.1 • The Credit Card Promotion Database

Income Life Insurance Credit CardRange Promotion Insurance Sex Age

40–50K No No Male 4530–40K Yes No Female 4040–50K No No Male 4230–40K Yes Yes Male 4350–60K Yes No Female 3820–30K No No Female 5530–40K Yes Yes Male 3520–30K No No Male 2730–40K No No Male 4330–40K Yes No Female 4140–50K Yes No Female 4320–30K Yes No Male 2950–60K Yes No Female 3940–50K No No Male 5520–30K Yes Yes Female 19

Accuracy Score 13/15~0.8787%Goodness score0.87/4~0.22

14

Page 15: Basic Data Mining Techniques

Table 3.2 • Training Data Instances Following the Path in Figure 3.4 to Credit CardInsurance = No

Income Life Insurance Credit CardRange Promotion Insurance Sex Age

40–50K No No Male 4220–30K No No Male 2730–40K No No Male 4320–30K Yes No Male 29

15

Page 16: Basic Data Mining Techniques

Decision Tree Rules

16

Page 17: Basic Data Mining Techniques

A Rule for the Tree in Figure 3.4

IF Age <=43 & Sex = Male & Credit Card Insurance = NoTHEN Life Insurance Promotion = No

17

Page 18: Basic Data Mining Techniques

A Simplified Rule Obtained by Removing Attribute Age

IF Sex = Male & Credit Card Insurance = No THEN Life Insurance Promotion = No

18

Page 19: Basic Data Mining Techniques

Advantages of Decision Trees• Easy to understand.• Map nicely to a set of production rules.• Applied to real problems.• Make no prior assumptions about the data.• Able to process both numerical and categorical data.

19

Page 20: Basic Data Mining Techniques

Disadvantages of Decision Trees

• Output attribute must be categorical. • Limited to one output attribute.• Decision tree algorithms are unstable.• Trees created from numeric datasets can be complex.

20

Page 21: Basic Data Mining Techniques

Generating Association Rules

21

Page 22: Basic Data Mining Techniques

Confidence and Support

• Traditional classification rules usually limit a consequent of a rule to a single attribute

• Association rule generators allow the consequent of a rule to contain one or several attribute values

22

Page 23: Basic Data Mining Techniques

Example• If there are any interesting relationships to

be found in customer purchasing trends among the grocery store products:

• Milk• Cheese• Bread• Eggs

23

Page 24: Basic Data Mining Techniques

Possible associations:

• If customers purchase milk they also purchase bread

• If customers purchase bread they also purchase milk

• If customers purchase milk and eggs they also purchase cheese and bread

• If customers purchase milk, cheese, and eggs they also purchase bread

24

Page 25: Basic Data Mining Techniques

Confidence• Analyzing the first rule we are coming to

the natural question: “How likely will the event of a milk purchase lead to a bread purchase?”

• To answer this question, a rule has an associated confidence, which is in our case the conditional probability of a bread purchase given a milk purchase

25

Page 26: Basic Data Mining Techniques

Rule Confidence

Given a rule of the form “If A then B”, rule confidence is the conditional probability that B is true when A is known to be true.

26

Page 27: Basic Data Mining Techniques

Rule Support

The minimum percentage of instances in the database that contain all items listed in a given association rule.

27

Page 28: Basic Data Mining Techniques

Mining Association Rules: An Example

28

Page 29: Basic Data Mining Techniques

Apriori Algorithm

• This algorithm generates item sets• Item sets are attribute-value combinations

that meet a specified coverage requirement• Those attribute-value combinations that do

not meet the coverage requirement are discarded

29

Page 30: Basic Data Mining Techniques

Apriori Algorithm

• The first step: item set generation• The second step: creation a set of

association rules using the generated item set

30

Page 31: Basic Data Mining Techniques

Table 3.3 • A Subset of the Credit Card Promotion Database

Magazine Watch Life Insurance Credit CardPromotion Promotion Promotion Insurance Sex

Yes No No No MaleYes Yes Yes No FemaleNo No No No MaleYes Yes Yes Yes MaleYes No Yes No FemaleNo No No No FemaleYes No Yes Yes MaleNo Yes No No MaleYes No No No MaleYes Yes Yes No Female

The “income range” and “age” attributes are eliminated

31

Page 32: Basic Data Mining Techniques

Generation of the item sets

• First, we will generate “single-item” sets• Minimum attribute-value coverage

requirement: four items• Single-item sets represent individual

attribute-value combinations extracted from the original data set

32

Page 33: Basic Data Mining Techniques

Table 3.4 • Single-Item Sets Single-Item Sets Number of Items Magazine Promotion = Yes 7 Watch Promotion = Yes 4 Watch Promotion = No 6 Life Insurance Promotion = Yes 5 Life Insurance Promotion = No 5 Credit Card Insurance = No 8 Sex = Male 6 Sex = Female 4

Single-Item Sets

33

Page 34: Basic Data Mining Techniques

Two-Item Sets and Multiple-Item Sets

• Two-item sets can be created from single-item sets by their combination (usually with the same coverage restriction)

• The next step is to use the attribute-value combinations from the two-item sets to create three-item sets, etc.

• The process is continued until such n, for which the n-item set will contain a single instance

34

Page 35: Basic Data Mining Techniques

Table 3.5 • Two-Item Sets

Two-Item Sets Number of Items

Magazine Promotion = Yes & Watch Promotion = No 4Magazine Promotion = Yes & Life Insurance Promotion = Yes 5Magazine Promotion = Yes & Credit Card Insurance = No 5Magazine Promotion = Yes & Sex = Male 4Watch Promotion = No & Life Insurance Promotion = No 4Watch Promotion = No & Credit Card Insurance = No 5Watch Promotion = No & Sex = Male 4Life Insurance Promotion = No & Credit Card Insurance = No 5Life Insurance Promotion = No & Sex = Male 4Credit Card Insurance = No & Sex = Male 4Credit Card Insurance = No & Sex = Female 4

Two-Item Sets

35

Page 36: Basic Data Mining Techniques

Three-Item Set

• The only three-item set that satisfies the coverage criterion is:

• (Watch Promotion = No) & (Life Insurance Promotion = No) & (Credit Card Insurance = No)

36

Page 37: Basic Data Mining Techniques

Rule Creation

• The first step is to specify a minimum rule confidence

• Next, association rules are generated from the two- and three-item set tables

• Any rule not meeting the minimum confidence value is discarded

37

Page 38: Basic Data Mining Techniques

Two Possible Two-Item Set Rules

IF Magazine Promotion =YesTHEN Life Insurance Promotion =Yes (5/7)(Rule confidence is 5/7x100% = 71%)

IF Life Insurance Promotion =YesTHEN Magazine Promotion =Yes (5/5)(Rule confidence is 5/5x100% = 100%)

38

Page 39: Basic Data Mining Techniques

Three-Item Set Rules

IF Watch Promotion =No & Life Insurance Promotion = No

THEN Credit Card Insurance =No (4/4)(Rule confidence is 4/4x100% = 100%)

IF Watch Promotion =No THEN Life Insurance Promotion = No & Credit

Card Insurance = No (4/6)(Rule confidence is 4/6x100% = 66.6%)

39

Page 40: Basic Data Mining Techniques

General Considerations

• We are interested in association rules that show a lift in product sales where the lift is the result of the product’s association with one or more other products.

• We are also interested in association rules that show a lower than expected confidence for a particular association.

40

Page 41: Basic Data Mining Techniques

Homework

• Problems 2, 3 (p. 102 of the book)

41