of 75

• date post

07-Apr-2018
• Category

## Documents

• view

225

0

Embed Size (px)

### Transcript of ch07 classific

• 8/4/2019 ch07 classific

1/75

1

Classification and Prediction

Data Mining

Modified from the slidesby Prof. Han

• 8/4/2019 ch07 classific

2/75

2

Chapter 7. Classification and Prediction

What is classification? What is prediction?

Issues regarding classification and prediction Classification by decision tree induction

Bayesian Classification

Classification by backpropagation

Classification based on concepts fromassociation rule mining

Other Classification Methods

Prediction

Classification accuracy

Summary

• 8/4/2019 ch07 classific

3/75

3

Classification vs. Prediction

Classification:

predicts categorical class labels classifies data (constructs a model) based on thetraining set and the values (class labels) in aclassifying attribute and uses it in classifying newdata

Prediction (regression): models continuous-valued functions, i.e., predicts

unknown or missing values

e.g., expenditure of potential customers given theirincome and occupation

Typical Applications credit approval

target marketing

medical diagnosis

• 8/4/2019 ch07 classific

4/75

4

ClassificationA Two-Step Process

Model construction: describing a set of

predetermined classes Each tuple/sample is assumed to belong to apredefined class, as determined by the class labelattribute

The set of tuples used for model construction:training set

The model is represented as classification rules,decision trees, or mathematical formulae

Model usage: for classifying future or unknownobjects Estimate accuracy of the model

The known label of test sample is compared with the classifiedresult from the model

Accuracy rate is the percentage of test set samples that arecorrectly classified by the model

Test set is independent of training set, otherwise over-fittingwill occur

• 8/4/2019 ch07 classific

5/75

5

Classification Process (1): ModelConstruction

Training

Data

N A M E R A N K Y E A R S T E N U R E D

Mike A ssistant Prof 3 no

Mary Assistant Prof 7 yesBill Professor 2 yes

Jim Associate Prof 7 yes

Dave Assistant Prof 6 no

Anne Associate Prof 3 no

Classification

Algorithms

IF rank = professor

OR years > 6

THEN tenured = yes

Classifier

(Model)

• 8/4/2019 ch07 classific

6/75

6

Classification Process (2): Use theModel in Prediction

Classifier

Testing

Data

N A M E R A N K Y E A R S T E N U R E D

Tom Assistant Prof 2 no

Merlisa Associate Prof 7 no

George Professor 5 yes

Joseph Assistant Prof 7 yes

Unseen Data

(Jeff, Professor, 4)

Tenured?

• 8/4/2019 ch07 classific

7/75

7

Supervised vs. Unsupervised Learning

Supervised learning (classification)

Supervision: The training data (observations,measurements, etc.) are accompanied by labelsindicating the class of the observations

New data is classified based on the training set

Unsupervised learning (clustering)

The class labels of training data is unknown

Given a set of measurements, observations, etc. withthe aim of establishing the existence of classes orclusters in the data

• 8/4/2019 ch07 classific

8/75

8

Chapter 7. Classification and Prediction

What is classification? What is prediction?

Issues regarding classification and prediction Classification by decision tree induction

Bayesian Classification

Classification by backpropagation

Classification based on concepts fromassociation rule mining

Other Classification Methods

Prediction

Classification accuracy

Summary

• 8/4/2019 ch07 classific

9/75

9

Issues regarding classification andprediction (1): Data Preparation

Data cleaning Preprocess data in order to reduce noise and handle

missing values

e.g., smoothing technique for noise and common valuereplacement for missing values

Relevance analysis (feature selection in ML)

Remove the irrelevant or redundant attributes

Relevance analysis + learning on reduced attributes< learning on original set

Data transformation

Generalize data (discretize)

E.g., income: low, medium, high E.g., street city

Normalize data (particularly for NN) E.g., income [-1..1] or [0..1]

Without this, what happen?

• 8/4/2019 ch07 classific

10/75

10

Issues regarding classification andprediction (2): Evaluating ClassificationMethods

Predictive accuracy

Speed

time to construct the model

time to use the model

Robustness

handling noise and missing values

Scalability

efficiency in disk-resident databases

Interpretability:

understanding and insight provided by the model

• 8/4/2019 ch07 classific

11/75

11

Chapter 7. Classification and Prediction

What is classification? What is prediction?

Issues regarding classification and prediction Classification by decision tree induction

Bayesian Classification

Classification by backpropagation

Classification based on concepts fromassociation rule mining

Other Classification Methods

Prediction

Classification accuracy

Summary

• 8/4/2019 ch07 classific

12/75

12

Classification by Decision TreeInduction

Decision tree

A flow-chart-like tree structure

Internal node denotes a test on an attribute

Branch represents an outcome of the test

Leaf nodes represent class labels or class distribution

Decision tree generation consists of two phases Tree construction

At start, all the training examples are at the root

Partition examples recursively based on selected attributes

Tree pruning Identify and remove branches that reflect noise or outliers

Use of decision tree: Classifying an unknown sample Test the attribute values of the sample against the

decision tree

• 8/4/2019 ch07 classific

13/75

13

Training Dataset

40 low yes fair yes

>40 low yes excellent no3140 low yes excellent yes

• 8/4/2019 ch07 classific

14/75

14

age?

overcast

student? credit rating?

no yes fairexcellent

40

no noyes yes

yes

30..40

• 8/4/2019 ch07 classific

15/75

15

Algorithm for Decision Tree Induction

Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start, all the training examples are at the root

Attributes are categorical (if continuous-valued, they arediscretized in advance)

Examples are partitioned recursively based on selected

attributes Test attributes are selected on the basis of a heuristic or

statistical measure (e.g., information gain)

Conditions for stopping partitioning

All samples for a given node belong to the same class

There are no remaining attributes for further partitioning majority voting is employed for classifying the leaf

There are no samples left

• 8/4/2019 ch07 classific

16/75

16

Attribute Selection Measure

Information gain (ID3/C4.5)

All attributes are assumed to be categorical Can be modified for continuous-valued attributes

Gini index (IBM IntelligentMiner)

All attributes are assumed continuous-valued

Assume there exist several possible split values foreach attribute

May need other tools, such as clustering, to get thepossible split values

Can be modified for categorical attributes

• 8/4/2019 ch07 classific

17/75

17

Information Gain (ID3/C4.5)

Select the attribute with the highest

information gain Assume there are two classes, P and N

Let the set of examples S contain p elements of classP and n elements of class N

The amount of information, needed to decide if an

arbitrary example in S belongs to P or N is definedas

np

n

np

n

np

p

np

pnpI

22 loglog),(

• 8/4/2019 ch07 classific

18/75

18

Information Gain in Decision TreeInduction

Assume that using attribute A a set S will be

partitioned into sets {S1, S2 , , Sv} If Si contains pi examples of P and ni examples of N,

the entropy, or the expected information needed toclassify objects in all subtrees Si is

The encoding information that would begained by branching on A

1),()(

iii

ii

npInp

np

AE

)(),()( AEnpIAGain

• 8/4/2019 ch07 classific

19/75

19

Attribute Selection by Information GainComputation

=yes

=no

I(p, n) = I(9, 5) =0.940

Compute the entropy for

age: Hence

Similarly

age pi ni I(pi, ni)40 3 2 0.971

971.0)2,3(14

5

)0,4(14

4)3,2(

14

5)(

I

IIageE

048.0)_(

151.0)(

029.0)(

ratingcreditGain

studentGain

incomeGain

)(),()( ageEnpIageGain

• 8/4/2019 ch07 classific

20/75

20

Gini Index (IBM IntelligentMiner)

If a data