Download - ch07 classific

8/4/2019 ch07 classific

1/75

1

Classification and Prediction

Data Mining

Modified from the slidesby Prof. Han


2/75

2

Chapter 7. Classification and Prediction

What is classification? What is prediction?

Issues regarding classification and prediction Classification by decision tree induction

Bayesian Classification

Classification by backpropagation

Classification based on concepts fromassociation rule mining

Other Classification Methods

Prediction

Classification accuracy

Summary


3/75

3

Classification vs. Prediction

Classification:

predicts categorical class labels classifies data (constructs a model) based on thetraining set and the values (class labels) in aclassifying attribute and uses it in classifying newdata

Prediction (regression): models continuous-valued functions, i.e., predicts

unknown or missing values

e.g., expenditure of potential customers given theirincome and occupation

Typical Applications credit approval

target marketing

medical diagnosis


4/75

4

ClassificationA Two-Step Process

Model construction: describing a set of

predetermined classes Each tuple/sample is assumed to belong to apredefined class, as determined by the class labelattribute

The set of tuples used for model construction:training set

The model is represented as classification rules,decision trees, or mathematical formulae

Model usage: for classifying future or unknownobjects Estimate accuracy of the model

The known label of test sample is compared with the classifiedresult from the model

Accuracy rate is the percentage of test set samples that arecorrectly classified by the model

Test set is independent of training set, otherwise over-fittingwill occur


5/75

5

Classification Process (1): ModelConstruction

Training

Data

N A M E R A N K Y E A R S T E N U R E D

Mike A ssistant Prof 3 no

Mary Assistant Prof 7 yesBill Professor 2 yes

Jim Associate Prof 7 yes

Dave Assistant Prof 6 no

Anne Associate Prof 3 no

Classification

Algorithms

IF rank = professor

OR years > 6

THEN tenured = yes

Classifier

(Model)


6/75

6

Classification Process (2): Use theModel in Prediction

Classifier

Testing

Data

N A M E R A N K Y E A R S T E N U R E D

Tom Assistant Prof 2 no

Merlisa Associate Prof 7 no

George Professor 5 yes

Joseph Assistant Prof 7 yes

Unseen Data

(Jeff, Professor, 4)

Tenured?


7/75

7

Supervised vs. Unsupervised Learning

Supervised learning (classification)

Supervision: The training data (observations,measurements, etc.) are accompanied by labelsindicating the class of the observations

New data is classified based on the training set

Unsupervised learning (clustering)

The class labels of training data is unknown

Given a set of measurements, observations, etc. withthe aim of establishing the existence of classes orclusters in the data


8/75

8








Prediction


Summary


9/75

9

Issues regarding classification andprediction (1): Data Preparation

Data cleaning Preprocess data in order to reduce noise and handle

missing values

e.g., smoothing technique for noise and common valuereplacement for missing values

Relevance analysis (feature selection in ML)

Remove the irrelevant or redundant attributes

Relevance analysis + learning on reduced attributes< learning on original set

Data transformation

Generalize data (discretize)

E.g., income: low, medium, high E.g., street city

Normalize data (particularly for NN) E.g., income [-1..1] or [0..1]

Without this, what happen?


10/75

10

Issues regarding classification andprediction (2): Evaluating ClassificationMethods

Predictive accuracy

Speed

time to construct the model

time to use the model

Robustness

handling noise and missing values

Scalability

efficiency in disk-resident databases

Interpretability:

understanding and insight provided by the model


11/75

11








Prediction


Summary


12/75

12

Classification by Decision TreeInduction

Decision tree

A flow-chart-like tree structure

Internal node denotes a test on an attribute

Branch represents an outcome of the test

Leaf nodes represent class labels or class distribution

Decision tree generation consists of two phases Tree construction

At start, all the training examples are at the root

Partition examples recursively based on selected attributes

Tree pruning Identify and remove branches that reflect noise or outliers

Use of decision tree: Classifying an unknown sample Test the attribute values of the sample against the

decision tree


13/75

13

Training Dataset

age income student credit_rating buys_computer

40 low yes fair yes

>40 low yes excellent no3140 low yes excellent yes


14/75

14

Output: A Decision Tree forbuys_computer

age?

overcast

student? credit rating?

no yes fairexcellent

40

no noyes yes

yes

30..40


15/75

15

Algorithm for Decision Tree Induction

Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start, all the training examples are at the root

Attributes are categorical (if continuous-valued, they arediscretized in advance)

Examples are partitioned recursively based on selected

attributes Test attributes are selected on the basis of a heuristic or

statistical measure (e.g., information gain)

Conditions for stopping partitioning

All samples for a given node belong to the same class

There are no remaining attributes for further partitioning majority voting is employed for classifying the leaf

There are no samples left


16/75

16

Attribute Selection Measure

Information gain (ID3/C4.5)

All attributes are assumed to be categorical Can be modified for continuous-valued attributes

Gini index (IBM IntelligentMiner)

All attributes are assumed continuous-valued

Assume there exist several possible split values foreach attribute

May need other tools, such as clustering, to get thepossible split values

Can be modified for categorical attributes


17/75

17

Information Gain (ID3/C4.5)

Select the attribute with the highest

information gain Assume there are two classes, P and N

Let the set of examples S contain p elements of classP and n elements of class N

The amount of information, needed to decide if an

arbitrary example in S belongs to P or N is definedas

np

n

np

n

np

p

np

pnpI

22 loglog),(


18/75

18

Information Gain in Decision TreeInduction

Assume that using attribute A a set S will be

partitioned into sets {S1, S2 , , Sv} If Si contains pi examples of P and ni examples of N,

the entropy, or the expected information needed toclassify objects in all subtrees Si is

The encoding information that would begained by branching on A

1),()(

iii

ii

npInp

np

AE

)(),()( AEnpIAGain


19/75

19

Attribute Selection by Information GainComputation

Class P: buys_computer

=yes

Class N: buys_computer

=no

I(p, n) = I(9, 5) =0.940

Compute the entropy for

age: Hence

Similarly

age pi ni I(pi, ni)40 3 2 0.971

971.0)2,3(14

5

)0,4(14

4)3,2(

14

5)(

I

IIageE

048.0)_(

151.0)(

029.0)(

ratingcreditGain

studentGain

incomeGain

)(),()( ageEnpIageGain


20/75

20

Gini Index (IBM IntelligentMiner)

If a data set Tcontains examples from n classes, gini index,gini(T) is defined as

wherepj is the relative frequency of classjin T. If a data set Tis split into two subsets T1 and T2 with sizes N1

and N2 respectively, the giniindex of the split data containsexamples from n classes, the giniindex gini(T) is defined as

The attribute provides the smallest ginisplit(T) is chosen to splitthe node (need to enumerate all possible splitting points foreach attribute).

n

j

p jTgini

1

21)(

)()()( 22

11

Tgini

N

NTgini

N

NTginisplit


21/75

21

Extracting Classification Rules fromTrees

Represent the knowledge in the form of IF-THEN rules

One rule is created for each path from the root to a leaf

Each attribute-value pair along a path forms aconjunction

The leaf node holds the class prediction

Rules are easier for humans to understand

Example IF age =


22/75

22

Avoid Overfitting in Classification

The generated tree may overfit the training

data Too many branches, some may reflect anomalies due

to noise or outliers

Result is in poor accuracy for unseen samples

Two approaches to avoid overfitting

Prepruning: Halt tree construction earlydo not splita node if this would result in the goodness measurefalling below a threshold

Difficult to choose an appropriate threshold

Postpruning: Remove branches from a fully growntreeget a sequence of progressively pruned trees

Use a set of data different from the training data to decidewhich is the best pruned tree


23/75

23

Approaches to Determine the Final TreeSize

Separate training (2/3) and testing (1/3) sets

Use cross validation, e.g., 10-fold crossvalidation

Use all the data for training

but apply a statistical test (e.g., chi-square) toestimate whether expanding or pruning a node mayimprove the entire distribution

Use minimum description length (MDL)principle:

halting growth of the tree when the encoding is

minimized


24/75

24

Enhancements to basic decision treeinduction

Allow for continuous-valued attributes

Dynamically define new discrete-valued attributesthat partition the continuous attribute value into adiscrete set of intervals

Handle missing attribute values

Assign the most common value of the attribute Assign probability to each of the possible values

Attribute construction Create new attributes based on existing ones that

are sparsely represented

This reduces fragmentation, repetition, andreplication


25/75

25

Classification in Large Databases

Classificationa classical problem extensively

studied by statisticians and machine learningresearchers

Scalability: Classifying data sets with millionsof examples and hundreds of attributes withreasonable speed

Why decision tree induction in data mining? relatively faster learning speed (than other

classification methods)

convertible to simple and easy to understand

classification rules can use SQL queries for accessing databases

comparable classification accuracy with othermethods


26/75

26

Scalable Decision Tree InductionMethods in Data Mining Studies

SLIQ (EDBT96 Mehta et al.) builds an index for each attribute and only class listand the current attribute list reside in memory

SPRINT (VLDB96 J. Shafer et al.) constructs an attribute list data structure

PUBLIC (VLDB98 Rastogi & Shim) integrates tree splitting and tree pruning: stop

growing the tree earlier

RainForest (VLDB98 Gehrke,Ramakrishnan & Ganti) separates the scalability aspects from the criteria

that determine the quality of the tree

builds an AVC-list (attribute, value, class label)


27/75

27

Data Cube-Based Decision-TreeInduction

Integration of generalization with decision-treeinduction (Kamber et al97).

Classification at primitive concept levels

E.g., precise temperature, humidity, outlook, etc.

Low-level concepts, scattered classes, bushy

classification-trees

Semantic interpretation problems.

Cube-based multi-level classification

Relevance analysis at multi-levels.

Information-gain analysis with dimension + level.


28/75

28

Presentation of Classification Results


29/75

29








Prediction


Summary


30/75

30

Bayesian Classification: Why?

Probabilistic learning: Calculate explicit probabilities forhypothesis, among the most practical approaches to

certain types of learning problems

Incremental: Each training example can incrementallyincrease/decrease the probability that a hypothesis iscorrect. Prior knowledge can be combined withobserved data.

Probabilistic prediction: Predict multiple hypotheses,weighted by their probabilities

Standard: Even when Bayesian methods arecomputationally intractable, they can provide a standardof optimal decision making against which other methods

can be measured


31/75

31

Bayesian Theorem

Given training data D, posteriori probability of

a hypothesis h, P(h|D) follows the Bayestheorem

MAP (maximum posteriori) hypothesis

Practical difficulty: require initial knowledge of

many probabilities, significant computationalcost

)()()|(

)|(DP

hPhDPDhP

.)()|(maxarg)|(maxarg hPhDPHh

DhPHhMAP

h


32/75

34

Bayesian classification

The classification problem may be formalized usinga-posteriori probabilities:

P(C|X) = prob. that the sample tupleX= is of class C.

E.g. P(class=N | outlook=sunny,windy=true,)

Idea: assign to sample X the class label C suchthat P(C|X) is maximal


33/75

35

Estimating a-posteriori probabilities

Bayes theorem:

P(C|X) = P(X|C)P(C) / P(X)

P(X) is constant for all classes P(C) = relative freq of class C samples

C such that P(C|X) is maximum =

C such that P(X|C)P(C) is maximum Problem: computing P(X|C) is unfeasible!

i l ifi i


34/75

36

Nave Bayesian Classification

Nave assumption: attribute independence

P(x1,,xk|C) = P(x1|C)P(xk|C)

If i-th attribute is categorical:

P(xi|C) is estimated as the relative freq ofsamples having value xi as i-th attribute in classC

If i-th attribute is continuous:

P(xi|C) is estimated thru a Gaussian densityfunction

Computationally easy in both cases

Pl i l i i P( |C)


37/75

39

The independence hypothesis

makes computation possible

yields optimal classifiers when satisfied but is seldom satisfied in practice, as

attributes (variables) are often correlated.

Attempts to overcome this limitation:

Bayesian networks, that combine Bayesian reasoningwith causal relationships between attributes

Decision trees, that reason on one attribute at thetime, considering most important attributes first

B i B li f N t k (I)


38/75

40

Bayesian Belief Networks (I)

FamilyHistory

LungCancer

PositiveXRay

Smoker

Emphysema

Dyspnea

LC

~LC

(FH, S) (FH, ~S)(~FH, S)(~FH, ~S)

0.8

0.2

0.5

0.5

0.7

0.3

0.1

0.9

Bayesian Belief Networks

The conditional probability table

for the variable LungCancer

B i B li f N t k (II)


39/75

41

Bayesian Belief Networks (II)

Bayesian belief network allows a subset of the

variables conditionally independent A graphical model of causal relationships

Several cases of learning Bayesian beliefnetworks

Given both network structure and all the variables:easy

Given network structure but only some variables

When the network structure is not known in advance

Ch t 7 Cl ifi ti d P di ti


40/75

42








Prediction


Summary

N l N t k


41/75

43

Neural Networks

Advantages

prediction accuracy is generally high robust, works when training examples contain errors

output may be discrete, real-valued, or a vector ofseveral discrete or real-valued attributes

fast evaluation of the learned target function

Criticism long training time

difficult to understand the learned function (weights)

not easy to incorporate domain knowledge

A Ne on


42/75

44

A Neuron

The n-dimensional input vector x is mapped into variable y by

means of the scalar product and a nonlinear function mapping

In hidden or output layer, the net input Ij to the unit j is

mk-

f

weighted

sum

Input

vectorx

outputy

Activation

function

weight

vector w

w0

w1

wn

x0

x1

xn

i

jiijj OwI

Network Training


43/75

Network Training

The ultimate objective of training

obtain a set of weights that makes almost all thetuples in the training data classified correctly

Steps

Initialize weights with random values

Feed the input tuples into the network one by one

For each unit Compute the net input to the unit as a linear combination of

all the inputs to the unit

Compute the output value using the activation function

Compute the error

Update the weights and the bias


44/75

Multi-Layer Perceptron

Output nodes

Input nodes

Hidden nodes

Output vector

Input vector:xi

wij

i

jiijj OwI

jIj eO

1

1

))(1( jjjjj OTOOErr

jkk

kjjj wErrOOErr )1(

ijijij OErrlww )(

jjj Errl)(

Oj(1-Oj): derivative

of the sig. function

Ij: the net inputto the unit j

Oj: the output ofunit jsigmoid function:

map Ij to [0..1]

Error of hid-den layer

Error ofoutput layer

l: learningrate [0..1]

l: too small learning pace is too slowtoo large oscillation between wrong solutions

Heuristic: l=1/t (t: # iterations through training set so far)

Multi Layer Perceptron


45/75

47

Multi-Layer Perceptron

Case updating vs. epoch updating

Weights and biases are updated after presentation ofeach sample vs.

Deltas are accumulated into variables throughout thewhole training examples and then update

Case updating is more common (more accurate)

Termination condition Delta is too small (converge)

Accuracy of the current epoch is high enough

Pre-specified number of epochs

In practice, hundreds of thousands of epochs

Example


46/75

48

Example

Class label = 1

Example


47/75

49

Example

))(1( jjjjj OTOOErr

i

jiijj OwI

jIj eO

1

1

jkk

kjjj wErrOOErr )1(

O=0.332

0.332

O=0.525

O=0.474

Example


48/75

50

Example

ijijij OErrlww )(

jjj Errl)(

E=0.1311

O=0.332E=-0.0087

O=0.525E=-0.0065

Chapter 7 Classification and Prediction


49/75

52








Prediction


Summary

Association-Based Classification


50/75

53


Several methods for association-basedclassification

ARCS: Quantitative association mining and clusteringof association rules (Lent et al97)

Associative classification: (Liu et al98)

CAEP (Classification by aggregating emerging

patterns) (Dong et al99)

ARCS: Quantitative Association Rules


51/75

54

ARCS: Quantitative Association Rules Revisited

age(X,30-34) income(X,24K-48K)

buys(X,high resolution TV)

Numeric attributes are dynamicallydiscretized Such that the confidence or compactness of the rules mined is

maximized.

2-D quantitative association rules: Aquan1 Aquan2 Acat

Clusteradjacentassociation rules

to form generalrules using a 2-D

grid.

Example:

ARCS


52/75

55

ARCS

The clustered association rules were applied tothe classification

Accuracy were compared to C4.5

ARCS were slightly better than C4.5

Scalability

ARCS requires constant amount of memoryregardless of database size

C4.5 has exponentially higher execution times



53/75

56


Associative classification: (Liu et al98)

It mines high support and high confidence rules in the

form of cond_set => y where y is a class label

Cond_set is a set of items

Regard y as one item in association rule mining

Support s%: if s% samples contain cond_set and y

Rules are accurate if

c% of samples that contain cond_set belong to y Possible rule (PR)

If multiple rules have the same cond_set, choose the one with highestconfidence

Two steps Step 1: find PRs

Step 2: construct a classifier: sort the rules based on confidence and

support Classifying new sample: use the first rule matched

Default rule: for the new sample with no matched rule

Empirical study: more accurate than C4.5



54/75

57


CAEP (Classification by aggregating emerging patterns)(Dong et al99)

Emerging patterns (EPs): the itemsets whose supportincreases significantly from one class to another

C1: buys_computer = yes

C2: buys_computer = no

EP: {age


55/75

58








Prediction


Summary



56/75

59


k-nearest neighbor classifier

Case-based reasoning Genetic algorithm

Rough set approach

Fuzzy set approaches

Instance-Based Methods


57/75

60

Instance Based Methods

Instance-based learning:

Store training examples and delay the processing(lazy evaluation) until a new instance must beclassified

Typical approaches

k-nearest neighbor approach

Instances represented as points in a Euclidean space. Locally weighted regression

Constructs local approximation

Case-based reasoning Uses symbolic representations and knowledge-based inference

The k-Nearest Neighbor Algorithm


58/75

61

The k Nearest Neighbor Algorithm

All instances correspond to points in the n-D space.

The nearest neighbor are defined in terms of Euclidean

distance.

The target function could be discrete- or real- valued.

For discrete-valued, the k-NN returns the most commonvalue among the k training examples nearest to xq.

Vonoroi diagram: the decision surface induced by 1-NNfor a typical set of training examples.

.

_+

_ xq

+

_ _+

_

_

+

.

.

.

. .

Discussion on the k-NN Algorithm


59/75

62

Discussion on the k NN Algorithm

The k-NN algorithm for continuous-valued targetfunctions

Calculate the mean values of the k nearest neighbors

Distance-weighted nearest neighbor algorithm

Weight the contribution of each of the k neighborsaccording to their distance to the query point xq

giving greater weight to closer neighbors

Similarly, for real-valued target functions Robust to noisy data by averaging k-nearest neighbors

Curse of dimensionality: distance between neighborscould be dominated by irrelevant attributes.

Assigning equal weight is a problem (vs. decision tree)

To overcome it, axes stretch or elimination of the leastrelevant attributes.

w

d xq xi

1

2( , )

Case-Based Reasoning


60/75

63

Case Based Reasoning

Also uses: lazy evaluation + analyze similar instances

Difference: Instances are not points in a Euclidean

space

Example: Water faucet problem in CADET (Sycara etal92)

Methodology

Instances represented by rich symbolic descriptions (e.g.,function graphs)

Multiple retrieved cases may be combined

Tight coupling between case retrieval, knowledge-basedreasoning, and problem solving

Research issues

Indexing based on syntactic similarity measure, and whenfailure, backtracking, and adapting to additional cases

Remarks on Lazy vs. Eager Learning


61/75

64

Remarks on Lazy vs. Eager Learning

Instance-based learning: lazy evaluation

Decision-tree and Bayesian classification: eager

evaluation

Key differences

Lazy method may consider query instance xq whendeciding how to generalize beyond the training data D

Eager method cannot since they have already chosen

global approximation when seeing the query

Efficiency: Lazy - less time training but more timepredicting

Accuracy

Lazy method effectively uses a richer hypothesis space

since it uses many local linear functions to form its implicitglobal approximation to the target function

Eager: must commit to a single hypothesis that covers theentire instance space

Genetic Algorithms


62/75

65

Genetic Algorithms

GA: based on an analogy to biological evolution

Each rule is represented by a string of bits

An initial population is created consisting of randomlygenerated rules

e.g., IF A1 and Not A2 then C2 can be encoded as 100

Based on the notion of survival of the fittest, a newpopulation is formed to consists of the fittest rules andtheir offsprings

The fitness of a rule is represented by its classificationaccuracy on a set of training examples

Offsprings are generated by crossover and mutation

Rough Set Approach


63/75

66

Rough Set Approach

Rough sets are used to approximately or roughlydefine equivalent classes

A rough set for a given class C is approximated by twosets: a lower approximation (certain to be in C) and anupper approximation (cannot be described as notbelonging to C)

Finding the minimal subsets (reducts) of attributes (for

feature reduction) is NP-hard but a discernibility matrixis used to reduce the computation intensity

Fuzzy Set Approaches


64/75

67

Fuzzy Set Approaches

Fuzzy logic uses truth values between 0.0 and 1.0 torepresent the degree of membership (such as using

fuzzy membership graph) Attribute values are converted to fuzzy values

e.g., income is mapped into the discrete categories {low,medium, high} with fuzzy values calculated

For a given new sample, more than one fuzzy valuemay apply

Each applicable rule contributes a vote for membershipin the categories

Typically, the truth values for each predicted categoryare summed



65/75

68








Prediction


Summary

What Is Prediction?


66/75

69

What Is Prediction?

Prediction is similar to classification

First, construct a model

Second, use model to predict unknown value Major method for prediction is regression

Linear and multiple regression

Non-linear regression

Prediction is different from classification

Classification refers to predict categorical class label

Prediction models continuous-valued functions

Predictive Modeling in Databases


67/75

70

ed ct e ode g atabases

Predictive modeling: Predict data values or constructgeneralized linear models based on the database data.

One can only predict value ranges or categorydistributions

Method outline:

Minimal generalization

Attribute relevance analysis

Generalized linear model construction

Prediction

Determine the major factors which influence theprediction

Data relevance analysis: uncertainty measurement,entropy analysis, expert judgement, etc.

Multi-level prediction: drill-down and roll-up analysis

Regress Analysis and Log-Linear


68/75

71

Linear regression: Y = + X Two parameters , and specify the line and are to be

estimated by using the data at hand.

using the least squares criterion to the known values ofY1, Y2, , X1, X2, .

Multiple regression: Y = b0 + b1 X1 + b2 X2.

Many nonlinear functions can be transformed into theabove.

e.g., X1 = X, X2 = X2

Regress Analysis and Log LinearModels in Prediction



69/75

72

p







Prediction


Summary

Classification Accuracy: Estimating


70/75

73

y gError Rates

Partition: Training-and-testing

use two independent data sets, e.g., training set (2/3),test set(1/3)

used for data set with large number of samples

Cross-validation divide the data set into k subsamples S1, S2, , Sk

use k-1 subsamples as training data and one sub-sample

as test data --- k-fold cross-validation 1st iteration: S2, , Sk for training, S1 for test

2nd iteration: S1, S3, , Sk for training, S2 for test

Accuracy = correct classifications from k iterations / # samples

for data set with moderate size

Stratified cross-validation Class distribution in each fold is similar with that of the total samples

Bootstrapping (leave-one-out)

for small size data

Bagging and Boosting


71/75

74

gg g g

Bagging (Fig. 7.17)

Sample St from S with replacement, then buildclassifier Ct

Given symptom (test data), ask multiple doctors(classifiers)

Voting

Bagging and Boosting


72/75

75

gg g g

Boosting (Fig. 7.17)

Combining with weights instead of voting

Boosting increases classification accuracy

Applicable to decision trees or Bayesian classifier

Learn a series of classifiers, where each classifier inthe seriespays more attention to the examples

misclassified by its predecessor Boosting requires only linear time and constant

space

Boosting Technique (II) Algorithm


73/75

76

g q ( ) g

Assign every example an equal weight 1/N

For t = 1, 2, , T do1. Obtain a hypothesis (classifier) h(t) under w(t)2. Calculate the error of h(t) and re-weight the

examples based on the error

3. Normalize w(t+1) to sum to 1

Output a weighted sum of all the hypothesis,with each hypothesis weighted according toits accuracy on the training set

Weight of classifiers, not that of examples



74/75

77

p







Prediction


Summary

Summary


75/75

y

Classification is an extensively studiedproblem (mainly in statistics, machine learning& neural networks)

Classification is probably one of the mostwidely used data mining techniques with a lotof extensions

Scalability is still an important issue fordatabase applications: thus combiningclassification with database techniques shouldbe a promising topic

Research directions: classification of non-relational data, e.g., text, spatial, multimedia,etc..