class04.ppt

7/28/2019 class04.ppt

1/73

6/23/2013 Data Mining: Concepts and Techniques 1

Chapter 2: Data Preprocessing

Why preprocess the data?

Data cleaning

Data integration and transformation

Data reduction

Discretization and concept hierarchy generation Summary

7/28/2019 class04.ppt

2/73


Data Reduction Strategies

Why data reduction? A database/data warehouse may store terabytes of data

Complex data analysis/mining may take a very long time to runon the complete data set

Data reduction

Obtain a reduced representation of the data set that is muchsmaller in volume but yet produce the same (or almost the same)analytical results

Data reduction strategies

Data cube aggregation: Dimensionality reductione.g.,remove unimportant attributes

Data Compression

Numerosity reductione.g.,fit data into models

Discretization and concept hierarchy generation

7/28/2019 class04.ppt

3/73


Data Cube Aggregation

The lowest level of a data cube (base cuboid)

The aggregated data for an individual entity of interest

E.g., a customer in a phone calling data warehouse

Multiple levels of aggregation in data cubes Further reduce the size of data to deal with

Reference appropriate levels

Use the smallest representation which is enough tosolve the task

Queries regarding aggregated information should be

answered using data cube, when possible

7/28/2019 class04.ppt

4/73


Attribute Subset Selection

Feature selection (i.e., attribute subset selection): Select a minimum set of features such that the

probability distribution of different classes given thevalues for those features is as close as possible to the

original distribution given the values of all features reduce # of patterns in the patterns, easier to

understand

Heuristic methods (due to exponential # of choices):

Step-wise forward selection Step-wise backward elimination

Combining forward selection and backward elimination

Decision-tree induction

7/28/2019 class04.ppt

5/73


Example of Decision Tree Induction

Initial attribute set:{A1, A2, A3, A4, A5, A6}

A4 ?

A1? A6?

Class 1 Class 2 Class 1 Class 2

> Reduced attribute set: {A1, A4, A6}

7/28/2019 class04.ppt

6/73


Heuristic Feature Selection Methods

There are 2dpossible sub-features ofdfeatures Several heuristic feature selection methods:

Best single features under the feature independenceassumption: choose by significance tests

Best step-wise feature selection: The best single-feature is picked first

Then next best feature condition to the first, ...

Step-wise feature elimination:

Repeatedly eliminate the worst feature Best combined feature selection and elimination

Optimal branch and bound:

Use feature elimination and backtracking

7/28/2019 class04.ppt

7/736/23/2013 Data Mining: Concepts and Techniques 7

Data Compression

String compression There are extensive theories and well-tuned algorithms

Typically lossless

But only limited manipulation is possible without

expansion

Audio/video compression

Typically lossy compression, with progressiverefinement

Sometimes small fragments of signal can bereconstructed without reconstructing the whole

Time sequence is not audio

Typically short and vary slowly with time

7/28/2019 class04.ppt


Data Compression

Original Data CompressedData

lossless

Original Data

Approximated

7/28/2019 class04.ppt


Dimensionality Reduction

Curse of dimensionality When dimensionality increases, data becomes increasingly sparse

Density and distance between points, which is critical to clustering,outlier analysis, becomes less meaningful

The possible combinations of subspaces will grow exponentially

Dimensionality reduction

Avoid the curse of dimensionality

Help eliminate irrelevant features and reduce noise

Reduce time and space required in data mining

Allow easier visualization Dimensionality reduction techniques

Principal component analysis

Singular value decomposition

Supervised and nonlinear techniques (e.g., feature selection)

7/28/2019 class04.ppt


x2

x1

e

Dimensionality Reduction: Principal ComponentAnalysis (PCA)

Find a projection that captures the largest amount ofvariation in data

Find the eigenvectors of the covariance matrix, and theseeigenvectors define the new space

7/28/2019 class04.ppt


Given Ndata vectors from n-dimensions, find k northogonal vectors(principal components) that can be best used to represent data

Normalize input data: Each attribute falls within the same range

Compute korthonormal (unit) vectors, i.e., principal components

Each input data (vector) is a linear combination of the kprincipalcomponent vectors

The principal components are sorted in order of decreasing

significance or strength

Since the components are sorted, the size of the data can bereduced by eliminating the weak components, i.e., those with low

variance (i.e., using the strongest principal components, it is

possible to reconstruct a good approximation of the original data)

Works for numeric data only

Principal Component Analysis (Steps)

7/28/2019 class04.ppt


Feature Subset Selection

Another way to reduce dimensionality of data

Redundant features

duplicate much or all of the information contained in

one or more other attributes

E.g., purchase price of a product and the amount of

sales tax paid

Irrelevant features

contain no information that is useful for the datamining task at hand

E.g., students' ID is often irrelevant to the task of

predicting students' GPA

7/28/2019 class04.ppt


Heuristic Search in Feature Selection

There are 2dpossible feature combinations ofd features

Typical heuristic feature selection methods:

Best single features under the feature independenceassumption: choose by significance tests

Best step-wise feature selection:

The best single-feature is picked first

Then next best feature condition to the first, ...

Step-wise feature elimination: Repeatedly eliminate the worst feature

Best combined feature selection and elimination

Optimal branch and bound:

Use feature elimination and backtracking

7/28/2019 class04.ppt


Feature Creation

Create new attributes that can capture the importantinformation in a data set much more efficiently than theoriginal attributes

Three general methodologies

Feature extraction

domain-specific

Mapping data to new space (see: data reduction)

E.g., Fourier transformation, wavelet transformation

Feature construction

Combining features

Data discretization

7/28/2019 class04.ppt


Mapping Data to a New Space

Two Sine Waves Two Sine Waves + Noise Frequency

Fourier transform

Wavelet transform

7/28/2019 class04.ppt


Numerosity (Data) Reduction

Reduce data volume by choosing alternative, smallerforms of data representation

Parametric methods (e.g., regression)

Assume the data fits some model, estimate model

parameters, store only the parameters, and discardthe data (except possible outliers)

Example: Log-linear modelsobtain value at a pointin m-D space as the product on appropriate marginal

subspaces Non-parametric methods

Do not assume models

Major families: histograms, clustering, sampling

7/28/2019 class04.ppt


Parametric Data Reduction: Regressionand Log-Linear Models

Linear regression: Data are modeled to fit a straight line

Often uses the least-square method to fit the line

Multiple regression: allows a response variable Y to be

modeled as a linear function of multidimensional feature

vector

Log-linear model: approximates discrete

multidimensional probability distributions

7/28/2019 class04.ppt

18/73

Linear regression: Y = w X + b

Two regression coefficients, wand b, specify the lineand are to be estimated by using the data at hand

Using the least squares criterion to the known valuesofY1, Y2, , X1, X2, .

Multiple regression: Y = b0+ b1X1+ b2X2.

Many nonlinear functions can be transformed into theabove

Log-linear models:

The multi-way table of joint probabilities isapproximated by a product of lower-order tables

Probability: p(a, b, c, d) =abacadbcd

Regress Analysis and Log-Linear Models

6/23/2013 18Data Mining: Concepts and Techniques

7/28/2019 class04.ppt


Data Reduction: Histograms

Divide data into buckets and storeaverage (sum) for each bucket

Partitioning rules:

Equal-width: equal bucket range

Equal-frequency (or equal-depth)

V-optimal: with the least

histogram variance(weighted

sum of the original values that

each bucket represents) MaxDiff: set bucket boundary

between each pair for pairs have

the 1 largest differences0

5

10

15

20

25

30

35

40

10000 30000 50000 70000 90000

7/28/2019 class04.ppt


Data Reduction Method: Clustering

Partition data set into clusters based on similarity, and

store cluster representation (e.g., centroid and diameter)

only

Can be very effective if data is clustered but not if datais smeared

Can have hierarchical clustering and be stored in multi-

dimensional index tree structures

There are many choices of clustering definitions and

clustering algorithms

Cluster analysis will be studied in depth in Chapter 7

7/28/2019 class04.ppt


Data Reduction Method: Sampling

Sampling: obtaining a small sample sto represent thewhole data set N

Allow a mining algorithm to run in complexity that is

potentially sub-linear to the size of the data

Key principle: Choose a representative subset of the data

Simple random sampling may have very poor

performance in the presence of skew

Develop adaptive sampling methods, e.g., stratified

sampling:

Note: Sampling may not reduce database I/Os (page at a

time)

7/28/2019 class04.ppt


Types of Sampling

Simple random sampling There is an equal probability of selecting any particular

item

Sampling without replacement

Once an object is selected, it is removed from thepopulation

Sampling with replacement

A selected object is not removed from the population

Stratified sampling: Partition the data set, and draw samples from each

partition (proportionally, i.e., approximately the samepercentage of the data)

Used in conjunction with skewed data

7/28/2019 class04.ppt

23/73


Sampling: With or without Replacement

Raw Data

l l f d

7/28/2019 class04.ppt

24/73


Sampling: Cluster or StratifiedSampling

Raw Data Cluster/Stratified Sample

7/28/2019 class04.ppt

25/73


Data Reduction: Discretization

Three types of attributes:

Nominal values from an unordered set, e.g., color, profession

Ordinal values from an ordered set, e.g., military or academic

rank Continuous real numbers, e.g., integer or real numbers

Discretization:

Divide the range of a continuous attribute into intervals

Some classification algorithms only accept categorical attributes.

Reduce data size by discretization

Prepare for further analysis

7/28/2019 class04.ppt

26/73


Discretization and Concept Hierarchy

Discretization

Reduce the number of values for a given continuous attribute by

dividing the range of the attribute into intervals

Interval labels can then be used to replace actual data values

Supervised vs. unsupervised

Split (top-down) vs. merge (bottom-up)

Discretization can be performed recursively on an attribute

Concept hierarchy formation Recursively reduce the data by collecting and replacing low level

concepts (such as numeric values for age) by higher level concepts

(such as young, middle-aged, or senior)

7/28/2019 class04.ppt

27/73


Discretization and Concept Hierarchy Generationfor Numeric Data

Typical methods: All the methods can be applied recursively

Binning (covered above)

Top-down split, unsupervised,

Histogram analysis (covered above)

Top-down split, unsupervised

Clustering analysis (covered above)

Either top-down split or bottom-up merge, unsupervised

Entropy-based discretization: supervised, top-down split

Interval merging by 2 Analysis: unsupervised, bottom-up merge

Segmentation by natural partitioning: top-down split, unsupervised

7/28/2019 class04.ppt

28/73


Entropy-based Discretization

Using Entropy-based measure for attributeselection

Ex. Decision tree construction

Generalize the idea for discretization numericalattributes

7/28/2019 class04.ppt

29/73


Definition of Entropy

Entropy

Example: Coin Flip

AX= {heads, tails}

P(heads) = P(tails) =

log2() = * - 1

H(X) = 1

What about a two-headed coin? Conditional Entropy:

2( ) ( ) log ( )Xx A

H X P x P x

( | ) ( ) ( | )Yy A

H X Y P y H X y

7/28/2019 class04.ppt

30/73


Entropy

Entropy measures the amount of information in arandom variable; its the average length of themessage needed to transmit an outcome of thatvariable using the optimal code

Uncertainty, Surprise, Information High Entropymeans X is from a uniform

(boring) distribution

Low Entropymeans X is from a varied (peaksand valleys) distribution

7/28/2019 class04.ppt

31/73


Playing Tennis

7/28/2019 class04.ppt

32/73


Choosing an Attribute

We want to make decisions based on one of theattributes

There are four attributes to choose from:

Outlook

Temperature

Humidity

Wind

7/28/2019 class04.ppt

33/73


Example:

What is Entropy of play?

-5/14*log2(5/14) 9/14*log2(9/14)= Entropy(5/14,9/14) = 0.9403

Now based on Outlook, divided the set into three subsets,compute the entropy for each subset

The expected conditional entropy is:5/14 * Entropy(3/5,2/5) +4/14 * Entropy(1,0) +

5/14 * Entropy(3/5,2/5) = 0.6935

7/28/2019 class04.ppt

34/73


Outlook Continued

The expected conditional entropy is:5/14 * Entropy(3/5,2/5) +4/14 * Entropy(1,0) +5/14 * Entropy(3/5,2/5) = 0.6935

So IG(Outlook) = 0.9403 0.6935 = 0.2468

We seek an attribute that makes partitions aspure as possible

7/28/2019 class04.ppt

35/73


Information Gain in a Nutshell

)(||

||)()(

)(

v

AValuesv

v SEntropyS

SSEntropyAnGainInformatio

Decisionsd

dpdpEntropy ))(log(*)(

typically yes/no

7/28/2019 class04.ppt

36/73


Temperature

Now let us look at the attribute Temperature The expected conditional entropy is:

4/14 * Entropy(2/4,2/4) +6/14 * Entropy(4/6,2/6) +

4/14 * Entropy(3/4,1/4) = 0.9111 So IG(Temperature) = 0.9403 0.9111 = 0.0292

7/28/2019 class04.ppt

37/73


Humidity

Now let us look at attribute Humidity What is the expected conditional entropy?

7/14 * Entropy(4/7,3/7) +

7/14 * Entropy(6/7,1/7) = 0.7885 So IG(Humidity) = 0.9403 0.7885

= 0.1518

7/28/2019 class04.ppt

38/73


Wind

What is the information gain for wind? Expected conditional entropy:

8/14 * Entropy(6/8,2/8) +

6/14 * Entropy(3/6,3/6) = 0.8922 IG(Wind) = 0.9403 0.8922 = 0.048

7/28/2019 class04.ppt

39/73


Information Gains

Outlook 0.2468 Temperature 0.0292

Humidity 0.1518

Wind 0.0481 We choose Outlook since it has the highest

information gain

Now Generalize the idea for discretizing

7/28/2019 class04.ppt

40/73


Now Generalize the idea for discretizingnumerical values

What are the procedures?

How to get intervals?

Recursively binary split

Binary splits

Temperature < 71

Temperature > 69

How to select position for binary split? Comparing to categorical attribute selection

When do stop?

7/28/2019 class04.ppt

41/73


General Description Disc

Given a set of samples S, if S is partitioned into two intervalsS1 and S2 using boundary T, the entropy after partitioning is

The boundary that minimizes the entropy function over allpossible boundaries is selected as a binary discretization. Maximize the information gain We seek a discretization that makes subintervals as pure

as possible

1 2

1 2

| | | |( , ) ( ) ( )

| | | |H S T H H

S S

S SS S

7/28/2019 class04.ppt

42/73


7/28/2019 class04.ppt

43/73


Entropy-based discretization

It is common to place numeric thresholds halfwaybetween the values that delimit the boundaries ofa concept

Fact

cut point minimizing info never appearsbetween two consequtive examples ofthe same class

we can reduce the # of candidatepoints

7/28/2019 class04.ppt

44/73


Procedure

Recursively split intervals

apply attribute selection method to find theinitial split

repeat this process in both parts

The process is recursively applied to partitionsobtained until some stopping criterion is met, e.g.,

( ) ( , )H S H T S

7/28/2019 class04.ppt

45/73


64 65 68 69 70 71 72 75 80 81 83 85Y N Y Y Y N N/Y Y/Y N Y Y N

F E D C B A66.5 70.5 73.5 77.5 80.5 84

7/28/2019 class04.ppt

46/73


When to stop recursion?

Setting Threshold Use MDL principle

no split: encode example classes

split

model = splitting point takes log(N-1) bits to encode(N = # of instances)

plus encode classes in both partitions

optimal situation all < are yes & all > are no

each instance costs 1 bit without splitting and almost 0bits with it

formula for MDL-based gain threshold the devil is in the details

7/28/2019 class04.ppt

47/73


Discretization Using Class Labels

Entropy based approach

3 categories for both x and y 5 categories for both x and y


7/28/2019 class04.ppt

48/73



Entropy based approach

3 categories for both x and y 5 categories for both x and y

7/28/2019 class04.ppt

49/73


Entropy-Based Discretization

Given a set of samples S, if S is partitioned into two intervals S1

and S2

using boundary T, the information gain after partitioning is

Entropy is calculated based on class distribution of the samples in the

set. Given mclasses, the entropy ofS1 is

where pi is the probability of class iin S1

The boundary that minimizes the entropy function over all possible

boundaries is selected as a binary discretization

The process is recursively applied to partitions obtained until some

stopping criterion is met

Such a boundary may reduce data size and improve classification

accuracy

)(||

||)(

||

||),( 2

21

1SEntropy

S

SSEntropy

S

STSI

m

i

ii ppSEntropy1

21 )(log)(

7/28/2019 class04.ppt

50/73


Interval Merge by 2 Analysis

Merging-based (bottom-up) vs. splitting-based methods

Merge: Find the best neighboring intervals and merge them to form

larger intervals recursively

ChiMerge [Kerber AAAI 1992, See also Liu et al. DMKD 2002]

Initially, each distinct value of a numerical attr. A is considered to beone interval

2 tests are performed for every pair of adjacent intervals

Adjacent intervals with the least 2 values are merged together,

since low 2 values for a pair indicate similar class distributions

This merge process proceeds recursively until a predefined stopping

criterion is met (such as significance level, max-interval, max

inconsistency, etc.)

7/28/2019 class04.ppt

51/73


Segmentation by Natural Partitioning

A simply 3-4-5 rule can be used to segment numeric data

into relatively uniform, natural intervals.

If an interval covers 3, 6, 7 or 9 distinct values at the

most significant digit, partition the range into 3 equi-width intervals

If it covers 2, 4, or 8 distinct values at the most

significant digit, partition the range into 4 intervals If it covers 1, 5, or 10 distinct values at the most

significant digit, partition the range into 5 intervals

7/28/2019 class04.ppt

52/73


Chapter 2: Data Preprocessing

Why preprocess the data?

Data cleaning

Data integration and transformation

Data reduction

Discretization and concept hierarchygeneration

Summary

7/28/2019 class04.ppt

53/73


Summary

Data preparation or preprocessing is a big issue for bothdata warehousing and data mining

Discriptive data summarization is need for quality data

preprocessing

Data preparation includes

Data cleaning and data integration

Data reduction and feature selection

Discretization

A lot a methods have been developed but data

preprocessing still an active area of research

7/28/2019 class04.ppt

54/73


Feature Subset Selection Techniques

Brute-force approach: Try all possible feature subsets as input to data mining

algorithm

Embedded approaches:

Feature selection occurs naturally as part of the datamining algorithm

Filter approaches:

Features are selected before data mining algorithm is

run Wrapper approaches:

Use the data mining algorithm as a black box to findbest subset of attributes

7/28/2019 class04.ppt

55/73


References

D. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse environments. Communications

of ACM, 42:73-78, 1999

T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003

T. Dasu, T. Johnson, S. Muthukrishnan, V. Shkapenyuk. Mining Database Structure; Or, How to Build

a Data Quality Browser. SIGMOD02.

H.V. Jagadish et al., Special Issue on Data Reduction Techniques. Bulletin of the Technical

Committee on Data Engineering, 20(4), December 1997 D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999

E. Rahm and H. H. Do. Data Cleaning: Problems and Current Approaches. IEEE Bulletin of the

Technical Committee on Data Engineering. Vol.23, No.4

V. Raman and J. Hellerstein. Potters Wheel: An Interactive Framework for Data Cleaning and

Transformation, VLDB2001

T. Redman. Data Quality: Management and Technology. Bantam Books, 1992

Y. Wand and R. Wang. Anchoring data quality dimensions ontological foundations. Communications of

ACM, 39:86-95, 1996

R. Wang, V. Storey, and C. Firth. A framework for analysis of data quality research. IEEE Trans.

Knowledge and Data Engineering, 7:623-640, 1995
http://www-courses.cs.uiuc.edu/~cs491han/papers/dasu02.pdfhttp://www-courses.cs.uiuc.edu/~cs491han/papers/dasu02.pdfhttp://www-courses.cs.uiuc.edu/~cs491han/papers/dasu02.pdfhttp://www-courses.cs.uiuc.edu/~cs491han/papers/dasu02.pdf

7/28/2019 class04.ppt

56/73


Backup Slides

7/28/2019 class04.ppt

57/73


PCA

Intuition: find the axisthat shows thegreatest variation, andproject all points into

this axis

f1

e1e2

f2

7/28/2019 class04.ppt

58/73


PCA: The mathematical formulation

Find the eigenvectors of thecovariance matrix

These define the new space

The eigenvalues sort them ingoodness

order

f1

e1e2

f2

7/28/2019 class04.ppt

59/73


7/28/2019 class04.ppt

60/73


7/28/2019 class04.ppt

61/73


7/28/2019 class04.ppt

62/73


P i i l C A l i

7/28/2019 class04.ppt

63/73


Principal Component Analysis

Given N data vectors from k-dimensions, find c k orthogonal vectors that can be best used torepresent data

The original data set is reduced to oneconsisting of N data vectors on c principalcomponents (reduced dimensions)

Each data vector is a linear combination of the c

principal component vectors

P i i l t

7/28/2019 class04.ppt

64/73


Principal components

1. principal component (PC1) the direction along which there is greatest

variation

2. principal component (PC2) the direction with maximum variation left in

data, orthogonal to the 1. PC

P i i l t

7/28/2019 class04.ppt

65/73


Principal components

7/28/2019 class04.ppt

66/73


Nows lets look at a detailed example

7/28/2019 class04.ppt

67/73


7/28/2019 class04.ppt

68/73


Compute the covariance matrixCov=(0.61656 0.61544;0.61544 0.71656)

Computer the eigenvalues/eigenvectors

Of the covariance matrixeigvalue1: 1.284eigenvalue2: 0.04908eigenvector1: -0.6778 -0.73517

eigenvector2: -0.73517 0.67787

7/28/2019 class04.ppt

69/73


7/28/2019 class04.ppt

70/73


U i l i l i t

7/28/2019 class04.ppt

71/73


Using only a single eigenvector

7/28/2019 class04.ppt

72/73


7/28/2019 class04.ppt

73/73

class04.ppt

Documents

Transcript of class04.ppt