2015年11月6日星期五 2015年11月6日星期五 2015年11月6日星期五 Data Mining: Concepts...

二〇二三年四月二十日 Data Mining: Concepts and Techniques 1

Data Preprocessing

— Chapter 2 —


Data Preprocessing Step in KDD Process

Relevant Data

Data Preprocessing

Data Mining

Evaluation/Interpretation

Pattern

Knowledge

Databases

二〇二三年四月二十日 Introduction to Data Mining 3

Main Steps of a KDD Process(Fully)

Domain knowledge Acquisition Learning relevant prior knowledge and goals of application

Data collection and preprocessing (may take 60% of effort!)

Data selection and integration : creating a target data set Data cleaning, data transformation, and data reduction

Data mining Choosing functions of data mining

association, classification, clustering, regression, summarization.

Choosing the mining algorithm(s) Searching for patterns of interest

Pattern evaluation and knowledge presentation visualization, transformation, removing redundant patterns,

etc. Use of discovered knowledge


Chapter 2: Data Preprocessing

Why preprocess the data?

Data cleaning

Data integration and transformation

Data reduction

Discretization and concept hierarchy

generation

Summary


Why Data Preprocessing?

Data in the real world is dirty incomplete: lacking attribute values, lacking

certain attributes of interest, or containing only aggregate data

noisy: containing errors or outliers inconsistent: containing discrepancies in codes

or names No quality data, no quality mining results!

Quality decisions must be based on quality data

Data quality is important to data mining


Multi-Dimensional Measure of Data Quality

A well-accepted multidimensional view:( 人、事、時、地、物 )AccuracyCompletenessConsistencyTimelinessBelievabilityValue addedInterpretabilityAccessibility


Major Tasks in Data Preprocessing

Data cleaning Process missing values and noisy data

Data integration Integration of multiple files, tables, databases, or DBMSs’

DBs Data transformation

Normalization and aggregation Data reduction

Reducing data representation in volume but produces the same or similar analytical results (especially for Cloud computing)

Data discretization Part of data reduction. Reducing the number of attribute

values (especially for numerical attributes)


Forms of data preprocessing




Data cleaning


Data reduction


generation

Summary


Data Cleaning

Data cleaning tasks mainly include

Fill in missing values

Identify outliers and smooth out noisy

data

Correct inconsistent data


Missing Data

Data is not always available

e.g., many tuples may have no recorded value for

several attributes, such as customer income in sales

data

Missing data may be due to

equipment malfunction

inconsistent with other recorded data and thus deleted

data not entered due to misunderstanding e.g., data may not be considered important at the time of

entry

not registered history or changes of the data

Missing data may need to be inferred.


How to Handle Missing Data?

Ignore the tuple: usually done when class label is missing

(assuming the task is classification. Not effective when the

percentage of missing values per attribute varies considerably)

Fill in the missing value manually: tedious + infeasible?

Use a global constant to fill in the missing value,

e.g., “unknown” : a new value!

Use the attribute mean to fill in the missing value

Use the attribute mean for all samples belonging to the same

class to fill in the missing value: smarter

Use the most probable value to fill in the missing value:

inference-based such as Bayesian formula, decision tree or

linear regression


Noisy Data

Noise: random error or variance in a measured variable

Incorrect attribute values may due to faulty data collection instruments data entry problems data transmission problems technology limitation inconsistency in naming convention

Other data problems which requires data cleaning duplicate records incomplete data inconsistent data


How to Handle Noisy Data?

Binning method: first sort data and partition data into (equi-depth)

bins then one can smooth by bin means, smooth by

bin median, smooth by bin boundaries, etc. Clustering

detect and remove outliers Combined computer and human inspection

detect suspicious values and check by human Regression

smooth by fitting the data into regression functions

Example

Example


Simple Discretization Methods: Binning

Equal-width (distance) partitioning: The most straightforward method It divides the range of the attribute value into N

intervals of equal size. If A and B are the lowest and highest values of the

attribute, the width of intervals will be: W = (B -A)/N. Outliers may dominate the presentation Skewed data cannot be handled well.

Equal-depth (frequency) partitioning: It divides the range into N intervals. Each interval

contains approximately a same number of samples Good data scaling Managing categorical attributes can be tricky.


Data Smoothing for Equal-Depth Binning

1). Sorted data for price : 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34

2). Partition into 3 (equi-depth) bins: - Bin 1: - Bin 2: - Bin 3:3). Smoothing

* Smoothing by bin means: - Bin 1: - Bin 2: - Bin 3: * Smoothing by bin boundaries: - Bin 1: - Bin 2: - Bin 3:

9, 9, 9, 923, 23, 23, 2329, 29, 29, 29

4, 4, 4, 1521, 21, 25, 2526, 26, 26, 34

4, 8, 9, 1521, 21, 24, 2526, 28, 29, 34


Detect Outliers by Cluster Analysis

1. Cluster analysis 2. Outlier detection

XY

ZPoints X, Y, and Z are outliers


Data Smoothing by Regression

x

y

y = x + 1

X1

Y1

1. Regression 2. Detect outlier3. Outlier smoothing Y1’(X1,Y1) (X1,Y1’)

According to y = x + 1




Data cleaning


Data reduction


generation

Summary


Data Integration

Data integration of various source combines data from multiple sources into a

coherent store (multiple files, databases, or DBMSs’ DBs)

Schema (table) integration integrate metadata from different sources Entity identification problem: identify real world

entities from multiple data sources, e.g., A.cust-id B.cust-#

Detecting and resolving data value conflicts for the same real world entity, attribute values

from different sources may be different possible reasons:

different representations, different scales, e.g., metric vs. British units


Handling Redundant Data in Data Integration

Redundant data occur often when integration of multiple databases The same attribute may have different names in

different databases One attribute may be a “derived” attribute in

another table, e.g., annual revenue, total price Redundant data may be able to be detected by

correlational analysis Careful integration of the data from multiple sources

may help reduce/avoid data redundancies and inconsistencies and improve mining speed and quality


Data Transformation

Smoothing: remove noise from data Aggregation: summarization, data cube construction Generalization: concept hierarchy climbing Normalization: scaled to fall within a small, specified

range min-max normalization z-score normalization normalization by decimal scaling

Attribute/feature construction New attributes constructed from the given ones


Data Transformation: Normalization

min-max normalization for Attribute A

z-score normalization for Attribute A

normalization by decimal scaling for Attribute A

AAA

AA

A minnewminnewmaxnewminmax

minvv AA _)__('

A

A

devstandard

meanvv A

A

AAA _'

j

AA

vv

10' Where j is the smallest integer such that Max(| |)<1'v




Data cleaning


Data reduction


generation

Summary


Data Reduction Strategies

Data warehouse/Cloud may store terabytes of dataComplex data analysis/mining may take a very long time to run on a complete data set

Data reduction Obtains a reduced representation of the data set

that is much smaller in volume but yet produces the same (or almost the same) analytical results

Data reduction strategies Dimensionality reduction Numerosity reduction Discretization and concept hierarchy generation


Dimensionality Reduction

Feature selection (i.e., attribute subset selection): Select a minimum set of features from the original

features such that the class probability distribution using the minimal features is as close as possible to the distribution using the original features

Heuristic methods (due to exponential # of choices): decision-tree induction step-wise forward selection step-wise backward elimination combining forward selection and backward

elimination principal component analysis


Example of Decision Tree Induction

Initial attribute set: {A1, A2, A3, A4, A5, A6}

A4 ?

A1? A6?

Class 1 Class 2 Class 1 Class 2

Reduced attribute set: {A1, A4, A6}


Other Heuristic Feature Selection Methods

There are 2d possible subsets of d features Several heuristic feature selection methods:

Best/Worst single feature : choose by significance tests(under the feature independence assumption)Best step-wise feature selection:

The best single-feature is picked first Then pick the next best feature given the first feature, ...

Step-wise feature elimination: Repeatedly eliminate the worst feature

Best combined feature selection and elimination:Optimal branch and bound:

Use feature elimination and backtracking


Given N data vectors from k dimensions, find c (c <= k) k-dimensional orthogonal vectors that can be best used to represent the data

Transformation of data representation: O(N × k) → O(N × c)

Each original data vector is represented by a linear combination of the c principal component vectors

Each original vector is transformed into a c-dimensional vector The original data set is reduced to a new data set consisting of

N data vectors on c principal components Space complexity of representation : O(N × k) → O(N × c) Works for numeric data only Used when the number of dimensions is large

Principal Component Analysis

Example


X

Y α

β

Principal Component Analysis(Data Representation Transformation : X Y

αβ )

X, Y : old coordinates

α,β : new coordinates

a

b(a, b)

a’

b’

(a’, b’)

(a, b) → (a’, b’)

(c, d)

c

d(c’, d’)

d’

c’ (c, d) → (c’, d’)


Numerosity Reduction(Especially for Big Data Analysis)

Parametric methods Assume the data fits a certain model, estimate

model parameters, store only the parameters, and discard the data (except possible outliers)

Example: In linear regression, if a set of points (x1, y1), (x2, y2) …, (xn, yn) is approximated by a line Y=+ X, then and represent the given points

Non-parametric methods Do not assume models Major families: histograms, clustering, sampling

Example


Numerosity Reduction by Regression

Given n sets (classes) of data : S1, S2, …Si, …, Sn

Linear regression: The i-th set of data is modeled to fit a straight line : Y = a i X + bi

Therefore, Si is represented by <ai, bi> The n data sets are translated into <a1,b1>, <a2,b2>,…, <an,bn>

Multiple regression: Y=a10+a11x1+a12x2+…+a1nxn=A1 • X Where, A1=< a10, a11, a12,…, a1n>, X=<1, x1, x2, …, xn> The i-th set of data is modeled to fit a straight line

Therefore, Si is represented by Ai

The n data sets are translated into n data vectors : A1, A2,…, An


Numerosity Reduction by Histograms(Non-parametric methods)

A popular data

reduction technique

Divide data into

buckets and store

average (sum) & count

for each bucket

Related to

quantization problems. 0

5

10

15

20

25

30

35

40

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000


Numerosity Reduction by Clustering(Non-parametric methods)

Partition data set into clusters, and store

cluster representations only

Can be very effective if data is clustered,

but not if data is “smeared”

There are many choices of clustering

definitions and clustering algorithms,

further detailed in Chapter 7

Example

二〇二三年四月二十日 Introduction to Data Mining 35

AB

C

Use A, B, C to represent the data set

Represent Data by Cluster Centers(For Numerosity Reduction )

A, B, C are cluster representations of data


Sampling(Non-parametric methods)

Allow a mining algorithm to run in complexity that depends on the sample size, not the size of the data

Choose a representative subset of the data Note that simple random sampling may have very

poor performance in the presence of skew Adaptive sampling method

Stratified sampling ( 分層抽樣 ): Approximate the percentage of each class (or

subpopulation) in the overall data Used in conjunction with skewed data

2 methods

Example


Simple Random Sampling

SRSWOR

(simple random

sampling without

replacement)

Raw Data

SRSWRAfter a sample is drawn,

it is placed back so thatit may be drawn again


Example of Stratified Sampling

Raw Data Stratified Samples




Data cleaning


Data reduction


generation

Summary


Discretization(Reduce the Number of Attribute Values)

Consider three types of attributes: Nominal — values from an unordered set Ordinal — values from an ordered set Continuous — real numbers

Discretization: Divide the range of a continuous attribute into

intervals Each interval is denoted by a new categorical value Reduce the total number of attribute values Some classification algorithms only accept

categorical attributes. Prepare for further analysis


Discretization and Concept Hierarchy

Reduce the number of different values for a given attribute

Discretization (for continuous attribute ) reduce the number of different values for a given

continuous attribute by dividing the data range of the attribute into intervals. Then, interval labels is used to replace actual data values.

Concept hierarchies (for nominal attribute ) reduce the number of different values for a given

nominal attribute by replacing low level concepts (such as low-fat milk, 2% milk, coke, orange juice) by higher level concepts (such as milk, drink).


Discretization and Concept Hierarchy Generation for Numeric Data

Binning (see sections before)

Histogram analysis (see sections before)

Clustering analysis (see sections before)

Entropy-based discretization

Segmentation by natural partitioning


Entropy-Based Discretization

Given a set of samples S, if S is partitioned into two intervals S1 and S2 using boundary T for a continuous attribute A, the entropy after partitioning is

The boundary that minimizes the entropy function over

all possible boundaries is selected as a binary

discretization. The process is recursively applied to the obtained

partitions until some stopping criterion is met, e.g.,

Experiments show that it may reduce data size and improve classification accuracy

)(||

||)(

||

||),,( 2

21

1SEnt

SS

SEntSSATSE

),,()( ATSESEnt


Segmentation by natural partitioning

3-4-5 rule can be used to segment numeric data into

relatively uniform, “natural” intervals.

* If an interval covers 3, 6, 7 or 9 distinct values at the most significant decimal digit, partition the range into 3 equi-width intervals

* If it covers 2, 4, or 8 distinct values at the most significant digit, partition the range into 4 intervals

* If it covers 1, 5, or 10 distinct values at the most significant digit, partition the range into 5 intervals


Example of 3-4-5 rule(Generation of Concept Hierarchy for Profits)

(-$400 -$5,000)

(-$400 - 0)

(-$400 - -$300)

(-$300 - -$200)

(-$200 - -$100)

(-$100 - 0)

(0 - $1,000)

(0 - $200)

($200 - $400)

($400 - $600)

($600 - $800) ($800 -

$1,000)

($2,000 - $5, 000)

($2,000 - $3,000)

($3,000 - $4,000)

($4,000 - $5,000)

($1,000 - $2, 000)

($1,000 - $1,200)

($1,200 - $1,400)

($1,400 - $1,600)

($1,600 - $1,800) ($1,800 -

$2,000)

msd = 1,000 Low = -$1,000 High = $2,000Step 2:

Step 4:

Step 1: -$351 -$159 profit $1,838 $4,700

Min Low (i.e, 5%-tile) High (i.e, 95%- tile) Max

count

(-$1,000 - $2,000)

(-$1,000 - 0) (0 -$ 1,000)

Step 3:

($1,000 - $2,000) Generate concept hierarchy according to Low and High

Adjust left/right boundaries according to Min and Max

Step 3

Step 4


Automatic Generation of Concept Hierarchy

Alternative Methods

Specification of a partial ordering of attributes

explicitly at the schema level by users or experts, e.g.

day->month->quarter->year

Specification of a portion of a hierarchy by explicit

data grouping, e.g, day->month, quarter->year, …

Specification of a set of attributes, but not of their

partial ordering, in the hierarchy. The system can

then try to automatically generate the concept

hierarchy


Specification of a set of attributes

Concept hierarchy can be automatically generated based

on the number of distinct values per attribute in the given attribute set. The attribute with the most distinct values is placed at the lowest level of the hierarchy.

country

province_or_ state

city

street

15 distinct values

65 distinct values

3567 distinct values

674,339 distinct values


Summary

Data preparation is a big issue for both warehousing

and mining

Data preparation includes

Data cleaning, data integration, data

transformation

Data reduction and feature selection

Discretization

A lot a methods have been developed but still an

active area of research

2015年11月6日星期五 2015年11月6日星期五 2015年11月6日星期五 Data Mining: Concepts...

Documents

Transcript of 2015年11月6日星期五 2015年11月6日星期五 2015年11月6日星期五 Data Mining: Concepts...