1.6.data preprocessing

1

Data Preprocessing

2

Introduction

Real World databases Huge size Noisy, Missing and Inconsistent

Data Preprocessing Data Cleaning Data Integration Data Transformation Data Reduction

3

Need for Data Preprocessing

Incomplete Data Lacking certain attributes of interest May not have been considered important during data entry Equipment Malfunctioning Deleted due to clashes with other data Only aggregate data maybe present

4


Noisy data Incorrect attribute values Errors or outliers e.g., Salary=“-10” Data collection instruments – faulty Data entry errors Transmission errors – limited buffer size

Inconsistent Containing discrepancies in codes or names e.g., Age=“42”

Birthdate=“03/07/1997” Different data sources Functional dependency violation (e.g., modify some linked

data) Duplicate records also need data cleaning

5


Quality data Quality decisions must be based on quality data

e.g., duplicate or missing data may cause incorrect or even misleading statistics.

Data warehouse needs consistent integration of quality data

Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse

6

Major Tasks in Data Preprocessing

Data cleaning Fill in missing values Smooth noisy data Identify or remove outliers Resolve inconsistencies

Data integration Integration of multiple databases, data cubes, or files Attributes may have different names Eliminating redundancies

Some attributes can be inferred from others

7

Major Tasks in Data Preprocessing

Data transformation Normalization and aggregation

Data reduction Obtains reduced representation in volume but produces the same or

similar analytical results Data Aggregation Dimension Reduction Data Compression Numerosity Reduction Generalization

Data discretization Part of data reduction but with particular importance, especially for

numerical data

8

Forms of Data Preprocessing

9

Data Cleaning

Importance “Data cleaning is the number one problem in data

warehousing”—DCI survey

Data cleaning tasks Fill in missing values Identify outliers and smooth out noisy data Correct inconsistent data Resolve redundancy caused by data integration

10

Missing Data

Data is not always available E.g., many tuples have no recorded value for several

attributes, such as customer income in sales data

Missing data may need to be inferred.

11

Handling Missing Data

Ignore the tuple

Usually done when class label is missing (assuming the task is

classification)—not effective when the percentage of missing

values per attribute varies considerably.

Fill in the missing value manually

Tedious + infeasible

12

Handling Missing Data

Fill in it automatically with

a global constant : e.g., “unknown”, a new class

the attribute mean

the attribute mean for all samples belonging to the

same class: smarter

the most probable value: inference-based such as

Bayesian formula or decision tree

13

Noisy Data Noise: random error or variance in a measured

variable Binning

first sort data and partition into bins then one can smooth by bin means, smooth by bin median, smooth by bin

boundaries, etc.

Regression Smooth by fitting the data into regression functions

Clustering Detect and remove outliers

Combined computer and human inspection Detect suspicious values and check by human (e.g., deal with possible outliers)

14

Binning

Equal-width (distance) partitioning Divides the range into N intervals of equal size: uniform grid if A and B are the lowest and highest values of the attribute, the

width of intervals will be: W = (B –A)/N. The most straightforward, but outliers may dominate Skewed data is not handled well

Equal-depth (frequency) partitioning Divides the range into N intervals, each containing approximately

same number of samples Good data scaling Managing categorical attributes can be tricky

15

Binning Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26,

28, 29, 34 Partition into equal-frequency (equi-depth) bins:

Bin 1: 4, 8, 9, 15 Bin 2: 21, 21, 24, 25 Bin 3: 26, 28, 29, 34

Smoothing by bin means: Bin 1: 9, 9, 9, 9 Bin 2: 23, 23, 23, 23 Bin 3: 29, 29, 29, 29

Smoothing by bin boundaries: Bin 1: 4, 4, 4, 15 Bin 2: 21, 21, 25, 25 Bin 3: 26, 26, 26, 34

16

Regression

x

y

y = x + 1

X1

Y1

Y1’

17

Cluster Analysis

18

Human Inspection

Combined Computer and Human Inspection Outliers in Handwritten character recognition

Surprise value Garbage / Outliers Human identification

19

Inconsistent Data

Data discrepancy detection Use metadata (e.g., domain, range, dependency)

Check Functional dependency relationships

Use commercial tools Data scrubbing: use simple domain knowledge to detect errors and

make corrections

Data auditing: by analyzing data to discover rules and relationship to

detect violators (e.g., correlation and clustering to find outliers) Discrepancy detection and transformation

20

Data Integration Data integration:

Combines data from multiple sources into a coherent store

Schema integration: e.g., A.cust-id ≡ B.cust-# Integrate metadata from different sources

Entity identification problem: Identify real world entities from multiple data sources

Other Issues Detecting redundancy (tuple level and attribute level) Detecting and resolving data value conflicts

For the same real world entity, attribute values from different sources are different Possible reasons: different representations, different scales, e.g., metric vs.

British units

21

Handling Redundancy

Redundant data occur often during integration of multiple databases Object identification: The same attribute or object may have

different names in different databases Derivable data: One attribute may be a “derived” attribute in

another table, e.g., annual revenue

Redundant attributes may be detected by correlation analysis Careful integration - helps to reduce/avoid redundancies and

inconsistencies and improve mining speed and quality

22

Correlation Analysis

Correlation coefficient (also called Pearson’s product moment coefficient)

where n is the number of tuples, and are the respective means of A and B, σA and σB are the respective standard deviation of A and

B

If rA,B > 0, A and B are positively correlated (A’s values increase

as B’s). The higher the value, the stronger the correlation.

rA,B = 0: independent; rA,B < 0: negatively correlated

BAn

BBAAr BA σσ)1(

))((, −

−−= ∑

A B

23

Correlation Analysis Chi-square test

χ2 = ∑ i=1

c ∑ j=1

r (oij – eij)2 / eij

c – Number of possibilities for attribute 1; r – for attribute 2

eij = Expected Frequency : count(A=ai) x count(B=bj) / N

o – Observed Frequency

e – Expected Frequency

Compare χ2 value with table and decide

(r-1) x (c-1) : degrees of freedom

Chi-square test

Male Female Total

Fiction 250 (90) 200 (360) 450

Non-fiction 50 (210) 1000 (840) 1050

Total 300 1200 1500

24

χ2 = (250-90)2/90 + (200-360)2/360 + (50-210)2/210 + (1000-840)2/840 = 507.93

Hypothesis: Gender and preferred reading are independent507.83 > 10.83

So hypothesis is rejectedConclusion: Gender and preferred_reading are strongly correlated

Chi-square test

Degrees of freedom (df)

χ2 value

1 0.004 0.02 0.06 0.15 0.46 1.07 1.64 2.71 3.84 6.64 10.83

2 0.10 0.21 0.45 0.71 1.39 2.41 3.22 4.60 5.99 9.21 13.82

3 0.35 0.58 1.01 1.42 2.37 3.66 4.64 6.25 7.82 11.34 16.27

4 0.71 1.06 1.65 2.20 3.36 4.88 5.99 7.78 9.49 13.28 18.47

5 1.14 1.61 2.34 3.00 4.35 6.06 7.29 9.24 11.07 15.09 20.52

6 1.63 2.20 3.07 3.83 5.35 7.23 8.56 10.64 12.59 16.81 22.46

7 2.17 2.83 3.82 4.67 6.35 8.38 9.80 12.02 14.07 18.48 24.32

8 2.73 3.49 4.59 5.53 7.34 9.52 11.03 13.36 15.51 20.09 26.12

9 3.32 4.17 5.38 6.39 8.34 10.66 12.24 14.68 16.92 21.67 27.88

10 3.94 4.86 6.18 7.27 9.34 11.78 13.44 15.99 18.31 23.21 29.59

P value (Probability) 0.95 0.90 0.80 0.70 0.50 0.30 0.20 0.10 0.05 0.01 0.001

Non-significant Significant

25

26

Data Transformation

1. Smoothing: remove noise from data – Binning, Clustering, Regression

2. Normalization: scaled to fall within a small, specified range min-max normalization z-score normalization normalization by decimal scaling

1. Attribute/feature construction New attributes constructed from the given ones

1. Aggregation: summarization, data cube construction

2. Generalization: concept hierarchy climbing

27

1. Smoothing

Binning

Equal-width (distance) partitioning

Equal-depth (frequency) partitioning

Clustering

28

1. SmoothingRegression and Log-Linear Models

Linear regression: Data are modeled to fit a straight line

Often uses the least-square method to fit the line

Multiple regression: allows a response variable Y to be modeled

as a linear function of multidimensional feature vector

Log-linear model: approximates multidimensional probability

distributions

29

Linear regression: Y = w X + b Two regression coefficients, w and b, specify the line and are to be

estimated by using the data at hand Using the least squares criterion to the known values of Y1, Y2, …, X1,

X2, ….

Multiple regression: Y = b0 + b1 X1 + b2 X2. Many nonlinear functions can be transformed into the above

Log-linear models: Approximation of joint probabilities Estimate the probability of each point in a multi-dimensional space from a

smaller subset of dimensions

1. SmoothingRegression and Log-Linear Models

30

2. Normalization Min-max normalization: to [new_minA, new_maxA]

Ex. Let income range $12,000 to $98,000 normalized to [0.0,

1.0]. Then $73,600 is mapped to

Z-score normalization (μ: mean, σ: standard deviation)

Ex. Let μ = 54,000, σ = 16,000. Then for 73600

Normalization by decimal scaling

716.00)00.1(000,12000,98

000,12600,73 =+−−−

AAA

AA

A

minnewminnewmaxnewminmax

minvv _)__(' +−

−−=

A

Avv

σµ−='

j

vv

10'= Where j is the smallest integer such that Max(|ν’|) < 1

225.1000,16

000,54600,73 =−

31

3. Attribute Construction

New attributes are constructed from given attributes To improve accuracy and understanding of structure Area from height and width Product operator and ‘and’

32

4. Aggregation

Data Cubes Store Multi-dimensional aggregated information Each cell holds an aggregate data value Concept hierarchies – allow analysis at multiple

levels Provide fast access to pre-computed summarized

data

33

A Sample 3D Data CubeTotal annual salesof TV in U.S.A.Date

Produ

ct

Cou

ntr

ysum

sum TV

VCRPC

1Qtr 2Qtr 3Qtr 4Qtr

U.S.A

Canada

Mexico

sum

34

4. Aggregation Data Cube Aggregation

The lowest level of abstraction

Base Cuboid

Highest level of abstraction

Apex Cuboid

Multiple levels

Lattice of Cuboids

35

Cube: A Lattice of Cuboids

time,item

time,item,location

time, item, location, supplier

all

time item location supplier

time,location

time,supplier

item,location

item,supplier

location,supplier

time,item,supplier

time,location,supplier

item,location,supplier

0-D(apex) cuboid

1-D cuboids

2-D cuboids

3-D cuboids

4-D(base) cuboid

36

5. Generalization - Concept hierarchy

Concept hierarchy formation Recursively reduce the data by collecting and replacing low

level concepts (such as numeric values for age) by higher

level concepts (such as young, middle-aged, or senior) Detail lost but more meaningful Mining becomes easier Several concept hierarchies can be defined for the

same attribute

1.6.data preprocessing

Education

Transcript of 1.6.data preprocessing