Post on 04-Apr-2018
7/30/2019 DM Lecture 2 18-09-2012
1/36
November 14, 2012 Data Mining: Concepts and Techniques 1
Data Mining:Concepts and Techniques
Slides for Textbook
Chapter 2 &3
7/30/2019 DM Lecture 2 18-09-2012
2/36
November 14, 2012 Data Mining: Concepts and Techniques 2
Chapter 3: Data Preprocessing
Why preprocess the data?
Data cleaningData integration and transformation
Data reduction
Discretization and concept hierarchy generation
Summary
7/30/2019 DM Lecture 2 18-09-2012
3/36
November 14, 2012 Data Mining: Concepts and Techniques 3
Why Data Preprocessing?
Data in the real world is dirtyincomplete : lacking attribute values, lacking certainattributes of interest, or containing only aggregate
datanoisy : containing errors or outliersinconsistent : containing discrepancies in codes ornames
No quality data, no quality mining results!Quality decisions must be based on quality dataData warehouse needs consistent integration of quality data
7/30/2019 DM Lecture 2 18-09-2012
4/36
November 14, 2012 Data Mining: Concepts and Techniques 4
Multi-Dimensional Measure of Data Quality
A well-accepted multidimensional view: AccuracyCompletenessConsistencyTimelinessBelievability
Value addedInterpretability
Accessibility
7/30/2019 DM Lecture 2 18-09-2012
5/36
November 14, 2012 Data Mining: Concepts and Techniques 5
Major Tasks in Data Preprocessing
Data cleaningFill in missing values, smooth noisy data, identify or removeoutliers, and resolve inconsistencies
Data integrationIntegration of multiple databases, data cubes, or files
Data transformationNormalization and aggregation
Data reductionObtains reduced representation in volume but produces thesame or similar analytical results
Data discretizationPart of data reduction but with particular importance, especially
for numerical data
7/30/2019 DM Lecture 2 18-09-2012
6/36
November 14, 2012 Data Mining: Concepts and Techniques 6
Forms of data preprocessing
7/30/2019 DM Lecture 2 18-09-2012
7/36November 14, 2012 Data Mining: Concepts and Techniques 7
Chapter 3: Data Preprocessing
Why preprocess the data?
Data cleaning Data integration and transformation
Data reduction
Discretization and concept hierarchy generation
Summary
7/30/2019 DM Lecture 2 18-09-2012
8/36November 14, 2012 Data Mining: Concepts and Techniques 8
Data Cleaning
Data cleaning tasks
Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data
7/30/2019 DM Lecture 2 18-09-2012
9/36November 14, 2012 Data Mining: Concepts and Techniques 9
Missing Data
Data is not always available
E.g., many tuples have no recorded value for severalattributes, such as customer income in sales data
Missing data may be due toequipment malfunction
inconsistent with other recorded data and thus deleted
data not entered due to misunderstanding
certain data may not be considered important at the time of entry
not register history or changes of the data
Missing data may need to be inferred.
7/30/2019 DM Lecture 2 18-09-2012
10/36November 14, 2012 Data Mining: Concepts and Techniques 10
How to Handle Missing Data?
Ignore the tuple: usually done when class label is missing (assuming
the tasks in classification not effective when the percentage of
missing values per attribute varies considerably.
Fill in the missing value manually: tedious + infeasible?Use a global constant to fill in the missing value: e.g., unknown, a
new class?!
Use the attribute mean to fill in the missing value
Use the attribute mean for all samples belonging to the same class
to fill in the missing value: smarter
Use the most probable value to fill in the missing value: inference-
based such as Bayesian formula or decision tree
7/30/2019 DM Lecture 2 18-09-2012
11/36November 14, 2012 Data Mining: Concepts and Techniques 11
Noisy Data
Noise: random error or variance in a measured variableIncorrect attribute values may due to
faulty data collection instruments
data entry problemsdata transmission problemstechnology limitationinconsistency in naming convention
Other data problems which requires data cleaningduplicate recordsincomplete data
inconsistent data
7/30/2019 DM Lecture 2 18-09-2012
12/36November 14, 2012 Data Mining: Concepts and Techniques 12
How to Handle Noisy Data?
Binning method:first sort data and partition into (equi-depth) binsthen one can smooth by bin means, smooth by bin
median, smooth by bin boundaries , etc.Clustering
detect and remove outliersCombined computer and human inspection
detect suspicious values and check by humanRegression
smooth by fitting the data into regression functions
7/30/2019 DM Lecture 2 18-09-2012
13/36November 14, 2012 Data Mining: Concepts and Techniques 13
Simple Discretization Methods: Binning
Equal-width (distance) partitioning:It divides the range into N intervals of equal size:uniform grid if A and B are the lowest and highest values of theattribute, the width of intervals will be: W = ( B - A )/ N. The most straightforwardBut outliers may dominate presentationSkewed data is not handled well.
Equal-depth (frequency) partitioning:It divides the range into N intervals, each containingapproximately same number of samplesGood data scaling
Managing categorical attributes can be tricky.
7/30/2019 DM Lecture 2 18-09-2012
14/36November 14, 2012 Data Mining: Concepts and Techniques 14
Binning Methods for Data Smoothing
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28,29, 34
* Partition into (equi-depth) bins:- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:- Bin 1: 9, 9, 9, 9- Bin 2: 23, 23, 23, 23- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:- Bin 1: 4, 4, 4, 15- Bin 2: 21, 21, 25, 25- Bin 3: 26, 26, 26, 34
7/30/2019 DM Lecture 2 18-09-2012
15/36November 14, 2012 Data Mining: Concepts and Techniques 15
Cluster Analysis
7/30/2019 DM Lecture 2 18-09-2012
16/36November 14, 2012 Data Mining: Concepts and Techniques 16
Chapter 3: Data Preprocessing
Why preprocess the data?
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy generation
Summary
7/30/2019 DM Lecture 2 18-09-2012
17/36November 14, 2012 Data Mining: Concepts and Techniques 17
Data Integration
Data integration:combines data from multiple sources into a coherentstore
Schema integrationintegrate metadata from different sourcesEntity identification problem: identify real world entitiesfrom multiple data sources, e.g., A.cust-id B.cust-#
Detecting and resolving data value conflicts
for the same real world entity, attribute values fromdifferent sources are differentpossible reasons: different representations, differentscales, e.g., metric vs. British units
7/30/2019 DM Lecture 2 18-09-2012
18/36November 14, 2012 Data Mining: Concepts and Techniques 18
Handling Redundant Datain Data Integration
Redundant data occur often when integration of multipledatabases
The same attribute may have different names in
different databasesOne attribute may be a derived attribute in anothertable, e.g., annual revenue
Redundant data may be able to be detected by
correlational analysisCareful integration of the data from multiple sources mayhelp reduce/avoid redundancies and inconsistencies andimprove mining speed and quality
7/30/2019 DM Lecture 2 18-09-2012
19/36November 14, 2012 Data Mining: Concepts and Techniques 19
Data Transformation
Smoothing: remove noise from data Aggregation: summarization, data cube constructionGeneralization: concept hierarchy climbingNormalization: scaled to fall within a small, specifiedrange
min-max normalization
z-score normalizationnormalization by decimal scaling
Attribute/feature constructionNew attributes constructed from the given ones
7/30/2019 DM Lecture 2 18-09-2012
20/36November 14, 2012 Data Mining: Concepts and Techniques 20
Data Transformation:Normalization
min-max normalization
z-score normalization
normalization by decimal scaling
A A A
A A
A
minnewminnewmaxnewminmax
minvv _)__('
A
A
devstand
meanvv
_'
j
vv
10' Where j is the smallest integer such that Max(| |)
7/30/2019 DM Lecture 2 18-09-2012
21/36November 14, 2012 Data Mining: Concepts and Techniques 21
Chapter 3: Data Preprocessing
Why preprocess the data?
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy generation
Summary
7/30/2019 DM Lecture 2 18-09-2012
22/36November 14, 2012 Data Mining: Concepts and Techniques 22
Data Reduction Strategies
Warehouse may store terabytes of data: Complex dataanalysis/mining may take a very long time to run on thecomplete data setData reduction
Obtains a reduced representation of the data set that ismuch smaller in volume but yet produces the same (oralmost the same) analytical results
Data reduction strategiesData cube aggregationDimensionality reductionNumerosity reduction
Discretization and concept hierarchy generation
7/30/2019 DM Lecture 2 18-09-2012
23/36November 14, 2012 Data Mining: Concepts and Techniques 23
Example of Decision Tree Induction
Initial attribute set:{A1, A2, A3, A4, A5, A6}
A4 ?
A1? A6?
Class 1 Class 2 Class 1 Class 2
> Reduced attribute set: {A1, A4, A6}
7/30/2019 DM Lecture 2 18-09-2012
24/36
November 14, 2012 Data Mining: Concepts and Techniques 24
Data Compression
String compressionThere are extensive theories and well-tuned algorithmsTypically losslessBut only limited manipulation is possible withoutexpansion
Audio/video compressionTypically lossy compression, with progressive
refinementSometimes small fragments of signal can bereconstructed without reconstructing the whole
Time sequence is not audio
Typically short and vary slowly with time
7/30/2019 DM Lecture 2 18-09-2012
25/36
November 14, 2012 Data Mining: Concepts and Techniques 25
Data Compression
Original DataCompressed
Data
lossless
Original DataApproximated
7/30/2019 DM Lecture 2 18-09-2012
26/36
November 14, 2012 Data Mining: Concepts and Techniques 26
Histograms
A popular data reductiontechniqueDivide data into buckets
and store average (sum)for each bucketCan be constructedoptimally in one
dimension using dynamicprogrammingRelated to quantizationproblems. 0
510
15
20
25
30
35
40
1
0 0 0 0
2
0 0 0 0
3
0 0 0 0
4
0 0 0 0
5
0 0 0 0
6
0 0 0 0
7
0 0 0 0
8
0 0 0 0
9
0 0 0 0
1 0
0 0 0 0
7/30/2019 DM Lecture 2 18-09-2012
27/36
November 14, 2012 Data Mining: Concepts and Techniques 27
Clustering
Partition data set into clusters, and one can store cluster
representation only
Can be very effective if data is clustered but not if datais smeared
Can have hierarchical clustering and be stored in multi-
dimensional index tree structuresThere are many choices of clustering definitions and
clustering algorithms, further detailed in Chapter 8
7/30/2019 DM Lecture 2 18-09-2012
28/36
November 14, 2012 Data Mining: Concepts and Techniques 28
Sampling
Allow a mining algorithm to run in complexity that ispotentially sub-linear to the size of the dataChoose a representative subset of the data
Simple random sampling may have very poorperformance in the presence of skew
Develop adaptive sampling methodsStratified sampling:
Approximate the percentage of each class (orsubpopulation of interest) in the overall databaseUsed in conjunction with skewed data
Sampling may not reduce database I/Os (page at a time).
7/30/2019 DM Lecture 2 18-09-2012
29/36
November 14, 2012 Data Mining: Concepts and Techniques 29
Sampling
Raw Data
7/30/2019 DM Lecture 2 18-09-2012
30/36
November 14, 2012 Data Mining: Concepts and Techniques 30
Sampling
Raw Data Cluster/Stratified Sample
7/30/2019 DM Lecture 2 18-09-2012
31/36
November 14, 2012 Data Mining: Concepts and Techniques 31
Hierarchical Reduction
Use multi-resolution structure with different degrees of reductionHierarchical clustering is often performed but tends todefine partitions of data sets rather than clusters Parametric methods are usually not amenable tohierarchical representationHierarchical aggregation
An index tree hierarchically divides a data set intopartitions by value range of some attributesEach partition can be considered as a bucketThus an index tree with aggregates stored at eachnode is a hierarchical histogram
7/30/2019 DM Lecture 2 18-09-2012
32/36
November 14, 2012 Data Mining: Concepts and Techniques 32
Chapter 3: Data Preprocessing
Why preprocess the data?
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy generation
Summary
7/30/2019 DM Lecture 2 18-09-2012
33/36
November 14, 2012 Data Mining: Concepts and Techniques 33
Discretization
Three types of attributes:Nominal values from an unordered setOrdinal values from an ordered setContinuous real numbers
Discretization:divide the range of a continuous attribute intointervalsSome classification algorithms only accept categoricalattributes.Reduce data size by discretizationPrepare for further analysis
7/30/2019 DM Lecture 2 18-09-2012
34/36
November 14, 2012 Data Mining: Concepts and Techniques 34
Discretization and Concept hierachy
Discretization reduce the number of values for a given continuousattribute by dividing the range of the attribute intointervals. Interval labels can then be used to replaceactual data values.
Concept hierarchies
reduce the data by collecting and replacing low levelconcepts (such as numeric values for the attributeage) by higher level concepts (such as young,middle-aged, or senior).
7/30/2019 DM Lecture 2 18-09-2012
35/36
November 14, 2012 Data Mining: Concepts and Techniques 35
Specification of a set of attributes
Concept hierarchy can be automatically generated basedon the number of distinct values per attribute in thegiven attribute set. The attribute with the mostdistinct values is placed at the lowest level of thehierarchy.
country
province_or_ state
city
street
15 distinct values
65 distinct values
3567 distinct values
674,339 distinct values
7/30/2019 DM Lecture 2 18-09-2012
36/36
Chapter 3: Data Preprocessing
Why preprocess the data?
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy generation
Summary