Overview of DATA PREPROCESS..
-
Upload
killerkarthic -
Category
Education
-
view
184 -
download
0
Transcript of Overview of DATA PREPROCESS..
![Page 1: Overview of DATA PREPROCESS..](https://reader035.fdocuments.in/reader035/viewer/2022080212/559771af1a28ab520e8b47ba/html5/thumbnails/1.jpg)
![Page 2: Overview of DATA PREPROCESS..](https://reader035.fdocuments.in/reader035/viewer/2022080212/559771af1a28ab520e8b47ba/html5/thumbnails/2.jpg)
![Page 3: Overview of DATA PREPROCESS..](https://reader035.fdocuments.in/reader035/viewer/2022080212/559771af1a28ab520e8b47ba/html5/thumbnails/3.jpg)
Why preprocess the data?
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy generation
![Page 4: Overview of DATA PREPROCESS..](https://reader035.fdocuments.in/reader035/viewer/2022080212/559771af1a28ab520e8b47ba/html5/thumbnails/4.jpg)
Why Data Preprocessing?
Data in the real world is dirty
incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
noisy: containing errors or outliers
inconsistent: containing discrepancies in codes or
names
No quality data, no quality mining results
Quality decisions must be based on quality data
Data warehouse needs consistent integration of quality
data
![Page 5: Overview of DATA PREPROCESS..](https://reader035.fdocuments.in/reader035/viewer/2022080212/559771af1a28ab520e8b47ba/html5/thumbnails/5.jpg)
O Data cleaning tasks
O Fill in missing values
O Identify outliers and smooth out noisy data
O Correct inconsistent data
duplicate records
incomplete data
inconsistent data
![Page 6: Overview of DATA PREPROCESS..](https://reader035.fdocuments.in/reader035/viewer/2022080212/559771af1a28ab520e8b47ba/html5/thumbnails/6.jpg)
O Data integration: O combines data from multiple sources into a coherent store
O Schema integrationO integrate metadata from different sources
O Entity identification problem: identify real world entities from multiple data sources,
O Detecting and resolving data value conflictsO for the same real world entity, attribute values from different
sources are different
O possible reasons: different representations, different scales,
e.g., metric vs. British units
![Page 7: Overview of DATA PREPROCESS..](https://reader035.fdocuments.in/reader035/viewer/2022080212/559771af1a28ab520e8b47ba/html5/thumbnails/7.jpg)
O Smoothing: remove noise from data
O Aggregation: summarization, data cube construction
O Generalization: concept hierarchy climbing
O Normalization: scaled to fall within a small, specified
range
O min-max normalization
O z-score normalization
O normalization by decimal scaling
![Page 8: Overview of DATA PREPROCESS..](https://reader035.fdocuments.in/reader035/viewer/2022080212/559771af1a28ab520e8b47ba/html5/thumbnails/8.jpg)
O Data reduction
O Obtains a reduced representation of the data set
that is much smaller in volume but yet produces
the same (or almost the same) analytical results
O Data reduction strategies
O Data cube aggregation
O Dimensionality reduction
O Numerosity reduction
O Discretization and concept hierarchy generation
![Page 9: Overview of DATA PREPROCESS..](https://reader035.fdocuments.in/reader035/viewer/2022080212/559771af1a28ab520e8b47ba/html5/thumbnails/9.jpg)
O Discretization: divide the range of a continuous attribute into intervals
O Some classification algorithms only accept categorical attributes.
O Reduce data size by discretization
O Prepare for further analysis
O reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. Interval labels can then be used to replace actual data values.
![Page 10: Overview of DATA PREPROCESS..](https://reader035.fdocuments.in/reader035/viewer/2022080212/559771af1a28ab520e8b47ba/html5/thumbnails/10.jpg)
OBinning
OHistogram analysis
OClustering analysis
![Page 11: Overview of DATA PREPROCESS..](https://reader035.fdocuments.in/reader035/viewer/2022080212/559771af1a28ab520e8b47ba/html5/thumbnails/11.jpg)