Elementary Concepts of data minig

16
Elementary Concepts Data Mining Technology Anjan.K II Sem M.Tech CSE M.S.R.I.T

description

Mathematical analysis of Graph and Huff amn coding

Transcript of Elementary Concepts of data minig

Page 1: Elementary Concepts of data minig

Elementary Concepts Data Mining Technology

Anjan.KII Sem M.Tech

CSEM.S.R.I.T

Page 2: Elementary Concepts of data minig

Agenda

Need for Dimensionality ReductionPCA revisitedData Mining elementary conceptsHands On Problem-Q3Potter’s Wheel-Data Cleaning Tool

Page 3: Elementary Concepts of data minig

Need for Dimensionality Reduction

It is easy to collect data but accumulates in an unprecedented speed.

Data is not collected only for data miningData preprocessing is an important part

for effective machine learning and data mining.

Dimensionality reduction is an effective approach to downsizing data

Page 4: Elementary Concepts of data minig

Dimensionality Reduction?

Learning and data mining techniques may not be effective for high-dimensional data due its dimensionality.

Query accuracy and efficiency degrade rapidly as the dimension increases.

Visualization: projection of high-dimensional data onto 2D or 3D.

Data compression: efficient storage and retrieval.

Noise removal: positive effect on query accuracy.

Page 5: Elementary Concepts of data minig

Principal Component Analysis

PCA is a statistical technique used in face recognition and image compression and is unsupervised linear algorithm.

A common technique for finding patterns in data of high dimension. Mining for principal component in image.

Reduce the dimensionality of a data set by finding a new set of variables, smaller than the original set of variables

Retains most of the sample's information.Ex: High resolution image transformed to low resolution image.

Page 6: Elementary Concepts of data minig

Geometric Picture of Principal Components (PCs)

Page 7: Elementary Concepts of data minig

Algebraic Derivation of PCs

Page 8: Elementary Concepts of data minig

Knowledge Discovery (KDD) Process

Data mining—core of knowledge discovery process

Data Cleaning

Data Integration

Databases

Data Warehouse

Task-relevant Data

Selection

Data Mining

Pattern Evaluation

Page 9: Elementary Concepts of data minig

Data Mining: Confluence of Multiple Disciplines

Data Mining

Database Technology Statistics

MachineLearning

PatternRecognition

AlgorithmOther

Disciplines

Visualization

Page 10: Elementary Concepts of data minig

Question 3 (1.3 of Chap 1 of Han & Kamber)

Suppose your task as a software engineer at Big university is to design a data mining system to examine the university course database, which contains the following information: name, address, status, course taken, the cumulative grade point average(GPA) of each student. Describe the architecture you would choose. What is the purpose of each component of this architecture?

Page 11: Elementary Concepts of data minig

Proposed Data Mining Technology

Page 12: Elementary Concepts of data minig

Data Mining System

College DB

University DB

UniversityWarehouse

Exam DB

Response Attribution

Back Office Systems

Data Mining system

OLAP Tools

Pattern Evaluation

Graphical Interface

Page 13: Elementary Concepts of data minig

Potter‘s Wheel

Problem of conventional approaches Time consuming (many iterations), long waiting

periods Users have to write complex transformation scripts Separate Tools for auditing and transformation

Potter‘s Wheel approach: Interactive system, instant feedback Integration of both, data auditing and transformation Intuitive User Interface – spreadsheet like

application

Page 14: Elementary Concepts of data minig

Potter‘s Wheel

Page 15: Elementary Concepts of data minig

Potter’s Wheel- features

Instead of complex transform specifications with regular expressions or custom programs user specifies by example (e.g. splitting)

Data auditing extensible with user defined domains Parse „Tayler, Jane, JFK to ORD on April 23, 2000 Coach“ as „[A-Za-z,]*

<Airport> to <Airport> on <Date> <Class>“ instead of „[A-Za-z,]* [A-Z]³ to [A-Z]³ on [A-Za-z]* [0-9]*, [0-9]* [A-Za-z]*

Allows easier detection of e.g. logical errors like false airport codes Problem: tradeoff between overfitting and underfitting structure Potter‘s Wheel uses Minimun description length method to balance this

tradeoff and choose appropriate structure Data auditing in background on the fly (data streaming also possible) Reorderer allows sorting on the fly User only works on a view – real data isn‘t changed until user exports

set of transforms e.g. as C program an runs it on the real data Undo without problems: just delete unwanted transform from sequence

and redo everything else

Page 16: Elementary Concepts of data minig

Potter‘s Wheel - Conclusion

Problems: Usability of User Interface How does duplicate elimination work? Kind of a black box system

General Open Problems of Data Cleaning: (Automatic) correction of wrong values

Mask wrong values but keep them Keep several possible values at the same time (2*age.

2*birthday) Leeds to problems if other values depend on a certain alternative

and this turns out to be wrong Maintenance of cleaned data, especially if sources

can‘t be cleaned Data cleaning framework desireable