MIT-652 Data Mining Applications · network DBMS 1970s: Relational data model, relational DBMS...

53
MIT-652: DM 1: Introduction to Data Mining 1 Introduction to Data Mining MIT-652 Data Mining Applications Thimaporn Phetkaew School of Informatics, Walailak University

Transcript of MIT-652 Data Mining Applications · network DBMS 1970s: Relational data model, relational DBMS...

Page 1: MIT-652 Data Mining Applications · network DBMS 1970s: Relational data model, relational DBMS implementation ... Data mining and data warehousing, multimedia databases, and Web databases

MIT-652: DM 1: Introduction to Data Mining 1

Introduction to Data Mining

MIT-652 Data Mining Applications

Thimaporn Phetkaew

School of Informatics,Walailak University

Page 2: MIT-652 Data Mining Applications · network DBMS 1970s: Relational data model, relational DBMS implementation ... Data mining and data warehousing, multimedia databases, and Web databases

MIT-652: DM 1: Introduction to Data Mining 2

Introduction

Motivation: Why data mining?

What is data mining?

Data Mining: On what kind of data?

Data mining functionality

Are all the patterns interesting?

Classification of data mining systems

Data mining task primitives

Major issues in data mining

Summary

Page 3: MIT-652 Data Mining Applications · network DBMS 1970s: Relational data model, relational DBMS implementation ... Data mining and data warehousing, multimedia databases, and Web databases

MIT-652: DM 1: Introduction to Data Mining 3

Motivation: “Necessity is the Mother of Invention”

Data explosion problem

Automated data collection tools and mature database technology

lead to tremendous amounts of data stored in databases, data

warehouses and other information repositories

We are drowning in data, but starving for knowledge!

Solution: Data warehousing and data mining

Data warehousing and on-line analytical processing (OLAP)

Data Mining: extraction of interesting knowledge (rules,

regularities, patterns, constraints) from data in large databases

Page 4: MIT-652 Data Mining Applications · network DBMS 1970s: Relational data model, relational DBMS implementation ... Data mining and data warehousing, multimedia databases, and Web databases

MIT-652: DM 1: Introduction to Data Mining 4

Evolution of Database Technology

1960s: Data collection, database creation, IMS and network DBMS

1970s: Relational data model, relational DBMS implementation

1980s: RDBMS, advanced data models (extended-relational, OO, deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.)

1990s: Data mining and data warehousing, multimedia databases, and Web databases

2000s: Data mining and its applications, Web technology (XML, data integration) and global information systems

Page 5: MIT-652 Data Mining Applications · network DBMS 1970s: Relational data model, relational DBMS implementation ... Data mining and data warehousing, multimedia databases, and Web databases

MIT-652: DM 1: Introduction to Data Mining 5

Introduction

Motivation: Why data mining?

What is data mining?

Data Mining: On what kind of data?

Data mining functionality

Are all the patterns interesting?

Classification of data mining systems

Data mining task primitives

Major issues in data mining

Summary

Page 6: MIT-652 Data Mining Applications · network DBMS 1970s: Relational data model, relational DBMS implementation ... Data mining and data warehousing, multimedia databases, and Web databases

MIT-652: DM 1: Introduction to Data Mining 6

What Is Data Mining?

Data mining (knowledge discovery in databases): Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databasesData mining: a misnomer?

Alternative names: Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.

Watch out: is everything “data mining”?Simple search and query processing Expert systems

Page 7: MIT-652 Data Mining Applications · network DBMS 1970s: Relational data model, relational DBMS implementation ... Data mining and data warehousing, multimedia databases, and Web databases

MIT-652: DM 1: Introduction to Data Mining 7

Potential Data Mining Applications

Database analysis and decision supportMarket analysis and management

Target marketing, customer relation management (CRM), market basket analysis, cross selling, market segmentation

Risk analysis and management

Forecasting, customer retention, improved underwriting, quality control, competitive analysis

Fraud detection and detection of unusual patterns (outliers)

Other ApplicationsText mining (news group, email, documents) and Web mining

Bioinformatics and bio-data analysis

Page 8: MIT-652 Data Mining Applications · network DBMS 1970s: Relational data model, relational DBMS implementation ... Data mining and data warehousing, multimedia databases, and Web databases

MIT-652: DM 1: Introduction to Data Mining 8

Data Mining: A KDD Process

Data mining: the core of knowledge discovery process

Data Cleaning

Data Integration

Databases

Data Warehouse

Task-relevant Data

Data Selection

Data Mining

Pattern Evaluation

Data Transformation

Page 9: MIT-652 Data Mining Applications · network DBMS 1970s: Relational data model, relational DBMS implementation ... Data mining and data warehousing, multimedia databases, and Web databases

MIT-652: DM 1: Introduction to Data Mining 9

Steps of a KDD Process

Data cleaning: to remove noise and incomplete data

Data integration: where multiple data sources may be combined

Data selection: where data relevant to the analysis task are retrieved from the database/repository

Data transformation: where data are transformed into forms appropriate for mining

Page 10: MIT-652 Data Mining Applications · network DBMS 1970s: Relational data model, relational DBMS implementation ... Data mining and data warehousing, multimedia databases, and Web databases

MIT-652: DM 1: Introduction to Data Mining 10

Steps of a KDD Process

Data mining: where intelligent methods are applied in order to extract data patterns

Pattern evaluation: to identify the truly interesting patterns representing knowledge

Knowledge presentation: where visualization and knowledge representation techniques are used to present the mined knowledge to the user

Page 11: MIT-652 Data Mining Applications · network DBMS 1970s: Relational data model, relational DBMS implementation ... Data mining and data warehousing, multimedia databases, and Web databases

MIT-652: DM 1: Introduction to Data Mining 11

Data Mining and Business Intelligence

Increasing potentialto supportbusiness decisions

End User

BusinessAnalyst

DataAnalyst

DBA

DecisionsMaking

Data PresentationVisualization Techniques

Data MiningInformation Discovery

Data ExplorationStatistical Summary, Querying and Reporting

Data Preprocessing/Integration, Data Warehouses

Data SourcesPaper, Files, Web documents, Scientific Experiments, Database Systems

Page 12: MIT-652 Data Mining Applications · network DBMS 1970s: Relational data model, relational DBMS implementation ... Data mining and data warehousing, multimedia databases, and Web databases

MIT-652: DM 1: Introduction to Data Mining 12

Architecture of a Typical Data Mining System

data cleaning, integration, and selection

Database or Data Warehouse Server

Data Mining Engine

Pattern Evaluation

Graphical User Interface

Knowledge-Base

Database Data Warehouse

World-WideWeb

Other InfoRepositories

Page 13: MIT-652 Data Mining Applications · network DBMS 1970s: Relational data model, relational DBMS implementation ... Data mining and data warehousing, multimedia databases, and Web databases

MIT-652: DM 1: Introduction to Data Mining 13

Introduction

Motivation: Why data mining?

What is data mining?

Data Mining: On what kind of data?

Data mining functionality

Are all the patterns interesting?

Classification of data mining systems

Data mining task primitives

Major issues in data mining

Summary

Page 14: MIT-652 Data Mining Applications · network DBMS 1970s: Relational data model, relational DBMS implementation ... Data mining and data warehousing, multimedia databases, and Web databases

MIT-652: DM 1: Introduction to Data Mining 14

Data Mining: On What Kind of Data?

Relational databasesTransactional databasesData warehousesAdvanced DB and information repositories

Object-oriented and object-relational databasesSpatial databasesTime-series data and temporal dataText databases Multimedia databasesHeterogeneous and legacy databasesWWW

Page 15: MIT-652 Data Mining Applications · network DBMS 1970s: Relational data model, relational DBMS implementation ... Data mining and data warehousing, multimedia databases, and Web databases

MIT-652: DM 1: Introduction to Data Mining 15

Relational Databases

A collection of tables consisting of a set of tuples(records or rows)Each tuple owns a set of attributes (columns or fields) A tuple is identified by a unique keySQL “Show me total sales of last year grouped by branch”Data mining can search for trends and data patterns e.g. detect deviations such as

Items whose sales are far from those expected compared to last year; Predict credit risk of new customers based on their income, age,credit history

Page 16: MIT-652 Data Mining Applications · network DBMS 1970s: Relational data model, relational DBMS implementation ... Data mining and data warehousing, multimedia databases, and Web databases

MIT-652: DM 1: Introduction to Data Mining 16

Transactional Databases

Consist of a file where each record represents a transaction (TransID + ItemList, such as items purchased in a store)Q: How many transactions include item #13?Data mining: Market basket analysis: identify frequent itemsets to enable maximizing sales

Knowing printers are commonly purchased together with computers, thus discounting an expensive model printers with a purchase of selected computers (hoping selling more expensive printers)Shelf diapers, beers and potato chips nearby to increase sales

Page 17: MIT-652 Data Mining Applications · network DBMS 1970s: Relational data model, relational DBMS implementation ... Data mining and data warehousing, multimedia databases, and Web databases

MIT-652: DM 1: Introduction to Data Mining 17

Data Warehouses

A repository of information collected from multiple sources, organized under a unified schema at a single site in order to facilitate management decision makingProvide a multidimensional view of data and allows precomputation and fast accessing of summarized dataData warehouse systems are well suited for On-line Analytical Processing (OLAP)Data mining uses data warehouse tools to support data analysis

OLAP operations such as drill-down and roll-up

Page 18: MIT-652 Data Mining Applications · network DBMS 1970s: Relational data model, relational DBMS implementation ... Data mining and data warehousing, multimedia databases, and Web databases

MIT-652: DM 1: Introduction to Data Mining 18

Object-Oriented Databases

Based on OO programming paradigmData and code encapsulated into a single unitEntity is considered as object associated with

A set of variables describe the object (attributes in E-R model)A set of messages the object uses to communicate with other objectA set of methods which return a value in response upon receiving a message

Superclass/subclass with inheritance benefits information sharing

Page 19: MIT-652 Data Mining Applications · network DBMS 1970s: Relational data model, relational DBMS implementation ... Data mining and data warehousing, multimedia databases, and Web databases

MIT-652: DM 1: Introduction to Data Mining 19

Object-Relational Databases

Extend basic relational model by adding power to handle complex data types, class hierarchies, and object inheritanceData mining in OO and OR systems share some similarities with relational Data mining

Techniques need to be developed for handling complex object structure, class and subclass hierarchies, property inheritance,and methods and procedures

Page 20: MIT-652 Data Mining Applications · network DBMS 1970s: Relational data model, relational DBMS implementation ... Data mining and data warehousing, multimedia databases, and Web databases

MIT-652: DM 1: Introduction to Data Mining 20

Spatial Databases

Include geographic (map) databases, medical image databases and satellite image databasesApplications: forestry and ecology planning, vehicle navigation, providing public service information regarding location of telephone and electric cables, pipes, and sewage systemsData mining uncover patterns describing

Characteristics of houses located near a parkClimate of mountainous areas located at various altitudes

Page 21: MIT-652 Data Mining Applications · network DBMS 1970s: Relational data model, relational DBMS implementation ... Data mining and data warehousing, multimedia databases, and Web databases

MIT-652: DM 1: Introduction to Data Mining 21

Time-series and Temporal Databases

Time-series database stores sequences of values that change with time e.g. stock exchangeTemporal database stores relational data that include time-related attributesData mining finds characteristics of object evolution or the trend of changes for objects in database e.g.

Aid in scheduling of bank tellers according to the volume of customer trafficUncover trends that could help investment strategy planning (when is the best time to purchase XXX stock?)

Page 22: MIT-652 Data Mining Applications · network DBMS 1970s: Relational data model, relational DBMS implementation ... Data mining and data warehousing, multimedia databases, and Web databases

MIT-652: DM 1: Introduction to Data Mining 22

Text Databases

Contain word descriptions for objects such as library databases, articles, product specifications, and bug reportsData mining may uncover keyword or content associations, clustering behavior of text objects e.g. search engines cluster documents according to the words contained

Page 23: MIT-652 Data Mining Applications · network DBMS 1970s: Relational data model, relational DBMS implementation ... Data mining and data warehousing, multimedia databases, and Web databases

MIT-652: DM 1: Introduction to Data Mining 23

Multimedia Databases

Store text, image, audio,video dataApplication: picture content-based retrieval, voice mail system, speech-based user interfacesData mining needs storage and search techniques to be integrated to support real-time retrieval of countinuousmedia data

Page 24: MIT-652 Data Mining Applications · network DBMS 1970s: Relational data model, relational DBMS implementation ... Data mining and data warehousing, multimedia databases, and Web databases

MIT-652: DM 1: Introduction to Data Mining 24

Heterogeneous and Legacy Databases

Heterogenous database consists of a set of interconnected, autonomous component databasesLegacy database is a group of heterogeneous databases that combined different kinds of data systems, such as relational or OO databases, spreadsheets, or file systemsData mining may provide an interesting solution to the information exchange problem by transforming the given data into higher, more generalized, conceptual levels

Page 25: MIT-652 Data Mining Applications · network DBMS 1970s: Relational data model, relational DBMS implementation ... Data mining and data warehousing, multimedia databases, and Web databases

MIT-652: DM 1: Introduction to Data Mining 25

A huge, widely distributed information repository made available by internetData mining: Mining path traversal patterns: understand user access patterns

Help providing efficient access between highly correlated objects (improve system desigh)Lead to better marketing decisions e.g. placing advertisementsProviding better customer classification and behavior analysis

World Wide Web

Page 26: MIT-652 Data Mining Applications · network DBMS 1970s: Relational data model, relational DBMS implementation ... Data mining and data warehousing, multimedia databases, and Web databases

MIT-652: DM 1: Introduction to Data Mining 26

Introduction

Motivation: Why data mining?

What is data mining?

Data Mining: On what kind of data?

Data mining functionality

Are all the patterns interesting?

Classification of data mining systems

Data mining task primitives

Major issues in data mining

Summary

Page 27: MIT-652 Data Mining Applications · network DBMS 1970s: Relational data model, relational DBMS implementation ... Data mining and data warehousing, multimedia databases, and Web databases

MIT-652: DM 1: Introduction to Data Mining 27

Data Mining Functionalities

Concept description: Characterization and discrimination

Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions

Frequent patterns, association (correlation and causality)

Multi-dimensional vs. single-dimensional associationage(X, “20..29”) ^ income(X, “20..29K”) -> buys(X, “PC”) [support = 20%, confidence = 60%]contains(X, “computer”) -> contains(X, “software”) [10%, 75%]

Page 28: MIT-652 Data Mining Applications · network DBMS 1970s: Relational data model, relational DBMS implementation ... Data mining and data warehousing, multimedia databases, and Web databases

MIT-652: DM 1: Introduction to Data Mining 28

Data Mining Functionalities

Classification and PredictionFinding models (functions) that describe and distinguish classesor concepts for future prediction

E.g., classify countries based on climate, or classify cars based on gas mileage

Presentation: decision-tree, classification rule, neural network

Prediction: Predict some unknown or missing numerical values

Cluster analysisClass label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns

Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity

Page 29: MIT-652 Data Mining Applications · network DBMS 1970s: Relational data model, relational DBMS implementation ... Data mining and data warehousing, multimedia databases, and Web databases

MIT-652: DM 1: Introduction to Data Mining 29

Data Mining Functionalities

Outlier analysisOutlier: a data object that does not comply with the general

behavior of the data

It can be considered as noise or exception but is quite useful in

fraud detection, rare events analysis

Trend and evolution analysisTrend and deviation: regression analysis

Sequential pattern mining: e.g., digital camera large SD memory

Periodicity analysis

Similarity-based analysis

Page 30: MIT-652 Data Mining Applications · network DBMS 1970s: Relational data model, relational DBMS implementation ... Data mining and data warehousing, multimedia databases, and Web databases

MIT-652: DM 1: Introduction to Data Mining 30

Introduction

Motivation: Why data mining?

What is data mining?

Data Mining: On what kind of data?

Data mining functionality

Are all the patterns interesting?

Classification of data mining systems

Data mining task primitives

Major issues in data mining

Summary

Page 31: MIT-652 Data Mining Applications · network DBMS 1970s: Relational data model, relational DBMS implementation ... Data mining and data warehousing, multimedia databases, and Web databases

MIT-652: DM 1: Introduction to Data Mining 31

Are All the “Discovered” Patterns Interesting?

A data mining system/query may generate thousands of

patterns, not all of them are interesting.Suggested approach: Human-centered, query-based, focused

mining

Interestingness measures: A pattern is interesting if it is easily understood by humans,

valid on new or test data with some degree of certainty,

potentially useful, novel, or validates some hypothesis that a

user seeks to confirm

Page 32: MIT-652 Data Mining Applications · network DBMS 1970s: Relational data model, relational DBMS implementation ... Data mining and data warehousing, multimedia databases, and Web databases

MIT-652: DM 1: Introduction to Data Mining 32

Are All the “Discovered” Patterns Interesting?

Objective vs. subjective interestingness measures:

Objective: based on statistics and structures of

patterns, e.g., support, confidence, etc.

Subjective: based on user’s belief in the data, e.g.,

unexpectedness, novelty, actionability, etc.

Page 33: MIT-652 Data Mining Applications · network DBMS 1970s: Relational data model, relational DBMS implementation ... Data mining and data warehousing, multimedia databases, and Web databases

MIT-652: DM 1: Introduction to Data Mining 33

Can We Find All and Only Interesting Patterns?

Find all the interesting patterns: CompletenessCan a data mining system find all the interesting patterns?

Heuristic vs. exhaustive search

Association vs. classification vs. clustering

Search for only interesting patterns: OptimizationCan a data mining system find only the interesting patterns?

Approaches

First general all the patterns and then filter out the uninteresting ones

Generate only the interesting patterns—mining query optimization

Page 34: MIT-652 Data Mining Applications · network DBMS 1970s: Relational data model, relational DBMS implementation ... Data mining and data warehousing, multimedia databases, and Web databases

MIT-652: DM 1: Introduction to Data Mining 34

Introduction

Motivation: Why data mining?

What is data mining?

Data Mining: On what kind of data?

Data mining functionality

Are all the patterns interesting?

Classification of data mining systems

Data mining task primitives

Major issues in data mining

Summary

Page 35: MIT-652 Data Mining Applications · network DBMS 1970s: Relational data model, relational DBMS implementation ... Data mining and data warehousing, multimedia databases, and Web databases

MIT-652: DM 1: Introduction to Data Mining 35

Data Mining: Confluence of Multiple Disciplines

Data Mining

Database Technology Statistics

OtherDisciplines

InformationScience

MachineLearning Visualization

Algorithm

Page 36: MIT-652 Data Mining Applications · network DBMS 1970s: Relational data model, relational DBMS implementation ... Data mining and data warehousing, multimedia databases, and Web databases

MIT-652: DM 1: Introduction to Data Mining 36

Data Mining: Classification Schemes

General functionalityDescriptive data mining characterize general properties of data

in database e.g. association rules

Predictive data mining perform inference on current data in

order to make predictions

Different views, different classificationsData view: kinds of databases to be mined

Knowledge view: kinds of knowledge to be discovered

Method view: kinds of techniques utilized

Application view: kinds of applications adapted

Page 37: MIT-652 Data Mining Applications · network DBMS 1970s: Relational data model, relational DBMS implementation ... Data mining and data warehousing, multimedia databases, and Web databases

MIT-652: DM 1: Introduction to Data Mining 37

Different views, different classifications (1)Databases to be mined

Relational, transactional, object-oriented, object-relational, active, spatial, time-series, text, multi-media, heterogeneous, legacy, WWW, etc.

Knowledge to be minedCharacterization, discrimination, association, classification, clustering, trend, deviation and outlier analysis, etc.Multiple/integrated functions and mining at multiple levels

Data Mining: Classification Schemes

Page 38: MIT-652 Data Mining Applications · network DBMS 1970s: Relational data model, relational DBMS implementation ... Data mining and data warehousing, multimedia databases, and Web databases

MIT-652: DM 1: Introduction to Data Mining 38

Different views, different classifications (2)Techniques utilized

Database-oriented, data warehouse (OLAP), machine learning, statistics, visualization, neural network, etc.

Applications adaptedRetail, telecommunication, banking, fraud analysis, DNA mining, stock market analysis, Web mining, Weblog analysis, etc.

Data Mining: Classification Schemes

Page 39: MIT-652 Data Mining Applications · network DBMS 1970s: Relational data model, relational DBMS implementation ... Data mining and data warehousing, multimedia databases, and Web databases

MIT-652: DM 1: Introduction to Data Mining 39

Introduction

Motivation: Why data mining?

What is data mining?

Data Mining: On what kind of data?

Data mining functionality

Are all the patterns interesting?

Classification of data mining systems

Data mining task primitives

Major issues in data mining

Summary

Page 40: MIT-652 Data Mining Applications · network DBMS 1970s: Relational data model, relational DBMS implementation ... Data mining and data warehousing, multimedia databases, and Web databases

MIT-652: DM 1: Introduction to Data Mining 40

What Defines a Data Mining Task ?

Task-relevant data

Type of knowledge to be mined

Background knowledge

Pattern interestingness measurements

Visualization of discovered patterns

Page 41: MIT-652 Data Mining Applications · network DBMS 1970s: Relational data model, relational DBMS implementation ... Data mining and data warehousing, multimedia databases, and Web databases

MIT-652: DM 1: Introduction to Data Mining 41

Task-Relevant Data

Database or data warehouse name

Database tables or data cubes

Condition for data selection

Relevant attributes or dimensions

Data grouping criteria

Page 42: MIT-652 Data Mining Applications · network DBMS 1970s: Relational data model, relational DBMS implementation ... Data mining and data warehousing, multimedia databases, and Web databases

MIT-652: DM 1: Introduction to Data Mining 42

Types of knowledge to be mined

Characterization

Discrimination

Association

Classification/prediction

Clustering

Outlier analysis

Other data mining tasks

Page 43: MIT-652 Data Mining Applications · network DBMS 1970s: Relational data model, relational DBMS implementation ... Data mining and data warehousing, multimedia databases, and Web databases

MIT-652: DM 1: Introduction to Data Mining 43

Types of knowledge to be mined: Metapatterns

Metapatterns: all discovered patterns must match

For example, metarule for association rules

P(X:customer,W) ^ Q(X,Y) ⇒ buys(X,Z)

age(X,”30..39”) ^ income(X,”40K..49K”) ⇒ buys(X,”VCR”)

[2.2%, 60%]

occupation(X,”student”) ^ age(X,”20..29”) ⇒

buys(X,”Computer”) [1.4%, 70%]

Page 44: MIT-652 Data Mining Applications · network DBMS 1970s: Relational data model, relational DBMS implementation ... Data mining and data warehousing, multimedia databases, and Web databases

MIT-652: DM 1: Introduction to Data Mining 44

Background Knowledge: Concept Hierarchies

Schema hierarchyE.g., street < city < province_or_state < country

Set-grouping hierarchyE.g., {20-39} = young, {40-59} = middle_aged, {60-89} = senior

Operation-derived hierarchyE.g., email address: [email protected]

login-name < department < university < countryRule-based hierarchy

low_profit_margin (X) <= price(X, P1) and cost (X, P2) and (P1 - P2) < $50

Page 45: MIT-652 Data Mining Applications · network DBMS 1970s: Relational data model, relational DBMS implementation ... Data mining and data warehousing, multimedia databases, and Web databases

MIT-652: DM 1: Introduction to Data Mining 45

Measurements of Pattern Interestingness

Simplicitye.g., (association) rule length, (decision) tree size

Certaintye.g., confidence, P(A|B) = n(A and B)/ n (B), classification reliability or accuracy, certainty factor, rule strength, rule quality, discriminating weight, etc.

Utilitypotential usefulness, e.g., support (association), noise threshold (description)

Noveltynot previously known, contribute new information, e.g., exception

Page 46: MIT-652 Data Mining Applications · network DBMS 1970s: Relational data model, relational DBMS implementation ... Data mining and data warehousing, multimedia databases, and Web databases

MIT-652: DM 1: Introduction to Data Mining 46

Visualization of Discovered Patterns

Different backgrounds/usages may require different forms of representation

E.g., rules, tables, crosstabs, pie/bar chart etc.

Concept hierarchy is also important Discovered knowledge might be more understandable when represented at high level of abstraction

Interactive drill up/down, pivoting, slicing and dicing provide different perspective to data

Different kinds of knowledge require different representation: association, classification, clustering, etc.

Page 47: MIT-652 Data Mining Applications · network DBMS 1970s: Relational data model, relational DBMS implementation ... Data mining and data warehousing, multimedia databases, and Web databases

MIT-652: DM 1: Introduction to Data Mining 47

Introduction

Motivation: Why data mining?

What is data mining?

Data Mining: On what kind of data?

Data mining functionality

Are all the patterns interesting?

Classification of data mining systems

Data mining task primitives

Major issues in data mining

Summary

Page 48: MIT-652 Data Mining Applications · network DBMS 1970s: Relational data model, relational DBMS implementation ... Data mining and data warehousing, multimedia databases, and Web databases

MIT-652: DM 1: Introduction to Data Mining 48

Major Issues in Data Mining

Mining methodology and user interaction

Mining different kinds of knowledge in databases

Interactive mining of knowledge at multiple levels of abstraction

Incorporation of background knowledge

Data mining query languages and ad-hoc data mining

Expression and visualization of data mining results

Handling noise and incomplete data

Pattern evaluation: the interestingness problem

Page 49: MIT-652 Data Mining Applications · network DBMS 1970s: Relational data model, relational DBMS implementation ... Data mining and data warehousing, multimedia databases, and Web databases

MIT-652: DM 1: Introduction to Data Mining 49

Major Issues in Data Mining

Issues relating to the diversity of data typesHandling relational and complex types of dataMining information from heterogeneous databases and global information systems (WWW)

Performance and scalability

Efficiency and scalability of data mining algorithms

Parallel, distributed and incremental mining methods

Page 50: MIT-652 Data Mining Applications · network DBMS 1970s: Relational data model, relational DBMS implementation ... Data mining and data warehousing, multimedia databases, and Web databases

MIT-652: DM 1: Introduction to Data Mining 50

Introduction

Motivation: Why data mining?

What is data mining?

Data Mining: On what kind of data?

Data mining functionality

Are all the patterns interesting?

Classification of data mining systems

Data mining task primitives

Major issues in data mining

Summary

Page 51: MIT-652 Data Mining Applications · network DBMS 1970s: Relational data model, relational DBMS implementation ... Data mining and data warehousing, multimedia databases, and Web databases

MIT-652: DM 1: Introduction to Data Mining 51

Summary

Database technology: evolved from primitive file processing to the development of database management systems

Data mining: discovering interesting patterns from large amounts of data in a variety of information repositories

A KDD process: includes data cleaning, data integration, data selection, data transformation, data mining, pattern evaluation, and knowledge presentation

Data mining functionalities: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc.

Page 52: MIT-652 Data Mining Applications · network DBMS 1970s: Relational data model, relational DBMS implementation ... Data mining and data warehousing, multimedia databases, and Web databases

MIT-652: DM 1: Introduction to Data Mining 52

Summary

Pattern interestingness: easily understood, valid (with some degree of certainty), useful, novel

Measures of pattern interestingness, either objective or subjective can be used to guide the discovery process

Data mining systems: can be classified according to the kinds of databases mined, the kinds of knowledge mined, the techniques used, or the applications adapted

Page 53: MIT-652 Data Mining Applications · network DBMS 1970s: Relational data model, relational DBMS implementation ... Data mining and data warehousing, multimedia databases, and Web databases

MIT-652: DM 1: Introduction to Data Mining 53

Summary

Five primitives for specification of a data mining task

task-relevant data (i.e., the data set to be mined)

kind of knowledge to be mined

background knowledge

interestingness measures

knowledge presentation and visualization techniques to be used for displaying the discovered patterns