Data warehouse and Data Mining · Outliers . Sequential Pattern Mining • Sequential Pattern...

Naeem A. Mahoto

Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro

Email: [email protected]

Data warehouse and Data Mining Lecture No. 14

Data Mining and its techniques

Decision support progress to Data Mining

Basic accounting

data

Operational systems

data

Data for decision Support

Data for multi-

Dimensional Analysis

Selected and extracted

data

Knowledge Discovery

Primitive Decision Support

True Decision Support

Complex Analysis &

Calculations

No Decision Support

Early File- based Systems

Database Systems

Data Warehouse

OLAP Systems

Data Mining Applications

Data Mining •  A non-trivial extraction of novel, implicit, and

actionable knowledge from large databases

•  Technology to enable data exploration, data analysis, and data visualization of very large databases at a high level of abstraction, without a specific hypothesis in mind

Data Mining: A KDD Process

Databases

Data Cleaning

Data Integration

Data Warehouse

Task-relevant Data

Selection

Data Mining

Pattern Evaluation

Data Mining: A KDD Process

Steps of KDD Process •  Learning the application domain:

–  relevant prior knowledge and goals of application •  Creating a target data set: data selection •  Data cleaning and preprocessing: (may take 60% of effort!) •  Data reduction and transformation:

–  Find useful features, dimensionality/variable reduction, invariant representation

•  Choosing functions of data mining –  summarization, classification, regression, association, clustering

•  Choosing the mining algorithm(s) •  Data mining: search for patterns of interest •  Pattern evaluation and knowledge presentation

–  visualization, transformation, removing redundant patterns, etc

•  Use of discovered knowledge

Data Mining and Business Intelligence Increasing potential to support business decisions End User

Business Analyst

Data Analyst

DBA

Making Decisions

Data Presentation Visualization Techniques

Data Mining Information Discovery

Data Exploration

OLAP, MDA

Statistical Analysis, Querying and Reporting

Data Warehouses / Data Marts

Data Sources Paper, Files, Information Providers, Database Systems, OLTP

OLAP versus Data Mining

OLAP Data Mining

In OLAP analysis session, analyst looks for some prior knowledge

In data mining, the analyst has no prior knowledge of what results are likely to be

OLAP helps the user to analyze the past and gain insights

Data Mining helps the user predict the future

In OLAP, the analyst drives the process while using OLAP tools

In data mining, the analyst prepares the data and “sits back” while the tools drive the process

Complex Queries No SQL Queries

OLAP versus Data Mining Features

Motivation for Information request

Data granularity Number of business dimension Number of dimension attributes

Sizes of datasets for the dimensions

Analysis approach

Analysis techniques

State of the technology

What is happening in the enterprise?

Summary data

Limited number of dimensions

Small number of attributes

Not large for each dimension

User-driven interactive analysis Multidimensional, drill-down, and slice & dice

Mature & widely used

OLAP Predict the future based on why this is happening

Detailed transaction-level data

Large number of dimensions

Many dimension Attributes

Usually very large for each dimension

Data-driven automatic knowledge discovery

Prepare data, launch mining tool & sit back

Data Mining

Still emerging

Data Mining Applications •  Database analysis and decision support

–  Market analysis and management •  target marketing, customer relation management, market

basket analysis, cross selling, market segmentation –  Risk analysis and management

•  Forecasting, customer retention, improved underwriting, quality control, competitive analysis

–  Fraud detection and management •  Other Applications

–  Text mining (news group, email, documents) –  Stream data mining –  Web mining –  DNA data analysis

Data Mining Techniques •  Data mining covers a broad range of techniques

including: –  Classification –  Clustering –  Sequential Pattern mining –  Association rule mining –  Many more …

•  These techniques consist of the specific algorithms

Association Rule Mining •  Finding frequent patterns, associations, correlations, or

causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories

•  Frequent pattern: pattern (set of items, sequence, etc.) that occurs frequently in a database

•  Motivation: finding regularities in data –  What products were often purchased together? — Beer and diapers?! –  What are the subsequent purchases after buying a PC? –  What kinds of DNA are sensitive to this new drug? –  Can one automatically classify web documents?

Association Rule Mining •  Itemset X={x1, …, xk} •  Find all the rules XàY with

min confidence and support –  support, s, probability that a

transaction contains X∪Y –  confidence, c, conditional

probability that a transaction having X also contains Y.

Customer buys diapers

Customer buys both

Customer buys beer

Transaction-id Items bought 10 A, B, C 20 A, C 30 A, D 40 B, E, F

Let min_support = 50%, min_conf = 50%:

A à C (50%, 66.7%) C à A (50%, 100%)

Mining Association Rules—an Example Min. support 50% Min. confidence 50%

Transaction-id Items bought 10 A, B, C 20 A, C 30 A, D 40 B, E, F

Frequent pattern Support {A} 75% {B} 50% {C} 50%

{A, C} 50%

For Example Rule: A ⇒ C support = support({A}∪{C}) = 50% confidence = support({A}∪{C})/support({A}) = 66.6%

Classification and Prediction •  Finding models (functions) that describe and

distinguish classes or concepts for future prediction

•  E.g., classify countries based on climate, or classify cars based on gas mileage

•  Presentation: decision-tree, classification rule, neural network

•  Prediction: Predict some unknown or missing numerical values

Classification Process: Model Construction

Training Data

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

Classification Algorithms

IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’

Classifier (Model)

Classification Process: Use the Model in Prediction

Classifier

Testing Data

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Unseen Data

(Jeff, Professor, 4)

Tenured?

Decision Trees age income student credit_rating

<=30 high no fair<=30 high no excellent31!40 high no fair>40 medium no fair>40 low yes fair>40 low yes excellent31!40 low yes excellent<=30 medium no fair<=30 low yes fair>40 medium yes fair<=30 medium yes excellent31!40 medium no excellent31!40 high yes fair>40 medium no excellent

Training set

Decision Trees

age?

overcast

student? credit rating?

no yes fair excellent

<=30 >40

no no yes yes

yes

30..40

Cluster and outlier analysis •  Cluster Analysis

–  Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns

–  Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity

•  Outlier Analysis –  Outlier: a data object that does not comply with the general

behavior of the data –  It can be considered as noise or exception but is quite

useful in fraud detection, rare events analysis

Clusters and Outliers

Clusters

Outliers

Sequential Pattern Mining •  Sequential Pattern Mining is the mining of

frequently occurring ordered events or subsequences as pattern in sequence database

•  A sequence database stores a number of records, where all records are sequences of ordered events, with or without concrete notions of time

•  Sequential patterns are used for targeted marketing and customer retention

Terminology for Sequence Mining •  Itemset: non-empty set of items •  Sequence: Ordered list of itemsets •  Customer sequence: List of customer transactions

ordered by increasing transaction time –  A customer supports a sequence if the sequence is

contained in the customer-sequence •  Support for a sequence: Fraction of total customers that

support a sequence •  Maximal sequence: A sequence that is not contained in

any other sequence •  Closed sequence: A sequence which is composed of

other small sequences

Example: Sequence

A sequence database A sequence : < (ef) (ab) (df) c b >

An element may contain a set of items. Items within an element are unordered and we list them alphabetically.

<a(bc)df> is a subsequence of <a(abc)(ac)d(cf)>

Given support threshold min_sup =2, <(ab)c> is a sequential pattern

SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc>

Terms •  Data scrubbing: A process to upgrade the quality

of data before it is moved into a data warehouse •  Transient data: Data in which changes to

existing records cause the previous version of the records to be eliminated

Data warehouse and Data Mining · Outliers . Sequential Pattern Mining • Sequential Pattern...

Documents

Transcript of Data warehouse and Data Mining · Outliers . Sequential Pattern Mining • Sequential Pattern...