Data warehouse and Data Mining · Outliers . Sequential Pattern Mining • Sequential Pattern...
Transcript of Data warehouse and Data Mining · Outliers . Sequential Pattern Mining • Sequential Pattern...
Naeem A. Mahoto
Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro
Email: [email protected]
Data warehouse and Data Mining Lecture No. 14
Data Mining and its techniques
Decision support progress to Data Mining
Basic accounting
data
Operational systems
data
Data for decision Support
Data for multi-
Dimensional Analysis
Selected and extracted
data
Knowledge Discovery
Primitive Decision Support
True Decision Support
Complex Analysis &
Calculations
No Decision Support
Early File- based Systems
Database Systems
Data Warehouse
OLAP Systems
Data Mining Applications
Data Mining • A non-trivial extraction of novel, implicit, and
actionable knowledge from large databases
• Technology to enable data exploration, data analysis, and data visualization of very large databases at a high level of abstraction, without a specific hypothesis in mind
Data Mining: A KDD Process
Databases
Data Cleaning
Data Integration
Data Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation
Data Mining: A KDD Process
Steps of KDD Process • Learning the application domain:
– relevant prior knowledge and goals of application • Creating a target data set: data selection • Data cleaning and preprocessing: (may take 60% of effort!) • Data reduction and transformation:
– Find useful features, dimensionality/variable reduction, invariant representation
• Choosing functions of data mining – summarization, classification, regression, association, clustering
• Choosing the mining algorithm(s) • Data mining: search for patterns of interest • Pattern evaluation and knowledge presentation
– visualization, transformation, removing redundant patterns, etc
• Use of discovered knowledge
Data Mining and Business Intelligence Increasing potential to support business decisions End User
Business Analyst
Data Analyst
DBA
Making Decisions
Data Presentation Visualization Techniques
Data Mining Information Discovery
Data Exploration
OLAP, MDA
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
Data Sources Paper, Files, Information Providers, Database Systems, OLTP
OLAP versus Data Mining
OLAP Data Mining
In OLAP analysis session, analyst looks for some prior knowledge
In data mining, the analyst has no prior knowledge of what results are likely to be
OLAP helps the user to analyze the past and gain insights
Data Mining helps the user predict the future
In OLAP, the analyst drives the process while using OLAP tools
In data mining, the analyst prepares the data and “sits back” while the tools drive the process
Complex Queries No SQL Queries
OLAP versus Data Mining Features
Motivation for Information request
Data granularity Number of business dimension Number of dimension attributes
Sizes of datasets for the dimensions
Analysis approach
Analysis techniques
State of the technology
What is happening in the enterprise?
Summary data
Limited number of dimensions
Small number of attributes
Not large for each dimension
User-driven interactive analysis Multidimensional, drill-down, and slice & dice
Mature & widely used
OLAP Predict the future based on why this is happening
Detailed transaction-level data
Large number of dimensions
Many dimension Attributes
Usually very large for each dimension
Data-driven automatic knowledge discovery
Prepare data, launch mining tool & sit back
Data Mining
Still emerging
Data Mining Applications • Database analysis and decision support
– Market analysis and management • target marketing, customer relation management, market
basket analysis, cross selling, market segmentation – Risk analysis and management
• Forecasting, customer retention, improved underwriting, quality control, competitive analysis
– Fraud detection and management • Other Applications
– Text mining (news group, email, documents) – Stream data mining – Web mining – DNA data analysis
Data Mining Techniques • Data mining covers a broad range of techniques
including: – Classification – Clustering – Sequential Pattern mining – Association rule mining – Many more …
• These techniques consist of the specific algorithms
Association Rule Mining • Finding frequent patterns, associations, correlations, or
causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories
• Frequent pattern: pattern (set of items, sequence, etc.) that occurs frequently in a database
• Motivation: finding regularities in data – What products were often purchased together? — Beer and diapers?! – What are the subsequent purchases after buying a PC? – What kinds of DNA are sensitive to this new drug? – Can one automatically classify web documents?
Association Rule Mining • Itemset X={x1, …, xk} • Find all the rules XàY with
min confidence and support – support, s, probability that a
transaction contains X∪Y – confidence, c, conditional
probability that a transaction having X also contains Y.
Customer buys diapers
Customer buys both
Customer buys beer
Transaction-id Items bought 10 A, B, C 20 A, C 30 A, D 40 B, E, F
Let min_support = 50%, min_conf = 50%:
A à C (50%, 66.7%) C à A (50%, 100%)
Mining Association Rules—an Example Min. support 50% Min. confidence 50%
Transaction-id Items bought 10 A, B, C 20 A, C 30 A, D 40 B, E, F
Frequent pattern Support {A} 75% {B} 50% {C} 50%
{A, C} 50%
For Example Rule: A ⇒ C support = support({A}∪{C}) = 50% confidence = support({A}∪{C})/support({A}) = 66.6%
Classification and Prediction • Finding models (functions) that describe and
distinguish classes or concepts for future prediction
• E.g., classify countries based on climate, or classify cars based on gas mileage
• Presentation: decision-tree, classification rule, neural network
• Prediction: Predict some unknown or missing numerical values
Classification Process: Model Construction
Training Data
NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no
Classification Algorithms
IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’
Classifier (Model)
Classification Process: Use the Model in Prediction
Classifier
Testing Data
NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes
Unseen Data
(Jeff, Professor, 4)
Tenured?
Decision Trees age income student credit_rating
<=30 high no fair<=30 high no excellent31!40 high no fair>40 medium no fair>40 low yes fair>40 low yes excellent31!40 low yes excellent<=30 medium no fair<=30 low yes fair>40 medium yes fair<=30 medium yes excellent31!40 medium no excellent31!40 high yes fair>40 medium no excellent
Training set
Decision Trees
age?
overcast
student? credit rating?
no yes fair excellent
<=30 >40
no no yes yes
yes
30..40
Cluster and outlier analysis • Cluster Analysis
– Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns
– Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity
• Outlier Analysis – Outlier: a data object that does not comply with the general
behavior of the data – It can be considered as noise or exception but is quite
useful in fraud detection, rare events analysis
Clusters and Outliers
Clusters
Outliers
Sequential Pattern Mining • Sequential Pattern Mining is the mining of
frequently occurring ordered events or subsequences as pattern in sequence database
• A sequence database stores a number of records, where all records are sequences of ordered events, with or without concrete notions of time
• Sequential patterns are used for targeted marketing and customer retention
Terminology for Sequence Mining • Itemset: non-empty set of items • Sequence: Ordered list of itemsets • Customer sequence: List of customer transactions
ordered by increasing transaction time – A customer supports a sequence if the sequence is
contained in the customer-sequence • Support for a sequence: Fraction of total customers that
support a sequence • Maximal sequence: A sequence that is not contained in
any other sequence • Closed sequence: A sequence which is composed of
other small sequences
Example: Sequence
A sequence database A sequence : < (ef) (ab) (df) c b >
An element may contain a set of items. Items within an element are unordered and we list them alphabetically.
<a(bc)df> is a subsequence of <a(abc)(ac)d(cf)>
Given support threshold min_sup =2, <(ab)c> is a sequential pattern
SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc>
Terms • Data scrubbing: A process to upgrade the quality
of data before it is moved into a data warehouse • Transient data: Data in which changes to
existing records cause the previous version of the records to be eliminated