Data Mining Concepts 1.1 COT5230 Data Mining Week 1 Data Mining Concepts M O N A S H A U S T R A L I...
-
date post
22-Dec-2015 -
Category
Documents
-
view
215 -
download
2
Transcript of Data Mining Concepts 1.1 COT5230 Data Mining Week 1 Data Mining Concepts M O N A S H A U S T R A L I...
Data Mining Concepts 1.1
COT5230 Data Mining
Week 1
Data Mining Concepts
M O N A S HA U S T R A L I A ’ S I N T E R N A T I O N A L U N I V E R S I T Y
Data Mining Concepts 1.2
A Definition of Data Mining
Use of analytical tools to discover knowledge in a collection of data
The knowledge takes the form of patterns, relationships and facts which would not otherwise be immediately apparent
These analytical tools may be drawn from a number of disciplines, which include:
» machine learning» pattern recognition» machine discovery» statistics» artificial intelligence» human-computer interaction» information visualization
Data Mining Concepts 1.3
Data MiningWhy has the area appeared?
– Large volumes of data stored by organizations in a competitive environment combined with advances in technologies which can be applied to the data
Background and evolution– The failure of traditional approaches
The need for Data Mining– Niche marketing, customer retention, the internet
The means to implement Data Mining– The data warehouse, the available computing power,
effective modeling approaches
Data Mining Concepts 1.4
A Case Study - Data Preparation(Cabena et al. page 106)
Health Insurance Commission Australia– 550Gb online; 1300Gb in 5 year history DB
– Aim to prevent fraud and inappropriate practice
– Considered 6.8 million visits requesting up to 20 pathology tests and 17,000 doctors
– Descriptive variables were added to the GP records
– Records were pivoted to create separate records for each pathology test
– Records were then aggregated by provider number (GP)
– An association discovery operation was carried out
Data Mining Concepts 1.5
An Association Rule
The Rule– When a customer buys a shirt, in 70% of cases, he or
she will also buy a tie
– The Confidence Factor is 70%
The Support Factor– This occurs in 13.5% of all purchases
– The Support Factor is 13.5%
Data Mining Concepts 1.6
A Case Study - Modeling and Analysis
– Rules with a confidence factor greater than 50% were considered
– The software Intelligent Miner (IBM) was used
– The level of support was gradually reduced» i.e. the number of records to which the rule applied was reduced
– Rules considered to be noise were excluded.
– Domain knowledge indicated that some tests should be excluded and more useful rules were revealed
– GP profiling was carried out
– The new segments were related back to existing classifications of GPs
– Some rules corresponded to expensive tests that could be substituted
Data Mining Concepts 1.7
Episodes Database GP Database
Rules 1% supportIf test A then test B will occur in 62%
of cases
Segment 1 Segment 2 97 GPs 206 GPsScore = 1.8 Score = 2.7
Data Preparation Merge
Association Discovery Database Segmentation
Data Mining Concepts 1.8
Data Mining for Business Decision Support (From Berry & Linoff 1997)
Identify the business problem
Use data mining techniques to transform the data into actionable information
Act on information
Measure the results
Data Mining Concepts 1.9
The Process of Knowledge Discovery
Pre-processing– data selection
– cleaning
– coding
Data Mining– select a model
– apply the model
Analysis of results and assimilation– Take action and measure the results
Data Mining Concepts 1.10
The Process of Knowledge Discovery
Data Cleaning & Enrichment
Coding Data mining Reporting
selection-domain consistency
- clustering - segmentation
-de-duplication - prediction
-disambiguation
Information
Requirement Action
Feedback
Operational data External data
The Knowledge Discovery in Databases (KDD) process (Adriens/Zantinge)
Data Mining Concepts 1.11
Data Selection
Identify the relevant data, both internal and external to the organization
Select the subset of the data appropriate for the particular data mining application
Store the data in a database separate from the operational systems
Data Mining Concepts 1.12
Data PreprocessingCleaning
– Domain consistency: replace certain values with null
– De-duplication: customers are often added to the DB on each purchase transaction
– Disambiguation: highlighting ambiguities for a decision by the user» e.g. if names differed slightly but addresses were the same
Enrichment– Additional fields are added to records from external sources which
may be vital in establishing relationships.
Coding» e.g. take addresses and replace them with regional codes» e.g. transform birth dates into age ranges
– It is often necessary to convert continuous data into range data for categorization purposes.
Data Mining Concepts 1.13
Data Mining
Preliminary Analysis– Much interesting information can be found by querying
the data set
– May be supported by a visualization of the data set.
Choose a one or more modeling approaches
There are two styles of data mining– Hypothesis testing
– Knowledge discovery
The styles and approaches are not mutually exclusive
Data Mining Concepts 1.14
Data Mining Tasks
Various taxonomies exist. Berry & Linoff define 6 tasks:
» Classification» Estimation» Prediction» Affinity Grouping» Clustering» Description
The tasks are also referred to as operations. Cabena et al define 4 operations:
» Predictive Modeling» Database Segmentation» Link Analysis» Deviation Detection
Data Mining Concepts 1.15
Classification
Classification involves considering the features of some object then assigning it it to some pre-defined class, for example:
– Spotting fraudulent insurance claims
– Which phone numbers are fax numbers
– Which customers are high-value
Data Mining Concepts 1.16
Estimation
Estimation deals with numerically valued outcomes rather than discrete categories as occurs in classification.
– Estimating the number of children in a family
– Estimating family income
Data Mining Concepts 1.17
Prediction
Essentially the same as classification and estimation but involves future behaviour
Historical data is used to build a model explaining behaviour (outputs) for known inputs
The model developed is then applied to current inputs to predict future outputs
– Predict which customers will respond to a promotion
– Classifying loan applications
Data Mining Concepts 1.18
Affinity Grouping
Affinity grouping is also referred to as Market Basket Analysis
A common example is which items are bought together at the supermarket. Once this is known, decisions can be made on, for example:
– how to arrange items on the shelves
– which items should be promoted together
Data Mining Concepts 1.19
Clustering
Clustering is also sometimes referred to as segmentation (though this has other meanings in other fields)
In clustering there are no pre-defined classes. Self-similarity is used to group records. The user must attach meaning to the clusters formed
Clustering often precedes some other data mining task, for example:
– once customers are separated into clusters, a promotion might be carried out based on market basket analysis of the resulting cluster
Data Mining Concepts 1.20
Description
A good description of data can provide understanding of behaviour
The description of the behaviour can suggest an explanation for it as well
Statistical measures can be useful in describing data, as can techniques that generate rules
Data Mining Concepts 1.21
Deviation Detection
Records whose attributes deviate from the norm by significant amounts are also called outliers
Application areas include:– fraud detection
– quality control
– tracing defects.
Visualization techniques and statistical techniques are useful in finding outliers
A cluster which contains only a few records may in fact represent outliers
Data Mining Concepts 1.22
Data Mining Techniques
– Query tools
– Decision Trees
– Memory-Based Reasoning
– Artificial Neural Networks
– Genetic Algorithms
– Association and sequence detection
– Statistical Techniques
– Visualization
– Others (Logistic regression,Generalized Additive Models (GAM), Multivariate Adaptive Regression Splines (MARS), K Means Clustering, ...)
Data Mining Concepts 1.23
Data Mining and the Data WarehouseOrganizations realized that they had large
amounts of data stored (especially of transactions) but it was not easily accessible
The data warehouse provides a convenient data source for data mining. Some data cleaning has usually occurred. It exists independently of the operational systems
– Data is retrieved rather than updated
– Indexed for efficient retrieval
– Data will often cover 5 to 10 years
A data warehouse is not a pre-requisite for data mining
Data Mining Concepts 1.24
Data Mining and OLAP
Online Analytic Processing (OLAP)
Tools that allow a powerful and efficient representation of the data
Makes use of a representation known as a cube
A cube can be sliced and diced
OLAP provide reporting with aggregation and summary information but does not reveal patterns, which is the purpose of data mining