1 Chapter I Introduction MIS 463 Fall 2011. 2 Chapter 1. Introduction Motivation: Why data mining?...

1

Chapter IIntroduction

MIS 463Fall 2011

2

Chapter 1. Introduction

Motivation: Why data mining?

Methodology of Knowledge Discovery in Databases

Data mining functionalities

Are all the patterns interesting?

Business applications of data mining

3

Motivation: “Necessity is the Mother of Invention”

Data explosion problem Automated data collection tools and mature database

technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories

Need to convert such data into knowledge and information Applications

Business management Production control Market analysis Engineering design Science exploration

4

Evolution of Database Technology (1)

Data collection, database creation Data management

data storage and retrieval database transaction processing

Data analysis and understanding Data mining and data warehousing

5

Evolution of Database Technology (2) (See Fig. 1.1) Han

1960s: Data collection, database creation, primitive file processing, hierarchical

and network DBMS 1970s:

Relational data model, relational DBMS implementation Query languages like SQL (structured query language) Online transaction processing

1980s: Advanced DBMS, advanced data models (extended-relational, OO,

deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.)

Data warehousing, data mining, OLAP, multimedia databases, and Web databases

1990s—2000s: Web based database systems: XML based database systems, web mining

6

Developments in computer hardware

Powerful and affordable computers Data collection equipment Storage media Communication and networking

7

Data WarehouseRepository of multiple heterogeneous data sources, organized under a unified schema at a single site in order to facilitate management decision making.Data warehouse technology includes:Data cleaningData integrationOn-Line Analytical Processing (OLAP): Techniques that support multidimensional analysis and decision making with the following functionalities

summarization consolidation aggregation view information from different angles

but additional data analysis tools are needed for classification clustering charecterization of data changing over time

8

Data-rich, information-poor state

Abundance of data AND need for powerful data analysis tools

“data tombs” - data archives seldom visited

Important decisions are made not on the information rich data stored in databases but on a decision maker’s intuition

No tool to extract knowledge embedded in vast amounts of data

Current expert system technology Users or domain experts manually input knowledge

which is time consuming, costly, prone to biases errors

9

What Is Data Mining? Data mining (knowledge discovery in

databases): Extraction of interesting (non-trivial, implicit,

previously unknown and potentially useful) information or patterns from data in large databases

Alternative names and their “inside stories”: Data mining: a misnomer? Knowledge discovery(mining) in databases (KDD),

knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.

What is not data mining? query processing. Expert systems or small ML/statistical programs

Gold mining vs

sand mining

10

Data Mining vs. Data Query Data Query:e.g.

A list of all customers who use a credit card to buy a PC

A list of all MIS students having a GPA of 3.5 or higher and has studied 4 or less semesters

Data Mining problems:e.g. What is the likelihood of a customer purchasing PC

with credit card Given the characteristics of MIS students predict her

SPA in the comming term What are the characteristics of MIS undergrad

students

11







12

Why Data Mining? Four questions to be answered

Can the problem clearly be defined? Does potentially meaningful data exists? Does the data contain hidden knowledge or useful only

for reporting purposes? Will the cost of processing the data will be less then the

likely increase in profit from the knowledge gained from applying any data mining project

13

Steps of a KDD Process (1)

1. Goal identification: Define problem relevant prior knowledge and goals of

application 2. Creating a target data set: data

selection 3. Data preprocessing: (may take 60%-80%

of effort!) removal of noise or outliers strategies for handling missing data fields accounting for time sequence information

4. Data reduction and transformation: Find useful features, dimensionality/variable

reduction, invariant representation.

14

Steps of a KDD Process (2)

5. Data Mining: Choosing functions of data mining:

summarization, classification, regression, association, clustering.

Choosing the mining algorithm(s): which models or parameters

Search for patterns of interest 6. Presentation and Evaluation:

visualization, transformation, removing redundant patterns, etc.

7. Taking action: incorporating into the performance system documenting reporting to interested parties

15

An example: Customer Segmentation 1. Marketing department wants to perform a

segmentation study on the customers of AE Company

2. Decide on relevant variables from a data warehouse on customers, sales, promotions

Customers: name,ID,income,age,education,... Sales: history of sales Promotion: promotion types durations...

3. Handle missing income, addresses.. determine outliers if any 4. Generate new index variables representing

wealth of customers Wealth = a*income+b*#houses+c*#cars... Make neccesary transformations z scores so that some

data mining algorithms work more efficiently

16

Example: Customer Segmentation cont. 5.a: Choose clustering as the data mining functionality

as it is the natural one for a segmentation study so as to find group of customers with similar characteristics

5.b: Choose a clustering algorithm K-means or k-medoids or any suitable one for that problem

5.c: Apply the algorithm Find clusters or segments

6. Make reverse transformations, visualize the customer segments

7. Present the results in the form of a report to the marketing department

Implement the segmentation as part of a DSS so that it can be applied repeatedly at certain internvals as new customers arrive

Develop marketing strategies for each segment

17

Data Mining: A KDD Process

Data mining: the core of knowledge discovery process.

Data Cleaning

Data Integration

Databases

Data Warehouse

Task-relevant Data

Data SelectionData transformation

Data Mining

Pattern Evaluation

18

Architecture of a Typical Data Mining System

Data Warehouse

Data cleaning & data integration Filtering

Databases

Database or data warehouse server

Data mining engine

Pattern evaluation

Graphical user interface

Knowledge-base

19

Architecture of a Typical Data Mining System Data base, data warehouse Data base or data warehouse server Knowledge base

concept hierarchies user beliefs

asses pattern’s interestingness other thresholds

Data mining engine functional modules

characterization, association, classification, cluster analysis, evolution and deviation analysis

Pattern evaluation module Graphical user interface

20

Data Mining: Confluence of Multiple Disciplines

Data Mining

Database Technology

Statistics

OtherDisciplines

InformationScience

MachineLearning Visualization

21

Efficient and Scalable Techniques

For an algorithm to be efficient and scalable

its running time should be predictable and acceptable

How Parallel and distributed algorithms Sampling from databases

22







23

Two Styles of Data Mining Descriptive data mining

characterize the general properties of the data in the database

finds patterns in data and the user determines which ones are important

Predictive data mining perform inference on the current data to make predictions we know what to predict

Not mutually exclusive used together Descriptive predictive

Eg. Customer segmentation – descriptive by clustering Followed by a risk assignment model – predictive by

ANN

24

Supervised vs. Unsupervised Learning

Supervised learning (classification, prediction) Supervision: The training data (observations,

measurements, etc.) are accompanied by labels indicating the class of the observations

New data is classified based on the training set

Unsupervised learning (summarization. association, clustering) The class labels of training data is unknown Given a set of measurements, observations,

etc. with the aim of establishing the existence of classes or clusters in the data

25

Descriptive Data Mining (1) Discovering new patterns inside the data Used during the data exploration steps Typical questions answered by descriptive

data mining what is in the data what does it look like are there any unusual patterns what dose the data suggest for customer

segmentation users may have no idea

which kind of patterns may be interesting

26

Descriptive Data Mining (2) patterns at verious granularities

geograph country - city - region - street

student university - faculty - department - minor

Fuctionalities of descriptive data mining Clustering

Ex: customer segmentation summarization visualization Association

Ex: market basket analysis

27

Model Y outputinputsX1,X2

The user does not care what the model is doingit is a black boxinterested in the accuracy of its predictions

X: vector of independent variables or inputsY =f(X) : an unknown functionY: dependent variables or output a single variable or a vector

A model is a black box

28

Predictive Data Mining (1) Using known examples the model is

trained the unknown function is learned from data

the more data with known outcomes is available the better the predictive power of the model

Used to predict outcomes whose inputs are known but the output values are not realized yet

Never %100 accurate

29

Predictive Data Mining (2)

The performance of a model on past data is not important to predict the known outcomes

Its performance on unknown data is much more important

30

Typical questions answered by predictive models Who is likely to respond to our next offer

based on history of previous marketing campaigns

Which customers are likely to leave in the next six months

What transactions are likely to be fraudulent based on known examples of fraud

What is the total amount spending of a customer in the next month

31

Data Mining Functionalities (1)

Concept description: Characterization and discrimination

Generalize, summarize, and contrast data characteristics, e.g., big spenders vs. budget spenders

Association (correlation and causality) Multi-dimensional vs. single-dimensional association age(X, “20..29”) ^ income(X, “20..29K”) buys(X, “PC”)

[support = 2%, confidence = 60%] contains(T, “computer”) contains(x, “software”) [1%,

75%]

32


Classification and Prediction Finding models (functions) that describe and distinguish

classes or concepts for future prediction E.g., classify people as healty or sick, or classify transactions

as fraudulent or not Methods: decision-tree, classification rule, neural network Prediction: Predict some unknown or missing numerical

values

Cluster analysis Class label is unknown: Group data to form new classes,

e.g., cluster customers of a retail company to learn about characteristics of different segments

Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity

33


Outlier analysis Outlier: a data object that does not comply with the general

behavior of the data

It can be considered as noise or exception but is quite useful

in fraud detection, rare events analysis

Trend and evolution analysis Trend and deviation: regression analysis

Sequential pattern mining: click stream analysis

Similarity-based analysis

Other pattern-directed or statistical analyses

34

Concept Description Characterization Discerimination Data

classes or concpets

classes of items for sale computers, printers

concepts of customers: bigSpenders BudgetSpenders

35

Data Characterization Summarization the data of the class under

study (target class) Methods

SQL queries OLAP roll up -operation

user-controlled data summarization along a specified dimension

attribute oriented induction without step by step user interraction

the output of characterization pie charts, bar chars, curves, multidimensional data

cube, or cross tabs in rule form as characteristic rules

36

Characterization example

Description summarizing the characteristics of customers who spend more than $1000 a year at AllElecronics age, employment, income drill down on any dimension

on occupation view these according to their type of employment

37

Data Discrimination Comparing the target class with one or

a set of comparative classes (contrasting classes) these classes can be specified by the use

database queries methods and output

similar to those used for characterization include comparative measures to distinguish

between the target and contrasting classes

38

Discrimination examples Example 1:Compare the general features of software products

whose sales increased by %10 in the last year (target class) whose sales decreased by at least %30 during the same period

(contrasting class) Example 2: Compare two groups of AE customers

I) who shop for computer products regularly (target class) more than two times a month

II) who rarely shop for such products (contrasting class) less than three times a year

The resulting description: %80 of I group customers

university education ages 20-40

%60 of II group customers seniors or young no university degree

39

Multidimensional Data sales according to region month and

product type

Product

Region

Month

Dimensions: Product, Location, TimeHierarchical summarization paths

Industry Region Year

Category Country Quarter

Product City Month Week

Office Day

40

Association Analysis

Discovery of association rules showing attribute-value conditions that occur frequently together in a given set of data

widely used market basket transaction data analysis

more formally X Y that is A1A2.. Ak B1B2.. Bl

A1 , B1 are attribute value pairs or predicates

41

Example: association analysis From the AllEs database

age(X,”20..29”)income(X,”1,000...2,000”)buy(X,”CD player”)

(support = %2, confidence= %60)

X is a variable representing a customer %2 of the AE customers are

between 20 and 29 age incomes ranging from 1 to 2 billon TL buy CD player

with %60 probability that customers in those age and income groups will buy CD player

a multidimensional association rule contains more than one attribute or predicate

42

Market basket analysis

customers buying behaviour is investigated

Based on only the transactions data no information about customer

properties: age income Managers

are interested in which products or product groups are sold together

43

Transactional Database

Transaction ID Item List

10001 Computer,CD,pritner

10002 Ploter,monitor,mouse

10003 Computer,DVD Player

10004 Printer

10005 Ploter,UPS,modem

44

Example: basket analysis rule

buy(computer)buy(printer) (support= %1,confidence=%60) %1 of all transactions contains

computer and printer if a transaction contains computer

there is a %60 chance that it contains printer as well a single dimensional association rule

contains a single predicate an association rule is interesting if

its support exceeds a minimum threshold and its confidence exceeds a min threshold

These min values are set by specialists

45

Classification

Learning is supervised Dependent variable is categorical Build a model able to assign new

instances to one of a set of well-defined classes

46

Typical Classification Problems

Given characteristics of individuals differentiate them who have suffered a heart attack from those who have not

Determine if a credit card purchase is fraudulent

Classify a car loan applicant as a good or a poor credit risk

47

Methods of Classification Decision Trees Artificial Neural Networks Bayesian Classification

Naïve Belief Networks

k-nearest neighbor Regression

Logistic (logit) probit Predicts probability of each class when the dependent variable is categorical

good customer bed customer or employed unemployed

48

Steps of classification process (1) Train the model

using a training set data objects whose class labels are known

(2) Test the model on a test sample whose class labels are known but not used for

training the model (3) Use the model for classification

on new data whose class labels are unknown

49

An example - classificationCust ID age income

education Type

1 35 800 udergrad risky

2 26 600 HighSch risky

3 48 1200 grad normal

8 52 2500 udergrad good

44 29 1700 HighSch good

CustID

age

income

Educatin Type

11 36 850Udergr

d ?

27 28 1650 grad ?

Historical data Each customer type İs known Each customer has a Label

New customers Whose type hsa to beEstimatedEach new customer hss to be classified as Risky normal or good

CustID

age

income education Type

17 43 550 Ph.D. risky

27 68 1650 gradNormal

Testing set whose labels are alsoKnown but not used in modelTraining the model

50

An example – classification cont. Based on historical data develop a

classification model Decision tree, neural network, regression ...

Test the performance of the model on a portion of the historical data

İf accuricy of the model is satisfactory Use the model on the new customers

11 and 27 to assign a type the these new customers

51

Example AE customers

Yearly income

agegoodlrisky

52

Example AE customers

Yearly income

agegoodlrisky

?

Assign the new customer whose type in unknown to either * or +

53

Solution

rule: IF yearly income> and age> THEN good ELSE risky

x2 : age

x1 : yearly income

goodrisky

54

Credit Card Promotion Policy Credit card companies

Promotional offerings with their monthly credit card billing

Offers provide the opportunity to purchase items such as magazines, …

A data mining study Predict individual behaviour What is the likelihood of an individual towards taking

the advantage of promotions based on individual characteristics, credit history.. Expected reduction in postage; paper and processing

costs for the credit card company

55

Income Range

Magazıne Promotıon

Watch Promotıon

Lıfe Insurance Promotıon

Gender AgeCredıt Card Insurance

40-50 K Yes No No Male 45 No

30-40 K Yes Yes Yes Female 40 No

40-50 K No No No Male 42 No

30-40 K Yes Yes Yes Male 43 Yes

50-60 K Yes No Yes Female 38 No

20-30 K No No No Female 55 No

30-40 K Yes No Yes Male 35 Yes

20-30 K No Yes No Male 27 No

30-40 K Yes No No Male 43 No


40-50 K No Yes Yes Female 43 No

20-30 K No Yes Yes Male 29 No


40-50 K No Yes No Male 55 No

20-30 K No No Yes Female 19 Yes

Credit Card Promotion Database

56

Decision Trees for Credit Card Insurance Database

age

Cr Ins

<=43

Male

>43

Female

critical value of 43is deter by the algorithm

N 3,Y 0Decision:NoGender

N 0, Y 6Decision: Yes Yes

No

N 4, Y 1Decision: No

Yes 2, No 0Decision? Yes

Dependent VariableLife Insurance Promotion

A Production Rule from the Tree

IF (age<=43)&(Sex=Male) &(Credit Card In = No)THEN Life Insurance Pr = No

57

Artificial Neural Networks

Set of interconnected nodes designed to imitate the functioning of the human brain

Feed-forward network Supervised learner model

58

For the promotion example

Encode all variables Assign a numerical value even for

qualitative variables such as sex Say X1 represent gender When

Male X1 =1 Female X1 =0

59

15

X1=+1

X2=0

X3=0.5

X4=-1

Inputlayer

Hiddenlayer

Outputlayer

W1,5=0.014

W5,9=-0.17

(1-0.78)2 is error square 1 actual value of O9 for a particular Data object 0.78 is predicted value

60

Weights updating

Weights between nodes are adjusted so as to reduce error

Details of the training process for neural networks are not important for the time being

61

Estimation-Prediction

Similar to classification Output is a continuous variable Estimation: current value Prediction: future outcome rather

then current behavior

62

Typical Estimation-Prediction Problems

Estimate the salary of an individual who owns a sports car

Predict next week`s closing price for the IMKB100 index

Forecast next days temperature

63

Prediction methods

Artificial Neural networks linear regression

Yi = a0+a1X1,i+a2X2,i+...+akXk,i+ui non-linear regression

Yi =f(X1,i, X2,i,.., Xk,ia1,a2,..,ak,ui) generalized linear regression

logistic logit,probit

poisson regression for count variables

Regression Trees

64

Example:Prediction and Classification Classification is used to classify

customers applying for credit cards known class labels: risky,reliable when a new customer applies looking at her

charecteristics income age education wealth region ...

Customer class is predicted Prediction: The monthly expense of a

new customer ( a real continuous variable ) is predicted based on personal information independent variables

income education wealth profession ... Some are numeric some categorical

65

Cluster Analysis Class label is unknown: Group data to form new classes, assign class labels to each data object

Unknown generated by the clustering model e.g., cluster customers to find customer segments Clustering based on the principle: maximizing the intra-

class similarity and minimizing the interclass similarity Objects within a cluster have high similarity in comparison

to one another but are very dissimilar to objects in other clusters

there may be hierarchy of classes

66

Example: Clustering

Can be performed on AE customer data

to identify homogenous subpopulations of customers

represent individual target groups for marketing

67

income

distance

Type1

Type 2type 3

Clustering according to income and distance to storethree cluster of data points are evident

68

Outlier Analysis

Outlier: a data object that does not comply with the

general behavior of the data

It can be considered as noise or exception but is quite

useful in fraud detection, rare events analysis

DECTECED using statistical tests

distance measures

visually inspecting the data

Examples:

69

Reasons for outliers

Measurement errors coding errors

age is entered as 999 nature of data

salary of the general manager is much more higher than the other employees

in crisis the interest rate was in the order of 1000s

70

Evolution Analysis

Describes and models regularities or trends for objects

whose behavior changes over time

Distinct features include Trend and deviation: time-series data analysis

Sequential pattern mining, periodicity analysis

Similarity-based analysis

Example Stock market predictions: future stock prices

for overall stocks: indexes or individual company stocks

71

Sequential Pattern Analysis Determine sequential patterns in data Based on time sequence of actions Similar to associations

Relationship is based on time Example 1: buy CD player today buy CD within

one week Example 2: In what sequence web pages of an

e-business company are accessed %70 percents of visitors follows

A B C or A D B C or A E B C He then determines to add a link directly from page

A to page C

72







73

Are All the “Discovered” Patterns Interesting?

A data mining system/query may generate thousands of

patterns, not all of them are interesting.

Are all patterns interesting? Typically not -only a small fraction of patterns are interesting

to any given user

Interestingness measures: A pattern is interesting if it is easily understood by humans,

valid on new or test data with some degree of certainty,

potentially useful,

novel, or

validates some hypothesis that a user seeks to confirm

74

Objective vs. subjective interestingness measures:

Objective: Objective: based on statistics and structures of

patterns, e.g., support, X Y P(X Y):probability of a transaction contains

both X and Y confidence, degree of certainty of the detected

association P(Y I X) the conditional probability : the probability

that a transaction containing X also contains Y thresholds - controlled by the user ex: rules that do not satisfy a confidence threshold of

%50 are uninteresting Subjective: based on user’s belief in the data, e.g.,

unexpectedness, novelty, actionability, etc.

75






Business Applications of data mining

76

Potential Business Applications

Market analysis and management target marketing, customer relation management, market

basket analysis, cross selling, market segmentation

Risk analysis and management Banks assume a financial risk when they grant loans

risk models attempt to predict the probability of default or fail to pay back the borrowed amount

Credit cards Insurance companies

Fraud detection and management Other Applications

Text mining (news group, email, documents) and Web analysis. Intelligent query answering

77

Market Analysis and Management (1)

Where are the data sources for analysis? Credit card transactions, loyalty cards, discount coupons,

customer complaint calls, plus (public) lifestyle studies,clickstreams

Customer profiling-segmentation data mining can tell you what types of customers buy what

products (clustering or classification)

Target marketing Find clusters of “model” customers who share the same

characteristics: interest, income level, spending habits, etc.

78


Effectiveness of sales campaigns Advertisements, coupons, discounts,

bonuses promote products and attract customers can help improve profits Compare amount of sales and number of

transactions during the sales period versus before or after the

sales campaign Association analysis

which items are likely to be purchased together with the items on sale

79


Customer retention Analysis of Customer loyalty sequences of purchases of particular customers goods purchased at different periods by the same

customers can be grouped into sequences changes in customer consumption or loyalty suggests adjustments on the pricing and variety of

goods to retain old customers and attract new customers

Cross-selling and up-selling associations from sales records a customer who buy a PC is likely to buy a printer purchase recommendations

80

Fraud Detection and Management

Applications widely used in health care, retail, credit card services,

telecommunications (phone card fraud), etc. Approach

use historical data to build models of fraudulent behavior and use data mining to help identify similar instances

Examples Credit card transactions: The FALCON fraud assessment

system by HNC Inc. to signal possibly fraudulent credit card transactions

money laundering: detect suspicious money transactions (US Treasury's Financial Crimes Enforcement Network)

Detecting telephone fraud:ASPECT European Research Gr. Unsupervised clustering to detect fraud in mobile phone networks Telephone call model: destination of the call, duration, time of day or

week. Analyze patterns that deviate from an expected norm.

81

Health Care Storing patients` records in electronic format,

developments in medical information systems Large amount of clinical data

Regularities, trends and surprising events extracted by data mining methods

ANN, temporal reasoning assist clinicians to make informed decisions and

improving health sevices MERCK-MEDCO Managed Care, Pharmaceutical

Insurance … company Uncover less expensive but equally effective drug

treatments

82

Financial Data Analysis Financial data

complete, reliable, high quality Loan payment prediction and

customer credit policy analysis

83

Loan payment prediction and customer credit policy analysis Factors influencing loan payment performance

loan-to-value ratio term of the loan dept ratio (total monthly debt/total monthly income) payment-to-income ratio income level education level residence region credit history

analysis may find that payment-income ratio is a dominant factor while education level and debt ratio are not

84

Risk Management and Insurance determine insurance rates manage investment portfolios differentiate between companies and/or

individuals who are good and poor credit risks

Farmer`s Group discover a scenario: Someone who owns a sports car is not a higher

accident risk Conditions: the sport car to be a second car and

the family car to be a station wagon or a sedan

85

Data Mining for the Telecommunication Industry Telecommunication data are multidimensional

calling-time duration location of caller location of callee type of call

used to identify and compare data traffic system workload resource usage user group behavior profit

fraudulent pattern analysis and identification of unusual patterns

to achieve customer loyalty characteristics of customers affecting line

usage

86

Other Applications

Sports and Gaming Predicting outcome of football games

Text Mining Spam detection

Internet Web Mining Web usage mining

İmprove link structure Recommander Systmes

Web structure mining: mining link structure of Web

87

Other Applications

Educational Data Mining Clustering students Design enterece exams, selection policies

Human Resources How to select applicants

Online Dating Recommandataions to visitors

88

Summary

Data mining: discovering interesting patterns from large amounts of data

A natural evolution of database technology, in great demand, with wide applications

A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation

Mining can be performed in a variety of information repositories Data mining functionalities: characterization, discrimination,

association, classification, clustering, outlier and trend analysis, etc.

Classification of data mining systems Major issues in data mining

1 Chapter I Introduction MIS 463 Fall 2011. 2 Chapter 1. Introduction Motivation: Why data mining?...

Documents

Transcript of 1 Chapter I Introduction MIS 463 Fall 2011. 2 Chapter 1. Introduction Motivation: Why data mining?...