[PPT]DATA WAREHOUSING AND DATA MINING - Prince...

Data Mining Tools

Overview & Tutorial

Ahmed Sameh

Prince Sultan University

Department of Computer Science & Info Sys

May 2010

(Some slides belong to IBM)

*

*

Introduction Outline

Define data miningData mining vs. databasesBasic data mining tasksData mining developmentData mining issues

Goal: Provide an overview of data mining.

*

Introduction

Data is growing at a phenomenal rateUsers expect more sophisticated informationHow?

UNCOVER HIDDEN INFORMATION

DATA MINING

*

Data Mining Definition

Finding hidden information in a databaseFit data to a modelSimilar termsExploratory data analysisData driven discoveryDeductive learning

*

Data Mining Algorithm

Objective: Fit Data to a ModelDescriptivePredictivePreference Technique to choose the best modelSearch Technique to search the dataQuery

*

Database Processing vs. Data Mining Processing

QueryWell definedSQL

QueryPoorly definedNo precise query language

Data Operational data

Output Precise Subset of database

Data Not operational data

Output Fuzzy Not a subset of database

*

Query Examples

Database

Data Mining

Find all customers who have purchased milk

Find all items which are frequently purchased with milk. (association rules)

Find all credit applicants with last name of Smith.

Identify customers who have purchased more than $10,000 in the last month.

Find all credit applicants who are poor credit risks. (classification)

Identify customers with similar buying habits. (Clustering)

*

Related Fields

Statistics

Machine

Learning

Databases

Visualization

Data Mining and

Knowledge Discovery

*

*

Statistics, Machine Learning and Data Mining

Statistics: more theory-basedmore focused on testing hypothesesMachine learningmore heuristicfocused on improving performance of a learning agentalso looks at real-time learning and robotics areas not part of data miningData Mining and Knowledge Discoveryintegrates theory and heuristicsfocus on the entire process of knowledge discovery, including data cleaning, learning, and integration and visualization of resultsDistinctions are fuzzy

Definition

A class of database application that analyze data in a database using tools which look for trends or anomalies. Data mining was invented by IBM.

Purpose

To look for hidden patterns or previously unknown relationships among the data in a group of data that can be used to predict future behavior.Ex: Data mining software can help retail companies find customers with common interests.

Background Information

Many of the techniques used by today's data mining tools have been around for many years, having originated in the artificial intelligence research of the 1980s and early 1990s. Data Mining tools are only now being applied to large-scale database systems.

The Need for Data Mining

The amount of raw data stored in corporate data warehouses is growing rapidly. There is too much data and complexity that might be relevant to a specific problem. Data mining promises to bridge the analytical gap by giving knowledgeworkers the tools to navigate this complex analytical space.

The Need for Data Mining, cont

The need for information has resulted in the proliferation of data warehouses that integrate information multiple sources to support decision making. Often include data from external sources, such as customer demographics and household information.

Definition (Cont.)

Data mining is the exploration and analysis of large quantities of data in order to discover valid, novel, potentially useful, and ultimately understandable patterns in data.

Valid: The patterns hold in general.

Novel: We did not know the pattern beforehand.

Useful: We can devise actions from the patterns.

Understandable: We can interpret and comprehend the patterns.

Of laws, Monsters, and Giants

Moores law: processing capacity doubles every 18 months : CPU, cache, memoryIts more aggressive cousin:Disk storage capacity doubles every 9 months

What is Data Mining?

Finding interesting structure in data

Structure: refers to statistical patterns, predictive models, hidden relationships

Examples of tasks addressed by Data MiningPredictive Modeling (classification, regression)Segmentation (Data Clustering )SummarizationVisualization

*

Major Application Areas for
Data Mining Solutions

AdvertisingBioinformaticsCustomer Relationship Management (CRM)Database Marketing Fraud Detection eCommerceHealth CareInvestment/SecuritiesManufacturing, Process ControlSports and Entertainment TelecommunicationsWeb

*

*

Data Mining

The non-trivial extraction of novel, implicit, and actionable knowledge from large datasets.Extremely large datasetsDiscovery of the non-obviousUseful knowledge that can improve processesCan not be done manuallyTechnology to enable data exploration, data analysis, and data visualization of very large databases at a high level of abstraction, without a specific hypothesis in mind.Sophisticated data search capability that uses statistical algorithms to discover patterns and correlations in data.

*

Data Mining (cont.)

*

Data Mining (cont.)

Data Mining is a step of Knowledge Discovery in Databases (KDD) ProcessData WarehousingData SelectionData PreprocessingData TransformationData MiningInterpretation/EvaluationData Mining is sometimes referred to as KDD and DM and KDD tend to be used as synonyms

*

Data Mining Evaluation

*

Data Mining is Not

Data warehousing SQL / Ad Hoc Queries / Reporting Software Agents Online Analytical Processing (OLAP) Data Visualization

*

Data Mining Motivation

Changes in the Business EnvironmentCustomers becoming more demandingMarkets are saturatedDatabases today are huge:More than 1,000,000 entities/records/rowsFrom 10 to 10,000 fields/attributes/variablesGigabytes and terabytesDatabases a growing at an unprecedented rateDecisions must be made rapidlyDecisions must be made with maximum knowledge

Why Use Data Mining Today?

Human analysis skills are inadequate:

Volume and dimensionality of the dataHigh data growth rate

Availability of:

DataStorageComputational powerOff-the-shelf softwareExpertise

An Abundance of Data

Supermarket scanners, POS dataPreferred customer cardsCredit card transactionsDirect mail responseCall center recordsATM machinesDemographic dataSensor networksCamerasWeb server logsCustomer web site trails

Evolution of Database Technology

1960s: IMS, network model1970s: The relational data model, first relational DBMS implementations1980s: Maturing RDBMS, application-specific DBMS, (spatial data, scientific data, image data, etc.), OODBMS1990s: Mature, high-performance RDBMS technology, parallel DBMS, terabyte data warehouses, object-relational DBMS, middleware and web technology2000s: High availability, zero-administration, seamless integration into business processes2010: Sensor database systems, databases on embedded systems, P2P database systems, large-scale pub/sub systems, ???

Much Commercial Support

Many data mining toolshttp://www.kdnuggets.com/software Database systems with data mining supportVisualization toolsData mining process supportConsultants

Why Use Data Mining Today?

Competitive pressure!

The secret of success is to know something that nobody else knows.

Aristotle Onassis

Competition on service, not only on price (Banks, phone companies, hotel chains, rental car companies)Personalization, CRMThe real-time enterpriseSystemic listeningSecurity, homeland defense

The Knowledge Discovery Process

Steps:

Identify business problem

Data mining

Action

Evaluation and measurement

Deployment and integration into businesses processes

Data Mining Step in Detail

2.1 Data preprocessing

Data selection: Identify target datasets and relevant fieldsData cleaningRemove noise and outliersData transformationCreate common unitsGenerate new fields

2.2 Data mining model construction

2.3 Model evaluation

Preprocessing and Mining

Original Data

Target
Data

Preprocessed
Data

Patterns

Knowledge

Data
Integration
and Selection

Preprocessing

Model
Construction

Interpretation

*

Data Mining Techniques

*

Data Mining Models and Tasks

*

Basic Data Mining Tasks

Classification maps data into predefined groups or classesSupervised learningPattern recognitionPredictionRegression is used to map a data item to a real valued prediction variable.Clustering groups similar data together into clusters.Unsupervised learningSegmentationPartitioning

*

Basic Data Mining Tasks (contd)

Summarization maps data into subsets with associated simple descriptions.CharacterizationGeneralizationLink Analysis uncovers relationships among data.Affinity AnalysisAssociation RulesSequential Analysis determines sequential patterns.

*

Ex: Time Series Analysis

Example: Stock MarketPredict future valuesDetermine similar patterns over timeClassify behavior

4.ps

*

Data Mining vs. KDD

Knowledge Discovery in Databases (KDD): process of finding useful information and patterns in data.Data Mining: Use of algorithms to extract the information and patterns derived by the KDD process.

*

Data Mining Development

Similarity MeasuresHierarchical ClusteringIR SystemsImprecise QueriesTextual DataWeb Search Engines

Bayes TheoremRegression AnalysisEM AlgorithmK-Means ClusteringTime Series Analysis

Neural NetworksDecision Tree Algorithms

Algorithm Design TechniquesAlgorithm AnalysisData Structures

Relational Data ModelSQLAssociation Rule AlgorithmsData WarehousingScalability Techniques

*

KDD Issues

Human InteractionOverfitting Outliers InterpretationVisualization Large DatasetsHigh Dimensionality

*

KDD Issues (contd)

Multimedia DataMissing DataIrrelevant DataNoisy DataChanging DataIntegrationApplication

*

Visualization Techniques

GraphicalGeometricIcon-basedPixel-basedHierarchicalHybrid

*

Data Mining Applications

*

Data Mining Applications:
Retail

Performing basket analysisWhich items customers tend to purchase together. This knowledge can improve stocking, store layout strategies, and promotions.Sales forecastingExamining time-based patterns helps retailers make stocking decisions. If a customer purchases an item today, when are they likely to purchase a complementary item?Database marketingRetailers can develop profiles of customers with certain behaviors, for example, those who purchase designer labels clothing or those who attend sales. This information can be used to focus costeffective promotions.Merchandise planning and allocationWhen retailers add new stores, they can improve merchandise planning and allocation by examining patterns in stores with similar demographic characteristics. Retailers can also use data mining to determine the ideal layout for a specific store.

*

Banking

Card marketingBy identifying customer segments, card issuers and acquirers can improve profitability with more effective acquisition and retention programs, targeted product development, and customized pricing.Cardholder pricing and profitabilityCard issuers can take advantage of data mining technology to price their products so as to maximize profit and minimize loss of customers. Includes risk-based pricing.Fraud detection Fraud is enormously costly. By analyzing past transactions that were later determined to be fraudulent, banks can identify patterns. Predictive life-cycle managementDM helps banks predict each customers lifetime value and to service each segment appropriately (for example, offering special deals and discounts).

*

Telecommunication

Call detail record analysisTelecommunication companies accumulate detailed call records. By identifying customer segments with similar use patterns, the companies can develop attractive pricing and feature promotions.Customer loyaltySome customers repeatedly switch providers, or churn, to take advantage of attractive incentives by competing companies. The companies can use DM to identify the characteristics of customers who are likely to remain loyal once they switch, thus enabling the companies to target their spending on customers who will produce the most profit.

*

Other Applications

Customer segmentationAll industries can take advantage of DM to discover discrete segments in their customer bases by considering additional variables beyond traditional analysis.ManufacturingThrough choice boards, manufacturers are beginning to customize products for customers; therefore they must be able to predict which features should be bundled to meet customer demand.WarrantiesManufacturers need to predict the number of customers who will submit warranty claims and the average cost of those claims.Frequent flier incentivesAirlines can identify groups of customers that can be given incentives to fly more.

*

A producer wants to know.

*

Data, Data everywhere
yet ...

I cant find the data I needdata is scattered over the networkmany versions, subtle differences

I cant get the data I needneed an expert to get the data

I cant understand the data I foundavailable data poorly documented

I cant use the data I foundresults are unexpecteddata needs to be transformed from one form to other

*

What is a Data Warehouse?

A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context.

[Barry Devlin]

*

What are the users saying...

Data should be integrated across the enterpriseSummary data has a real value to the organizationHistorical data holds the key to understanding data over timeWhat-if capabilities are required

*

What is Data Warehousing?

A process of transforming data into information and making it available to users in a timely enough manner to make a difference

[Forrester Research, April 1996]

*

Very Large Data Bases

Terabytes -- 10^12 bytes:

Petabytes -- 10^15 bytes:

Exabytes -- 10^18 bytes:

Zettabytes -- 10^21 bytes:

Zottabytes -- 10^24 bytes:

Walmart -- 24 Terabytes

Geographic Information Systems

National Medical Records

Weather images

Intelligence Agency Videos

*

Data Warehousing --
It is a process

Technique for assembling and managing data from various sources for the purpose of answering business questions. Thus making decisions that were not previous possibleA decision support database maintained separately from the organizations operational database

*

Data Warehouse

A data warehouse is a subject-orientedintegratedtime-varyingnon-volatile

collection of data that is used primarily in organizational decision making.

-- Bill Inmon, Building the Data Warehouse 1996

Data Warehousing Concepts

Decision support is key for companies wanting to turn their organizational data into an information asset

Traditional database is transaction-oriented while data warehouse is data-retrieval optimized for decision-support

Data Warehouse
"A subject-oriented, integrated, time-variant, and non-volatile collection of data in support of management's decision-making process"

OLAP (on-line analytical processing), Decision Support Systems (DSS), Executive Information Systems (EIS), and data mining applications

*

What does data warehouse do?

integrate diverse information from various systems which enable users to quickly produce powerful ad-hoc queries and perform complex analysis

create an infrastructure for reusing the data in numerous ways

create an open systems environment to make useful information easily accessible to authorized users

help managers make informed decisions

*

Benefits of Data Warehousing

Potential high returns on investmentCompetitive advantageIncreased productivity of corporate decision-makers

*

Comparison of OLTP and Data Warehousing

OLTP systemsData warehousing systems
Holds current dataHolds historic data
Stores detailed dataStores detailed, lightly, and
summarized data
Data is dynamicData is largely static
Repetitive processingAd hoc, unstructured, and heuristic processing
High level of transaction throughputMedium to low transaction throughput
Predictable pattern of usageUnpredictable pattern of usage
Transaction drivenAnalysis driven
Application orientedSubject oriented
Supports day-to-day decisionsSupports strategic decisions
Serves large number ofServes relatively lower number
clerical / operational usersof managerial users

*

Data Warehouse Architecture

Operational Data

Load Manager

Warehouse Manager

Query Manager

Detailed Data

Lightly and Highly Summarized Data

Archive / Backup Data

Meta-Data

End-user Access Tools

*

End-user Access Tools

Reporting and query toolsApplication development toolsExecutive Information System (EIS) toolsOnline Analytical Processing (OLAP) toolsData mining tools

*

Data Warehousing Tools and Technologies

Extraction, Cleansing, and Transformation Tools

Data Warehouse DBMS

Load performance

Load processing

Data quality management

Query performance

Terabyte scalability

Networked data warehouse

Warehouse administration

Integrated dimensional tools

Advanced query functionality

*

Data Marts

A subset of data warehouse that supports the requirements of a particular department or business function

*

Online Analytical Processing (OLAP)

OLAPThe dynamic synthesis, analysis, and consolidation of large volume of multi-dimensional dataMulti-dimensional OLAPCubes of data

*

City

Time

Producttype

Problems of Data Warehousing

Underestimation of resources for data loadingHidden problem with source systemsRequired data not capturedIncreased end-user demandsData homogenizationHigh demand for resourcesData ownershipHigh maintenanceLong duration projectsComplexity of integration

*

Codd's Rules for OLAP

Multi-dimensional conceptual view

Transparency

Accessibility

Consistent reporting performance

Client-server architecture

Generic dimensionality

Dynamic sparse matrix handling

Multi-user support

Unrestricted cross-dimensional operations

Intuitive data manipulation

Flexible reporting

Unlimited dimensions and aggregation levels

*

OLAP Tools

Multi-dimensional OLAP (MOLAP)Multi-dimensional DBMS (MDDBMS)Relational OLAP (ROLAP)Creation of multiple multi-dimensional views of the two-dimensional relationsManaged Query Environment (MQE)Deliver selected data directly from the DBMS to the desktop in the form of a data cube, where it is stored, analyzed, and manipulated locally

*

Data Mining

Definition

The process of extracting valid, previously unknown, comprehensible, and actionable information from large database and using it to make crucial business decisions

Knowledge discovery

Association rules

Sequential patterns

Classification trees

Goals

Prediction

Identification

Classification

Optimization

*


Predictive ModelingSupervised training with two phasesTraining phase : building a model using large sample of historical data called the training setTesting phase : trying the model on new dataDatabase SegmentationLink AnalysisDeviation Detection

*

What are Data Mining Tasks?

ClassificationRegressionClustering SummarizationDependency modelingChange and Deviation Detection

*

What are Data Mining Discoveries?

New Purchase TrendsPlan Investment StrategiesDetect Unauthorized ExpenditureFraudulent ActivitiesCrime TrendsSmugglers-border crossing

*

*

Data Warehouse Architecture

*

Data Warehouse for Decision Support & OLAP

Putting Information technology to help the knowledge worker make faster and better decisionsWhich of my customers are most likely to go to the competition?What product promotions have the biggest impact on revenue?How did the share price of software companies correlate with profits over last 10 years?

*

Decision Support

Used to manage and control businessData is historical or point-in-timeOptimized for inquiry rather than updateUse of the system is loosely defined and can be ad-hocUsed by managers and end-users to understand the business and make judgements

*

Data Mining works with Warehouse Data

Data Warehousing provides the Enterprise with a memory

Data Mining provides the Enterprise with intelligence

*

We want to know ...

Given a database of 100,000 names, which persons are the least likely to default on their credit cards? Which types of transactions are likely to be fraudulent given the demographics and transactional history of a particular customer? If I raise the price of my product by Rs. 2, what is the effect on my ROI? If I offer only 2,500 airline miles as an incentive to purchase rather than 5,000, how many lost responses will result? If I emphasize ease-of-use of the product as opposed to its technical capabilities, what will be the net effect on my revenues? Which of my customers are likely to be the most loyal?

Data Mining helps extract such information

*

Application Areas

Industry

Application

Finance

Credit Card Analysis

Insurance

Claims, Fraud Analysis

Telecommunication

Call record analysis

Transport

Logistics management

Consumer goods

promotion analysis

Data Service providers

Value added data

Utilities

Power usage analysis

*

Data Mining in Use

The US Government uses Data Mining to track fraudA Supermarket becomes an information brokerBasketball teams use it to track game strategyCross SellingWarranty Claims RoutingHolding on to Good CustomersWeeding out Bad Customers

*

What makes data mining possible?

Advances in the following areas are making data mining deployable:data warehousing better and more data (i.e., operational, behavioral, and demographic) the emergence of easily deployed data mining tools and the advent of new data mining techniques.

-- Gartner Group

*

Why Separate Data Warehouse?

PerformanceOp dbs designed & tuned for known txs & workloads.Complex OLAP queries would degrade perf. for op txs.Special data organization, access & implementation methods needed for multidimensional views & queries.

FunctionMissing data: Decision support requires historical data, which op dbs do not typically maintain.Data consolidation: Decision support requires consolidation (aggregation, summarization) of data from many heterogeneous sources: op dbs, external sources. Data quality: Different sources typically use inconsistent data representations, codes, and formats which have to be reconciled.

*

What are Operational Systems?

They are OLTP systemsRun mission critical applicationsNeed to work with stringent performance requirements for routine tasksUsed to run a business!

*

RDBMS used for OLTP

Database Systems have been used traditionally for OLTPclerical data processing tasksdetailed, up to date datastructured repetitive tasksread/update a few recordsisolation, recovery and integrity are critical

*

Operational Systems

Run the business in real timeBased on up-to-the-second dataOptimized to handle large numbers of simple read/write transactionsOptimized for fast response to predefined transactionsUsed by people who deal with customers, products -- clerks, salespeople etc.They are increasingly used by customers

*

Examples of Operational Data

*

Application-Orientation vs. Subject-Orientation

*

OLTP vs. Data Warehouse

OLTP systems are tuned for known transactions and workloads while workload is not known a priori in a data warehouseSpecial data organization, access methods and implementation methods are needed to support data warehouse queries (typically multidimensional queries)e.g., average amount spent on phone calls between 9AM-5PM in Pune during the month of December

*

OLTP vs Data Warehouse

OLTPApplication OrientedUsed to run businessDetailed dataCurrent up to dateIsolated DataRepetitive accessClerical User

Warehouse (DSS)Subject OrientedUsed to analyze businessSummarized and refinedSnapshot dataIntegrated DataAd-hoc accessKnowledge User (Manager)

*


OLTPPerformance SensitiveFew Records accessed at a time (tens)

Read/Update Access

No data redundancyDatabase Size 100MB -100 GB

Data WarehousePerformance relaxedLarge volumes accessed at a time(millions)Mostly Read (Batch Update)Redundancy presentDatabase Size 100 GB - few terabytes

*


OLTPTransaction throughput is the performance metricThousands of usersManaged in entirety

Data WarehouseQuery throughput is the performance metricHundreds of usersManaged by subsets

*

To summarize ...

OLTP Systems are
used to run a business

The Data Warehouse helps to optimize the business

*

Why Now?

Data is being producedERP provides clean dataThe computing power is availableThe computing power is affordableThe competitive pressures are strongCommercial products are available

*

Myths surrounding OLAP Servers and Data Marts

Data marts and OLAP servers are departmental solutions supporting a handful of usersMillion dollar massively parallel hardware is needed to deliver fast time for complex queriesOLAP servers require massive and unwieldy indicesComplex OLAP queries clog the network with dataData warehouses must be at least 100 GB to be effective

Source -- Arbor Software Home Page

II. On-Line Analytical Processing (OLAP)

Making Decision Support Possible

*

Typical OLAP Queries

Write a multi-table join to compare sales for each product line YTD this year vs. last year. Repeat the above process to find the top 5 product contributors to margin. Repeat the above process to find the sales of a product line to new vs. existing customers. Repeat the above process to find the customers that have had negative sales growth.

*

* Reference: http://www.arborsoft.com/essbase/wht_ppr/coddTOC.html

What Is OLAP?

Online Analytical Processing - coined by
EF Codd in 1994 paper contracted by
Arbor Software*Generally synonymous with earlier terms such as Decisions Support, Business Intelligence, Executive Information SystemOLAP = Multidimensional DatabaseMOLAP: Multidimensional OLAP (Arbor Essbase, Oracle Express)ROLAP: Relational OLAP (Informix MetaCube, Microstrategy DSS Agent)

*

The OLAP Market

Rapid growth in the enterprise market1995: $700 Million1997: $2.1 BillionSignificant consolidation activity among major DBMS vendors10/94: Sybase acquires ExpressWay7/95: Oracle acquires Express 11/95: Informix acquires Metacube1/97: Arbor partners up with IBM10/96: Microsoft acquires PanoramaResult: OLAP shifted from small vertical niche to mainstream DBMS category

*

Strengths of OLAP

It is a powerful visualization paradigmIt provides fast, interactive response timesIt is good for analyzing time seriesIt can be useful to find some clusters and outliersMany vendors offer OLAP tools

*

Nigel Pendse, Richard Creath - The OLAP Report

OLAP Is FASMI

FastAnalysisSharedMultidimensionalInformation

*

Dimensions: Product, Region, Time

Hierarchical summarization paths

Product Region Time

Industry Country Year

Category Region Quarter

Product City Month Week

Office Day

Multi-dimensional Data

HeyI sold $100M worth of goods

*

A Visual Operation: Pivot (Rotate)

10

47

30

12

Juice

Cola

Milk

Cream

NY

LA

SF

3/1 3/2 3/3 3/4

Date

Month

Region

Product

*

Slicing and Dicing

Product

Sales Channel

Regions

Retail

Direct

Special

Household

Telecomm

Video

Audio

India

Far East

Europe

The Telecomm Slice

*

Roll-up and Drill Down

Sales ChannelRegionCountryState Location AddressSales Representative

Results of Data Mining Include:

Forecasting what may happen in the futureClassifying people or things into groups by recognizing patternsClustering people or things into groups based on their attributesAssociating what events are likely to occur togetherSequencing what events are likely to lead to later events

Data mining is not

Brute-force crunching of bulk data Blind application of algorithmsGoing to find relationships where none existPresenting data in different waysA database intensive taskA difficult to understand technology requiring an advanced degree in computer science

Data Mining versus OLAP

OLAP - On-line Analytical ProcessingProvides you with a very good view of what is happening, but can not predict what will happen in the future or why it is happening

Data Mining Versus Statistical Analysis

Data Mining

Originally developed to act as expert systems to solve problems

Less interested in the mechanics of the technique

If it makes sense then lets use it

Does not require assumptions to be made about data

Can find patterns in very large amounts of data

Requires understanding of data and business problem

Data Analysis

Tests for statistical correctness of models

Are statistical assumptions of models correct?

Eg Is the R-Square good?

Hypothesis testing

Is the relationship significant?

Use a t-test to validate significance

Tends to rely on sampling

Techniques are not optimised for large amounts of data

Requires strong statistical skills

Examples of What People are Doing with Data Mining:

Fraud/Non-Compliance Anomaly detection

Isolate the factors that lead to fraud, waste and abuse

Target auditing and investigative efforts more effectively

Credit/Risk Scoring

Intrusion detection

Parts failure prediction

Recruiting/Attracting customers

Maximizing profitability (cross selling, identifying profitable customers)

Service Delivery and Customer Retention

Build profiles of customers likely to use which services

Web Mining

What data mining has done for...

Scheduled its workforce

to provide faster, more accurate answers to questions.

The US Internal Revenue Service

needed to improve customer service and...

*

The US Internal Revenue Service is using data mining to improve customer service.

[Click] By analyzing incoming requests for help and information, the IRS hopes to schedule its workforce to provide faster, more accurate answers to questions.


analyzed suspects cell phone usage to focus investigations.

The US Drug Enforcement Agency needed to be more effective in their drug busts and

*

The US DFAS needs to search through 2.5 million financial transactions that may indicate inaccurate charges. Instead of relying on tips to point out fraud, the DFAS is mining the data to identify suspicious transactions.

[Click] Using Clementine, the agency examined credit card transactions and was able to identify purchases that did not match past patterns. Using this information, DFAS could focus investigations, finding fraud more costs effectively.


Reduced direct mail costs by 30% while garnering 95% of the campaigns revenue.

HSBC need to cross-sell more

effectively by identifying profiles

that would be interested in higher

yielding investments and...

*

Retail banking is a highly competitive business. In addition to competition from other banks, banks also see intense competition from financial services companies of all kinds, from stockbrokers to mortgage companies.

With so many organizations working the same customer base, the value of customer retention is greater than ever before. As a result, HSBC Bank USA looks to enticing existing customers to "roll over" maturing products, or on cross-selling new ones.

[Click] Using SPSS products, HSBC found that it could reduce direct mail costs by 30% while still bringing in 95% of the campaigns revenue. Because HSBC is sending out fewer mail pieces, customers are likely to be more loyal because they dont receive junk mail from the bank.

Suggestion:Predicting Washington

C-Span has lunched a digital archieve of 500,000 hours of audio debates. Text Mining or Audio Mining of these talks to reveal cwetrain questions such as.

Example Application: Sports

IBM Advanced Scout analyzes
NBA game statistics

Shots blockedAssistsFouls

Google: IBM Advanced Scout

Advanced Scout

Example pattern: An analysis of the
data from a game played between
the New York Knicks and the Charlotte
Hornets revealed that When Glenn Rice played the shooting guard position, he shot 5/6 (83%) on jump shots."

Pattern is interesting:
The average shooting percentage for the Charlotte Hornets during that game was 54%.

Data Mining: Types of Data

Relational data and transactional dataSpatial and temporal data, spatio-temporal observationsTime-series data TextImages, videoMixtures of dataSequence data

Features from processing other data sources


Supervised learningClassification and regressionUnsupervised learningClusteringDependency modelingAssociations, summarization, causalityOutlier and deviation detectionTrend analysis and change detection

Different Types of Classifiers

Linear discriminant analysis (LDA)Quadratic discriminant analysis (QDA)Density estimation methodsNearest neighbor methodsLogistic regressionNeural networksFuzzy set theoryDecision Trees

Test Sample Estimate

Divide D into D1 and D2Use D1 to construct the classifier dThen use resubstitution estimate R(d,D2) to calculate the estimated misclassification error of dUnbiased and efficient, but removes D2 from training dataset D

V-fold Cross Validation

Procedure:

Construct classifier d from DPartition D into V datasets D1, , DVConstruct classifier di using D \ DiCalculate the estimated misclassification error R(di,Di) of di using test sample Di

Final misclassification estimate:

Weighted combination of individual misclassification errors:
R(d,D) = 1/V R(di,Di)

Cross-Validation: Example

d

d1

d2

d3

Cross-Validation

Misclassification estimate obtained through cross-validation is usually nearly unbiasedCostly computation (we need to compute d, and d1, , dV); computation of di is nearly as expensive as computation of dPreferred method to estimate quality of learning algorithms in the machine learning literature

Decision Tree Construction

Three algorithmic components:

Split selection (CART, C4.5, QUEST, CHAID, CRUISE, )

Pruning (direct stopping rule, test dataset pruning, cost-complexity pruning, statistical tests, bootstrapping)

Data access (CLOUDS, SLIQ, SPRINT, RainForest, BOAT, UnPivot operator)

*

Goodness of a Split

Consider node t with impurity phi(t)

The reduction in impurity through splitting predicate s (t splits into children nodes tL with impurity phi(tL) and tR with impurity phi(tR)) is:
phi(s,t) = phi(t) pL phi(tL) pR phi(tR)

Pruning Methods

Test dataset pruningDirect stopping ruleCost-complexity pruningMDL pruningPruning by randomization testing

Stopping Policies

A stopping policy indicates when further growth of the tree at a node t is counterproductive.

All records are of the same classThe attribute values of all records are identicalAll records have missing valuesAt most one class has a number of records larger than a user-specified numberAll records go to the same child node if t is split (only possible with some split selection methods)

Test Dataset Pruning

Use an independent test sample D to estimate the misclassification cost using the resubstitution estimate R(T,D) at each nodeSelect the subtree T of T with the smallest expected cost

Missing Values

What is the problem?During computation of the splitting predicate, we can selectively ignore records with missing values (note that this has some problems)But if a record r misses the value of the variable in the splitting attribute, r can not participate further in tree construction

Algorithms for missing values address this problem.

Mean and Mode Imputation

Assume record r has missing value r.X, and splitting variable is X.

Simplest algorithm:If X is numerical (categorical), impute the overall mean (mode) Improved algorithm:If X is numerical (categorical), impute the mean(X|t.C) (the mode(X|t.C))

Decision Trees: Summary

Many application of decision treesThere are many algorithms available for:Split selectionPruningHandling Missing ValuesData AccessDecision tree construction still active research area (after 20+ years!)Challenges: Performance, scalability, evolving datasets, new applications

Supervised vs. Unsupervised Learning

Supervised

y=F(x): true functionD: labeled training setD: {xi,F(xi)}Learn:
G(x): model trained to predict labels DGoal:
E[(F(x)-G(x))2] 0Well defined criteria: Accuracy, RMSE, ...

Unsupervised

Generator: true modelD: unlabeled data sampleD: {xi}Learn

??????????

Goal:

??????????

Well defined criteria:

??????????

Clustering: Unsupervised Learning

Given:Data Set D (training set)Similarity/distance metric/informationFind:Partitioning of dataGroups of similar/close items

Similarity?

Groups of similar customersSimilar demographicsSimilar buying behaviorSimilar healthSimilar productsSimilar costSimilar functionSimilar storeSimilarity usually is domain/problem specific

Clustering: Informal Problem Definition

Input:

A data set of N records each given as a d-dimensional data feature vector.

Output:

Determine a natural, useful partitioning of the data set into a number of (k) clusters and noise such that we have:High similarity of records within each cluster (intra-cluster similarity)Low similarity of records between clusters (inter-cluster similarity)

Types of Clustering

Hard Clustering:Each object is in one and only one clusterSoft Clustering:Each object has a probability of being in each cluster

Clustering Algorithms

Partitioning-based clusteringK-means clusteringK-medoids clusteringEM (expectation maximization) clusteringHierarchical clusteringDivisive clustering (top down)Agglomerative clustering (bottom up)Density-Based MethodsRegions of dense points separated by sparser regions of relatively low density

K-Means Clustering Algorithm

Initialize k cluster centers

Do

Assignment step: Assign each data point to its closest cluster center

Re-estimation step: Re-compute cluster centers

While (there are still changes in the cluster centers)

Visualization at:

http://www.delft-cluster.nl/textminer/theory/kmeans/kmeans.html

Issues

Why is K-Means working:

How does it find the cluster centers?Does it find an optimal clusteringWhat are good starting points for the algorithm?What is the right number of cluster centers?How do we know it will terminate?

Agglomerative Clustering

Algorithm:

Put each item in its own cluster (all singletons)Find all pairwise distances between clustersMerge the two closest clustersRepeat until everything is in one cluster

Observations:

Results in a hierarchical clusteringYields a clustering for each possible number of clustersGreedy clustering: Result is not optimal for any cluster size

Density-Based Clustering

A cluster is defined as a connected dense component.Density is defined in terms of number of neighbors of a point.We can find clusters of arbitrary shape

Market Basket Analysis

Consider shopping cart filled with several itemsMarket basket analysis tries to answer the following questions:Who makes purchases?What do customers buy together?In what order do customers purchase items?


Given:

A database of customer transactionsEach transaction is a set of items

Example:
Transaction with TID 111 contains items {Pen, Ink, Milk, Juice}

TID

CID

Date

Item

Qty

111

201

5/1/99

Pen

2

111

201

5/1/99

Ink

1

111

201

5/1/99

Milk

3

111

201

5/1/99

Juice

6

112

105

6/3/99

Pen

1

112

105

6/3/99

Ink

1

112

105

6/3/99

Milk

1

113

106

6/5/99

Pen

1

113

106

6/5/99

Milk

1

114

201

7/1/99

Pen

2

114

201

7/1/99

Ink

2

114

201

7/1/99

Juice

4

Market Basket Analysis (Contd.)

Coocurrences80% of all customers purchase items X, Y and Z together.Association rules60% of all customers who purchase X and Y also buy Z.Sequential patterns60% of customers who first buy X also purchase Y within three weeks.

*

Confidence and Support

We prune the set of all possible association rules using two interestingness measures:

Confidence of a rule:X Y has confidence c if P(Y|X) = cSupport of a rule:X Y has support s if P(XY) = s

We can also define

Support of an itemset (a coocurrence) XY:XY has support s if P(XY) = s

*

Market Basket Analysis: Applications

Sample ApplicationsDirect marketingFraud detection for medical insuranceFloor/shelf planningWeb site layoutCross-selling

*

Applications of Frequent Itemsets

Market Basket AnalysisAssociation RulesClassification (especially: text, rare classes)Seeds for construction of Bayesian NetworksWeb log analysisCollaborative filtering

Association Rule Algorithms

More abstract problem reduxBreadth-first searchDepth-first search

Problem Redux

Abstract:

A set of items {1,2,,k}A dabase of transactions (itemsets) D={T1, T2, , Tn},
Tj subset {1,2,,k}

GOAL:

Find all itemsets that appear in at least x transactions

(appear in == are subsets of)

I subset T: T supports I

For an itemset I, the number of transactions it appears in is called the support of I.

x is called the minimum support.

Concrete:

I = {milk, bread, cheese, }D = { {milk,bread,cheese}, {bread,cheese,juice}, }

GOAL:

Find all itemsets that appear in at least 1000 transactions

{milk,bread,cheese} supports {milk,bread}

Problem Redux (Contd.)

Definitions:

An itemset is frequent if it is a subset of at least x transactions. (FI.)An itemset is maximally frequent if it is frequent and it does not have a frequent superset. (MFI.)

GOAL: Given x, find all frequent (maximally frequent) itemsets (to be stored in the FI (MFI)).

Obvious relationship:
MFI subset FI

Example:

D={ {1,2,3}, {1,2,3}, {1,2,3}, {1,2,4} }

Minimum support x = 3

{1,2} is frequent

{1,2,3} is maximal frequent

Support({1,2}) = 4

All maximal frequent itemsets: {1,2,3}

Applications

Spatial association rulesWeb miningMarket basket analysisUser/customer profiling

ExtenSuggestionssions: Sequential Patterns

In the Market Itemset Analysis replace Milk, Pen, etc with names of medications and use the idea in Hospital Data mining new proposal The idea of swaem intelligence add to it the extra analysis pf the inducyion rules in this set of slides.

Kraft Foods: Direct Marketing

Company maintains a large database of purchases by customers.

Data mining

1. Analysts identified associations among groups of products bought by particular segments of customers.

2. Sent out 3 sets of coupons to various households.

Better response rates: 50 % increase in sales for one its products

Continue to use of this approach

Health Insurance Commission of Australia: Insurance Fraud

Commission maintains a database of insurance claims,including laboratory tests ordered during the diagnosis of patients.

Data mining

1. Identified the practice of "up coding" to reflect more expensive tests than are necessary.

2. Now monitors orders for lab tests.

Commission expects to save US$1,000,000 / year by eliminating the practice of "up coding.

HNC Software: Credit Card Fraud

Payment Fraud

Large issuers of cards may lose

$10 million / year due to fraud

Difficult to identify the few transactions among thousands which reflect potential fraud

Falcon software

Mines data through neural networks

Introduced in September 1992

Models each cardholder's requested transaction against the customer's past spending history.

processes several hundred requests per second

compares current transaction with customer's history

identifies the transactions most likely to be frauds

enables bank to stop high-risk transactions before they are authorized

Used by many retail banks: currently monitors

160 million card accounts for fraud

New Account Fraud

Fraudulent applications for credit cards are growing at 50 % per year

Falcon Sentry software

Mines data through neural networks and a rule base

Introduced in September 1992

Checks information on applications against data from credit bureaus

Allows card issuers to simultaneously:

increase the proportion of applications received

reduce the proportion of fraudulent applications authorized

New Account Fraud

Quality Control

IBM Microelectronics: Quality Control

Analyzed manufacturing data on Dynamic Random Access Memory (DRAM) chips.

Data mining

1. Built predictive models of

manufacturing yield (% non-defective)

effects of production parameters on chip performance.

2. Discovered critical factors behind

production yield &

product performance.

3. Created a new design for the chip

increased yield saved millions of dollars in direct manufacturing costs

enhanced product performance by substantially lowering the memory cycle time

B & L Stores

Belk and Leggett Stores =

one of largest retail chains

280 stores in southeast U.S.

data warehouse contains 100s of gigabytes (billion characters) of data

data mining to:

increase sales

reduce costs

Selected DSS Agent from MicroStrategy, Inc.

analyize merchandizing (patterns of sales)

manage inventory

Retail Sales

DSS Agent

uses intelligent agents data mining

provides multiple functions

recognizes sales patterns among stores

discovers sales patterns by

time of day

day of year

category of product

etc.

swiftly identifies trends & shifts in customer tastes

performs Market Basket Analysis (MBA)

analyzes Point-of-Sale or -Service (POS) data

identifies relationships among products and/or services purchased

E.g. A customer who buys Brand X slacks has a 35% chance of buying Brand Y shirts.

Agent tool is also used by other Fortune 1000 firms

average ROI > 300 %

average payback in 1 ~ 2 years


Case Based Reasoning

(CBR)

General scheme for a case based reasoning (CBR) model. The target case is

matched against similar precedents in the historical database, such as cases A and B.

case A

targ

e

t

case B

_997881239.doc

case A

target

case B

_997881570.doc

case A

target

case B

_997881862.doc

case A

target

case B

_997882088.doc

case A

target

case B

_997881577.doc

case A

target

case B

_997881289.doc

case A

target

case B

_997880756.doc

case A

target

case B

Case Based Reasoning (CBR)

Learning through the accumulation of experience

Key issues

Indexing:
storing cases for quick, effective access of precedents

Retrieval:
accessing the appropriate precedent cases

Advantages

Explicit knowledge form recognizable to humans

No need to re-code knowledge for computer processing

Limitations

Retrieving precedents based on superficial features
E.g. Matching Indonesia with U.S. because both have similar population size

Traditional approach ignores the issue of generalizing knowledge

Genetic Algorithm

Generation of candidate solutions using the procedures of biological evolution.

Procedure

0. Initialize.
Create a population of potential solutions ("organisms").
1. Evaluate.
Determine the level of "fitness" for each solution.
2. Cull.
Discard the poor solutions.
3. Breed.
a. Select 2 "fit" solutions to serve as parents.
b. From the 2 parents, generate offspring.
* Crossover:
Cut the parents at random and switch the 2 halves.
* Mutation:
Randomly change the value in a parent solution.
4. Repeat.
Go back to Step 1 above.

Genetic Algorithm (Cont.)

Advantages

Applicable to a wide range of problem domains.

Robustness:
can obtain solutions even when the performance

function is highly irregular or input data are noisy.

Implicit parallelism:
can search in many directions concurrently.

Limitations

Slow, like neural networks.
But: computation can be distributed

over multiple processors

(unlike neural networks)

Source: www.pathology.washington.edu

Multistrategy Learning

Every technique has advantages & limitations

Multistrategy approach

Take advantage of the strengths of diverse techniques

Circumvent the limitations of each methodology

Types of Models

Prediction Models for Predicting and Classifying

Regression algorithms (predict numeric outcome): neural networks, rule induction, CART (OLS regression, GLM)

Classification algorithm predict symbolic outcome): CHAID, C5.0 (discriminant analysis, logistic regression)

Descriptive Models for Grouping and Finding Associations

Clustering/Grouping algorithms: K-means, Kohonen

Association algorithms: apriori, GRI

Neural Networks

DescriptionDifficult interpretationTends to overfit the dataExtensive amount of training timeA lot of data preparationWorks with all data types

Rule Induction

Description

Intuitive outputHandles all forms of numeric data, as well as non-numeric (symbolic) data

C5 Algorithm a special case of rule induction

Target variable must be symbolic

Apriori

Description

Seeks association rules in dataset

Market basket analysis

Sequence discovery

Data Mining Is

The automated process of finding relationships and patterns in stored data It is different from the use of SQL queries and other business intelligence tools

Data Mining Is

Motivated by business need, large amounts of available data, and humans limited cognitive processing abilitiesEnabled by data warehousing, parallel processing, and data mining algorithms

Common Types of Information from Data Mining

Associations -- identifies occurrences that are linked to a single eventSequences -- identifies events that are linked over timeClassification -- recognizes patterns that describe the group to which an item belongs

Common Types of Information from Data Mining

Clustering -- discovers different groupings within the dataForecasting -- estimates future values

Commonly Used Data Mining Techniques

Artificial neural networksDecision treesGenetic algorithmsNearest neighbor methodRule induction

The Current State of Data Mining Tools

Many of the vendors are small companiesIBM and SAS have been in the market for some time, and more biggies are moving into this marketBI tools and RDMS products are increasingly including basic data mining capabilitiesPackaged data mining applications are becoming common

The Data Mining Process

Requires personnel with domain, data warehousing, and data mining expertiseRequires data selection, data extraction, data cleansing, and data transformationMost data mining tools work with highly granular flat filesIs an iterative and interactive process

Why Data Mining

Credit ratings/targeted marketing:Given a database of 100,000 names, which persons are the least likely to default on their credit cards? Identify likely responders to sales promotionsFraud detectionWhich types of transactions are likely to be fraudulent, given the demographics and transactional history of a particular customer? Customer relationship management:Which of my customers are likely to be the most loyal, and which are most likely to leave for a competitor? :

Data Mining helps extract such information

Applications

Banking: loan/credit card approvalpredict good customers based on old customersCustomer relationship management:identify those who are likely to leave for a competitor.Targeted marketing: identify likely responders to promotionsFraud detection: telecommunications, financial transactionsfrom an online stream of event identify fraudulent eventsManufacturing and production:automatically adjust knobs when process parameter changes

*

Any area where large amounts of historic data that if understood

better can help shape future decisions.

Applications (continued)

Medicine: disease outcome, effectiveness of treatmentsanalyze patient disease history: find relationship between diseases Molecular/Pharmaceutical: identify new drugsScientific data analysis: identify new galaxies by searching for sub clustersWeb site/store design and promotion:find affinity of visitor to pages and modify layout

The KDD process

Problem fomulationData collectionsubset data: sampling might hurt if highly skewed datafeature selection: principal component analysis, heuristic searchPre-processing: cleaningname/address cleaning, different meanings (annual, yearly), duplicate removal, supplying missing values Transformation:map complex objects e.g. time series data to features e.g. frequencyChoosing mining task and mining method:Result evaluation and Visualization:

Knowledge discovery is an iterative process

Relationship with other fields

Overlaps with machine learning, statistics, artificial intelligence, databases, visualization but more stress onscalability of number of features and instancesstress on algorithms and architectures whereas foundations of methods and formulations provided by statistics and machine learning. automation for handling large, heterogeneous data

*

Some basic operations

Predictive:RegressionClassificationCollaborative FilteringDescriptive:Clustering / similarity matchingAssociation rules and variantsDeviation detection

*

Each topic is a talk..

Classification

Given old data about customers and payments, predict new applicants loan eligibility.

Age

Salary

Profession

Location

Customer type

Previous customers

Classifier

Decision rules

Salary > 5 L

Prof. = Exec

New applicants data

Good/

bad

*

Classification methods

Goal: Predict class Ci = f(x1, x2, .. Xn)Regression: (linear or any other polynomial) a*x1 + b*x2 + c = Ci. Nearest neighourDecision tree classifier: divide decision space into piecewise constant regions.Probabilistic/generative modelsNeural networks: partition by non-linear boundaries

*

Define proximity between instances, find neighbors of new instance and assign majority classCase based reasoning: when attributes are more complicated than real-valued.

Nearest neighbor

Cons Slow during application. No feature selection. Notion of proximity vague

Pros Fast training

*

Clustering

Unsupervised learning when old data with class labels not available e.g. when introducing a new product.Group/cluster existing customers based on time series of payment history such that similar customers in same cluster.Key requirement: Need a good measure of similarity between instances.Identify micro-markets and develop policies for each

*

Applications

Customer segmentation e.g. for targeted marketingGroup/cluster existing customers based on time series of payment history such that similar customers in same cluster.Identify micro-markets and develop policies for eachCollaborative filtering:group based on common items purchasedText clustering Compression

Distance functions

Numeric data: euclidean, manhattan distances Categorical data: 0/1 to indicate presence/absence followed byHamming distance (# dissimilarity)Jaccard coefficients: #similarity in 1s/(# of 1s) data dependent measures: similarity of A and B depends on co-occurance with C.Combined numeric and categorical data:weighted normalized distance:

Clustering methods

Hierarchical clusteringagglomerative Vs divisive single link Vs complete link Partitional clusteringdistance-based: K-meansmodel-based: EMdensity-based:

Agglomerative Hierarchical clustering

Given: matrix of similarity between every point pairStart with each point in a separate cluster and merge clusters based on some criteria:Single link: merge two clusters such that the minimum distance between two points from the two different cluster is the leastComplete link: merge two clusters such that all points in one cluster are close to all points in the other.

Partitional methods: K-means

Criteria: minimize sum of square of distance Between each point and centroid of the cluster.Between each pair of points in the clusterAlgorithm:Select initial partition with K clusters: random, first K, K separated pointsRepeat until stabilization:Assign each point to closest cluster centerGenerate new cluster centersAdjust clusters by merging/splitting

Collaborative Filtering

Given database of user preferences, predict preference of new user Example: predict what new movies you will like based onyour past preferencesothers with similar past preferencestheir preferences for the new moviesExample: predict what books/CDs a person may want to buy (and suggest it, or give discounts to tempt customer)

Association rules

Given set T of groups of itemsExample: set of item sets purchased Goal: find all rules on itemsets of the form a-->b such that support of a and b > user threshold s conditional probability (confidence) of b given a > user threshold cExample: Milk --> breadPurchase of product A --> service B

Milk, cereal

Tea, milk

Tea, rice, bread

cereal

T

*

Prevalent Interesting

Analysts already know about prevalent rulesInteresting rules are those that deviate from prior expectationMinings payoff is in finding surprising phenomena

1995

Milk and

cereal sell
together!

Milk and

cereal sell
together!

*

Applications of fast itemset counting

Find correlated events:

Applications in medicine: find redundant testsCross selling in retail, bankingImprove predictive capability of classifiers that assume attribute independence New similarity measures of categorical attributes [Mannila et al, KDD 98]

*

Application Areas

Industry

Application

Finance

Credit Card Analysis

Insurance

Claims, Fraud Analysis

Telecommunication

Call record analysis

Transport

Logistics management

Consumer goods

promotion analysis

Data Service providers

Value added data

Utilities

Power usage analysis

Usage scenarios

Data warehouse mining: assimilate data from operational sourcesmine static dataMining log dataContinuous mining: example in process controlStages in mining: data selection pre-processing: cleaning transformation mining result evaluation visualization

*

Mining market

Around 20 to 30 mining tool vendorsMajor tool players:Clementine, IBMs Intelligent Miner, SGIs MineSet, SASs Enterprise Miner.All pretty much the same set of toolsMany embedded products: fraud detection: electronic commerce applications, health care, customer relationship management: Epiphany

*

Absolute: 40 M$

40M$, expected to grow 10 times by 2000 --Forrester research

Vertical integration:
Mining on the web

Web log analysis for site design: what are popular pages, what links are hard to find.Electronic stores sales enhancements:recommendations, advertisement: Collaborative filtering: Net perception, Wisewire Inventory control: what was a shopper looking for and could not find..

*

State of art in mining OLAP integration

Decision trees [Information discovery, Cognos]find factors influencing high profitsClustering [Pilot software]segment customers to define hierarchy on that dimensionTime series analysis: [Seagates Holos]Query for various shapes along time: eg. spikes, outliersMulti-level Associations [Han et al.]find association between members of dimensions Sarawagi [VLDB2000]

*

Littl e integration: here are few exceptions ---

People are starting to wake up to this possibility and here are some examples I have found by web-surfing.

. decision tree most common. Information Discovery claimed to be only serious integrator [DBMS Ap 98]

Clustering used by some to define new product hierarchies.

Of course, rich set of time-series functions especially for forecasting was always there

New charting software: 80/20, A-B-C analysis, quadrant plotting.

Univ. Jiawen Han.

Previous approach has been to bring in mining operations in olap. Look at mining operations and choose what fits.

My approach has been to reflect on what people do with cube metaphor

and the drill-down, roll-up, based exploration and see if there is anything there that can be automated.

Discuss my work first.

Data Mining in Use

The US Government uses Data Mining to track fraudA Supermarket becomes an information brokerBasketball teams use it to track game strategyCross SellingTarget MarketingHolding on to Good CustomersWeeding out Bad Customers

*

Some success stories

Network intrusion detection using a combination of sequential rule discovery and classification tree on 4 GB DARPA dataWon over (manual) knowledge engineering approachhttp://www.cs.columbia.edu/~sal/JAM/PROJECT/ provides good detailed description of the entire processMajor US bank: customer attrition predictionFirst segment customers based on financial behavior: found 3 segmentsBuild attrition models for each of the 3 segments40-50% of attritions were predicted == factor of 18 increase Targeted credit marketing: major US banksfind customer segments based on 13 months credit balancesbuild another response model based on surveysincreased response 4 times -- 2%

Data Mining Tools: KnowledeSeeker 4.5

*

What is KnowledgeSeeker?

Produced by ANGOSS Software Corporation, who focus solely on data mining software.

Offer training and consulting services

Produce data mining add-ins which accepts data from all major databases

Works with popular query and reporting, spreadsheet, statistical and OLAP & ROLAP tools.


*

Angoss Software Corp. was formed under the Business Corp. Act (ontario) in 1980. It began data mining software production in 1992. It is publicly traded on the Canadian Venture Exchange under the trading symbol ANC

Promote the rapid knowledge transfer to customers in the use of technology and adoption of best practice for data mining


*

Major Competitors

CompanySoftwareClementine 6.0Enterprise Miner 3.0Intelligent Miner


*


*

Major Competitors

CompanySoftwareMineset 3.1DarwinScenario


*


*

Current Applications

Manufacturing

Used by the R.R. Donnelly & Sons commercial printing company to improve process control, cut costs and increase productivity.

Used extensively by Hewlett Packard in their United States manufacturing plants as a process control tool both to analyze factors impacting product quality as well as to generate rules for production control systems.


*


*


Auditing

Used by the IRS to combat fraud, reduce risk, and increase collection rates.

Finance

Used by the Canadian Imperial Bank of Commerce (CIBC) to create models for fraud detection and risk management.


*


*


CRM

Telephony

Used by US West to reduce churning and increase customer loyalty for a new voice messaging technology.


*


*


Marketing

Used by the Washington Post to improve their direct mail targeting and to conduct survey analysis.

Health Care

Used by the Oxford Transplant Center to discover factors affecting transplant survival rates.

Used by the University of Rochester Cancer Center to study the effect of anxiety on chemotherapy-related nausea.


*


*

More Customers


*


*

Questions

What percentage of people in the test group have high blood pressure with these characteristics: 66-year-old male regular smoker that has low to moderate salt consumption?

Do the risk levels change for a male with the same characteristics who quit smoking? What are the percentages?

If you are a 2% milk drinker, how many factors are still interesting?

Knowing that salt consumption and smoking habits are interesting factors, which one has a stronger correlation to blood pressure levels?

Grow an automatic tree. Look to see if gender is an interesting factor for 55-year-old regular smoker who does not each cheese?


Association

Classic market-basket analysis, which treats the purchase of a number of items (for example, the contents of a shopping basket) as a single transaction. This information can be used to adjust inventories, modify floor or shelf layouts, or introduce targeted promotional activities to increase overall sales or move specific products. Example : 80 percent of all transactions in which beer was purchased also included potato chips.

Sequence-based analysis

Traditional market-basket analysis deals with a collection of items as part of a point-in-time transaction. to identify a typical set of purchases that might predict the subsequent purchase of a specific item.

Clustering

Clustering approach address segmentation problems. These approaches assign records with a large number of attributes into a relatively small set of groups or "segments." Example : Buying habits of multiple population segments might be compared to determine which segments to target for a new sales campaign.

Classification

Most commonly applied data mining technique Algorithm uses preclassified examples to determine the set of parameters required for proper discrimination. Example : A classifier derived from the Classification approach is capable of identifying risky loans, could be used to aid in the decision of whether to grant a loan to an individual.

Issues of Data Mining

Present-day tools are strong but require significant expertise to implement effectively. Issues of Data MiningSusceptibility to "dirty" or irrelevant data.Inability to "explain" results in human terms.

Issues

susceptibility to "dirty" or irrelevant data Data mining tools of today simply take everything they are given as factual and draw the resulting conclusions. Users must take the necessary precautions to ensure that the data being analyzed is "clean."

Issues, cont

inability to "explain" results in human terms Many of the tools employed in data mining analysis use complex mathematical algorithms that are not easily mapped into human terms.what good does the information do if you dont understand it?

Comparison with reporting, BI and OLAP

Reporting

Simple relationshipsChoose the relevant factors Examine all details

(Also applies to visualisation & simple statistics)

Data Mining

Complex relationshipsAutomatically find the relevant factorsShow only relevant details

Prediction

*

Here its obviously the algorithms.

Comparison with Statistics

Statistical analysis

Mainly about hypothesis testingFocussed on precision

Data mining

Mainly about hypothesis generationFocussed on deployment

*

Here its less clear maybe its the algorithms, but more its the attitude

Example: data mining and customer processes

Insight: Who are my customers and why do they behave the way they do?Prediction: Who is a good prospect, for what product, who is at risk,
what is the next thing to offer?Uses: Targeted marketing, mail-shots, call-centres, adaptive web-sites

*

Example: data mining and fraud detection

Insight: How can (specific method of) fraud be recognised? What constitute normal, abnormal and suspicious events?Prediction: Recognise similarity to previous frauds how similar?
Spot abnormal events how suspicious? Used by: Banks, telcos, retail, government

*

Example: data mining and diagnosing cancer

Complex data from geneticsChallenging data mining problemFind patterns of gene activation indicating different diseases / stagesChanged the way I think about cancer Oncologist from Chicago Childrens Memorial Hospital

72.bin

*

Example: data mining and policing

Knowing the patterns helps plan effective crime preventionCrime hot-spots understood betterSift through mountains of crime reportsIdentify crime seriesOther people save money using data mining we save lives. Police force homicide specialist and data miner

*

Data mining tools:
Clementine and its philosophy

*

How it works

How its really used.

Handling of business problems and algorithms / expert features

Deep embedding of deployment

CRISP-DM pane

How to do data mining

Lots of data mining operationsHow do you glue them together to solve a problem?How do we actually do data mining?MethodologyNot just the right way, but any way

*

Myths about Data Mining (1)
Data, Process and Tech

Data mining is all about
massive data

Data mining is a
technical process

Data mining is all
about algorithms

Data mining is all
about predictive accuracy

Myths about Data Mining (2)
Data Quality

Data mining only works
with clean data

with complete data


with correct data

One last exploding myth

Neural Networks are not useful when you need to understand the patterns that you find
(which is nearly always in data mining)

Related to over-simplistic views of data mining

Data mining techniques form a toolkit

We often use techniques in surprising ways

E.g. Neural nets for field selection
Neural nets for pattern confirmation
Neural nets combined with other techniques
for cross-checking

What use is a pair of pliers?

*

Related Concepts Outline

Database/OLTP SystemsFuzzy Sets and LogicInformation Retrieval(Web Search Engines)Dimensional ModelingData WarehousingOLAP/DSSStatisticsMachine LearningPattern Matching

Goal: Examine some areas which are related to data mining.

*

Fuzzy Sets and Logic

Fuzzy Set: Set membership function is a real valued function with output in the range [0,1].f(x): Probability x is in F.1-f(x): Probability x is not in F.EX:T = {x | x is a person and x is tall}Let f(x) be the probability that x is tallHere f is the membership function

DM: Prediction and classification are fuzzy.

*

Information Retrieval

Information Retrieval (IR): retrieving desired information from textual data.Library ScienceDigital LibrariesWeb Search EnginesTraditionally keyword basedSample query:

Find all documents about data mining.

DM: Similarity measures;

Mine text/Web data.

Prentice Hall

*

Dimensional Modeling

View data in a hierarchical manner more as business executives mightUseful in decision support systems and miningDimension: collection of logically related attributes; axis for modeling data.Facts: data storedEx: Dimensions products, locations, date

Facts quantity, unit price

DM: May view data as dimensinoal.

Prentice Hall

*

Dimensional Modeling Queries

Roll Up: more general dimensionDrill Down: more specific dimensionDimension (Aggregation) HierarchySQL uses aggregationDecision Support Systems (DSS): Computer systems and tools to assist managers in making decisions and solving problems.

*

Cube view of Data

*

Data Warehousing

Subject-oriented, integrated, time-variant, nonvolatile William InmonOperational Data: Data used in day to day needs of company.Informational Data: Supports other functions such as planning and forecasting.Data mining tools often access data warehouses rather than operational data.

DM: May access data in warehouse.

*

OLAP

Online Analytic Processing (OLAP): provides more complex queries than OLTP.OnLine Transaction Processing (OLTP): traditional database/transaction processing.Dimensional data; cube view Visualization of operations:Slice: examine sub-cube.Dice: rotate cube to look at another dimension.Roll Up/Drill Down

DM: May use OLAP queries.

*

OLAP Operations

Single Cell

Multiple Cells

Slice

Dice

Roll Up

Drill Down

*

Statistics

Simple descriptive modelsStatistical inference: generalizing a model created from a sample of the data to the entire dataset.Exploratory Data Analysis: Data can actually drive the creation of the modelOpposite of traditional statistical view.Data mining targeted to business user

DM: Many data mining methods come from statistical techniques.

*

Machine Learning

Machine Learning: area of AI that examines how to write programs that can learn.Often used in classification and prediction Supervised Learning: learns by example.Unsupervised Learning: learns without knowledge of correct answers.Machine learning often deals with small static datasets.

DM: Uses many machine learning techniques.

Prentice Hall

*

Pattern Matching (Recognition)

Pattern Matching: finds occurrences of a predefined pattern in the data.Applications include speech recognition, information retrieval, time series analysis.

DM: Type of classification.

Prentice Hall

*

DM vs. Related Topics

Area

Query

Data

Results

Output

DB/OLTP

Precise

Database

Precise

DB Objects or Aggregation

IR

Precise

Documents

Vague

Documents

OLAP

Analysis

Multidimensional

Precise

DB Objects or Aggregation

DM

Vague

Preprocessed

Vague

KDD Objects

Prentice Hall

*

Data Mining Techniques Outline

StatisticalPoint EstimationModels Based on SummarizationBayes TheoremHypothesis TestingRegression and CorrelationSimilarity MeasuresDecision TreesNeural NetworksActivation FunctionsGenetic Algorithms

Goal: Provide an overview of basic data mining techniques

Prentice Hall

*

Point Estimation

Point Estimate: estimate a population parameter.May be made by calculating the parameter for a sample.May be used to predict value for missing data.Ex: R contains 100 employees99 have salary informationMean salary of these is $50,000Use $50,000 as value of remaining employees salary.

Is this a good idea?

*

Estimation Error

Bias: Difference between expected value and actual value.

Mean Squared Error (MSE): expected value of the squared difference between the estimate and the actual value:

Why square?Root Mean Square Error (RMSE)

*

Expectation-Maximization (EM)

Solves estimation with incomplete data.Obtain initial estimates for parameters.Iteratively use estimates for missing data and continue until convergence.

*

Models Based on Summarization

Visualization: Frequency distribution, mean, variance, median, mode, etc.Box Plot:

*

Bayes Theorem

Posterior Probability: P(h1|xi)Prior Probability: P(h1)Bayes Theorem:

Assign probabilities of hypotheses given a data value.

*

Hypothesis Testing

Find model to explain behavior by creating and then testing a hypothesis about the data.Exact opposite of usual DM approach.H0 Null hypothesis; Hypothesis to be tested.H1 Alternative hypothesis

*

Regression

Predict future values based on past valuesLinear Regression assumes linear relationship exists.

y = c0 + c1 x1 + + cn xn

Find values to best fit the data

*

Correlation

Examine the degree to which the values for two variables behave similarly.Correlation coefficient r:

1 = perfect correlation

-1 = perfect but opposite correlation

0 = no correlation

Prentice Hall

*

Similarity Measures

Determine similarity between two objects.Similarity characteristics:

Alternatively, distance measure measure how unlike or dissimilar objects are.

Prentice Hall

*

Distance Measures

Measure dissimilarity between objects

*

Decision Trees

Decision Tree (DT):Tree where the root and each internal node is labeled with a question. The arcs represent each possible answer to the associated question. Each leaf node represents a prediction of a solution to the problem.Popular technique for classification; Leaf node indicates class to which the corresponding tuple belongs.

Prentice Hall

*

Decision Trees

A Decision Tree Model is a computational model consisting of three parts:Decision TreeAlgorithm to create the treeAlgorithm that applies the tree to data Creation of the tree is the most difficult part.Processing is basically a search similar to that in a binary search tree (although DT may not be binary).

Prentice Hall

Prentice Hall

*

Neural Networks

Based on observed functioning of human brain. (Artificial Neural Networks (ANN)Our view of neural networks is very simplistic. We view a neural network (NN) from a graphical viewpoint.Alternatively, a NN may be viewed from the perspective of matrices.Used in pattern recognition, speech recognition, computer vision, and classification.

Prentice Hall

*

Generating Rules

Decision tree can be converted into a rule setStraightforward conversion: each path to the leaf becomes a rule makes an overly complex rule setMore effective conversions are not trivial(e.g. C4.8 tests each node in root-leaf path to see if it can be eliminated without loss in accuracy)

*

In the previous lesson we discussed Classification using decision trees.

Sometimes decision trees are inconvenient they can be very large

Also, they require starting at the same attribute

We can extract modular nuggets of knowledge by getting rules

*

Covering algorithms

Strategy for generating a rule set directly: for each class in turn find rule set that covers all instances in it (excluding instances not in the class)This approach is called a covering approach because at each stage a rule is identified that covers some of the instances

*

*

Rules vs. trees

Corresponding decision tree:

(produces exactly the same

predictions)

But: rule sets can be more clear when decision trees suffer from replicated subtreesAlso: in multi-class situations, covering algorithm concentrates on one class at a time whereas decision tree learner takes all classes into account

*

*

A simple covering algorithm

Generates a rule by adding tests that maximize rules accuracySimilar to situation in decision trees: problem of selecting an attribute to split onBut: decision tree inducer maximizes overall purityEach new test reduces

rules coverage:

witten&eibe

*

Algorithm Components

1. The task the algorithm is used to address (e.g. classification, clustering, etc.)

2. The structure of the model or pattern we are fitting to the data (e.g. a linear regression model)

3. The score function used to judge the quality of the fitted models or patterns (e.g. accuracy, BIC, etc.)

4. The search or optimization method used to search over parameters and/or structures (e.g. steepest descent, MCMC, etc.)

5. The data management technique used for storing, indexing, and retrieving data (critical when data too large to reside in memory)

Models and Patterns

Models

Prediction

Probability Distributions

Structured Data

Linear regressionPiecewise linear

Models

Prediction


Structured Data

Linear regressionPiecewise linearNonparamatric regression

Models

Prediction


Structured Data

Linear regressionPiecewise linearNonparametric regressionClassification

logistic regression

nave bayes/TAN/bayesian networks

NN

support vector machines

Trees

etc.

Models

Prediction


Structured Data


Parametric modelsMixtures of parametric modelsGraphical Markov models (categorical, continuous, mixed)

Models

Prediction


Structured Data


Parametric modelsMixtures of parametric modelsGraphical Markov models (categorical, continuous, mixed)

Time seriesMarkov modelsMixture Transition Distribution modelsHidden Markov modelsSpatial models

Bias-Variance Tradeoff

High Bias - Low Variance

Low Bias - High Variance

overfitting - modeling the random component

Score function should embody the compromise

Patterns

Global

Local

Clustering via partitioningHierarchical ClusteringMixture Models

Outlier detectionChangepoint detection

Bump huntingScan statisticsAssociation rules

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

The curve represents a road

Each x marks an accident

Red x denotes an injury accident

Black x means no injury

Is there a stretch of road where there is an unually large fraction of injury accidents?

Scan Statistics via Permutation Tests

Scan with Fixed Window

If we know the length of the stretch of road that we seek, e.g., we could slide this window long

[PPT]DATA WAREHOUSING AND DATA MINING - Prince...

Documents

Transcript of [PPT]DATA WAREHOUSING AND DATA MINING - Prince...