[PPT]DATA WAREHOUSING AND DATA MINING - Prince...
Transcript of [PPT]DATA WAREHOUSING AND DATA MINING - Prince...
-
Data Mining Tools
Overview & Tutorial
Ahmed Sameh
Prince Sultan University
Department of Computer Science & Info Sys
May 2010
(Some slides belong to IBM)
*
-
*
Introduction Outline
Define data miningData mining vs. databasesBasic data mining tasksData mining developmentData mining issues
Goal: Provide an overview of data mining.
-
*
Introduction
Data is growing at a phenomenal rateUsers expect more sophisticated informationHow?
UNCOVER HIDDEN INFORMATION
DATA MINING
-
*
Data Mining Definition
Finding hidden information in a databaseFit data to a modelSimilar termsExploratory data analysisData driven discoveryDeductive learning
-
*
Data Mining Algorithm
Objective: Fit Data to a ModelDescriptivePredictivePreference Technique to choose the best modelSearch Technique to search the dataQuery
-
*
Database Processing vs. Data Mining Processing
QueryWell definedSQL
QueryPoorly definedNo precise query language
Data Operational data
Output Precise Subset of database
Data Not operational data
Output Fuzzy Not a subset of database
-
*
Query Examples
Database
Data Mining
Find all customers who have purchased milk
Find all items which are frequently purchased with milk. (association rules)
Find all credit applicants with last name of Smith.
Identify customers who have purchased more than $10,000 in the last month.
Find all credit applicants who are poor credit risks. (classification)
Identify customers with similar buying habits. (Clustering)
-
*
Related Fields
Statistics
Machine
Learning
Databases
Visualization
Data Mining and
Knowledge Discovery
*
-
*
Statistics, Machine Learning and Data Mining
Statistics: more theory-basedmore focused on testing hypothesesMachine learningmore heuristicfocused on improving performance of a learning agentalso looks at real-time learning and robotics areas not part of data miningData Mining and Knowledge Discoveryintegrates theory and heuristicsfocus on the entire process of knowledge discovery, including data cleaning, learning, and integration and visualization of resultsDistinctions are fuzzy
-
Definition
A class of database application that analyze data in a database using tools which look for trends or anomalies. Data mining was invented by IBM.
-
Purpose
To look for hidden patterns or previously unknown relationships among the data in a group of data that can be used to predict future behavior.Ex: Data mining software can help retail companies find customers with common interests.
-
Background Information
Many of the techniques used by today's data mining tools have been around for many years, having originated in the artificial intelligence research of the 1980s and early 1990s. Data Mining tools are only now being applied to large-scale database systems.
-
The Need for Data Mining
The amount of raw data stored in corporate data warehouses is growing rapidly. There is too much data and complexity that might be relevant to a specific problem. Data mining promises to bridge the analytical gap by giving knowledgeworkers the tools to navigate this complex analytical space.
-
The Need for Data Mining, cont
The need for information has resulted in the proliferation of data warehouses that integrate information multiple sources to support decision making. Often include data from external sources, such as customer demographics and household information.
-
Definition (Cont.)
Data mining is the exploration and analysis of large quantities of data in order to discover valid, novel, potentially useful, and ultimately understandable patterns in data.
Valid: The patterns hold in general.
Novel: We did not know the pattern beforehand.
Useful: We can devise actions from the patterns.
Understandable: We can interpret and comprehend the patterns.
-
Of laws, Monsters, and Giants
Moores law: processing capacity doubles every 18 months : CPU, cache, memoryIts more aggressive cousin:Disk storage capacity doubles every 9 months
-
What is Data Mining?
Finding interesting structure in data
Structure: refers to statistical patterns, predictive models, hidden relationships
Examples of tasks addressed by Data MiningPredictive Modeling (classification, regression)Segmentation (Data Clustering )SummarizationVisualization
-
*
Major Application Areas for
Data Mining SolutionsAdvertisingBioinformaticsCustomer Relationship Management (CRM)Database Marketing Fraud Detection eCommerceHealth CareInvestment/SecuritiesManufacturing, Process ControlSports and Entertainment TelecommunicationsWeb
*
-
*
Data Mining
The non-trivial extraction of novel, implicit, and actionable knowledge from large datasets.Extremely large datasetsDiscovery of the non-obviousUseful knowledge that can improve processesCan not be done manuallyTechnology to enable data exploration, data analysis, and data visualization of very large databases at a high level of abstraction, without a specific hypothesis in mind.Sophisticated data search capability that uses statistical algorithms to discover patterns and correlations in data.
-
*
Data Mining (cont.)
-
*
Data Mining (cont.)
Data Mining is a step of Knowledge Discovery in Databases (KDD) ProcessData WarehousingData SelectionData PreprocessingData TransformationData MiningInterpretation/EvaluationData Mining is sometimes referred to as KDD and DM and KDD tend to be used as synonyms
-
*
Data Mining Evaluation
-
*
Data Mining is Not
Data warehousing SQL / Ad Hoc Queries / Reporting Software Agents Online Analytical Processing (OLAP) Data Visualization
-
*
Data Mining Motivation
Changes in the Business EnvironmentCustomers becoming more demandingMarkets are saturatedDatabases today are huge:More than 1,000,000 entities/records/rowsFrom 10 to 10,000 fields/attributes/variablesGigabytes and terabytesDatabases a growing at an unprecedented rateDecisions must be made rapidlyDecisions must be made with maximum knowledge
-
Why Use Data Mining Today?
Human analysis skills are inadequate:
Volume and dimensionality of the dataHigh data growth rate
Availability of:
DataStorageComputational powerOff-the-shelf softwareExpertise
-
An Abundance of Data
Supermarket scanners, POS dataPreferred customer cardsCredit card transactionsDirect mail responseCall center recordsATM machinesDemographic dataSensor networksCamerasWeb server logsCustomer web site trails
-
Evolution of Database Technology
1960s: IMS, network model1970s: The relational data model, first relational DBMS implementations1980s: Maturing RDBMS, application-specific DBMS, (spatial data, scientific data, image data, etc.), OODBMS1990s: Mature, high-performance RDBMS technology, parallel DBMS, terabyte data warehouses, object-relational DBMS, middleware and web technology2000s: High availability, zero-administration, seamless integration into business processes2010: Sensor database systems, databases on embedded systems, P2P database systems, large-scale pub/sub systems, ???
-
Much Commercial Support
Many data mining toolshttp://www.kdnuggets.com/software Database systems with data mining supportVisualization toolsData mining process supportConsultants
-
Why Use Data Mining Today?
Competitive pressure!
The secret of success is to know something that nobody else knows.
Aristotle Onassis
Competition on service, not only on price (Banks, phone companies, hotel chains, rental car companies)Personalization, CRMThe real-time enterpriseSystemic listeningSecurity, homeland defense
-
The Knowledge Discovery Process
Steps:
Identify business problem
Data mining
Action
Evaluation and measurement
Deployment and integration into businesses processes
-
Data Mining Step in Detail
2.1 Data preprocessing
Data selection: Identify target datasets and relevant fieldsData cleaningRemove noise and outliersData transformationCreate common unitsGenerate new fields
2.2 Data mining model construction
2.3 Model evaluation
-
Preprocessing and Mining
Original Data
Target
DataPreprocessed
DataPatterns
Knowledge
Data
Integration
and SelectionPreprocessing
Model
ConstructionInterpretation
-
*
Data Mining Techniques
-
*
Data Mining Models and Tasks
-
*
Basic Data Mining Tasks
Classification maps data into predefined groups or classesSupervised learningPattern recognitionPredictionRegression is used to map a data item to a real valued prediction variable.Clustering groups similar data together into clusters.Unsupervised learningSegmentationPartitioning
-
*
Basic Data Mining Tasks (contd)
Summarization maps data into subsets with associated simple descriptions.CharacterizationGeneralizationLink Analysis uncovers relationships among data.Affinity AnalysisAssociation RulesSequential Analysis determines sequential patterns.
-
*
Ex: Time Series Analysis
Example: Stock MarketPredict future valuesDetermine similar patterns over timeClassify behavior
4.ps
-
*
Data Mining vs. KDD
Knowledge Discovery in Databases (KDD): process of finding useful information and patterns in data.Data Mining: Use of algorithms to extract the information and patterns derived by the KDD process.
-
*
Data Mining Development
Similarity MeasuresHierarchical ClusteringIR SystemsImprecise QueriesTextual DataWeb Search Engines
Bayes TheoremRegression AnalysisEM AlgorithmK-Means ClusteringTime Series Analysis
Neural NetworksDecision Tree Algorithms
Algorithm Design TechniquesAlgorithm AnalysisData Structures
Relational Data ModelSQLAssociation Rule AlgorithmsData WarehousingScalability Techniques
-
*
KDD Issues
Human InteractionOverfitting Outliers InterpretationVisualization Large DatasetsHigh Dimensionality
-
*
KDD Issues (contd)
Multimedia DataMissing DataIrrelevant DataNoisy DataChanging DataIntegrationApplication
-
*
Visualization Techniques
GraphicalGeometricIcon-basedPixel-basedHierarchicalHybrid
-
*
Data Mining Applications
-
*
Data Mining Applications:
RetailPerforming basket analysisWhich items customers tend to purchase together. This knowledge can improve stocking, store layout strategies, and promotions.Sales forecastingExamining time-based patterns helps retailers make stocking decisions. If a customer purchases an item today, when are they likely to purchase a complementary item?Database marketingRetailers can develop profiles of customers with certain behaviors, for example, those who purchase designer labels clothing or those who attend sales. This information can be used to focus costeffective promotions.Merchandise planning and allocationWhen retailers add new stores, they can improve merchandise planning and allocation by examining patterns in stores with similar demographic characteristics. Retailers can also use data mining to determine the ideal layout for a specific store.
-
*
Data Mining Applications:
BankingCard marketingBy identifying customer segments, card issuers and acquirers can improve profitability with more effective acquisition and retention programs, targeted product development, and customized pricing.Cardholder pricing and profitabilityCard issuers can take advantage of data mining technology to price their products so as to maximize profit and minimize loss of customers. Includes risk-based pricing.Fraud detection Fraud is enormously costly. By analyzing past transactions that were later determined to be fraudulent, banks can identify patterns. Predictive life-cycle managementDM helps banks predict each customers lifetime value and to service each segment appropriately (for example, offering special deals and discounts).
-
*
Data Mining Applications:
TelecommunicationCall detail record analysisTelecommunication companies accumulate detailed call records. By identifying customer segments with similar use patterns, the companies can develop attractive pricing and feature promotions.Customer loyaltySome customers repeatedly switch providers, or churn, to take advantage of attractive incentives by competing companies. The companies can use DM to identify the characteristics of customers who are likely to remain loyal once they switch, thus enabling the companies to target their spending on customers who will produce the most profit.
-
*
Data Mining Applications:
Other ApplicationsCustomer segmentationAll industries can take advantage of DM to discover discrete segments in their customer bases by considering additional variables beyond traditional analysis.ManufacturingThrough choice boards, manufacturers are beginning to customize products for customers; therefore they must be able to predict which features should be bundled to meet customer demand.WarrantiesManufacturers need to predict the number of customers who will submit warranty claims and the average cost of those claims.Frequent flier incentivesAirlines can identify groups of customers that can be given incentives to fly more.
-
*
A producer wants to know.
-
*
Data, Data everywhere
yet ...I cant find the data I needdata is scattered over the networkmany versions, subtle differences
I cant get the data I needneed an expert to get the data
I cant understand the data I foundavailable data poorly documented
I cant use the data I foundresults are unexpecteddata needs to be transformed from one form to other
-
*
What is a Data Warehouse?
A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context.
[Barry Devlin]
-
*
What are the users saying...
Data should be integrated across the enterpriseSummary data has a real value to the organizationHistorical data holds the key to understanding data over timeWhat-if capabilities are required
-
*
What is Data Warehousing?
A process of transforming data into information and making it available to users in a timely enough manner to make a difference
[Forrester Research, April 1996]
-
*
Very Large Data Bases
Terabytes -- 10^12 bytes:
Petabytes -- 10^15 bytes:
Exabytes -- 10^18 bytes:
Zettabytes -- 10^21 bytes:
Zottabytes -- 10^24 bytes:
Walmart -- 24 Terabytes
Geographic Information Systems
National Medical Records
Weather images
Intelligence Agency Videos
-
*
Data Warehousing --
It is a processTechnique for assembling and managing data from various sources for the purpose of answering business questions. Thus making decisions that were not previous possibleA decision support database maintained separately from the organizations operational database
-
*
Data Warehouse
A data warehouse is a subject-orientedintegratedtime-varyingnon-volatile
collection of data that is used primarily in organizational decision making.
-- Bill Inmon, Building the Data Warehouse 1996
-
Data Warehousing Concepts
Decision support is key for companies wanting to turn their organizational data into an information asset
Traditional database is transaction-oriented while data warehouse is data-retrieval optimized for decision-support
Data Warehouse
"A subject-oriented, integrated, time-variant, and non-volatile collection of data in support of management's decision-making process"OLAP (on-line analytical processing), Decision Support Systems (DSS), Executive Information Systems (EIS), and data mining applications
*
-
What does data warehouse do?
integrate diverse information from various systems which enable users to quickly produce powerful ad-hoc queries and perform complex analysis
create an infrastructure for reusing the data in numerous ways
create an open systems environment to make useful information easily accessible to authorized users
help managers make informed decisions
*
-
Benefits of Data Warehousing
Potential high returns on investmentCompetitive advantageIncreased productivity of corporate decision-makers
*
-
Comparison of OLTP and Data Warehousing
OLTP systemsData warehousing systems
Holds current dataHolds historic data
Stores detailed dataStores detailed, lightly, and
summarized data
Data is dynamicData is largely static
Repetitive processingAd hoc, unstructured, and heuristic processing
High level of transaction throughputMedium to low transaction throughput
Predictable pattern of usageUnpredictable pattern of usage
Transaction drivenAnalysis driven
Application orientedSubject oriented
Supports day-to-day decisionsSupports strategic decisions
Serves large number ofServes relatively lower number
clerical / operational usersof managerial users*
-
Data Warehouse Architecture
Operational Data
Load Manager
Warehouse Manager
Query Manager
Detailed Data
Lightly and Highly Summarized Data
Archive / Backup Data
Meta-Data
End-user Access Tools
*
-
End-user Access Tools
Reporting and query toolsApplication development toolsExecutive Information System (EIS) toolsOnline Analytical Processing (OLAP) toolsData mining tools
*
-
Data Warehousing Tools and Technologies
Extraction, Cleansing, and Transformation Tools
Data Warehouse DBMS
Load performance
Load processing
Data quality management
Query performance
Terabyte scalability
Networked data warehouse
Warehouse administration
Integrated dimensional tools
Advanced query functionality
*
-
Data Marts
A subset of data warehouse that supports the requirements of a particular department or business function
*
-
Online Analytical Processing (OLAP)
OLAPThe dynamic synthesis, analysis, and consolidation of large volume of multi-dimensional dataMulti-dimensional OLAPCubes of data
*
City
Time
Producttype
-
Problems of Data Warehousing
Underestimation of resources for data loadingHidden problem with source systemsRequired data not capturedIncreased end-user demandsData homogenizationHigh demand for resourcesData ownershipHigh maintenanceLong duration projectsComplexity of integration
*
-
Codd's Rules for OLAP
Multi-dimensional conceptual view
Transparency
Accessibility
Consistent reporting performance
Client-server architecture
Generic dimensionality
Dynamic sparse matrix handling
Multi-user support
Unrestricted cross-dimensional operations
Intuitive data manipulation
Flexible reporting
Unlimited dimensions and aggregation levels
*
-
OLAP Tools
Multi-dimensional OLAP (MOLAP)Multi-dimensional DBMS (MDDBMS)Relational OLAP (ROLAP)Creation of multiple multi-dimensional views of the two-dimensional relationsManaged Query Environment (MQE)Deliver selected data directly from the DBMS to the desktop in the form of a data cube, where it is stored, analyzed, and manipulated locally
*
-
Data Mining
Definition
The process of extracting valid, previously unknown, comprehensible, and actionable information from large database and using it to make crucial business decisions
Knowledge discovery
Association rules
Sequential patterns
Classification trees
Goals
Prediction
Identification
Classification
Optimization
*
-
Data Mining Techniques
Predictive ModelingSupervised training with two phasesTraining phase : building a model using large sample of historical data called the training setTesting phase : trying the model on new dataDatabase SegmentationLink AnalysisDeviation Detection
*
-
What are Data Mining Tasks?
ClassificationRegressionClustering SummarizationDependency modelingChange and Deviation Detection
*
-
What are Data Mining Discoveries?
New Purchase TrendsPlan Investment StrategiesDetect Unauthorized ExpenditureFraudulent ActivitiesCrime TrendsSmugglers-border crossing
*
-
*
Data Warehouse Architecture
-
*
Data Warehouse for Decision Support & OLAP
Putting Information technology to help the knowledge worker make faster and better decisionsWhich of my customers are most likely to go to the competition?What product promotions have the biggest impact on revenue?How did the share price of software companies correlate with profits over last 10 years?
-
*
Decision Support
Used to manage and control businessData is historical or point-in-timeOptimized for inquiry rather than updateUse of the system is loosely defined and can be ad-hocUsed by managers and end-users to understand the business and make judgements
-
*
Data Mining works with Warehouse Data
Data Warehousing provides the Enterprise with a memory
Data Mining provides the Enterprise with intelligence
-
*
We want to know ...
Given a database of 100,000 names, which persons are the least likely to default on their credit cards? Which types of transactions are likely to be fraudulent given the demographics and transactional history of a particular customer? If I raise the price of my product by Rs. 2, what is the effect on my ROI? If I offer only 2,500 airline miles as an incentive to purchase rather than 5,000, how many lost responses will result? If I emphasize ease-of-use of the product as opposed to its technical capabilities, what will be the net effect on my revenues? Which of my customers are likely to be the most loyal?
Data Mining helps extract such information
-
*
Application Areas
Industry
Application
Finance
Credit Card Analysis
Insurance
Claims, Fraud Analysis
Telecommunication
Call record analysis
Transport
Logistics management
Consumer goods
promotion analysis
Data Service providers
Value added data
Utilities
Power usage analysis
-
*
Data Mining in Use
The US Government uses Data Mining to track fraudA Supermarket becomes an information brokerBasketball teams use it to track game strategyCross SellingWarranty Claims RoutingHolding on to Good CustomersWeeding out Bad Customers
-
*
What makes data mining possible?
Advances in the following areas are making data mining deployable:data warehousing better and more data (i.e., operational, behavioral, and demographic) the emergence of easily deployed data mining tools and the advent of new data mining techniques.
-- Gartner Group
-
*
Why Separate Data Warehouse?
PerformanceOp dbs designed & tuned for known txs & workloads.Complex OLAP queries would degrade perf. for op txs.Special data organization, access & implementation methods needed for multidimensional views & queries.
FunctionMissing data: Decision support requires historical data, which op dbs do not typically maintain.Data consolidation: Decision support requires consolidation (aggregation, summarization) of data from many heterogeneous sources: op dbs, external sources. Data quality: Different sources typically use inconsistent data representations, codes, and formats which have to be reconciled.
-
*
What are Operational Systems?
They are OLTP systemsRun mission critical applicationsNeed to work with stringent performance requirements for routine tasksUsed to run a business!
-
*
RDBMS used for OLTP
Database Systems have been used traditionally for OLTPclerical data processing tasksdetailed, up to date datastructured repetitive tasksread/update a few recordsisolation, recovery and integrity are critical
-
*
Operational Systems
Run the business in real timeBased on up-to-the-second dataOptimized to handle large numbers of simple read/write transactionsOptimized for fast response to predefined transactionsUsed by people who deal with customers, products -- clerks, salespeople etc.They are increasingly used by customers
-
*
Examples of Operational Data
-
*
Application-Orientation vs. Subject-Orientation
-
*
OLTP vs. Data Warehouse
OLTP systems are tuned for known transactions and workloads while workload is not known a priori in a data warehouseSpecial data organization, access methods and implementation methods are needed to support data warehouse queries (typically multidimensional queries)e.g., average amount spent on phone calls between 9AM-5PM in Pune during the month of December
-
*
OLTP vs Data Warehouse
OLTPApplication OrientedUsed to run businessDetailed dataCurrent up to dateIsolated DataRepetitive accessClerical User
Warehouse (DSS)Subject OrientedUsed to analyze businessSummarized and refinedSnapshot dataIntegrated DataAd-hoc accessKnowledge User (Manager)
-
*
OLTP vs Data Warehouse
OLTPPerformance SensitiveFew Records accessed at a time (tens)
Read/Update Access
No data redundancyDatabase Size 100MB -100 GB
Data WarehousePerformance relaxedLarge volumes accessed at a time(millions)Mostly Read (Batch Update)Redundancy presentDatabase Size 100 GB - few terabytes
-
*
OLTP vs Data Warehouse
OLTPTransaction throughput is the performance metricThousands of usersManaged in entirety
Data WarehouseQuery throughput is the performance metricHundreds of usersManaged by subsets
-
*
To summarize ...
OLTP Systems are
used to run a businessThe Data Warehouse helps to optimize the business
-
*
Why Now?
Data is being producedERP provides clean dataThe computing power is availableThe computing power is affordableThe competitive pressures are strongCommercial products are available
-
*
Myths surrounding OLAP Servers and Data Marts
Data marts and OLAP servers are departmental solutions supporting a handful of usersMillion dollar massively parallel hardware is needed to deliver fast time for complex queriesOLAP servers require massive and unwieldy indicesComplex OLAP queries clog the network with dataData warehouses must be at least 100 GB to be effective
Source -- Arbor Software Home Page
-
II. On-Line Analytical Processing (OLAP)
Making Decision Support Possible
-
*
Typical OLAP Queries
Write a multi-table join to compare sales for each product line YTD this year vs. last year. Repeat the above process to find the top 5 product contributors to margin. Repeat the above process to find the sales of a product line to new vs. existing customers. Repeat the above process to find the customers that have had negative sales growth.
-
*
* Reference: http://www.arborsoft.com/essbase/wht_ppr/coddTOC.html
What Is OLAP?
Online Analytical Processing - coined by
EF Codd in 1994 paper contracted by
Arbor Software*Generally synonymous with earlier terms such as Decisions Support, Business Intelligence, Executive Information SystemOLAP = Multidimensional DatabaseMOLAP: Multidimensional OLAP (Arbor Essbase, Oracle Express)ROLAP: Relational OLAP (Informix MetaCube, Microstrategy DSS Agent) -
*
The OLAP Market
Rapid growth in the enterprise market1995: $700 Million1997: $2.1 BillionSignificant consolidation activity among major DBMS vendors10/94: Sybase acquires ExpressWay7/95: Oracle acquires Express 11/95: Informix acquires Metacube1/97: Arbor partners up with IBM10/96: Microsoft acquires PanoramaResult: OLAP shifted from small vertical niche to mainstream DBMS category
-
*
Strengths of OLAP
It is a powerful visualization paradigmIt provides fast, interactive response timesIt is good for analyzing time seriesIt can be useful to find some clusters and outliersMany vendors offer OLAP tools
-
*
Nigel Pendse, Richard Creath - The OLAP Report
OLAP Is FASMI
FastAnalysisSharedMultidimensionalInformation
-
*
Dimensions: Product, Region, Time
Hierarchical summarization paths
Product Region Time
Industry Country Year
Category Region Quarter
Product City Month Week
Office Day
Multi-dimensional Data
HeyI sold $100M worth of goods
-
*
A Visual Operation: Pivot (Rotate)
10
47
30
12
Juice
Cola
Milk
Cream
NY
LA
SF
3/1 3/2 3/3 3/4
Date
Month
Region
Product
-
*
Slicing and Dicing
Product
Sales Channel
Regions
Retail
Direct
Special
Household
Telecomm
Video
Audio
India
Far East
Europe
The Telecomm Slice
-
*
Roll-up and Drill Down
Sales ChannelRegionCountryState Location AddressSales Representative
-
Results of Data Mining Include:
Forecasting what may happen in the futureClassifying people or things into groups by recognizing patternsClustering people or things into groups based on their attributesAssociating what events are likely to occur togetherSequencing what events are likely to lead to later events
-
Data mining is not
Brute-force crunching of bulk data Blind application of algorithmsGoing to find relationships where none existPresenting data in different waysA database intensive taskA difficult to understand technology requiring an advanced degree in computer science
-
Data Mining versus OLAP
OLAP - On-line Analytical ProcessingProvides you with a very good view of what is happening, but can not predict what will happen in the future or why it is happening
-
Data Mining Versus Statistical Analysis
Data Mining
Originally developed to act as expert systems to solve problems
Less interested in the mechanics of the technique
If it makes sense then lets use it
Does not require assumptions to be made about data
Can find patterns in very large amounts of data
Requires understanding of data and business problem
Data Analysis
Tests for statistical correctness of models
Are statistical assumptions of models correct?
Eg Is the R-Square good?
Hypothesis testing
Is the relationship significant?
Use a t-test to validate significance
Tends to rely on sampling
Techniques are not optimised for large amounts of data
Requires strong statistical skills
-
Examples of What People are Doing with Data Mining:
Fraud/Non-Compliance Anomaly detection
Isolate the factors that lead to fraud, waste and abuse
Target auditing and investigative efforts more effectively
Credit/Risk Scoring
Intrusion detection
Parts failure prediction
Recruiting/Attracting customers
Maximizing profitability (cross selling, identifying profitable customers)
Service Delivery and Customer Retention
Build profiles of customers likely to use which services
Web Mining
-
What data mining has done for...
Scheduled its workforce
to provide faster, more accurate answers to questions.
The US Internal Revenue Service
needed to improve customer service and...
*
The US Internal Revenue Service is using data mining to improve customer service.
[Click] By analyzing incoming requests for help and information, the IRS hopes to schedule its workforce to provide faster, more accurate answers to questions.
-
What data mining has done for...
analyzed suspects cell phone usage to focus investigations.
The US Drug Enforcement Agency needed to be more effective in their drug busts and
*
The US DFAS needs to search through 2.5 million financial transactions that may indicate inaccurate charges. Instead of relying on tips to point out fraud, the DFAS is mining the data to identify suspicious transactions.
[Click] Using Clementine, the agency examined credit card transactions and was able to identify purchases that did not match past patterns. Using this information, DFAS could focus investigations, finding fraud more costs effectively.
-
What data mining has done for...
Reduced direct mail costs by 30% while garnering 95% of the campaigns revenue.
HSBC need to cross-sell more
effectively by identifying profiles
that would be interested in higher
yielding investments and...
*
Retail banking is a highly competitive business. In addition to competition from other banks, banks also see intense competition from financial services companies of all kinds, from stockbrokers to mortgage companies.
With so many organizations working the same customer base, the value of customer retention is greater than ever before. As a result, HSBC Bank USA looks to enticing existing customers to "roll over" maturing products, or on cross-selling new ones.
[Click] Using SPSS products, HSBC found that it could reduce direct mail costs by 30% while still bringing in 95% of the campaigns revenue. Because HSBC is sending out fewer mail pieces, customers are likely to be more loyal because they dont receive junk mail from the bank.
-
Suggestion:Predicting Washington
C-Span has lunched a digital archieve of 500,000 hours of audio debates. Text Mining or Audio Mining of these talks to reveal cwetrain questions such as.
-
Example Application: Sports
IBM Advanced Scout analyzes
NBA game statisticsShots blockedAssistsFouls
Google: IBM Advanced Scout
-
Advanced Scout
Example pattern: An analysis of the
data from a game played between
the New York Knicks and the Charlotte
Hornets revealed that When Glenn Rice played the shooting guard position, he shot 5/6 (83%) on jump shots."Pattern is interesting:
The average shooting percentage for the Charlotte Hornets during that game was 54%. -
Data Mining: Types of Data
Relational data and transactional dataSpatial and temporal data, spatio-temporal observationsTime-series data TextImages, videoMixtures of dataSequence data
Features from processing other data sources
-
Data Mining Techniques
Supervised learningClassification and regressionUnsupervised learningClusteringDependency modelingAssociations, summarization, causalityOutlier and deviation detectionTrend analysis and change detection
-
Different Types of Classifiers
Linear discriminant analysis (LDA)Quadratic discriminant analysis (QDA)Density estimation methodsNearest neighbor methodsLogistic regressionNeural networksFuzzy set theoryDecision Trees
-
Test Sample Estimate
Divide D into D1 and D2Use D1 to construct the classifier dThen use resubstitution estimate R(d,D2) to calculate the estimated misclassification error of dUnbiased and efficient, but removes D2 from training dataset D
-
V-fold Cross Validation
Procedure:
Construct classifier d from DPartition D into V datasets D1, , DVConstruct classifier di using D \ DiCalculate the estimated misclassification error R(di,Di) of di using test sample Di
Final misclassification estimate:
Weighted combination of individual misclassification errors:
R(d,D) = 1/V R(di,Di) -
Cross-Validation: Example
d
d1
d2
d3
-
Cross-Validation
Misclassification estimate obtained through cross-validation is usually nearly unbiasedCostly computation (we need to compute d, and d1, , dV); computation of di is nearly as expensive as computation of dPreferred method to estimate quality of learning algorithms in the machine learning literature
-
Decision Tree Construction
Three algorithmic components:
Split selection (CART, C4.5, QUEST, CHAID, CRUISE, )
Pruning (direct stopping rule, test dataset pruning, cost-complexity pruning, statistical tests, bootstrapping)
Data access (CLOUDS, SLIQ, SPRINT, RainForest, BOAT, UnPivot operator)
*
-
Goodness of a Split
Consider node t with impurity phi(t)
The reduction in impurity through splitting predicate s (t splits into children nodes tL with impurity phi(tL) and tR with impurity phi(tR)) is:
phi(s,t) = phi(t) pL phi(tL) pR phi(tR) -
Pruning Methods
Test dataset pruningDirect stopping ruleCost-complexity pruningMDL pruningPruning by randomization testing
-
Stopping Policies
A stopping policy indicates when further growth of the tree at a node t is counterproductive.
All records are of the same classThe attribute values of all records are identicalAll records have missing valuesAt most one class has a number of records larger than a user-specified numberAll records go to the same child node if t is split (only possible with some split selection methods)
-
Test Dataset Pruning
Use an independent test sample D to estimate the misclassification cost using the resubstitution estimate R(T,D) at each nodeSelect the subtree T of T with the smallest expected cost
-
Missing Values
What is the problem?During computation of the splitting predicate, we can selectively ignore records with missing values (note that this has some problems)But if a record r misses the value of the variable in the splitting attribute, r can not participate further in tree construction
Algorithms for missing values address this problem.
-
Mean and Mode Imputation
Assume record r has missing value r.X, and splitting variable is X.
Simplest algorithm:If X is numerical (categorical), impute the overall mean (mode) Improved algorithm:If X is numerical (categorical), impute the mean(X|t.C) (the mode(X|t.C))
-
Decision Trees: Summary
Many application of decision treesThere are many algorithms available for:Split selectionPruningHandling Missing ValuesData AccessDecision tree construction still active research area (after 20+ years!)Challenges: Performance, scalability, evolving datasets, new applications
-
Supervised vs. Unsupervised Learning
Supervised
y=F(x): true functionD: labeled training setD: {xi,F(xi)}Learn:
G(x): model trained to predict labels DGoal:
E[(F(x)-G(x))2] 0Well defined criteria: Accuracy, RMSE, ...Unsupervised
Generator: true modelD: unlabeled data sampleD: {xi}Learn
??????????
Goal:
??????????
Well defined criteria:
??????????
-
Clustering: Unsupervised Learning
Given:Data Set D (training set)Similarity/distance metric/informationFind:Partitioning of dataGroups of similar/close items
-
Similarity?
Groups of similar customersSimilar demographicsSimilar buying behaviorSimilar healthSimilar productsSimilar costSimilar functionSimilar storeSimilarity usually is domain/problem specific
-
Clustering: Informal Problem Definition
Input:
A data set of N records each given as a d-dimensional data feature vector.
Output:
Determine a natural, useful partitioning of the data set into a number of (k) clusters and noise such that we have:High similarity of records within each cluster (intra-cluster similarity)Low similarity of records between clusters (inter-cluster similarity)
-
Types of Clustering
Hard Clustering:Each object is in one and only one clusterSoft Clustering:Each object has a probability of being in each cluster
-
Clustering Algorithms
Partitioning-based clusteringK-means clusteringK-medoids clusteringEM (expectation maximization) clusteringHierarchical clusteringDivisive clustering (top down)Agglomerative clustering (bottom up)Density-Based MethodsRegions of dense points separated by sparser regions of relatively low density
-
K-Means Clustering Algorithm
Initialize k cluster centers
Do
Assignment step: Assign each data point to its closest cluster center
Re-estimation step: Re-compute cluster centers
While (there are still changes in the cluster centers)
Visualization at:
http://www.delft-cluster.nl/textminer/theory/kmeans/kmeans.html
-
Issues
Why is K-Means working:
How does it find the cluster centers?Does it find an optimal clusteringWhat are good starting points for the algorithm?What is the right number of cluster centers?How do we know it will terminate?
-
Agglomerative Clustering
Algorithm:
Put each item in its own cluster (all singletons)Find all pairwise distances between clustersMerge the two closest clustersRepeat until everything is in one cluster
Observations:
Results in a hierarchical clusteringYields a clustering for each possible number of clustersGreedy clustering: Result is not optimal for any cluster size
-
Density-Based Clustering
A cluster is defined as a connected dense component.Density is defined in terms of number of neighbors of a point.We can find clusters of arbitrary shape
-
Market Basket Analysis
Consider shopping cart filled with several itemsMarket basket analysis tries to answer the following questions:Who makes purchases?What do customers buy together?In what order do customers purchase items?
-
Market Basket Analysis
Given:
A database of customer transactionsEach transaction is a set of items
Example:
Transaction with TID 111 contains items {Pen, Ink, Milk, Juice}TID
CID
Date
Item
Qty
111
201
5/1/99
Pen
2
111
201
5/1/99
Ink
1
111
201
5/1/99
Milk
3
111
201
5/1/99
Juice
6
112
105
6/3/99
Pen
1
112
105
6/3/99
Ink
1
112
105
6/3/99
Milk
1
113
106
6/5/99
Pen
1
113
106
6/5/99
Milk
1
114
201
7/1/99
Pen
2
114
201
7/1/99
Ink
2
114
201
7/1/99
Juice
4
-
Market Basket Analysis (Contd.)
Coocurrences80% of all customers purchase items X, Y and Z together.Association rules60% of all customers who purchase X and Y also buy Z.Sequential patterns60% of customers who first buy X also purchase Y within three weeks.
*
-
Confidence and Support
We prune the set of all possible association rules using two interestingness measures:
Confidence of a rule:X Y has confidence c if P(Y|X) = cSupport of a rule:X Y has support s if P(XY) = s
We can also define
Support of an itemset (a coocurrence) XY:XY has support s if P(XY) = s
*
-
Market Basket Analysis: Applications
Sample ApplicationsDirect marketingFraud detection for medical insuranceFloor/shelf planningWeb site layoutCross-selling
*
-
Applications of Frequent Itemsets
Market Basket AnalysisAssociation RulesClassification (especially: text, rare classes)Seeds for construction of Bayesian NetworksWeb log analysisCollaborative filtering
-
Association Rule Algorithms
More abstract problem reduxBreadth-first searchDepth-first search
-
Problem Redux
Abstract:
A set of items {1,2,,k}A dabase of transactions (itemsets) D={T1, T2, , Tn},
Tj subset {1,2,,k}GOAL:
Find all itemsets that appear in at least x transactions
(appear in == are subsets of)
I subset T: T supports I
For an itemset I, the number of transactions it appears in is called the support of I.
x is called the minimum support.
Concrete:
I = {milk, bread, cheese, }D = { {milk,bread,cheese}, {bread,cheese,juice}, }
GOAL:
Find all itemsets that appear in at least 1000 transactions
{milk,bread,cheese} supports {milk,bread}
-
Problem Redux (Contd.)
Definitions:
An itemset is frequent if it is a subset of at least x transactions. (FI.)An itemset is maximally frequent if it is frequent and it does not have a frequent superset. (MFI.)
GOAL: Given x, find all frequent (maximally frequent) itemsets (to be stored in the FI (MFI)).
Obvious relationship:
MFI subset FIExample:
D={ {1,2,3}, {1,2,3}, {1,2,3}, {1,2,4} }
Minimum support x = 3
{1,2} is frequent
{1,2,3} is maximal frequent
Support({1,2}) = 4
All maximal frequent itemsets: {1,2,3}
-
Applications
Spatial association rulesWeb miningMarket basket analysisUser/customer profiling
-
ExtenSuggestionssions: Sequential Patterns
In the Market Itemset Analysis replace Milk, Pen, etc with names of medications and use the idea in Hospital Data mining new proposal The idea of swaem intelligence add to it the extra analysis pf the inducyion rules in this set of slides.
-
Kraft Foods: Direct Marketing
Company maintains a large database of purchases by customers.
Data mining
1. Analysts identified associations among groups of products bought by particular segments of customers.
2. Sent out 3 sets of coupons to various households.
Better response rates: 50 % increase in sales for one its products
Continue to use of this approach
Health Insurance Commission of Australia: Insurance Fraud
Commission maintains a database of insurance claims,including laboratory tests ordered during the diagnosis of patients.
Data mining
1. Identified the practice of "up coding" to reflect more expensive tests than are necessary.
2. Now monitors orders for lab tests.
Commission expects to save US$1,000,000 / year by eliminating the practice of "up coding.
-
HNC Software: Credit Card Fraud
Payment Fraud
Large issuers of cards may lose
$10 million / year due to fraud
Difficult to identify the few transactions among thousands which reflect potential fraud
Falcon software
Mines data through neural networks
Introduced in September 1992
Models each cardholder's requested transaction against the customer's past spending history.
processes several hundred requests per second
compares current transaction with customer's history
identifies the transactions most likely to be frauds
enables bank to stop high-risk transactions before they are authorized
Used by many retail banks: currently monitors
160 million card accounts for fraud
-
New Account Fraud
Fraudulent applications for credit cards are growing at 50 % per year
Falcon Sentry software
Mines data through neural networks and a rule base
Introduced in September 1992
Checks information on applications against data from credit bureaus
Allows card issuers to simultaneously:
increase the proportion of applications received
reduce the proportion of fraudulent applications authorized
New Account Fraud
-
Quality Control
IBM Microelectronics: Quality Control
Analyzed manufacturing data on Dynamic Random Access Memory (DRAM) chips.
Data mining
1. Built predictive models of
manufacturing yield (% non-defective)
effects of production parameters on chip performance.
2. Discovered critical factors behind
production yield &
product performance.
3. Created a new design for the chip
increased yield saved millions of dollars in direct manufacturing costs
enhanced product performance by substantially lowering the memory cycle time
-
B & L Stores
Belk and Leggett Stores =
one of largest retail chains
280 stores in southeast U.S.
data warehouse contains 100s of gigabytes (billion characters) of data
data mining to:
increase sales
reduce costs
Selected DSS Agent from MicroStrategy, Inc.
analyize merchandizing (patterns of sales)
manage inventory
Retail Sales
-
DSS Agent
uses intelligent agents data mining
provides multiple functions
recognizes sales patterns among stores
discovers sales patterns by
time of day
day of year
category of product
etc.
swiftly identifies trends & shifts in customer tastes
performs Market Basket Analysis (MBA)
analyzes Point-of-Sale or -Service (POS) data
identifies relationships among products and/or services purchased
E.g. A customer who buys Brand X slacks has a 35% chance of buying Brand Y shirts.
Agent tool is also used by other Fortune 1000 firms
average ROI > 300 %
average payback in 1 ~ 2 years
Market Basket Analysis
-
Case Based Reasoning
(CBR)
General scheme for a case based reasoning (CBR) model. The target case is
matched against similar precedents in the historical database, such as cases A and B.
case A
targ
e
t
case B
_997881239.doc
case A
target
case B
_997881570.doc
case A
target
case B
_997881862.doc
case A
target
case B
_997882088.doc
case A
target
case B
_997881577.doc
case A
target
case B
_997881289.doc
case A
target
case B
_997880756.doc
case A
target
case B
-
Case Based Reasoning (CBR)Learning through the accumulation of experience
Key issues
Indexing:
storing cases for quick, effective access of precedentsRetrieval:
accessing the appropriate precedent casesAdvantages
Explicit knowledge form recognizable to humans
No need to re-code knowledge for computer processing
Limitations
Retrieving precedents based on superficial features
E.g. Matching Indonesia with U.S. because both have similar population sizeTraditional approach ignores the issue of generalizing knowledge
-
Genetic AlgorithmGeneration of candidate solutions using the procedures of biological evolution.
Procedure
0. Initialize.
Create a population of potential solutions ("organisms").
1. Evaluate.
Determine the level of "fitness" for each solution.
2. Cull.
Discard the poor solutions.
3. Breed.
a. Select 2 "fit" solutions to serve as parents.
b. From the 2 parents, generate offspring.
* Crossover:
Cut the parents at random and switch the 2 halves.
* Mutation:
Randomly change the value in a parent solution.
4. Repeat.
Go back to Step 1 above. -
Genetic Algorithm (Cont.)Advantages
Applicable to a wide range of problem domains.
Robustness:
can obtain solutions even when the performancefunction is highly irregular or input data are noisy.
Implicit parallelism:
can search in many directions concurrently.Limitations
Slow, like neural networks.
But: computation can be distributedover multiple processors
(unlike neural networks)
Source: www.pathology.washington.edu
-
Multistrategy LearningEvery technique has advantages & limitations
Multistrategy approach
Take advantage of the strengths of diverse techniques
Circumvent the limitations of each methodology
-
Types of Models
Prediction Models for Predicting and Classifying
Regression algorithms (predict numeric outcome): neural networks, rule induction, CART (OLS regression, GLM)
Classification algorithm predict symbolic outcome): CHAID, C5.0 (discriminant analysis, logistic regression)
Descriptive Models for Grouping and Finding Associations
Clustering/Grouping algorithms: K-means, Kohonen
Association algorithms: apriori, GRI
-
Neural Networks
DescriptionDifficult interpretationTends to overfit the dataExtensive amount of training timeA lot of data preparationWorks with all data types
-
Rule Induction
Description
Intuitive outputHandles all forms of numeric data, as well as non-numeric (symbolic) data
C5 Algorithm a special case of rule induction
Target variable must be symbolic
-
Apriori
Description
Seeks association rules in dataset
Market basket analysis
Sequence discovery
-
Data Mining Is
The automated process of finding relationships and patterns in stored data It is different from the use of SQL queries and other business intelligence tools
-
Data Mining Is
Motivated by business need, large amounts of available data, and humans limited cognitive processing abilitiesEnabled by data warehousing, parallel processing, and data mining algorithms
-
Common Types of Information from Data Mining
Associations -- identifies occurrences that are linked to a single eventSequences -- identifies events that are linked over timeClassification -- recognizes patterns that describe the group to which an item belongs
-
Common Types of Information from Data Mining
Clustering -- discovers different groupings within the dataForecasting -- estimates future values
-
Commonly Used Data Mining Techniques
Artificial neural networksDecision treesGenetic algorithmsNearest neighbor methodRule induction
-
The Current State of Data Mining Tools
Many of the vendors are small companiesIBM and SAS have been in the market for some time, and more biggies are moving into this marketBI tools and RDMS products are increasingly including basic data mining capabilitiesPackaged data mining applications are becoming common
-
The Data Mining Process
Requires personnel with domain, data warehousing, and data mining expertiseRequires data selection, data extraction, data cleansing, and data transformationMost data mining tools work with highly granular flat filesIs an iterative and interactive process
-
Why Data Mining
Credit ratings/targeted marketing:Given a database of 100,000 names, which persons are the least likely to default on their credit cards? Identify likely responders to sales promotionsFraud detectionWhich types of transactions are likely to be fraudulent, given the demographics and transactional history of a particular customer? Customer relationship management:Which of my customers are likely to be the most loyal, and which are most likely to leave for a competitor? :
Data Mining helps extract such information
-
Applications
Banking: loan/credit card approvalpredict good customers based on old customersCustomer relationship management:identify those who are likely to leave for a competitor.Targeted marketing: identify likely responders to promotionsFraud detection: telecommunications, financial transactionsfrom an online stream of event identify fraudulent eventsManufacturing and production:automatically adjust knobs when process parameter changes
*
Any area where large amounts of historic data that if understood
better can help shape future decisions.
-
Applications (continued)
Medicine: disease outcome, effectiveness of treatmentsanalyze patient disease history: find relationship between diseases Molecular/Pharmaceutical: identify new drugsScientific data analysis: identify new galaxies by searching for sub clustersWeb site/store design and promotion:find affinity of visitor to pages and modify layout
-
The KDD process
Problem fomulationData collectionsubset data: sampling might hurt if highly skewed datafeature selection: principal component analysis, heuristic searchPre-processing: cleaningname/address cleaning, different meanings (annual, yearly), duplicate removal, supplying missing values Transformation:map complex objects e.g. time series data to features e.g. frequencyChoosing mining task and mining method:Result evaluation and Visualization:
Knowledge discovery is an iterative process
-
Relationship with other fields
Overlaps with machine learning, statistics, artificial intelligence, databases, visualization but more stress onscalability of number of features and instancesstress on algorithms and architectures whereas foundations of methods and formulations provided by statistics and machine learning. automation for handling large, heterogeneous data
*
-
Some basic operations
Predictive:RegressionClassificationCollaborative FilteringDescriptive:Clustering / similarity matchingAssociation rules and variantsDeviation detection
*
Each topic is a talk..
-
Classification
Given old data about customers and payments, predict new applicants loan eligibility.
Age
Salary
Profession
Location
Customer type
Previous customers
Classifier
Decision rules
Salary > 5 L
Prof. = Exec
New applicants data
Good/
bad
*
-
Classification methods
Goal: Predict class Ci = f(x1, x2, .. Xn)Regression: (linear or any other polynomial) a*x1 + b*x2 + c = Ci. Nearest neighourDecision tree classifier: divide decision space into piecewise constant regions.Probabilistic/generative modelsNeural networks: partition by non-linear boundaries
*
-
Define proximity between instances, find neighbors of new instance and assign majority classCase based reasoning: when attributes are more complicated than real-valued.
Nearest neighbor
Cons Slow during application. No feature selection. Notion of proximity vague
Pros Fast training
*
-
Clustering
Unsupervised learning when old data with class labels not available e.g. when introducing a new product.Group/cluster existing customers based on time series of payment history such that similar customers in same cluster.Key requirement: Need a good measure of similarity between instances.Identify micro-markets and develop policies for each
*
-
Applications
Customer segmentation e.g. for targeted marketingGroup/cluster existing customers based on time series of payment history such that similar customers in same cluster.Identify micro-markets and develop policies for eachCollaborative filtering:group based on common items purchasedText clustering Compression
-
Distance functions
Numeric data: euclidean, manhattan distances Categorical data: 0/1 to indicate presence/absence followed byHamming distance (# dissimilarity)Jaccard coefficients: #similarity in 1s/(# of 1s) data dependent measures: similarity of A and B depends on co-occurance with C.Combined numeric and categorical data:weighted normalized distance:
-
Clustering methods
Hierarchical clusteringagglomerative Vs divisive single link Vs complete link Partitional clusteringdistance-based: K-meansmodel-based: EMdensity-based:
-
Agglomerative Hierarchical clustering
Given: matrix of similarity between every point pairStart with each point in a separate cluster and merge clusters based on some criteria:Single link: merge two clusters such that the minimum distance between two points from the two different cluster is the leastComplete link: merge two clusters such that all points in one cluster are close to all points in the other.
-
Partitional methods: K-means
Criteria: minimize sum of square of distance Between each point and centroid of the cluster.Between each pair of points in the clusterAlgorithm:Select initial partition with K clusters: random, first K, K separated pointsRepeat until stabilization:Assign each point to closest cluster centerGenerate new cluster centersAdjust clusters by merging/splitting
-
Collaborative Filtering
Given database of user preferences, predict preference of new user Example: predict what new movies you will like based onyour past preferencesothers with similar past preferencestheir preferences for the new moviesExample: predict what books/CDs a person may want to buy (and suggest it, or give discounts to tempt customer)
-
Association rules
Given set T of groups of itemsExample: set of item sets purchased Goal: find all rules on itemsets of the form a-->b such that support of a and b > user threshold s conditional probability (confidence) of b given a > user threshold cExample: Milk --> breadPurchase of product A --> service B
Milk, cereal
Tea, milk
Tea, rice, bread
cereal
T
*
-
Prevalent Interesting
Analysts already know about prevalent rulesInteresting rules are those that deviate from prior expectationMinings payoff is in finding surprising phenomena
1995
Milk and
cereal sell
together!Milk and
cereal sell
together!*
-
Applications of fast itemset counting
Find correlated events:
Applications in medicine: find redundant testsCross selling in retail, bankingImprove predictive capability of classifiers that assume attribute independence New similarity measures of categorical attributes [Mannila et al, KDD 98]
*
-
Application Areas
Industry
Application
Finance
Credit Card Analysis
Insurance
Claims, Fraud Analysis
Telecommunication
Call record analysis
Transport
Logistics management
Consumer goods
promotion analysis
Data Service providers
Value added data
Utilities
Power usage analysis
-
Usage scenarios
Data warehouse mining: assimilate data from operational sourcesmine static dataMining log dataContinuous mining: example in process controlStages in mining: data selection pre-processing: cleaning transformation mining result evaluation visualization
*
-
Mining market
Around 20 to 30 mining tool vendorsMajor tool players:Clementine, IBMs Intelligent Miner, SGIs MineSet, SASs Enterprise Miner.All pretty much the same set of toolsMany embedded products: fraud detection: electronic commerce applications, health care, customer relationship management: Epiphany
*
Absolute: 40 M$
40M$, expected to grow 10 times by 2000 --Forrester research
-
Vertical integration:
Mining on the webWeb log analysis for site design: what are popular pages, what links are hard to find.Electronic stores sales enhancements:recommendations, advertisement: Collaborative filtering: Net perception, Wisewire Inventory control: what was a shopper looking for and could not find..
*
-
State of art in mining OLAP integration
Decision trees [Information discovery, Cognos]find factors influencing high profitsClustering [Pilot software]segment customers to define hierarchy on that dimensionTime series analysis: [Seagates Holos]Query for various shapes along time: eg. spikes, outliersMulti-level Associations [Han et al.]find association between members of dimensions Sarawagi [VLDB2000]
*
Littl e integration: here are few exceptions ---
People are starting to wake up to this possibility and here are some examples I have found by web-surfing.
. decision tree most common. Information Discovery claimed to be only serious integrator [DBMS Ap 98]
Clustering used by some to define new product hierarchies.
Of course, rich set of time-series functions especially for forecasting was always there
New charting software: 80/20, A-B-C analysis, quadrant plotting.
Univ. Jiawen Han.
Previous approach has been to bring in mining operations in olap. Look at mining operations and choose what fits.
My approach has been to reflect on what people do with cube metaphor
and the drill-down, roll-up, based exploration and see if there is anything there that can be automated.
Discuss my work first.
-
Data Mining in Use
The US Government uses Data Mining to track fraudA Supermarket becomes an information brokerBasketball teams use it to track game strategyCross SellingTarget MarketingHolding on to Good CustomersWeeding out Bad Customers
*
-
Some success stories
Network intrusion detection using a combination of sequential rule discovery and classification tree on 4 GB DARPA dataWon over (manual) knowledge engineering approachhttp://www.cs.columbia.edu/~sal/JAM/PROJECT/ provides good detailed description of the entire processMajor US bank: customer attrition predictionFirst segment customers based on financial behavior: found 3 segmentsBuild attrition models for each of the 3 segments40-50% of attritions were predicted == factor of 18 increase Targeted credit marketing: major US banksfind customer segments based on 13 months credit balancesbuild another response model based on surveysincreased response 4 times -- 2%
-
Data Mining Tools: KnowledeSeeker 4.5
*
What is KnowledgeSeeker?
Produced by ANGOSS Software Corporation, who focus solely on data mining software.
Offer training and consulting services
Produce data mining add-ins which accepts data from all major databases
Works with popular query and reporting, spreadsheet, statistical and OLAP & ROLAP tools.
Data Mining Tools: KnowledeSeeker 4.5
*
Angoss Software Corp. was formed under the Business Corp. Act (ontario) in 1980. It began data mining software production in 1992. It is publicly traded on the Canadian Venture Exchange under the trading symbol ANC
Promote the rapid knowledge transfer to customers in the use of technology and adoption of best practice for data mining
-
Data Mining Tools: KnowledeSeeker 4.5
*
Major Competitors
CompanySoftwareClementine 6.0Enterprise Miner 3.0Intelligent Miner
Data Mining Tools: KnowledeSeeker 4.5
*
-
Data Mining Tools: KnowledeSeeker 4.5
*
Major Competitors
CompanySoftwareMineset 3.1DarwinScenario
Data Mining Tools: KnowledeSeeker 4.5
*
-
Data Mining Tools: KnowledeSeeker 4.5
*
Current Applications
Manufacturing
Used by the R.R. Donnelly & Sons commercial printing company to improve process control, cut costs and increase productivity.
Used extensively by Hewlett Packard in their United States manufacturing plants as a process control tool both to analyze factors impacting product quality as well as to generate rules for production control systems.
Data Mining Tools: KnowledeSeeker 4.5
*
-
Data Mining Tools: KnowledeSeeker 4.5
*
Current Applications
Auditing
Used by the IRS to combat fraud, reduce risk, and increase collection rates.
Finance
Used by the Canadian Imperial Bank of Commerce (CIBC) to create models for fraud detection and risk management.
Data Mining Tools: KnowledeSeeker 4.5
*
-
Data Mining Tools: KnowledeSeeker 4.5
*
Current Applications
CRM
Telephony
Used by US West to reduce churning and increase customer loyalty for a new voice messaging technology.
Data Mining Tools: KnowledeSeeker 4.5
*
-
Data Mining Tools: KnowledeSeeker 4.5
*
Current Applications
Marketing
Used by the Washington Post to improve their direct mail targeting and to conduct survey analysis.
Health Care
Used by the Oxford Transplant Center to discover factors affecting transplant survival rates.
Used by the University of Rochester Cancer Center to study the effect of anxiety on chemotherapy-related nausea.
Data Mining Tools: KnowledeSeeker 4.5
*
-
Data Mining Tools: KnowledeSeeker 4.5
*
More Customers
Data Mining Tools: KnowledeSeeker 4.5
*
-
Data Mining Tools: KnowledeSeeker 4.5
*
Questions
What percentage of people in the test group have high blood pressure with these characteristics: 66-year-old male regular smoker that has low to moderate salt consumption?
Do the risk levels change for a male with the same characteristics who quit smoking? What are the percentages?
If you are a 2% milk drinker, how many factors are still interesting?
Knowing that salt consumption and smoking habits are interesting factors, which one has a stronger correlation to blood pressure levels?
Grow an automatic tree. Look to see if gender is an interesting factor for 55-year-old regular smoker who does not each cheese?
Data Mining Tools: KnowledeSeeker 4.5
-
Association
Classic market-basket analysis, which treats the purchase of a number of items (for example, the contents of a shopping basket) as a single transaction. This information can be used to adjust inventories, modify floor or shelf layouts, or introduce targeted promotional activities to increase overall sales or move specific products. Example : 80 percent of all transactions in which beer was purchased also included potato chips.
-
Sequence-based analysis
Traditional market-basket analysis deals with a collection of items as part of a point-in-time transaction. to identify a typical set of purchases that might predict the subsequent purchase of a specific item.
-
Clustering
Clustering approach address segmentation problems. These approaches assign records with a large number of attributes into a relatively small set of groups or "segments." Example : Buying habits of multiple population segments might be compared to determine which segments to target for a new sales campaign.
-
Classification
Most commonly applied data mining technique Algorithm uses preclassified examples to determine the set of parameters required for proper discrimination. Example : A classifier derived from the Classification approach is capable of identifying risky loans, could be used to aid in the decision of whether to grant a loan to an individual.
-
Issues of Data Mining
Present-day tools are strong but require significant expertise to implement effectively. Issues of Data MiningSusceptibility to "dirty" or irrelevant data.Inability to "explain" results in human terms.
-
Issues
susceptibility to "dirty" or irrelevant data Data mining tools of today simply take everything they are given as factual and draw the resulting conclusions. Users must take the necessary precautions to ensure that the data being analyzed is "clean."
-
Issues, cont
inability to "explain" results in human terms Many of the tools employed in data mining analysis use complex mathematical algorithms that are not easily mapped into human terms.what good does the information do if you dont understand it?
-
Comparison with reporting, BI and OLAP
Reporting
Simple relationshipsChoose the relevant factors Examine all details
(Also applies to visualisation & simple statistics)
Data Mining
Complex relationshipsAutomatically find the relevant factorsShow only relevant details
Prediction
*
Here its obviously the algorithms.
-
Comparison with Statistics
Statistical analysis
Mainly about hypothesis testingFocussed on precision
Data mining
Mainly about hypothesis generationFocussed on deployment
*
Here its less clear maybe its the algorithms, but more its the attitude
-
Example: data mining and customer processes
Insight: Who are my customers and why do they behave the way they do?Prediction: Who is a good prospect, for what product, who is at risk,
what is the next thing to offer?Uses: Targeted marketing, mail-shots, call-centres, adaptive web-sites*
-
Example: data mining and fraud detection
Insight: How can (specific method of) fraud be recognised? What constitute normal, abnormal and suspicious events?Prediction: Recognise similarity to previous frauds how similar?
Spot abnormal events how suspicious? Used by: Banks, telcos, retail, government*
-
Example: data mining and diagnosing cancer
Complex data from geneticsChallenging data mining problemFind patterns of gene activation indicating different diseases / stagesChanged the way I think about cancer Oncologist from Chicago Childrens Memorial Hospital
72.bin
*
-
Example: data mining and policing
Knowing the patterns helps plan effective crime preventionCrime hot-spots understood betterSift through mountains of crime reportsIdentify crime seriesOther people save money using data mining we save lives. Police force homicide specialist and data miner
*
-
Data mining tools:
Clementine and its philosophy*
How it works
How its really used.
Handling of business problems and algorithms / expert features
Deep embedding of deployment
CRISP-DM pane
-
How to do data mining
Lots of data mining operationsHow do you glue them together to solve a problem?How do we actually do data mining?MethodologyNot just the right way, but any way
*
-
Myths about Data Mining (1)
Data, Process and TechData mining is all about
massive dataData mining is a
technical processData mining is all
about algorithmsData mining is all
about predictive accuracy -
Myths about Data Mining (2)
Data QualityData mining only works
with clean dataData mining only works
with complete dataData mining only works
with correct data
-
One last exploding myth
Neural Networks are not useful when you need to understand the patterns that you find
(which is nearly always in data mining)Related to over-simplistic views of data mining
Data mining techniques form a toolkit
We often use techniques in surprising ways
E.g. Neural nets for field selection
Neural nets for pattern confirmation
Neural nets combined with other techniques
for cross-checkingWhat use is a pair of pliers?
-
*
Related Concepts Outline
Database/OLTP SystemsFuzzy Sets and LogicInformation Retrieval(Web Search Engines)Dimensional ModelingData WarehousingOLAP/DSSStatisticsMachine LearningPattern Matching
Goal: Examine some areas which are related to data mining.
-
*
Fuzzy Sets and Logic
Fuzzy Set: Set membership function is a real valued function with output in the range [0,1].f(x): Probability x is in F.1-f(x): Probability x is not in F.EX:T = {x | x is a person and x is tall}Let f(x) be the probability that x is tallHere f is the membership function
DM: Prediction and classification are fuzzy.
-
*
Information Retrieval
Information Retrieval (IR): retrieving desired information from textual data.Library ScienceDigital LibrariesWeb Search EnginesTraditionally keyword basedSample query:
Find all documents about data mining.
DM: Similarity measures;
Mine text/Web data.
-
Prentice Hall
*
Dimensional Modeling
View data in a hierarchical manner more as business executives mightUseful in decision support systems and miningDimension: collection of logically related attributes; axis for modeling data.Facts: data storedEx: Dimensions products, locations, date
Facts quantity, unit price
DM: May view data as dimensinoal.
Prentice Hall
-
*
Dimensional Modeling Queries
Roll Up: more general dimensionDrill Down: more specific dimensionDimension (Aggregation) HierarchySQL uses aggregationDecision Support Systems (DSS): Computer systems and tools to assist managers in making decisions and solving problems.
-
*
Cube view of Data
-
*
Data Warehousing
Subject-oriented, integrated, time-variant, nonvolatile William InmonOperational Data: Data used in day to day needs of company.Informational Data: Supports other functions such as planning and forecasting.Data mining tools often access data warehouses rather than operational data.
DM: May access data in warehouse.
-
*
OLAP
Online Analytic Processing (OLAP): provides more complex queries than OLTP.OnLine Transaction Processing (OLTP): traditional database/transaction processing.Dimensional data; cube view Visualization of operations:Slice: examine sub-cube.Dice: rotate cube to look at another dimension.Roll Up/Drill Down
DM: May use OLAP queries.
-
*
OLAP Operations
Single Cell
Multiple Cells
Slice
Dice
Roll Up
Drill Down
-
*
Statistics
Simple descriptive modelsStatistical inference: generalizing a model created from a sample of the data to the entire dataset.Exploratory Data Analysis: Data can actually drive the creation of the modelOpposite of traditional statistical view.Data mining targeted to business user
DM: Many data mining methods come from statistical techniques.
-
*
Machine Learning
Machine Learning: area of AI that examines how to write programs that can learn.Often used in classification and prediction Supervised Learning: learns by example.Unsupervised Learning: learns without knowledge of correct answers.Machine learning often deals with small static datasets.
DM: Uses many machine learning techniques.
-
Prentice Hall
*
Pattern Matching (Recognition)
Pattern Matching: finds occurrences of a predefined pattern in the data.Applications include speech recognition, information retrieval, time series analysis.
DM: Type of classification.
Prentice Hall
-
*
DM vs. Related Topics
Area
Query
Data
Results
Output
DB/OLTP
Precise
Database
Precise
DB Objects or Aggregation
IR
Precise
Documents
Vague
Documents
OLAP
Analysis
Multidimensional
Precise
DB Objects or Aggregation
DM
Vague
Preprocessed
Vague
KDD Objects
-
Prentice Hall
*
Data Mining Techniques Outline
StatisticalPoint EstimationModels Based on SummarizationBayes TheoremHypothesis TestingRegression and CorrelationSimilarity MeasuresDecision TreesNeural NetworksActivation FunctionsGenetic Algorithms
Goal: Provide an overview of basic data mining techniques
Prentice Hall
-
*
Point Estimation
Point Estimate: estimate a population parameter.May be made by calculating the parameter for a sample.May be used to predict value for missing data.Ex: R contains 100 employees99 have salary informationMean salary of these is $50,000Use $50,000 as value of remaining employees salary.
Is this a good idea?
-
*
Estimation Error
Bias: Difference between expected value and actual value.
Mean Squared Error (MSE): expected value of the squared difference between the estimate and the actual value:
Why square?Root Mean Square Error (RMSE)
-
*
Expectation-Maximization (EM)
Solves estimation with incomplete data.Obtain initial estimates for parameters.Iteratively use estimates for missing data and continue until convergence.
-
*
Models Based on Summarization
Visualization: Frequency distribution, mean, variance, median, mode, etc.Box Plot:
-
*
Bayes Theorem
Posterior Probability: P(h1|xi)Prior Probability: P(h1)Bayes Theorem:
Assign probabilities of hypotheses given a data value.
-
*
Hypothesis Testing
Find model to explain behavior by creating and then testing a hypothesis about the data.Exact opposite of usual DM approach.H0 Null hypothesis; Hypothesis to be tested.H1 Alternative hypothesis
-
*
Regression
Predict future values based on past valuesLinear Regression assumes linear relationship exists.
y = c0 + c1 x1 + + cn xn
Find values to best fit the data
-
*
Correlation
Examine the degree to which the values for two variables behave similarly.Correlation coefficient r:
1 = perfect correlation
-1 = perfect but opposite correlation
0 = no correlation
-
Prentice Hall
*
Similarity Measures
Determine similarity between two objects.Similarity characteristics:
Alternatively, distance measure measure how unlike or dissimilar objects are.
Prentice Hall
-
*
Distance Measures
Measure dissimilarity between objects
-
*
Decision Trees
Decision Tree (DT):Tree where the root and each internal node is labeled with a question. The arcs represent each possible answer to the associated question. Each leaf node represents a prediction of a solution to the problem.Popular technique for classification; Leaf node indicates class to which the corresponding tuple belongs.
-
Prentice Hall
*
Decision Trees
A Decision Tree Model is a computational model consisting of three parts:Decision TreeAlgorithm to create the treeAlgorithm that applies the tree to data Creation of the tree is the most difficult part.Processing is basically a search similar to that in a binary search tree (although DT may not be binary).
Prentice Hall
-
Prentice Hall
*
Neural Networks
Based on observed functioning of human brain. (Artificial Neural Networks (ANN)Our view of neural networks is very simplistic. We view a neural network (NN) from a graphical viewpoint.Alternatively, a NN may be viewed from the perspective of matrices.Used in pattern recognition, speech recognition, computer vision, and classification.
Prentice Hall
-
*
Generating Rules
Decision tree can be converted into a rule setStraightforward conversion: each path to the leaf becomes a rule makes an overly complex rule setMore effective conversions are not trivial(e.g. C4.8 tests each node in root-leaf path to see if it can be eliminated without loss in accuracy)
*
In the previous lesson we discussed Classification using decision trees.
Sometimes decision trees are inconvenient they can be very large
Also, they require starting at the same attribute
We can extract modular nuggets of knowledge by getting rules
-
*
Covering algorithms
Strategy for generating a rule set directly: for each class in turn find rule set that covers all instances in it (excluding instances not in the class)This approach is called a covering approach because at each stage a rule is identified that covers some of the instances
*
-
*
Rules vs. trees
Corresponding decision tree:
(produces exactly the same
predictions)
But: rule sets can be more clear when decision trees suffer from replicated subtreesAlso: in multi-class situations, covering algorithm concentrates on one class at a time whereas decision tree learner takes all classes into account
*
-
*
A simple covering algorithm
Generates a rule by adding tests that maximize rules accuracySimilar to situation in decision trees: problem of selecting an attribute to split onBut: decision tree inducer maximizes overall purityEach new test reduces
rules coverage:
witten&eibe
*
-
Algorithm Components
1. The task the algorithm is used to address (e.g. classification, clustering, etc.)
2. The structure of the model or pattern we are fitting to the data (e.g. a linear regression model)
3. The score function used to judge the quality of the fitted models or patterns (e.g. accuracy, BIC, etc.)
4. The search or optimization method used to search over parameters and/or structures (e.g. steepest descent, MCMC, etc.)
5. The data management technique used for storing, indexing, and retrieving data (critical when data too large to reside in memory)
-
Models and Patterns
Models
Prediction
Probability Distributions
Structured Data
Linear regressionPiecewise linear
-
Models
Prediction
Probability Distributions
Structured Data
Linear regressionPiecewise linearNonparamatric regression
-
Models
Prediction
Probability Distributions
Structured Data
Linear regressionPiecewise linearNonparametric regressionClassification
logistic regression
nave bayes/TAN/bayesian networks
NN
support vector machines
Trees
etc.
-
Models
Prediction
Probability Distributions
Structured Data
Linear regressionPiecewise linearNonparametric regressionClassification
Parametric modelsMixtures of parametric modelsGraphical Markov models (categorical, continuous, mixed)
-
Models
Prediction
Probability Distributions
Structured Data
Linear regressionPiecewise linearNonparametric regressionClassification
Parametric modelsMixtures of parametric modelsGraphical Markov models (categorical, continuous, mixed)
Time seriesMarkov modelsMixture Transition Distribution modelsHidden Markov modelsSpatial models
-
Bias-Variance Tradeoff
High Bias - Low Variance
Low Bias - High Variance
overfitting - modeling the random component
Score function should embody the compromise
-
Patterns
Global
Local
Clustering via partitioningHierarchical ClusteringMixture Models
Outlier detectionChangepoint detection
Bump huntingScan statisticsAssociation rules
-
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
The curve represents a road
Each x marks an accident
Red x denotes an injury accident
Black x means no injury
Is there a stretch of road where there is an unually large fraction of injury accidents?
Scan Statistics via Permutation Tests
-
Scan with Fixed Window
If we know the length of the stretch of road that we seek, e.g., we could slide this window long