[PPT]DATA WAREHOUSING AND DATA MINING - Prince...

301
Data Mining Tools Overview & Tutorial Ahmed Sameh Prince Sultan University Department of Computer Science & Info Sys May 2010 (Some slides belong to IBM) 1

Transcript of [PPT]DATA WAREHOUSING AND DATA MINING - Prince...

  • Data Mining Tools

    Overview & Tutorial

    Ahmed Sameh

    Prince Sultan University

    Department of Computer Science & Info Sys

    May 2010

    (Some slides belong to IBM)

    *

  • *

    Introduction Outline

    Define data miningData mining vs. databasesBasic data mining tasksData mining developmentData mining issues

    Goal: Provide an overview of data mining.

  • *

    Introduction

    Data is growing at a phenomenal rateUsers expect more sophisticated informationHow?

    UNCOVER HIDDEN INFORMATION

    DATA MINING

  • *

    Data Mining Definition

    Finding hidden information in a databaseFit data to a modelSimilar termsExploratory data analysisData driven discoveryDeductive learning

  • *

    Data Mining Algorithm

    Objective: Fit Data to a ModelDescriptivePredictivePreference Technique to choose the best modelSearch Technique to search the dataQuery

  • *

    Database Processing vs. Data Mining Processing

    QueryWell definedSQL

    QueryPoorly definedNo precise query language

    Data Operational data

    Output Precise Subset of database

    Data Not operational data

    Output Fuzzy Not a subset of database

  • *

    Query Examples

    Database

    Data Mining

    Find all customers who have purchased milk

    Find all items which are frequently purchased with milk. (association rules)

    Find all credit applicants with last name of Smith.

    Identify customers who have purchased more than $10,000 in the last month.

    Find all credit applicants who are poor credit risks. (classification)

    Identify customers with similar buying habits. (Clustering)

  • *

    Related Fields

    Statistics

    Machine

    Learning

    Databases

    Visualization

    Data Mining and

    Knowledge Discovery

    *

  • *

    Statistics, Machine Learning and Data Mining

    Statistics: more theory-basedmore focused on testing hypothesesMachine learningmore heuristicfocused on improving performance of a learning agentalso looks at real-time learning and robotics areas not part of data miningData Mining and Knowledge Discoveryintegrates theory and heuristicsfocus on the entire process of knowledge discovery, including data cleaning, learning, and integration and visualization of resultsDistinctions are fuzzy

  • Definition

    A class of database application that analyze data in a database using tools which look for trends or anomalies. Data mining was invented by IBM.

  • Purpose

    To look for hidden patterns or previously unknown relationships among the data in a group of data that can be used to predict future behavior.Ex: Data mining software can help retail companies find customers with common interests.

  • Background Information

    Many of the techniques used by today's data mining tools have been around for many years, having originated in the artificial intelligence research of the 1980s and early 1990s. Data Mining tools are only now being applied to large-scale database systems.

  • The Need for Data Mining

    The amount of raw data stored in corporate data warehouses is growing rapidly. There is too much data and complexity that might be relevant to a specific problem. Data mining promises to bridge the analytical gap by giving knowledgeworkers the tools to navigate this complex analytical space.

  • The Need for Data Mining, cont

    The need for information has resulted in the proliferation of data warehouses that integrate information multiple sources to support decision making. Often include data from external sources, such as customer demographics and household information.

  • Definition (Cont.)

    Data mining is the exploration and analysis of large quantities of data in order to discover valid, novel, potentially useful, and ultimately understandable patterns in data.

    Valid: The patterns hold in general.

    Novel: We did not know the pattern beforehand.

    Useful: We can devise actions from the patterns.

    Understandable: We can interpret and comprehend the patterns.

  • Of laws, Monsters, and Giants

    Moores law: processing capacity doubles every 18 months : CPU, cache, memoryIts more aggressive cousin:Disk storage capacity doubles every 9 months

  • What is Data Mining?

    Finding interesting structure in data

    Structure: refers to statistical patterns, predictive models, hidden relationships

    Examples of tasks addressed by Data MiningPredictive Modeling (classification, regression)Segmentation (Data Clustering )SummarizationVisualization

  • *

    Major Application Areas for
    Data Mining Solutions

    AdvertisingBioinformaticsCustomer Relationship Management (CRM)Database Marketing Fraud Detection eCommerceHealth CareInvestment/SecuritiesManufacturing, Process ControlSports and Entertainment TelecommunicationsWeb

    *

  • *

    Data Mining

    The non-trivial extraction of novel, implicit, and actionable knowledge from large datasets.Extremely large datasetsDiscovery of the non-obviousUseful knowledge that can improve processesCan not be done manuallyTechnology to enable data exploration, data analysis, and data visualization of very large databases at a high level of abstraction, without a specific hypothesis in mind.Sophisticated data search capability that uses statistical algorithms to discover patterns and correlations in data.

  • *

    Data Mining (cont.)

  • *

    Data Mining (cont.)

    Data Mining is a step of Knowledge Discovery in Databases (KDD) ProcessData WarehousingData SelectionData PreprocessingData TransformationData MiningInterpretation/EvaluationData Mining is sometimes referred to as KDD and DM and KDD tend to be used as synonyms

  • *

    Data Mining Evaluation

  • *

    Data Mining is Not

    Data warehousing SQL / Ad Hoc Queries / Reporting Software Agents Online Analytical Processing (OLAP) Data Visualization

  • *

    Data Mining Motivation

    Changes in the Business EnvironmentCustomers becoming more demandingMarkets are saturatedDatabases today are huge:More than 1,000,000 entities/records/rowsFrom 10 to 10,000 fields/attributes/variablesGigabytes and terabytesDatabases a growing at an unprecedented rateDecisions must be made rapidlyDecisions must be made with maximum knowledge

  • Why Use Data Mining Today?

    Human analysis skills are inadequate:

    Volume and dimensionality of the dataHigh data growth rate

    Availability of:

    DataStorageComputational powerOff-the-shelf softwareExpertise

  • An Abundance of Data

    Supermarket scanners, POS dataPreferred customer cardsCredit card transactionsDirect mail responseCall center recordsATM machinesDemographic dataSensor networksCamerasWeb server logsCustomer web site trails

  • Evolution of Database Technology

    1960s: IMS, network model1970s: The relational data model, first relational DBMS implementations1980s: Maturing RDBMS, application-specific DBMS, (spatial data, scientific data, image data, etc.), OODBMS1990s: Mature, high-performance RDBMS technology, parallel DBMS, terabyte data warehouses, object-relational DBMS, middleware and web technology2000s: High availability, zero-administration, seamless integration into business processes2010: Sensor database systems, databases on embedded systems, P2P database systems, large-scale pub/sub systems, ???

  • Much Commercial Support

    Many data mining toolshttp://www.kdnuggets.com/software Database systems with data mining supportVisualization toolsData mining process supportConsultants

  • Why Use Data Mining Today?

    Competitive pressure!

    The secret of success is to know something that nobody else knows.

    Aristotle Onassis

    Competition on service, not only on price (Banks, phone companies, hotel chains, rental car companies)Personalization, CRMThe real-time enterpriseSystemic listeningSecurity, homeland defense

  • The Knowledge Discovery Process

    Steps:

    Identify business problem

    Data mining

    Action

    Evaluation and measurement

    Deployment and integration into businesses processes

  • Data Mining Step in Detail

    2.1 Data preprocessing

    Data selection: Identify target datasets and relevant fieldsData cleaningRemove noise and outliersData transformationCreate common unitsGenerate new fields

    2.2 Data mining model construction

    2.3 Model evaluation

  • Preprocessing and Mining

    Original Data

    Target
    Data

    Preprocessed
    Data

    Patterns

    Knowledge

    Data
    Integration
    and Selection

    Preprocessing

    Model
    Construction

    Interpretation

  • *

    Data Mining Techniques

  • *

    Data Mining Models and Tasks

  • *

    Basic Data Mining Tasks

    Classification maps data into predefined groups or classesSupervised learningPattern recognitionPredictionRegression is used to map a data item to a real valued prediction variable.Clustering groups similar data together into clusters.Unsupervised learningSegmentationPartitioning

  • *

    Basic Data Mining Tasks (contd)

    Summarization maps data into subsets with associated simple descriptions.CharacterizationGeneralizationLink Analysis uncovers relationships among data.Affinity AnalysisAssociation RulesSequential Analysis determines sequential patterns.

  • *

    Ex: Time Series Analysis

    Example: Stock MarketPredict future valuesDetermine similar patterns over timeClassify behavior

    4.ps

  • *

    Data Mining vs. KDD

    Knowledge Discovery in Databases (KDD): process of finding useful information and patterns in data.Data Mining: Use of algorithms to extract the information and patterns derived by the KDD process.

  • *

    Data Mining Development

    Similarity MeasuresHierarchical ClusteringIR SystemsImprecise QueriesTextual DataWeb Search Engines

    Bayes TheoremRegression AnalysisEM AlgorithmK-Means ClusteringTime Series Analysis

    Neural NetworksDecision Tree Algorithms

    Algorithm Design TechniquesAlgorithm AnalysisData Structures

    Relational Data ModelSQLAssociation Rule AlgorithmsData WarehousingScalability Techniques

  • *

    KDD Issues

    Human InteractionOverfitting Outliers InterpretationVisualization Large DatasetsHigh Dimensionality

  • *

    KDD Issues (contd)

    Multimedia DataMissing DataIrrelevant DataNoisy DataChanging DataIntegrationApplication

  • *

    Visualization Techniques

    GraphicalGeometricIcon-basedPixel-basedHierarchicalHybrid

  • *

    Data Mining Applications

  • *

    Data Mining Applications:
    Retail

    Performing basket analysisWhich items customers tend to purchase together. This knowledge can improve stocking, store layout strategies, and promotions.Sales forecastingExamining time-based patterns helps retailers make stocking decisions. If a customer purchases an item today, when are they likely to purchase a complementary item?Database marketingRetailers can develop profiles of customers with certain behaviors, for example, those who purchase designer labels clothing or those who attend sales. This information can be used to focus costeffective promotions.Merchandise planning and allocationWhen retailers add new stores, they can improve merchandise planning and allocation by examining patterns in stores with similar demographic characteristics. Retailers can also use data mining to determine the ideal layout for a specific store.

  • *

    Data Mining Applications:
    Banking

    Card marketingBy identifying customer segments, card issuers and acquirers can improve profitability with more effective acquisition and retention programs, targeted product development, and customized pricing.Cardholder pricing and profitabilityCard issuers can take advantage of data mining technology to price their products so as to maximize profit and minimize loss of customers. Includes risk-based pricing.Fraud detection Fraud is enormously costly. By analyzing past transactions that were later determined to be fraudulent, banks can identify patterns. Predictive life-cycle managementDM helps banks predict each customers lifetime value and to service each segment appropriately (for example, offering special deals and discounts).

  • *

    Data Mining Applications:
    Telecommunication

    Call detail record analysisTelecommunication companies accumulate detailed call records. By identifying customer segments with similar use patterns, the companies can develop attractive pricing and feature promotions.Customer loyaltySome customers repeatedly switch providers, or churn, to take advantage of attractive incentives by competing companies. The companies can use DM to identify the characteristics of customers who are likely to remain loyal once they switch, thus enabling the companies to target their spending on customers who will produce the most profit.

  • *

    Data Mining Applications:
    Other Applications

    Customer segmentationAll industries can take advantage of DM to discover discrete segments in their customer bases by considering additional variables beyond traditional analysis.ManufacturingThrough choice boards, manufacturers are beginning to customize products for customers; therefore they must be able to predict which features should be bundled to meet customer demand.WarrantiesManufacturers need to predict the number of customers who will submit warranty claims and the average cost of those claims.Frequent flier incentivesAirlines can identify groups of customers that can be given incentives to fly more.

  • *

    A producer wants to know.

  • *

    Data, Data everywhere
    yet ...

    I cant find the data I needdata is scattered over the networkmany versions, subtle differences

    I cant get the data I needneed an expert to get the data

    I cant understand the data I foundavailable data poorly documented

    I cant use the data I foundresults are unexpecteddata needs to be transformed from one form to other

  • *

    What is a Data Warehouse?

    A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context.

    [Barry Devlin]

  • *

    What are the users saying...

    Data should be integrated across the enterpriseSummary data has a real value to the organizationHistorical data holds the key to understanding data over timeWhat-if capabilities are required

  • *

    What is Data Warehousing?

    A process of transforming data into information and making it available to users in a timely enough manner to make a difference

    [Forrester Research, April 1996]

  • *

    Very Large Data Bases

    Terabytes -- 10^12 bytes:

    Petabytes -- 10^15 bytes:

    Exabytes -- 10^18 bytes:

    Zettabytes -- 10^21 bytes:

    Zottabytes -- 10^24 bytes:

    Walmart -- 24 Terabytes

    Geographic Information Systems

    National Medical Records

    Weather images

    Intelligence Agency Videos

  • *

    Data Warehousing --
    It is a process

    Technique for assembling and managing data from various sources for the purpose of answering business questions. Thus making decisions that were not previous possibleA decision support database maintained separately from the organizations operational database

  • *

    Data Warehouse

    A data warehouse is a subject-orientedintegratedtime-varyingnon-volatile

    collection of data that is used primarily in organizational decision making.

    -- Bill Inmon, Building the Data Warehouse 1996

  • Data Warehousing Concepts

    Decision support is key for companies wanting to turn their organizational data into an information asset

    Traditional database is transaction-oriented while data warehouse is data-retrieval optimized for decision-support

    Data Warehouse
    "A subject-oriented, integrated, time-variant, and non-volatile collection of data in support of management's decision-making process"

    OLAP (on-line analytical processing), Decision Support Systems (DSS), Executive Information Systems (EIS), and data mining applications

    *

  • What does data warehouse do?

    integrate diverse information from various systems which enable users to quickly produce powerful ad-hoc queries and perform complex analysis

    create an infrastructure for reusing the data in numerous ways

    create an open systems environment to make useful information easily accessible to authorized users

    help managers make informed decisions

    *

  • Benefits of Data Warehousing

    Potential high returns on investmentCompetitive advantageIncreased productivity of corporate decision-makers

    *

  • Comparison of OLTP and Data Warehousing

    OLTP systemsData warehousing systems
    Holds current dataHolds historic data
    Stores detailed dataStores detailed, lightly, and
    summarized data
    Data is dynamicData is largely static
    Repetitive processingAd hoc, unstructured, and heuristic processing
    High level of transaction throughputMedium to low transaction throughput
    Predictable pattern of usageUnpredictable pattern of usage
    Transaction drivenAnalysis driven
    Application orientedSubject oriented
    Supports day-to-day decisionsSupports strategic decisions
    Serves large number ofServes relatively lower number
    clerical / operational usersof managerial users

    *

  • Data Warehouse Architecture

    Operational Data

    Load Manager

    Warehouse Manager

    Query Manager

    Detailed Data

    Lightly and Highly Summarized Data

    Archive / Backup Data

    Meta-Data

    End-user Access Tools

    *

  • End-user Access Tools

    Reporting and query toolsApplication development toolsExecutive Information System (EIS) toolsOnline Analytical Processing (OLAP) toolsData mining tools

    *

  • Data Warehousing Tools and Technologies

    Extraction, Cleansing, and Transformation Tools

    Data Warehouse DBMS

    Load performance

    Load processing

    Data quality management

    Query performance

    Terabyte scalability

    Networked data warehouse

    Warehouse administration

    Integrated dimensional tools

    Advanced query functionality

    *

  • Data Marts

    A subset of data warehouse that supports the requirements of a particular department or business function

    *

  • Online Analytical Processing (OLAP)

    OLAPThe dynamic synthesis, analysis, and consolidation of large volume of multi-dimensional dataMulti-dimensional OLAPCubes of data

    *

    City

    Time

    Producttype

  • Problems of Data Warehousing

    Underestimation of resources for data loadingHidden problem with source systemsRequired data not capturedIncreased end-user demandsData homogenizationHigh demand for resourcesData ownershipHigh maintenanceLong duration projectsComplexity of integration

    *

  • Codd's Rules for OLAP

    Multi-dimensional conceptual view

    Transparency

    Accessibility

    Consistent reporting performance

    Client-server architecture

    Generic dimensionality

    Dynamic sparse matrix handling

    Multi-user support

    Unrestricted cross-dimensional operations

    Intuitive data manipulation

    Flexible reporting

    Unlimited dimensions and aggregation levels

    *

  • OLAP Tools

    Multi-dimensional OLAP (MOLAP)Multi-dimensional DBMS (MDDBMS)Relational OLAP (ROLAP)Creation of multiple multi-dimensional views of the two-dimensional relationsManaged Query Environment (MQE)Deliver selected data directly from the DBMS to the desktop in the form of a data cube, where it is stored, analyzed, and manipulated locally

    *

  • Data Mining

    Definition

    The process of extracting valid, previously unknown, comprehensible, and actionable information from large database and using it to make crucial business decisions

    Knowledge discovery

    Association rules

    Sequential patterns

    Classification trees

    Goals

    Prediction

    Identification

    Classification

    Optimization

    *

  • Data Mining Techniques

    Predictive ModelingSupervised training with two phasesTraining phase : building a model using large sample of historical data called the training setTesting phase : trying the model on new dataDatabase SegmentationLink AnalysisDeviation Detection

    *

  • What are Data Mining Tasks?

    ClassificationRegressionClustering SummarizationDependency modelingChange and Deviation Detection

    *

  • What are Data Mining Discoveries?

    New Purchase TrendsPlan Investment StrategiesDetect Unauthorized ExpenditureFraudulent ActivitiesCrime TrendsSmugglers-border crossing

    *

  • *

    Data Warehouse Architecture

  • *

    Data Warehouse for Decision Support & OLAP

    Putting Information technology to help the knowledge worker make faster and better decisionsWhich of my customers are most likely to go to the competition?What product promotions have the biggest impact on revenue?How did the share price of software companies correlate with profits over last 10 years?

  • *

    Decision Support

    Used to manage and control businessData is historical or point-in-timeOptimized for inquiry rather than updateUse of the system is loosely defined and can be ad-hocUsed by managers and end-users to understand the business and make judgements

  • *

    Data Mining works with Warehouse Data

    Data Warehousing provides the Enterprise with a memory

    Data Mining provides the Enterprise with intelligence

  • *

    We want to know ...

    Given a database of 100,000 names, which persons are the least likely to default on their credit cards? Which types of transactions are likely to be fraudulent given the demographics and transactional history of a particular customer? If I raise the price of my product by Rs. 2, what is the effect on my ROI? If I offer only 2,500 airline miles as an incentive to purchase rather than 5,000, how many lost responses will result? If I emphasize ease-of-use of the product as opposed to its technical capabilities, what will be the net effect on my revenues? Which of my customers are likely to be the most loyal?

    Data Mining helps extract such information

  • *

    Application Areas

    Industry

    Application

    Finance

    Credit Card Analysis

    Insurance

    Claims, Fraud Analysis

    Telecommunication

    Call record analysis

    Transport

    Logistics management

    Consumer goods

    promotion analysis

    Data Service providers

    Value added data

    Utilities

    Power usage analysis

  • *

    Data Mining in Use

    The US Government uses Data Mining to track fraudA Supermarket becomes an information brokerBasketball teams use it to track game strategyCross SellingWarranty Claims RoutingHolding on to Good CustomersWeeding out Bad Customers

  • *

    What makes data mining possible?

    Advances in the following areas are making data mining deployable:data warehousing better and more data (i.e., operational, behavioral, and demographic) the emergence of easily deployed data mining tools and the advent of new data mining techniques.

    -- Gartner Group

  • *

    Why Separate Data Warehouse?

    PerformanceOp dbs designed & tuned for known txs & workloads.Complex OLAP queries would degrade perf. for op txs.Special data organization, access & implementation methods needed for multidimensional views & queries.

    FunctionMissing data: Decision support requires historical data, which op dbs do not typically maintain.Data consolidation: Decision support requires consolidation (aggregation, summarization) of data from many heterogeneous sources: op dbs, external sources. Data quality: Different sources typically use inconsistent data representations, codes, and formats which have to be reconciled.

  • *

    What are Operational Systems?

    They are OLTP systemsRun mission critical applicationsNeed to work with stringent performance requirements for routine tasksUsed to run a business!

  • *

    RDBMS used for OLTP

    Database Systems have been used traditionally for OLTPclerical data processing tasksdetailed, up to date datastructured repetitive tasksread/update a few recordsisolation, recovery and integrity are critical

  • *

    Operational Systems

    Run the business in real timeBased on up-to-the-second dataOptimized to handle large numbers of simple read/write transactionsOptimized for fast response to predefined transactionsUsed by people who deal with customers, products -- clerks, salespeople etc.They are increasingly used by customers

  • *

    Examples of Operational Data

  • *

    Application-Orientation vs. Subject-Orientation

  • *

    OLTP vs. Data Warehouse

    OLTP systems are tuned for known transactions and workloads while workload is not known a priori in a data warehouseSpecial data organization, access methods and implementation methods are needed to support data warehouse queries (typically multidimensional queries)e.g., average amount spent on phone calls between 9AM-5PM in Pune during the month of December

  • *

    OLTP vs Data Warehouse

    OLTPApplication OrientedUsed to run businessDetailed dataCurrent up to dateIsolated DataRepetitive accessClerical User

    Warehouse (DSS)Subject OrientedUsed to analyze businessSummarized and refinedSnapshot dataIntegrated DataAd-hoc accessKnowledge User (Manager)

  • *

    OLTP vs Data Warehouse

    OLTPPerformance SensitiveFew Records accessed at a time (tens)

    Read/Update Access

    No data redundancyDatabase Size 100MB -100 GB

    Data WarehousePerformance relaxedLarge volumes accessed at a time(millions)Mostly Read (Batch Update)Redundancy presentDatabase Size 100 GB - few terabytes

  • *

    OLTP vs Data Warehouse

    OLTPTransaction throughput is the performance metricThousands of usersManaged in entirety

    Data WarehouseQuery throughput is the performance metricHundreds of usersManaged by subsets

  • *

    To summarize ...

    OLTP Systems are
    used to run a business

    The Data Warehouse helps to optimize the business

  • *

    Why Now?

    Data is being producedERP provides clean dataThe computing power is availableThe computing power is affordableThe competitive pressures are strongCommercial products are available

  • *

    Myths surrounding OLAP Servers and Data Marts

    Data marts and OLAP servers are departmental solutions supporting a handful of usersMillion dollar massively parallel hardware is needed to deliver fast time for complex queriesOLAP servers require massive and unwieldy indicesComplex OLAP queries clog the network with dataData warehouses must be at least 100 GB to be effective

    Source -- Arbor Software Home Page

  • II. On-Line Analytical Processing (OLAP)

    Making Decision Support Possible

  • *

    Typical OLAP Queries

    Write a multi-table join to compare sales for each product line YTD this year vs. last year. Repeat the above process to find the top 5 product contributors to margin. Repeat the above process to find the sales of a product line to new vs. existing customers. Repeat the above process to find the customers that have had negative sales growth.

  • *

    * Reference: http://www.arborsoft.com/essbase/wht_ppr/coddTOC.html

    What Is OLAP?

    Online Analytical Processing - coined by
    EF Codd in 1994 paper contracted by
    Arbor Software*Generally synonymous with earlier terms such as Decisions Support, Business Intelligence, Executive Information SystemOLAP = Multidimensional DatabaseMOLAP: Multidimensional OLAP (Arbor Essbase, Oracle Express)ROLAP: Relational OLAP (Informix MetaCube, Microstrategy DSS Agent)

  • *

    The OLAP Market

    Rapid growth in the enterprise market1995: $700 Million1997: $2.1 BillionSignificant consolidation activity among major DBMS vendors10/94: Sybase acquires ExpressWay7/95: Oracle acquires Express 11/95: Informix acquires Metacube1/97: Arbor partners up with IBM10/96: Microsoft acquires PanoramaResult: OLAP shifted from small vertical niche to mainstream DBMS category

  • *

    Strengths of OLAP

    It is a powerful visualization paradigmIt provides fast, interactive response timesIt is good for analyzing time seriesIt can be useful to find some clusters and outliersMany vendors offer OLAP tools

  • *

    Nigel Pendse, Richard Creath - The OLAP Report

    OLAP Is FASMI

    FastAnalysisSharedMultidimensionalInformation

  • *

    Dimensions: Product, Region, Time

    Hierarchical summarization paths

    Product Region Time

    Industry Country Year

    Category Region Quarter

    Product City Month Week

    Office Day

    Multi-dimensional Data

    HeyI sold $100M worth of goods

  • *

    A Visual Operation: Pivot (Rotate)

    10

    47

    30

    12

    Juice

    Cola

    Milk

    Cream

    NY

    LA

    SF

    3/1 3/2 3/3 3/4

    Date

    Month

    Region

    Product

  • *

    Slicing and Dicing

    Product

    Sales Channel

    Regions

    Retail

    Direct

    Special

    Household

    Telecomm

    Video

    Audio

    India

    Far East

    Europe

    The Telecomm Slice

  • *

    Roll-up and Drill Down

    Sales ChannelRegionCountryState Location AddressSales Representative

  • Results of Data Mining Include:

    Forecasting what may happen in the futureClassifying people or things into groups by recognizing patternsClustering people or things into groups based on their attributesAssociating what events are likely to occur togetherSequencing what events are likely to lead to later events

  • Data mining is not

    Brute-force crunching of bulk data Blind application of algorithmsGoing to find relationships where none existPresenting data in different waysA database intensive taskA difficult to understand technology requiring an advanced degree in computer science

  • Data Mining versus OLAP

    OLAP - On-line Analytical ProcessingProvides you with a very good view of what is happening, but can not predict what will happen in the future or why it is happening

  • Data Mining Versus Statistical Analysis

    Data Mining

    Originally developed to act as expert systems to solve problems

    Less interested in the mechanics of the technique

    If it makes sense then lets use it

    Does not require assumptions to be made about data

    Can find patterns in very large amounts of data

    Requires understanding of data and business problem

    Data Analysis

    Tests for statistical correctness of models

    Are statistical assumptions of models correct?

    Eg Is the R-Square good?

    Hypothesis testing

    Is the relationship significant?

    Use a t-test to validate significance

    Tends to rely on sampling

    Techniques are not optimised for large amounts of data

    Requires strong statistical skills

  • Examples of What People are Doing with Data Mining:

    Fraud/Non-Compliance Anomaly detection

    Isolate the factors that lead to fraud, waste and abuse

    Target auditing and investigative efforts more effectively

    Credit/Risk Scoring

    Intrusion detection

    Parts failure prediction

    Recruiting/Attracting customers

    Maximizing profitability (cross selling, identifying profitable customers)

    Service Delivery and Customer Retention

    Build profiles of customers likely to use which services

    Web Mining

  • What data mining has done for...

    Scheduled its workforce

    to provide faster, more accurate answers to questions.

    The US Internal Revenue Service

    needed to improve customer service and...

    *

    The US Internal Revenue Service is using data mining to improve customer service.

    [Click] By analyzing incoming requests for help and information, the IRS hopes to schedule its workforce to provide faster, more accurate answers to questions.

  • What data mining has done for...

    analyzed suspects cell phone usage to focus investigations.

    The US Drug Enforcement Agency needed to be more effective in their drug busts and

    *

    The US DFAS needs to search through 2.5 million financial transactions that may indicate inaccurate charges. Instead of relying on tips to point out fraud, the DFAS is mining the data to identify suspicious transactions.

    [Click] Using Clementine, the agency examined credit card transactions and was able to identify purchases that did not match past patterns. Using this information, DFAS could focus investigations, finding fraud more costs effectively.

  • What data mining has done for...

    Reduced direct mail costs by 30% while garnering 95% of the campaigns revenue.

    HSBC need to cross-sell more

    effectively by identifying profiles

    that would be interested in higher

    yielding investments and...

    *

    Retail banking is a highly competitive business. In addition to competition from other banks, banks also see intense competition from financial services companies of all kinds, from stockbrokers to mortgage companies.

    With so many organizations working the same customer base, the value of customer retention is greater than ever before. As a result, HSBC Bank USA looks to enticing existing customers to "roll over" maturing products, or on cross-selling new ones.

    [Click] Using SPSS products, HSBC found that it could reduce direct mail costs by 30% while still bringing in 95% of the campaigns revenue. Because HSBC is sending out fewer mail pieces, customers are likely to be more loyal because they dont receive junk mail from the bank.

  • Suggestion:Predicting Washington

    C-Span has lunched a digital archieve of 500,000 hours of audio debates. Text Mining or Audio Mining of these talks to reveal cwetrain questions such as.

  • Example Application: Sports

    IBM Advanced Scout analyzes
    NBA game statistics

    Shots blockedAssistsFouls

    Google: IBM Advanced Scout

  • Advanced Scout

    Example pattern: An analysis of the
    data from a game played between
    the New York Knicks and the Charlotte
    Hornets revealed that When Glenn Rice played the shooting guard position, he shot 5/6 (83%) on jump shots."

    Pattern is interesting:
    The average shooting percentage for the Charlotte Hornets during that game was 54%.

  • Data Mining: Types of Data

    Relational data and transactional dataSpatial and temporal data, spatio-temporal observationsTime-series data TextImages, videoMixtures of dataSequence data

    Features from processing other data sources

  • Data Mining Techniques

    Supervised learningClassification and regressionUnsupervised learningClusteringDependency modelingAssociations, summarization, causalityOutlier and deviation detectionTrend analysis and change detection

  • Different Types of Classifiers

    Linear discriminant analysis (LDA)Quadratic discriminant analysis (QDA)Density estimation methodsNearest neighbor methodsLogistic regressionNeural networksFuzzy set theoryDecision Trees

  • Test Sample Estimate

    Divide D into D1 and D2Use D1 to construct the classifier dThen use resubstitution estimate R(d,D2) to calculate the estimated misclassification error of dUnbiased and efficient, but removes D2 from training dataset D

  • V-fold Cross Validation

    Procedure:

    Construct classifier d from DPartition D into V datasets D1, , DVConstruct classifier di using D \ DiCalculate the estimated misclassification error R(di,Di) of di using test sample Di

    Final misclassification estimate:

    Weighted combination of individual misclassification errors:
    R(d,D) = 1/V R(di,Di)

  • Cross-Validation: Example

    d

    d1

    d2

    d3

  • Cross-Validation

    Misclassification estimate obtained through cross-validation is usually nearly unbiasedCostly computation (we need to compute d, and d1, , dV); computation of di is nearly as expensive as computation of dPreferred method to estimate quality of learning algorithms in the machine learning literature

  • Decision Tree Construction

    Three algorithmic components:

    Split selection (CART, C4.5, QUEST, CHAID, CRUISE, )

    Pruning (direct stopping rule, test dataset pruning, cost-complexity pruning, statistical tests, bootstrapping)

    Data access (CLOUDS, SLIQ, SPRINT, RainForest, BOAT, UnPivot operator)

    *

  • Goodness of a Split

    Consider node t with impurity phi(t)

    The reduction in impurity through splitting predicate s (t splits into children nodes tL with impurity phi(tL) and tR with impurity phi(tR)) is:
    phi(s,t) = phi(t) pL phi(tL) pR phi(tR)

  • Pruning Methods

    Test dataset pruningDirect stopping ruleCost-complexity pruningMDL pruningPruning by randomization testing

  • Stopping Policies

    A stopping policy indicates when further growth of the tree at a node t is counterproductive.

    All records are of the same classThe attribute values of all records are identicalAll records have missing valuesAt most one class has a number of records larger than a user-specified numberAll records go to the same child node if t is split (only possible with some split selection methods)

  • Test Dataset Pruning

    Use an independent test sample D to estimate the misclassification cost using the resubstitution estimate R(T,D) at each nodeSelect the subtree T of T with the smallest expected cost

  • Missing Values

    What is the problem?During computation of the splitting predicate, we can selectively ignore records with missing values (note that this has some problems)But if a record r misses the value of the variable in the splitting attribute, r can not participate further in tree construction

    Algorithms for missing values address this problem.

  • Mean and Mode Imputation

    Assume record r has missing value r.X, and splitting variable is X.

    Simplest algorithm:If X is numerical (categorical), impute the overall mean (mode) Improved algorithm:If X is numerical (categorical), impute the mean(X|t.C) (the mode(X|t.C))

  • Decision Trees: Summary

    Many application of decision treesThere are many algorithms available for:Split selectionPruningHandling Missing ValuesData AccessDecision tree construction still active research area (after 20+ years!)Challenges: Performance, scalability, evolving datasets, new applications

  • Supervised vs. Unsupervised Learning

    Supervised

    y=F(x): true functionD: labeled training setD: {xi,F(xi)}Learn:
    G(x): model trained to predict labels DGoal:
    E[(F(x)-G(x))2] 0Well defined criteria: Accuracy, RMSE, ...

    Unsupervised

    Generator: true modelD: unlabeled data sampleD: {xi}Learn

    ??????????

    Goal:

    ??????????

    Well defined criteria:

    ??????????

  • Clustering: Unsupervised Learning

    Given:Data Set D (training set)Similarity/distance metric/informationFind:Partitioning of dataGroups of similar/close items

  • Similarity?

    Groups of similar customersSimilar demographicsSimilar buying behaviorSimilar healthSimilar productsSimilar costSimilar functionSimilar storeSimilarity usually is domain/problem specific

  • Clustering: Informal Problem Definition

    Input:

    A data set of N records each given as a d-dimensional data feature vector.

    Output:

    Determine a natural, useful partitioning of the data set into a number of (k) clusters and noise such that we have:High similarity of records within each cluster (intra-cluster similarity)Low similarity of records between clusters (inter-cluster similarity)

  • Types of Clustering

    Hard Clustering:Each object is in one and only one clusterSoft Clustering:Each object has a probability of being in each cluster

  • Clustering Algorithms

    Partitioning-based clusteringK-means clusteringK-medoids clusteringEM (expectation maximization) clusteringHierarchical clusteringDivisive clustering (top down)Agglomerative clustering (bottom up)Density-Based MethodsRegions of dense points separated by sparser regions of relatively low density

  • K-Means Clustering Algorithm

    Initialize k cluster centers

    Do

    Assignment step: Assign each data point to its closest cluster center

    Re-estimation step: Re-compute cluster centers

    While (there are still changes in the cluster centers)

    Visualization at:

    http://www.delft-cluster.nl/textminer/theory/kmeans/kmeans.html

  • Issues

    Why is K-Means working:

    How does it find the cluster centers?Does it find an optimal clusteringWhat are good starting points for the algorithm?What is the right number of cluster centers?How do we know it will terminate?

  • Agglomerative Clustering

    Algorithm:

    Put each item in its own cluster (all singletons)Find all pairwise distances between clustersMerge the two closest clustersRepeat until everything is in one cluster

    Observations:

    Results in a hierarchical clusteringYields a clustering for each possible number of clustersGreedy clustering: Result is not optimal for any cluster size

  • Density-Based Clustering

    A cluster is defined as a connected dense component.Density is defined in terms of number of neighbors of a point.We can find clusters of arbitrary shape

  • Market Basket Analysis

    Consider shopping cart filled with several itemsMarket basket analysis tries to answer the following questions:Who makes purchases?What do customers buy together?In what order do customers purchase items?

  • Market Basket Analysis

    Given:

    A database of customer transactionsEach transaction is a set of items

    Example:
    Transaction with TID 111 contains items {Pen, Ink, Milk, Juice}

    TID

    CID

    Date

    Item

    Qty

    111

    201

    5/1/99

    Pen

    2

    111

    201

    5/1/99

    Ink

    1

    111

    201

    5/1/99

    Milk

    3

    111

    201

    5/1/99

    Juice

    6

    112

    105

    6/3/99

    Pen

    1

    112

    105

    6/3/99

    Ink

    1

    112

    105

    6/3/99

    Milk

    1

    113

    106

    6/5/99

    Pen

    1

    113

    106

    6/5/99

    Milk

    1

    114

    201

    7/1/99

    Pen

    2

    114

    201

    7/1/99

    Ink

    2

    114

    201

    7/1/99

    Juice

    4

  • Market Basket Analysis (Contd.)

    Coocurrences80% of all customers purchase items X, Y and Z together.Association rules60% of all customers who purchase X and Y also buy Z.Sequential patterns60% of customers who first buy X also purchase Y within three weeks.

    *

  • Confidence and Support

    We prune the set of all possible association rules using two interestingness measures:

    Confidence of a rule:X Y has confidence c if P(Y|X) = cSupport of a rule:X Y has support s if P(XY) = s

    We can also define

    Support of an itemset (a coocurrence) XY:XY has support s if P(XY) = s

    *

  • Market Basket Analysis: Applications

    Sample ApplicationsDirect marketingFraud detection for medical insuranceFloor/shelf planningWeb site layoutCross-selling

    *

  • Applications of Frequent Itemsets

    Market Basket AnalysisAssociation RulesClassification (especially: text, rare classes)Seeds for construction of Bayesian NetworksWeb log analysisCollaborative filtering

  • Association Rule Algorithms

    More abstract problem reduxBreadth-first searchDepth-first search

  • Problem Redux

    Abstract:

    A set of items {1,2,,k}A dabase of transactions (itemsets) D={T1, T2, , Tn},
    Tj subset {1,2,,k}

    GOAL:

    Find all itemsets that appear in at least x transactions

    (appear in == are subsets of)

    I subset T: T supports I

    For an itemset I, the number of transactions it appears in is called the support of I.

    x is called the minimum support.

    Concrete:

    I = {milk, bread, cheese, }D = { {milk,bread,cheese}, {bread,cheese,juice}, }

    GOAL:

    Find all itemsets that appear in at least 1000 transactions

    {milk,bread,cheese} supports {milk,bread}

  • Problem Redux (Contd.)

    Definitions:

    An itemset is frequent if it is a subset of at least x transactions. (FI.)An itemset is maximally frequent if it is frequent and it does not have a frequent superset. (MFI.)

    GOAL: Given x, find all frequent (maximally frequent) itemsets (to be stored in the FI (MFI)).

    Obvious relationship:
    MFI subset FI

    Example:

    D={ {1,2,3}, {1,2,3}, {1,2,3}, {1,2,4} }

    Minimum support x = 3

    {1,2} is frequent

    {1,2,3} is maximal frequent

    Support({1,2}) = 4

    All maximal frequent itemsets: {1,2,3}

  • Applications

    Spatial association rulesWeb miningMarket basket analysisUser/customer profiling

  • ExtenSuggestionssions: Sequential Patterns

    In the Market Itemset Analysis replace Milk, Pen, etc with names of medications and use the idea in Hospital Data mining new proposal The idea of swaem intelligence add to it the extra analysis pf the inducyion rules in this set of slides.

  • Kraft Foods: Direct Marketing

    Company maintains a large database of purchases by customers.

    Data mining

    1. Analysts identified associations among groups of products bought by particular segments of customers.

    2. Sent out 3 sets of coupons to various households.

    Better response rates: 50 % increase in sales for one its products

    Continue to use of this approach

    Health Insurance Commission of Australia: Insurance Fraud

    Commission maintains a database of insurance claims,including laboratory tests ordered during the diagnosis of patients.

    Data mining

    1. Identified the practice of "up coding" to reflect more expensive tests than are necessary.

    2. Now monitors orders for lab tests.

    Commission expects to save US$1,000,000 / year by eliminating the practice of "up coding.

  • HNC Software: Credit Card Fraud

    Payment Fraud

    Large issuers of cards may lose

    $10 million / year due to fraud

    Difficult to identify the few transactions among thousands which reflect potential fraud

    Falcon software

    Mines data through neural networks

    Introduced in September 1992

    Models each cardholder's requested transaction against the customer's past spending history.

    processes several hundred requests per second

    compares current transaction with customer's history

    identifies the transactions most likely to be frauds

    enables bank to stop high-risk transactions before they are authorized

    Used by many retail banks: currently monitors

    160 million card accounts for fraud

  • New Account Fraud

    Fraudulent applications for credit cards are growing at 50 % per year

    Falcon Sentry software

    Mines data through neural networks and a rule base

    Introduced in September 1992

    Checks information on applications against data from credit bureaus

    Allows card issuers to simultaneously:

    increase the proportion of applications received

    reduce the proportion of fraudulent applications authorized

    New Account Fraud

  • Quality Control

    IBM Microelectronics: Quality Control

    Analyzed manufacturing data on Dynamic Random Access Memory (DRAM) chips.

    Data mining

    1. Built predictive models of

    manufacturing yield (% non-defective)

    effects of production parameters on chip performance.

    2. Discovered critical factors behind

    production yield &

    product performance.

    3. Created a new design for the chip

    increased yield saved millions of dollars in direct manufacturing costs

    enhanced product performance by substantially lowering the memory cycle time

  • B & L Stores

    Belk and Leggett Stores =

    one of largest retail chains

    280 stores in southeast U.S.

    data warehouse contains 100s of gigabytes (billion characters) of data

    data mining to:

    increase sales

    reduce costs

    Selected DSS Agent from MicroStrategy, Inc.

    analyize merchandizing (patterns of sales)

    manage inventory

    Retail Sales

  • DSS Agent

    uses intelligent agents data mining

    provides multiple functions

    recognizes sales patterns among stores

    discovers sales patterns by

    time of day

    day of year

    category of product

    etc.

    swiftly identifies trends & shifts in customer tastes

    performs Market Basket Analysis (MBA)

    analyzes Point-of-Sale or -Service (POS) data

    identifies relationships among products and/or services purchased

    E.g. A customer who buys Brand X slacks has a 35% chance of buying Brand Y shirts.

    Agent tool is also used by other Fortune 1000 firms

    average ROI > 300 %

    average payback in 1 ~ 2 years

    Market Basket Analysis

  • Case Based Reasoning

    (CBR)

    General scheme for a case based reasoning (CBR) model. The target case is

    matched against similar precedents in the historical database, such as cases A and B.

    case A

    targ

    e

    t

    case B

    _997881239.doc

    case A

    target

    case B

    _997881570.doc

    case A

    target

    case B

    _997881862.doc

    case A

    target

    case B

    _997882088.doc

    case A

    target

    case B

    _997881577.doc

    case A

    target

    case B

    _997881289.doc

    case A

    target

    case B

    _997880756.doc

    case A

    target

    case B



  • Case Based Reasoning (CBR)

    Learning through the accumulation of experience

    Key issues

    Indexing:
    storing cases for quick, effective access of precedents

    Retrieval:
    accessing the appropriate precedent cases

    Advantages

    Explicit knowledge form recognizable to humans

    No need to re-code knowledge for computer processing

    Limitations

    Retrieving precedents based on superficial features
    E.g. Matching Indonesia with U.S. because both have similar population size

    Traditional approach ignores the issue of generalizing knowledge


  • Genetic Algorithm

    Generation of candidate solutions using the procedures of biological evolution.

    Procedure

    0. Initialize.
    Create a population of potential solutions ("organisms").
    1. Evaluate.
    Determine the level of "fitness" for each solution.
    2. Cull.
    Discard the poor solutions.
    3. Breed.
    a. Select 2 "fit" solutions to serve as parents.
    b. From the 2 parents, generate offspring.
    * Crossover:
    Cut the parents at random and switch the 2 halves.
    * Mutation:
    Randomly change the value in a parent solution.
    4. Repeat.
    Go back to Step 1 above.


  • Genetic Algorithm (Cont.)

    Advantages

    Applicable to a wide range of problem domains.

    Robustness:
    can obtain solutions even when the performance

    function is highly irregular or input data are noisy.

    Implicit parallelism:
    can search in many directions concurrently.

    Limitations

    Slow, like neural networks.
    But: computation can be distributed

    over multiple processors

    (unlike neural networks)

    Source: www.pathology.washington.edu


  • Multistrategy Learning

    Every technique has advantages & limitations

    Multistrategy approach

    Take advantage of the strengths of diverse techniques

    Circumvent the limitations of each methodology

  • Types of Models

    Prediction Models for Predicting and Classifying

    Regression algorithms (predict numeric outcome): neural networks, rule induction, CART (OLS regression, GLM)

    Classification algorithm predict symbolic outcome): CHAID, C5.0 (discriminant analysis, logistic regression)

    Descriptive Models for Grouping and Finding Associations

    Clustering/Grouping algorithms: K-means, Kohonen

    Association algorithms: apriori, GRI

  • Neural Networks

    DescriptionDifficult interpretationTends to overfit the dataExtensive amount of training timeA lot of data preparationWorks with all data types

  • Rule Induction

    Description

    Intuitive outputHandles all forms of numeric data, as well as non-numeric (symbolic) data

    C5 Algorithm a special case of rule induction

    Target variable must be symbolic

  • Apriori

    Description

    Seeks association rules in dataset

    Market basket analysis

    Sequence discovery

  • Data Mining Is

    The automated process of finding relationships and patterns in stored data It is different from the use of SQL queries and other business intelligence tools

  • Data Mining Is

    Motivated by business need, large amounts of available data, and humans limited cognitive processing abilitiesEnabled by data warehousing, parallel processing, and data mining algorithms

  • Common Types of Information from Data Mining

    Associations -- identifies occurrences that are linked to a single eventSequences -- identifies events that are linked over timeClassification -- recognizes patterns that describe the group to which an item belongs

  • Common Types of Information from Data Mining

    Clustering -- discovers different groupings within the dataForecasting -- estimates future values

  • Commonly Used Data Mining Techniques

    Artificial neural networksDecision treesGenetic algorithmsNearest neighbor methodRule induction

  • The Current State of Data Mining Tools

    Many of the vendors are small companiesIBM and SAS have been in the market for some time, and more biggies are moving into this marketBI tools and RDMS products are increasingly including basic data mining capabilitiesPackaged data mining applications are becoming common

  • The Data Mining Process

    Requires personnel with domain, data warehousing, and data mining expertiseRequires data selection, data extraction, data cleansing, and data transformationMost data mining tools work with highly granular flat filesIs an iterative and interactive process

  • Why Data Mining

    Credit ratings/targeted marketing:Given a database of 100,000 names, which persons are the least likely to default on their credit cards? Identify likely responders to sales promotionsFraud detectionWhich types of transactions are likely to be fraudulent, given the demographics and transactional history of a particular customer? Customer relationship management:Which of my customers are likely to be the most loyal, and which are most likely to leave for a competitor? :

    Data Mining helps extract such information

  • Applications

    Banking: loan/credit card approvalpredict good customers based on old customersCustomer relationship management:identify those who are likely to leave for a competitor.Targeted marketing: identify likely responders to promotionsFraud detection: telecommunications, financial transactionsfrom an online stream of event identify fraudulent eventsManufacturing and production:automatically adjust knobs when process parameter changes

    *

    Any area where large amounts of historic data that if understood

    better can help shape future decisions.

  • Applications (continued)

    Medicine: disease outcome, effectiveness of treatmentsanalyze patient disease history: find relationship between diseases Molecular/Pharmaceutical: identify new drugsScientific data analysis: identify new galaxies by searching for sub clustersWeb site/store design and promotion:find affinity of visitor to pages and modify layout

  • The KDD process

    Problem fomulationData collectionsubset data: sampling might hurt if highly skewed datafeature selection: principal component analysis, heuristic searchPre-processing: cleaningname/address cleaning, different meanings (annual, yearly), duplicate removal, supplying missing values Transformation:map complex objects e.g. time series data to features e.g. frequencyChoosing mining task and mining method:Result evaluation and Visualization:

    Knowledge discovery is an iterative process

  • Relationship with other fields

    Overlaps with machine learning, statistics, artificial intelligence, databases, visualization but more stress onscalability of number of features and instancesstress on algorithms and architectures whereas foundations of methods and formulations provided by statistics and machine learning. automation for handling large, heterogeneous data

    *

  • Some basic operations

    Predictive:RegressionClassificationCollaborative FilteringDescriptive:Clustering / similarity matchingAssociation rules and variantsDeviation detection

    *

    Each topic is a talk..

  • Classification

    Given old data about customers and payments, predict new applicants loan eligibility.

    Age

    Salary

    Profession

    Location

    Customer type

    Previous customers

    Classifier

    Decision rules

    Salary > 5 L

    Prof. = Exec

    New applicants data

    Good/

    bad

    *

  • Classification methods

    Goal: Predict class Ci = f(x1, x2, .. Xn)Regression: (linear or any other polynomial) a*x1 + b*x2 + c = Ci. Nearest neighourDecision tree classifier: divide decision space into piecewise constant regions.Probabilistic/generative modelsNeural networks: partition by non-linear boundaries

    *

  • Define proximity between instances, find neighbors of new instance and assign majority classCase based reasoning: when attributes are more complicated than real-valued.

    Nearest neighbor

    Cons Slow during application. No feature selection. Notion of proximity vague

    Pros Fast training

    *

  • Clustering

    Unsupervised learning when old data with class labels not available e.g. when introducing a new product.Group/cluster existing customers based on time series of payment history such that similar customers in same cluster.Key requirement: Need a good measure of similarity between instances.Identify micro-markets and develop policies for each

    *

  • Applications

    Customer segmentation e.g. for targeted marketingGroup/cluster existing customers based on time series of payment history such that similar customers in same cluster.Identify micro-markets and develop policies for eachCollaborative filtering:group based on common items purchasedText clustering Compression

  • Distance functions

    Numeric data: euclidean, manhattan distances Categorical data: 0/1 to indicate presence/absence followed byHamming distance (# dissimilarity)Jaccard coefficients: #similarity in 1s/(# of 1s) data dependent measures: similarity of A and B depends on co-occurance with C.Combined numeric and categorical data:weighted normalized distance:

  • Clustering methods

    Hierarchical clusteringagglomerative Vs divisive single link Vs complete link Partitional clusteringdistance-based: K-meansmodel-based: EMdensity-based:

  • Agglomerative Hierarchical clustering

    Given: matrix of similarity between every point pairStart with each point in a separate cluster and merge clusters based on some criteria:Single link: merge two clusters such that the minimum distance between two points from the two different cluster is the leastComplete link: merge two clusters such that all points in one cluster are close to all points in the other.

  • Partitional methods: K-means

    Criteria: minimize sum of square of distance Between each point and centroid of the cluster.Between each pair of points in the clusterAlgorithm:Select initial partition with K clusters: random, first K, K separated pointsRepeat until stabilization:Assign each point to closest cluster centerGenerate new cluster centersAdjust clusters by merging/splitting

  • Collaborative Filtering

    Given database of user preferences, predict preference of new user Example: predict what new movies you will like based onyour past preferencesothers with similar past preferencestheir preferences for the new moviesExample: predict what books/CDs a person may want to buy (and suggest it, or give discounts to tempt customer)

  • Association rules

    Given set T of groups of itemsExample: set of item sets purchased Goal: find all rules on itemsets of the form a-->b such that support of a and b > user threshold s conditional probability (confidence) of b given a > user threshold cExample: Milk --> breadPurchase of product A --> service B

    Milk, cereal

    Tea, milk

    Tea, rice, bread

    cereal

    T

    *

  • Prevalent Interesting

    Analysts already know about prevalent rulesInteresting rules are those that deviate from prior expectationMinings payoff is in finding surprising phenomena

    1995

    Milk and

    cereal sell
    together!

    Milk and

    cereal sell
    together!

    *

  • Applications of fast itemset counting

    Find correlated events:

    Applications in medicine: find redundant testsCross selling in retail, bankingImprove predictive capability of classifiers that assume attribute independence New similarity measures of categorical attributes [Mannila et al, KDD 98]

    *

  • Application Areas

    Industry

    Application

    Finance

    Credit Card Analysis

    Insurance

    Claims, Fraud Analysis

    Telecommunication

    Call record analysis

    Transport

    Logistics management

    Consumer goods

    promotion analysis

    Data Service providers

    Value added data

    Utilities

    Power usage analysis

  • Usage scenarios

    Data warehouse mining: assimilate data from operational sourcesmine static dataMining log dataContinuous mining: example in process controlStages in mining: data selection pre-processing: cleaning transformation mining result evaluation visualization

    *

  • Mining market

    Around 20 to 30 mining tool vendorsMajor tool players:Clementine, IBMs Intelligent Miner, SGIs MineSet, SASs Enterprise Miner.All pretty much the same set of toolsMany embedded products: fraud detection: electronic commerce applications, health care, customer relationship management: Epiphany

    *

    Absolute: 40 M$

    40M$, expected to grow 10 times by 2000 --Forrester research

  • Vertical integration:
    Mining on the web

    Web log analysis for site design: what are popular pages, what links are hard to find.Electronic stores sales enhancements:recommendations, advertisement: Collaborative filtering: Net perception, Wisewire Inventory control: what was a shopper looking for and could not find..

    *

  • State of art in mining OLAP integration

    Decision trees [Information discovery, Cognos]find factors influencing high profitsClustering [Pilot software]segment customers to define hierarchy on that dimensionTime series analysis: [Seagates Holos]Query for various shapes along time: eg. spikes, outliersMulti-level Associations [Han et al.]find association between members of dimensions Sarawagi [VLDB2000]

    *

    Littl e integration: here are few exceptions ---

    People are starting to wake up to this possibility and here are some examples I have found by web-surfing.

    . decision tree most common. Information Discovery claimed to be only serious integrator [DBMS Ap 98]

    Clustering used by some to define new product hierarchies.

    Of course, rich set of time-series functions especially for forecasting was always there

    New charting software: 80/20, A-B-C analysis, quadrant plotting.

    Univ. Jiawen Han.

    Previous approach has been to bring in mining operations in olap. Look at mining operations and choose what fits.

    My approach has been to reflect on what people do with cube metaphor

    and the drill-down, roll-up, based exploration and see if there is anything there that can be automated.

    Discuss my work first.

  • Data Mining in Use

    The US Government uses Data Mining to track fraudA Supermarket becomes an information brokerBasketball teams use it to track game strategyCross SellingTarget MarketingHolding on to Good CustomersWeeding out Bad Customers

    *

  • Some success stories

    Network intrusion detection using a combination of sequential rule discovery and classification tree on 4 GB DARPA dataWon over (manual) knowledge engineering approachhttp://www.cs.columbia.edu/~sal/JAM/PROJECT/ provides good detailed description of the entire processMajor US bank: customer attrition predictionFirst segment customers based on financial behavior: found 3 segmentsBuild attrition models for each of the 3 segments40-50% of attritions were predicted == factor of 18 increase Targeted credit marketing: major US banksfind customer segments based on 13 months credit balancesbuild another response model based on surveysincreased response 4 times -- 2%

  • Data Mining Tools: KnowledeSeeker 4.5

    *

    What is KnowledgeSeeker?

    Produced by ANGOSS Software Corporation, who focus solely on data mining software.

    Offer training and consulting services

    Produce data mining add-ins which accepts data from all major databases

    Works with popular query and reporting, spreadsheet, statistical and OLAP & ROLAP tools.

    Data Mining Tools: KnowledeSeeker 4.5

    *

    Angoss Software Corp. was formed under the Business Corp. Act (ontario) in 1980. It began data mining software production in 1992. It is publicly traded on the Canadian Venture Exchange under the trading symbol ANC

    Promote the rapid knowledge transfer to customers in the use of technology and adoption of best practice for data mining

  • Data Mining Tools: KnowledeSeeker 4.5

    *

    Major Competitors

    CompanySoftwareClementine 6.0Enterprise Miner 3.0Intelligent Miner

    Data Mining Tools: KnowledeSeeker 4.5

    *

  • Data Mining Tools: KnowledeSeeker 4.5

    *

    Major Competitors

    CompanySoftwareMineset 3.1DarwinScenario

    Data Mining Tools: KnowledeSeeker 4.5

    *

  • Data Mining Tools: KnowledeSeeker 4.5

    *

    Current Applications

    Manufacturing

    Used by the R.R. Donnelly & Sons commercial printing company to improve process control, cut costs and increase productivity.

    Used extensively by Hewlett Packard in their United States manufacturing plants as a process control tool both to analyze factors impacting product quality as well as to generate rules for production control systems.

    Data Mining Tools: KnowledeSeeker 4.5

    *

  • Data Mining Tools: KnowledeSeeker 4.5

    *

    Current Applications

    Auditing

    Used by the IRS to combat fraud, reduce risk, and increase collection rates.

    Finance

    Used by the Canadian Imperial Bank of Commerce (CIBC) to create models for fraud detection and risk management.

    Data Mining Tools: KnowledeSeeker 4.5

    *

  • Data Mining Tools: KnowledeSeeker 4.5

    *

    Current Applications

    CRM

    Telephony

    Used by US West to reduce churning and increase customer loyalty for a new voice messaging technology.

    Data Mining Tools: KnowledeSeeker 4.5

    *

  • Data Mining Tools: KnowledeSeeker 4.5

    *

    Current Applications

    Marketing

    Used by the Washington Post to improve their direct mail targeting and to conduct survey analysis.

    Health Care

    Used by the Oxford Transplant Center to discover factors affecting transplant survival rates.

    Used by the University of Rochester Cancer Center to study the effect of anxiety on chemotherapy-related nausea.

    Data Mining Tools: KnowledeSeeker 4.5

    *

  • Data Mining Tools: KnowledeSeeker 4.5

    *

    More Customers

    Data Mining Tools: KnowledeSeeker 4.5

    *

  • Data Mining Tools: KnowledeSeeker 4.5

    *

    Questions

    What percentage of people in the test group have high blood pressure with these characteristics: 66-year-old male regular smoker that has low to moderate salt consumption?

    Do the risk levels change for a male with the same characteristics who quit smoking? What are the percentages?

    If you are a 2% milk drinker, how many factors are still interesting?

    Knowing that salt consumption and smoking habits are interesting factors, which one has a stronger correlation to blood pressure levels?

    Grow an automatic tree. Look to see if gender is an interesting factor for 55-year-old regular smoker who does not each cheese?

    Data Mining Tools: KnowledeSeeker 4.5

  • Association

    Classic market-basket analysis, which treats the purchase of a number of items (for example, the contents of a shopping basket) as a single transaction. This information can be used to adjust inventories, modify floor or shelf layouts, or introduce targeted promotional activities to increase overall sales or move specific products. Example : 80 percent of all transactions in which beer was purchased also included potato chips.

  • Sequence-based analysis

    Traditional market-basket analysis deals with a collection of items as part of a point-in-time transaction. to identify a typical set of purchases that might predict the subsequent purchase of a specific item.

  • Clustering

    Clustering approach address segmentation problems. These approaches assign records with a large number of attributes into a relatively small set of groups or "segments." Example : Buying habits of multiple population segments might be compared to determine which segments to target for a new sales campaign.

  • Classification

    Most commonly applied data mining technique Algorithm uses preclassified examples to determine the set of parameters required for proper discrimination. Example : A classifier derived from the Classification approach is capable of identifying risky loans, could be used to aid in the decision of whether to grant a loan to an individual.

  • Issues of Data Mining

    Present-day tools are strong but require significant expertise to implement effectively. Issues of Data MiningSusceptibility to "dirty" or irrelevant data.Inability to "explain" results in human terms.

  • Issues

    susceptibility to "dirty" or irrelevant data Data mining tools of today simply take everything they are given as factual and draw the resulting conclusions. Users must take the necessary precautions to ensure that the data being analyzed is "clean."

  • Issues, cont

    inability to "explain" results in human terms Many of the tools employed in data mining analysis use complex mathematical algorithms that are not easily mapped into human terms.what good does the information do if you dont understand it?

  • Comparison with reporting, BI and OLAP

    Reporting

    Simple relationshipsChoose the relevant factors Examine all details

    (Also applies to visualisation & simple statistics)

    Data Mining

    Complex relationshipsAutomatically find the relevant factorsShow only relevant details

    Prediction

    *

    Here its obviously the algorithms.

  • Comparison with Statistics

    Statistical analysis

    Mainly about hypothesis testingFocussed on precision

    Data mining

    Mainly about hypothesis generationFocussed on deployment

    *

    Here its less clear maybe its the algorithms, but more its the attitude

  • Example: data mining and customer processes

    Insight: Who are my customers and why do they behave the way they do?Prediction: Who is a good prospect, for what product, who is at risk,
    what is the next thing to offer?Uses: Targeted marketing, mail-shots, call-centres, adaptive web-sites

    *

  • Example: data mining and fraud detection

    Insight: How can (specific method of) fraud be recognised? What constitute normal, abnormal and suspicious events?Prediction: Recognise similarity to previous frauds how similar?
    Spot abnormal events how suspicious? Used by: Banks, telcos, retail, government

    *

  • Example: data mining and diagnosing cancer

    Complex data from geneticsChallenging data mining problemFind patterns of gene activation indicating different diseases / stagesChanged the way I think about cancer Oncologist from Chicago Childrens Memorial Hospital

    72.bin

    *

  • Example: data mining and policing

    Knowing the patterns helps plan effective crime preventionCrime hot-spots understood betterSift through mountains of crime reportsIdentify crime seriesOther people save money using data mining we save lives. Police force homicide specialist and data miner

    *

  • Data mining tools:
    Clementine and its philosophy

    *

    How it works

    How its really used.

    Handling of business problems and algorithms / expert features

    Deep embedding of deployment

    CRISP-DM pane

  • How to do data mining

    Lots of data mining operationsHow do you glue them together to solve a problem?How do we actually do data mining?MethodologyNot just the right way, but any way

    *

  • Myths about Data Mining (1)
    Data, Process and Tech

    Data mining is all about
    massive data

    Data mining is a
    technical process

    Data mining is all
    about algorithms

    Data mining is all
    about predictive accuracy

  • Myths about Data Mining (2)
    Data Quality

    Data mining only works
    with clean data

    Data mining only works
    with complete data

    Data mining only works

    with correct data

  • One last exploding myth

    Neural Networks are not useful when you need to understand the patterns that you find
    (which is nearly always in data mining)

    Related to over-simplistic views of data mining

    Data mining techniques form a toolkit

    We often use techniques in surprising ways

    E.g. Neural nets for field selection
    Neural nets for pattern confirmation
    Neural nets combined with other techniques
    for cross-checking

    What use is a pair of pliers?

  • *

    Related Concepts Outline

    Database/OLTP SystemsFuzzy Sets and LogicInformation Retrieval(Web Search Engines)Dimensional ModelingData WarehousingOLAP/DSSStatisticsMachine LearningPattern Matching

    Goal: Examine some areas which are related to data mining.

  • *

    Fuzzy Sets and Logic

    Fuzzy Set: Set membership function is a real valued function with output in the range [0,1].f(x): Probability x is in F.1-f(x): Probability x is not in F.EX:T = {x | x is a person and x is tall}Let f(x) be the probability that x is tallHere f is the membership function

    DM: Prediction and classification are fuzzy.

  • *

    Information Retrieval

    Information Retrieval (IR): retrieving desired information from textual data.Library ScienceDigital LibrariesWeb Search EnginesTraditionally keyword basedSample query:

    Find all documents about data mining.

    DM: Similarity measures;

    Mine text/Web data.

  • Prentice Hall

    *

    Dimensional Modeling

    View data in a hierarchical manner more as business executives mightUseful in decision support systems and miningDimension: collection of logically related attributes; axis for modeling data.Facts: data storedEx: Dimensions products, locations, date

    Facts quantity, unit price

    DM: May view data as dimensinoal.

    Prentice Hall

  • *

    Dimensional Modeling Queries

    Roll Up: more general dimensionDrill Down: more specific dimensionDimension (Aggregation) HierarchySQL uses aggregationDecision Support Systems (DSS): Computer systems and tools to assist managers in making decisions and solving problems.

  • *

    Cube view of Data

  • *

    Data Warehousing

    Subject-oriented, integrated, time-variant, nonvolatile William InmonOperational Data: Data used in day to day needs of company.Informational Data: Supports other functions such as planning and forecasting.Data mining tools often access data warehouses rather than operational data.

    DM: May access data in warehouse.

  • *

    OLAP

    Online Analytic Processing (OLAP): provides more complex queries than OLTP.OnLine Transaction Processing (OLTP): traditional database/transaction processing.Dimensional data; cube view Visualization of operations:Slice: examine sub-cube.Dice: rotate cube to look at another dimension.Roll Up/Drill Down

    DM: May use OLAP queries.

  • *

    OLAP Operations

    Single Cell

    Multiple Cells

    Slice

    Dice

    Roll Up

    Drill Down

  • *

    Statistics

    Simple descriptive modelsStatistical inference: generalizing a model created from a sample of the data to the entire dataset.Exploratory Data Analysis: Data can actually drive the creation of the modelOpposite of traditional statistical view.Data mining targeted to business user

    DM: Many data mining methods come from statistical techniques.

  • *

    Machine Learning

    Machine Learning: area of AI that examines how to write programs that can learn.Often used in classification and prediction Supervised Learning: learns by example.Unsupervised Learning: learns without knowledge of correct answers.Machine learning often deals with small static datasets.

    DM: Uses many machine learning techniques.

  • Prentice Hall

    *

    Pattern Matching (Recognition)

    Pattern Matching: finds occurrences of a predefined pattern in the data.Applications include speech recognition, information retrieval, time series analysis.

    DM: Type of classification.

    Prentice Hall

  • *

    DM vs. Related Topics

    Area

    Query

    Data

    Results

    Output

    DB/OLTP

    Precise

    Database

    Precise

    DB Objects or Aggregation

    IR

    Precise

    Documents

    Vague

    Documents

    OLAP

    Analysis

    Multidimensional

    Precise

    DB Objects or Aggregation

    DM

    Vague

    Preprocessed

    Vague

    KDD Objects

  • Prentice Hall

    *

    Data Mining Techniques Outline

    StatisticalPoint EstimationModels Based on SummarizationBayes TheoremHypothesis TestingRegression and CorrelationSimilarity MeasuresDecision TreesNeural NetworksActivation FunctionsGenetic Algorithms

    Goal: Provide an overview of basic data mining techniques

    Prentice Hall

  • *

    Point Estimation

    Point Estimate: estimate a population parameter.May be made by calculating the parameter for a sample.May be used to predict value for missing data.Ex: R contains 100 employees99 have salary informationMean salary of these is $50,000Use $50,000 as value of remaining employees salary.

    Is this a good idea?

  • *

    Estimation Error

    Bias: Difference between expected value and actual value.

    Mean Squared Error (MSE): expected value of the squared difference between the estimate and the actual value:

    Why square?Root Mean Square Error (RMSE)

  • *

    Expectation-Maximization (EM)

    Solves estimation with incomplete data.Obtain initial estimates for parameters.Iteratively use estimates for missing data and continue until convergence.

  • *

    Models Based on Summarization

    Visualization: Frequency distribution, mean, variance, median, mode, etc.Box Plot:

  • *

    Bayes Theorem

    Posterior Probability: P(h1|xi)Prior Probability: P(h1)Bayes Theorem:

    Assign probabilities of hypotheses given a data value.

  • *

    Hypothesis Testing

    Find model to explain behavior by creating and then testing a hypothesis about the data.Exact opposite of usual DM approach.H0 Null hypothesis; Hypothesis to be tested.H1 Alternative hypothesis

  • *

    Regression

    Predict future values based on past valuesLinear Regression assumes linear relationship exists.

    y = c0 + c1 x1 + + cn xn

    Find values to best fit the data

  • *

    Correlation

    Examine the degree to which the values for two variables behave similarly.Correlation coefficient r:

    1 = perfect correlation

    -1 = perfect but opposite correlation

    0 = no correlation

  • Prentice Hall

    *

    Similarity Measures

    Determine similarity between two objects.Similarity characteristics:

    Alternatively, distance measure measure how unlike or dissimilar objects are.

    Prentice Hall

  • *

    Distance Measures

    Measure dissimilarity between objects

  • *

    Decision Trees

    Decision Tree (DT):Tree where the root and each internal node is labeled with a question. The arcs represent each possible answer to the associated question. Each leaf node represents a prediction of a solution to the problem.Popular technique for classification; Leaf node indicates class to which the corresponding tuple belongs.

  • Prentice Hall

    *

    Decision Trees

    A Decision Tree Model is a computational model consisting of three parts:Decision TreeAlgorithm to create the treeAlgorithm that applies the tree to data Creation of the tree is the most difficult part.Processing is basically a search similar to that in a binary search tree (although DT may not be binary).

    Prentice Hall

  • Prentice Hall

    *

    Neural Networks

    Based on observed functioning of human brain. (Artificial Neural Networks (ANN)Our view of neural networks is very simplistic. We view a neural network (NN) from a graphical viewpoint.Alternatively, a NN may be viewed from the perspective of matrices.Used in pattern recognition, speech recognition, computer vision, and classification.

    Prentice Hall

  • *

    Generating Rules

    Decision tree can be converted into a rule setStraightforward conversion: each path to the leaf becomes a rule makes an overly complex rule setMore effective conversions are not trivial(e.g. C4.8 tests each node in root-leaf path to see if it can be eliminated without loss in accuracy)

    *

    In the previous lesson we discussed Classification using decision trees.

    Sometimes decision trees are inconvenient they can be very large

    Also, they require starting at the same attribute

    We can extract modular nuggets of knowledge by getting rules

  • *

    Covering algorithms

    Strategy for generating a rule set directly: for each class in turn find rule set that covers all instances in it (excluding instances not in the class)This approach is called a covering approach because at each stage a rule is identified that covers some of the instances

    *

  • *

    Rules vs. trees

    Corresponding decision tree:

    (produces exactly the same

    predictions)

    But: rule sets can be more clear when decision trees suffer from replicated subtreesAlso: in multi-class situations, covering algorithm concentrates on one class at a time whereas decision tree learner takes all classes into account

    *

  • *

    A simple covering algorithm

    Generates a rule by adding tests that maximize rules accuracySimilar to situation in decision trees: problem of selecting an attribute to split onBut: decision tree inducer maximizes overall purityEach new test reduces

    rules coverage:

    witten&eibe

    *

  • Algorithm Components

    1. The task the algorithm is used to address (e.g. classification, clustering, etc.)

    2. The structure of the model or pattern we are fitting to the data (e.g. a linear regression model)

    3. The score function used to judge the quality of the fitted models or patterns (e.g. accuracy, BIC, etc.)

    4. The search or optimization method used to search over parameters and/or structures (e.g. steepest descent, MCMC, etc.)

    5. The data management technique used for storing, indexing, and retrieving data (critical when data too large to reside in memory)

  • Models and Patterns

    Models

    Prediction

    Probability Distributions

    Structured Data

    Linear regressionPiecewise linear

  • Models

    Prediction

    Probability Distributions

    Structured Data

    Linear regressionPiecewise linearNonparamatric regression

  • Models

    Prediction

    Probability Distributions

    Structured Data

    Linear regressionPiecewise linearNonparametric regressionClassification

    logistic regression

    nave bayes/TAN/bayesian networks

    NN

    support vector machines

    Trees

    etc.

  • Models

    Prediction

    Probability Distributions

    Structured Data

    Linear regressionPiecewise linearNonparametric regressionClassification

    Parametric modelsMixtures of parametric modelsGraphical Markov models (categorical, continuous, mixed)

  • Models

    Prediction

    Probability Distributions

    Structured Data

    Linear regressionPiecewise linearNonparametric regressionClassification

    Parametric modelsMixtures of parametric modelsGraphical Markov models (categorical, continuous, mixed)

    Time seriesMarkov modelsMixture Transition Distribution modelsHidden Markov modelsSpatial models

  • Bias-Variance Tradeoff

    High Bias - Low Variance

    Low Bias - High Variance

    overfitting - modeling the random component

    Score function should embody the compromise

  • Patterns

    Global

    Local

    Clustering via partitioningHierarchical ClusteringMixture Models

    Outlier detectionChangepoint detection

    Bump huntingScan statisticsAssociation rules

  • x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    The curve represents a road

    Each x marks an accident

    Red x denotes an injury accident

    Black x means no injury

    Is there a stretch of road where there is an unually large fraction of injury accidents?

    Scan Statistics via Permutation Tests

  • Scan with Fixed Window

    If we know the length of the stretch of road that we seek, e.g., we could slide this window long