An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of...

81
An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    218
  • download

    0

Transcript of An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of...

Page 1: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

An Introduction to Data Mining

Padhraic SmythInformation and Computer Science

University of California, Irvine

July 2000

Page 2: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

Today’s talk:

An introduction to data mining

General concepts

Focus on current practice of data mining: mainmessage is be aware of the “hype factor”

Wednesday’s talk:

Application of ideas in data mining to problems inatmospheric/environmental science

Page 3: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

Outline of Today’s Talk

• What is Data Mining?

• Computer Science and Statistics: a Brief History

• Models and Algorithms

• Hot Topics in Data Mining

• Conclusions

Page 4: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

The Data Revolution

• Context – “.. drowning in data, but starving for knowledge” – Ubiquitous in business, science, medicine, military– Analyzing/exploring data manually becomes difficult with massive data sets

• Viewpoint: data as a resource– Data themselves are not of direct use– How can we leverage data to make better decisions ?

Page 5: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

Technology is a Driving Factor

• Larger, cheaper memory– Moore’s law for magnetic disk density

“capacity doubles every 18 months” (Jim Gray, Microsoft)– storage cost per byte falling rapidly

• Faster, cheaper processors– can analyze more data– fit more complex models– invoke massive search techniques– more powerful visualization

Page 6: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

Massive Data Sets

• Characteristics– very large N (billions)– very large d (thousands or millions)– heterogeneous– dynamic– (Note: in scientific applications there is often a temporal and/or

spatial dimension)

1 2 . . . . . . . . . . . d12....N

Page 7: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

High-dimensional data

• Volume of sphere relative to cube in d dimensions?

Hypersphere in d dimensions

Hypercubein d dimensions

Rel. Volume 0.79 ? ? ? ? ?

Dimension 2 3 4 5 6 7

(David Scott, Multivariate Density Estimation, Wiley, 1992)

Page 8: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

High-dimensional data

Hypersphere in d dimensions

Hypercubein d dimensions

Rel. Volume 0.79 0.53 0.31 0.16 0.08 0.04

Dimension 2 3 4 5 6 7

• high-d, uniform => most data points will be “out” at the corners

• high-d space is sparse: and non-intuitive

Page 9: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

What is data mining?

Page 10: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

What is data mining?

“Data-driven discovery of models and patterns from massive observational data sets”

Page 11: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

What is data mining?

“The magic phrase to put in every funding proposalyou write to NSF, DARPA, NASA, etc”

Page 12: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

What is data mining?

“The magic phrase you use to sell your….. - database software - statistical analysis software - parallel computing hardware - consulting services”

Page 13: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

What is data mining?

“Data-driven discovery of models and patterns from

massive observational data sets”

Statistics,Inference

Page 14: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

What is data mining?

“Data-driven discovery of models and patterns from massive observational data sets”

Statistics,Inference

LanguagesandRepresentations

Page 15: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

What is data mining?

“Data-driven discovery of models and patterns from massive observational data sets”

Statistics,Inference

Engineering,Data ManagementLanguages,

Representations

Page 16: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

What is data mining?

“Data-driven discovery of models and patterns from massive observational data sets”

Statistics,Inference

Engineering,Data Management

Languages,Representations

Applications

Page 17: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

Who is involved in Data Mining?

• Business Applications– customer-based, transaction-oriented applications– very specific applications in fraud, marketing, credit-scoring

• in-house applications (e.g., AT&T, Microsoft, etc)• consulting firms: considerable hype factor!

– largely involve the application of existing statistical ideas, scaled up to massive data sets (“engineering”)

• Academic Researchers– mainly in computer science – extensions of existing ideas, significant “bandwagon effect”– largely focused on prediction with multivariate data

• Bottom Line: – primarily computer scientists, often with little knowledge of statistics, main focus is on

algorithms

Page 18: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

Myths and Legends in Data Mining

• “Data analysis can be fully automated”

– human judgement is critical in almost all applications

– “semi-automation” is however very useful

Page 19: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

Myths and Legends in Data Mining

• “Data analysis can be fully automated”

– human judgement is critical in almost all applications

– “semi-automation” is however very useful

• “Association rules are useful”

– association rules are essentially lists of correlations

– no documented successful application

– compare with decision trees (numerous applications)

Page 20: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

Myths and Legends in Data Mining

• “Data analysis can be fully automated”

– human judgement is critical in almost all applications

– “semi-automation” is however very useful

• “Association rules are useful”

– association rules are essentially lists of correlations

– no documented successful application

– compare with decision trees (numerous applications)

• “With massive data sets you don’t need statistics”

– massiveness brings heterogeneity - even more statistics

Page 21: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

Current Data Mining Software

1. General purpose tools

– software systems for data mining (IBM, SGI, etc)

• just simple statistical algorithms with SQL?

• limited support for temporal, spatial data

– some successes (difficult to validate)

• banking, marketing, retail

• mainly useful for large-scale EDA?

– “mining the miners” (Jerry Friedman):

• similar to expert systems/neural networks hype in 80’s?

Page 22: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

Transaction Data and Association Rules

• Supermarket example: (Srikant and Agrawal, 1997)

– #items = 500,000, #transactions = 1.5 million

ItemsTransa

ctions x x

xx

x x xx

x x xxx x

xx

x

xx

x

Page 23: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

Transaction Data and Association Rules

• Example of an Association Rule If a customer buys beer they will also buy chips

– p(chips|beer) = “confidence”

– p(beer) = “support”

ItemsTransa

ctions x x

xx

x x xx

x x xxx x

xx

x

xx

x

Page 24: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

Current Data Mining Software

2. Special purpose (“niche”) applications

- fraud detection, direct-mail marketing, credit-scoring,etc.

- often solve high-dimensional classification/regression problems

- Telephone industry applications

- fraud

- Direct-mail advertising

- find new customers

- increase # home-equity loans

- common theme: “track the customer!”

- difficult to validate claims of success (few publications)

Page 25: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

Advanced Scout

• Background– every NBA game is annotated (each pass, shot, foul, etc.)– potential competitive advantage for coaches– Problem: over a season, this generates alot of data!

• Solution (Bhandari et al, IBM, 1997)– “attribute focusing” finds conditional ranges on attributes where the distributions

differ from the norm– generates descriptions of interesting patterns

e.g., “Player X made 100% of his shots when when Player Y was in the game: X normally makes only 50% of his shots”

• Status– used by 28 of the 29 teams in the NBA– an intelligent assistant

Page 26: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

AT&T Classification of Telephone Numbers

• Background

– AT&T has about 100 million customers

– It logs 300 million calls per day, 40 attributes each

– 350 million unique telephone numbers

– Which are business and which are residential?

• Solution (Pregibon and Cortes, AT&T,1997)

– Proprietary model, using a few attributes, trained on known business customers to adaptively track p(business|data)

– Significant systems engineering: data are downloaded nightly, model updated (20 processors, 6Gb RAM, terabyte disk farm)

• Status:

– invaluable evolving “snapshot” of phone usage in US for AT&T

– basis for fraud detection, marketing, and other applications

Page 27: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

Bad Debt Prediction

• Background– Bank has 120,000 accounts which are delinquent– employs 500 collectors– process is expensive and inefficient

• Predictive Modeling– target variable: amount repaid within 6 months– input variables: 2000 different variables derived from credit history– model outputs are used to “score” each debtor based on likelihood of paying

• Results– decision trees, “bump-hunting” used to score customers

• non-trivial software issues in handling such large data sets– “scoring” system in routine use– estimated savings to bank are in millions/annum

Page 28: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

Outline

• What is Data Mining?

• Computer Science and Statistics: a Brief History

Page 29: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

Historical Context: Statistics

• Gauss, Fisher, and all that– least-squares, maximum likelihood– development of fundamental principles

• The Mathematical Era– 1950’s: Neyman, etc: the mathematicians take over

• The Computational Era– steadily growing since the 1960’s

• note: “data mining/fishing” viewed very negatively!– 1970’s: EDA, Bayesian estimation, flexible models, EM, etc– a growing awarness of the power and role of computing in data

analysis

Page 30: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

Historical Context: Computer Science

• Pattern Recognition and AI

– focus on perceptual problems (e.g., speech, images)

– 1960’s: bifurcation into statistical and non-statistical approaches, e.g., grammars

– convergence of applied statistics and engineering

• e.g., statistical image analysis: Geman, Grenander, etc

• Machine Learning and Neural Networks

– 1980’s: failure of non-statistical learning approaches

– emergence of flexible models (trees, networks)

– convergence of applied statistics and learning

• e.g., work of Friedman, Spiegelhalter, Jordan, Hinton

Page 31: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

The Emergence of Data Mining

• Distinct threads of evolution

– AI/machine learning

• 1989 KDD workshop -> ACM SIGKDD 2000

• focus on “automated discovery, novelty”

– Database Research

• focus on massive data sets

• e.g., SIGMOD -> association rules, scalable algorithms

– “Data Owners”

• what can we do with all this data in our RDBMS?

• primarily customer-oriented transaction data owners

• industry dominated, applications-oriented

Page 32: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

The Emergence of Data Mining

• The “Mother in Law” phenomenon

• even your mother-in-law has heard about data mining

• Beware of the hype!

– remember expert systems, neural nets, etc

– basically sound ideas that were oversold creating a backlash

Page 33: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

Computer ScienceStatistics

Page 34: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

Statistics Computer Science

StatisticalPatternRecognition

Neural Networks

MachineLearning

DataMining

DatabasesStatisticalInference

Page 35: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

Statistics Computer Science

StatisticalPatternRecognition

Neural Networks

MachineLearning

DataMining

DatabasesStatisticalInference

Where Work is Published

JASA,JRSS

IEEE PAMIICPRICCV

NIPSNeural Comp.

ICMLCOLTML Journal

KDDIJDMKD

SIGMODVLDB

Page 36: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

Statistics Computer Science

StatisticalPatternRecognition

Neural Networks

MachineLearning

DataMining

DatabasesStatisticalInference

NonlinearRegression

PatternFindingComputer Vision,

Signal Recognition

FlexibleClassificationModels

ScalableAlgorithmsGraphical

ModelsHiddenVariableModels

Focus Areas

Page 37: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

NonlinearRegression

PatternFindingComputer Vision,

Signal Recognition

FlexibleClassificationModels

ScalableAlgorithms

GraphicalModels

HiddenVariableModels

More Statistical More Algorithmic

General Characteristics

Page 38: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

NonlinearRegression

PatternFindingComputer Vision,

Signal Recognition

FlexibleClassificationModels

ScalableAlgorithms

GraphicalModels

HiddenVariableModels

More Statistical More Algorithmic

Continuous Signals Categorical Data

General Characteristics

Page 39: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

NonlinearRegression

PatternFindingComputer Vision,

Signal Recognition

FlexibleClassificationModels

ScalableAlgorithms

GraphicalModels

HiddenVariableModels

More Statistical More Algorithmic

Continuous Signals Categorical Data

General Characteristics

Model-Based “Model-free”

Page 40: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

NonlinearRegression

PatternFindingComputer Vision,

Signal Recognition

FlexibleClassificationModels

ScalableAlgorithms

GraphicalModels

HiddenVariableModels

More Statistical More Algorithmic

Continuous Signals Categorical Data

Time/Space Modeling Multivariate Data

General Characteristics

Model-Based “Model-free”

Page 41: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

NonlinearRegression

PatternFindingComputer Vision,

Signal Recognition

FlexibleClassificationModels

ScalableAlgorithms

GraphicalModels

HiddenVariableModels

“Hot Topics”

HiddenMarkov Models

BeliefNetworks

SupportVectorMachines

Mixture/Factor Models

Classification Trees

AssociationRules

DeformableTemplates

ModelCombining

Page 42: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

Implications

• The “renaissance data miner” is skilled in:– statistics: theories and principles of inference– modeling: languages and representations for data– optimization and search– algorithm design and data management

• The educational problem– is it necessary to know all these areas in depth?– Is it possible?– Do we need a new breed of professionals?

• The applications viewpoint:– How does a scientist or business person keep up with all these developments? – How can they choose the best approach for their problem

Page 43: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

Outline

• What is Data Mining?

• Computer Science and Statistics: a Brief History

• Models and Algorithms

Page 44: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

E.g., multivariate,

continuous/categorical,

temporal, spatial,

combinations, etc

Data Set

Page 45: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

TaskE.g., Exploration,

Prediction, Clustering,

Density Estimation,

Pattern Discovery

Data Set

Page 46: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

TaskData Set

Model

Language/Representation:

Underlying functional form

used for representation, e.g.,

linear functions, hierarchies,

rules/boxes, grammars, etc

Page 47: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

TaskData Set

Model

Score Function

Statistical Inference:

How well a model fits data, e.g.,

square-error, likelihood,

classification loss, query

match, interpretation

Page 48: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

TaskData Set

Model

Score Function

Optimization Computational method

used to optimize score function,

given the model and score

function, e.g., hill-climbing,

greedy search, linear programming

Modeling

Page 49: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

TaskData Set

Model

Score Function

Optimization

Data Access

Actual instantiation as an algorithm

with data structures, efficient

implementation, etc.

Modeling

Algorithm

Page 50: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

TaskData Set

Model

Score Function

Optimization

Data Access

Human Evaluation/Decisions

Modeling

Algorithm

Page 51: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

PredictionMultivariate

Hierarchical representation of

piecewise constant mapping

Cross-Validation

Greedy Search

Flat File

Accuracy and Interpretability

CART

Emphasis on

predictive power

and flexibility

of model

Page 52: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

ExploratoryTransaction

Sets of local rules/

conditional probabilities

Thresholds on p

Systematic Search

Relational Database

????

Association Rules

Emphasis on

computational

efficiency and

data access

Page 53: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

The Reductionist Viewpoint

• Methodology

– reduce problems to fundamental components

– think in terms of components first, algorithms second

– ultimately the application should “drive” the algorithm

– allows systematic comparison and synthesis

– clarifies relative role of statistics, databases, search, etc

Page 54: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

Cultural Differences

• Computer Scientists:

– often have little exposure to the “modeling art” of data analysis

– tend to stick to a small set of well-understood models and problems

– papers focus on algorithms, not models

– but are typically good at making things run fast

• Statisticians:

– applied statisticians are often very good at the “art” component

– little experience with the data management/engineering part

– papers focus on models, not algorithms

• Bottom line

– the computer scientists get more attention since they are much savvier at marketing new ideas than the statisticians

– The “right” way: systematically combine both statistics and engineering/CS, beware of hype

Page 55: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

Outline

• What is Data Mining?

• Computer Science and Statistics: a Brief History

• Models and Algorithms

• Hot Topics in Data Mining

Page 56: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

Hot Topics

• 1. Flexible Prediction Models

• 2. Scalable Algorithms

• 3. Pattern Discovery

• 4. Graphical Models

• 5. Hidden Variable Models

• 6. Deformable Templates

• 7. Heterogenous Data

Today’s talk

Wednesday’s talk

Page 57: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

1. Flexible Prediction Models

• Model Combining:

– Stacking

• linear combinations of models with X-validated weights

– Bagging

• equally weighted combinations trained on bootstrap samples

– Boosting

• iterative re-training on data points which contribute to error

• Flexible Model Forms

– Decision trees

– Neural networks

– Support vector machines

Page 58: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

2. Scalable Algorithms

• How far away are the data?

Memory

RAM

Disk

Page 59: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

2. Scalable Algorithms

• How far away are the data?

Memory RandomAccess Time

RAM 10-8 seconds

Disk 10-3 seconds

Page 60: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

2. Scalable Algorithms

• How far away are the data?

Memory Random EffectiveAccess Time Distance

RAM 10-8 seconds 1 meter

Disk 10-3 seconds 100 km

Page 61: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

2. Scalable Algorithms

• “Scaling down the data” or “data approximation”– work from clever data summarizations (e.g., sufficient statistics)

• Squashing (DuMouchel et al, AT&T, KDD ‘99) – create a small “pseudo data set” – similar statistical properties to the original (massive) data set – now run your standard algorithm on the pseudo-data– can be significantly better than random sampling– interesting theoretical (statistical) basis

• Frequent Itemsets– find all tuples which with more than T occurrences in D– (basis for association rule algorithms)– itemsets: cheap computational way to generate joint probabilities– use maximum entropy to construct full model from itemsets (Pavlov, Mannila, and

Smyth, KDD 99)

Page 62: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

2. Scalable Algorithms

• “Scaling up the algorithm”

– data structures and caching strategies to speed up known algorithms

– typically orders of magnitude speed improvements

• Exact Algorithms

– BOAT (Gehrke et al, SIGMOD 98):

• a scalable decision tree construction algorithm

• clever algorithms can work from only 2 scans

– ADTrees (Moore, CMU, 1998)

• clever data structures for caching sufficient statistics for multivariate categorical data

• Approximate Algorithms

– approximate EM for Gaussian mixture modeling (Bradley and Fayyad, KDD 98)

– various heuristics for caching, approximation

Page 63: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

3. Pattern Finding

• Patterns = unusual hard-to-find local “pockets” of data

– finding patterns is not the same as global model fitting

– the simplest example of patterns are association rules

• “Bump-hunting”

– PRIM algorithm of Friedman and Fisher (1999)

– finds multivariate “boxes” in high-dimensional spaces where mean of target variable is higher

– effective and flexible

• e.g., finding small highly profitable groups of customers

Page 64: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

“Bump-Hunting”

Page 65: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

“Bump-Hunting”

Page 66: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

“Bump-Hunting”

Page 67: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

“Bump-Hunting”

Page 68: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

“Bump-Hunting”

Page 69: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

“Bump-Hunting”

Page 70: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

Pattern Finding in Sequence Data

• Clustering Sequences– sequences of different lengths from different individuals

• e.g. sequences of Web-page requests– Problem: do the sequences cluster into groups?– Clustering problem is non-trivial:

• distance between 2 sequences of different lengths?

• Model-based approach (Cadez, Heckerman, Smyth, KDD 2000)– each cluster described as a Markov model– defines a mixture of Markov models, EM used for clustering– Application to MSNBC.com Web data

• 900,000 users/sequences per day• clustered into order of 100 groups• useful for visualization/exploration of massive Web log

Page 71: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

Clusters of Dynamic Behavior

B

C

D

A

B

C

D

A

B

C

D

A

Cluster 1 Cluster 2

Cluster 3

Page 72: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.
Page 73: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

Final Comments

• Successful data mining requires integration of – statistics– computer science– the application discipline

• Current practice of data mining– computer scientists focused on business applications– relatively little statistical sophistication: but some new ideas– considerable “hype” factor

• Wednesday’s talk:– new ideas in temporal and spatial models– new ideas in latent variable modeling – potential applications in atmospheric/environmental science

Page 74: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

Further Reading

• Papers:– www.ics.uci.edu/~datalab– e.g., see P. Smyth, “Data mining: data analysis on a grand scale?”,

preprint of review paper to appear in Statistical Methods in Medical Research

• Text (forthcoming)

– Principles of Data Mining

• D. J Hand, H. Mannila, P. Smyth

• MIT Press, late 2000

Page 75: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

3. Pattern Finding

• Contrast Sets (Bay and Pazzani, KDD99)

– individuals or objects categorized into 2 groups

• e.g., students enrolled in CS and in Engineering

– high-dimensional multivariate measurements on each

– Problem: automatically summarize the significant differences between the two groups.

• e.g., [fraction of ESL >] AND [mean SAT >] in CS

• Approach

– massive systematic breadth-first search through potential variable-value conjunctions

– branch-and-bound pruning of exponentially large search space

– statistical adjustments for multiple hypothesis problem

Page 76: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

3. Pattern Finding

• Contrast Sets (Bay and Pazzani, KDD99)– individuals or objects categorized into 2 groups

• e.g., students enrolled in CS and in Engineering– high-dimensional multivariate measurements on each – automatically produces a summary of significant differences between

groups (Bay and Pazzani, KDD ‘99)– combines massive search with statistical estimation

• Time-Series Pattern Spotting– “find me a shape that looks like this”– semi-Markov deformable templates (Ge and Smyth, KDD 2000)– significantly outperforms template matching and DTW– Bayesian approach integrates prior knowledge with data

Page 77: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

Example: Deformable Templates

Each waveform segment corresponds to a state in the model. Segmental hidden semi-Markov model

S1 S2ST

- - - - - - - -

Segments

States

Page 78: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

Pattern-Based End-Point Detection

0 50 100 150 200 250 300 350 400200

300

400

500

0 50 100 150 200 250 300 350 400200

300

400

500

TIME (SECONDS)

Original Pattern

Detected Pattern

End-Point Detection in Semiconductor Manufacturing

Page 79: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

Heterogeneous Data Modeling

• Clustering Objects (sequences, curves, etc)

– probabilistic approach: define a mixture of models (Cadez, Gaffney, and Smyth, KDD 2000)

– unified framework for clustering objects of different dimensions

– applications:

• curve-clustering:

– e.g., mixture of regression models (Gaffney and Smyth (KDD ‘99)

– video movement, gene expression data, storm trajectories

• sequence clustering

– e.g., mixtures of Markov models

– clustering of MSNBC Web data (Cadez et al, KDD ‘00)

Page 80: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

0 5 10 15 20 25 3040

60

80

100

120

140

160

TIME

X-P

OS

ITIO

N

TRAJECTORIES OF CENTROIDS OF MOVING HAND IN VIDEO STREAMS

0 5 10 15 20 25 300

10

20

30

40

50

60

70

80

TIME

Y-P

OS

ITIO

N

ESTIMATED CLUSTER TRAJECTORY

0 5 10 15 20 25 3085

90

95

100

105

110

115

120

125

TIME

X-P

OS

ITIO

N

ESTIMATED CLUSTER TRAJECTORY

Page 81: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.

Heterogenous Populations of Objects

Population Model

in parameter space

Individuals

and Parameters

Observed Data