An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of...
-
date post
21-Dec-2015 -
Category
Documents
-
view
218 -
download
0
Transcript of An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of...
![Page 1: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/1.jpg)
An Introduction to Data Mining
Padhraic SmythInformation and Computer Science
University of California, Irvine
July 2000
![Page 2: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/2.jpg)
Today’s talk:
An introduction to data mining
General concepts
Focus on current practice of data mining: mainmessage is be aware of the “hype factor”
Wednesday’s talk:
Application of ideas in data mining to problems inatmospheric/environmental science
![Page 3: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/3.jpg)
Outline of Today’s Talk
• What is Data Mining?
• Computer Science and Statistics: a Brief History
• Models and Algorithms
• Hot Topics in Data Mining
• Conclusions
![Page 4: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/4.jpg)
The Data Revolution
• Context – “.. drowning in data, but starving for knowledge” – Ubiquitous in business, science, medicine, military– Analyzing/exploring data manually becomes difficult with massive data sets
• Viewpoint: data as a resource– Data themselves are not of direct use– How can we leverage data to make better decisions ?
![Page 5: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/5.jpg)
Technology is a Driving Factor
• Larger, cheaper memory– Moore’s law for magnetic disk density
“capacity doubles every 18 months” (Jim Gray, Microsoft)– storage cost per byte falling rapidly
• Faster, cheaper processors– can analyze more data– fit more complex models– invoke massive search techniques– more powerful visualization
![Page 6: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/6.jpg)
Massive Data Sets
• Characteristics– very large N (billions)– very large d (thousands or millions)– heterogeneous– dynamic– (Note: in scientific applications there is often a temporal and/or
spatial dimension)
1 2 . . . . . . . . . . . d12....N
![Page 7: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/7.jpg)
High-dimensional data
• Volume of sphere relative to cube in d dimensions?
Hypersphere in d dimensions
Hypercubein d dimensions
Rel. Volume 0.79 ? ? ? ? ?
Dimension 2 3 4 5 6 7
(David Scott, Multivariate Density Estimation, Wiley, 1992)
![Page 8: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/8.jpg)
High-dimensional data
Hypersphere in d dimensions
Hypercubein d dimensions
Rel. Volume 0.79 0.53 0.31 0.16 0.08 0.04
Dimension 2 3 4 5 6 7
• high-d, uniform => most data points will be “out” at the corners
• high-d space is sparse: and non-intuitive
![Page 9: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/9.jpg)
What is data mining?
![Page 10: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/10.jpg)
What is data mining?
“Data-driven discovery of models and patterns from massive observational data sets”
![Page 11: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/11.jpg)
What is data mining?
“The magic phrase to put in every funding proposalyou write to NSF, DARPA, NASA, etc”
![Page 12: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/12.jpg)
What is data mining?
“The magic phrase you use to sell your….. - database software - statistical analysis software - parallel computing hardware - consulting services”
![Page 13: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/13.jpg)
What is data mining?
“Data-driven discovery of models and patterns from
massive observational data sets”
Statistics,Inference
![Page 14: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/14.jpg)
What is data mining?
“Data-driven discovery of models and patterns from massive observational data sets”
Statistics,Inference
LanguagesandRepresentations
![Page 15: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/15.jpg)
What is data mining?
“Data-driven discovery of models and patterns from massive observational data sets”
Statistics,Inference
Engineering,Data ManagementLanguages,
Representations
![Page 16: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/16.jpg)
What is data mining?
“Data-driven discovery of models and patterns from massive observational data sets”
Statistics,Inference
Engineering,Data Management
Languages,Representations
Applications
![Page 17: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/17.jpg)
Who is involved in Data Mining?
• Business Applications– customer-based, transaction-oriented applications– very specific applications in fraud, marketing, credit-scoring
• in-house applications (e.g., AT&T, Microsoft, etc)• consulting firms: considerable hype factor!
– largely involve the application of existing statistical ideas, scaled up to massive data sets (“engineering”)
• Academic Researchers– mainly in computer science – extensions of existing ideas, significant “bandwagon effect”– largely focused on prediction with multivariate data
• Bottom Line: – primarily computer scientists, often with little knowledge of statistics, main focus is on
algorithms
![Page 18: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/18.jpg)
Myths and Legends in Data Mining
• “Data analysis can be fully automated”
– human judgement is critical in almost all applications
– “semi-automation” is however very useful
![Page 19: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/19.jpg)
Myths and Legends in Data Mining
• “Data analysis can be fully automated”
– human judgement is critical in almost all applications
– “semi-automation” is however very useful
• “Association rules are useful”
– association rules are essentially lists of correlations
– no documented successful application
– compare with decision trees (numerous applications)
![Page 20: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/20.jpg)
Myths and Legends in Data Mining
• “Data analysis can be fully automated”
– human judgement is critical in almost all applications
– “semi-automation” is however very useful
• “Association rules are useful”
– association rules are essentially lists of correlations
– no documented successful application
– compare with decision trees (numerous applications)
• “With massive data sets you don’t need statistics”
– massiveness brings heterogeneity - even more statistics
![Page 21: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/21.jpg)
Current Data Mining Software
1. General purpose tools
– software systems for data mining (IBM, SGI, etc)
• just simple statistical algorithms with SQL?
• limited support for temporal, spatial data
– some successes (difficult to validate)
• banking, marketing, retail
• mainly useful for large-scale EDA?
– “mining the miners” (Jerry Friedman):
• similar to expert systems/neural networks hype in 80’s?
![Page 22: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/22.jpg)
Transaction Data and Association Rules
• Supermarket example: (Srikant and Agrawal, 1997)
– #items = 500,000, #transactions = 1.5 million
ItemsTransa
ctions x x
xx
x x xx
x x xxx x
xx
x
xx
x
![Page 23: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/23.jpg)
Transaction Data and Association Rules
• Example of an Association Rule If a customer buys beer they will also buy chips
– p(chips|beer) = “confidence”
– p(beer) = “support”
ItemsTransa
ctions x x
xx
x x xx
x x xxx x
xx
x
xx
x
![Page 24: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/24.jpg)
Current Data Mining Software
2. Special purpose (“niche”) applications
- fraud detection, direct-mail marketing, credit-scoring,etc.
- often solve high-dimensional classification/regression problems
- Telephone industry applications
- fraud
- Direct-mail advertising
- find new customers
- increase # home-equity loans
- common theme: “track the customer!”
- difficult to validate claims of success (few publications)
![Page 25: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/25.jpg)
Advanced Scout
• Background– every NBA game is annotated (each pass, shot, foul, etc.)– potential competitive advantage for coaches– Problem: over a season, this generates alot of data!
• Solution (Bhandari et al, IBM, 1997)– “attribute focusing” finds conditional ranges on attributes where the distributions
differ from the norm– generates descriptions of interesting patterns
e.g., “Player X made 100% of his shots when when Player Y was in the game: X normally makes only 50% of his shots”
• Status– used by 28 of the 29 teams in the NBA– an intelligent assistant
![Page 26: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/26.jpg)
AT&T Classification of Telephone Numbers
• Background
– AT&T has about 100 million customers
– It logs 300 million calls per day, 40 attributes each
– 350 million unique telephone numbers
– Which are business and which are residential?
• Solution (Pregibon and Cortes, AT&T,1997)
– Proprietary model, using a few attributes, trained on known business customers to adaptively track p(business|data)
– Significant systems engineering: data are downloaded nightly, model updated (20 processors, 6Gb RAM, terabyte disk farm)
• Status:
– invaluable evolving “snapshot” of phone usage in US for AT&T
– basis for fraud detection, marketing, and other applications
![Page 27: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/27.jpg)
Bad Debt Prediction
• Background– Bank has 120,000 accounts which are delinquent– employs 500 collectors– process is expensive and inefficient
• Predictive Modeling– target variable: amount repaid within 6 months– input variables: 2000 different variables derived from credit history– model outputs are used to “score” each debtor based on likelihood of paying
• Results– decision trees, “bump-hunting” used to score customers
• non-trivial software issues in handling such large data sets– “scoring” system in routine use– estimated savings to bank are in millions/annum
![Page 28: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/28.jpg)
Outline
• What is Data Mining?
• Computer Science and Statistics: a Brief History
![Page 29: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/29.jpg)
Historical Context: Statistics
• Gauss, Fisher, and all that– least-squares, maximum likelihood– development of fundamental principles
• The Mathematical Era– 1950’s: Neyman, etc: the mathematicians take over
• The Computational Era– steadily growing since the 1960’s
• note: “data mining/fishing” viewed very negatively!– 1970’s: EDA, Bayesian estimation, flexible models, EM, etc– a growing awarness of the power and role of computing in data
analysis
![Page 30: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/30.jpg)
Historical Context: Computer Science
• Pattern Recognition and AI
– focus on perceptual problems (e.g., speech, images)
– 1960’s: bifurcation into statistical and non-statistical approaches, e.g., grammars
– convergence of applied statistics and engineering
• e.g., statistical image analysis: Geman, Grenander, etc
• Machine Learning and Neural Networks
– 1980’s: failure of non-statistical learning approaches
– emergence of flexible models (trees, networks)
– convergence of applied statistics and learning
• e.g., work of Friedman, Spiegelhalter, Jordan, Hinton
![Page 31: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/31.jpg)
The Emergence of Data Mining
• Distinct threads of evolution
– AI/machine learning
• 1989 KDD workshop -> ACM SIGKDD 2000
• focus on “automated discovery, novelty”
– Database Research
• focus on massive data sets
• e.g., SIGMOD -> association rules, scalable algorithms
– “Data Owners”
• what can we do with all this data in our RDBMS?
• primarily customer-oriented transaction data owners
• industry dominated, applications-oriented
![Page 32: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/32.jpg)
The Emergence of Data Mining
• The “Mother in Law” phenomenon
• even your mother-in-law has heard about data mining
• Beware of the hype!
– remember expert systems, neural nets, etc
– basically sound ideas that were oversold creating a backlash
![Page 33: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/33.jpg)
Computer ScienceStatistics
![Page 34: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/34.jpg)
Statistics Computer Science
StatisticalPatternRecognition
Neural Networks
MachineLearning
DataMining
DatabasesStatisticalInference
![Page 35: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/35.jpg)
Statistics Computer Science
StatisticalPatternRecognition
Neural Networks
MachineLearning
DataMining
DatabasesStatisticalInference
Where Work is Published
JASA,JRSS
IEEE PAMIICPRICCV
NIPSNeural Comp.
ICMLCOLTML Journal
KDDIJDMKD
SIGMODVLDB
![Page 36: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/36.jpg)
Statistics Computer Science
StatisticalPatternRecognition
Neural Networks
MachineLearning
DataMining
DatabasesStatisticalInference
NonlinearRegression
PatternFindingComputer Vision,
Signal Recognition
FlexibleClassificationModels
ScalableAlgorithmsGraphical
ModelsHiddenVariableModels
Focus Areas
![Page 37: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/37.jpg)
NonlinearRegression
PatternFindingComputer Vision,
Signal Recognition
FlexibleClassificationModels
ScalableAlgorithms
GraphicalModels
HiddenVariableModels
More Statistical More Algorithmic
General Characteristics
![Page 38: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/38.jpg)
NonlinearRegression
PatternFindingComputer Vision,
Signal Recognition
FlexibleClassificationModels
ScalableAlgorithms
GraphicalModels
HiddenVariableModels
More Statistical More Algorithmic
Continuous Signals Categorical Data
General Characteristics
![Page 39: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/39.jpg)
NonlinearRegression
PatternFindingComputer Vision,
Signal Recognition
FlexibleClassificationModels
ScalableAlgorithms
GraphicalModels
HiddenVariableModels
More Statistical More Algorithmic
Continuous Signals Categorical Data
General Characteristics
Model-Based “Model-free”
![Page 40: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/40.jpg)
NonlinearRegression
PatternFindingComputer Vision,
Signal Recognition
FlexibleClassificationModels
ScalableAlgorithms
GraphicalModels
HiddenVariableModels
More Statistical More Algorithmic
Continuous Signals Categorical Data
Time/Space Modeling Multivariate Data
General Characteristics
Model-Based “Model-free”
![Page 41: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/41.jpg)
NonlinearRegression
PatternFindingComputer Vision,
Signal Recognition
FlexibleClassificationModels
ScalableAlgorithms
GraphicalModels
HiddenVariableModels
“Hot Topics”
HiddenMarkov Models
BeliefNetworks
SupportVectorMachines
Mixture/Factor Models
Classification Trees
AssociationRules
DeformableTemplates
ModelCombining
![Page 42: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/42.jpg)
Implications
• The “renaissance data miner” is skilled in:– statistics: theories and principles of inference– modeling: languages and representations for data– optimization and search– algorithm design and data management
• The educational problem– is it necessary to know all these areas in depth?– Is it possible?– Do we need a new breed of professionals?
• The applications viewpoint:– How does a scientist or business person keep up with all these developments? – How can they choose the best approach for their problem
![Page 43: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/43.jpg)
Outline
• What is Data Mining?
• Computer Science and Statistics: a Brief History
• Models and Algorithms
![Page 44: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/44.jpg)
E.g., multivariate,
continuous/categorical,
temporal, spatial,
combinations, etc
Data Set
![Page 45: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/45.jpg)
TaskE.g., Exploration,
Prediction, Clustering,
Density Estimation,
Pattern Discovery
Data Set
![Page 46: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/46.jpg)
TaskData Set
Model
Language/Representation:
Underlying functional form
used for representation, e.g.,
linear functions, hierarchies,
rules/boxes, grammars, etc
![Page 47: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/47.jpg)
TaskData Set
Model
Score Function
Statistical Inference:
How well a model fits data, e.g.,
square-error, likelihood,
classification loss, query
match, interpretation
![Page 48: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/48.jpg)
TaskData Set
Model
Score Function
Optimization Computational method
used to optimize score function,
given the model and score
function, e.g., hill-climbing,
greedy search, linear programming
Modeling
![Page 49: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/49.jpg)
TaskData Set
Model
Score Function
Optimization
Data Access
Actual instantiation as an algorithm
with data structures, efficient
implementation, etc.
Modeling
Algorithm
![Page 50: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/50.jpg)
TaskData Set
Model
Score Function
Optimization
Data Access
Human Evaluation/Decisions
Modeling
Algorithm
![Page 51: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/51.jpg)
PredictionMultivariate
Hierarchical representation of
piecewise constant mapping
Cross-Validation
Greedy Search
Flat File
Accuracy and Interpretability
CART
Emphasis on
predictive power
and flexibility
of model
![Page 52: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/52.jpg)
ExploratoryTransaction
Sets of local rules/
conditional probabilities
Thresholds on p
Systematic Search
Relational Database
????
Association Rules
Emphasis on
computational
efficiency and
data access
![Page 53: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/53.jpg)
The Reductionist Viewpoint
• Methodology
– reduce problems to fundamental components
– think in terms of components first, algorithms second
– ultimately the application should “drive” the algorithm
– allows systematic comparison and synthesis
– clarifies relative role of statistics, databases, search, etc
![Page 54: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/54.jpg)
Cultural Differences
• Computer Scientists:
– often have little exposure to the “modeling art” of data analysis
– tend to stick to a small set of well-understood models and problems
– papers focus on algorithms, not models
– but are typically good at making things run fast
• Statisticians:
– applied statisticians are often very good at the “art” component
– little experience with the data management/engineering part
– papers focus on models, not algorithms
• Bottom line
– the computer scientists get more attention since they are much savvier at marketing new ideas than the statisticians
– The “right” way: systematically combine both statistics and engineering/CS, beware of hype
![Page 55: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/55.jpg)
Outline
• What is Data Mining?
• Computer Science and Statistics: a Brief History
• Models and Algorithms
• Hot Topics in Data Mining
![Page 56: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/56.jpg)
Hot Topics
• 1. Flexible Prediction Models
• 2. Scalable Algorithms
• 3. Pattern Discovery
• 4. Graphical Models
• 5. Hidden Variable Models
• 6. Deformable Templates
• 7. Heterogenous Data
Today’s talk
Wednesday’s talk
![Page 57: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/57.jpg)
1. Flexible Prediction Models
• Model Combining:
– Stacking
• linear combinations of models with X-validated weights
– Bagging
• equally weighted combinations trained on bootstrap samples
– Boosting
• iterative re-training on data points which contribute to error
• Flexible Model Forms
– Decision trees
– Neural networks
– Support vector machines
![Page 58: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/58.jpg)
2. Scalable Algorithms
• How far away are the data?
Memory
RAM
Disk
![Page 59: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/59.jpg)
2. Scalable Algorithms
• How far away are the data?
Memory RandomAccess Time
RAM 10-8 seconds
Disk 10-3 seconds
![Page 60: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/60.jpg)
2. Scalable Algorithms
• How far away are the data?
Memory Random EffectiveAccess Time Distance
RAM 10-8 seconds 1 meter
Disk 10-3 seconds 100 km
![Page 61: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/61.jpg)
2. Scalable Algorithms
• “Scaling down the data” or “data approximation”– work from clever data summarizations (e.g., sufficient statistics)
• Squashing (DuMouchel et al, AT&T, KDD ‘99) – create a small “pseudo data set” – similar statistical properties to the original (massive) data set – now run your standard algorithm on the pseudo-data– can be significantly better than random sampling– interesting theoretical (statistical) basis
• Frequent Itemsets– find all tuples which with more than T occurrences in D– (basis for association rule algorithms)– itemsets: cheap computational way to generate joint probabilities– use maximum entropy to construct full model from itemsets (Pavlov, Mannila, and
Smyth, KDD 99)
![Page 62: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/62.jpg)
2. Scalable Algorithms
• “Scaling up the algorithm”
– data structures and caching strategies to speed up known algorithms
– typically orders of magnitude speed improvements
• Exact Algorithms
– BOAT (Gehrke et al, SIGMOD 98):
• a scalable decision tree construction algorithm
• clever algorithms can work from only 2 scans
– ADTrees (Moore, CMU, 1998)
• clever data structures for caching sufficient statistics for multivariate categorical data
• Approximate Algorithms
– approximate EM for Gaussian mixture modeling (Bradley and Fayyad, KDD 98)
– various heuristics for caching, approximation
![Page 63: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/63.jpg)
3. Pattern Finding
• Patterns = unusual hard-to-find local “pockets” of data
– finding patterns is not the same as global model fitting
– the simplest example of patterns are association rules
• “Bump-hunting”
– PRIM algorithm of Friedman and Fisher (1999)
– finds multivariate “boxes” in high-dimensional spaces where mean of target variable is higher
– effective and flexible
• e.g., finding small highly profitable groups of customers
![Page 64: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/64.jpg)
“Bump-Hunting”
![Page 65: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/65.jpg)
“Bump-Hunting”
![Page 66: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/66.jpg)
“Bump-Hunting”
![Page 67: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/67.jpg)
“Bump-Hunting”
![Page 68: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/68.jpg)
“Bump-Hunting”
![Page 69: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/69.jpg)
“Bump-Hunting”
![Page 70: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/70.jpg)
Pattern Finding in Sequence Data
• Clustering Sequences– sequences of different lengths from different individuals
• e.g. sequences of Web-page requests– Problem: do the sequences cluster into groups?– Clustering problem is non-trivial:
• distance between 2 sequences of different lengths?
• Model-based approach (Cadez, Heckerman, Smyth, KDD 2000)– each cluster described as a Markov model– defines a mixture of Markov models, EM used for clustering– Application to MSNBC.com Web data
• 900,000 users/sequences per day• clustered into order of 100 groups• useful for visualization/exploration of massive Web log
![Page 71: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/71.jpg)
Clusters of Dynamic Behavior
B
C
D
A
B
C
D
A
B
C
D
A
Cluster 1 Cluster 2
Cluster 3
![Page 72: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/72.jpg)
![Page 73: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/73.jpg)
Final Comments
• Successful data mining requires integration of – statistics– computer science– the application discipline
• Current practice of data mining– computer scientists focused on business applications– relatively little statistical sophistication: but some new ideas– considerable “hype” factor
• Wednesday’s talk:– new ideas in temporal and spatial models– new ideas in latent variable modeling – potential applications in atmospheric/environmental science
![Page 74: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/74.jpg)
Further Reading
• Papers:– www.ics.uci.edu/~datalab– e.g., see P. Smyth, “Data mining: data analysis on a grand scale?”,
preprint of review paper to appear in Statistical Methods in Medical Research
• Text (forthcoming)
– Principles of Data Mining
• D. J Hand, H. Mannila, P. Smyth
• MIT Press, late 2000
![Page 75: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/75.jpg)
3. Pattern Finding
• Contrast Sets (Bay and Pazzani, KDD99)
– individuals or objects categorized into 2 groups
• e.g., students enrolled in CS and in Engineering
– high-dimensional multivariate measurements on each
– Problem: automatically summarize the significant differences between the two groups.
• e.g., [fraction of ESL >] AND [mean SAT >] in CS
• Approach
– massive systematic breadth-first search through potential variable-value conjunctions
– branch-and-bound pruning of exponentially large search space
– statistical adjustments for multiple hypothesis problem
![Page 76: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/76.jpg)
3. Pattern Finding
• Contrast Sets (Bay and Pazzani, KDD99)– individuals or objects categorized into 2 groups
• e.g., students enrolled in CS and in Engineering– high-dimensional multivariate measurements on each – automatically produces a summary of significant differences between
groups (Bay and Pazzani, KDD ‘99)– combines massive search with statistical estimation
• Time-Series Pattern Spotting– “find me a shape that looks like this”– semi-Markov deformable templates (Ge and Smyth, KDD 2000)– significantly outperforms template matching and DTW– Bayesian approach integrates prior knowledge with data
![Page 77: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/77.jpg)
Example: Deformable Templates
Each waveform segment corresponds to a state in the model. Segmental hidden semi-Markov model
S1 S2ST
- - - - - - - -
Segments
States
![Page 78: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/78.jpg)
Pattern-Based End-Point Detection
0 50 100 150 200 250 300 350 400200
300
400
500
0 50 100 150 200 250 300 350 400200
300
400
500
TIME (SECONDS)
Original Pattern
Detected Pattern
End-Point Detection in Semiconductor Manufacturing
![Page 79: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/79.jpg)
Heterogeneous Data Modeling
• Clustering Objects (sequences, curves, etc)
– probabilistic approach: define a mixture of models (Cadez, Gaffney, and Smyth, KDD 2000)
– unified framework for clustering objects of different dimensions
– applications:
• curve-clustering:
– e.g., mixture of regression models (Gaffney and Smyth (KDD ‘99)
– video movement, gene expression data, storm trajectories
• sequence clustering
– e.g., mixtures of Markov models
– clustering of MSNBC Web data (Cadez et al, KDD ‘00)
![Page 80: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/80.jpg)
0 5 10 15 20 25 3040
60
80
100
120
140
160
TIME
X-P
OS
ITIO
N
TRAJECTORIES OF CENTROIDS OF MOVING HAND IN VIDEO STREAMS
0 5 10 15 20 25 300
10
20
30
40
50
60
70
80
TIME
Y-P
OS
ITIO
N
ESTIMATED CLUSTER TRAJECTORY
0 5 10 15 20 25 3085
90
95
100
105
110
115
120
125
TIME
X-P
OS
ITIO
N
ESTIMATED CLUSTER TRAJECTORY
![Page 81: An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649d625503460f94a4517e/html5/thumbnails/81.jpg)
Heterogenous Populations of Objects
Population Model
in parameter space
Individuals
and Parameters
Observed Data