Post on 17-Nov-2014
description
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
SDSC Summer Institute 2004SDSC Summer Institute 2004
TUTORIALTUTORIALData Mining for Scientific Data Mining for Scientific
ApplicationsApplications
Peter Shin Hector Jasso
San Diego Supercomputer Center UCSD
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Overview Introduction to data mining
• Definitions, concepts, applications• Machine learning methods for KDD
• Supervised learning – classification• Unsupervised learning – clustering
Cyberinfrastructure for data mining• SDSC/NPACI resources – hardware and software
Survey of Applications at SKIDL
Break
Hands on tutorial with IBM Intelligent Miner and SKIDLkit• Customer targeting• Microarray analysis (leukemia dataset)
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Data Mining Definition
The search for interesting patterns and models,
in large data collections,
using statistical and machine learning methods, and high-performance computational
infrastructure.
Key point: applications are data-driven and compute-intensive
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Analysis Levels and Infrastructure
• Informal methods – graphs, plots, visualizations, exploratory data analysis (yes – Excel is a data mining tool)
• Advanced query processing and OLAP – e.g., National Virtual Observatory (NVO)
• Machine learning (compute-intensive statistical methods)• Supervised – classification, prediction• Unsupervised – clustering
• Computational infrastructure needed at all levels – collections management, information integration, high-performance database systems, web services, grid services, scientific workflows, the global IT grid
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
The Case for Data Mining: Data Reality
• Deluge from new sources• Remote sensing• Microarray processing• Wireless communication• Simulation models• Instrumentation – microscopes, telescopes• Digital publishing• Federation of collections
• “5 exabytes (5 million terabytes) of new information was created in 2002” (source: UC Berkeley researchers Peter Lyman and Hal Varian)
• This is the result of a recent paradigm shift: from hypothesis-driven data collection to data mining
• Data destination: Legacy archives and independent collection activities
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Knowledge Discovery Process
Collection
Processing/Cleansing/Corrections
Analysis/Modeling
Presentation/Visualization
Application/Decision Support
Management/Federation/Warehousing
Data
Knowledge
“Data is not information; information is not knowledge; knowledge is not wisdom.” Gary Flake, Principal Scientist & Head of Yahoo! Research Labs, July 2004.
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Characteristics of Data Mining Applications
• Data:• Lots of data, numerous sources
• Noisy – missing values, outliers, interference
• Heterogeneous – mixed types, mixed media
• Complex – scale, resolution, temporal, spatial dimensions
• Relatively little domain theory, few quantitative causal models
• Lack of valid ground truth
• Advice: don’t choose problems that have all these characteristics …
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Scientific vs. Commercial Data MiningGoals:
• Science – Theories: Need for insight and theory-based models, interpretable model structures, generate domain rules or causal structures, support for theory development
• Commercial – Profits: black boxes OK
Types of data: • Science – Images, sensors, simulations• Commercial - Transaction data• Both - Spatial and temporal dimensions, heterogeneous
Trend – Common IT (information technology) tools fit both enterprises• Database systems (Oracle, DB2, etc), integration tools (Information
Integrator), web services (Blue Titan, .NET)• This is good!
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Introduction to Machine Learning Basic machine learning theory Concepts and feature vectors Supervised and unsupervised learning Model development
training and testing methodology, model validation, overfitting confusion matrices
Survey of algorithms Decision Trees classification k-means clustering Hierarchical clustering Bayesian networks and probabilistic inference Support vector machines
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Basic Machine Learning Theory
Basic inductive learning hypothesis:• Having a large number of observations, we can approximate the rule
that describes how the data was generated, and thus generate a model (using some algorithm)
No Free Lunch Theorem: • There is no ultimate algorithm: In the absence of prior information about
the problem, there are no reasons to prefer one learning algorithm over another.
Conclusion: • There is no problem-independent “best” learning system. Formal theory
and algorithms are not enough. • Machine learning is an empirical subject.
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Concepts are described as feature vectors
Example: vehicles • Has wheels• Runs on gasoline• Carries people • Flies• Weighs less than 500 pounds
Boolean feature vectors for vehicles• car254 [ 1 1 1 0 0 ] • motorcyle14 [ 1 1 1 0 1 ] • airplane132 [ 1 1 1 1 0 ]
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Easy to generalize to complex data types:
• Number of wheels• Fuel type• Carrying capacity• Flies• Weight
car254 [ 4, gas, 6, 0, 2000 ]motorcyle14 [ 2, gas, 2, 0, 400 ]airplane132 [ 10, jetfuel, 110, 1, 35000 ]
Most machine learning algorithms expect feature vectors, stored in text files or databases
Suggestions: • Identify the target concept• Organize your data to fit feature vector representation • Design your database schemas to support generation of data in
this format
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Supervised vs. Unsupervised Learning
Supervised – Each feature vector belongs to a class (label). Labels are given externally, and algorithms learn to predict the label of new samples/observations.
Unsupervised – Finds structure in the data, by clustering similar elements together. No previous knowledge of classes needed.
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Model development
Model validation
• Hold-out validation (2/3, 1/3 splits)• Cross validation, simple and n-fold (reuse)• Bootstrap validation (sample with replacement)• Jackknife validation (leave one out)
• When possible hide a subset of the data until train-test is complete.
Train Test Apply
Training and testing
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
0%
20%
40%
60%
80%
100%
0 2 4 6 8
Algorithm Steps
Acc
urac
y
Train
Test
OverfittingOptimal Depth
v
v
Avoid overfitting
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
0%
20%
40%
60%
80%
100%
0 2 4 6 8
Algorithm Steps
Acc
urac
y
Train
Test
OverfittingOptimal Depth
v
v
Avoid overfitting
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Confusion matrices
124 15
8 84
Predicted
Actual
Negative
Negative Positive
Positive
Accuracy = (124 + 84) / (124 + 15 + 8 + 84) “proportion of predictions correct”
True positive rate = 84 / (8 + 84) “proportion of positive cases correctly identified”
False positive rate = 15 / (124 + 15) “proportion of negative cases incorrectly class as positive”
True negative rate = 124 / (124 + 15) “proportion of negative cases correctly identified”
False negative rate = 8 / (8 + 84) “proportion of positive cases incorrectly class as negative”
Precision = 84 / (15 + 84) “proportion of predicted positive cases that were correct”
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Classification – Decision Tree
Desert 2
Forest 120
Forest 104
Desert 5
Forest 116
Prairie 63
Annual
PrecipitationEcosystem
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Desert 2
Forest 120
Forest 104
Desert 5
Forest 116
Prairie 63
Forest 120
Forest 104
Forest 116
Desert 2
Desert 5
Prairie 63
Precipitation > 63?YESNO
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Desert 2
Forest 120
Forest 104
Desert 5
Forest 116
Prairie 63
Forest 120
Forest 104
Forest 116Prairie 63
Desert 2
Desert 5
Prairie 63
Desert 2
Desert 5
Precipitation > 5?
Precipitation > 63?
YESNO
NOYES
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Desert 2
Forest 120
Forest 104
Desert 5
Forest 116
Prairie 63
If (Precip > 63 ) then “Forest”
else If (Precip > 5) then “Prairie”
else “Desert”
Classification accuracy on training data is 100%
2 0 0
0 3 0
0 0 1
D F P
F
D
P
Actual
Learned Model
Predicted
Confusion matrix
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Testing Set Results
Desert 8
Forest 100
Prairie 55
Desert 4
Forest 116
Prairie 72
IF(Precip > 63 ) then Forest
Else If (Precip > 5) then Prairie
Else Desert
Learned Model
Test Data
Result: Accuracy 67% Model shows overfitting, generalizes poorly
Prairie
Forest
Prairie
Desert
Forest
Forest
True Predicted1 0 1
0 2 0
0 1 1
D F P
F
D
P
Actual
Predicted
Confusion matrix
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Pruning to improve generalizationPruned Decision Tree
Desert 2
Forest 120
Forest 104
Desert 5
Forest 116
Prairie 63
Forest 120
Forest 104
Forest 116
Prairie 63
Desert 2
Desert 5
Precipitation < 60?
IF(Precip < 60 ) then Desert
Else, [P(Forest) = .75] &
[P(Prairie) = .25]
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Decision Trees Summary
• Simple to understand• Works with mixed data types• Heuristic search sensitive to local minima • Models non-linear functions • Handles classification and regression • Many successful applications• Readily available tools
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Overview of Clustering• Definition:
• Clustering is the discovery of classes• Unlabeled examples => unsupervised learning.
• Survey of Applications• Grouping of web-visit data, clustering of genes according to their
expression values, grouping of customers into distinct profiles,
• Survey of Methods• k-means clustering• Hierarchical clustering• Expectation Maximization (EM) algorithm• Gaussian mixture modeling
• Cluster analysis • Concept (class) discovery• Data compression/summarization• Bootstrapping knowledge
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Clustering – k-Means
Precipitation Temperature
8 81
71 70
62 63
49 45
17 76
32 49
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Clustering – k-Means
30
40
50
60
70
80
90
0 20 40 60 80
Precipitation
Tem
per
atu
re
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Clustering – k-Means
30
40
50
60
70
80
90
0 20 40 60 80
Precipitation
Tem
per
atu
re
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Clustering – k-Means
30
40
50
60
70
80
90
0 20 40 60 80
Precipitation
Tem
per
atu
re
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Clustering – k-Means
30
40
50
60
70
80
90
0 20 40 60 80
Precipitation
Tem
per
atu
re
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Clustering – k-Means
30
40
50
60
70
80
90
0 20 40 60 80
Precipitation
Tem
per
atu
re
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Clustering – k-Means
30
40
50
60
70
80
90
0 20 40 60 80
Precipitation
Tem
per
atu
re
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Clustering – k-Means
50 – 8050 – 80C3
25 - 5535 - 60C2
0 - 2570 - 85C1
Cluster Temperature Precipitation
30
40
50
60
70
80
90
0 20 40 60 80
Precipitation
Tem
pera
ture
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Clustering – k-Means
50 – 8050 – 80C3
25 - 5535 - 60C2
0 - 2570 - 85C1
Cluster Temperature Precipitation
30
40
50
60
70
80
90
0 20 40 60 80
Precipitation
Tem
pera
ture
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Clustering – k-Means
C1 70 - 85 0-25 Desert
C2 35 - 60 25 - 55 Prairie
C3 50 – 80 50 – 80 Forest
Cluster Temperature Precipitation Ecosystem
30
40
50
60
70
80
90
0 20 40 60 80
Precipitation
Tem
pera
ture
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Using k-means
• Requires a priori knowledge of ‘k’ • The final outcome depends on the initial choice
of k-means -- inconsistency• Sensitive to the outliers, which can skew the
means of their clusters• Favors spherical clusters – clusters may not
match domain boundaries• Requires real-valued features
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Cyberinfrastructure for Data Mining
• Resources – hardware and software (analysis tools and middleware)
• Policies – allocating resources to the scientific community. Challenges to the traditional supercomputer model. Requirements for interactive and real-time analysis resources.
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
NSF TeraGrid Building Integrated National CyberInfrastructure
• Prototype for CyberInfrastructure • Ubiquitous computational resources• Plug-in compatibility
• National Reach: • SDSC, NCSA, CIT, ANL, PSC
• High Performance Network: • 40 Gb/s backbone, 30 Gb/s to each site
• Over 20 Teraflops compute power• Over 1PB Online Storage• 8.9PB Archival Storage
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
SDSC is Data-Intensive Center
39
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
SDSC is Data-Intensive Center
40
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
SDSC Machine Room Data Architecture
Philosophy: enable SDSC configuration to serve the grid as Data Center
• .5 PB disk• 6 PB archive• 1 GB/s disk-to-tape• Optimized support for DB2 /Oracle
Blue Horizon
HPSS
LAN (multiple GbE, TCP/IP)
SAN (2 Gb/s, SCSI)
Linux Cluster, 4TF
Sun F15K
WAN (30 Gb/s)
SCSI/IP or FC/IP
FC Disk Cache (400 TB)
FC GPFS Disk (100TB)
200 MB/s per controller
Silos and Tape, 6 PB, 1 GB/sec disk to tape 32 tape drives
30 MB/s per drive
Database Engine
Data Miner
Vis Engine
Local Disk (50TB)
Power 4
Power 4 DB
Blue Horizon: 1152 processor IBM SP, 1.7 Teraflops HPSS: over 600 TB data stored
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
SDSC IBM Regatta - DataStar
• 100+ TB Disk• Numerous fast CPUs • 64 GB of RAM per node• DB2 v8.x ESE• IBM Intelligent Miner• SAS Enterprise Miner • Platform for high-performance
database, data mining, comparative IT studies …
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Data Mining Tools used at SDSC• SAS Enterprise Miner (Protein crystallization - JCSG)
• IBM Intelligent Miner (Protein crystallization - JCSG, Corn Yield – Michigan State University, Security logs - SDSC)
• CART (Protein crystallization - JCSG)
• Matlab SVM package (TeraBridge health monitoring – UCSD Structural Engineering Department, North Temperate Lakes Monitoring - LTER)
• PyML (Text Mining – NSDL, Hyperspectral data - LTER)
• SKIDLkit by SDSC (Microarray analysis – UCSD Cancer Center, Hyperspectral data - LTER)
• SVMlight (Hyperspectral data, LTER)
• LSI by Telecordia (Text Mining – NSDL)
• CoClustering by Fair Isaac (Text Mining – NSDL)
• Matlab Bayes Net package
• WEKA
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
SKIDLkit
• Toolkit for feature selection and classification• Filter methods• Wrapper methods• Data normalization• Feature selection• Support Vector Machine & Naïve Bayesian Clustering• http://daks.sdsc.edu/skidl
• Will use it in the hands-on demo…
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Survey of Applications at SDSC
• Sensor networks for bridge monitoring (with Structural Engineering Dept., UCSD)
• Text mining the NSDL (National Science Digital Library) collection
• Hyperspectral remote sensing data for groundcover classification (with Long Term Ecological Research Network - LTER)
• Microarray analysis for tumor detection (with UCSD Cancer Center)
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Sensor Networks for Bridge Monitoring• Task: detection & classification
• Identify damaged piers based on the data stream of acceleration measurements.
• Determine which sensors are key to determining bridge health.
• Multi-resolution analysis Rational resource management.
• Testbed: • Humboldt Bay Bridge with 8 piers.
• Assumptions: • Damage only happens at the lower end
of each pier (location of plastic hinge)• There is only one damaged pier each
time.
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Text Mining the NSDLVariously FormattedDocuments
StripFormatting
Pick out content words using “stop lists”
Stemming
Discard words that appear in every
document or only one
Word count, Term
Weighting
Generate Term Document
Matrix
Query: for a list of words, get docs
with highest score
VariousRetrievalSchemes
(LSI, Classification, or
clustering modules)
Processing pipeline
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Hyperspectral Image Classification
• Characteristics of the data• Over 200 bands
• Small number of samples through labor-intensive collecting process
• Collaboration with the Long Term Ecological Research Network
• Tasks:• Classify the vegetation (e.g.
Juniper tree, Sage, etc.)
• Identify key bands
• Detect spatio-temporal patterns
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Microarray Analysis for Tumor Detection
Characteristics of the Data:• 88 prostate tissue samples:
• 37 labeled “no tumor”,• 51 labeled “tumor”
• Each tissue with 10,600 gene expression measurements
• Collected by the UCSD Cancer Center, analyzed at SDSC
Tasks:• Build model to classify new,
unseen tissues as either “no tumor” or “tumor”
• Identify key genes to determine their biological significance in the process of cancer
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Some genes are more useful than others for building classification models
Example: genes 36569_at and 36495_at are useful
No Tumor
Tumor
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Results using independent test set
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Break
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Hands-on Analysis
• Decision Tree classification with IBM Intelligent Miner
• Using classification models to make rational decisions
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Data Mining Example: Targeting Customers
• Problem Characteristics:1. We make $50 profit on a sale of $200 shoes.
2. A preliminary study shows that people who make over $50k will buy the shoes at a rate of 5% when they receive the brochure.
3. People who make less than $50k will buy the shoes at a rate of 1% when they receive the brochure.
4. It costs $1 to send a brochure to a potential customer.
5. In general, we do not know whether a person will make more than $50k or not.
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Available Information
• Variable Description• Please refer to the hand-out.
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Possible Marketing Plans
• We will send out 30,000 brochures.
• Plan A: Ignore data and randomly send brochures (a.k.a ran-dumb plan)
• Plan B: Use data mining to target a specific group with high probabilities of responding (a.k.a Intelligent Target (IT) plan)
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Plan A (ran-dumb plan)• Strategy:
• Send brochures to anyone
• Cost of sending one brochure = $1• Probability of Response
• 1% of the population who make <= $50k (76%)• 5% of the population who make > $50k (24%)• Resulting in:(1% * 76% + 5% * 24%) = 1.96% final response rate
• Earnings• Expected profit from one brochure = (Probability of response * profit – Cost of a
brochure)(1.96% * $50 - $1) = -$0.02 • Expected Earning = Expected profit from one brochure * number of brochures sent
-$0.02 * 30000 = -$600
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Plan B (Intelligent Target (IT) plan)• Strategy:
• Send out brochures to only to: married, college or above, managerial/professional/sales/tech. support/protective service/armed forces, age >= 28.5, hours_per_week >= 31
• Cost of sending one brochure = $1• Probability of Response
• 1% of the population who make <= $50k (20.6%)• 5% of the population who make > $50k (79.4%)• Resulting in:(1% * 20.6% + 5% * 79.4%) = 4.176% final response rate
• Earnings• Expected profit from one brochure = (Probability of response * profit – Cost of a
brochure)(4.176% * $50 - $1) = $1.088• (Probability of response * profit – Cost of a flier) * number of fliers
$1.088 * 30000 = $32,640
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Comparison of Two Plans
• Expected earning from ran-dumb plan• -$600
• Expected earning from IT plan• $32,640
• Net Difference• $32,640 – (-$600) = $33,240
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Acknowledgements
• Original source Census Bureau (1994)
• Data processed and donated by Ron Kohavi and Barry Becker (Data Mining and Visualization, SGI)