System Support for High Performance Scientific Data Mining
description
Transcript of System Support for High Performance Scientific Data Mining
System Support for High Performance Scientific Data Mining
Gagan Agrawal Ruoming Jin
Raghu Machiraju S. Parthasarathy
Department of Computer and Information SciencesOhio State University
Scientific Data Mining Problem Datasets used for scientific
data mining are large – particularly from simulations
Our understanding of what algorithms and parameters will give desired insights is limited
Time required for implementing different algorithms and running them with different parameters on large datasets slows down the scientific data mining process
Project Overview FREERIDE (Framework
for Rapid Implementation of datamining engines) as the base system
Already demonstrated for a variety of standard mining algorithms
Working for feature analysis and mining of simulation data currently
FREERIDE offers: The ability to rapidly prototype a high-performance mining
implementation Distributed memory parallelization Shared memory parallelization Ability to process large and disk-resident datasets Only modest modifications to a sequential implementation for the
above three
Key Observation from Mining Algorithms
Popular algorithms have a common canonical loop
Can be used as the basis for supporting a common middleware
While( ) {
forall( data instances d) {
I = process(d)
R(I) = R(I) op d
}
…….
}
Performance of Shared Memory Parallelization
0200400600800
1000120014001600
1thread
4threads
16threads
full repl
opt full locks
cache sens.Locks
K-means clustering
Performance on Cluster of SMPs
010000200003000040000500006000070000
1 node 2nodes
4nodes
8nodes
1 thread 2 threads 3 threads
Apriori Association Mining
SPIES On (a) FREERIDE Developed a new
communication efficient decision tree construction algorithm – Statistical Pruning of Intervals for Enhanced Scalability (SPIES)
Combines RainForest with statistical pruning of intervals of numerical attributes to reduce memory requirements and communication volume
Does not require sorting of data, or partitioning and writing-back of records
01000200030004000500060007000
1node
8nodes
1thread 2threads 3threads
Broader Research Agenda
Applying FREERIDE for Scientific Data Mining Focusing on feature
extraction, tracking, and mining approach developed by Machiraju et al.
A feature is a region of interest in a dataset
A suite of algorithms for extracting and tracking them
Aggregate Classify Points
RankDenoise
Track
Transform
Operator
Tour Grid
A Feature Analysis Algorithm
ROIs
Data
Catalog
Classify-Aggregate
Ongoing Work – Parallelization Using FREERIDE Most of the steps involve
generalized reductions - supported well in FREERIDE
Extensions to FREERIDE required for aggregation and tracking steps
Overall, FREERIDE can allow rapid implementation of scalable versions of a variety of steps and algorithms that are part of the feature mining paradigm