System Support for High Performance Scientific Data Mining

System Support for High Performance Scientific Data Mining

Gagan Agrawal Ruoming Jin

Raghu Machiraju S. Parthasarathy

Department of Computer and Information SciencesOhio State University

Scientific Data Mining Problem Datasets used for scientific

data mining are large – particularly from simulations

Our understanding of what algorithms and parameters will give desired insights is limited

Time required for implementing different algorithms and running them with different parameters on large datasets slows down the scientific data mining process

Project Overview FREERIDE (Framework

for Rapid Implementation of datamining engines) as the base system

Already demonstrated for a variety of standard mining algorithms

Working for feature analysis and mining of simulation data currently

FREERIDE offers: The ability to rapidly prototype a high-performance mining

implementation Distributed memory parallelization Shared memory parallelization Ability to process large and disk-resident datasets Only modest modifications to a sequential implementation for the

above three

Key Observation from Mining Algorithms

Popular algorithms have a common canonical loop

Can be used as the basis for supporting a common middleware

While( ) {

forall( data instances d) {

I = process(d)

R(I) = R(I) op d

}

…….

}

Performance of Shared Memory Parallelization

0200400600800

1000120014001600

1thread

4threads

16threads

full repl

opt full locks

cache sens.Locks

K-means clustering

Performance on Cluster of SMPs

010000200003000040000500006000070000

1 node 2nodes

4nodes

8nodes

1 thread 2 threads 3 threads

Apriori Association Mining

SPIES On (a) FREERIDE Developed a new

communication efficient decision tree construction algorithm – Statistical Pruning of Intervals for Enhanced Scalability (SPIES)

Combines RainForest with statistical pruning of intervals of numerical attributes to reduce memory requirements and communication volume

Does not require sorting of data, or partitioning and writing-back of records

01000200030004000500060007000

1node

8nodes

1thread 2threads 3threads

Broader Research Agenda

Applying FREERIDE for Scientific Data Mining Focusing on feature

extraction, tracking, and mining approach developed by Machiraju et al.

A feature is a region of interest in a dataset

A suite of algorithms for extracting and tracking them

Aggregate Classify Points

RankDenoise

Track

Transform

Operator

Tour Grid

A Feature Analysis Algorithm

ROIs

Data

Catalog

Classify-Aggregate

Ongoing Work – Parallelization Using FREERIDE Most of the steps involve

generalized reductions - supported well in FREERIDE

Extensions to FREERIDE required for aggregation and tracking steps

Overall, FREERIDE can allow rapid implementation of scalable versions of a variety of steps and algorithms that are part of the feature mining paradigm

System Support for High Performance Scientific Data Mining

Documents

Transcript of System Support for High Performance Scientific Data Mining