Compiler and Runtime Support for Shared Memory Parallelization of Data Mining Algorithms

Compiler and Runtime Support for Shared Memory Parallelization of

Data Mining Algorithms

Xiaogang LiRuoming Jin

Gagan Agrawal Department of Computer and

Information SciencesOhio State University

Motivation Languages, compilers, and runtime systems

for high-end computing Typically focus on scientific applications

Can commercial applications benefit ? A majority of top 500 parallel configurations are used

as database servers Is there a role for parallel systems research ?

Parallel relational databases – probably not Data mining, decision support – quite likely

Data Mining Extracting useful models or patterns from large

datasets Includes a variety of tasks - mining associations,

sequences, clustering data, building decision trees, predictive models - several algorithms proposed for each

Both compute and data intensive Algorithms are well suited for parallel execution High-level interfaces can be useful for

application development

Project Overview

Clusters of SMPs

Data Parallel Java

Compiler Techniques

MPI+Posix Threads+File I/O

FREERIDE(middleware)

Runtime Techniques

Outline Key observation from mining algorithms Parallelization Techniques Middleware Support and Interface Language Interface and Compilation

techniques Experimental Results

K- means Apriori

Summary

Common Processing Structure

Structure of Common Data Mining Algorithms {* Outer Sequential Loop *} While () { { * Reduction Loop* } Foreach (element e) { (i,val) = process(e); Reduc(i) = Reduc(i) op val; } }

Applies to major association mining, clustering and decision tree construction algorithms

K- means Apriori

Summary

Challenges in Parallelization Statically partitioning the reduction object to

avoid race conditions is generally impossible. Runtime preprocessing or scheduling also

cannot be applied Can’t tell what you need to update w/o processing

the element The size of reduction object means significant

memory overheads for replication Locking and synchronization costs could be

significant because of the fine-grained updates to the reduction object.

Parallelization Techniques Full Replication: create a copy of the reduction

object for each thread Full Locking: associate a lock with each

element Cache Sensitive Locking: one lock for all

elements in a cache block Optimized Full Locking: put the element and

corresponding lock on the same cache block

Memory Layout for Various Locking Schemes

Full Locking

Optimized Full Locking

Cache-Sensitive Locking

Lock Reduction Element

K- means Apriori

Summary

Middleware Support for Shared Memory Parallelization Interface Requires:

Specification of an iterator and termination condition

Local reduction for each parallel loop Functionality

Fetch data elements chunk by chunk, apply local reduction

Parallelization and Synchronization Global reduction for all threads Check termination condition, move to next iteration

Example :Kmeans Clustering Algorithm Problem: -Given N points in a metric space and a distance function. -Try to find K centers and assign each point to one of these centers. -Minimize total distance between each point and the center it belongs to. • Algorithm Make initial guesses for the centers m1, m2, ..., mk

Until there are no changes in any center Use the estimated centers to classify the points into clusters For i from 1 to k

Replace mi with the mean of all of the pointss for Cluster i

end_for end_until

Programming Interface: k-means example

Initialization Function

void Kmeans::initialize() {

for (int i=0;i<k;i++) { clusterID[I]=reducobject->alloc(ndim); }

{* Initialize Centers *} }

k-means example (contd.) Local Reduction Functionvoid Kmeans::reduction(void *point) { for (int i=0;i<k;i++) { dis=distance(point,i); if (dis<min) { min=dis; min_index=i; } for (int j=0;j<ndim;j++) reductionobject->Add(objectID,j,point[j]); reduction object->Add(objectID,ndim,1); reductionobject->Add(objectID,ndim+1,dis); }}

Find a nearest center

Assign point to the center

Outline Key observation from mining algorithms Middleware Support for Shared Memory

Parallelization Interface and Compilation techniques Experimental Results

K- means Apriori

Summary

Language Support A data parallel dialect of Java: to give compiler information about independent collections of objects, parallel loops and reduction operations

— domain & rectdomain — foreach loop — reduction interface:

- can only be updated inside a foreach loop by operations that are associative & commutative -intermediate value of the reduction variables may

not be used within the loop, except for self-updates

K-means Clustering expressed by Data Parallel Javapublic class Kmeans { public static void main(String[] args) { RectDomain<1> InputDomain=[lowend:hiend];KmPoint[1d] Input=new KmPoint[InputDomain]; While (not_converged) { foreach (p in InputDomain) { min=MIN_NUMBER; for ( i=0;i<k;i++) { int dis=kcenter.distance(Input[p],i); if(dis<min) { min=dis; minindex=i; } } kcenter.assign(Input[p],minindex,min); } kcenter.finalizing();} }}

Input Data

Reduction Loop

Tasks of Compilation Mapping from reduction interface in our dialet

of Java to reduction object used by middleware - Parallelization techniques are transparent to compiler

by using reduction object. Extract important function from Java code

to fit into our middleware -Data fetching -Local reduction -Iterator and termination condition

Mapping of Reduction interface Decide the size of reduction object to be

allocated. -By declaration information of reduction interface -By symbolic analysis if can not decide statically Allocation of reduction object -Layout can be block or cyclic Changed reference and modification of

members to corresponding elements of reduction object.

x[1]=0 (*reductionElement)(reduct_buffer,1)=0

Extract important functions Local reduction function -From body of data parallel loop -Cumulative and associative operations on reduction interface

are replaced by operator of reduction object. meansx1[i]+=Input[p].x2 reducObject->Add(reduct_buffer, I,Input.x1) Iterator and termination -simple from overall code Data fetching function - from declaration of input class. -use constructor of input class to provide additional

information.

Results

Relative performance of Full Replication, Optimized Full locking and Cache-Sensitive Locking : 4 threads, different support levels

Cache Sensitive locking outperforms Full replication and Optimized Full locking as size of reduction object increased

Full Replication achieve best result when size of reduction object is small

Results

Comparison of compiler generated and manual versions– Apriori Association Mining (1GB Dataset)

Results

Comparison of compiler generated and manual versions– K-means Clustering ( 1GB Dataset, K=100)

Conclusion Provide runtime and compiler supports for

shared parallelization of data mining applications.

-Different parallelization techniques. -Support of middleware simplifies code

generation. -Compiler generated code is competitive.

Compiler and Runtime Support for Shared Memory Parallelization of Data Mining Algorithms

Documents

Transcript of Compiler and Runtime Support for Shared Memory Parallelization of Data Mining Algorithms

CMPT379% Compilers% - Simon Fraser Universitymsiahban/personal/teaching/CMPT-379-Spring-2016/... · 20121101% 4 Compiler! Program! Machine Code! Input! Runtime! Output! Compiler Source!

Demystify eBPF JIT Compiler - Netronome · JIT Compiler Ø Just In Time During Execution Runtime Ø Machine instruction generation depends on runtime information • Dynamically typed

Dandelion: a Compiler and Runtime for Heterogeneous Systems › en-us › research › wp-content › ... · 2018-01-04 · Dandelion: a Compiler and Runtime for Heterogeneous Systems

OSCAR Automatic Parallelization and Power Reduction ......OSCAR Automatic Parallelization and Power Reduction Compiler for Homogeneous and Heterogeneous Multicores Hironori Kasahara

COMPILER AND RUNTIME SUPPORT FOR HETEROGENEOUS …

Enterprise COBOL for z/OS V4.2 Compiler and Runtime Migration ...

1 Compiler Construction Runtime Environment. 2 Run-Time Environments (Chapter 7)

Multi-platform Automatic Parallelization and Power ... · (SMP servers) Server Code Generation OpenMP Compiler Shred memory servers Heterogeneous Multicores from Vendor B Hitachi,

The programming Language. Introduction History Compiler & Runtime Execution Language Fundamentals C# Conventions Syntax Variables Types Control Statements.

Automatic Parallelization by OSCAR Compiler for …...Speedup of NPB/CG by OSCAR Compiler on NEC SX-Aurora TSUBASA A100-1 8 cores 10C VE 57 times speed up for 8 core Parallelization

What is a compiler? What is a runtime system?

F : A Compiler and Runtime for (Almost) Object …guoqingx/papers/nguyen-asplos15.pdfFACADE: A Compiler and Runtime for (Almost) Object-Bounded Big Data Applications Khanh Nguyen Kai

Latte: A Language, Compiler, and Runtime for Elegant … · Latte: A Language, Compiler, and Runtime for Elegant and Efﬁcient Deep Neural Networks Leonard Truong Intel Labs / UC

CHET: Compiler and Runtime for Homomorphic ... - arXiv

Compiler and Runtime Support for Scaling Adaptive Mesh ... · Compiler and Runtime Support for Scaling Adaptive Mesh Re nement Computations in Titanium Jimmy Su, Tong Wen, and Katherine

MODYLAS · PDF fileC (without parallelization) ・Source ... Compiler frtpx (Fujitsu), ifort (intel), pgf90 ... document/ Manual and tutorial documents . Y. Andoh,

GCC for Parallelization - CSE, IIT Bombay · PPoPP’10 GCC-Par: Outline 2/147 About this Tutorial • Expected background Some compiler background, no knowledge of GCC or parallelization

Compiler and Runtime Optimizations for Fine-Grained ...The goals of this thesis are to study compiler and runtime optimizations that allow multi-threaded shared-memory Java programs

COMPILER AND RUNTIME APPROACH FOR SUPPORTING … · we developed an early prototype of the compiler and runtime system. Debyjoti Ma-jumder helped me greatly in the early development

Languages and Compiler Design II Runtime Systemweb.cecs.pdx.edu/~herb/cs322s10/cs322_11_Runtime_System.pdf · Languages and Compiler Design II Runtime System ... location for the