Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs...

Performance Prediction for Random Write Reductions: A Case Study in

Modelling Shared Memory Programs

Ruoming JinGagan Agrawal

Department of Computer and Information SciencesOhio State University

Outline

Motivation Random Write Reductions and Parallelization

Techniques Problem Definition Analytical Model

General Approach Modeling Cache and TLB Modeling waiting for locks and memory contention

Experimental Validation Conclusions

Motivation

Frequently need to mine very large datasets Large and powerful SMP machines are

becoming available Vendors often target data mining and data

warehousing as the main market Data mining emerging as an important class of

applications for SMP machines

Common Processing Structure

Structure of Common Data Mining Algorithms {* Outer Sequential Loop *} While () { { * Reduction Loop* } Foreach (element e) { (i,val) = process(e); Reduc(i) = Reduc(i) op val; } }

Applies to major association mining, clustering and decision tree construction algorithms

How to parallelize it on a shared memory machine?

Challenges in Parallelization

Statically partitioning the reduction object to avoid race conditions is generally impossible.

Runtime preprocessing or scheduling also cannot be applied

Can’t tell what you need to update w/o processing the element

The size of reduction object means significant memory overheads for replication

Locking and synchronization costs could be significant because of the fine-grained updates to the reduction object.

Parallelization Techniques

Full Replication: create a copy of the reduction object for each thread

Full Locking: associate a lock with each element

Optimized Full Locking: put the element and corresponding lock on the same cache block

Cache Sensitive Locking: one lock for all elements in a cache block

Memory Layout for Locking Schemes

Optimized Full Locking Cache-Sensitive Locking

Lock Reduction Element

Relative Experimental Performance

Relative Performance of Full Replication, Optimized Full Locking and Cache-Sensitive Locking

0

20000

4000060000

80000

100000

0.10% 0.05% 0.03% 0.02%

Support Levels

Tim

e(s)

fr

ofl

csl

Different Techniques can outperform each other depending upon problem and

machine parameters

Problem Definition

Can we predict the relative performance of different techniques for given machine, algorithm and dataset parameters ?

Develop an analytical model capturing the impact of memory hierarchy and modeling different parallelization overheads

Other applications of the model: Predicting speedups possible on parallel configurations Predicting performance as the output size is increased Scheduling and QoS in multiprogrammed environments Choosing accuracy of analysis and sampling rate in an

interactive environment or when mining over data streams

Context

Part of the FREERIDE (Framework for Rapid Implementation of Datamining Engines) system

Support parallelization on shared-nothing configurations

Support parallelization on shared memory configurations

Support processing of large datasets Previously reported our work on parallelization

techniques and processing of disk-resident datasets (SDM 01, SDM 02)

Analytical Model: Overview

Input data is read from disks – constant processing time

Reduction elements are accessed randomly – their size can vary considerably

Factors to model: Cache Misses on reduction elements -> Capacity

and Coherence TLB Misses on reduction elements Waiting time for locks Memory contention

Basic Approach

Focus on modeling reduction loops Tloop = Taverage * N Taverage = Tcompute + Treduc

TTreducreduc = T = Tupdate update + T+ Twaitwait + T + Tcache_misscache_miss + +

TTtlb_misstlb_miss + T + Tmemory_contentionmemory_contention

TTupdate update can be computed by executing the loop with a reduction can be computed by executing the loop with a reduction object that fits into L1 cache object that fits into L1 cache

Modeling Waiting time for Locks

The spent by a thread in one iteration of the loop can be divided into three components

Computing independently (a) Waiting for a lock (Twait) Holding a lock (b) where b = Treduc - Twait

Each lock is a M/D/1 queue The rate at which each requests to acquire a

lock are issued are: t / ((a + b + Twait)*m)

Modeling Waiting Time for Locks

Standard result on M/D/1 queues Twait = bU/ 2(1-U)

where, U is the server utilization and is given by

U = b Result on Twait is

Twait = b/(2(a/b + 1)(m/t) – 1)

Modeling Memory Hierarchy

Need to model L1 and L2 Cache L2 Cache TLB Misses

Ignore cold misses Only consider directly-mapped cache – analyze

capacity and conflict misses together Simple analysis for capacity and conflict

misses because of random accesses to the reduction object

Modeling Coherence Cache Misses

A coherence miss occurs when a cache block is invalidated by other CPU

Analyze the probability that: Between two accesses to a cache block on a processor, the same memory block is accessed, and this memory

block is not updated by one of the other processors in the mean-

time Details are available in the paper

Modeling Memory Contention

Input elements displace reduction objects from cache

Results in a write-back followed by read operation

Memory system on many machines requires extra cycles to switch between write-back and read operations

Source of contention Model using M/D/1 queues, similar to waiting

time for locks

Experimental Platform

Small SMP machine Sun Ultra Enterprise 450 4 X 250 MHz Ultra-II processors 1 GB of 4-way interleaved main memory

Large SMP machine Sun Fire 6800 24 X 900 MHz Sun UltraSparc III A 96KB L1 cache and a 64 MB L2 cache per

processor 24 GB main memory

Impact of Memory Hierarchy, Large SMP

Measured and predicted performance as the size ofreduction object is scaled

Full replication Optimized full locking Cache-sensitive locking

Modeling Parallel Performance with Locking, Large SMP

Parallel performance with cache-sensitive locking, small reduction object sizes

1 thread 2 threads 4 threads 8 threads 12 threads

Modeling Parallel Performance, Large SMP

Performance of optimizedfull locking with large reductionobject sizes

1 thread 2 threads 4 threads 8 threads 12 threads

How good is the Model in Predicting Relative Performance ? (Large SMP)

Performance of Optimized full locking and Cache Sensitive Locking (12 threads)

Impact of Memory Hierarchy, Small SMP

Measured and predictedperformance as the size of reduction object is scaled

Full replication Optimized full locking Cache-sensitive locking

Parallel Performance, Small SMP

Performance of optimized full locking

1 thread 2 threads 3 threads

Summary

A new application of performance modeling Choosing among different parallelization techniques

Detailed analytical model capturing memory hierarchy and parallelization overheads

Evaluated on two different SMP machines Predicted performance within 20% in almost all

cases Effectively capture impact of both memory hierarchy

and parallelization overheads Quite accurate in predicting the relative

performance of different techniques

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs...

Documents

Transcript of Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs...