An Autonomic Framework in Cloud Environment Jiedan Zhu Advisor: Prof. Gagan Agrawal.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs...
Transcript of Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs...
Performance Prediction for Random Write Reductions: A Case Study in
Modelling Shared Memory Programs
Ruoming JinGagan Agrawal
Department of Computer and Information SciencesOhio State University
Outline
Motivation Random Write Reductions and Parallelization
Techniques Problem Definition Analytical Model
General Approach Modeling Cache and TLB Modeling waiting for locks and memory contention
Experimental Validation Conclusions
Motivation
Frequently need to mine very large datasets Large and powerful SMP machines are
becoming available Vendors often target data mining and data
warehousing as the main market Data mining emerging as an important class of
applications for SMP machines
Common Processing Structure
Structure of Common Data Mining Algorithms {* Outer Sequential Loop *} While () { { * Reduction Loop* } Foreach (element e) { (i,val) = process(e); Reduc(i) = Reduc(i) op val; } }
Applies to major association mining, clustering and decision tree construction algorithms
How to parallelize it on a shared memory machine?
Challenges in Parallelization
Statically partitioning the reduction object to avoid race conditions is generally impossible.
Runtime preprocessing or scheduling also cannot be applied
Can’t tell what you need to update w/o processing the element
The size of reduction object means significant memory overheads for replication
Locking and synchronization costs could be significant because of the fine-grained updates to the reduction object.
Parallelization Techniques
Full Replication: create a copy of the reduction object for each thread
Full Locking: associate a lock with each element
Optimized Full Locking: put the element and corresponding lock on the same cache block
Cache Sensitive Locking: one lock for all elements in a cache block
Memory Layout for Locking Schemes
Optimized Full Locking Cache-Sensitive Locking
Lock Reduction Element
Relative Experimental Performance
Relative Performance of Full Replication, Optimized Full Locking and Cache-Sensitive Locking
0
20000
4000060000
80000
100000
0.10% 0.05% 0.03% 0.02%
Support Levels
Tim
e(s)
fr
ofl
csl
Different Techniques can outperform each other depending upon problem and
machine parameters
Problem Definition
Can we predict the relative performance of different techniques for given machine, algorithm and dataset parameters ?
Develop an analytical model capturing the impact of memory hierarchy and modeling different parallelization overheads
Other applications of the model: Predicting speedups possible on parallel configurations Predicting performance as the output size is increased Scheduling and QoS in multiprogrammed environments Choosing accuracy of analysis and sampling rate in an
interactive environment or when mining over data streams
Context
Part of the FREERIDE (Framework for Rapid Implementation of Datamining Engines) system
Support parallelization on shared-nothing configurations
Support parallelization on shared memory configurations
Support processing of large datasets Previously reported our work on parallelization
techniques and processing of disk-resident datasets (SDM 01, SDM 02)
Analytical Model: Overview
Input data is read from disks – constant processing time
Reduction elements are accessed randomly – their size can vary considerably
Factors to model: Cache Misses on reduction elements -> Capacity
and Coherence TLB Misses on reduction elements Waiting time for locks Memory contention
Basic Approach
Focus on modeling reduction loops Tloop = Taverage * N Taverage = Tcompute + Treduc
TTreducreduc = T = Tupdate update + T+ Twaitwait + T + Tcache_misscache_miss + +
TTtlb_misstlb_miss + T + Tmemory_contentionmemory_contention
TTupdate update can be computed by executing the loop with a reduction can be computed by executing the loop with a reduction object that fits into L1 cache object that fits into L1 cache
Modeling Waiting time for Locks
The spent by a thread in one iteration of the loop can be divided into three components
Computing independently (a) Waiting for a lock (Twait) Holding a lock (b) where b = Treduc - Twait
Each lock is a M/D/1 queue The rate at which each requests to acquire a
lock are issued are: t / ((a + b + Twait)*m)
Modeling Waiting Time for Locks
Standard result on M/D/1 queues Twait = bU/ 2(1-U)
where, U is the server utilization and is given by
U = b Result on Twait is
Twait = b/(2(a/b + 1)(m/t) – 1)
Modeling Memory Hierarchy
Need to model L1 and L2 Cache L2 Cache TLB Misses
Ignore cold misses Only consider directly-mapped cache – analyze
capacity and conflict misses together Simple analysis for capacity and conflict
misses because of random accesses to the reduction object
Modeling Coherence Cache Misses
A coherence miss occurs when a cache block is invalidated by other CPU
Analyze the probability that: Between two accesses to a cache block on a processor, the same memory block is accessed, and this memory
block is not updated by one of the other processors in the mean-
time Details are available in the paper
Modeling Memory Contention
Input elements displace reduction objects from cache
Results in a write-back followed by read operation
Memory system on many machines requires extra cycles to switch between write-back and read operations
Source of contention Model using M/D/1 queues, similar to waiting
time for locks
Experimental Platform
Small SMP machine Sun Ultra Enterprise 450 4 X 250 MHz Ultra-II processors 1 GB of 4-way interleaved main memory
Large SMP machine Sun Fire 6800 24 X 900 MHz Sun UltraSparc III A 96KB L1 cache and a 64 MB L2 cache per
processor 24 GB main memory
Impact of Memory Hierarchy, Large SMP
Measured and predicted performance as the size ofreduction object is scaled
Full replication Optimized full locking Cache-sensitive locking
Modeling Parallel Performance with Locking, Large SMP
Parallel performance with cache-sensitive locking, small reduction object sizes
1 thread 2 threads 4 threads 8 threads 12 threads
Modeling Parallel Performance, Large SMP
Performance of optimizedfull locking with large reductionobject sizes
1 thread 2 threads 4 threads 8 threads 12 threads
How good is the Model in Predicting Relative Performance ? (Large SMP)
Performance of Optimized full locking and Cache Sensitive Locking (12 threads)
Impact of Memory Hierarchy, Small SMP
Measured and predictedperformance as the size of reduction object is scaled
Full replication Optimized full locking Cache-sensitive locking
Parallel Performance, Small SMP
Performance of optimized full locking
1 thread 2 threads 3 threads
Summary
A new application of performance modeling Choosing among different parallelization techniques
Detailed analytical model capturing memory hierarchy and parallelization overheads
Evaluated on two different SMP machines Predicted performance within 20% in almost all
cases Effectively capture impact of both memory hierarchy
and parallelization overheads Quite accurate in predicting the relative
performance of different techniques