Multi-tier Data Access and Hierarchical Memory Design: Performance Modeling and Analysis Marwan...

Multi-tier Data Access and Hierarchical Memory Design:

Performance Modeling and Analysis Marwan Sleiman

PHD Defense

Department of Computer Science & EngineeringUniversity of Connecticut

371 Fairfield Road Unit 2155 Storrs, CT 06269

Major advisor: Dr. Lester Lipsky

Associate advisors:

Dr. Reda AmmarDr. Swapna GokhaleDr. Chun-Hsi Huang

2

Overview of the Presentation

Introduction & previous work

Motivation and Objectives

Markov-Chain model and performance metrics

Interdependence of the hit ratios between the levels

Design Constraints

Approximation Function

Power-Tailed aspect of the memory access

Effect of increasing the cost on performance

Improving the performance while maintaining a constant cost

Optimization techniques

Performance measures

Conclusion and Future work

3

Hierarchy of Storage Systems

Storage systems are present in several forms and on different hierarchical levels; they expand the concept of the classical hierarchy beyond the local machine:

Registers, caches, main memory (RAM), disks, tapes, middle-tiers, network storage, internet storage.

Storage systems provide the basic functions of :

• storing data permanently• holding data until it is accessed and processed

4

Hierarchical Memory Model

CPU

Level 1

Level i

Level n

M

Access Time

X1

Xi

XM

Xn

5

Storage Systems and Performance

Fast memory access is vital to achieving superior system performance

Because of the gap between the CPU speed and memory access time,

memory access time is increasingly becoming bottleneck to system performance

Thus the applications cannot benefit from a processor clock-speed upgrade

Speed is expensive! => $$ cost must be optimized

6

Solution

Increasing the speed and size of the existing levels.

Inserting smaller and faster intermediate memory levels

Which one is better?

* We need to evaluate the cost and performance of each alternative

7

Previous Work

Du et. al. ’00 showed the importance of the depth of the memory hierarchy as a primary factor on a cluster of workstations but their results are dependent on the workload type.

D. G. Dolgikh et al ‘01 show the importance of developing an analytical model to optimize the use of web cashes.

Jin et al. ’02 developed a limited analytical model that captures only a two-level cache, but we see in their work a big discrepancy between the predicted and measured memory performance.

El-Zanfaly et. Al ’04 presented an analytical model to study the performance of Multi-Level cashes in Distributed Database Systems.

Garcia Molina and Rege ‘76 and Nagi ‘06 demonstrated that, in some cases, it is more suitable to use a slower CPU for effective utilization of memory.

E. Robinson and G. Cooperman ‘06 showed that, in certain conditions, it can be more efficient to discard the memory and use a disk-based architecture than using the memory –which means reducing the memory hierarchy.

8

Motivation

Memory hierarchy is becoming more complex.

Memory access time differs from application to application

The average memory access time is a crucial factor in system performance, also other performance metrics and measures may be important and must be taken into consideration.

Despite having small mean time for certain cases, for an infinite hierarchy, the time may have unbounded higher jth moments E[|T- Tavg |j]= => Long queues of memory accesses and hence can take quite some time draining them out while affecting system performance severely (pipelined processors & shared memory cases)

It is necessary to develop a universal model able to cover all possible cases.

9

Our Objectives:

Several objective functions help us improve the performance of hierarchical memory systems:

1. Minimizing the mean memory access time.2. Minimizing the memory queueing time for a given arrival

rate by using the P-K formula of an M/G/1 Queue.3. Minimizing the probability of exceeding a target time

4. Finding an approximation to the above functions by maximizing the ratio of time lag to variance for a given objective time to minimize the width of the confidence interval and reduce the probability of exceeding the target time.

* Do these objectives have the same optima? If not, what about a Trade-off? What is (are) the best optimization technique(s)?

Pr( )X z

Moments of Memory Access Time:

MARKOV-CHAIN Model

11

State Transition Diagram

S0 M

L1LnLi

D1DnDi

h1

1

hi

1

1- h1 1-h(i-1)

111 1

XL1XLnXLi 1-hi 1-h(n-1)

hn

1-hn

1

XD1 XDi XDn

12

Notation n is the depth of the memory hierarchy, L=n+1 is the total number of memory levels

P is the sub-stochastic matrix that corresponds to the transitions from one state to another one. Its dimension is (2n+1)*(2n+1)

p is the entrance vector that corresponds to the state of the system at the first memory request. p is a row vector of size 2n+1, where n is the number of intermediate levels.

p =[1 0 0 ….0] is the unit column vector of size 2n+1.

M is the transition rate matrix; it corresponds to the rates of leaving the state. M is a diagonal matrix of dimension (2n+1)*(2n+1).

I is the identity matrix of the same dimension as P and M.

h is the hit ratio, ĥ =1-h.

'

13

Sub-stochastic Matrices

1 1

1

'2

0 1 ... 0 0 0

0 0 0 ... 0 0 0

0 0 0 ... 0 1

0 0 0 ... 0 0 0

0 0 0 ... 0 1 0

10 0

11

0 0 1,

11

0 0

n n

M

h h

h h

X

X

X

P

M

B = M(I –P) and V = B-1.

14

Access Time Calculations

Let X be the random variable denoting the memory access time.

Assuming we have memories with exponential service times, the pdf of the ith memory level is given by:

The jth moment of the access time is given by:

The mean access time is given by the first moment and does not depend on whether the memory service time is exponential or not.

The variance of the access time is given by:

( ) ! 'j jE X j pV

2 2 2 2 2( ) [ ( )] 2 ' ( ')e E X E X pV pV

( ) i xi if x e

( ) 'E X x pV

15

Non Exponential Memories

Each node is represented by the vector-matrix pair: <pi,Bi>

its pdf is: The mean access time remains the same. However the variance differs by a correction term

compared to the exponential case e:

i

2 2

2vi

22

v 2

'

is a diagonal Matrix of elelemnts C -1

C

e

ii

i

i

where

x

pVTΓ

Γ Γ

Cvi2 is the coefficient of variation of the non-exponential stage i.

For Exponential memories,

*This is an innovation! (CATA 2006)

( ) exp( ) 'i if x x i ip B

2 1 0viC

16

More performance metricsLet Ty be the random variable denoting the mean system time. The Pollaczek-Khintchine formula (called P-K formula) is used to calculate

the mean waiting time spent by a customer in an M/G/1 Queue:

Where,is coefficient of variation,

is the utilization factor,

is the arrival rate.

• The probability of exceeding a target time, Pr(X>z), is given by the reliability function for our hierarchical memory system:

2 1( )

1 1 2vCx x

E T T

2vC

22

2vCx

( )E X

( ) exp( ) 'R z z p B

17

Interdependence of the Hit Ratios

Let Y be the Random Variable representing the data fetched in memory and Mi the dataset in memory level i, Mi Mi+1. We define the following additional terms:

the probability of finding data in the intermediate memory level i

the size of memory level i.

the cost per unit of size of each memory level i, Mi .We assume that:

, where is a constant; The total cost of the L-level hierarchical system becomes:

ia

iS

iC

11i

i

aS

i

1

Pr( )

depends on size, locality and working set

i i

i i

i

a Y M

a a

a

1

L

i ii

C C S

18

Interdependence of the Hit Ratios (continued)

is the local hit ratio at memory level i,ih

1 1

1 11

1

1

1

for =1,

for 1, Pr( / ) 11

11

1

i i ii i i

i i

i ii

i i

i h a

a a Si h Y M Y M

a S

a Sh

a S

19

Design Constraints

If we consider any L-level hierarchical memory with total cost C, there are constraints on the sizes of the levels we can select:

For an L-level memory system version of the system with hi the hit ratio at memory level i, the constraints on the size of each memory level become:

For simplicity of calculations, we assume in what follows that the memory access time increases geometrically from one level to the next and the cost decreases geometrically:

1 1

1

i i

i i

i

i

L D

L D

C

C

T T

T T

1,1

L

j jj j i

ii

C C S

SC

11 i iS S

20

Power-tailed Aspect of the Memory Access Time

A power-Tailed (also called Pareto Distribution) function with parameter is a function that has infinite high moments.

Its reliability function,

We showed that, if we have the same hit ratio, h, at all levels, the moments become unbounded as the number of hierarchies goes to infinity:

For (1-h) j 1, iff

where,

( ) Pr( )R z X z z

2

if 1, then ( )

if 2, then ( )

E X

E X

lim( [ ])j

nE X

j

log(1 )

log( )

h

21

Simulations

Three plots for the reliability function with power tails obtained by simulating 100, 000 memory accesses

The system has 10 memory levels with hit-ratios h = 0.3, 0.5 and 0.7 with µ = 1, =1 and = 2

The slopes of the plots are equivalent to the slopes ( > 0) in the reliability function for the power-tailed distribution:

log[ ( )] .log( )R z a z

22

Log R(z) vs Log(z) plot shows power-tailed aspect of access time

23

Two-Level Cache Memory

L1 L2

D1 D2

1

1

1- h1

1

XL1XL2

M

1-h2

1

h2h1

XD1XD2

S0

24

Effect of doubling the Cost on Exponential & Non-Exponential Memories in a 2-level cache memory

Plots of the mean and variance for exponential and non-exponential 2-level memory hierarchies versus the size of the outer memory. The lines correspond to the original system and the dashes correspond to a system with double cost. The non-exponential has a gamma of 4.

1 1 2 2C C S C S

0

Γ =

Γ = 4Ι

Γ Ι

25

Behavior of the Hit Ratios in a 2-Level Cache Memory.

Behavior of the memory hit ratios as we change the size of the lower/outer memory level and increase the cost of the memory system. A level may become obsolete because it has a low hit ratio.

26

Queueing Time vs Access Time

Mean memory access time E(X) and mean queuing time E(T l) versus the size S of the Outer-level memory in a 2-Level hierarchical memory system. E(X) has its minimum for S = 71, while E(Tl) has different minima depending on the value of l.

There is a difference of 8 % between Min[E(T)] and its value at Min[E(T)] for the same outer memory size! This difference increases as the arrival rate increases.

27

Inserting an Upper Faster Level

C1 C2

D1 D2

h1

1

1

1- h1

1

XL1

h2 Cm

1-h2

1XD1

a2a1S0

D0

C0

XD2

XD0

XL2

XL0

h0

1- h0

a0

28

Increasing the Size vs Inserting Intermediate Levels

Increasing the size of the exiting levels versus inserting intermediate memory levels.

29

Exceeding a Target Time

Plots of the mean, variance and reliability of exponential 2-level and 3-level memory hierarchies versus the size of the outer memory. The straight lines correspond to 2-level memory systems and the dotted lines correspond to 3-level memory system. The mean is plotted in blue and the probability of exceeding a target time is plotted in green.

The probability of exceeding a target time is given by the reliability of our hierarchical memory system: ( ) exp( ) 'R z z p B

30

Effect of the memory levels on the probability of exceeding a threshold access time

Effect of the memory levels on the probability of exceeding a threshold access time on a log scale: The curve of the reliability is steeper when the system includes an upper level memory system. E(X)= 4.78 for the 3-level memory, E(X)=6.91 for the 2-level memory with the upper level removed, and E(X)=5.81 for the 2-level memory with the upper level removed.

31

Target time (continued)

Probability of exceeding a threshold access time for 2-Level and 3-Level hierarchical memories on a log scale: As the access time becomes greater than 100ns, the reliability curves become tangent to their asymptotes. E(X)= 4.78 for the 3-level memory, and E(X)=6.91 for the 2-level memory.

32

Exceeding a Target Time: asymptotic Behavior

From the spectral decomposition theorem, R(z) is given by: (1)

Where: is the ith Eigenvalue of the matrix B is the ith column Eigenvector of the matrix B, that is is the ith row Eigenvector of the matrix B, that is

The probability of getting memory requests that take a relatively long time along this stochastic hierarchy is given by finding the limit of R(z) as z becomes very high and it is dominated by the mth term of R(z) having the smallest Eigen-Value.

Let thus , and (2) So if we plot the probability of exceeding time x on a semi-log scale,

we find out that it approaches the curve: that intercepts the y-axis on a semi-log graph at the value

'

1

1

( ) .exp( . ). ' . 'i

i

mz

i ii

mz

ii

R z z e

a e

p B p v u

i

i

'v 'ii iBv v

iui i iu B u

'i ( )( ')a i ipv u

m iMin[ ] mm( ) z

xR z a e

m mlog[ ( )] log( )x

R z a z

my a z 'a=log( ) log[( )( ')]m m ma pv u

33

Exceeding a Target Time 3-D

3-D plot of the probability of exceeding a threshold time R(z) for a three-level memory versus the size of the intermediate memory levels: Sb is the index of the upper memory level and Sc is the index of the lower level. We remark here that the curve of R(z) is steeper with respect to the upper level growth because R(z) is more sensitive to it, however it is more flat with respect to the lower level because it is less sensitive to it.

34

Optimization techniques

Local search Lagrange Multipliers Method

We assume that:

1

1

1 1 log( )or ( ) ( ) , where

log( )

i i

i i

k

C C

X X

C kX X

35

Analytic Solution for Minimizing E(X)

Optimizing E(X) versus the total cost: Because our model is a Feed-forward Network, the total access

time for this memory system is given as a function of the intermediate memory sizes by:

So we have to optimize subject to the following total cost constraint:

By using Lagrange Multipliers method, we will have:

1

1 2 1 2 31 1 1 2

1 1 1( ) ( , ,... ) (1 ) ...

;where

;

kL

tot n i k Lk i n

Dk Lkk

M

E X T f S S S h W W W W WS S S

X X k LW

X k L

1 2 1 1 2 21

( , ,..., ) ...n

n n n i ii

g S S S C S C S C S C S C

1 2( , ,... )nf S S S

1 2 1 2( , ,..., ) ( , ,..., )n n Si Sif S S S g S S S f g

36

Lagrange Multipliers method and constant hit ratios

By solving these equations, we get:

1 1 1

1 111

112

111

11 11 1

2

11 1

1

1

1 1

11

1

( )

for 1, ( )

1for 1, 1 1 ( ) 1

( )

1 1 1- =

in

n i ii ii

i

ii

ii

ii

i

k

i

i

C CS

WC CC

C W

WCi S S S

C W

Si h ctc

S

Sh

S

1

37

Plot of the hit ratios at steady state (PT)

Hit Ratios versus cost for a three-Level Hierarchical memory: h2 and h3converge to a constant determined by 1

1

38

Difference between E(X) and E(Tl) for a 3-level hierarchy

Mean memory access time E(X) and mean queuing time E(Tl) versus the total memory cost for a 3-Level Hierarchical memory. The difference between the minimal queueing time and the value of queueing time at the optimal mean memory time is more significant here and is of the order of 15%.

39

Difference between E(x) and E(Tl) for a 3-level hierarchy

Optimal system time, Min[E(Tl)] versus the value of E(Tl) at the optimal mean system time, E(X), versus the total memory cost for a 3-Level Hierarchical memory. The relative difference between the minimal queueing time and the value of queueing time at the optimal mean memory time decreases as we decrease the cost.

(1) 0.87

0.072

e

40

Performance Measures

Different hierarchical memory architectures with intermediate levels at different locations: The closer the memory is to the CPU, the smaller and faster it is.

L3

CPU

L2

L1

m

CPU

L2

m

CPU

L1

m

CPU

L1

m

L3 L3

L2

System 1 System 2 System 3 System 4

41

Performance MeasurementsC C1 C2 C3 Architecture Min(X) StdDev @ minX h1 R(8)

512 32 8 2 3-LVL 4.82 22.26 0.75 0.0958

512 32 8 2 2-LVL, L1 rem 6.96 19.66 N/A 0.231

512 32 8 2 2-LVL, L2 rem 6.13 31.74 0.93 0.0628

512 32 8 2 2-LVL, L3 rem 5.13 28.41 0.83 0.0557

512 50 10 2 3-LVL 5.6 23.29 0.75 0.1152

512 50 10 2 2-LVL, L1 rem 7.13 19.87 N/A 0.2348

512 50 10 2 2-LVL, L2 rem 6.74 29.8 0.88 0.0947

512 50 10 2 2-LVL, L3 rem 6.32 24.06 0.71 0.0895

512 72 12 2 3-LVL 7.35 29.27 0.75 0.1334

512 72 12 2 2-LVL, L1 rem 7.31 20.07 N/A 0.2382

512 72 12 2 2-LVL, L2 rem 8.0749 30.62 0.81 0.1381

512 72 12 2 2-LVL, L3 rem 7.72 33.78 0.59 0.1352

1024 32 8 2 3-LVL 3.73 18.16 0.75 0.0758

1024 32 8 2 2-LVL, L1 rem 6.49 19.11 N/A 0.217

1024 32 8 2 2-LVL, L2 rem 4.19 26.14 0.96 0.0316

1024 32 8 2 2-LVL, L3 rem 3.6 23.64 0.94 0.0312

1024 50 10 2 3-LVL 3.9171 18.9313 0.75 0.0789

1024 50 10 2 2-LVL, L1 rem 6.2 15.59 N/A 0.219

1024 50 10 2 2-LVL, L2 rem 4.04 22.11 0.94 0.0475

1024 50 10 2 2-LVL, L3 rem 3.79 21.46 0.84 0.053

1024 72 12 2 3-LVL 4.09 19.18 0.75 0.0833

1024 72 12 2 2-LVL, L1 rem 6.28 15.71 N/A 0.221

1024 72 12 2 2-LVL, L2 rem 4.69 22.74 0.91 0.0669

1024 72 12 2 2-LVL, L3 rem 4.43 23.92 0.79 0.0695

42

Observations

Observation 1: for the same cost, inserting an intermediate memory at the upper level results in a system with a lower mean time.

Observation 2: for the same cost, inserting an intermediate memory at the upper level results in a system with a smaller probability of exceeding small threshold access times and higher probability of exceeding high threshold access times.

Observation 3: for the same cost, inserting an intermediate memory at the upper level results in a system with a worse variance regardless of the distribution of the service time of the intermediate memory levels.

Observation 4: a higher variance corresponds to a lower hit ratio at the upper memory levels. Observation 5: the variance of the memory access time is relatively high. Such a high variance can

dramatically affect the performance of some architectures sensitive to a high access time such as pipelined, decoupled, and multi-grid architectures. So it is important to consider optimizing the variance of hierarchical storage.

Observation 6: doubling the cost of the hierarchical memory has a positive effect on all the performance metrics but in different ratios. Each of the performance metrics improves in a different way than the others to the modification of the memory architecture (number of levels, size, cost, etc…) because these performance metrics have different optimal points.

Observation 7: the probability of exceeding a target time is more sensitive to the upper memory level than the lower level and it improves at a faster rate by optimizing the upper level size than by optimizing the lower levels.

Observation 8: if cost and speed are proportional (i.e. there is a geometric relationship between the levels), we get an optimal access time when we have C iSi= Ci+1Si+1that is when we invest the same in each level.

Observation 9: there is a linear relationship between the probability of going to the main memory, Pm, and the value of am in equation 2. We have found that the ratio is a constant:

1.3m

m

a

P

43

Conclusions Markov-Chains can model the access time of

hierarchical memories. Our analytical model is very powerful and

universal and very flexible. The hierarchical memory access CAN power-

tailed. The variance is not the same for non-

exponential memory stages. The different performance metrics don’t have

the same optima => designing an optimal system is application dependent.

44

Contributions

Robust Analytical Model (Independent of the application, number of level and architecture)

New performance Metrics Effect of location and proximity of the Memory

Levels Power tailed aspect of the memory access

45

Future Work Running more simulations to validate our model and make sure it

is realistic and reflects the real computing environment. Including models that account for localities and working sets. Trying different objective functions and optimizing them. Studying the sensitivity to each performance metric and finding

its effect on performance. Trying different architectures (like decoupled architecture, dual-

processor and shared memory). Studying memory hierarchies with more levels. More optimization techniques like NN.

46

Publications “Moments of Memory Access Time for Systems with Hierarchical Memories,” 21st International Conference on

Computers and Their Applications (CATA-2006), Seattle WA, March 2006. With Lester Lipsky and Kishori Konwar.

“Performance Modeling of Hierarchical Memories,” 19th international conference on computer applications in industry and engineering (CAINE-2006), Las Vegas, Nevada USA, November 13-15, 2006. With Lester Lipsky and Kishori Konwar.

“Multi-channel Software-Oriented Pulse Width Modulation (SPWM),”21st International Conference on Computers and Their Applications (CATA-2006), Seattle WA, March 2006.

“Dynamic Resource Allocation of Computer Clusters with Probabilistic Workloads,” Marwan Sleiman, Lester Lipsky, and Robert Sheahan in the proceedings of the 20th IEE International Parallel & Distributed Processing Symposium, April 25-29 Rhodes Island, Greece.

“Multi-Tier Data Access & Hierarchical Memory Optimization,” submitted to the 20th International Conference on Parallel and Distributed Computing Systems. With Lester Lipsky.

“Moments and Distributions of Response Time for Systems with Hierarchical Memories,” submitted to the International Journal of computers and Their Applications.

“Performance Metrics of Hierarchical Memories,” to be submitted to the International Journal of computers and Their Applications.

47

The End

Questions & Suggestion?

[email protected]

Thank You!

Multi-tier Data Access and Hierarchical Memory Design: Performance Modeling and Analysis Marwan...

Documents

Transcript of Multi-tier Data Access and Hierarchical Memory Design: Performance Modeling and Analysis Marwan...