Performance Architecture and Platform Positioning

28
Performance Architecture and Platform Positioning Joe Temple ([email protected]) IBM August 6, 2012 11847

Transcript of Performance Architecture and Platform Positioning

Page 1: Performance Architecture and Platform Positioning

Performance Architecture and Platform Positioning

Joe Temple ([email protected]) IBM

August 6, 2012

11847

Page 2: Performance Architecture and Platform Positioning

Common Metrics

•  There have been many attempts to create “a common metric” for the relative capacity of servers

•  There are even vendors such as Ideas International (RPE) and IDC (QPI) who sell their version for various machines

•  Many people in the Intel community still rely on “MHz” or interpolate from published SPECint Rate results

•  Many UNIX people particularly in the AP countries who use a metric estimated or interpolated from published TPC-C values.

2

Page 3: Performance Architecture and Platform Positioning

The Problem

•  Which metric is right? •  They show a wide variance •  Switching metrics can “reorder the list” •  Interpolation between measured or published results leads to even

more variation •  A single metric will not “cover” all workloads

•  Relative Capacity is workload dependent •  Little or no understanding of the differences between the

machines is conveyed by the metric values. •  An accepted metric drives machines toward a common

denominator and loses its effectiveness as a differentiator over time. •  Administratively declaring an official metric is like declaring that pi =

3 “because it makes the math easier” 3

Page 4: Performance Architecture and Platform Positioning

The Result

If your understanding of Relative Capacity and server performance is based on a “Common Metric” … … you don’t know what you think you know. Sooner or later you will be misled. Therefore an administratively established Standard Metric will lead to losses. 4

Page 5: Performance Architecture and Platform Positioning

The Reality

•  Relative Capacity dependent on at least three machine parameters: •  Thread Speed •  Thread Count •  Cache/Thread

0"

0.2"

0.4"

0.6"

0.8"

1"

1.2"

)0.5" 0" 0.5" 1" 1.5"

Thread

'Spe

ed'

Cache/Thread'

'Socket'Designs'in'Fitness'Space'Bubble'Size'is'threads'

Super"Cluster"

Exadata"2)2"

Exadata"8)2"

Power"780"SMT4"

Power"780T"SMT2"

Power"780T"ST"

z196"

Sandy"Bridge"

Note that high thread count means reduced thread speed and cache per thread 5

Page 6: Performance Architecture and Platform Positioning

A Model We observe:

•  Maximum achievable Throughput is Thread Speed x Thread Count

•  Maximum Thread Capacity is Thread Speed x Cache per Thread

There is a clear trade off between Throughput and Thread Capacity Enterprise servers are distinguished by higher thread capacity. Pure Systems compute nodes generally have low thread capacity but can have high throughput. Pure Systems aggregate more nodes in a rack

0"0.1"0.2"0.3"0.4"0.5"0.6"0.7"0.8"0.9"1"

0" 0.2" 0.4" 0.6" 0.8" 1"

Maxim

um'Throu

ghpu

t'

Maximum'Thread'Capacity'

Sandy"Bridge"

Pure"Power"

Power"7"

Power"7T"

z196"

6

Page 7: Performance Architecture and Platform Positioning

Capacity

•  We assert: •  Total Capacity is a combination of Throughput and Thread

Capacity •  Relative Capacity varies with workload because the relative

weight of each parameter is workload dependent. •  We define: •  Capacity = w(Throughput) + (1-w)Thread Capacity •  C = w(TC) + (1-w)TP

7

Page 8: Performance Architecture and Platform Positioning

Relative Capacity

•  Relative Capacity is Capacity Ratio between Machines A and B often expressed as a core ratio •  Relcap =CapA/CapB = Cores B/Cores A •  Model: Relcap = (w(TCA) + (1-w)TPA) / (w(TCB) + (1-w)TPB)

•  The following work normalizes everything to the Intel Sandy Bridge Core

8

Page 9: Performance Architecture and Platform Positioning

Workload Weight and Relcap

•  By our definition weight increases as a workload exploits thread capacity •  Increased cache misses •  Increased dependence on a single thread

•  Cache misses increase with data intensity of individual loads and with VM or application density when consolidating loads

•  Serial Dependence increases with data and resource sharing. This happens by definition in consolidations

9

Page 10: Performance Architecture and Platform Positioning

Core ratios and weight

•  Over the years we have published various descriptions of workload factors and “core ratios” comparing z to other machines.

•  The paper “Relative Capacity and Fit for Purpose Platform Selection” (CMG #123, March, 2009) contains a set based on various benchmarks and “real world” comparisons

•  IBM has done internal benchmarking (Friendly Bank, single threaded cpu measurements, etc.)

•  The next 3 charts are a composite based on all that work.

10

Page 11: Performance Architecture and Platform Positioning

Core Ratios

0"

1"

2"

3"

4"

5"

6"

7"

8"

9"

0" 0.2" 0.4" 0.6" 0.8" 1" 1.2"

Sand

y&Bridge&Core&pe

r&Core&

weight&

Sandy"Bridge"

Power"7"ITE"

Power"7"

Power"7T"

z196"

11

Page 12: Performance Architecture and Platform Positioning

Workloads  and  Weight  

12

VMs/Core   weight   Mean  SAPs 0.5   -0.13   0.5  

Throughput Benchmarks 0.5   0.00   0.5  Threaded Integer Processing 0.5   0.49   0.5  

Partitioned OLTP 0.5   0.57   0.5  Java OLTP 0.5   0.62   0.5  

Web Transactions 0.5   0.65   0.5  Single threaded Integer

processing 0.5   0.70   0.5  Java Heavy 0.5   0.72   0.5  TPoX 5VMs 5   0.52   0.7  

TPoX 10 VMs 5   0.53   0.7  TPoX 20VMs 5   0.62   0.7  

Vitrualization Benchmark 5   0.69   0.7  TPoX 40VMs   5   0.75   0.7  TPoX 80 VMs 5   0.81   0.7  

Scaled Virtualization Benchmark 5   0.88   0.7  Friendly Bank Heavy I/O   10   1.16   0.9  Friendly Bank VM Light   10   0.77   0.9  

Friendly Bank Raw   10   0.87   0.9  Power VM Max Density 10   0.88   0.9  

Thread Capacity   10   1.00   0.9  20 VMs/Core 20   1.10   1.1  30 VMs/Core 30   1.20   1.2  

8/5/12

Page 13: Performance Architecture and Platform Positioning

VMs  per  Core  and  weight  

13

y"="$0.0009x2"+"0.049x"+"0.496"R²"="0.58852"

0.00"

0.20"

0.40"

0.60"

0.80"

1.00"

1.20"

1.40"

0" 10" 20" 30" 40"

weight'

VMs'per'zCore''

weight"

Mean"

Poly."(weight)"

Note: weights > 1 are due to either I/O or Power VM’s product limit of 10 VMs per Core Power probably will raise the limit some if they build and leverage more thread capacity. Some metrics like SAPs (SAP SD Benchmark) have weights <0 due to software differences.

8/5/12

Page 14: Performance Architecture and Platform Positioning

Load Variability Effects

•  All of the above is about steady state load •  Random variability introduces consolidation and queuing

effects •  High variability implies low resource utilization •  Consolidation increase utilization by decreasing variability

in the composite load •  Consolidated loads are heavier than the individual loads

from which they are built. •  Consolidation leads to more emphasis on thread capacity

14

Page 15: Performance Architecture and Platform Positioning

Local Factors Govern Variability and Client Response to it

Variability is driven by a combination of market dynamics and business process variation In the face of variability clients must make operational tradeoffs between throughput, response time and utilization efficiency Absent batching and buffering, variability effects are stronger on distributed solutions than on centralized solutions.

15

Page 16: Performance Architecture and Platform Positioning

•  HR = Capacity Remaining/Capacity Used •  HR = (1-u)/u •  As u à 1, HR à 0. As u à 0, HR becomes unbounded •  Rearranging: Uavg = 1/(1+HR)

•  We also know that •  Uavg = 1/(1+kc/SQRT(n)) Rogers’ Equation •  k is the number standard deviations of the load from the mean at

peak utilization design point. •  c is the standard deviation / mean of a single load instance •  n is the number of loads on the server. •  Note that the n can be less than 1 when a load is distributed

•  Therefore at average utilization: •  HR = kc/SQRT(n)

The operational tradeoffs are governed by Normalized Headroom (HR)

16

Page 17: Performance Architecture and Platform Positioning

Analysis of HR

•  Given HR = kc/SQRT(n) we can see: •  Service Level requirement impact • As k à 0, HR à 0 and Uavg à 1 • As k gets large, HR gets large and Uavg à 0 • At k = 1, HR is governed by c and n

•  Load Variability impact • As c à 0, HR à 0 and Uavg à 1 • As c gets large, HR gets larger and Uavg à 0 • At c = 1, HR is governed by k and n

•  Scaling Impact (Variability as a function of n) • As nà 0 (Distribution of load) , HR gets larger and Uavg à0 • As n gets larger, HR à 0 and Uavgà1 • At n = 1 HR is governed by k and c

Note that as n increases so does the “weight” of the load. 17

Page 18: Performance Architecture and Platform Positioning

Response time

•  (We will assume here that response time is measured on a single thread of work. When this is not the case, the base response time can be adjusted )

•  t = F(1/Capacity, Variability, utilization) •  t0 ~ 1/Capacity (Link to the raw capacity model) •  Variability = c/SQRT(n) (Link to Rogers’ equation) •  t =t0 + twait = t0 + (t0)(c2/n)(u/(1-u)) •  t =t0 + twait = t0 + (t0)(c2/n)/HR) (By definition of HR)

•  t(Uavg) = t0 + (t0)(c2/n)/(kc/SQRT(n) = t0 + (t0)(c/kSQRT(n))

18

Page 19: Performance Architecture and Platform Positioning

Analysis of Peak Response Time

•  Given •  t =t0 + twait = t0 + (t0)(c2/n)/HR)

•  As capacity goes up t goes down with t0 •  As variability increases the wait time increase with c2

taking t up with it. •  Consolidation decreases variability, distribution increases

it. •  As the maximum utilization increases HR decreases and t

goes up

19

Page 20: Performance Architecture and Platform Positioning

Analysis of Average Response Time

•  Given: •  t(Uavg) = t0 + (t0)(c/kSQRT(n))

•  As capacity goes up t goes down with t0 •  As variability increases the wait time increase with c taking

t up with it. •  Consolidation decreases variability, distribution increases

it. •  As SLA gets tighter k is increased reducing the average

response time

20

Page 21: Performance Architecture and Platform Positioning

Assume design for good service level at 2 sigma

21

0"

5"

10"

15"

20"

25"

30"

0" 0.2" 0.4" 0.6" 0.8" 1" 1.2"

Respon

se'Tim

e'

Throughput'

System'Characteris5c'Curve'Piecewise'Linear'Model'

to Mean

Knee at 2 sigma

Delay at 3 sigma

Page 22: Performance Architecture and Platform Positioning

Trading off efficiency and service level

22

0"

0.2"

0.4"

0.6"

0.8"

1"

1.2"

1.4"

0" 0.1" 0.2" 0.3" 0.4" 0.5" 0.6"

Respon

se'Tim

e'

VMs/Core'

2'sigma'response'v'average'u5liiza5on'

Sandy"Bridge" Sandy"Bridge"ST" power"7"ITE"SMT4"

Power"7"ITE"SMT2" Power"7"ITE"ST" power"7"SMT4"

Power"7"SMT2" Power"7"ST" Power7T"SMT4"

Power"7T"SMT2" Power"7T"ST" z196"

Page 23: Performance Architecture and Platform Positioning

Trading off efficiency and Throughput

23

0"

0.5"

1"

1.5"

2"

2.5"

0" 0.1" 0.2" 0.3" 0.4" 0.5" 0.6"

THroughp

ut*

average*U/liza/on*

2*Sigma*Throughput*v*average*u/liiza/on*

Sandy"Bridge" Sandy"Bridge"ST" power"7"ITE"SMT4"

Power"7"ITE"SMT2" Power"7"ITE"ST" power"7"SMT4"

Power"7"SMT2" Power"7"ST" Power7T"SMT4"

Power"7T"SMT2" Power"7T"ST" z196"

Page 24: Performance Architecture and Platform Positioning

Throughput and Capacity are not the same thing

Throughput/Core

0"

0.2"

0.4"

0.6"

0.8"

1"

1.2"

1.4"

0" 0.5" 1" 1.5" 2" 2.5"

Respon

se'Tim

e'

Throughput'/'Core'

2'Sigma'knee'of'System'Characteris9c'

Sandy"Bridge" Sandy"Bridge"ST" power"7"ITE"SMT4"

Power"7"ITE"SMT2" Power"7"ITE"ST" power"7"SMT4"

Power"7"SMT2" Power"7"ST" Power7T"SMT4"

Power"7T"SMT2" Power"7T"ST" z196"

VMs/Core

0"

0.2"

0.4"

0.6"

0.8"

1"

1.2"

1.4"

0" 0.5" 1" 1.5" 2" 2.5" 3" 3.5"Re

spon

se'Tim

e'VMs/Core'

2'Sigma'knee'of'System'Characteris:c'

Sandy"Bridge" Sandy"Bridge"ST" power"7"ITE"SMT4"

Power"7"ITE"SMT2" Power"7"ITE"ST" power"7"SMT4"

Power"7"SMT2" Power"7"ST" Power7T"SMT4"

Power"7T"SMT2" Power"7T"ST" z196"

24

Page 25: Performance Architecture and Platform Positioning

Moderate weight and variability

•  Variability is Moderate •  Weight is Moderate •  3 VMs per Sandy Bridge

Core •  5 VMs per Power 7 Core •  9 VMs per z196 Core •  Response time is single

threaded

k(100%)   3.1  c   1  w   0.5  N  Sandy  Bridge   3  

25

0"

0.2"

0.4"

0.6"

0.8"

1"

1.2"

1.4"

0" 0.5" 1" 1.5" 2" 2.5"Re

spon

se'Tim

e'Throughput'/'Core'

2'Sigma'knee'of'System'Characteris9c'

Sandy"Bridge" Sandy"Bridge"ST" power"7"ITE"SMT4"

Power"7"ITE"SMT2" Power"7"ITE"ST" power"7"SMT4"

Power"7"SMT2" Power"7"ST" Power7T"SMT4"

Power"7T"SMT2" Power"7T"ST" z196"

Page 26: Performance Architecture and Platform Positioning

Consolidation of highly variable loads

•  Weight is high •  Variability is high •  4 VMs per SB core •  9 VMs per Power 7 Core •  25 VMs per z196 Core •  Response time is single

threaded

26

k(100%)   3.1  c   2  w   0.7  N  Sandy  Bridge   4  Thread  Unit   1  

0"

0.2"

0.4"

0.6"

0.8"

1"

1.2"

1.4"

1.6"

0" 0.5" 1" 1.5" 2" 2.5"Re

spon

se'Tim

e'

Throughput'/'Core'

2'Sigma'knee'of'System'Characteris9c'

Sandy"Bridge" Sandy"Bridge"ST" power"7"ITE"SMT4"

Power"7"ITE"SMT2" Power"7"ITE"ST" power"7"SMT4"

Power"7"SMT2" Power"7"ST" Power7T"SMT4"

Power"7T"SMT2" Power"7T"ST" z196"

Page 27: Performance Architecture and Platform Positioning

Coarse Grain N <1 for Sandy Bridge

•  Variability is Moderate •  Weight is Moderate •  .5 VMs per SB Core •  1 VM per Power 7 Core •  2 VMs per z196 Core •  Response time is

multithreaded

k(100%)   3.1  c   1  w   0.5  N  Sandy  Bridge   0.5  Thread  Unit   4  

27

0"

0.2"

0.4"

0.6"

0.8"

1"

1.2"

0" 0.5" 1" 1.5" 2" 2.5" 3" 3.5" 4"

Respon

se'Tim

e'

Throughput'/'Core'

2'Sigma'knee'of'System'Characteris9c'

Sandy"Bridge" Sandy"Bridge"ST" power"7"ITE"SMT4"

Power"7"ITE"SMT2" Power"7"ITE"ST" power"7"SMT4"

Power"7"SMT2" Power"7"ST" Power7T"SMT4"

Power"7T"SMT2" Power"7T"ST" z196"

Page 28: Performance Architecture and Platform Positioning

Low Variability and Low Weight

•  Variability Low •  Weight Low •  .5 VMs per SB Core •  .92 VMs per Power 7

Core •  .47 VMs per z196 Core •  Response Time

Multithreaded

k(100%)   3.1  c   0.5  w   0.1  N  Sandy  Bridge   0.5  Thread  Unit   4  

28

0"

0.2"

0.4"

0.6"

0.8"

1"

1.2"

0" 0.5" 1" 1.5" 2" 2.5" 3" 3.5"

Respon

se'Tim

e'

Throughput'/'Core'

2'Sigma'knee'of'System'Characteris9c'

Sandy"Bridge" Sandy"Bridge"ST" power"7"ITE"SMT4"

Power"7"ITE"SMT2" Power"7"ITE"ST" power"7"SMT4"

Power"7"SMT2" Power"7"ST" Power7T"SMT4"

Power"7T"SMT2" Power"7T"ST" z196"