Post on 20-Jun-2022
1R. Clapp, 1/21/01
STiNG Revisited:
Performance of Commercial Database Benchmarks on a CC-NUMA Computer System
Russell M. Clapp
IBM NUMA-Q
rclapp@us.ibm.com
January 21st, 2001
2R. Clapp, 1/21/01
Background• STiNG was the code name of an engineering development.
– Architecture introduced at ISCA in 1996.
• NUMA-Q was the name of the product family that resulted.– Features phased in over multiple releases.
– Not yet feature complete in mid-1997.
– Unable to validate simulation results at that time.
• Left Sequent to pursue other opportunities in 1997.
• Recently returned to what is now IBM NUMA-Q– Looking at data collected between late 1997 and 1999.
– Now trying to validate simulation results –sort of.
• This presentation compares apples and oranges.– At least they are both types of fruit!
• A lot of the details are different, but the overall conclusions are the same.
3R. Clapp, 1/21/01
Outline
• Brief review of STiNG architecture.– Updates on changes made for second-generation NUMA-Q
• Examination of OLTP and DSS workloads.– Per instruction workload profiles.
– How they look on a CC-NUMA machine.
• Comparison of architectural simulation and measured results.– OLTP and DSS workloads.
– Report latencies and consumed bandwidths throughout the system.
• Conclusions
4R. Clapp, 1/21/01
STiNG Block Diagram
• Homogenous nodes called quads.
• Quad local bus defined by Intel processor and chipset.
• 1 GByte/sec SCI interconnect.
• Shared global memory address space.
• Shared global I/O address space.
• MESI and modified SCI coherence protocols.
• Up to 63 nodes in a coherence domain
Quad 0 Quad 1
Quad 2
Quad 3Quad N-2
Quad N-1
Proc
Lynx Memory/PCIControl
Proc Proc Proc
5R. Clapp, 1/21/01
Lynx2 Block Diagram
DOBICDOBIC
SCI RemoteCache Tags
(even)(2Mx40)
SCI RemoteCache Tags
(even)(2Mx40)
LocalDirectory
(even)(8Mx40)
LocalDirectory
(even)(8Mx40)
18 b i t s18 b i t s
DataPumpDataPump
72 b
its
35 b
its
33 b
its
72 b
its
S C I o u tS C I i n
X e o n T M B u s ( 1 4 7 b i t s )
Lynxbus
FastDirectory(2Mx72)
FastDirectory(2Mx72)
RemoteCacheTags
(512Kx72)
RemoteCacheTags
(512Kx72)
RemoteCache
(16Mx72)
RemoteCache
(16Mx72)
1 4 + 9
2 1 b i t s + 1 3
72 b
its
1 4 b i t s + 8
SCI RemoteCache Tags
(odd)(2Mx40)
SCI RemoteCache Tags
(odd)(2Mx40)
LocalDirectory
(odd)(8Mx40)
LocalDirectory
(odd)(8Mx40)
40 b
its
40 b
its
72 b
its
10 b
its
10 b
its
9 7 0 1 1 6 . 1 p m e s s e r
LYNX2 BoardSBB Board
DSCLICDSCLIC
72
bits
8 bi
ts
3 bi
ts
4 bi
ts
1 4 b i t s + 8
40 b
its
40 b
its
7 2 b i t s 3 bi
ts
6R. Clapp, 1/21/01
STiNG to NUMA-Q Product Evolution• STiNG assumed product features that were introduced in NUMA-Q over
time:– Multipath I/O.
– Intelligent I/O drivers.
– Improved chipset behavior.
– Dual protocol processing engine in SCLIC.
– Larger processor and remote cache sizes.
• These changes were all incorporated with the second generation of NUMA-Q, code named “Scorpion”.– This is the data that I will present today for OLTP workloads.
– I will also present some data from the first generation of NUMA-Q.
• “Centurion” increases the bus speed and ASICs to 100MHz as well as adding some other minor enhancements.
7R. Clapp, 1/21/01
STiNG Simulation vs. NUMA-Q Data Collection• “Application profiles” were determined for several database benchmark
workloads. – We compare to measured profiles on NUMA-Q.
• The TPC-B benchmark and Query 6 of the TPC-D suite were chosen as representative of OLTP and DSS usage models.
– TPC-C and Query 5 Of TPC-D were measured on NUMA-Q
• A simulator with behavioral-level models of all system ASICs and data paths was built.
• The application profiles were used to drive the simulation model.
• The simulation model reports latencies, resource utilization rates, and instruction throughput.
– Data was collected using hardware performance counters.
– The counters provide data for over 1000 different metrics.
8R. Clapp, 1/21/01
STiNG Simulation vs. NUMA-Q Data Collection
• STiNG simulations included the the “Dual SCLIC.”– TPC-C data is for the dual sequencer version of the SCLIC, the “DSCLIC”.
• Clearly, we are comparing apples and oranges.– But many of the results are similar!
2 per SCLIC
1 per SCLIC
2 per SCLIC
1 per SCLIC
Protocol Processors
STiNG500MHz32MB 4-way
66MHz512K 4-way
133MHzTPC-B
NUMA-Q “Scorpion”
NUMA-Q
STiNG
System Name
500MHz128MB 4-way
90MHz2M 4-way
495MHzTPC-C
500MHz32MB 4-way
60MHz1M 4-way
180MHzQuery 5
500MHz32MB 4-way
66MHz512K 4-way
133MHzQuery 6
SCI/Datapump
Remote Cache
Bus/Lynx Speed
L2 Cache
Core Speed
9R. Clapp, 1/21/01
Application Profiles
• Larger cache leads to lower MPI for TPC-C vs. TPC-B.• Query 5 has higher MPI than Query 6 despite larger cache.• Remote memory access rates are lower than what we assumed.
– Lots of OS work made this happen.
• Remote cache miss rates are higher than expected.– Fewer L2 capacity misses to remote memory that would hit in remote cache.
• TPC-C has lower I/O rate than TPC-B.
0.440.490.080.36bits per instructionI/O
0.00170.00190.00030.0014cache line per instI/O
43%15%24%11%per referenceRC Miss Rate
323212832MB (all 4-way)Remote Cache Size
27%35%24%35%per L2 cache missRemote Memory Access Rate
0.00310.00180.00730.0223per instructionL2 Cache Miss
1M512K2M512Kper processorL2 Cache Size
32321616Processor Count
TPC-D Q5TPC-D Q6TPC-CTPC-BRateEvent
10R. Clapp, 1/21/01
Cache Miss Service Distribution for OLTP 16 processors
TPC-B on STiNG
50.4%
27.2%
2.7%5.4%
5.4%
8.9%
Hit to Local Memory
Hit to Remote Cache
Local Cache-to-CacheTransfer
2 Hop Remote
4 Hop Remote
Local Hit/ RemoteInvalidate
TPC-C on NUMA-Q
68.8%
14.5%
9.1%
3.4%
3.3%
0.9%
• Locality was higher than expected, including cache-to-cache transfers.– “Buddy Locks” helped this quite a bit.
• Most invalidations completed locally as well.
11R. Clapp, 1/21/01
Cache Miss Service Distribution for DSS32 Processors
• Larger systems should have more remote references take 4 hops.• Query 5 does not show this behavior.
– This is probably due to the higher remote cache miss rate.– These are compulsory misses for Query 5.
Query 5 on NUMA-Q
69.4%
14.6%
3.1%
7.0%
4.4%
1.4%
Query 6 on STiNG
51.6%
27.8%
1.6%
3.6%
10.9%
4.5%
Hit to Local Memory
Hit to Remote Cache
Local Cache-to-CacheTransfer
2 Hop Remote
4 Hop Remote
Local Hit/ RemoteInvalidate
12R. Clapp, 1/21/01
Average Bus Latency on L2 Cache Miss
0
20
40
60
80
100
120
0 4 8 12 16 20 24 28 32
Processors
bus clocks
STiNG/DSCLIC/TPC-B NUMA-Q/DSCLIC/TPC-C
STiNG/Query 6 NUMA-Q/Query 5
Average Cache Miss Penalty
• Higher locality leads to lower average latency.• Rate of increase is decreasing with quads –as predicted.
13R. Clapp, 1/21/01
Remote Quad Access Latency
• OLTP higher than simulated in part due to speed difference between Scorpion quads and SCI ring.
Average Remote Latency
150
200
250
300
350
400
450
500
4 8 12 16 20 24 28 32
Processors
busclocks
STiNG/DSCLIC/TPC-B NUMA-Q/DSCLIC/TPC-C
STiNG/Query 6 NUMA-Q/Query 5
14R. Clapp, 1/21/01
Remote Latency Breakdown - 4 Quad OLTP
• Similar breakdown despite loading and chipset differences.• SCI a bigger percentage due to faster quad speed.
STiNG TPC-B with Dual SCLIC Engine
26%
62%
12%
Bus time for Average Lynx time for Average SCI time for Average
NUMA-Q TPC-C with Dual SCLIC Engine
16%
65%
19%
15R. Clapp, 1/21/01
Remote Latency Breakdown –8 Quad DSS
• Much lower latency makes SCI responsible for a larger percentage.• Simulation had low SCLIC utilization, NUMA-Q has fewer “4 Hop” line fills.
STiNG Query 6
23%55%
22%
Bus time for Average Lynx time for Average SCI time for Average
NUMA-Q Query 5
18%
54%
28%
16R. Clapp, 1/21/01
Quad Bus Bandwidth Consumption
• Bus utilization for TPC-C on NUMA-Q is close to STiNG despite much lower MPI. – Lower overall latency results in lower CPI which allows this to occur.
• Less I/O keeps Query 5 bus utilization close to Query 6 despite higher MPI.
Data Bandwidth Consumed vs. Maximum Available
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
0 4 8 12 16 20 24 28 32Processors
STiNG/DSCLIC/TPC-B NUMA-Q/DSCLIC/TPC-C
STiNG/Query 6 NUMA-Q/Query 5
17R. Clapp, 1/21/01
Relative CPI Breakdown for OLTPTPC-B on STiNG, 4 Quads
21%
26%
11%
42%
internal
local miss
remote miss
invalidate
TPC-C on NUMA-Q, 4 Quads
21%
6%
28%45%
Despite different core speeds and ratios, the relative contribution from the sources of CPI is similar.
18R. Clapp, 1/21/01
Conclusions• Despite many differences between STiNG and NUMA-Q, simulation results
were reasonably close to measured results in most areas.– Architectural simulation provided an accurate high-level view of system
behavior.– The simulation results had a significant impact on implementation decisions.
• Dual protocol processing engines.• Hardware assist.• Lots of other areas not covered today.
– The 1 man-year investment in modeling was well worth it.– All this despite known deficiencies in the modeling technique.
• NUMA architectures are clearly capable of executing commercial workloads.– Average latency is similar to other large-scale UMA machines in the same
timeframe.– However, OS restructuring for NUMA was required to get the right mix of local
and remote transfers.
19R. Clapp, 1/21/01
Final Remarks
• The industry is moving to NUMA.– New offerings from Sun, HP, and Compaq for example.
• Lower remote to local ratios than NUMA-Q will be required to reduce effort on OS restructuring.– This helps the “shrink wrap” OS segment as well.
• As NUMA becomes more pervasive, even “shrink wrap” OSes will make the necessary modifications.– Memory allocation.
– Affinity scheduling.
– Multipath I/O.
20R. Clapp, 1/21/01
Backup
21R. Clapp, 1/21/01
“Scorpion/Centurion” Quad Block Diagram
I/OAPIC
Processor External Bus @ 90-100 MHz
PIIX4
IDEUSB GPIO
Flash
Super I/O
KeyboardMouse
2 Serial1 Parallel
Ports
Floppy
MIOC82450NX
F16 Bus
LYNX2Board
Xeon XeonXeonXeon
I/OAPIC
PCI & ISAInterrupts
DOBIC
F16 Bus
PXB-A
64 Bit PC I Bus30-33 MHz
MDC PCI-A0
PCI-A1
PCI-A2
PCI-A3
PXB-B
PCI-B0
PCI-B1
PCI-B2
PCI-B3
64 Bit PC I Bus30-33 MHz
ExternalMemory Bus
OMMOMM
Lynx Bus
ISA/X bus
RTC,NVRAM
MDC ISA
USB
IDE
ProcActivityLights
SCIBus
SLMMSLMM
Quad Memory
(1 GB- 8 GB)
RCGs(2)
MUXs(4)
22R. Clapp, 1/21/01
Mulipath I/O
Quad 0
Quad 1
Disks A
Disks B
IQLink
Quad 0
Quad 1
Disks A
Disks B
IQLink
Fibre ChannelSwitch
Point-to-Point Multipath
• With multipath I/O, the all DMA transfers are to/from local memory addresses.
23R. Clapp, 1/21/01
Workload Characterization: OLTP vs. DSS• OLTP does more “bad” things than DSS.
– OLTP has higher processor cache miss rates than DSS.
– OLTP has higher I/O operations per second than DSS.
– OLTP spends more time and instructions in kernel mode than DSS.
– OLTP requires higher cache-to-cache transfer bandwidth than DSS.
• However, DSS requires higher I/O bandwidth than OLTP.– But large block I/O softens the blow.
• With sufficient I/O bandwidth available, both workloads are processor-bound with throughput limited by the latency of the interconnect and memory.
– This is true for all results I present today.
• DSS has lower CPI and scales better than OLTP.
• These conclusions are based on benchmarks – real OLTP applications behave better, but still not as good as DSS.
24R. Clapp, 1/21/01
Quad Local Access Latency
• Model was too pessimistic at low utilizations and too optimistic at high utilizations.
Quad Local Latencies - 16 Processor OLTP and 32 Processor DSS
0
5
10
15
20
25
30
35
busclocks
STiNG/DSCLIC/TPC-B NUMA-Q/DSCLIC/TPC-C STiNG/Query 6 NUMA-Q/Query 5
450 NX90MHz
450 GXModel66MHz
450 GXModel66MHz
450 GX60MHz
25R. Clapp, 1/21/01
Sequencer Core Utilization
• NUMA-Q has much fewer remote invalidates and some “hardware assist.”• This doesn’t help Query 5 as higher MPI and remote cache miss rate add up to
higher bandwidth consumption.
SCLIC Sequencer Utilization
0%
10%
20%
30%
40%
50%
60%
70%
4 8 12 16 20 24 28 32
Processors
STiNG/DSCLIC/TPC-B NUMA-Q/DSCLIC/TPC-C
STiNG/Query 6 NUMA-Q/Query 5
26R. Clapp, 1/21/01
Quad-to-Quad Bandwidth Consumption
• Noops cause unanticipated bandwidth consumption for OLTP.• Rate of increase is higher due to speed of Scorpion quads relative to SCI.• Despite higher CPI, Query 5 uses more remote bandwidth than Query 6 due to
higher core speed, MPI and remote cache miss rate.
SCI Ring Utilization
0%
5%
10%
15%
20%
25%
30%
8 12 16 20 24 28 32
Processors
STiNG/DSCLIC/TPC-B TPC-D/Q6, BaselineNUMA-Q/DSCLIC/TPC-C NUMA-Q/DSCLIC/TPC-C/No NOOPsNUMA-Q/Query 5
27R. Clapp, 1/21/01
Relative CPI Breakdown for DSSQuery 6 on STiNG, 8 Quads
85%
4%
8%3%
internal
local miss
remote miss
invalidate
Query 5 on NUMA-Q, 8 Quads
59%
18%
21%
2%
•Higher miss rate for Query 5 leads to higher external CPI.
28R. Clapp, 1/21/01
Throughput Scaling
• OLTP Benchmarks are worst case. Over 60% efficiency is acceptable.
Performance Relative to One Quad
0
1
2
3
4
5
6
7
8
0 1 2 3 4 5 6 7 8
Quads
Effective Quads
STiNG/Dual SCLIC/TPC-B NUMA-Q/Dual SCLIC/TPC-C TPC-D/Q6