Session: E03 DB2 Performance Update

52
1 May 19, 2008 • 1:30 p.m. – 2:30 p.m. Platform: Linux, UNIX, Windows Berni Schiefer IBM Toronto Lab Session: E03 DB2 Performance Update

Transcript of Session: E03 DB2 Performance Update

Page 1: Session: E03 DB2 Performance Update

1

May 19, 2008 • 1:30 p.m. – 2:30 p.m.Platform: Linux, UNIX, Windows

Berni SchieferIBM Toronto Lab

Session: E03

DB2 Performance Update

Page 2: Session: E03 DB2 Performance Update

22

2

Agenda• Basics• Benchmarks• Performance Proof Points• The great new stuff ….• Summary

Page 3: Session: E03 DB2 Performance Update

33

3

Basics – Platforms/OS• The basic fundamentals haven’t changed• You still want/need a balanced (I/O, Memory, CPU)

configuration• We recommend 4GB-8GB RAM / core• 6-20 disks per core where feasible

• Use recommended generally available 64-bit OS • Applies to Linux, Windows, AIX, Solaris, HP-UX

• e.g. AIX 5.3 TL07, SLES10 SP1, RHEL5.2 etc • All performance measurements/assumptions are with a 64-bit

DB2 server• Clients can be 32-bit or 64-bit or mixed

• Even LOCAL clients

Page 4: Session: E03 DB2 Performance Update

44

4

Basics - Storage• Disk spindles still matter

• With sophisticated storage subsystems and storage virtualization it just requires more sleuthing than ever to find them

• Drives keep getting bigger, 146GB now the norm• Be leery of Storage Administrators that tell you

• “Don’t worry, it doesn’t matter”• “The cache will take care of it”

• Make the Storage Administrator your best friend!• Take them out for lunch/dinner, whatever it takes!

Page 5: Session: E03 DB2 Performance Update

55

5

Benchmarks• DB2 is THE performance leader

• for OLTP and Data Warehousing

Page 6: Session: E03 DB2 Performance Update

6

6

TPC-H result on IBM Balanced Warehouse E7100

343,551

208,457

63,651

0

60000

120000

180000

240000

300000

360000

Qph

HIBM System p6 570 and DB2 9.5 create top 10TB TPC-H

performance

IBM p6 570/DB2 9.5HP Integrity Superdome-DC Itanium 2/Oracle 11gHP Integrity Superdome / SQL Server 2008

TPC Benchmark, TPC-H, QphH, are trademarks of the Transaction Processing Performance Council. For further TPC-related information, please see http://www.tpc.org.

DB2 9.5 on IBM System p6 570, (128 core POWER6 4.7GHz), 343551 QphH@10000GB, 32.89 US $ per QphH@10000GB available: April 15, 2008Oracle 11g Enterprise Ed w/ Partitioning on HP Integrity Superdome-DC Itanium 2, HP-UX 11i v3 64 bit (128 core Intel Itanium 2 1.6 GHz), 208457 QphH@10000GB, 27.97 US $ per QphH@10000GB available: September 10, 2008 SQL Server 2005 on HP Integrity Superdome-DC Itanium 2, Windows (64 core Intel Itanium 2 1.6GHz): 63651QphH @38.54 US $ per QphH@10000GB available: August 30, 2008

Latest POWER6 hardware combined with DB2 9.5 and DS4800 storage produce outstanding data warehouse performance

Delivers 1.65x faster performance than best Oracle result

Loaded 10TB data @ 6TB / hour (incl. data load, index creation, runstats)

Results as of 2008/03/24

TPC-H is a benchmark that uses DSS type queries, it is the best candidate to measure database warehouse performanceDB2 9.5 provides a significant proof-point for the new IBM Balanced Warehouse E7100Delivers 2-3x performance than existing Oracle results10TB database built in just 1h40m with DB2 9.5 (compared to Oracle/Sun 18h13m and Oracle/HP 5h51m) on DS4800 storage

Page 7: Session: E03 DB2 Performance Update

7

7

516,752

407,079

75000

135000

195000

255000

315000

375000

435000

495000

tpm

CTPC-C performance comparison on 4 processor

Intel Xeon 7350

IBM x3850/DB2 9.5 HP DL580/SQL Server 2005

TPC Benchmark, TPC-C, tpmC, are trademarks of the Transaction Processing Performance Council. For further TPC-related information, please see http://www.tpc.org.

DB2 9.5 on IBM System x3850, Red Hat Enterprise Linux Advanced Platform (4-way Intel Quad Core Xeon 7350 2.93 GHz): 516,752 tpmC @ $2.59/tpmC available: April 15, 2008SQL Server 2005 on HP DL580G5, Microsoft Windows Server 2003 Enterprise x64 (4-way Intel Quad Core Xeon 7350 2.93GHz): 407,079 tpmC @$1.71/tpmC available: September 5, 2007

Latest System x server (x3850 M2) combined with DB2 9.5 and Red Hat Enterprise Linux delivers outstanding OLTP performance

First data server to cross the half-million tpmC ceiling with 4 processors

With about 1.1 Billion web users in the world, the performance delivered in this benchmark would handle purchase and delivery of items to all these web users every 4 days

Results as of 2008/03/24

TPC-C result on IBMSystem x3850 M2 with Linux

TPC-C is a benchmark emulating an OLTP workload, DB2 is the TPC-C leader. This chart shows the relative performance on 4-processorsIBM System x and DB2 9.5 beat SQL Server by 27% on Red HatDB2 9.5 has a good relationship with Red HatOracle numbers not available

Page 8: Session: E03 DB2 Performance Update

8

8

629,159

372,140 371,044

0100000200000300000400000500000600000700000

tpm

CIBM System p 570 and DB2 9 leader on SAP R/3 2-tier SD

IBM System 550/DB2 9.5HP Integrity rx6600 Itanium 2 9050 DC, 1.6GHzIBM System p 570 1.9GHz POWER5

DB2 9.5 on System p550 takes industry leadership in 8 core TPC-C benchmark.

Demonstrates excellent performance of DB2 and POWER6 with AIX 5L

Demonstrates superior per core performance for DB2 9.5 on POWER6 processors

TPC-C on IBM System p550 and DB2 9.5/AIX 5.3

Results as of 2008/03/24

$5.26371,0447/12/049/30/04

Oracle 10g, AIX 5.34/8/16IBM System p 570 1.9GHz POWER5

$1.81372,1406/11/076/11/07

SQL Server 2005, Windows 20034/8/16HP Integrity rx6600 Itanium 2 9050 DC, 1.6GHz

$2.49629,159 3/20/084/20/08

DB2 9.5, AIX 5.34/8/16IBM System p 550 4.2GHz POWER6

$/tpmCtpmCSubmitted/Available

SoftwareProcessors/Cores/ThreadsConfiguration

Also the leader on SAP R/3SAP SD benchmarks are sales and distribution benchmarks designed to test the performance of database components and SAP applicationsDB2 9 outperforms SQL Server and Oracle once again …

IBM System p 570, 8 processors / 16 cores / 32 threads, POWER6 4.7 GHz, 128 KB L1 cache and 4 MB L2 cache per core, 32 MB L3 cache per processor, 8000 benchmark users, AIX 5.3, DB2 9, available: May 2007HP ProLiant DL580 G5, 4 processors / 16 cores / 16 threads, Quad-Core Intel Xeon Processor X7350 2.93 GHz, 64 KB L1 cache per core and 4 MB L2 cache per 2 cores, 3705 benchmark users, Windows Server 2003 Enterprise Edition , SQL Server 2005, available: Sept 2007HP Integrity rx8620, 16-way XMP, Intel Itanium 2 1.5 GHz, 32 KB L1 cache, 256 KB L2 cache, 6 MB L3 cache per processor, 2880 benchmark users, HP-UX 11i, Oracle 9i, available: Dec, 2003

Page 9: Session: E03 DB2 Performance Update

99

9

1,616,162

520,467

254,471150000

350000

550000

750000

950000

1150000

1350000

1550000

tpm

CDB2 9 Top TPC-C Performer among Data

server vendors on 8 Processors

DB2 9 SQL Server 2005 Oracle 10g

Higher is Better

TPC Benchmark, TPC-C, tpmC, are trademarks of the Transaction Processing Performance Council. For further TPC-related information, please see http://www.tpc.org.

DB2 9 on IBM System p570, IBM AIX 5L V5.3 (8 P 16 C 4.7 GHz POWER6 ): 1,616,162 tpmC @ $3.54/tpmC available: November 21, 2007SQL Server 2005 on Unisys ES7000, Microsoft Windows Server 2003 Enterprise x64 Edition (8 P, 16 C Intel Dual Core Xeon MP 3.4 GHz ): 520,467 tpmC @ $2.73/tpmC available: May 1, 2007Oracle 10g on NEC Express5800, Red Hat Enterprise Linux AS 4.0 (8 P, 8C Intel Itanium2 1.6GHz): 254,471 tpmC @ $5.32/tpmC available: February 17, 2006Oracle 10g on HP Integrity rx6600, HP-UX 11i v2 64 bit (2P, 4C Intel Itanium2 1.6GHz): 230,569 tpmC @ $2.63/tpmC available December 1, 2006SQL Server 2000 on HP Proliant ML350G4p, Microsoft Windows Server 2003 Enterprise Edition (1 P, 1C Intel Xeon 3.4GHz): 42,432 tpmC @ $1.96/tpmC available March 29, 2005

101,010

57,642

42,432

20000

40000

60000

80000

100000

120000

tpm

C p

er c

ore

DB2 9 Best TPC-C performance per CPU/Core among Data servers

DB2 9 SQL Server Oracle 10g

Top Performer on POWER6

Results as of 2008/03/24

DB2 9 has the best TPC-C number on 8 processors – 3.1x better than SQL Server and an amazing 6.3x better performing than Oracle 10gDB2 9 also has the best TPC-C performance per core among the other competitors. 2-2.5 more TPC-C per core than the competitors this means better TCO with DB2

Page 10: Session: E03 DB2 Performance Update

10

10

SPECjAppServer 2004 World Record

14004

10519

0

5000

10000

15000

JOPS

@St

anda

rdDB2 9.5 has best SPECjAppServer 2004 results

40-core System p5-595 / DB2 9.564-core HP Superdome/Oracle 10g

SPEC and the benchmark name SPECjAppServer 2004 are registered trademarks of the Standard Performance Evaluation Corporation. For the latest SPECjAppServer 2004 benchmark results, visit http://www.spec.org/.

DB2 9.5 has 1/3 more performance with ½ the number of cores!

Results as of 2008/03/24

Ilustrates advantage of combining DB2 with Websphere

SPECjAppServer is the only official multi-tier end-to-end performance benchmark for J2EE technologies. It emulates information flow among an automotive dealership, manufacturing, supply chain management, and an order/inventory system.

DB2 was the first to publish on every version of SPECjAppServer benchmark!Only DB2 has published with single-database (i.e. non-XA) and multi-database (i.e. XA 2PC) results. Others were all single-database.DB2 9 also has the best performance per coreOnly DB2 has leading results with both WebSphere and WebLogic on IBM and non-IBM platforms

Sun/WLS/DB2 – 8253.21 SPECjAppserver2004 JOPS@Standard – WLS 10 on Sun Blade 6000 10x 8 cores T6300 UltraSPARC T1 1.4GHz running Solaris 10 8/07, and DB2 9 on 48 cores Sun E6900 UltraSPARC IV+ 1.95GHz HP/WLS/Oracle – 7629.45 SPECjAppServer2004 JOPS@Standard – WLS 9.2 on 6x 8 cores IA64 rx6600 1.6GHz running HP-UX 11iv3, and Oracle 10g EE 10.2.0.2 on HP Superdome 64x1.6GHz 256GB RAM running HP-UX 11v3IBM/WAS/DB2 - 4368 SPECjAppServer2004 JOPS@Standard – WAS 6.1 on xSeries Blade Center with 20 HS20 on SLES 9 40 cores, and DB2 9 on p5-570 POWER5+ 1.9GHz 16 cores 128GB ram AIX 5.3Sun/WLS/ORA – 4099 SPECjAppServer2004 JOPS@Standard , WLS 9.0 on Sun T2000 cluster 7x8 core Solaris 10, Oracle 10g EE on E6900 UltraSPARC IV+ 40x1.5Ghz, Solaris 10

Page 11: Session: E03 DB2 Performance Update

1111

11

TPoX performance with DB2 9.5

DB

Customers

BrokerageHouse DB

Customers

BrokerageHouse

For more information on TPoX please visit tpox.sourceforge.net

400000050000006000000700000080000009000000

100000001100000012000000

txns

/sec

15M orderinserts

3M custaccinserts

21K securityinserts

Full DocumentReplacement

TPoX Throughput with DB2 9 and DB2 9.5

DB2 9 DB2 9.5

Transaction Processing over XML (TPoX) is an open source application-level XML database benchmark based on a financial application scenario

DB2 9.5 yields 10%-54% throughput improvement over DB2 9 for TPoXinserts and full document replacement

TPoX Schema consists of an order, security, holding and customer account tables. There are 15M documents in the order table, 3M documents in the Customer account table and 21K documents in the security table.

Each customer has one or multiple accounts. Each account has one or multiple holdings. A holding is a certain number of shares of a security. A security can be a stock, a bond or a mutual fund. Customers place orders to buy or sell securities for their account(s).

In DB2 9 there were no subdocument updates, you could only replace an entire document. Replacing full documents has improved by ~54% with DB2 9.5

Page 12: Session: E03 DB2 Performance Update

12

12

TPoX performance on Intel

1 97%

1.30

56%

1.90

82%

2.20

90%

0.00

0.50

1.00

1.50

2.00

2.50

Intel Tulsa DB2 9, 16GB Intel Tigerton DB2 9.5,16GB

Intel Tigerton DB2 9.5,Compression and In-

lining, 16GB

Intel Tigerton DB2 9.5Compression and In-

lining, 32GB

Relative TPoX performance and CPU utilization using DB2 9.5

Throughput ImprovementCPU utilization

DB2 9 to DB2 9.5 size reduction of

67%

With the new Quad-Core Intel Tigerton processors –2.2x TpoX throughput on DB2 9.5 with new XML features (in-lining and compression)

275781Database size (GB)

Compressed DB2 9.5

DB2 9.5DB2 9

Tulsa processor - 4 Socket Dual-Core Intel Xeon processor 7100 seriesTigerton processor - 4 Socket Quad-Core Intel Xeon processor 7300 series

The first bar is the baseline results on DB2 9 with Tulsa processorsWith the Tigerton processors, there was a 30% improvement in throughput moving to quad core and ~ 44% idle CPU since the system was IO boundApplying the new XML features in DB2 9.5 (in-lining and compression) allowed us to increase throughput by 1.9 x and improve CPU utilization to ~82%. Since IO costs were reduced by compression we were able to drive more throughput at the cost of more user CPU.Doubling the memory on Tigerton, allowed us to achieve 2.2x throughput with DB29.5 since there was even less IO and hence we were able to drive more CPU utilization

The storage space also reduced by 67% which results in disk savings $

Common storageDS4800 with 78 disks RAID5

Equivalent OS levelSLES 10 64bit SP1

Intel Tulsa system referred to in slides16 GB of memoryDB2 9Fastest tulsa is 3.5Ghz

Page 13: Session: E03 DB2 Performance Update

13

13

DB2 / SAP AMD Opteron Virtualization performance on VMWare ESX 3.0.1

445

350

200

300

400

500

SD u

sers

DB2 9 Virtualization capabilities outperforms SQL Server 2005

2 VCPU IBM System x3755, AMD Opteron 8220 SE 2.8GHz2 VCPU Dell PowerEdge 6950 AMD Opteron 8220 SE 2.8GHz

Virtualization enables superior efficiency allowing you to maximize use of unused server capacity and hardware resources with the least overhead

DB2 on VMWare ESX provides an effective and scalable production ready platform for hosting multiple virtualized transaction processing workloads

Self-Tuning Memory Manager allows DB2 to automatically adapt in dynamic resource allocation environments.

DB2 also offers automatic storage which enables storage virtualization

For SAP Benchmark and related information please see http://www/sap.com/benchmark

For information on DB2 scalability on VMWare see http://www.vmware.com/pdf/db2_scalability_wp_vi3.pdf

Virtualization makes it possible to run multiple operating systems and multiple applications on the same computer at the same time, increasing the utilization and flexibility of hardware. VMware is the most widely deployed software for optimizing and managing IT environments through virtualizationDB2 on VMWare provides an effective production ready platform for hosting multiple virtualized transaction processing workloads

DB2 9 has the most number of SD (SALES and Distribution) users - ~30% more than SQL ServerSame processors in both cases, DB2 has higher performance by ~30%

There are more automatic parameters now in DB2 9.5 Db cfgDatabase heap (4KB) (DBHEAP) = AUTOMATICSQL statement heap (4KB) (STMTHEAP) = AUTOMATICDefault application heap (4KB) (APPLHEAPSZ) = AUTOMATICApplication Memory Size (4KB) (APPL_MEMORY) = AUTOMATICStatistics heap size (4KB) (STAT_HEAP_SZ) = AUTOMATIC

Dbm cfg

Page 14: Session: E03 DB2 Performance Update

14

14

DB2 & AIX LPAR Mobility• DB2 9.5 on AIX 5.3• 2 CPU Dedicated LPAR• 14 GB RAM• Virtual IO Server

151

101

151

201

251

301

351

401

451

501

551

601

651

701

751

801

851

901

S1

0

20000

40000

60000

80000

100000

120000

140000

160000

180000

tpm

C

Interval

Partition Migration Impact on OLTP

Start of LPAR Migration End of LPAR Migration

Challenge:To migrate an LPAR with DB2 9.5 running a TPC-C likeworkload from one physical host to another.

Outcome:During the memory copy portion ofthe migration, an 18% degradationin throughput was observed.

At the final switch-over phase, throughput stopped for a few seconds.

Immediately after the migration completed, DB2 9.5 was running at peak performance!

At 14GB RAM, the entire migration tookabout 5 minutes.

Page 15: Session: E03 DB2 Performance Update

1515

15

The Great New Stuff

• When you think about the new features …• As always

• “It depends”• We don’t know everything (yet)• Your mileage will vary• Please tell us what you think!

Page 16: Session: E03 DB2 Performance Update

1616

16

Today’s UNIX/Linux Architecture

Data DisksLog Disks

Common

Client

UDB Client Library

Active

Subagentsdb2agntp

Write Log Requests

Victim

Noti

ficati

ons

Parallel, Page

Write Requests

UDB ServerShared Mem & Semaphores, TCPIP, Named Pipes,…

Each circle is an OS process

ListenersInstance Level

Idle Agent Pool

Per-instance

Idle, pooled agent or subagent

db2tcpcm db2ipccmdb2agent (idle)

CoordinatorAgents

Per-application

db2agent

db2pclnr

db2pfchr

db2loggw db2dlock

db2agntp

db2loggr

Per-database

Prefetchers

PageCleaners

Buffer Pool(s)

DeadlockDetector

LoggingSubsystem

Log Buffer

Database Level

Idle

Async IO Prefetch Requests

Parallel, Big-block,

Read Requests

Page 17: Session: E03 DB2 Performance Update

1717

17

New UNIX/Linux Architecture

Data DisksLog Disks

Common

Client

UDB Client Library

Active

Subagentsdb2agntp

Write Log Requests

Victim

Noti

ficati

ons

Parallel, Page

Write Requests

UDB ServerShared Mem & Semaphores, TCPIP, Named Pipes,…

Each circle is an OS thread

ListenersInstance Level

Idle Agent Pool

Per-instance

Idle, pooled agent or subagent

db2tcpcm db2ipccmdb2agent (idle)

CoordinatorAgents

Per-application

db2agent

db2pclnr

db2pfchr

db2loggw db2dlock

db2agntp

db2loggr

Per-database

Prefetchers

PageCleaners

Buffer Pool(s)

DeadlockDetector

LoggingSubsystem

Log Buffer

Database Level

Idle

Async IO Prefetch Requests

Parallel, Big-block,

Read Requests

Single, multi-thread process

Page 18: Session: E03 DB2 Performance Update

1818

18

Performance Advantages of the Threaded Architecture on UNIX/Linux

• Context switching between threads is generally faster than between processes• No need to switch address space• Less cache “pollution”

• Operating system threads require less context than processes• Share address space, context information (such as uid, file

handle table, etc)• Memory savings

• Significantly fewer system file descriptors used• All threads in a process can share the same file descriptors• No need to have each agent maintain its own file descriptor table

Page 19: Session: E03 DB2 Performance Update

1919

19

Performance characteristics of threaded architecture

0%

20%

40%

60%

80%

100%

120%

Rel

ativ

e th

roug

hput

on

Linu

x x6

4Relative performance on Linux x64 with threaded DB2 9.5

DB2 9DB2 9.5

0

0.5

1

1.5

2

2.5

3Pe

r-age

nt M

emor

y Fo

otpr

int (

MB

) -lo

wer

is

bet

ter

Linux x64 AIX

Decrease in Agent Memory Footprint with DB2 9.5

DB2 9DB2 9.5

Savings of up to 1 MB per agent due to new threaded architecture

Increased throughput by 14 % on Linux x64 internal OLTP workload

With the new threaded architecture, throughput increased by 14% on Linux x64 and the agent footprint decreased by 1MB. On AIX we also see ~0.6MB decrease in footprint

Page 20: Session: E03 DB2 Performance Update

2020

20

Tuning hints/tips

• Be current on your OS maintenance• Use large pages where feasible

• 64K pages selected automatically on AIX• Ensure the resource limits assigned to the few DB2

processes are “unlimited”• Set the NUM_IOCLEANERS configuration parameter

to automatic• It uses the # of CPUs as a key factor • Don’t want to have too many cleaners

Page 21: Session: E03 DB2 Performance Update

2121

21

XML Enhancements in DB2 9.5• Base Table In-lining (BTI) and Compression

• Store small XML docs in the XML column in the base table - no .xda storage needed

• In-lined documents can be compressed• XML Load

• For bulk inserts of XML documents• Faster Insert with XML Schema Validation

• Up to 5x faster than DB2 9• XML Update

• Based on XQuery Update Facility, a standardized extension to XQuery - allows you to modify, insert, or delete individual elements and attributes

• 2-3x faster than stored procedure approach in DB2 9• Extensive path length reduction and optimizer

improvements

Also§Instant compatible schema evolution -> schema evolution is the XML

schema change (this is how i understand it: e.g., your schema changed and you saved it under the same name; then you can still easily validate your documents using this schema with the same validation statement as before; this is of course if the schema became less restrictive; if it became more restrictive you have to save it under a different name and only validate the new documents with it; i.e., the ones that conform to it;

Enabling existing customers–Non-Unicode, Offline Load, Replication, FederationRicher tool support: -> this is great and we did not mention it; however, i am not sure how important it is for a DB2 performance presentation (it is in a way performance-related too, because it makes the user's work faster/more effective)

–IBM Data Studio, RDA, DB2W, and many more–Altova, Skytide, and many more

Page 22: Session: E03 DB2 Performance Update

2222

22

Base Table In-lining & Compression

050000

100000150000200000250000300000350000400000

Tx/m

in

Inserts Queries Mixed

TPoX throughput by workload type

DB2 9

DB2 9.5

Inlined DB2 9.5

Inlined,compressed DB29.5

Up to 3x improvement in throughput with In-lining and Compression of the XML data

XML document structures are now stored more efficiently on DB2 9.5

Obtained a 30% reduction in space just by using DB2 9.5 to load the data.

With in-lining and compression achieved space savings of ~68%

5.310.911.816.5Database size (GB)

Compressed DB2 9.5

In-lined DB2 9.5

DB2 9.5

DB2 9

Database size reduction with compressed XML tables in DB2 9.5

17% improvement in inserts, ~ 3x improvement in queries and 2.2x improvement in mixed transactions

We save space when migrating to DB2 9.5 because internal structures are now stored more efficiently

Page 23: Session: E03 DB2 Performance Update

2323

23

XML Load Performance

LOAD support for XML is new inDB2 9.5!

Tests done on TPoX tables in DB2 9.5show that LOAD out performs IMPORTby a factor of 3-8 x

0

500

1000

1500

2000

2500

3000

Ela

psed

Tim

e (s

ec) -

low

er is

bet

ter

No Indexes 10 Indexes SchemaValidation

Population of an XML table using Load is faster than Import

XML ImportXML Load

Increase CPU and DISK PARALLELISM to speed up LOAD

Build indexes during LOAD rather than loading then creating the indexes

Always run RUNSTATS after loading tables

With no indexes, 7x improvement in throughput using loadWith 10 indexes, 4 x improvement with loadSchema Validation can also be done during load, the results show that schema validation is ~8x better with load than importRUNSTATS during LOAD forces CPU_PARALLELISM to 1

Page 24: Session: E03 DB2 Performance Update

2424

24

Best Practices for XML

• Use Base Table In-lining and Row Compression for workloads that …• Tend to be more I/O-bound rather than CPU-bound• Contain statements that involve XML columns• Do not touch large numbers of XML documents per

statement (be aware of temping)• Use Load instead of Import to insert XML documents• Use the new XML Update facility rather than XML

Update Stored Procedure• Filter XML documents passed to the XML Update transform

rather than filtering within the XML Update transform• Apply normal tunings for update workloads• Use parameter markers in your xml update statement to

avoid recompilation

Here is an example of the new XML Update using SQL-style parameter marker ("?") to avoid recompilation. In this example we also filter the data passed to the update transform with a where clause.

update xmlcustomerset info = xmlquery(transform 'copy $new := $INFO

modify do replace value of $new/customer_info/phone with $z

return $new 'passing cast(? as varchar(15)) as "z")

where cid = ?

For more information on how to use the new XML Update see http://www.ibm.com/developerworks/db2/library/techarticle/dm-0710nicola/

Here is another example of filtering docs passed to the tranformfor $i in db2-fn:xmlcolumn("XMLCUSTOMER.INFO")[/customerinfo/name="John Smith"]return

transform copy $new := $imodify do delete $new/customerinfo/phonereturn $new;

Page 25: Session: E03 DB2 Performance Update

2525

25

Fast Redistribute Utility• Enables rapid incremental growth of a data warehouse

• Rapidly moves rows from partitions with more data to partitions with less/no data including space reclamation

• High performing as it …• Reduces active log space requirement• Reduces code path• Performs multiple activities in a single pass of the data• Redistributes multiple tables in parallel

• With DB2 9.5, a redistribute command is equivalent to these steps in DB2 9

1. Dropping and re-creating the indexes2. Redistribute3. Running REORG on the table4. Executing RUNSTATS on the table

Has to have runstats profile defined before hand for these steps to apply

Page 26: Session: E03 DB2 Performance Update

2626

26

DB2 9 Redistribute DB2 9.5 Redistribute (aka Fast Redistribute)

Implementation and Architecture

Uses standard SQL inserts and deletes

Bypass runtime; Parallel Architecture; Single pass of data, Data compaction, Parallel processing of tables.

Performance Low – record level processing High – page level processing

Logging Requirements High : full SQL logging; high disk requirements

Low : minimal logging

Indexes Incremental Indexing (slow and heavy logging). No sorting. Very costly

Single Table Scan, Parallel Sorting, Parallel Index Rebuild

Disk Requirements Fully logged, large active log space and total log space/archive required

No additional disk requirements

Catalog Contention Low : tables are not created and dropped

Low : tables are not created and dropped

Post Redistribute steps required

Reorg, run stats, re-bindPossible Re-create indexes

Only re-binding of packages

Fast Redistribute Details

Fast redistribute on DB2 9 and DB9.5 do not supported replicated MQTs – drop and recreate after redistribution

Page 27: Session: E03 DB2 Performance Update

2727

27

Fast Redistribute Performance Data

Up to 83% improvement in total time to redistribute all the data

More consistent redistribution times –time to redistribute ½ the data is ~ ½ the overall time in DB2 9.50%

20%

40%

60%

80%

100%

Rel

ativ

e El

apse

d Ti

me

- low

er is

be

tter

50 100Percentage of table redistributed

Elapsed time to redistribute table on DB2 9 and DB2 9.5

DB2 9 Redistribute DB2 9.5 Fast Redistribute

Linux x64~ 13% reduction in total time to redistribute the table for 50% of the data, for 100% of the data, improvement of 83%

Page 28: Session: E03 DB2 Performance Update

28

28

Fast Redistribute Hints / Tips

• Use separate log disks than the disks used for the table space containers

• Create a large temporary table space for each node and increase the buffer pools and sort heap sizes needed for index creation

• Define RUNSTATS profiles for all tables to be redistributed

• Backup the affected tablespaces before and after redistribution

We place the tablespace in backup pending mode after redistribution in DB2 9.5

Page 29: Session: E03 DB2 Performance Update

2929

29

LOB Performance Enhancements

• Large Objects are becoming more prevalent• BLOB, CLOB, DBCLOB

• DB2 9.5 enhances client/server LOB performance• Blocking of Cursors containing LOB columns• DB2 automatically chooses the best performing

method to send LOB data back to the client• CLI also supports RTNEXTALL for blocking cursors with LOB

columns

• Improved performance by reducing the number of network flows required to retrieve a LOB

Page 30: Session: E03 DB2 Performance Update

3030

30

Performance benefit depends on the row size

• Smaller the row size - more LOBs can be blocked - more significant improvement

• Larger row size - not much savings in network traffic with blocking therefore performance benefit for medium and large-sized not as high as improvement for small pages

LOB Performance Improvements

02468

1012141618

Elap

sed

Tim

e (s

ec) -

low

er is

bet

ter

4K 8K 16K 32KPage size

Elapsed time to retrieve LOBs on pages with different sizes

DB2 9 DB2 9.5

smaller the row size -> more rows in a query block -> significant improvement (40% for LOB size=4k)

larger row size -> more network traffic - limited by the size of the TCP/IP packets ->increase in the number of send calls (performance benefit of DDF for medium and large-sized LOB is < 20% )

For large pages consider using DB2SOSNDBUF > 64k and DB2SORCVBUF > 64k these will increase the send and receive buffers in DB2

Page 31: Session: E03 DB2 Performance Update

3131

31

Decimal Floating Point (DFP)• How to represent numbers

• BIGINT – good for whole numbers• FLOAT - Binary floating point is fast but can be

inaccurate for business applications• DECIMAL - SQL DECIMAL data type

• implemented in software• DECFLOAT(n) – new datatype

• 16 and 34 digit precision• IBM POWER6 is the first Unix microprocessor to

support decimal floating point arithmetic in hardware• Provides additional performance acceleration

8 bytes and 16 bytes for DECFloat

Page 32: Session: E03 DB2 Performance Update

32

32

Decimal Floating Point Performance on IBM POWER6 Hardware with AIX

• Significant range of speed-up • Depends on

• How cpu-bound• Number/kind of math

expressions • Have seen up to 6x faster

performance• In one complex expression

with mainly aggregation DECFLOAT(16) was 1.6x faster than DECIMAL(15,4) when using DFP on POWER6

• Have seen ~40% gains in SAP BW environments

0

50

100

150

200

Elap

sed

time

(sec

) -

low

er is

bet

ter

Elapsed Time Reduction for a Sample Query(on POWER6 hardware)

Float/Double Decimal DFP

Q1 in TPC-H does a lot of calculations on DFP columns, we do sum and averages across these columns so the performance benefit is more recognizableselect

l_returnflag,l_linestatus,sum(l_quantity) as sum_qty,sum(l_extendedprice) as sum_base_price,sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,avg(l_quantity) as avg_qty,avg(l_extendedprice) as avg_price,avg(l_discount) as avg_disc,count_big(*) as count_order

fromtpcd.lineitem

wherel_shipdate <= date ('1998-12-01') - 117 day

group byl_returnflag,l_linestatus

d b

Page 33: Session: E03 DB2 Performance Update

33

33

Decimal Floating Point Performance on x64 Hardware running Linux

• Significant range of speed-up • Depends on

• How cpu-bound• Number/kind of math

expressions • In one complex expression

with mainly aggregation DECFLOAT(16) was 1.4x faster than DECIMAL(15,4) when using the DFPAL library

050

100150200250300350400

Elap

sed

time

(sec

) -

low

er is

bet

ter

Elapsed Time Reduction for a Sample Query(on x64 hardware)

Float/Double Decimal DFP

DFP Hardware support exists on POWER6 but we only use the support on AIX, not pLinux today. Q1 in TPC-H does a lot of calculations on DFP columns, we do sum and averages across these columns so the performance benefit is more recognizableselect

l_returnflag,l_linestatus,sum(l_quantity) as sum_qty,sum(l_extendedprice) as sum_base_price,sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,avg(l_quantity) as avg_qty,avg(l_extendedprice) as avg_price,avg(l_discount) as avg_disc,count_big(*) as count_order

fromtpcd.lineitem

wherel_shipdate <= date ('1998-12-01') - 117 day

group byl returnflag

Page 34: Session: E03 DB2 Performance Update

3434

34

MDC Rollout• Faster DELETE along cell or slice boundaries• Immediate Index Cleanup Rollout (implemented in DB2

v8.2.2)• Deferred Index Cleanup Rollout (new in DB2 9.5)

1997, Mexico, blue

1997, Canada, blue

1997, Mexico, yellow

1997, Canada, yellow

1997, Canada, yellow

1997, Mexico, yellow

1998, Canada, yellow

1998, Mexico, yellow

Cell for (1997, Canada, yellow)

Each cell contains one or more blocks

yeardimension

nationdimension

colourdimension

MDC Tables are multi-dimensional clustered tables that deliver fast performance by introducing block indexes which point to blocks or groups of records instead of to individual records.By physically organizing data in an MDC table into blocks according to clustering values, and then accessing these blocks using block indexes, MDC is able to provide significant additional performance benefits. In the example above, the block indexes on nation, year and colour provide more enhanced performance for queries using these fields

Page 35: Session: E03 DB2 Performance Update

3535

35

Deferred index cleanup rollout• Since DB2 v8.2.2, rollout deletion provided faster, block-based

deletes and reduced logging than regular deletes• Required row-level processing and logging for each index• Performance dependent on the number of indexes

• DB2 9.5 provides further enhancements• Deferred index cleanup

• Enabled by setting DB2_MDC_ROLLOUT to defer, SET CURRENT MDC ROLLOUT MODE DEFERRED

• Removes index keys after the transaction commits• Cleans up multiple indexes in parallel • Reduces logging - instead of logging one record for every RID

removed from the indexes, only one record per index page is logged• The application doing the delete does not need to wait before

processing other transactions

Page 36: Session: E03 DB2 Performance Update

3636

36

MDC rollout performance data

Deferred rollout excluding the wait time for asyncindex cleanup is the fastest; transactions do not have to wait until the index cleanup is finished

Significant reduction in log space usage with deferred rollout

0%10%20%30%40%50%60%70%80%90%

100%R

elat

ive

time

to d

elet

e ro

ws

- low

er is

bette

r

30 97% of table deleted

Time to delete rows using different options of MDC rollout

Delete (no rollout)

Immediate rollout

Deferred rollout (includingasync cleanup)Deferred rollout (excludingasync cleanup)

0%10%20%30%40%50%60%70%80%90%

100%

% lo

g sp

ace

need

ed -

low

er is

bet

ter

30 97% of table deleted

Log space needed to delete rows using various MDC rollout options

Delete (no rollout)Immediate rolloutDeferred rollout

On 11 million rows (134260 pages), 16K page, 16K extent size, 4 nodes, 8 RID indexes The deferred rollout not including async cleanup time is there to show how fast the delete statement finishes.

Page 37: Session: E03 DB2 Performance Update

37

37

When to use deferred index cleanup rollout

• Consider deferred index cleanup when• There are a number of RID indexes• Large number of deletes• Several rollouts planned for a particular table• In limited log space environments• Doing a lot of roll-in/roll-out in the short

maintenance window – transactions do not have to wait for index cleanup to occur

Limited log space environments - customers may need to break down the delete into smaller ones by deleting first N rows. Using deferred index cleanup can avoid the problem because not only the deferred index cleanup reduces the logging, but it also performs internal commit for the cleanup

Deferred rollout has the following drawbacks1) blocks cannot be reused immediately after deletion. 2) index scan will be slower while the indexes are being cleanup. 3) additional memory is needed to memorize which blocks have been rolled out.

Immediate index cleanup is default in DB2 9.5

Page 38: Session: E03 DB2 Performance Update

3838

38

Optimizer enhancements in DB2 9.5• Real Time Statistics (RTS)

• Table statistics updated automatically over time with variables including UDI (update/delete/insert)

• Potentially significant query plan improvements• Slight overhead

• FFNR (Fetch First N Rows), OFNR (Optimize for N Rows) and Group-By with min/max• Improved costing and better query/sub-query plan

alternatives for FFNR, OFNR and Group-By• Performance Results:

• Up to 99% improvement for such queries

< 1% overhead for RTS

Page 39: Session: E03 DB2 Performance Update

39

39

STMM Updated for 9.5• The self-tuning memory manager has been

enhanced in DB2 9.5 with feedback from customers• We have seen very positive results

• Particularly in OLTP environments• Some ISVs now use it out of the box

• The DB2 performance team uses it to tune OLTP benchmarks• But we turn it off at the end once tuning is

complete• We continue to work on enhancements

• Particularly in the data warehousing environment

Page 40: Session: E03 DB2 Performance Update

40

40

Optimizer enhancements in DB2 9.5• In-List Cardinality Improvements

• DB2 has enhanced the costing of IN list predicates that are to be converted to nested loop joins

• Performance Results:• Various individual query improvements for SAP-

SSQJ and internal query workloads (30%-99%)• Improved filter-factor/selectivity estimation of

‘between’ predicates using parameter markers • Performance Results:

• 5% improvement overall throughput SAP-SD (AIX)• 12% faster average response time SAP-SD (LINUX)

In-list 2 join optimization existed in v9, v9.5 just improves this feature

40

Page 41: Session: E03 DB2 Performance Update

4141

41

Unicode Enhancements• Unicode standards exist to support the worldwide

interchange and processing of texts of diverse languages

• Unicode has the ability to encode 1.1 million characters

• Unicode is the database creation default for new databases in DB2 9.5• With functional and performance enhancements

for Unicode• UCA400 Collation improvements for sorting and

organizing data• Normalized Unicode, Thai and Slovakian

Changes in DB2 9.5Codepage conversion -(caching values, decreasing function calls, minimize tracing, short-cutting sqlnlsIconv) in libg11n library

Rearranged DB2 code in order to minimize cycles spent on inter-library glue-code

Addition of adjustable ICU Key Buffering for binary sortEliminates excessive ICU tracing

ICU - 'international components for unicode' ... its a standard we follow. like a library. ICU-tracing: tracing the code on the ICU library. ICU-key-buffer: basically how we pass information/mapping in db2 (size of the buffer)

Unicode normalization is a form of text normalization that transforms equivalentcharacters or sequences of characters into a consistent underlying representation so that they may be easily compared. Normalization is important when comparing text strings for searching and sorting (collation).

http://unicode.org/reports/tr10/

Page 42: Session: E03 DB2 Performance Update

4242

42

Unicode Performance Results

020406080

100120140160180200

Ela

psed

Tim

e (s

ec) -

low

er is

be

tter

Overall Unicode Performance Improvement in DB2 9.5

DB2 9DB2 9.5

0

50

100

150

200

250

300

350

400

Ela

psed

Tim

e (s

ec) -

low

er is

bet

ter

Normalized Thai Slovakian

UCA 400 Collation Improvement in DB2 9.5

DB2 9DB2 9.5

From internal tests, overall performance improvement of ~30% when using Unicode in DB2 9.5

Performance improvements of 11-14% for Normalized Unicode, Thai and Slovakian with UCA400 Collation

Page 43: Session: E03 DB2 Performance Update

43

43

Container

Container

Container

Subagentdb2agntp

Parallel scan & sort

CoordinatingAgent

db2agent

Table queue

Indexbuild

Up to 6 agents, depending on active CPUs & number of

nodes (1:N)

In DB2 9, just one DB2 agent handles allthis in non-SMP case

Parallel Index Create• DB2 9.5 parallelizes index create to exploit extra processors• A CPU-bound index create in DB2 9 will see a substantial

performance boost in DB2 9.5• Improvements between 20% and 2x, depending on the

number of CPUs & the I/O capacity of the system

Up to degree 6 parallelism.

Controlled by registry variable

DB2_SMP_INDEX_CREATE

Page 44: Session: E03 DB2 Performance Update

4444

44

DB2 Audit Enhancements in DB2 9.5• Introduction of auditing at the database level instead of at the

instance level• One audit log file per database • Audit policies can be created and associated to a table,

user, group or role, thereby enabling fine grained auditing• Customizable audit log path• Introduction of Archiving

• Quick method of switching the active log file to an archived log file and starting a new active log file

• Allows follow on operations such as backup, extraction and deletion to have zero effect on the performance of the server

• EXECUTE Category• A new database level category that audits the execution of

SQL statements• Can optionally include input data host variables and

parameter markers

Page 45: Session: E03 DB2 Performance Update

45

45

DB2 Audit Performance Results

0

1

2

3

4

5

6

7

8

Rel

ativ

e Th

roug

hput

Relative throughput of an internal OLTP workload collecting DB2 Audit data

DB2 9DB2 9.5

From internal tests on OLTP workload, collecting db2audit data is ~ 8x faster on DB2 9.5

Audit log is on a separate disk from the db2 logger, table space container disks and database directory with DB2 9.5

Page 46: Session: E03 DB2 Performance Update

4646

46

DB2 Audit Hints / Tips

• Use Asynchronous Logging • Set AUDIT_BUF_SZ to > 0

• Place audit logs and archive audit logs disks on separate disks,different from db2 logger, database directory and table space container disks

• Only audit the tables needed with the level of audit required• In a DPF environment set the log path for each node on

different disks• Use the filter capability of audit to selectively audit data • Use EXECUTE category instead of CONTEXT category when

auditing SQL statements

Page 47: Session: E03 DB2 Performance Update

47

47

Workload Management (WLM)• Provides a foundation for more predictable performance

through improved resource & request management• Explicit control of CPU priority for different classes of work• Controlling ‘rogue’ queries• Finer-grained monitoring capabilities than DB2 9

• Integrated with AIX WLM for control of CPU consumption by service class

• Agent priority and prefetch priority provided on all platforms• Thresholds allow control of activity execution & monitoring

• Query exceeded a threshold? Capture information, or even block it

• Too many queries or utilities bombarding the system? Queue up the excess ones

• Aggregate statistics and individual event records can be captured on activity in any service class• Allows monitoring to be as narrow or broad as required

Page 48: Session: E03 DB2 Performance Update

48

48

0100200300400500600700

Tx/s

No WLM DB2 WLM + AIXWLM

DB2 WLM +AGENTPRI

CPU-bound OLTP workloads competing for system resources

Low priorty workloadHigh priority workload

WLM Examples• Example 1

• Two CPU-intensive workloads occupy the same system & compete for resources

• Simple DB2 WLM Workloads & Service Classes plus AIX WLM allow the high priority workload to use most of the resources

• Example 2• Two I/O-intensive scan

workloads occupy the same system & compete for resources

• Simple DB2 WLM Workloads & Service Classes using PREFETCH PRIORITY allow the high priority workload to use most of the resources

0100200300400500600700

Que

ry T

hrou

ghpu

t

No WLM DB2 WLM + PREFETCHPRIORITY

Disk-bound BI workloads competing for system resources

Low priorty workloadHigh priority workload

Page 49: Session: E03 DB2 Performance Update

49

49

WLM Hints / Tips

• For CPU-intensive environments, use CPU prioritization with AIX WLM CPU shares, or DB2 WLM AGENTPRI ‘nice’ values

• For BI environments, prefetch prioritization provides best control of scan-oriented workloads

• Use fine-grained service class definitions to choose what applications to monitor• WLM enables activity (event) monitoring with much

lower overheads than statement event monitors in previous versions of DB2

Page 50: Session: E03 DB2 Performance Update

50

50

Many more new DB2 9.5 capabilities exist …

Page 51: Session: E03 DB2 Performance Update

5151

51

Summary

• DB2 is the performance benchmark leader • TPC-C, TPC-H, XML, SAP, SPEC-J …• Leader in the Virtualization space

• New features in DB2 9.5 that further boost performance• threaded architecture• XML enhancements• LOB blocking• …

• Initial performance results and usage guidance

Page 52: Session: E03 DB2 Performance Update

52

52

Berni SchieferIBM Toronto Lab

[email protected]

Session E03DB2 9.5 Performance Update