Slide 1 - Michigan DB2 Users Group -- Home Page

14 October 2008 • 9:30 – 10:45Platform: Linux, Unix, and Windows

Rob WilliamsMHC Inc.

Session: E05

DB2 Design Patterns – Solutions to Problems

2

Agenda

• Definition of design patterns

• Problems and Solutions to common design patterns• Hardware Layout• Application Development• Database Design• Database Configuration• Application Architecture

3

What Are Design Patterns?

• Design patterns are commonly used in the software development field• General reusable solution to commonly occurring

problems• You may know many of these patterns• Can be a solution to a problem• When you see this do this

• Or a specific way to implement something• ie. a pattern for DB2 configuration

4

HARDWARE LAYOUTS

5

Hard Disk/File System Layout Pattern• Raid Stripe Size = Extent Size * DB2 Page Size

• File System Block Size = DB2 Page Size

• Have seen a lot of 4KB raid stripe sizes. This causes dramatically higher disk activity and can lower throughput

• Want to prevent disk trashing

• Hard disks are good at sequential reading, but bad at seeking

6

Raid Pattern

• People tend to get very uptight about their raid mode

• Use raid-1 as a bare minimum for any new installation

• Raid-5• Slow rebuild times that effect production

performance• Slower insert/update/delete because of parity

updates

• Raid 10 for anything else

7

Tablespace Layout Pattern

• Create a data tablespace and an index tablespace per table• Can group small tables into one tablespace• Be wary of the new initial size default of 32MB if

doing this across all tables

• With tablespace level recovery in v9 this makes like much easier• Pesky not logged drop of tables

• Prevents logical fragmentation

• Con: extra administration

8

Fragmentation Testing Pattern

• Tablespaces can be fragmented both logically and physically!• Depending on your IO patterns this can have a huge impact

on reporting queries• How can you tell if you are suffering from fragmentation

issues?• Performance a SELECT count(*) FROM table WHERE

not_indexed_column = 0• Cause a table scan. Make sure it isn’t indexed otherwise DB2

may use an index• Clear bufferpool and file system caching• 22 MB/s

• Look at vmstat and check the read spead• Compare this to cat tablespace_file > /dev/null

• 70 MB/s• Read speed should be very close to that of the SQL

statement

9

File System Frag Testing

• Overlooked these days but can be an issue in hybrid data marts over a long period of time

• How do you test if file system overhead is an issue?• cat /dev/sda1 > /dev/null

• 200 MB/s• This reads the actual data off the hard drive in a

completely sequential manner. • Allows you to estimate the file system overhead in

reporting situations• Can be substantial

• After new tablespace and running defrag output = 160MB/s

10

APPLICATION DEVELOPMENT

11

Problem: Loop Processing Pattern

• Problem: Very common for developers to write looping logic as it is natural for them• Performance is typically poor for even 30,000 rows

• resultset = SELECT * FROM XYZ

while(resultset->fetch_row){

SELECT something

EXEC UPDATE ….

}

• Context: DB2 has a large number of facilities to typically process such logic in a single statement

• Some solutions are presented on the following slides

12

Solution 1: Delta Pattern• To merge the differences into another database

• Common activity in ETL processes and data warehouses• Deltas are typically implemented in some form of loop

• Solution:MERGE INTO archive ar USING (SELECT activity, description FROM activities) ac

ON (ar.activity = ac.activity) WHEN MATCHED AND (cond 1)THEN

UPDATE SET description = ac.description WHEN NOT MATCHED

THEN INSERT (activity, description) VALUES (ac.activity, ac.description)

• Useful in other programming situations and more efficient than looking for SQL exception cases

• Be careful on locks and unit of work size

13

Solution 2: Hierarchical SQL Pattern• Many developers are unfamiliar or uncomfortable with

recursive SQL• Typically implemented in loop logic or recursively calling

application functions that issue SQL• Context

WITH temptab(deptid, empcount, superdept) AS ( SELECT root.deptid, root.empcount, root.superdept FROM departments root WHERE deptname='Production' UNION ALL SELECT sub.deptid, sub.empcount, sub.superdept FROM departments sub, temptab super WHERE sub.superdept = super.deptid )

SELECT sum(empcount) FROM temptab

14

Solution 3:Loop Insert Pattern• To insert more than a few records

for(int i =0; i < arr_size; i++){

insert into table values (….)

}

• Bind arrays, either column or row based to a prepared statement• Very low network overhead• Extremely fast

15

Solution 4:Highest Balance /Moving Average Pattern

• Many programmers have built ugly solutions to analyzing trends and linear data• Typically implemented in nested loops• select date_timestamp,

stock_price,

avg(stock_price) over(order by date_timestamp) range between 90 preceding and current row) as spending_pattern

from stock_prices

*********Show prog expalme

16

Solution 5: Paging Through A Result Set Pattern• We typically see paging poorly implemented in

applications using DB2 as it does not have OFFSET like open source databases and only has FETCH FIRST x rows only

• Paging typically done by doing a for loop and sql->fetchrow. Lots of network traffic• select row_number( ) over (order by name) as

row_number, other from employee WHERE row_number BETWEEN 5 AND 10 FETCH FIRST 10 ROWS ONLY

17

Problem: OR and AND Simplification

• SELECT * FROM t1 WHERE ((x = ‘a’ OR x = ‘b’) AND (y = ‘c’ or j = ‘e’)) OR (( x = ‘a’ OR x= ‘c’) AND (y = ‘c’ or j = ‘e’)) OR ((x = ‘b’ or x = ‘c’) AND (y=‘c’ or j = ‘e’))• Assume high selectivity of the predicates and full distribution

statistics • One index that includes all the columns• Only a small set of rows returned

• Problem:• Large amount of index space used• DB2 9 has a tendency to avoid index anding in the 20 – 100

million row range when OR and AND chaining

• This has caused us some grief in migrations• Extra processing on select, insert, updates, and delete

18

Solution: OR and AND Simplification• Solution - Use a generated column with an index• May reduce number of columns indexed. Increased

performance through reduced index writing, simpler index access paths.

• *Sometimes you do not need to rewrite queries• Tip – Have a standard prefix so developers know not to

update generated columns• Consider Triggers/Views/LBAC’s to enforce

development policy on generated columns

19

LIKE %%

• SELECT * FROM TABLE WHERE COLUMN LIKE ‘%SOMETHING%’

• Problem:• A % at the front of a column causes at the bare

minimum a full index scan and most likely a table scan

• Can potentially have problems even with ‘ASDF%’ as you may have high cardinality on strings starting with ‘ASDF’

• Solution:• Use the DB2 text extender (free), apache lucene,

or have a word map table

20

Prepared Statements and Flags/Skewed data Patterns• In general we believe prepared statements are great

• Problem: Occasionally seeing a large spike in read io on a table. Captured all the SQL and didn’t see any abnormal queries.

• Noticed prepared statements being used a filtered on flag/skewed distribution data. This can be an issue as the access path is only generated once.

• Solution: Switched to dynamic sql in the bean and if using stored procedures use the REOPT(ALWAYS) bind option

21

Concurrency Patterns• Always use CS unless truly necessary• Don’t create artificial constraints

• Pessimistic locking should not be considered unless it is critical to functionality

• Favor optimistic locking• Consider DB2_EVALUNCOMMITTED and DB2_SKIPDELETED• When having concurrency issues

• Can be a result of denormalization of data• FOR UPDATE is typically not understood by developers.

• They often think it’s equivalent to CS• Can kill concurrency• Slows down runstats

• COMMIT select statements or close result sets as soon as possible• Can possibly hold a row level lock longer than needed

• Be careful with WITH HOLD. Can leave behind locks till the cursor is closed.

22

MQT Federated Caching

• Often overlooked, but using MQTs to cache data from a federated source• Refresh nightly

• Reduce network round time

• Allows for better optimization of access paths

• Much simpler and faster than other caching implementation

23

DB2 Java Driver Pattern

• Developers typically confused over which Java driver to use. Normally they take the driver of the first example they find

• JDBC 2 for a local DB2 connection. Runs much faster than type 4 as the driver does system level calls instead of network calls

• JDBC 4 for a remote db connection. Easier portability, similar performance to type 2 in this setup. It communicates to DB2 through TCP/IP

24

Splitting Table Pattern

• General tendency to have huge central tables with a large number of flag columns, text data, and infrequently accessed data

• Splitting core data, preferably fixed width, from other columns• Can speed up reporting• Reduces CPU overhead• When you need access to the other tables,

ensure that they are in clustering order for a merge join to be used. That way little overhead is realized

25

DATABASE DESIGN

26

Problem: Flag Pattern

• Flag columns that are typically selected and processed based on their values• Generally have reapers that run based on flag values• Requires a larger indexes and slower update/insert

• Context: Wasting a lot of disk space and memory on a flag index• Cardinality issues can slow updates of flags

• Solution: Use MDCs on flag columns unless there are large amounts of sequential IO• Additional license is potentially required for MDCs

27

MDC Indexes

• Dimension• “Block” index column• eg, year, region, itemId

• Slice• Key value in one

dimension, eg. year = 1997

• Cell• Unique set of dimension

values, eg, year = 1997, region = Canada, AND itemId = 1

28

Statistics and Access Path Patterns• Developers use the selectivity predicate to influence

the optimizer

• Certain “experts” recommend bogus predicates to change access paths

• Has short-term functionality but fails in the long run

• IBM employs lots of smart people who work on the optimizer

• Rather than hacking a solution with selectivity, instead, inform the optimizer.

29

Statistical View Pattern

• In base DB2, statistics are on the base table and do not have information on the cardinality of the join relationship.

• Statistical views allow the optimizer to compute more accurate results• Problems with correlated and skewed data across

complex relationships• On larger tables poor access paths mean

dramatically more cpu, io, and elapased time• Tendency for the optimizer to be overly optimistic on

join selectivity, particularly when distributions change over time.

30

Statistical View Pattern• Create statistical views for common filtering columns on fact tables

that are used in large joins to flakes.• Ex:

• SALE_FACT (store_id, item_id, …….)• STORE(store_id, store_name, manager)• ITEM(item_id, item_class, item_desc, …..)

CREATE VIEW sv_salesfact_store AS (SELECT sf.* FROM store s, sale_fact sf WHERE s.store_id = sf.store_id)

ALTER VIEW sv_salesfact_store ENABLE QUERY OPTIMIZATION

RUNSTATS on table sv_salesfact_store WITH DISTRIBUTION

31

Data Correlation Pattern• Problem: Poor performing SQL running against a large fact table with multiple filter

predicates• SELECT item.*

FROM item, supplier

WHERE item.type= supplier.type AND item.type = ‘TOOL’

AND item.material = supplier.material AND item.material = ‘STEEL’• Context: Looking at the explain output we noticed the use of a nested loop join when a

merge/hash join should have been used• By default the optimizer assumes predicates are

independent. So selectivity is calculated as:• SELECTIVITY(item.type)* SELECTIVITY(item.material)

• 0.25 * 0.01 = 0.0025• Correlated selectivity = 0.25 + 0.01 = 0.26

• 0 <= SELECTIVITY () <= 1• Means over estimation of the filter when data is corelated

32

Data Correlation Pattern Cont

• Solution: Either create a multicolumn index with both the columns and collect statistics or run runstats with grouping.

• RUNSTATS ON TABLE item((type,material)) with distribution

• How to test if you should do this?• db2 "select ((count(*)*SUM(comm*salary)-

SUM(comm)*SUM(salary))/sqrt((count(*)*sum(power(comm,2))-sum(power(comm,2))) *(count(*)*sum(power(salary,2))-sum(power(salary,2))))) from employee“

• If the result is >= .3 or <= -.3 collect statistics

33

Fact Table Cluster Pattern

• Problem: Poor performance on large joins against a large fact table and statistics are perfect

• Context: DB2 was using a nested loop join or hash join and causing an overflow. Instead of using a merge join because data was not in clustered order on both tables.

• Solution: Avoid clustering central fact tables on a unique id column. Cluster on columns that will have large joins against them. Can use MDC for finer granularity

34

MQT OLAP Pattern

• Problem: Customer when trying to use MQT’s make them too granular either causing no matches or a large number of MQT’s• People sometimes take design advisor

recommendations without modifying or analyzing• Can get great recommendations, other times very

poor

• Context: We don’t want too many MQT as that slows down overall query optimization. It also has a heavy cost on insert/update/delete. How can we make MQT matches more general?

35

MQT OLAP Pattern Cont

• Using GROUPING SET• SELECT HOUR(TIMESTAMP) AS HOUR,

DAYOFWEEK(TIMESTAMP) AS DAY, ITEM_NAME

FROM LINE_ITEMSGROUP BY GROUPING SETS( (HOUR(TIMESTAMP), ITEM_NAME) , (DAYOFWEEK(TIMESTAMP), ITEM_NAME) )

36

MQT OLAP Pattern Cont• Using ROLLUP

• select substr(tabschema,1,20) as SCHEMA,substr(tabname,1,30) as TABLE, count(*) as NUM_TABLES,sum(npages) as Pages from syscat.tables group by rollup(tabschema,tabname)

SCHEMA TABLE NUM_TABLES PAGES--------------- --------------- ----------- --------------------- - 119 3948SYSIBM - 105 948SYSTOOLS - 6 3EATON - 8 2997SYSIBM SYSTABLES 1 41SYSIBM SYSCOLUMNS 1 233

37

MQT OLAP Pattern Cont

• Using CUBE

• SELECT SNAP_DATE, APP_NAME, AVG(LOCK_WAIT_TIME) AS "AVG WAIT(ms)", SUM(LOCK_WAIT_TIME) AS "TOT WAIT(ms)",SUM(DEADLOCKS) AS "TOT DL", AVG(DEADLOCKS) AS "AVG DL"FROM LOCK_SNAPGROUP BY CUBE(SNAP_DATE, APP_NAME)

• Similar to ROLLUP except subtotals for every single combination

38

Refactoring without SQL change pattern• Problem: Customer designed to use 64 byte(not bit)

identifiers for everything. Worked fine in test but after billions of transactions the system had huge storage requirements and was slow.

• Solution: In DB2 8.1 instead of triggers were introduced. This allows all views to be updatable. • Make the top 5 largest tables a views and create

a mapping table for 64 byte identifiers to bigint.• Utilized statistical views• No performance difference noticed in application

and much higher throughput in reporting

39

Money Column Pattern• Never allow nulls on any column that has a dollar value• Never use a float value to represent money

• Loses accuracy as you move farther away from 0• (a + b) + c is not necessarily equal to a + (b + c)• 1234.567 + 45.67844 = 1280.245 • 1280.245 + 0.0004 = 1280.245

but 45.67840 + 0.0004 = 45.67844 45.67844 + 1234.567 = 1280.246

• In new development on 9.5 always use the DECFLOAT type.• New IEEE Standard• Multiple rounding modes

• ROUND_CEILING, ROUND_FLOOR, ROUND_HALF_UP, ROUND_HALF_EVEN, ROUND_DOWN

40

Data Quality Pattern

• Validate IN, batch yearly

• With web services these days externally validating data has become extremely simple and cheap• Address Validation• Phone Number Validation• SIN Validation

• Consider batch validating or adding checks to data going in

• Catching input errors right away

41

User Patterns

• Always use a connection pool • Use trusted contexts in 9.5 LUW to be able to

audit usage based on “real” user.

• Segment application modules with different user id’s• Becomes extremely easy to isolate modules with

performance issues/problems• Can use event monitor trace based on authid

• Add permissions only as required. When refactoring being able to tell which applications and modules access a given table is extremely useful

42

Constraint Pattern

• Problem: Companies do not use check constraints to verify data going in and out of their database

• Context: Fails the concept of not duplicating business logic. Could have the logic in the ESB but does not guarantee accuracy.• People also do not realize you can use UDFs in a

CHECK constraint. • What about Java/C validation routines?

• Solution: Design all check constraints as scalar functions and enforce at the dbms level. That way data is always clean.

43

BP Layout Pattern

• Separate out bufferpool by OLTP and DW tablespaces

• Then separate out by index, xda, and data• Can go for finer granularity although you risk

wasting memory

• Allows you to control the most critical components of what's in memory

• Consider disabling STMM to make use of block based areas

44

DATABASE CONFIGURATION

45

Block Based Buffer Pool Pattern

• Always use block based bufferpools unless you have sequential I/O upon which performance is critical• STMM doesn’t support block based bufferpools -

you need to enable it manually

• Prevents large searches / sequential I/O from evicting frequently used pages from memory• Better reliability and average response time

46

Testing Parameter Pattern• Put your lock timeout as close to 1 as possible.• Find any concurrency issues in testing

• Disable STMM• Idea is to have optimal performance. Not helpful when

you want to find bugs• Put your bufferpools as small as possible to simulate

larger sets of data than expected• Copy production statistics to test machines• Reduce sort heap to as small as value• Can help find cases of bad clustering and improper

indexes• Create random network delays to simulate real world

situations and create different locking patterns

47

Load Patterns• Use IXF over DEL and ASC• Dramatically less CPU usage and runs faster• Contains table ddl information as a plus

• Don’t forget the disk parallelism option when working with raid drives• Default = number of containers

• Use statistics profiles to collect stats on your load instead of doing a runstats afterwards

• In 9.5 compress data during load. Not as effective as running a reorg compression but still useful

• As long as network is reasonably fast load over network drives typically does not slow down total load time

48

Upgrade Pattern

• People are too quick to upgrade hardware/software licenses when there are performance problems as they don’t like to blame their own code

• Review indexes

• Ensure: • the system is tuned• you have identified the limiting resource in the system.

Ex. CPU, memory

• Identify the exact statements and process that is causing the problem and validate that it is optimal

49

Monitoring Pattern

• Health Center works but isn’t great• Check out hyperic hq it’s an open source

monitoring tool that allows you to easily monitor DB2 with the new SQL views

• It’s free and can interface with the operating system, disk sub systems, network controllers

• Has support built in for Oracle, SQL Server, operating systems, etc

• Huge thanks to Fred Sabotka for letting me know about this software

50

Monitoring Pattern Cont

• What do you want to monitor?• CPU Usage• Hard Disk Utilization• BP Hit Ratios• Hash/Sort Overflows• Deadlocks• Lock timeouts• Rollbacks• Sync Read Percentage• Average Transaction Time• Statements per minute

51

APPLICATION ARCHITECTURE

52

Active Record Pattern• From Martin Fowler in his book "Patterns of Enterprise

Application Architecture"• Still see a lot of companies hard coding SQL in presentation

layer. Typically caused by quick web scripting languages• Database schema changes become a nightmare• Acquisition integration

• Map relational data to classes then use the classes for presentation• CREATE TABLE student (id,first_name,last_name)• CLASS Student

{int id; string first_name, string last_name;}• Very important at the most basic level to follow as it prevent

tight coupling of the database• Many technologies help automate this such as Hibernate,

PureQuery, etc..

53

PureXML Abstraction Pattern• Even having data manager classes issue SQL to a

database causes a database dependency• XML Message based architecture removes any

dependence on the logical/physical schema.• SOA, ESB?• XML Message in/XML Message out all onto an ESB• Great when acquired, consolidating, reasonable

performance• XSTL• Application only needs to be worried about the XML

schema• Don’t rush in, prototype, start slow• Very good tools to manage such infrastructure

54

Data Analysis Patterns• When people believe there is magic in some product they are

willing to pay money for it• Particularly in the BI field we see many customers

spending crazy amount of money for fairly trivial algorithms• Design several summery tables in DB2 to serve as basis for

end user recommendations• Don’t need to export all your data to a 3rd party product• Surprisingly trivial algorithms for fairly reasonable results• Product recommendations / similarity recommendations

• Euclidian Distance• Pearson correlation• K-clustering

• Price Models• Weighted KNN

55

Data Analysis Patterns

• Grouping• K-Means• Hierarchical Clustering

• Not difficult and well documented online

• Typically more flexible than bundled solutions and faster

• Easy to prototype

56

Processor Evaluation Pattern

• When upgrading systems you are faced with higher clock speed or more cores

• Clock Speed:• Favor clock speed if you are looking for elapsed

time improvement• Cheaper

• Number of cores:• # of concurrent transactions• Note we have an unhappy customer on the

Niagara core due to low clock speed per core

57

Questions?

58

Rob Williams & Martin Hubel

MHC Inc.

[email protected] / [email protected]

Session E05

DB2 Design Patterns – Solutions to Problems

Slide 1 - Michigan DB2 Users Group -- Home Page

Documents

Transcript of Slide 1 - Michigan DB2 Users Group -- Home Page