In this talk we will focus on the unique challenges we face on
high concurrency and applications requiring low latency
Slide 9
Laying the foundations for OLTP Performance
Slide 10
Slide 11
Windows OS Kernel Group 0 Kernel Group 0 NUMA 0 NUMA 1 NUMA 2
NUMA 3 NUMA 4 NUMA 5 NUMA 6 NUMA 7 Kernel Group 1 Kernel Group 1
NUMA 8 NUMA 9 NUMA 10 NUMA 11 NUMA 12 NUMA 13 NUMA 14 NUMA 15
Kernel Group 2 Kernel Group 2 NUMA 16 NUMA 17 NUMA 18 NUMA 19 NUMA
20 NUMA 21 NUMA 22 NUMA 23 Kernel Group 3 Kernel Group 3 NUMA 24
NUMA 25 NUMA 26 NUMA 27 NUMA 28 NUMA 29 NUMA 30 NUMA 31 Hardware
NUMA 6 CPU Socket CPU Socket CPU Core HT CPU Core HT CPU Socket CPU
Socket CPU Core HT CPU Core HT NUMA 7 CPU Socket CPU Socket CPU
Core HT CPU Core HT CPU Socket CPU Socket CPU Core HT CPU Core
HT
Slide 12
SQL Server Today: Capabilities and Challenges with real
customer workloads
Slide 13
Slide 14
Slide 15
ChallengeConsideration/Workaround Network 10 Gb/s network used
no bottlenecks observed Concurrency Observed spikes in CPU at
random times during workload Significant spinlocks contention on
SOS_CACHESTORE due to frequent re-generation of security tokens
Hotfix provided by SQL Server team Result SOS_CACHESTORE contention
removed Spinlock contention on LOCK_HASH due to heavy reading of
same rows This was due to an incorrect parameter being passed in by
test workload Result LOCK_HASH contention removed, reduced CPU from
100% to 18% Transaction Log Synchronous replication at the storage
level Observed 10-30ms for log latency expected 3-5ms Encountered
Virtual Log File fragmentation (dbcc loginfo) rebuilt log Observed
overutilization of front end fiber channel ports on array -
reconfigured storage balancing traffic across front end ports
Result: 3-5ms latency Database and table design/Schema Schema
utilizes hash partitioning to avoid page latch contention on
inserts Requirement for low privileges requires longer code paths
in the SQL engine Monitoring Heavily utilized Extended Events to
diagnose spinlock contention points Architecture/Hardware Currently
running 16 socket IA64 in production Benchmark performed on 8
socket x64 Nehalem-EX (64 physical cores) Hyper-threading to 128
logical cores offered little benefit to this workload Encountered
high NUMA latencies (coreinfo.exe) resolved via firmware
updates
Resource Lock Manager Lock Hash Table Lock Manager Lock Hash
Table Thread attempts to obtain lock (row, page, database, etc..
Threads accessing the same hash bucket of the table are
synchronized LOCK_HASH Hash of lock maintained in hash table
Slide 19
These symptoms may indicate spinlock contention: 1. A high
number of spins is reported for a particular spinlock type. AND 2.
The system is experiencing heavy CPU utilization. AND 3. The system
has a high amount of concurrency.
--Get the type value for any given spinlock type select
map_value, map_key, name from sys.dm_xe_map_values where map_value
IN ('SOS_CACHESTORE') --create the even session that will capture
the callstacks to a bucketizer create event session
spin_lock_backoff on server add event sqlos.spinlock_backoff
(action (package0.callstack) where type = 144--SOS_CACHESTORE ) add
target package0.asynchronous_bucketizer ( set
filtering_event_name='sqlos.spinlock_backoff', source_type=1,
source='package0.callstack') with (MAX_MEMORY=50MB,
MEMORY_PARTITION_MODE = PER_NODE) --Ensure the session was created
select * from sys.dm_xe_sessions where name = 'spin_lock_backoff'
--Run this section to measure the contention alter event session
spin_lock_backoff on server state=start --wait to measure the
number of backoffs over a 1 minute period waitfor delay '00:01:00'
--To view the data --1. Ensure the sqlservr.pdb is in the same
directory as the sqlservr.exe --2. Enable this trace flag to turn
on symbol resolution DBCC traceon (3656, -1) --Get the callstacks
from the bucketize target select event_session_address,
target_name, execution_count, cast (target_data as XML) from
sys.dm_xe_session_targets xst inner join sys.dm_xe_sessions xs on
(xst.event_session_address = xs.address) where xs.name =
'spin_lock_backoff' --clean up the session alter event session
spin_lock_backoff on server state=stop drop event session
spin_lock_backoff on server
Slide 22
Observation: It is counterintuitive to have high waits times
(LCK_M_X) correlate with heavy CPU This is the symptom not the
cause 1 2 Huge increase in number of spins & backoffs
associated with SOS_CACHESTORE 3 4 Approach: Use extended events to
profile the code path with the spinlock contention (i.e. where
there is a high number of backoffs) 5 Root cause: Regeneration of
security tokens exposes contention in code paths for access
permission checks Workaround/Problem Isolation: Run with sysadmin
rights Long Term Changes Required: SQL Server fix
Slide 23
Slide 24
Slide 25
Slide 26
SAN CX-960 (240 drives, 15K, 300GB) 5 x App servers: 5 x BL460
2 proc (quad core), 32bit 32 GB memory 12 x Load drivers: 2 proc
(quad core), x64 32+ GB memory Switch Transaction DB Server 1 x
DL785 8P (quad core), 2.3GHz 256 GB RAM Network switch Reporting DB
Server 1 x DL585 4P (dual core), 2.6 GHz 32 GB RAM Switch SAN
switch Brocade 4900 (32-ports active) DL785 DL585 BL460 Blade
Servers Dell R900s, R805s Active/Active Failover cluster
Slide 27
ChallengeConsideration/Workaround Network CPU bottlenecks for
network processing were observed and resolved via network tuning
(RSS) Further network optimization was performed by implementing
compression in the application After optimizations were able to
push ~180K packets/sec, approx 111 MB/sec through a single 1 Gb/s
NIC. Concurrency Page buffer latch waits were by far the biggest
pain point Hash partitioning was used to scale-out the btrees and
eliminate the contention Some PFS contention for the tables
containing LOB data resolved by placing LOB tables on dedicated
filegroups and adding more files Transaction Log No log bottlenecks
were observed. When cache on the array behaves well log response
times are very low. Database and table design/Schema Observed
overhead related to PK/FK relationships. Insert statements required
additional work. Adding persisted computed column needed for hash
partitioning is an offline operation. Moving LOB data is an offline
operation. Monitoring For the latch contention, utilized
dm_os_wait_stats, dm_os_waiting_tasks and
dm_db_index_operational_stats to identify indexes with most
contention Architecture/Hardware Be careful about shared components
in Blade server deployments this became a bottleneck for our middle
tier.
Slide 28
Page (8K) ROW EX_LATCH wait
Slide 29
Dig into details with: sys.dm_os_wait_stats
sys.dm_os_latch_waits
Slide 30
select *, wait_time_ms/waiting_tasks_count [avg_wait_time],
signal_wait_time_ms/waiting_tasks_count [avg_signal_wait_time] from
sys.dm_os_wait_stats where wait_time_ms > 0 and wait_type like
'%PAGELATCH%' order by wait_time_ms desc
Slide 31
/* latch waits ********************************************/
select top 20 database_id, object_id, index_id,
count(partition_number) [num partitions],sum(leaf_insert_count)
[leaf_insert_count], sum(leaf_delete_count)
[leaf_delete_count],sum(leaf_update_count)
[leaf_update_count],sum(singleton_lookup_count)
[singleton_lookup_count],sum(range_scan_count)
[range_scan_count],sum(page_latch_wait_in_ms)
[page_latch_wait_in_ms], sum(page_latch_wait_count)
[page_latch_wait_count],sum(page_latch_wait_in_ms) /
sum(page_latch_wait_count)
[avg_page_latch_wait],sum(tree_page_latch_wait_in_ms)
[tree_page_latch_wait_ms], sum(tree_page_latch_wait_count)
[tree_page_latch_wait_count],case when
(sum(tree_page_latch_wait_count) = 0) then 0 else
sum(tree_page_latch_wait_in_ms) / sum(tree_page_latch_wait_count)
end [avg_tree_page_latch_wait] from
sys.dm_db_index_operational_stats (null, null, null, null) os where
page_latch_wait_count > 0 group by database_id, object_id,
index_id order by sum(page_latch_wait_in_ms) desc
Slide 32
B- tree Page Leaf Pages Tree Pages Logical Key Order of Index
Monotonically Increasing Data Page Date Page Date Page Data Page
Date Page Date Page Data Page We call this Last Page Insert
Contention Many threads inserting into end of range Expect:
PAGELATCH_EX/SH waits And this is the observation
Slide 33
0 -1000 0 -1000 1001 - 2000 1001 - 2000 2001 - 3000 2001 - 3000
3001 - 4000 3001 - 4000 INSERT Hash Partitioning Reference:
http://sqlcat.com/technicalnotes/archive/2009/
09/22/resolving-pagelatch-contention-on-
highly-concurrent-insert-workloads-part-1.aspx Threads inserting
into end of range contention on last page Threads inserting into
end of range but across each partition Hash Partitioned Table /
Index
Slide 34
Latch waits of approximately 36 ms at baseline of 99
checks/sec.
Slide 35
Latch waits of approximately 0.6 ms at highest throughput of
249 checks/sec.
Slide 36
Note: Requires application changes Ensure Select/Update/Delete
have appropriate partition elimination
Applied Architecture Patterns on the Microsoft Platform
Slide 46
46 Q A & Q A &
Slide 47
47 2008 Microsoft Corporation. All rights reserved. Microsoft,
Windows, Windows Vista and other product names are or may be
registered trademarks and/or trademarks in the U.S. and/or other
countries. The information herein is for informational purposes
only and represents the current view of Microsoft Corporation as of
the date of this presentation. Because Microsoft must respond to
changing market conditions, it should not be interpreted to be a
commitment on the part of Microsoft, and Microsoft cannot guarantee
the accuracy of any information provided after the date of this
presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR
STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Slide 48
48 Agenda Windows Server 2008R2 and SQL Server 2008R2
improvements Scale architecture Customer Requirements Hardware
setup Transaction log essentials Getting the code right Application
Server Essentials Database Design Tuning Data Modification UPDATE
statements INSERT statements Management of LOB data The problem
with NUMA and what to do about it Final results and Thoughts
Slide 49
49 Top statistics CategoryMetric Largest single database80 TB
Largest table20 TB Biggest total data 1 customer2.5 PB Highest
write per second 1 db60,000 Fastest I/O subsystem in production
(and in lab) 18 GB/sec (26GB/sec) Fastest real time cube1 sec
latency data load for 1TB20 minutes Largest cube12 TB
Slide 50
50 Upping the Limits Previous (before 2008R2) windows was
limited to 64 cores Kernel tuned for this config With Windows
Server 2008R2 this limit is now upped to 256 Cores (plumbing for
1024 cores) New concept: Kernel Groups A bit like NUMA, but an
extra layer in the hierarchy SQL Server generally follows suit but
for now, 256 Cores is limit on R2 Example x64 machines: HP DL980
(64 Cores, 128 in HyperThread). IBM 3950 (up to 256 Cores) And
largest IA-64 is 256 Hyperthread (at 128 Cores)
Slide 51
51 The Path to the Sockets Windows OS Kernel Group 0 Kernel
Group 0 NUMA 0 NUMA 1 NUMA 2 NUMA 3 NUMA 4 NUMA 5 NUMA 6 NUMA 7
Kernel Group 1 Kernel Group 1 NUMA 8 NUMA 9 NUMA 10 NUMA 11 NUMA 12
NUMA 13 NUMA 14 NUMA 15 Kernel Group 2 Kernel Group 2 NUMA 16 NUMA
17 NUMA 18 NUMA 19 NUMA 20 NUMA 21 NUMA 22 NUMA 23 Kernel Group 3
Kernel Group 3 NUMA 24 NUMA 25 NUMA 26 NUMA 27 NUMA 28 NUMA 29 NUMA
30 NUMA 31 Hardware NUMA 6 CPU Socket CPU Socket CPU Core HT CPU
Core HT CPU Socket CPU Socket CPU Core HT CPU Core HT NUMA 7 CPU
Socket CPU Socket CPU Core HT CPU Core HT CPU Socket CPU Socket CPU
Core HT CPU Core HT
Slide 52
52 And we measure it like this Sysinternals CoreInfo
http://technet.microsoft.com/en-us/sysinternals/cc835722.aspx
Nehalem-EX Every socket is a NUMA node How fast is your
interconnect.
Slide 53
53 And it Looks Like This...
Slide 54
54 Customer Scenarios Core BankingHealthcare SystemPOS Workload
Credit Card transactions from ATM and Branches Sharing patient
information across multiple healthcare trusts World record
deployment of ISV POS application across 8,000 US stores Scale
Requirements 10.000 Business Transactions / sec 37,500 concurrent
usersHandle peak holiday load of 228 checks/sec Technology App
Tier.NET 3.5/WCF SQL 2008R2 Windows 2008R2 App Tier:.NET SQL 2008R2
Windows 2008R2 Virtualized App Tier: Com+, Windows 2003 SQL 2008,
Windows 2008 Server HP Superdome HP DL785G6 IBM 3950 and HP DL
980DL785
Slide 55
55 Network Cards Rule of Thumb At scale, network traffic will
generate a LOT of interrupts for the CPU These must be handled by
CPU Cores Must distribute packets to cores for processing Rule of
thumb (OTLP): 1 NIC / 16 Cores Watch the DPC activity in
Taskmanager In Windows 20003 remove SQL Server (with affinity mask)
from the NIC cores
Slide 56
56 Lab: Network Tuning Approaches 1. Tuning configuration
options of a single NIC card to provide the maximum throughput. 2.
Improve the application code to compress LOB data before sending it
to the SQL Server 3. Team a pair of 1 Gb/s NICs to provide more
bandwidth (transparent to the app). 4. Add multiple NICS (better
for scale )
Slide 57
57 Tuning a Single NIC Card POS system Enable RSS to enable
multiple CPUs to process receive indications:
http://www.microsoft.com/whdc/device/network/NDIS_RSS.mspx The next
step was to disable the Base Filtering Service in Windows and
explicitly enable TCP Chimney offload. Careful with Chimney Offload
as per KB 942861KB 942861
Slide 58
58 Before and After Tuning Single NIC 1. Before any network
changes the workload was CPU bound on CPU0 2. After tuning RSS,
disabling Base Filtering Service and explicitly enabling TCP
Chimney Offload CPU time on CPU0 was reduced. The base CPU for RSS
successfully moved from CPU0 to another CPU.
Slide 59
59 Teaming NICS Workload bound by network throughput Teaming of
2 network adapters realized no more aggregate throughput
Application blade servers shared a 1 Gb/s network connection Left
until next episode Consider 10 Gbps NICS for this throughput
Slide 60
60 SQL Server Configuration Changes As we increased number of
connections to around 6000 (users had think time) we started seeing
waits on THREADPOOL Solution: increase sp_configure max worker
threads Probably dont want to go higher than 4096 Gradually
increase it, default max is 980 Avoid killing yourself in thread
management bottleneck is likely somewhere else Use affinity mask to
get rid of SQL Server for cores running NIC traffic Well tuned,
pure play OLTP No need to consider parallel plans Sp_configure max
degree of parallelism, 1
Slide 61
61 Getting the Code Right Designing Highly Scalable OLTP
Systems
Slide 62
62 Lessons from ISV Applications Parameterize or pay the CPU
cost and potentially hit the gateway limits for compilations
(RESOURCE_SEMAPHORE_QUERY_COMPILATIONS) Watch out for cursors They
tie up worker threads and if they consume workspace memory you
could see blocking (RESOURCE_SEMAPHORE) Consume those results as
quickly as possible (watch for ASYNC_NETOWORK_IO) Schema design For
insert heavy workload RI can be very expensive. If performance is
key, work out the RI outside the DB and trust you app
Slide 63
63 Things to Double Check Connection pooling enabled? How much
connection memory are we using? Monitor perfmon: MSSQL: Memory
Manager Obvious Memory or Handle leaks? Check perfmon Process
counters in perfmon for.NET app Server side processes will keep
memory unless under pressure Can the application handle the load?
Call into dummy procedures that do nothing Check measured
application throughput Typical case: Application breaks before
SQL
Slide 64
64 Remote Calling from WCF Original client code: Synchronous
calls in WCF Each thread must wait for network latency before
proceeding Around 1ms waiting Very similar to disk I/O thread will
fall asleep Lots of sleeping threads Limited to around 50 client
simulations per machine Instead, use IAsyncInterface
Slide 65
65 Tuning Data Modification Designing Highly Scalable OLTP
Systems
67 Summary of Concerns Transaction table is hot Lots of INSERT
How to handle ID numbers? Allocation structures in database Account
table must be transactionally consistent with Transaction Do I
trust the developers to do this? Cannot release lock until BOTH are
in sync What about latency of round trips for this Potentially hot
rows in Account Are some accounts touched more than others ATM
Table has hot rows. Each row on average touched at least ten times
per second E.g. 10**3 rows with 10**4 transactions/sec Transaction
ATM Account Transaction_ID Customer_ID ATM_ID Account_ID
TransactionDate Amount Account_ID LastUpdateDate Balance ID_ATM
ID_Branch LastTransactionDate LastTransaction_ID
Slide 68
68 Generating a Unique ID Why wont this work? CREATE PROCEDURE
GetID @ID INT OUTPUT @ATM_ID INT AS DECLARE @LastTransaction_ID INT
SELECT @LastTransaction_ID = LastTransaction_ID FROM ATM WHERE
ATM_ID = @ATM_ID SET @ID = @LastTransaction_ID + 1 UPDATE ATM SET
@LastTransaction_ID WHERE ATM_ID = @ATM_ID
Slide 69
69 Concurrency is Fun SELECT @LastTransaction_ID =
LastTransaction_ID FROM ATM WHERE ATM_ID = 13 SET @ID =
@LastTransaction_ID + 1 UPDATE ATM SET @LastTransaction_ID = @ID
WHERE ATM_ID = 13 SELECT @LastTransaction_ID = LastTransaction_ID
FROM ATM WHERE ATM_ID = 13 SET @ID = @LastTransaction_ID + 1 UPDATE
ATM SET @LastTransaction_ID = @ID WHERE ATM_ID = 13 ATM ID_ATM = 13
LastTransaction_ID = 42 (@LastTransaction_ID = 42)
Slide 70
70 Generating a Unique ID The Right way CREATE PROCEDURE GetID
@ID INT OUTPUT @ATM_ID INT AS UPDATE ATM SET LastTransaction_ID =
@ID + 1, @ID = LastTransaction_ID WHERE ATM_ID = @ATM_ID And it it
is simple too...
Slide 71
71 Hot rows in ATM Initial runs with a few hundred ATM shows
excessive waits for LCK_M_U Diagnosed in sys.dm_os_wait_stats
Drilling down to individual locks using sys.dm_tran_locks Inventive
readers may wish to use Xevents Event objects:
sqlserver.lock_acquired and sqlos.wait_info Bucketize them As
concurrency increases, lock waits keep increasing While throughput
stays constant Until...
Slide 72
72 Spinning around Diagnosed using sys.dm_os_spinlock_stats Pre
SQL2008 this was DBCC SQLPERF(spinlockstats) Can dig deeper using
Xevents with sqlos.spinlock_backoff event We are spinning for
LOCK_HASH
Slide 73
73 LOCK_HASH what is it? ROW Lock Manager Thread More Threads
LOCK_HASH LCK_U - Why not go to sleep?
Slide 74
74 Locking at Scale Ratio between ATM machines and transactions
generated too low. Can only sustain a limited number of
locks/unlocks per second Depends a LOT on NUMA hardware, memory
speeds and CPU caches Each ATM was generating 200 transactions /
sec in test harness Solution: Increase number of ATM machines Key
Takeway: If a locked resource is contended create more of it
Notice: This is not SQL Server specific, any piece of code will be
bound by memory speeds when access to a region must be
serialized
Slide 75
75 Corresponded with high wait time for compile exclusive lock
LCK_M_X Note: ignore SOS_SCHEDULER_YIELD Non-Fully Qualified Calls
To Stored Procedures Results in SOS_CACHESTORE Spins Almost all
sessions waiting on LCK_M_X 1 2 Huge increase in Spinlocks on
SOS_CACHESTORE & SOS_SUSPEND_QUEUE 3 4 Corresponded with high
wait time for compile exclusive lock LCK_M_X Note: ignore
SOS_SCHEDULER_YIELD 5 Root cause: The Lorenzo user was not a member
of the DB_Owner role, all stored procedures were owned by DBO e.g.
dbo.lorenzoproc Workaround: add the application user to DB Owner
role Long Term Changes Required: Change Exec lorenzoproc to Exec
dbo.lorenzoproc
Slide 76
76 Hot rows in Account Three ways to update Account table Let
application servers invoke transaction to both insert in
TRANSACTION and UPDATE account Set a trigger on TRANSACTION Create
stored proc that handles the entire transaction Option 1 has two
issues: App developers may forget in all code paths Latency of
roundtrip: around 1ms i.e. no more than 1000 locks/sec possible on
single row Option 2 is better choice! Option 3 must be used in all
places in app to be better than option 2.
Slide 77
77 Hot Latches! LCK waits are gone, but we are seeing very high
waits for PAGELATCH_EX High = more than 1ms What are we contending
on? Latch a light weight semaphore Locks are logical (transactional
consistency) Latches are internal SQL Engine (memory consitency)
Because rows are small (many fit a page) multiple locks may compete
for one PAGELATCH Page (8K) ROW LCK_U PAGELATCH_EX
Slide 78
78 Row Padding In the case of the ATM table, our rows are small
and few We can waste a bit of space to get more performance
Solution: Pad rows with CHAR column to make each row take a full
page 1 LCK = 1 PAGELATCH Page (8K) ROW LCK_U PAGELATCH_EX
CHAR(5000) ALTER TABLE ATM ADD COLUMN Padding CHAR(5000) NOT NULL
DEFAULT (X)
Slide 79
79 INSERT throughput Transaction table is by far the most
active table Fortunately, only INSERT No need to lock rows But
several rows must still fit a single page Cannot pad pages there
are 10**10 rows in the table A new page will eventually be
allocated, but until it is, every insert goes to same page Expect:
PAGELATCH_EX waits And this is the observation
Slide 80
80 Hot page at the end of B-tree with increasing index
Slide 81
81 Waits & Latches Dig into details with:
sys.dm_os_wait_stats sys.dm_os_latch_waits
Slide 82
82 How to Solve INSERT hotspot Hash partition the table Create
multiple B-trees Round robin between the B-trees create more
resources and less contention Do not use a sequential key
Distribute the inserts all over the B- tree 0 0 1 1 2 2 3 3 4 4 5 5
6 6 hash ID 7 7 0,8,16 1,9,17 2,10,18 3,11,19 4,12,20 5,13,21
6,14,22 7,15,23 0 -1000 0 -1000 1001 - 2000 1001 - 2000 2001 - 3000
2001 - 3000 3001 - 4000 3001 - 4000 INSERT
Slide 83
83 0 0 Design Pattern: Table Hash Partitioning Create new
filegroup or use existing to hold the partitions Equally balance
over LUN using optimal layout Use CREATE PARTITION FUNCTION command
Partition the tables into #cores partitions Use CREATE PARTITION
SCHEME command Bind partition function to filegroups Add hash
column to table (tinyint or smallint) Calculate a good hash
distribution For example, use hashbytes with modulo or
binary_checksum 1 1 2 2 3 3 4 4 5 5 6 6 253 254 255 hash
Slide 84
84 Lab Example: Before Partitioning Latch waits of
approximately 36 ms at baseline of 99 checks/sec.
Slide 85
85 Lab Example: After Partitioning* *Other optimizations were
applied Latch waits of approximately 0.6 ms at highest throughput
of 249 checks/sec.
Slide 86
86 Pick The Right Number of Buckets
Slide 87
87 B-Tree Root Split Next Prev Virtual Root SH LATCH
(ACCESS_METHODS HBOT_VIRTUAL_ROOT) LCK PAGELATCH X X SH PAGELATCH
EX SH EX SH EX
Slide 88
88 Management of LOB Data Resolving latch contention required
rebuilding indexes into a new filegroup Resulted in PFS contention
(PAGELATCH_UP): Engine uses proportional fill algorithm Moving
indexes from one filegroup to another resulted in imbalance between
underlying data files in PRIMARY filegroup Resolve: move hot table
to dedicated filegroup Neither ALTER TABLE nor any method of index
rebuild support the movement of LOB data. Technique used: Create
the new filegroup and files. SELECT/INTO from the existing table
into a new table. Change the default filegroup as specifying a
target filegroup is not supported INSERT...WITH (TABLOCK) SELECT
will have similar behaviour without the need to change default
filegroup Drop the original table and rename the newly created
table to the original name. As a general best practice we advised
the partner/customer to use dedicated filegroups for LOB data Dont
use PRIMARY filegroup See Paul Randal post:
http://www.sqlskills.com/BLOGS/PAUL/post/Importance-of-choosing-the-right-LOB-storage-
technique.aspxhttp://www.sqlskills.com/BLOGS/PAUL/post/Importance-of-choosing-the-right-LOB-storage-
technique.aspx
Slide 89
89 NUMA and What to do Remember those PAGELATCH for UPDATE
statements? Our solution: add more pages Improvemnet: Get out of
the PAGELATCH fast so next one can work on it On NUMA systems,
going to a foreign memory node takes at least 4-10 times more
expensive Use SysInternals CoreInfo tool
Slide 90
90 How does NUMA work in SQL Server? The first NUMA node to
request a page will own that page Ownership continues until page is
evicted from buffer pool Every other NUMA node that need that page
will have to do foreign memory access Additional (SQL 2008) feature
is SuperLatch Useful when page is read a lot but written rarely
Only kicks in on 32 cores or more The this page is latched
information is copied to all NUMA nodes Acquiring a PAGELATCH_SH
only requires local NUMA access But: Acquiring PAGELATCH_EX must
signal all NUMA nodes Perfmon object: MSSQL:Latches Number of
SuperLatches SuperLatch demotions / sec SuperLatch promotions / sec
See CSS blog postCSS blog post
Slide 91
91 NUMA 3 NUMA 2 NUMA 1 NUMA 0 Effect of UPDATE on NUMA traffic
0 0 1 1 2 2 3 3 ATM_ID UPDATE ATM SET LastTransaction_ID UPDATE ATM
SET LastTransaction_ID UPDATE ATM SET LastTransaction_ID UPDATE ATM
SET LastTransaction_ID App Servers
Slide 92
92 NUMA 3 NUMA 2 NUMA 1 NUMA 0 Using NUMA affinity 0 0 1 1 2 2
3 3 ATM_ID UPDATE ATM SET LastTransaction_ID UPDATE ATM SET
LastTransaction_ID UPDATE ATM SET LastTransaction_ID UPDATE ATM SET
LastTransaction_ID Port: 8000 Port: 8001 Port: 8002 Port: 8003 How
to: Map TCP/IP Ports to NUMA Nodes
Slide 93
93 Final Results and thoughts 120.000 Batch Requests / sec
100.000 SQL Transactions / sec 50.000 SQL Write Transactions / sec
12.500 Business Transactions / sec CPU Load: 34 CPU cores busy
Given more time, we would get the CPUs to 100%, Tune the NICs more,
and work on balancing NUMA more. And of NIC, we only had two and
they were loading two CPU at 100%
Slide 94
Slide 95
Slide 96
Other 30+ Other 30+ Other 20+ Other 20+ News Letter 2+ News
Letter 2+ SMS 4+ SMS 4+ User Account & Sportsbook 8+ User
Account & Sportsbook 8+ Bookmaking 2+ Bookmaking 2+ Betcache 4+
Betcache 4+ Casino 2+ Casino 2+ VS Games 2+ VS Games 2+ 1x2 Games
12+ 1x2 Games 12+ CSM 2+ CSM 2+ CMS 15+ CMS 15+ Repl Other 40+
Other 40+ BGI DWH 60+ DWH 60+ Payment 20+ Payment 20+ Moni-toring
10+ Moni-toring 10+ Adminis- tration 20+ Adminis- tration 20+ DWH
Stage 50+ DWH Stage 50+ Internal Office, Sharepoint (300+) ASP.NET
Sessions 8+ ASP.NET Sessions 8+ OLAP 10+ OLAP 10+ Operation
Betoffer & OddsTranslationSecurityOther Marketing Product
MgmtPromotionsCampaigns'Other Administration PaymentCall
centerOther Finance BookkeepingControlingOther This is not a WinMo7
its a SuperDome
Slide 97
ChallengeConsideration/Workaround Network CPU bottlenecks for
network processing were observed and resolved via network tuning
(RSS) Dedicated networks for backup, replication etc 8 network
cards for clients Concurrency Latch contention on heavily used
tables, last page insert Hash partition option caused other problem
in query performance and application design. Resolution:
Co-operative scale-out Resolution: Co-operative scale-out
Transaction Log Latency on log writes Resolution: Increased
throughput/decreased latency by placing transaction log on SSDs
Resolution: Increased throughput/decreased latency by placing
transaction log on SSDs Database mirroring overhead very
significant on Synchronous Database and table design/Schema Latency
on IO intensive data files (including Tempdb): Resolution: Session
state database on SSDs; Resolution: Session state database on SSDs;
Resolution: Betting slips/customer databases testing sharding
Resolution: Betting slips/customer databases testing sharding
Single server, single database 500/tx/sec Single server, 4
databases 1,800/tx/sec (sharding) Multiple servers 2,600/tx/sec
(sharding) Monitoring Security monitoring (PCI and intrusion
detection) between 10%-25% impact/overhead when monitoring
Architecture/Hardware Tests using x64 (8-socket; 8-core) vs.
Itanium-Superdome(32-socket,dual-core) Same transaction throughput
IO and backups were a bit slower
Slide 98
Slide 99
Slide 100
Slide 101
Slide 102
ChallengeConsideration/Workaround Network Network round trip
time for synchronous call from client induced latency Resolution:
Batch data into single large parameter (varchar (8000)) to avoid
network roundtrips Concurrency Page Latch Contention Small table
Small table with latching on 36 Rows on single page. Performance
Gain: 20% Resolution: pad the rows to spread out latching to
multiple pages; Performance Gain: 20% Page Latch Contention Large
Table Concurrent INSERTS into incremental column (identity), last
page insert Performance Gain: 30% Resolution: Clustered Index
(partition_id & identity) column; Performance Gain: 30% Heavy,
long running threads contenting for time on the scheduler
Performance Gain: 20% Resolution: Map TCP/IP Ports to NUMA Nodes
(http://msdn.microsoft.com/en- us/library/ms345346.aspx);
Performance Gain: 20%http://msdn.microsoft.com/en-
us/library/ms345346.aspx Transaction Log Logwaits: Resolution:
Batching business transactions within a single COMMIT to avoid
WRITELOG waits Test of SSDs for log helped with latency. Database
and table design/Schema Change decimal datatypes to money, others
to int Performance Gain: 10% Integer based datatypes go through
optimized code path; Performance Gain: 10% No RI as this has an
overhead on performance. Executed in the application. Monitoring 5%
overhead in running default trace alone Collect perfmon and
targeted DMV/Xevents output to repository Architecture/Hardware
x32\x64 Performance x32 12% Faster Application Is Not Memory
Constrained* *Interesting for Futures discussion later in
presentation