Download - Laying the foundation and tuning for OLTP.

Laying the foundation and tuning for OLTP

In this talk we will focus on the unique challenges we face on high concurrency and applications requiring low latency

Laying the foundations for OLTP Performance

Windows OS Kernel Group 0 Kernel Group 0 NUMA 0 NUMA 1 NUMA 2 NUMA 3 NUMA 4 NUMA 5 NUMA 6 NUMA 7 Kernel Group 1 Kernel Group 1 NUMA 8 NUMA 9 NUMA 10 NUMA 11 NUMA 12 NUMA 13 NUMA 14 NUMA 15 Kernel Group 2 Kernel Group 2 NUMA 16 NUMA 17 NUMA 18 NUMA 19 NUMA 20 NUMA 21 NUMA 22 NUMA 23 Kernel Group 3 Kernel Group 3 NUMA 24 NUMA 25 NUMA 26 NUMA 27 NUMA 28 NUMA 29 NUMA 30 NUMA 31 Hardware NUMA 6 CPU Socket CPU Socket CPU Core HT CPU Core HT CPU Socket CPU Socket CPU Core HT CPU Core HT NUMA 7 CPU Socket CPU Socket CPU Core HT CPU Core HT CPU Socket CPU Socket CPU Core HT CPU Core HT

SQL Server Today: Capabilities and Challenges with real customer workloads

ChallengeConsideration/Workaround Network 10 Gb/s network used no bottlenecks observed Concurrency Observed spikes in CPU at random times during workload Significant spinlocks contention on SOS_CACHESTORE due to frequent re-generation of security tokens Hotfix provided by SQL Server team Result SOS_CACHESTORE contention removed Spinlock contention on LOCK_HASH due to heavy reading of same rows This was due to an incorrect parameter being passed in by test workload Result LOCK_HASH contention removed, reduced CPU from 100% to 18% Transaction Log Synchronous replication at the storage level Observed 10-30ms for log latency expected 3-5ms Encountered Virtual Log File fragmentation (dbcc loginfo) rebuilt log Observed overutilization of front end fiber channel ports on array - reconfigured storage balancing traffic across front end ports Result: 3-5ms latency Database and table design/Schema Schema utilizes hash partitioning to avoid page latch contention on inserts Requirement for low privileges requires longer code paths in the SQL engine Monitoring Heavily utilized Extended Events to diagnose spinlock contention points Architecture/Hardware Currently running 16 socket IA64 in production Benchmark performed on 8 socket x64 Nehalem-EX (64 physical cores) Hyper-threading to 128 logical cores offered little benefit to this workload Encountered high NUMA latencies (coreinfo.exe) resolved via firmware updates

FileId FileSize StartOffset FSeqNo Status Parity CreateLSN ----------- -------------------- -------------------- ----------- ----------- ------ --------------------------------------- 2 253952 8192 48141 0 64 0 2 427556864 74398826496 0 0 128 22970000047327200649 2 427950080 74826383360 0 0 128 22970000047327200649

Resource Lock Manager Lock Hash Table Lock Manager Lock Hash Table Thread attempts to obtain lock (row, page, database, etc.. Threads accessing the same hash bucket of the table are synchronized LOCK_HASH Hash of lock maintained in hash table

These symptoms may indicate spinlock contention: 1. A high number of spins is reported for a particular spinlock type. AND 2. The system is experiencing heavy CPU utilization. AND 3. The system has a high amount of concurrency.

NameCollisionsSpinsSpins_Per_CollisionBackoffs SOS_CACHESTORE14,752,117942,869,471,52663,91467,900,620 SOS_SUSPEND_QUEUE69,267,367473,760,338,7656,8402,167,281 LOCK_HASH5,765,761260,885,816,58445,2473,739,208 MUTEX2,802,7739,767,503,6823,485350,997 SOS_SCHEDULER1,207,0073,692,845,5723,060109,746

--Get the type value for any given spinlock type select map_value, map_key, name from sys.dm_xe_map_values where map_value IN ('SOS_CACHESTORE') --create the even session that will capture the callstacks to a bucketizer create event session spin_lock_backoff on server add event sqlos.spinlock_backoff (action (package0.callstack) where type = 144--SOS_CACHESTORE ) add target package0.asynchronous_bucketizer ( set filtering_event_name='sqlos.spinlock_backoff', source_type=1, source='package0.callstack') with (MAX_MEMORY=50MB, MEMORY_PARTITION_MODE = PER_NODE) --Ensure the session was created select * from sys.dm_xe_sessions where name = 'spin_lock_backoff' --Run this section to measure the contention alter event session spin_lock_backoff on server state=start --wait to measure the number of backoffs over a 1 minute period waitfor delay '00:01:00' --To view the data --1. Ensure the sqlservr.pdb is in the same directory as the sqlservr.exe --2. Enable this trace flag to turn on symbol resolution DBCC traceon (3656, -1) --Get the callstacks from the bucketize target select event_session_address, target_name, execution_count, cast (target_data as XML) from sys.dm_xe_session_targets xst inner join sys.dm_xe_sessions xs on (xst.event_session_address = xs.address) where xs.name = 'spin_lock_backoff' --clean up the session alter event session spin_lock_backoff on server state=stop drop event session spin_lock_backoff on server

Observation: It is counterintuitive to have high waits times (LCK_M_X) correlate with heavy CPU This is the symptom not the cause 1 2 Huge increase in number of spins & backoffs associated with SOS_CACHESTORE 3 4 Approach: Use extended events to profile the code path with the spinlock contention (i.e. where there is a high number of backoffs) 5 Root cause: Regeneration of security tokens exposes contention in code paths for access permission checks Workaround/Problem Isolation: Run with sysadmin rights Long Term Changes Required: SQL Server fix

SAN CX-960 (240 drives, 15K, 300GB) 5 x App servers: 5 x BL460 2 proc (quad core), 32bit 32 GB memory 12 x Load drivers: 2 proc (quad core), x64 32+ GB memory Switch Transaction DB Server 1 x DL785 8P (quad core), 2.3GHz 256 GB RAM Network switch Reporting DB Server 1 x DL585 4P (dual core), 2.6 GHz 32 GB RAM Switch SAN switch Brocade 4900 (32-ports active) DL785 DL585 BL460 Blade Servers Dell R900s, R805s Active/Active Failover cluster

ChallengeConsideration/Workaround Network CPU bottlenecks for network processing were observed and resolved via network tuning (RSS) Further network optimization was performed by implementing compression in the application After optimizations were able to push ~180K packets/sec, approx 111 MB/sec through a single 1 Gb/s NIC. Concurrency Page buffer latch waits were by far the biggest pain point Hash partitioning was used to scale-out the btrees and eliminate the contention Some PFS contention for the tables containing LOB data resolved by placing LOB tables on dedicated filegroups and adding more files Transaction Log No log bottlenecks were observed. When cache on the array behaves well log response times are very low. Database and table design/Schema Observed overhead related to PK/FK relationships. Insert statements required additional work. Adding persisted computed column needed for hash partitioning is an offline operation. Moving LOB data is an offline operation. Monitoring For the latch contention, utilized dm_os_wait_stats, dm_os_waiting_tasks and dm_db_index_operational_stats to identify indexes with most contention Architecture/Hardware Be careful about shared components in Blade server deployments this became a bottleneck for our middle tier.

Page (8K) ROW EX_LATCH wait

Dig into details with: sys.dm_os_wait_stats sys.dm_os_latch_waits

select *, wait_time_ms/waiting_tasks_count [avg_wait_time], signal_wait_time_ms/waiting_tasks_count [avg_signal_wait_time] from sys.dm_os_wait_stats where wait_time_ms > 0 and wait_type like '%PAGELATCH%' order by wait_time_ms desc

/* latch waits ********************************************/ select top 20 database_id, object_id, index_id, count(partition_number) [num partitions],sum(leaf_insert_count) [leaf_insert_count], sum(leaf_delete_count) [leaf_delete_count],sum(leaf_update_count) [leaf_update_count],sum(singleton_lookup_count) [singleton_lookup_count],sum(range_scan_count) [range_scan_count],sum(page_latch_wait_in_ms) [page_latch_wait_in_ms], sum(page_latch_wait_count) [page_latch_wait_count],sum(page_latch_wait_in_ms) / sum(page_latch_wait_count) [avg_page_latch_wait],sum(tree_page_latch_wait_in_ms) [tree_page_latch_wait_ms], sum(tree_page_latch_wait_count) [tree_page_latch_wait_count],case when (sum(tree_page_latch_wait_count) = 0) then 0 else sum(tree_page_latch_wait_in_ms) / sum(tree_page_latch_wait_count) end [avg_tree_page_latch_wait] from sys.dm_db_index_operational_stats (null, null, null, null) os where page_latch_wait_count > 0 group by database_id, object_id, index_id order by sum(page_latch_wait_in_ms) desc

B- tree Page Leaf Pages Tree Pages Logical Key Order of Index Monotonically Increasing Data Page Date Page Date Page Data Page Date Page Date Page Data Page We call this Last Page Insert Contention Many threads inserting into end of range Expect: PAGELATCH_EX/SH waits And this is the observation

0 -1000 0 -1000 1001 - 2000 1001 - 2000 2001 - 3000 2001 - 3000 3001 - 4000 3001 - 4000 INSERT Hash Partitioning Reference: http://sqlcat.com/technicalnotes/archive/2009/ 09/22/resolving-pagelatch-contention-on- highly-concurrent-insert-workloads-part-1.aspx Threads inserting into end of range contention on last page Threads inserting into end of range but across each partition Hash Partitioned Table / Index

Latch waits of approximately 36 ms at baseline of 99 checks/sec.

Latch waits of approximately 0.6 ms at highest throughput of 249 checks/sec.

Note: Requires application changes Ensure Select/Update/Delete have appropriate partition elimination

Single 1 Gb/s NIC

wait_typetotal_wait_time_mstotal_waiting_tasks_countaverage_wait_ms DTC_STATE5,477,997,9344,523,0191,211 PREEMPTIVE_TRANSIMPORT2,852,073,2823,672,147776 PREEMPTIVE_DTC_ENLIST2,718,413,4583,670,307740

Applied Architecture Patterns on the Microsoft Platform

46 Q A & Q A &

47 2008 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

48 Agenda Windows Server 2008R2 and SQL Server 2008R2 improvements Scale architecture Customer Requirements Hardware setup Transaction log essentials Getting the code right Application Server Essentials Database Design Tuning Data Modification UPDATE statements INSERT statements Management of LOB data The problem with NUMA and what to do about it Final results and Thoughts

49 Top statistics CategoryMetric Largest single database80 TB Largest table20 TB Biggest total data 1 customer2.5 PB Highest write per second 1 db60,000 Fastest I/O subsystem in production (and in lab) 18 GB/sec (26GB/sec) Fastest real time cube1 sec latency data load for 1TB20 minutes Largest cube12 TB

50 Upping the Limits Previous (before 2008R2) windows was limited to 64 cores Kernel tuned for this config With Windows Server 2008R2 this limit is now upped to 256 Cores (plumbing for 1024 cores) New concept: Kernel Groups A bit like NUMA, but an extra layer in the hierarchy SQL Server generally follows suit but for now, 256 Cores is limit on R2 Example x64 machines: HP DL980 (64 Cores, 128 in HyperThread). IBM 3950 (up to 256 Cores) And largest IA-64 is 256 Hyperthread (at 128 Cores)

51 The Path to the Sockets Windows OS Kernel Group 0 Kernel Group 0 NUMA 0 NUMA 1 NUMA 2 NUMA 3 NUMA 4 NUMA 5 NUMA 6 NUMA 7 Kernel Group 1 Kernel Group 1 NUMA 8 NUMA 9 NUMA 10 NUMA 11 NUMA 12 NUMA 13 NUMA 14 NUMA 15 Kernel Group 2 Kernel Group 2 NUMA 16 NUMA 17 NUMA 18 NUMA 19 NUMA 20 NUMA 21 NUMA 22 NUMA 23 Kernel Group 3 Kernel Group 3 NUMA 24 NUMA 25 NUMA 26 NUMA 27 NUMA 28 NUMA 29 NUMA 30 NUMA 31 Hardware NUMA 6 CPU Socket CPU Socket CPU Core HT CPU Core HT CPU Socket CPU Socket CPU Core HT CPU Core HT NUMA 7 CPU Socket CPU Socket CPU Core HT CPU Core HT CPU Socket CPU Socket CPU Core HT CPU Core HT

52 And we measure it like this Sysinternals CoreInfo http://technet.microsoft.com/en-us/sysinternals/cc835722.aspx Nehalem-EX Every socket is a NUMA node How fast is your interconnect.

53 And it Looks Like This...

54 Customer Scenarios Core BankingHealthcare SystemPOS Workload Credit Card transactions from ATM and Branches Sharing patient information across multiple healthcare trusts World record deployment of ISV POS application across 8,000 US stores Scale Requirements 10.000 Business Transactions / sec 37,500 concurrent usersHandle peak holiday load of 228 checks/sec Technology App Tier.NET 3.5/WCF SQL 2008R2 Windows 2008R2 App Tier:.NET SQL 2008R2 Windows 2008R2 Virtualized App Tier: Com+, Windows 2003 SQL 2008, Windows 2008 Server HP Superdome HP DL785G6 IBM 3950 and HP DL 980DL785

55 Network Cards Rule of Thumb At scale, network traffic will generate a LOT of interrupts for the CPU These must be handled by CPU Cores Must distribute packets to cores for processing Rule of thumb (OTLP): 1 NIC / 16 Cores Watch the DPC activity in Taskmanager In Windows 20003 remove SQL Server (with affinity mask) from the NIC cores

56 Lab: Network Tuning Approaches 1. Tuning configuration options of a single NIC card to provide the maximum throughput. 2. Improve the application code to compress LOB data before sending it to the SQL Server 3. Team a pair of 1 Gb/s NICs to provide more bandwidth (transparent to the app). 4. Add multiple NICS (better for scale )

57 Tuning a Single NIC Card POS system Enable RSS to enable multiple CPUs to process receive indications: http://www.microsoft.com/whdc/device/network/NDIS_RSS.mspx The next step was to disable the Base Filtering Service in Windows and explicitly enable TCP Chimney offload. Careful with Chimney Offload as per KB 942861KB 942861

58 Before and After Tuning Single NIC 1. Before any network changes the workload was CPU bound on CPU0 2. After tuning RSS, disabling Base Filtering Service and explicitly enabling TCP Chimney Offload CPU time on CPU0 was reduced. The base CPU for RSS successfully moved from CPU0 to another CPU.

59 Teaming NICS Workload bound by network throughput Teaming of 2 network adapters realized no more aggregate throughput Application blade servers shared a 1 Gb/s network connection Left until next episode Consider 10 Gbps NICS for this throughput

60 SQL Server Configuration Changes As we increased number of connections to around 6000 (users had think time) we started seeing waits on THREADPOOL Solution: increase sp_configure max worker threads Probably dont want to go higher than 4096 Gradually increase it, default max is 980 Avoid killing yourself in thread management bottleneck is likely somewhere else Use affinity mask to get rid of SQL Server for cores running NIC traffic Well tuned, pure play OLTP No need to consider parallel plans Sp_configure max degree of parallelism, 1

61 Getting the Code Right Designing Highly Scalable OLTP Systems

62 Lessons from ISV Applications Parameterize or pay the CPU cost and potentially hit the gateway limits for compilations (RESOURCE_SEMAPHORE_QUERY_COMPILATIONS) Watch out for cursors They tie up worker threads and if they consume workspace memory you could see blocking (RESOURCE_SEMAPHORE) Consume those results as quickly as possible (watch for ASYNC_NETOWORK_IO) Schema design For insert heavy workload RI can be very expensive. If performance is key, work out the RI outside the DB and trust you app

63 Things to Double Check Connection pooling enabled? How much connection memory are we using? Monitor perfmon: MSSQL: Memory Manager Obvious Memory or Handle leaks? Check perfmon Process counters in perfmon for.NET app Server side processes will keep memory unless under pressure Can the application handle the load? Call into dummy procedures that do nothing Check measured application throughput Typical case: Application breaks before SQL

64 Remote Calling from WCF Original client code: Synchronous calls in WCF Each thread must wait for network latency before proceeding Around 1ms waiting Very similar to disk I/O thread will fall asleep Lots of sleeping threads Limited to around 50 client simulations per machine Instead, use IAsyncInterface

65 Tuning Data Modification Designing Highly Scalable OLTP Systems

66 Database Schema Credit Cards Transaction ATM Account Transaction_ID Customer_ID ATM_ID Account_ID TransactionDate Amount Account_ID LastUpdateDate Balance ID_ATM ID_Branch LastTransactionDate LastTransaction_ID INSERT.. VALUES (@amount) INSERT.. VALUES (-1 * @amount) UPDATE.. SET LastTransaction_ID = @ID + 1 LastTransactionDate = GETDATE() UPDATE SET Balance 10**10 rows 10**5 rows 10**3 rows

67 Summary of Concerns Transaction table is hot Lots of INSERT How to handle ID numbers? Allocation structures in database Account table must be transactionally consistent with Transaction Do I trust the developers to do this? Cannot release lock until BOTH are in sync What about latency of round trips for this Potentially hot rows in Account Are some accounts touched more than others ATM Table has hot rows. Each row on average touched at least ten times per second E.g. 10**3 rows with 10**4 transactions/sec Transaction ATM Account Transaction_ID Customer_ID ATM_ID Account_ID TransactionDate Amount Account_ID LastUpdateDate Balance ID_ATM ID_Branch LastTransactionDate LastTransaction_ID

68 Generating a Unique ID Why wont this work? CREATE PROCEDURE GetID @ID INT OUTPUT @ATM_ID INT AS DECLARE @LastTransaction_ID INT SELECT @LastTransaction_ID = LastTransaction_ID FROM ATM WHERE ATM_ID = @ATM_ID SET @ID = @LastTransaction_ID + 1 UPDATE ATM SET @LastTransaction_ID WHERE ATM_ID = @ATM_ID

69 Concurrency is Fun SELECT @LastTransaction_ID = LastTransaction_ID FROM ATM WHERE ATM_ID = 13 SET @ID = @LastTransaction_ID + 1 UPDATE ATM SET @LastTransaction_ID = @ID WHERE ATM_ID = 13 SELECT @LastTransaction_ID = LastTransaction_ID FROM ATM WHERE ATM_ID = 13 SET @ID = @LastTransaction_ID + 1 UPDATE ATM SET @LastTransaction_ID = @ID WHERE ATM_ID = 13 ATM ID_ATM = 13 LastTransaction_ID = 42 (@LastTransaction_ID = 42)

70 Generating a Unique ID The Right way CREATE PROCEDURE GetID @ID INT OUTPUT @ATM_ID INT AS UPDATE ATM SET LastTransaction_ID = @ID + 1, @ID = LastTransaction_ID WHERE ATM_ID = @ATM_ID And it it is simple too...

71 Hot rows in ATM Initial runs with a few hundred ATM shows excessive waits for LCK_M_U Diagnosed in sys.dm_os_wait_stats Drilling down to individual locks using sys.dm_tran_locks Inventive readers may wish to use Xevents Event objects: sqlserver.lock_acquired and sqlos.wait_info Bucketize them As concurrency increases, lock waits keep increasing While throughput stays constant Until...

72 Spinning around Diagnosed using sys.dm_os_spinlock_stats Pre SQL2008 this was DBCC SQLPERF(spinlockstats) Can dig deeper using Xevents with sqlos.spinlock_backoff event We are spinning for LOCK_HASH

73 LOCK_HASH what is it? ROW Lock Manager Thread More Threads LOCK_HASH LCK_U - Why not go to sleep?

74 Locking at Scale Ratio between ATM machines and transactions generated too low. Can only sustain a limited number of locks/unlocks per second Depends a LOT on NUMA hardware, memory speeds and CPU caches Each ATM was generating 200 transactions / sec in test harness Solution: Increase number of ATM machines Key Takeway: If a locked resource is contended create more of it Notice: This is not SQL Server specific, any piece of code will be bound by memory speeds when access to a region must be serialized

75 Corresponded with high wait time for compile exclusive lock LCK_M_X Note: ignore SOS_SCHEDULER_YIELD Non-Fully Qualified Calls To Stored Procedures Results in SOS_CACHESTORE Spins Almost all sessions waiting on LCK_M_X 1 2 Huge increase in Spinlocks on SOS_CACHESTORE & SOS_SUSPEND_QUEUE 3 4 Corresponded with high wait time for compile exclusive lock LCK_M_X Note: ignore SOS_SCHEDULER_YIELD 5 Root cause: The Lorenzo user was not a member of the DB_Owner role, all stored procedures were owned by DBO e.g. dbo.lorenzoproc Workaround: add the application user to DB Owner role Long Term Changes Required: Change Exec lorenzoproc to Exec dbo.lorenzoproc

76 Hot rows in Account Three ways to update Account table Let application servers invoke transaction to both insert in TRANSACTION and UPDATE account Set a trigger on TRANSACTION Create stored proc that handles the entire transaction Option 1 has two issues: App developers may forget in all code paths Latency of roundtrip: around 1ms i.e. no more than 1000 locks/sec possible on single row Option 2 is better choice! Option 3 must be used in all places in app to be better than option 2.

77 Hot Latches! LCK waits are gone, but we are seeing very high waits for PAGELATCH_EX High = more than 1ms What are we contending on? Latch a light weight semaphore Locks are logical (transactional consistency) Latches are internal SQL Engine (memory consitency) Because rows are small (many fit a page) multiple locks may compete for one PAGELATCH Page (8K) ROW LCK_U PAGELATCH_EX

78 Row Padding In the case of the ATM table, our rows are small and few We can waste a bit of space to get more performance Solution: Pad rows with CHAR column to make each row take a full page 1 LCK = 1 PAGELATCH Page (8K) ROW LCK_U PAGELATCH_EX CHAR(5000) ALTER TABLE ATM ADD COLUMN Padding CHAR(5000) NOT NULL DEFAULT (X)

79 INSERT throughput Transaction table is by far the most active table Fortunately, only INSERT No need to lock rows But several rows must still fit a single page Cannot pad pages there are 10**10 rows in the table A new page will eventually be allocated, but until it is, every insert goes to same page Expect: PAGELATCH_EX waits And this is the observation

80 Hot page at the end of B-tree with increasing index

81 Waits & Latches Dig into details with: sys.dm_os_wait_stats sys.dm_os_latch_waits

82 How to Solve INSERT hotspot Hash partition the table Create multiple B-trees Round robin between the B-trees create more resources and less contention Do not use a sequential key Distribute the inserts all over the B- tree 0 0 1 1 2 2 3 3 4 4 5 5 6 6 hash ID 7 7 0,8,16 1,9,17 2,10,18 3,11,19 4,12,20 5,13,21 6,14,22 7,15,23 0 -1000 0 -1000 1001 - 2000 1001 - 2000 2001 - 3000 2001 - 3000 3001 - 4000 3001 - 4000 INSERT

83 0 0 Design Pattern: Table Hash Partitioning Create new filegroup or use existing to hold the partitions Equally balance over LUN using optimal layout Use CREATE PARTITION FUNCTION command Partition the tables into #cores partitions Use CREATE PARTITION SCHEME command Bind partition function to filegroups Add hash column to table (tinyint or smallint) Calculate a good hash distribution For example, use hashbytes with modulo or binary_checksum 1 1 2 2 3 3 4 4 5 5 6 6 253 254 255 hash

84 Lab Example: Before Partitioning Latch waits of approximately 36 ms at baseline of 99 checks/sec.

85 Lab Example: After Partitioning* *Other optimizations were applied Latch waits of approximately 0.6 ms at highest throughput of 249 checks/sec.

86 Pick The Right Number of Buckets

87 B-Tree Root Split Next Prev Virtual Root SH LATCH (ACCESS_METHODS HBOT_VIRTUAL_ROOT) LCK PAGELATCH X X SH PAGELATCH EX SH EX SH EX

88 Management of LOB Data Resolving latch contention required rebuilding indexes into a new filegroup Resulted in PFS contention (PAGELATCH_UP): Engine uses proportional fill algorithm Moving indexes from one filegroup to another resulted in imbalance between underlying data files in PRIMARY filegroup Resolve: move hot table to dedicated filegroup Neither ALTER TABLE nor any method of index rebuild support the movement of LOB data. Technique used: Create the new filegroup and files. SELECT/INTO from the existing table into a new table. Change the default filegroup as specifying a target filegroup is not supported INSERT...WITH (TABLOCK) SELECT will have similar behaviour without the need to change default filegroup Drop the original table and rename the newly created table to the original name. As a general best practice we advised the partner/customer to use dedicated filegroups for LOB data Dont use PRIMARY filegroup See Paul Randal post: http://www.sqlskills.com/BLOGS/PAUL/post/Importance-of-choosing-the-right-LOB-storage- technique.aspxhttp://www.sqlskills.com/BLOGS/PAUL/post/Importance-of-choosing-the-right-LOB-storage- technique.aspx

89 NUMA and What to do Remember those PAGELATCH for UPDATE statements? Our solution: add more pages Improvemnet: Get out of the PAGELATCH fast so next one can work on it On NUMA systems, going to a foreign memory node takes at least 4-10 times more expensive Use SysInternals CoreInfo tool

90 How does NUMA work in SQL Server? The first NUMA node to request a page will own that page Ownership continues until page is evicted from buffer pool Every other NUMA node that need that page will have to do foreign memory access Additional (SQL 2008) feature is SuperLatch Useful when page is read a lot but written rarely Only kicks in on 32 cores or more The this page is latched information is copied to all NUMA nodes Acquiring a PAGELATCH_SH only requires local NUMA access But: Acquiring PAGELATCH_EX must signal all NUMA nodes Perfmon object: MSSQL:Latches Number of SuperLatches SuperLatch demotions / sec SuperLatch promotions / sec See CSS blog postCSS blog post

91 NUMA 3 NUMA 2 NUMA 1 NUMA 0 Effect of UPDATE on NUMA traffic 0 0 1 1 2 2 3 3 ATM_ID UPDATE ATM SET LastTransaction_ID UPDATE ATM SET LastTransaction_ID UPDATE ATM SET LastTransaction_ID UPDATE ATM SET LastTransaction_ID App Servers

92 NUMA 3 NUMA 2 NUMA 1 NUMA 0 Using NUMA affinity 0 0 1 1 2 2 3 3 ATM_ID UPDATE ATM SET LastTransaction_ID UPDATE ATM SET LastTransaction_ID UPDATE ATM SET LastTransaction_ID UPDATE ATM SET LastTransaction_ID Port: 8000 Port: 8001 Port: 8002 Port: 8003 How to: Map TCP/IP Ports to NUMA Nodes

93 Final Results and thoughts 120.000 Batch Requests / sec 100.000 SQL Transactions / sec 50.000 SQL Write Transactions / sec 12.500 Business Transactions / sec CPU Load: 34 CPU cores busy Given more time, we would get the CPUs to 100%, Tune the NICs more, and work on balancing NUMA more. And of NIC, we only had two and they were loading two CPU at 100%

Other 30+ Other 30+ Other 20+ Other 20+ News Letter 2+ News Letter 2+ SMS 4+ SMS 4+ User Account & Sportsbook 8+ User Account & Sportsbook 8+ Bookmaking 2+ Bookmaking 2+ Betcache 4+ Betcache 4+ Casino 2+ Casino 2+ VS Games 2+ VS Games 2+ 1x2 Games 12+ 1x2 Games 12+ CSM 2+ CSM 2+ CMS 15+ CMS 15+ Repl Other 40+ Other 40+ BGI DWH 60+ DWH 60+ Payment 20+ Payment 20+ Moni-toring 10+ Moni-toring 10+ Adminis- tration 20+ Adminis- tration 20+ DWH Stage 50+ DWH Stage 50+ Internal Office, Sharepoint (300+) ASP.NET Sessions 8+ ASP.NET Sessions 8+ OLAP 10+ OLAP 10+ Operation Betoffer & OddsTranslationSecurityOther Marketing Product MgmtPromotionsCampaigns'Other Administration PaymentCall centerOther Finance BookkeepingControlingOther This is not a WinMo7 its a SuperDome

ChallengeConsideration/Workaround Network CPU bottlenecks for network processing were observed and resolved via network tuning (RSS) Dedicated networks for backup, replication etc 8 network cards for clients Concurrency Latch contention on heavily used tables, last page insert Hash partition option caused other problem in query performance and application design. Resolution: Co-operative scale-out Resolution: Co-operative scale-out Transaction Log Latency on log writes Resolution: Increased throughput/decreased latency by placing transaction log on SSDs Resolution: Increased throughput/decreased latency by placing transaction log on SSDs Database mirroring overhead very significant on Synchronous Database and table design/Schema Latency on IO intensive data files (including Tempdb): Resolution: Session state database on SSDs; Resolution: Session state database on SSDs; Resolution: Betting slips/customer databases testing sharding Resolution: Betting slips/customer databases testing sharding Single server, single database 500/tx/sec Single server, 4 databases 1,800/tx/sec (sharding) Multiple servers 2,600/tx/sec (sharding) Monitoring Security monitoring (PCI and intrusion detection) between 10%-25% impact/overhead when monitoring Architecture/Hardware Tests using x64 (8-socket; 8-core) vs. Itanium-Superdome(32-socket,dual-core) Same transaction throughput IO and backups were a bit slower

ChallengeConsideration/Workaround Network Network round trip time for synchronous call from client induced latency Resolution: Batch data into single large parameter (varchar (8000)) to avoid network roundtrips Concurrency Page Latch Contention Small table Small table with latching on 36 Rows on single page. Performance Gain: 20% Resolution: pad the rows to spread out latching to multiple pages; Performance Gain: 20% Page Latch Contention Large Table Concurrent INSERTS into incremental column (identity), last page insert Performance Gain: 30% Resolution: Clustered Index (partition_id & identity) column; Performance Gain: 30% Heavy, long running threads contenting for time on the scheduler Performance Gain: 20% Resolution: Map TCP/IP Ports to NUMA Nodes (http://msdn.microsoft.com/en- us/library/ms345346.aspx); Performance Gain: 20%http://msdn.microsoft.com/en- us/library/ms345346.aspx Transaction Log Logwaits: Resolution: Batching business transactions within a single COMMIT to avoid WRITELOG waits Test of SSDs for log helped with latency. Database and table design/Schema Change decimal datatypes to money, others to int Performance Gain: 10% Integer based datatypes go through optimized code path; Performance Gain: 10% No RI as this has an overhead on performance. Executed in the application. Monitoring 5% overhead in running default trace alone Collect perfmon and targeted DMV/Xevents output to repository Architecture/Hardware x32\x64 Performance x32 12% Faster Application Is Not Memory Constrained* *Interesting for Futures discussion later in presentation