105-07-2002CS757 Lock Behaviour Characterization of Commercial Workloads Jichuan Chang Xidong Wang.

CS757 Lock Behaviour Characterization of Commercial Workloads Jichuan Chang Xidong Wang CS757 Outline Motivation Methods Results Speculative Lock Elision Issues Conclusions CS757 Understanding the Synchronization Behavior of Commercial Workloads (OLTP, Apache, SpecJBB) Motivation Identifying Opportunities for Speculative Lock Elision (performance, ease of programming) CS757 Questions to Answer Lock related statistics Can hardware identify critical sections? Critical section size Lock-free section size Amount of lock contentions Hardware optimizations by speculation Context switching implications Resource requirements Other issues Realistic timing model Other synchronization (reader/writer, etc) Lock-free section Critical section Contention (spin/wait) time CS757 Methods Benchmarks OLTP, Apache, JBB, Barnes (for comparison) Full system simulation (tracing) using Simics Simple timing model - Simics tracer Ruby timing model - Simics + Ruby Using #instr (not #cycle) as the measurement unit Set cpu_switch_time to 1, disable STC Validating our approach Using micro-benchmarks, to compare our stats with the result reported by kernel tools (lockstat) Tracing into disassembly code (kernel/user) CS757 Lock Identification Basic idea [from SLE] Lock acquisition must use one atomic instruction. Silent store pair: as a pair, the stores in lock acquisition and release operations are silent. SPARC v9 atomic instructions ldstub, swap, casa (compare-and-swap) casa [%l2] 128,%g4,%g3 casa [%l2] 128,%l0,%g4 ldstub [%o0 + %g0], %o4 brnz,pn %o4, stbar stb %g0, [%o0 + 12] OLTPJBB 0x0->0xff 0xff->0x0 0x1->0x8410f8bc 0x8410f8bc->0x1 Values CS757 Lock Identification Algorithm Starts with an atomic instruction that writes back a different value to the lock otherwise meaning unsuccessful lock acquisition Examine each following store made by the same CPU Until we meet a normal store that completes the silent store pair usually with the value of 0x0 Other completion patterns Self-release (by the same CPU) using atomic instruction, pair-silently (JBB) using atomic instruction, not pair-silently Cross-release (by a different CPU) using atomic instruction; Removed: cant observe lock release (16K limited window). CS757 Lock Frequency CS757 Execution Phase Breakdown CS757 Critical Section Size CS757 Lock-free Section Size CS757 Timing Models Adding Ruby doesnt change the size of critical section and lock-free section, but removes lock contentions. Why? Shrinking caused by less frequent memory accesses within critical sections or simulation effect? Guess: more shrinking using Ruby and Opal Simple Timing Ruby Timing CS757 Lock Contention 46%236%70% Waiting: from the first try to successful acquisition Spinning: ignore those have been waiting for more than 4K instructions. CS757 Distinguishing wait and spin Why bother? Very few long-waiting events make big difference in the percentages of wasted instructions Easy if we can identify thread switching But the identification is not easy Waiting if spinning for too many instructions Using 4096 instructions as the limit 90+% contentions are shorter than 4K instr It makes sense for different timing models. CS757 Lock Contention Most Contended Locks CS757 SLE on Commercial Workloads Context switching (later) Buffering requirement Not much Small critical sections dominate Except for Apache user locks (1-8K) Single shared buffer among threads on the same CPU Possible performance gain Not big if only counting num of instructions (1 - 6%) Critical section size already small Contention already infrequent Can be larger if lock spinning latency increases Can be smaller less lock contentions happen (as in Ruby case) Must throttle speculation (to avoid unnecessary rollbacks) CS757 Context Switch Why bother? Needed to precisely quantify the amount of instructions spent on lock waiting (process and thread switching) Needed to correctly implement speculative lock elision (process switching only) Process Switching Identification Marker: Demap TLB on context switch Apache (100 transactions, CPU #3) Average: ~210K instructions (Max ~360K, Min ~160K) Process switching are infrequent, performance implication negligible Thread Switching Identification is hard No simple patterns to observe, No feedback to validate assumptions Not a good idea to provide separate buffer for each thread on a single processor. Hard to detect conflicts, thread switch & need many buffers. CS757 Hard to recognize complex synchronization Barriers, Read/writer locks, etc Mutual Exclusion implementation composed of the small critical sections pthread_mutex_lock(&lock) acquires 3 lock Reader/writer lock use locks to maintain data structure (reader/writer queues, num of current reader, etc) Other Synchronization Algorithms writer_enter () writer_exit () Serialized Execution (maintained by synch. algo.) HW only sees two small critical sections CS757 Conclusion Commercial workloads lock characterization Small critical sections dominate Infrequent lock contention User/kernel code have different behavior Kernel locks cant be ignored (Kernel) contented PCs predictable Performance Improvements SLE wont help as much CS757 Thank You! Questions? CS757 Backup Slides Thread switching details Critical section size using Ruby timing model Sparc Atomic Instructions Misc Issues Acknowledgement CS757 Thread Switch Identification User thread scheduling Disassemble user thread library, Observe execution of scheduling methods (_disp, _switch). not always possible!! Kernel thread scheduling Involve a set of interleaved method invocations (resume, disp, swtch, _resume_from_idle..). Hard to identify starting and ending point of thread switch Impossible to identify kernel thread switch by only observing register window swap since it also happen in user thread switch No feedback from OS to validate our assumption Methodology & Preliminary Observations Disassemble kernel code to build VA kernel method map. Observe the method control flow in Simics trace. resume may indicate a kernel thread switch user_rtt may indicate a user level thread switch. Conclusion: Thread Switch Identification is a hard, unresolved issue CS757 Critical Section Size (Ruby) CS757 Sparc Atomic Instructions ldstub Write all 1 into a byte Swap Swap the value of the reg and the mem location Compare-and-swap Swap if (value in the 1 st reg == value in mem) Membar/stbar Usually follows such atomic instructions CS757 Misc. Why Apache strange? Lock more frequent, few user lock (1-2%) Large percentage of critical section instruction Nested Locks Intertwined Locks Critical sections in Barnes are more clustered Buffer size 2^9 * 30% * 1/3 = 64 Blocks The same as SLE CS757 Acknowledgement Project suggested by Prof. Mark Hill Guiding and supporting Lots of discussion with and help from Min Xu, our TA Carl Mauer, Multifacet simulator expert Ravi Rajwar, SLE paper author

105-07-2002CS757 Lock Behaviour Characterization of Commercial Workloads Jichuan Chang Xidong Wang.

Documents

Transcript of 105-07-2002CS757 Lock Behaviour Characterization of Commercial Workloads Jichuan Chang Xidong Wang.