Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)
-
Upload
rei-odaira -
Category
Software
-
view
941 -
download
2
description
Transcript of Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)
© 2014 IBM Corporation
Eliminating Giant VM Lock in RubythroughHardware Transactional Memory
Rei Odaira (Twitter: @ReiOdaira)IBM Research – Tokyo
Jose G. CastanosIBM Research – Watson Research Center
Hisanobu TomariUniversity of Tokyo
© 2014 IBM Corporation2
IBM Research – Tokyo
Giant VM Locks in Scripting Languages
Scripting languages (Ruby, Python, etc.) everywhere. Increasing demands for speed.
Multi-thread speed restricted by Giant VM locks.
– Only one thread can execute interpreter at one time.
– Not required by language specification.
☺Simplified implementation in interpreter and libraries.
☹ No scalability on multi cores.
– CRuby, CPython, and PyPy
© 2014 IBM Corporation3
IBM Research – Tokyo
Hardware Transactional Memory (HTM) Coming into the Market
Improve performance by simply replacing locks with TM?
Lower overhead than software TM via hardware.
Blue Gene/Q2012
zEC122012
4th Generation Core Processor
2013
Intel
POWER82014
© 2014 IBM Corporation4
IBM Research – Tokyo
Our Goal
What will realistic applications perform if we replace GVL with HTM?
– Giant VM Lock (GVL)
What modifications are needed in the interpreter?
© 2014 IBM Corporation5
IBM Research – Tokyo
Our Goal
What will realistic applications perform if we replace GVL with HTM?
– Giant VM Lock (GVL)
What modifications are needed in the interpreter?
atomic { }
Eliminate GVL in CRuby through HTM on Xeon E3-1275 v3 (aka Haswell) Evaluate WEBrick HTTP server and Ruby on Rails
XeonE3-1275 v3
Intel
© 2014 IBM Corporation6
IBM Research – Tokyo
Outline
Motivation
Transactional Memory
GVL in Ruby
Eliminating GVL through HTM
Experimental Results
Conclusion
© 2014 IBM Corporation7
IBM Research – Tokyo
Transactional Memory
At programming time
– Enclose critical sections with transaction begin/end operations.
At execution time
– Memory operations within a transaction observed as one step by other threads.
– Multiple transactions executed in parallel as long as their memory operations do not conflict. Higher parallelism than locks.
xbegin();a->count++;xend();
xbegin();b->count++;xend();
xbegin();a->count++;xend(); xbegin();
a->count++;xend();
Thread X
Thread Y
lock();a->count++;unlock();
xbegin();a->count++;xend();
© 2014 IBM Corporation8
IBM Research – Tokyo
HTM Instruction set (Intel TSX)
– XBEGIN: Begin a transaction
– XEND: End a transaction
– XABORT, etc.
Micro-architecture
– Read and write sets held in CPU caches
– Conflict detection using CPU cache coherence protocol
Abort in four cases:
– Read set and write set conflict
– Read set and write set overflow
– Restricted instructions (e.g. system calls)
– External interruptions, etc.
XBEGIN abort_handler......XEND
abort_handler:...
© 2014 IBM Corporation9
IBM Research – Tokyo
Ruby Multi-thread Programming and GVL based on 1.9.3
Ruby language
– Program with Thread, Mutex, and ConditionVariable classes
Ruby virtual machine
– Ruby program compiled into bytecode sequences.
– Ruby threads assigned to native threads (Pthread)
– Only one thread can execute the interpreter at any give time due to the GVL.
GVL acquired/released when a thread starts/finishes.
GVL yielded during a blocking operation, such as I/O.
GVL yielded also at pre-defined yield-point bytecode ops.
– Conditional/unconditional jumps, method/block exits, etc.
© 2014 IBM Corporation10
IBM Research – Tokyo
How GVL is Yielded in Ruby It is too costly to yield GVL at every yield point.
Yield GVL every 250 msec using a timer thread.
Timer thread Application threads
Wa
ke u
p e
very
25
0 m
sec
ReleaseGVL
Set flag
Set flag
AcquireGVL
if (flag is true){ gvl_release(); sched_yield(); gvl_acquire();}
Check the flag at every yield point.
Actual implementation is more complex to ensure fairness.
© 2014 IBM Corporation11
IBM Research – Tokyo
Outline
Motivation
Transactional Memory
GVL in Ruby
Eliminating GVL through HTM
– Our Algorithm
– Conflict Removal
Experimental Results
Conclusion
© 2014 IBM Corporation12
IBM Research – Tokyo
Eliminating GVL through HTM Execute as a transaction first.
– Begins/ends at the same points as GVL’s yield points.
Acquire GVL after consecutive aborts in a transaction.
Begin transaction
End transaction Conflict Retry if abort
Acquire GVL in case of consecutive aborts
Release GVLWait for GVL release
No timer thread needed.
© 2014 IBM Corporation13
IBM Research – Tokyo
Where to Begin and End Transactions? Should be the same as GVL’s acquisition/release/yield points.
– Guaranteed as critical section boundaries.
☹ However, the original yield points are too coarse-grained.
– Cause many transaction overflows.
Bytecode boundaries are supposed to be safe critical section boundaries.
– Bytecode can be generated in arbitrary orders.
– Therefore, an interpreter is not supposed to have a critical section that straddles a bytecode boundary.
☹ Ruby programs that are not correctly synchronized can change behavior.
Added the following bytecode instructions as transaction yield points.– getinstancevariable, getclassvariable, getlocal, send, opt_plus, opt_minus, opt_mult, opt_aref
– Criteria: they appear frequently or are complex.
© 2014 IBM Corporation14
IBM Research – Tokyo
5 Sources of Conflicts and How to Remove Them (1/2)
Global variables pointing to the current running thread
– Cause conflicts because they are written every time transactions yield
Moved to Pthread’s thread-local storage
Updates to inline caches at the time of misses
– Cache the result of hash table access for method invocation and instance-field access
– Cause conflicts because caches are shared among threads.
Implement miss reduction techniques.
Source code needs to be changed for higher performance, but each change is limited to only a few dozen lines of Ruby interpreter’s source code.
© 2014 IBM Corporation15
IBM Research – Tokyo
5 Sources of Conflicts and How to Remove Them (2/2) Manipulation of global free list when allocating objects
Introduce thread-local free lists
• When a thread-local free list becomes empty, take 256 objects in bulk from the global free list.
Garbage collection
– Cause conflicts when multiple threads try to do GC.
– Cause overflows even when a single thread performs GC.
Reduce GC frequency by increasing the Ruby heap.
False sharing in Ruby’s thread structures (rb_thread_t)
– Add many fields to store thread-local information. Conflict with other structures on the same cache line.
Assign each thread structure to a dedicated cache line.
© 2014 IBM Corporation16
IBM Research – Tokyo
Outline
Motivation
Transactional Memory
GVL in Ruby
Eliminating GVL through HTM
Experimental Results
Conclusion
© 2014 IBM Corporation17
IBM Research – Tokyo
Experimental Environment
Implemented in Ruby 1.9.3-p547.
~400 MB Ruby heap
Xeon E3-1275 v3 (aka Haswell)
– 4-core 2-SMT 3.5-GHz Xeon E3-1275 v3
– 6-MB read set and 19-KB write set
– Linux 3.10.5
© 2014 IBM Corporation18
IBM Research – Tokyo
Benchmark Programs
WEBrick: HTTP server included in Ruby distribution
– Spawn 1 thread to handle 1 request.
Ruby on Rails running on WEBrick, Thin, and Puma.
– Measured a Web application to fetch a book list from SQLite3.
– Thin executed in the multi-threaded mode.
HTTP client and server running on the same machine.
– To measure maximum speed-ups possible.
© 2014 IBM Corporation19
IBM Research – Tokyo
Throughput of WEBrick
GET a ~50-byte HTML file.
Normalized to 1-thread GVL
HTM was faster than GVL by 57%.WEBrick
0
0.5
1
1.5
2
2.5
0 1 2 3 4 5 6 7Number of clients
Thr
ough
put
(1 =
1-t
hrea
d G
VL)
GVLHTM
© 2014 IBM Corporation20
IBM Research – Tokyo
Throughput of Ruby on Rails / WEBrick HTM improved the throughput by 24% over GVL.
Many aborts due to transaction overflows.
– Regular expression libraries and method invocations.
Need to split into multiple transactions?
Ruby on Rails/WEBrick
0
0.5
1
1.5
0 1 2 3 4 5 6 7Number of clients
Th
rou
gh
pu
t(1
= 1
-th
rea
d G
VL
)
GVL
HTM
© 2014 IBM Corporation21
IBM Research – Tokyo
Throughput of Ruby on Rails / Thin and Puma
HTM improved the throughput by 43% with Thin and 58% with Puma over GVL.
Many aborts due to transaction overflows.
Ruby on Rails/Thin
0
0.5
1
1.5
2
0 1 2 3 4 5 6 7Number of clients
Th
rou
gh
pu
t(1
= 1
-th
rea
d G
VL
)
GVL
HTM
Ruby on Rails/Puma
0
0.5
1
1.5
2
0 1 2 3 4 5 6 7Number of clients
Th
rou
gh
pu
t(1
= 1
-th
rea
d G
VL
)
GVL
HTM
© 2014 IBM Corporation22
IBM Research – Tokyo
Conclusion
What was the performance of realistic applications when GVL was replaced with HTM?
– 57% speed-up in WEBrick and 58% in Ruby on Rails.
What was required for good scalability?
– Removed 5 sources of conflicts.
Code patch available!
– http://researcher.watson.ibm.com/researcher/ view_person_subpage.php?id=5206
Using HTM is an effective way to outperform the GVL without fine-grain locking.
© 2014 IBM Corporation
Backup
© 2014 IBM Corporation24
IBM Research – Tokyo
Begin
Tradeoff in Transaction Lengths No need to end and begin transactions at every yield point.
= Variable transaction lengths at the granularity of yield points.
Longer transaction = Smaller relative overhead to begin/end transaction.
Shorter transaction = Smaller abort overhead
– Smaller amount of wasteful work rolled-back at aborts
– Smaller probability of transaction overflows
Shorter transactionBegin BeginEnd
Wasteful work
BeginEnd End
Longer transactionBegin BeginEnd Begin
© 2014 IBM Corporation25
IBM Research – Tokyo
Dynamic Adjustment of Transaction Lengths
Transaction length = How many yield points to skip
Adjust transaction lengths on a per-yield-point basis.
1. Initialize with a long length (255).
2. Calculate abort ratio at each yield point Number of aborts / Number of transactions
3. If the abort ratio exceeds a threshold (6% for Xeon), shorten the transaction length (x 0.75) and return to Step 2.
4. If a pre-defined number (300) of transactions started before the abort ratio exceeds the threshold, stop adjusting that yield point.
© 2014 IBM Corporation26
IBM Research – Tokyo
Single-thread Overhead
Single-thread speed is important too.
18-35% single-thread overhead
– Optimization: with 1 thread, use the GVL instead of HTM. 5-14% overhead in micro-benchmarks even with this optimization.
Sources of the overhead:
– Checks at the yield points Adaptive insertion of yield points
– Disabled method-invocation inline caches Concurrent inline caches
© 2014 IBM Corporation27
IBM Research – Tokyo
Beginning a Transaction
Persistent aborts
– Overflow
– Restricted instructions
Otherwise, transient aborts
– Read-set and write-set conflicts, etc.
Abort reason reported by CPU
– Using a specified memory area
if (TBEGIN()) { /* Transaction */ if (GVL.acquired) TABORT();} else { /* Abort */ if (GVL.acquired) { if (Retried 16 times) Acquire GVL; else { Retry after GVL release; } else if (Persistent abort) { Acquire GVL; } else { /* Transient abort */ if (Retried 3 times) Acquire GVL ; else Retry;}}Execute Ruby code;
© 2014 IBM Corporation28
IBM Research – Tokyo
Ending and Yielding a Transaction
if (GVL.acquired) Release GVL;else TEND();
Ending a transaction
if (--yield_counter == 0) { End a transaction; Begin a transaction;}
Yielding a transaction(transaction boundary)
Dynamic adjustment of a transaction length
© 2014 IBM Corporation29
IBM Research – Tokyo
Platform-Specific Optimizations
Spin-wait for a while at GVL contention.
– Original GVL is acquired and released every ~250 msec.
– GVL is acquired and released far more frequently when used as a fallback path for HTM. Parallelism severely damaged if a thread sleeps in OS every time GVL is contended.
– Some environments already include this optimization in Pthread implementation.
© 2014 IBM Corporation30
IBM Research – Tokyo
HTM in Intel Xeon E3-1275 v3
XBEGIN, XEND, and XABORT
– Offset to abort handler specified by XBEGIN.
– EAX register tells abort reason.
Flat nesting
Implementation details not disclosed yet.
Conflict detected at the granularity of 64-byte cache line
Read and write sets tracked by CPU caches
– 32-KB L1D, 256-KB L2, and 8-MB shared L3.
Read- and write-set sizes depend on the models, according to our experiments.
– Xeon E3-1275v3: ~6MB read and ~19KB write sets
– Core i7 4770S: ~4MB read and ~22KB write sets
© 2014 IBM Corporation31
IBM Research – Tokyo
FT
00.5
11.5
22.5
33.5
44.5
5
0 1 2 3 4 5 6 7 8 9 10 11 12 13
Number of threads
Thr
ough
put
(1 =
1 t
hrea
d G
IL)
GIL
HTM-1
HTM-16
HTM-256
HTM-dynamic
FT
0
0.5
1
1.5
2
2.5
0 1 2 3 4 5 6 7 8 9
Number of threads
Thr
ough
put
(1 =
1 t
hrea
d G
IL)
GIL
HTM-1
HTM-16
HTM-256
HTM-dynamic
Throughput of FT in Ruby NPB
HTM-dynamic was the best.
HTM-1 suffered high overhead.
HTM-256 incurred high abort ratios.
zEC12 and Xeon showed similar speed-ups up to 4 threads.
SMT did not help.
zEC12
Xeon E3-1275 v3
SMT
© 2014 IBM Corporation32
IBM Research – Tokyo
Throughput of Ruby NPB on Xeon E3-1275 v3 (1/2)BT
0
0.5
1
1.5
2
0 1 2 3 4 5 6 7 8 9
Number of threads
Thr
ough
put
(1 =
1 t
hrea
d G
IL)
CG
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
0 1 2 3 4 5 6 7 8 9
Number of threads
Thr
ough
put
(1 =
1 t
hrea
d G
IL)
FT
0
0.5
1
1.5
2
2.5
0 1 2 3 4 5 6 7 8 9
Number of threads
Thr
ough
put
(1 =
1 t
hrea
d G
IL)
GIL
HTM-1
HTM-16
HTM-256
HTM-dynamic
Showed the same scalability as zEC12.
SMT did not help.
HTM-dynamic was not the best.
– Due to learning in HTM?
© 2014 IBM Corporation33
IBM Research – Tokyo
Throughput of Ruby NPB on Xeon E3-1275 v3 (2/2)IS
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
0 1 2 3 4 5 6 7 8 9
Number of threads
Thr
ough
put
(1 =
1 t
hrea
d G
IL)
LU
00.20.40.60.8
11.21.41.61.8
0 1 2 3 4 5 6 7 8 9
Number of threads
Thr
ough
put
(1 =
1 t
hrea
d G
IL)
MG
00.20.40.60.8
11.21.41.61.8
0 1 2 3 4 5 6 7 8 9
Number of threads
Thr
ough
put
(1 =
1 t
hrea
d G
IL)
SP
00.20.40.60.8
11.21.41.61.8
0 1 2 3 4 5 6 7 8 9
Number of threads
Thr
ough
put
(1 =
1 t
hrea
d G
IL)