Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

© 2014 IBM Corporation

Eliminating Giant VM Lock in RubythroughHardware Transactional Memory

Rei Odaira (Twitter: @ReiOdaira)IBM Research – Tokyo

Jose G. CastanosIBM Research – Watson Research Center

Hisanobu TomariUniversity of Tokyo

© 2014 IBM Corporation2

IBM Research – Tokyo

Giant VM Locks in Scripting Languages

Scripting languages (Ruby, Python, etc.) everywhere. Increasing demands for speed.

Multi-thread speed restricted by Giant VM locks.

– Only one thread can execute interpreter at one time.

– Not required by language specification.

☺Simplified implementation in interpreter and libraries.

☹ No scalability on multi cores.

– CRuby, CPython, and PyPy



Hardware Transactional Memory (HTM) Coming into the Market

Improve performance by simply replacing locks with TM?

Lower overhead than software TM via hardware.

Blue Gene/Q2012

zEC122012

4th Generation Core Processor

2013

Intel

POWER82014



Our Goal

What will realistic applications perform if we replace GVL with HTM?

– Giant VM Lock (GVL)

What modifications are needed in the interpreter?



Our Goal

What will realistic applications perform if we replace GVL with HTM?

– Giant VM Lock (GVL)

What modifications are needed in the interpreter?

atomic { }

Eliminate GVL in CRuby through HTM on Xeon E3-1275 v3 (aka Haswell) Evaluate WEBrick HTTP server and Ruby on Rails

XeonE3-1275 v3

Intel



Outline

Motivation

Transactional Memory

GVL in Ruby

Eliminating GVL through HTM

Experimental Results

Conclusion




At programming time

– Enclose critical sections with transaction begin/end operations.

At execution time

– Memory operations within a transaction observed as one step by other threads.

– Multiple transactions executed in parallel as long as their memory operations do not conflict. Higher parallelism than locks.

xbegin();a->count++;xend();

xbegin();b->count++;xend();

xbegin();a->count++;xend(); xbegin();

a->count++;xend();

Thread X

Thread Y

lock();a->count++;unlock();

xbegin();a->count++;xend();



HTM Instruction set (Intel TSX)

– XBEGIN: Begin a transaction

– XEND: End a transaction

– XABORT, etc.

Micro-architecture

– Read and write sets held in CPU caches

– Conflict detection using CPU cache coherence protocol

Abort in four cases:

– Read set and write set conflict

– Read set and write set overflow

– Restricted instructions (e.g. system calls)

– External interruptions, etc.

XBEGIN abort_handler......XEND

abort_handler:...



Ruby Multi-thread Programming and GVL based on 1.9.3

Ruby language

– Program with Thread, Mutex, and ConditionVariable classes

Ruby virtual machine

– Ruby program compiled into bytecode sequences.

– Ruby threads assigned to native threads (Pthread)

– Only one thread can execute the interpreter at any give time due to the GVL.

GVL acquired/released when a thread starts/finishes.

GVL yielded during a blocking operation, such as I/O.

GVL yielded also at pre-defined yield-point bytecode ops.

– Conditional/unconditional jumps, method/block exits, etc.



How GVL is Yielded in Ruby It is too costly to yield GVL at every yield point.

Yield GVL every 250 msec using a timer thread.

Timer thread Application threads

Wa

ke u

p e

very

25

0 m

sec

ReleaseGVL

Set flag

Set flag

AcquireGVL

if (flag is true){ gvl_release(); sched_yield(); gvl_acquire();}

Check the flag at every yield point.

Actual implementation is more complex to ensure fairness.



Outline

Motivation


GVL in Ruby


– Our Algorithm

– Conflict Removal


Conclusion



Eliminating GVL through HTM Execute as a transaction first.

– Begins/ends at the same points as GVL’s yield points.

Acquire GVL after consecutive aborts in a transaction.

Begin transaction

End transaction Conflict Retry if abort

Acquire GVL in case of consecutive aborts

Release GVLWait for GVL release

No timer thread needed.



Where to Begin and End Transactions? Should be the same as GVL’s acquisition/release/yield points.

– Guaranteed as critical section boundaries.

☹ However, the original yield points are too coarse-grained.

– Cause many transaction overflows.

Bytecode boundaries are supposed to be safe critical section boundaries.

– Bytecode can be generated in arbitrary orders.

– Therefore, an interpreter is not supposed to have a critical section that straddles a bytecode boundary.

☹ Ruby programs that are not correctly synchronized can change behavior.

Added the following bytecode instructions as transaction yield points.– getinstancevariable, getclassvariable, getlocal, send, opt_plus, opt_minus, opt_mult, opt_aref

– Criteria: they appear frequently or are complex.



5 Sources of Conflicts and How to Remove Them (1/2)

Global variables pointing to the current running thread

– Cause conflicts because they are written every time transactions yield

Moved to Pthread’s thread-local storage

Updates to inline caches at the time of misses

– Cache the result of hash table access for method invocation and instance-field access

– Cause conflicts because caches are shared among threads.

Implement miss reduction techniques.

Source code needs to be changed for higher performance, but each change is limited to only a few dozen lines of Ruby interpreter’s source code.



5 Sources of Conflicts and How to Remove Them (2/2) Manipulation of global free list when allocating objects

Introduce thread-local free lists

• When a thread-local free list becomes empty, take 256 objects in bulk from the global free list.

Garbage collection

– Cause conflicts when multiple threads try to do GC.

– Cause overflows even when a single thread performs GC.

Reduce GC frequency by increasing the Ruby heap.

False sharing in Ruby’s thread structures (rb_thread_t)

– Add many fields to store thread-local information. Conflict with other structures on the same cache line.

Assign each thread structure to a dedicated cache line.



Outline

Motivation


GVL in Ruby



Conclusion



Experimental Environment

Implemented in Ruby 1.9.3-p547.

~400 MB Ruby heap

Xeon E3-1275 v3 (aka Haswell)

– 4-core 2-SMT 3.5-GHz Xeon E3-1275 v3

– 6-MB read set and 19-KB write set

– Linux 3.10.5



Benchmark Programs

WEBrick: HTTP server included in Ruby distribution

– Spawn 1 thread to handle 1 request.

Ruby on Rails running on WEBrick, Thin, and Puma.

– Measured a Web application to fetch a book list from SQLite3.

– Thin executed in the multi-threaded mode.

HTTP client and server running on the same machine.

– To measure maximum speed-ups possible.



Throughput of WEBrick

GET a ~50-byte HTML file.

Normalized to 1-thread GVL

HTM was faster than GVL by 57%.WEBrick

0

0.5

1

1.5

2

2.5

0 1 2 3 4 5 6 7Number of clients

Thr

ough

put

(1 =

1-t

hrea

d G

VL)

GVLHTM



Throughput of Ruby on Rails / WEBrick HTM improved the throughput by 24% over GVL.

Many aborts due to transaction overflows.

– Regular expression libraries and method invocations.

Need to split into multiple transactions?

Ruby on Rails/WEBrick

0

0.5

1

1.5


Th

rou

gh

pu

t(1

= 1

-th

rea

d G

VL

)

GVL

HTM



Throughput of Ruby on Rails / Thin and Puma

HTM improved the throughput by 43% with Thin and 58% with Puma over GVL.

Many aborts due to transaction overflows.

Ruby on Rails/Thin

0

0.5

1

1.5

2


Th

rou

gh

pu

t(1

= 1

-th

rea

d G

VL

)

GVL

HTM

Ruby on Rails/Puma

0

0.5

1

1.5

2


Th

rou

gh

pu

t(1

= 1

-th

rea

d G

VL

)

GVL

HTM



Conclusion

What was the performance of realistic applications when GVL was replaced with HTM?

– 57% speed-up in WEBrick and 58% in Ruby on Rails.

What was required for good scalability?

– Removed 5 sources of conflicts.

Code patch available!

– http://researcher.watson.ibm.com/researcher/ view_person_subpage.php?id=5206

Using HTM is an effective way to outperform the GVL without fine-grain locking.



Begin

Tradeoff in Transaction Lengths No need to end and begin transactions at every yield point.

= Variable transaction lengths at the granularity of yield points.

Longer transaction = Smaller relative overhead to begin/end transaction.

Shorter transaction = Smaller abort overhead

– Smaller amount of wasteful work rolled-back at aborts

– Smaller probability of transaction overflows

Shorter transactionBegin BeginEnd

Wasteful work

BeginEnd End

Longer transactionBegin BeginEnd Begin



Dynamic Adjustment of Transaction Lengths

Transaction length = How many yield points to skip

Adjust transaction lengths on a per-yield-point basis.

1. Initialize with a long length (255).

2. Calculate abort ratio at each yield point Number of aborts / Number of transactions

3. If the abort ratio exceeds a threshold (6% for Xeon), shorten the transaction length (x 0.75) and return to Step 2.

4. If a pre-defined number (300) of transactions started before the abort ratio exceeds the threshold, stop adjusting that yield point.



Single-thread Overhead

Single-thread speed is important too.

18-35% single-thread overhead

– Optimization: with 1 thread, use the GVL instead of HTM. 5-14% overhead in micro-benchmarks even with this optimization.

Sources of the overhead:

– Checks at the yield points Adaptive insertion of yield points

– Disabled method-invocation inline caches Concurrent inline caches



Beginning a Transaction

Persistent aborts

– Overflow

– Restricted instructions

Otherwise, transient aborts

– Read-set and write-set conflicts, etc.

Abort reason reported by CPU

– Using a specified memory area

if (TBEGIN()) { /* Transaction */ if (GVL.acquired) TABORT();} else { /* Abort */ if (GVL.acquired) { if (Retried 16 times) Acquire GVL; else { Retry after GVL release; } else if (Persistent abort) { Acquire GVL; } else { /* Transient abort */ if (Retried 3 times) Acquire GVL ； else Retry;}}Execute Ruby code;



Ending and Yielding a Transaction

if (GVL.acquired) Release GVL;else TEND();

Ending a transaction

if (--yield_counter == 0) { End a transaction; Begin a transaction;}

Yielding a transaction(transaction boundary)

Dynamic adjustment of a transaction length



Platform-Specific Optimizations

Spin-wait for a while at GVL contention.

– Original GVL is acquired and released every ~250 msec.

– GVL is acquired and released far more frequently when used as a fallback path for HTM. Parallelism severely damaged if a thread sleeps in OS every time GVL is contended.

– Some environments already include this optimization in Pthread implementation.



HTM in Intel Xeon E3-1275 v3

XBEGIN, XEND, and XABORT

– Offset to abort handler specified by XBEGIN.

– EAX register tells abort reason.

Flat nesting

Implementation details not disclosed yet.

Conflict detected at the granularity of 64-byte cache line

Read and write sets tracked by CPU caches

– 32-KB L1D, 256-KB L2, and 8-MB shared L3.

Read- and write-set sizes depend on the models, according to our experiments.

– Xeon E3-1275v3: ~6MB read and ~19KB write sets

– Core i7 4770S: ~4MB read and ~22KB write sets



FT

00.5

11.5

22.5

33.5

44.5

5

0 1 2 3 4 5 6 7 8 9 10 11 12 13

Number of threads

Thr

ough

put

(1 =

1 t

hrea

d G

IL)

GIL

HTM-1

HTM-16

HTM-256

HTM-dynamic

FT

0

0.5

1

1.5

2

2.5

0 1 2 3 4 5 6 7 8 9

Number of threads

Thr

ough

put

(1 =

1 t

hrea

d G

IL)

GIL

HTM-1

HTM-16

HTM-256

HTM-dynamic

Throughput of FT in Ruby NPB

HTM-dynamic was the best.

HTM-1 suffered high overhead.

HTM-256 incurred high abort ratios.

zEC12 and Xeon showed similar speed-ups up to 4 threads.

SMT did not help.

zEC12

Xeon E3-1275 v3

SMT



Throughput of Ruby NPB on Xeon E3-1275 v3 (1/2)BT

0

0.5

1

1.5

2

0 1 2 3 4 5 6 7 8 9

Number of threads

Thr

ough

put

(1 =

1 t

hrea

d G

IL)

CG

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

0 1 2 3 4 5 6 7 8 9

Number of threads

Thr

ough

put

(1 =

1 t

hrea

d G

IL)

FT

0

0.5

1

1.5

2

2.5

0 1 2 3 4 5 6 7 8 9

Number of threads

Thr

ough

put

(1 =

1 t

hrea

d G

IL)

GIL

HTM-1

HTM-16

HTM-256

HTM-dynamic

Showed the same scalability as zEC12.

SMT did not help.

HTM-dynamic was not the best.

– Due to learning in HTM?



Throughput of Ruby NPB on Xeon E3-1275 v3 (2/2)IS

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

0 1 2 3 4 5 6 7 8 9

Number of threads

Thr

ough

put

(1 =

1 t

hrea

d G

IL)

LU

00.20.40.60.8

11.21.41.61.8

0 1 2 3 4 5 6 7 8 9

Number of threads

Thr

ough

put

(1 =

1 t

hrea

d G

IL)

MG

00.20.40.60.8

11.21.41.61.8

0 1 2 3 4 5 6 7 8 9

Number of threads

Thr

ough

put

(1 =

1 t

hrea

d G

IL)

SP

00.20.40.60.8

11.21.41.61.8

0 1 2 3 4 5 6 7 8 9

Number of threads

Thr

ough

put

(1 =

1 t

hrea

d G

IL)

Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

Software

Transcript of Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)