Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

33
© 2014 IBM Corporation Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory Rei Odaira (Twitter: @ReiOdaira) IBM Research – Tokyo Jose G. Castanos IBM Research – Watson Research Center Hisanobu Tomari University of Tokyo

description

This deck is about eliminating Ruby's giant VM lock (GVL) through hardware transactional memory (HTM) of the Intel Haswell processor.

Transcript of Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

Page 1: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

© 2014 IBM Corporation

Eliminating Giant VM Lock in RubythroughHardware Transactional Memory

Rei Odaira (Twitter: @ReiOdaira)IBM Research – Tokyo

Jose G. CastanosIBM Research – Watson Research Center

Hisanobu TomariUniversity of Tokyo

Page 2: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

© 2014 IBM Corporation2

IBM Research – Tokyo

Giant VM Locks in Scripting Languages

Scripting languages (Ruby, Python, etc.) everywhere. Increasing demands for speed.

Multi-thread speed restricted by Giant VM locks.

– Only one thread can execute interpreter at one time.

– Not required by language specification.

☺Simplified implementation in interpreter and libraries.

☹ No scalability on multi cores.

– CRuby, CPython, and PyPy

Page 3: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

© 2014 IBM Corporation3

IBM Research – Tokyo

Hardware Transactional Memory (HTM) Coming into the Market

Improve performance by simply replacing locks with TM?

Lower overhead than software TM via hardware.

Blue Gene/Q2012

zEC122012

4th Generation Core Processor

2013

Intel

POWER82014

Page 4: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

© 2014 IBM Corporation4

IBM Research – Tokyo

Our Goal

What will realistic applications perform if we replace GVL with HTM?

– Giant VM Lock (GVL)

What modifications are needed in the interpreter?

Page 5: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

© 2014 IBM Corporation5

IBM Research – Tokyo

Our Goal

What will realistic applications perform if we replace GVL with HTM?

– Giant VM Lock (GVL)

What modifications are needed in the interpreter?

atomic { }

Eliminate GVL in CRuby through HTM on Xeon E3-1275 v3 (aka Haswell) Evaluate WEBrick HTTP server and Ruby on Rails

XeonE3-1275 v3

Intel

Page 6: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

© 2014 IBM Corporation6

IBM Research – Tokyo

Outline

Motivation

Transactional Memory

GVL in Ruby

Eliminating GVL through HTM

Experimental Results

Conclusion

Page 7: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

© 2014 IBM Corporation7

IBM Research – Tokyo

Transactional Memory

At programming time

– Enclose critical sections with transaction begin/end operations.

At execution time

– Memory operations within a transaction observed as one step by other threads.

– Multiple transactions executed in parallel as long as their memory operations do not conflict. Higher parallelism than locks.

xbegin();a->count++;xend();

xbegin();b->count++;xend();

xbegin();a->count++;xend(); xbegin();

a->count++;xend();

Thread X

Thread Y

lock();a->count++;unlock();

xbegin();a->count++;xend();

Page 8: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

© 2014 IBM Corporation8

IBM Research – Tokyo

HTM Instruction set (Intel TSX)

– XBEGIN: Begin a transaction

– XEND: End a transaction

– XABORT, etc.

Micro-architecture

– Read and write sets held in CPU caches

– Conflict detection using CPU cache coherence protocol

Abort in four cases:

– Read set and write set conflict

– Read set and write set overflow

– Restricted instructions (e.g. system calls)

– External interruptions, etc.

XBEGIN abort_handler......XEND

abort_handler:...

Page 9: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

© 2014 IBM Corporation9

IBM Research – Tokyo

Ruby Multi-thread Programming and GVL based on 1.9.3

Ruby language

– Program with Thread, Mutex, and ConditionVariable classes

Ruby virtual machine

– Ruby program compiled into bytecode sequences.

– Ruby threads assigned to native threads (Pthread)

– Only one thread can execute the interpreter at any give time due to the GVL.

GVL acquired/released when a thread starts/finishes.

GVL yielded during a blocking operation, such as I/O.

GVL yielded also at pre-defined yield-point bytecode ops.

– Conditional/unconditional jumps, method/block exits, etc.

Page 10: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

© 2014 IBM Corporation10

IBM Research – Tokyo

How GVL is Yielded in Ruby It is too costly to yield GVL at every yield point.

Yield GVL every 250 msec using a timer thread.

Timer thread Application threads

Wa

ke u

p e

very

25

0 m

sec

ReleaseGVL

Set flag

Set flag

AcquireGVL

if (flag is true){ gvl_release(); sched_yield(); gvl_acquire();}

Check the flag at every yield point.

Actual implementation is more complex to ensure fairness.

Page 11: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

© 2014 IBM Corporation11

IBM Research – Tokyo

Outline

Motivation

Transactional Memory

GVL in Ruby

Eliminating GVL through HTM

– Our Algorithm

– Conflict Removal

Experimental Results

Conclusion

Page 12: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

© 2014 IBM Corporation12

IBM Research – Tokyo

Eliminating GVL through HTM Execute as a transaction first.

– Begins/ends at the same points as GVL’s yield points.

Acquire GVL after consecutive aborts in a transaction.

Begin transaction

End transaction Conflict Retry if abort

Acquire GVL in case of consecutive aborts

Release GVLWait for GVL release

No timer thread needed.

Page 13: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

© 2014 IBM Corporation13

IBM Research – Tokyo

Where to Begin and End Transactions? Should be the same as GVL’s acquisition/release/yield points.

– Guaranteed as critical section boundaries.

☹ However, the original yield points are too coarse-grained.

– Cause many transaction overflows.

Bytecode boundaries are supposed to be safe critical section boundaries.

– Bytecode can be generated in arbitrary orders.

– Therefore, an interpreter is not supposed to have a critical section that straddles a bytecode boundary.

☹ Ruby programs that are not correctly synchronized can change behavior.

Added the following bytecode instructions as transaction yield points.– getinstancevariable, getclassvariable, getlocal, send, opt_plus, opt_minus, opt_mult, opt_aref

– Criteria: they appear frequently or are complex.

Page 14: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

© 2014 IBM Corporation14

IBM Research – Tokyo

5 Sources of Conflicts and How to Remove Them (1/2)

Global variables pointing to the current running thread

– Cause conflicts because they are written every time transactions yield

Moved to Pthread’s thread-local storage

Updates to inline caches at the time of misses

– Cache the result of hash table access for method invocation and instance-field access

– Cause conflicts because caches are shared among threads.

Implement miss reduction techniques.

Source code needs to be changed for higher performance, but each change is limited to only a few dozen lines of Ruby interpreter’s source code.

Page 15: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

© 2014 IBM Corporation15

IBM Research – Tokyo

5 Sources of Conflicts and How to Remove Them (2/2) Manipulation of global free list when allocating objects

Introduce thread-local free lists

• When a thread-local free list becomes empty, take 256 objects in bulk from the global free list.

Garbage collection

– Cause conflicts when multiple threads try to do GC.

– Cause overflows even when a single thread performs GC.

Reduce GC frequency by increasing the Ruby heap.

False sharing in Ruby’s thread structures (rb_thread_t)

– Add many fields to store thread-local information. Conflict with other structures on the same cache line.

Assign each thread structure to a dedicated cache line.

Page 16: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

© 2014 IBM Corporation16

IBM Research – Tokyo

Outline

Motivation

Transactional Memory

GVL in Ruby

Eliminating GVL through HTM

Experimental Results

Conclusion

Page 17: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

© 2014 IBM Corporation17

IBM Research – Tokyo

Experimental Environment

Implemented in Ruby 1.9.3-p547.

~400 MB Ruby heap

Xeon E3-1275 v3 (aka Haswell)

– 4-core 2-SMT 3.5-GHz Xeon E3-1275 v3

– 6-MB read set and 19-KB write set

– Linux 3.10.5

Page 18: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

© 2014 IBM Corporation18

IBM Research – Tokyo

Benchmark Programs

WEBrick: HTTP server included in Ruby distribution

– Spawn 1 thread to handle 1 request.

Ruby on Rails running on WEBrick, Thin, and Puma.

– Measured a Web application to fetch a book list from SQLite3.

– Thin executed in the multi-threaded mode.

HTTP client and server running on the same machine.

– To measure maximum speed-ups possible.

Page 19: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

© 2014 IBM Corporation19

IBM Research – Tokyo

Throughput of WEBrick

GET a ~50-byte HTML file.

Normalized to 1-thread GVL

HTM was faster than GVL by 57%.WEBrick

0

0.5

1

1.5

2

2.5

0 1 2 3 4 5 6 7Number of clients

Thr

ough

put

(1 =

1-t

hrea

d G

VL)

GVLHTM

Page 20: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

© 2014 IBM Corporation20

IBM Research – Tokyo

Throughput of Ruby on Rails / WEBrick HTM improved the throughput by 24% over GVL.

Many aborts due to transaction overflows.

– Regular expression libraries and method invocations.

Need to split into multiple transactions?

Ruby on Rails/WEBrick

0

0.5

1

1.5

0 1 2 3 4 5 6 7Number of clients

Th

rou

gh

pu

t(1

= 1

-th

rea

d G

VL

)

GVL

HTM

Page 21: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

© 2014 IBM Corporation21

IBM Research – Tokyo

Throughput of Ruby on Rails / Thin and Puma

HTM improved the throughput by 43% with Thin and 58% with Puma over GVL.

Many aborts due to transaction overflows.

Ruby on Rails/Thin

0

0.5

1

1.5

2

0 1 2 3 4 5 6 7Number of clients

Th

rou

gh

pu

t(1

= 1

-th

rea

d G

VL

)

GVL

HTM

Ruby on Rails/Puma

0

0.5

1

1.5

2

0 1 2 3 4 5 6 7Number of clients

Th

rou

gh

pu

t(1

= 1

-th

rea

d G

VL

)

GVL

HTM

Page 22: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

© 2014 IBM Corporation22

IBM Research – Tokyo

Conclusion

What was the performance of realistic applications when GVL was replaced with HTM?

– 57% speed-up in WEBrick and 58% in Ruby on Rails.

What was required for good scalability?

– Removed 5 sources of conflicts.

Code patch available!

– http://researcher.watson.ibm.com/researcher/ view_person_subpage.php?id=5206

Using HTM is an effective way to outperform the GVL without fine-grain locking.

Page 23: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

© 2014 IBM Corporation

Backup

Page 24: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

© 2014 IBM Corporation24

IBM Research – Tokyo

Begin

Tradeoff in Transaction Lengths No need to end and begin transactions at every yield point.

= Variable transaction lengths at the granularity of yield points.

Longer transaction = Smaller relative overhead to begin/end transaction.

Shorter transaction = Smaller abort overhead

– Smaller amount of wasteful work rolled-back at aborts

– Smaller probability of transaction overflows

Shorter transactionBegin BeginEnd

Wasteful work

BeginEnd End

Longer transactionBegin BeginEnd Begin

Page 25: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

© 2014 IBM Corporation25

IBM Research – Tokyo

Dynamic Adjustment of Transaction Lengths

Transaction length = How many yield points to skip

Adjust transaction lengths on a per-yield-point basis.

1. Initialize with a long length (255).

2. Calculate abort ratio at each yield point Number of aborts / Number of transactions

3. If the abort ratio exceeds a threshold (6% for Xeon), shorten the transaction length (x 0.75) and return to Step 2.

4. If a pre-defined number (300) of transactions started before the abort ratio exceeds the threshold, stop adjusting that yield point.

Page 26: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

© 2014 IBM Corporation26

IBM Research – Tokyo

Single-thread Overhead

Single-thread speed is important too.

18-35% single-thread overhead

– Optimization: with 1 thread, use the GVL instead of HTM. 5-14% overhead in micro-benchmarks even with this optimization.

Sources of the overhead:

– Checks at the yield points Adaptive insertion of yield points

– Disabled method-invocation inline caches Concurrent inline caches

Page 27: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

© 2014 IBM Corporation27

IBM Research – Tokyo

Beginning a Transaction

Persistent aborts

– Overflow

– Restricted instructions

Otherwise, transient aborts

– Read-set and write-set conflicts, etc.

Abort reason reported by CPU

– Using a specified memory area

if (TBEGIN()) { /* Transaction */ if (GVL.acquired) TABORT();} else { /* Abort */ if (GVL.acquired) { if (Retried 16 times) Acquire GVL; else { Retry after GVL release; } else if (Persistent abort) { Acquire GVL; } else { /* Transient abort */ if (Retried 3 times) Acquire GVL ; else Retry;}}Execute Ruby code;

Page 28: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

© 2014 IBM Corporation28

IBM Research – Tokyo

Ending and Yielding a Transaction

if (GVL.acquired) Release GVL;else TEND();

Ending a transaction

if (--yield_counter == 0) { End a transaction; Begin a transaction;}

Yielding a transaction(transaction boundary)

Dynamic adjustment of a transaction length

Page 29: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

© 2014 IBM Corporation29

IBM Research – Tokyo

Platform-Specific Optimizations

Spin-wait for a while at GVL contention.

– Original GVL is acquired and released every ~250 msec.

– GVL is acquired and released far more frequently when used as a fallback path for HTM. Parallelism severely damaged if a thread sleeps in OS every time GVL is contended.

– Some environments already include this optimization in Pthread implementation.

Page 30: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

© 2014 IBM Corporation30

IBM Research – Tokyo

HTM in Intel Xeon E3-1275 v3

XBEGIN, XEND, and XABORT

– Offset to abort handler specified by XBEGIN.

– EAX register tells abort reason.

Flat nesting

Implementation details not disclosed yet.

Conflict detected at the granularity of 64-byte cache line

Read and write sets tracked by CPU caches

– 32-KB L1D, 256-KB L2, and 8-MB shared L3.

Read- and write-set sizes depend on the models, according to our experiments.

– Xeon E3-1275v3: ~6MB read and ~19KB write sets

– Core i7 4770S: ~4MB read and ~22KB write sets

Page 31: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

© 2014 IBM Corporation31

IBM Research – Tokyo

FT

00.5

11.5

22.5

33.5

44.5

5

0 1 2 3 4 5 6 7 8 9 10 11 12 13

Number of threads

Thr

ough

put

(1 =

1 t

hrea

d G

IL)

GIL

HTM-1

HTM-16

HTM-256

HTM-dynamic

FT

0

0.5

1

1.5

2

2.5

0 1 2 3 4 5 6 7 8 9

Number of threads

Thr

ough

put

(1 =

1 t

hrea

d G

IL)

GIL

HTM-1

HTM-16

HTM-256

HTM-dynamic

Throughput of FT in Ruby NPB

HTM-dynamic was the best.

HTM-1 suffered high overhead.

HTM-256 incurred high abort ratios.

zEC12 and Xeon showed similar speed-ups up to 4 threads.

SMT did not help.

zEC12

Xeon E3-1275 v3

SMT

Page 32: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

© 2014 IBM Corporation32

IBM Research – Tokyo

Throughput of Ruby NPB on Xeon E3-1275 v3 (1/2)BT

0

0.5

1

1.5

2

0 1 2 3 4 5 6 7 8 9

Number of threads

Thr

ough

put

(1 =

1 t

hrea

d G

IL)

CG

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

0 1 2 3 4 5 6 7 8 9

Number of threads

Thr

ough

put

(1 =

1 t

hrea

d G

IL)

FT

0

0.5

1

1.5

2

2.5

0 1 2 3 4 5 6 7 8 9

Number of threads

Thr

ough

put

(1 =

1 t

hrea

d G

IL)

GIL

HTM-1

HTM-16

HTM-256

HTM-dynamic

Showed the same scalability as zEC12.

SMT did not help.

HTM-dynamic was not the best.

– Due to learning in HTM?

Page 33: Eliminating Giant VM Lock in Ruby through Hardware Transactional Memory (RubyKaigi 2014)

© 2014 IBM Corporation33

IBM Research – Tokyo

Throughput of Ruby NPB on Xeon E3-1275 v3 (2/2)IS

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

0 1 2 3 4 5 6 7 8 9

Number of threads

Thr

ough

put

(1 =

1 t

hrea

d G

IL)

LU

00.20.40.60.8

11.21.41.61.8

0 1 2 3 4 5 6 7 8 9

Number of threads

Thr

ough

put

(1 =

1 t

hrea

d G

IL)

MG

00.20.40.60.8

11.21.41.61.8

0 1 2 3 4 5 6 7 8 9

Number of threads

Thr

ough

put

(1 =

1 t

hrea

d G

IL)

SP

00.20.40.60.8

11.21.41.61.8

0 1 2 3 4 5 6 7 8 9

Number of threads

Thr

ough

put

(1 =

1 t

hrea

d G

IL)