Константин Серебряный "Быстрый динамичекский анализ...

Fast dynamic program analysisRace detection

Konstantin Serebryany <[email protected]>May 20 2011

Agenda

● Dynamic program analysis ● Race detection: theory● ThreadSanitizer: race detector● Making ThreadSanitizer faster● Announcement of a new tool (premiere)● War stories

Dynamic analysis

● Execute program and monitor interesting events● Lightweight: no need to monitor memory accesses

○ Leak detection (monitor malloc/free)○ Deadlock detection (monitor lock/unlock)

● Heavyweight: monitor memory accesses:○ Memory bugs:

■ Ouf-of-bound, use-after-free, uninitialized reads○ Races○ Pointer taintedness analysis

● Many more: profiling, coverage, ...

Data races are scary

A data race occurs when two or more threads concurrentlyaccess a shared memory location and at least one of theaccesses is a write.

std::map<int,int> my_map;

void Thread1() { my_map[123] = 1;}

void Thread2() { my_map[345] = 2;}

Our goal: find races in Google code

Happens-before (precedes)partial order on all events

Segment: a sequence of READ/WRITE events of one threadSignal(obj) Wait(obj) is a happens-before arc

Seg1 h.b. Seg4 -- segments belong to the same thread.Seg1 h.b.Seg5 -- due to Signal/Wait pair with a macthing object.Seg1 h.b. Seg7 -- happens-before is transitive.Seg3 and Seg6 -- no ordering constraint.

LockSet

void Thread1() { mu1.Lock(); mu2.Lock(); *X = 1; mu2.Unlock(); mu1.Unlock(); ...

void Thread2() { mu1.Lock(); mu3.Lock(); *X = 2; mu3.Unlock(); mu1.Unlock(); ...

● LockSet: a set of locks held during a memory access○ Thread1: {mu1, mu2}○ Thread2: {mu1, mu3}

● Common LockSet: intersection of LockSets○ {mu1}

Dynamic race detector: state machine

● Intercepts program events at run-time○ Memory access: READ, WRITE○ Synchronization: LOCK, UNLOCK, SIGNAL, WAIT

● Maintains global state ○ Locks, other synchronization events, threads○ Memory allocation

● Maintains shadow state for each memory location (byte)○ Records previous accesses○ Reports race in appropriate state. E.g. current WRITE

■ ... does not happen-before previous READ■ ... and previous WRITE have no common Locks.

ThreadSanitizer

● Implemented in late 2008, opensource. ● Initially based on Valgrind binary translation framework.● SLOW, 20x-50x slowdown.

○ Binary translation overhead is 1.5-3.x○ Serializes threads (up to 8x on our machines)○ Slow generalized state machine.

● Slow is bad: ○ Many tests (and bugs) are timing dependent○ Users are unhappy ○ Machines cost money

● Still very useful -- found thousands races all over Google. ○ Server-side software (e.g. bigtable, GWS)○ Google Chrome browser

ThreadSanitizer: algorithm

Speedup #1: fast path sate machine

● Observation: 90%-99% of reads/writes are thread-private● Simplification: special case for thread-private access

○ Very few global objects touched○ No loops (~20 hand-written if/else statements)○ 1.5x speedup

Speedup #2: parallel fast path

● Fast path does not touch global state (almost)○ easy to parallelize (fast path w/o a lock, fallback to

serialized slow path)● Valgrind is not parallel, so used PIN (pintool.org)

○ Good alternative, also works on Windows. ○ But non-opensource is a huge disadvantage.

● Up to #CPUs times speedup (for Chrome: ~2x).● Problem: how to fight with races (Valgrind can't run PIN)?

○ OUCH!

Speedup #3: faster instrumentation

● Valgrind/PIN add 1.5x-3x slowdown. Why pay that price?● Use compiler instrumentation

○ + Less run-time overhead○ - Need to recompile all libraries to catch races there

● Implemented LLVM and GCC plugins. Indeed 1.5x-3x faster.

● Bonus: now can detect races in the parallel race detector ○ TSan-Valgrind over TSan-LLVM

● Result: up to 50M memory events per second

Speedup #4: sampling

● Idea: ignore some accesses in hot region○ LiteRace, PLDI'09

● Execution counter for every code region (function or smaller).

● Until the counter is small, don't ignore the region● Larger counter -- ignore more frequently● Moderate sampling rate: looses no races, 2x-4x speedup.

if (num_to_skip-- <= 0) { HandleThisRegion();num_to_skip = (counter>>(sampling_rate))+1; counter += num_to_skip }

Results

●1.5x-4x slowdown●Can run Chrome interactively

○Play Farmville or use GMail. ●Finds more bugs per day.

Premiere: AddressSantizer (ASAN)

● Many memory error detectors exist:○ Slow: Valgrind, DrMemory, Purify, Boundschecker,

Insure++, Intel Inspector, mudflap, ...○ Incomplete: libgmalloc, Electric Fence, Page Heap, ...

● AddressSanitizer (ASAN): fast address sanity checker○ Use-after-free○ Out-of-bound (aka buffer overflow) for heap and stack○ Double-free, etc○ Linux, Mac, ChromeOS○ 2x-2.5x slowdown (faster than Debug build!)○ LLVM instrumentation module + specialized malloc

Generic addressability checking

● malloc()/free() replacement library (most tools):○ poison redzones around malloc-ed memory○ poison memory on free()○ delay reuse of free-ed memory

● Stack poisoning (few tools)● Instrument all loads and stores

○ if (IsPoisoned(mem)) BANG();● The tricky part: how to implement IsPoisoned and BANG

AddressSanitizer algorithm

[0x80000000, 0xffffffff]

[0x60000000, 0x7fffffff]

[0x40000000, 0x47ffffff][0x30000000, 0x3fffffff]

[0x20000000, 0x23ffffff][0x00000000, 0x1fffffff]

Mem => Shadow is a 8 to 1 mapping

Instrumenting 8 byte access to Mem:

Shadow = (Mem>>3)+0x20000000;if (*Shadow) { // 1 byte load Bad = Shadow * 2; *Bad = 0; // SEGV!}

AddressSanitizer demo

Константин Серебряный "Быстрый динамичекский анализ...

Technology

Transcript of Константин Серебряный "Быстрый динамичекский анализ...