Константин Серебряный "Быстрый динамичекский анализ...
-
Upload
yandex -
Category
Technology
-
view
341 -
download
0
description
Transcript of Константин Серебряный "Быстрый динамичекский анализ...
Fast dynamic program analysisRace detection
Konstantin Serebryany <[email protected]>May 20 2011
Agenda
● Dynamic program analysis ● Race detection: theory● ThreadSanitizer: race detector● Making ThreadSanitizer faster● Announcement of a new tool (premiere)● War stories
Dynamic analysis
● Execute program and monitor interesting events● Lightweight: no need to monitor memory accesses
○ Leak detection (monitor malloc/free)○ Deadlock detection (monitor lock/unlock)
● Heavyweight: monitor memory accesses:○ Memory bugs:
■ Ouf-of-bound, use-after-free, uninitialized reads○ Races○ Pointer taintedness analysis
● Many more: profiling, coverage, ...
Data races are scary
A data race occurs when two or more threads concurrentlyaccess a shared memory location and at least one of theaccesses is a write.
std::map<int,int> my_map;
void Thread1() { my_map[123] = 1;}
void Thread2() { my_map[345] = 2;}
Our goal: find races in Google code
Happens-before (precedes)partial order on all events
Segment: a sequence of READ/WRITE events of one threadSignal(obj) Wait(obj) is a happens-before arc
Seg1 h.b. Seg4 -- segments belong to the same thread.Seg1 h.b.Seg5 -- due to Signal/Wait pair with a macthing object.Seg1 h.b. Seg7 -- happens-before is transitive.Seg3 and Seg6 -- no ordering constraint.
LockSet
void Thread1() { mu1.Lock(); mu2.Lock(); *X = 1; mu2.Unlock(); mu1.Unlock(); ...
void Thread2() { mu1.Lock(); mu3.Lock(); *X = 2; mu3.Unlock(); mu1.Unlock(); ...
● LockSet: a set of locks held during a memory access○ Thread1: {mu1, mu2}○ Thread2: {mu1, mu3}
● Common LockSet: intersection of LockSets○ {mu1}
Dynamic race detector: state machine
● Intercepts program events at run-time○ Memory access: READ, WRITE○ Synchronization: LOCK, UNLOCK, SIGNAL, WAIT
● Maintains global state ○ Locks, other synchronization events, threads○ Memory allocation
● Maintains shadow state for each memory location (byte)○ Records previous accesses○ Reports race in appropriate state. E.g. current WRITE
■ ... does not happen-before previous READ■ ... and previous WRITE have no common Locks.
ThreadSanitizer
● Implemented in late 2008, opensource. ● Initially based on Valgrind binary translation framework.● SLOW, 20x-50x slowdown.
○ Binary translation overhead is 1.5-3.x○ Serializes threads (up to 8x on our machines)○ Slow generalized state machine.
● Slow is bad: ○ Many tests (and bugs) are timing dependent○ Users are unhappy ○ Machines cost money
● Still very useful -- found thousands races all over Google. ○ Server-side software (e.g. bigtable, GWS)○ Google Chrome browser
ThreadSanitizer: algorithm
Speedup #1: fast path sate machine
● Observation: 90%-99% of reads/writes are thread-private● Simplification: special case for thread-private access
○ Very few global objects touched○ No loops (~20 hand-written if/else statements)○ 1.5x speedup
Speedup #2: parallel fast path
● Fast path does not touch global state (almost)○ easy to parallelize (fast path w/o a lock, fallback to
serialized slow path)● Valgrind is not parallel, so used PIN (pintool.org)
○ Good alternative, also works on Windows. ○ But non-opensource is a huge disadvantage.
● Up to #CPUs times speedup (for Chrome: ~2x).● Problem: how to fight with races (Valgrind can't run PIN)?
○ OUCH!
Speedup #3: faster instrumentation
● Valgrind/PIN add 1.5x-3x slowdown. Why pay that price?● Use compiler instrumentation
○ + Less run-time overhead○ - Need to recompile all libraries to catch races there
● Implemented LLVM and GCC plugins. Indeed 1.5x-3x faster.
● Bonus: now can detect races in the parallel race detector ○ TSan-Valgrind over TSan-LLVM
● Result: up to 50M memory events per second
Speedup #4: sampling
● Idea: ignore some accesses in hot region○ LiteRace, PLDI'09
● Execution counter for every code region (function or smaller).
● Until the counter is small, don't ignore the region● Larger counter -- ignore more frequently● Moderate sampling rate: looses no races, 2x-4x speedup.
if (num_to_skip-- <= 0) { HandleThisRegion();num_to_skip = (counter>>(sampling_rate))+1; counter += num_to_skip }
Results
●1.5x-4x slowdown●Can run Chrome interactively
○Play Farmville or use GMail. ●Finds more bugs per day.
Premiere: AddressSantizer (ASAN)
● Many memory error detectors exist:○ Slow: Valgrind, DrMemory, Purify, Boundschecker,
Insure++, Intel Inspector, mudflap, ...○ Incomplete: libgmalloc, Electric Fence, Page Heap, ...
● AddressSanitizer (ASAN): fast address sanity checker○ Use-after-free○ Out-of-bound (aka buffer overflow) for heap and stack○ Double-free, etc○ Linux, Mac, ChromeOS○ 2x-2.5x slowdown (faster than Debug build!)○ LLVM instrumentation module + specialized malloc
Generic addressability checking
● malloc()/free() replacement library (most tools):○ poison redzones around malloc-ed memory○ poison memory on free()○ delay reuse of free-ed memory
● Stack poisoning (few tools)● Instrument all loads and stores
○ if (IsPoisoned(mem)) BANG();● The tricky part: how to implement IsPoisoned and BANG
AddressSanitizer algorithm
[0x80000000, 0xffffffff]
[0x60000000, 0x7fffffff]
[0x40000000, 0x47ffffff][0x30000000, 0x3fffffff]
[0x20000000, 0x23ffffff][0x00000000, 0x1fffffff]
Mem => Shadow is a 8 to 1 mapping
Instrumenting 8 byte access to Mem:
Shadow = (Mem>>3)+0x20000000;if (*Shadow) { // 1 byte load Bad = Shadow * 2; *Bad = 0; // SEGV!}
AddressSanitizer demo