Capriccio: Scalable Threads for Internet Services (von Behren) Kenneth Chiu.

Capriccio: Scalable Threads for Internet Services (von Behren)

Kenneth Chiu

Background

• Non-blocking I/O, async I/O– NB

• Usually doesn’t work well for disks.– Async I/O

• Issue a request, get completion.

• epoll()/poll() • convoy: tendency for threads to “bunch up”• priority inversion• call graph• average, weighted moving average• capriccio: improvisatory style, free form

The Problem

• Web “transactions” involve a number of steps which must be performed in sequence.

• For high-throughput, we want to service many of these requests concurrently.– When does concurrency help? When does it not?

• If we use a single thread per request, we will have too many threads.

• If we multiplex requests on a small set of threads, it’s more difficult.

Read two numbers and add

while (true) { fd = get_read_ready(); state = lookup(fd); if (state.step == READING_FIRST) { c = read(fd, …, bytes_left); if (have enough) { state.step == READING_SECOND; } } else if (state.step ==

READING_SECOND) { … }

while (true) { int n1, n2; readexact(fd, &n1, 4); readexact(fd, &n2, 4); printf(“%d\n”, n1 + n2);}

Thread Design and Scalability

The Case for User-Level Threads

• Flexibility– Level of indirection between applications and the kernel, which

helps decouple the two.– Kernel-level thread scheduling must handle all applications.

User-level can be tailored.– Lightweight which means can use zillions of them.

• Performance– Cooperative scheduling is nearly free.– Do not require kernel crossing for uncontended locks. (Why do

contended locks require kernel crossings?)

• Disadvantages– Non-blocking I/O requires an additional system call. (Why?)– SMPs

Implementation

• Context switches– Built on coroutine library.

• I/O– Intercept blocking system calls, use epoll() and AIO for disk.– Can be less efficient

• Scheduling– Main scheduling loop looks very much like an event-driven

application. (What is an EDA?)– Makes it relatively easy to switch schedulers.

• Synchronization– Cooperative threading on UP.

• Efficiency– All O(1), except sleep queue.

Benchmarks

• 2 X 2.4 GHz Xeon, 1 GB memory, 2 X 10K RPM SCSI, GigE.– 2 X 1.2 GHz US III

• Linux 2.5.70, epoll(), AIO.– Solaris 8

• Capriccio, LinuxThreads, NPTL

Thread Primitives

Capriccio Capriccio(notrace)

Linux-Threads

NPTL Solaris

Thread creation

21.5 21.5 37.9 17.7 32

Thread context switch

0.56 0.24 0.71 0.65

Uncontended mutex lock

0.04 0.04 0.14 0.15 0.08

Thread Scalability

• Producer-consumer

Thread Scalability

• Drop between 100 and 1000 to cache footprint.

I/O Performance

• pipetest– Pass a number of tokens among a set of

pipes.

• Disk scheduling– A number of threads perform random 4 KB

reads from a 1 GB file.

• Disk I/O through buffer cache– 200 threads reading with a fixed miss rate.

• When concurrency is low, performance is poorer.

• Benefits of disk head scheduling.

• I/O out of buffer.

• Performance is lower due to AIO.

Linked Stack Management

Thread Stacks

• If a lot of threads, the cumulative stack space can be quite large.

• Solution: Use a dynamic allocation policy and allocate on demand. Link stack chunks together.

• Problem: How do you link stack chunks together? How do you know when to link a new one?

Weighed Call Graph

• Use static analysis to create a weighted call graph.• Each node is weighed by the maximum stack space that

that function might consume. (Why is it maximum, and not exact?)

• Now what?

Bounds

• Most real-world programs use recursion.

• Even without, static bound wastes too much.

• Instead insert checkpoints at key places to link in new stack chunks.

• Chunks switched right before arguments are pushed.

Placing Checkpoints

• Make sure one checkpoint in every cycle by inserting in back edges. (How?) (Is this efficient?)

• Then make sure each path (sum) is not too long.

• Function B is executing.• Function D, both ways.• Recursion.

Special Cases

• Function pointers– Difficult, but they try to analyze.

• External functions– Allow annotations.– Alternatively, link in a large chunk.

• Variable length arrays– C99

Question

• What kind of a problem is this?

• Is it being solved at the right level?

Resource-Aware Scheduling

Admission Control

• We’ve seen many graphs where performance degrades as some variable increases.

• Scheduling in Capriccio is to keep performance in the “good” part of the curve.

Blocking Graph

• Each node is a location where the program blocked.– Location is call chain.

• Generated at run time.• Annotate with resource usage:

– Average running time (with exponentially-weighted “moving” average), memory, stack, sockets, etc.

• Maintain a run queue for each node. Admit threads till resources reach maximum capacity.

Pitfalls

• Too many non-linear effects to predict.

• One solution is to use some kind of instrumentation, plus feedback control.– But even detecting that is hard.

Web Server Test

Summary

• Control flow maintains state. Control flow can be swapped for explicit maintenance.

• Threads perform two functions:– Maintain state (logical threads of programming model)– Allow concurrency (kernel)

• Should separate the two, since the overhead of concurrency is not necessary when just want to maintain state.

• Cooperative multitasking has been denigrated before, but can be good.

Capriccio: Scalable Threads for Internet Services (von Behren) Kenneth Chiu.

Documents

Transcript of Capriccio: Scalable Threads for Internet Services (von Behren) Kenneth Chiu.