Servers: Concurrency and Performance
Jeff ChaseDuke University
HTTP Server
• HTTP Server– Creates a socket (socket)– Binds to an address– Listens to setup accept backlog– Can call accept to block waiting for
connections– (Can call select to check for data on multiple socks)
• Handle request– GET /index.html HTTP/1.0\n
<optional body, multiple lines>\n\n
Inside your server
packet queues
listen queue
accept queue
Server application(Apache,
Tomcat/Java, etc)
Measuresoffered loadresponse timethroughpututilization
Example: Video On Demand
Client() {fd = connect(“server”);write (fd, “video.mpg”);while (!eof(fd)) {
read (fd, buf); display (buf);}
}
Server() {while (1) { cfd = accept(); read (cfd, name); fd = open (name); while (!eof(fd)) {
read(fd, block);write (cfd, block);
} close (cfd); close (fd);}
[MIT/Morris]
How many clients can the server support?Suppose, say, 200 kb/s video on a 100 Mb/s network link?
Performance “analysis”
• Server capacity:– Network (100 Mbit/s)– Disk (20 Mbyte/s)
• Obtained performance: one client stream• Server is limited by software structure• If a video is 200 Kbit/s, server should be able to
support more than one client.
[MIT/Morris]
500?
WebServer Flow
TCP socket space
state: listeningaddress: {*.6789, *.*}completed connection queue: sendbuf:recvbuf:
128.36.232.5128.36.230.2
state: listeningaddress: {*.25, *.*}completed connection queue:sendbuf:recvbuf:
state: establishedaddress: {128.36.232.5:6789, 198.69.10.10.1500}sendbuf:recvbuf:
connSocket = accept()
Create ServerSocket
read request from connSocket
read local file
write file to connSocket
close connSocket Discussion: what does each step do and how long does it take?
Web Server Processing Steps
Accept ClientConnection
Read HTTPRequest Header
FindFile
Send HTTPResponse Header
Read FileSend Data
may blockwaiting ondisk I/O
Want to be able to process requests concurrently.
may blockwaiting onnetwork
Process States and Transitionsrunning(user)
running(kernel)
readyblocked
Run
Wakeup
interrupt,exception
Sleep
Yield
trap/return
Server Blocking
• accept() when no connect requests are waiting on the listen queue– What if server has multiple ports to listen from?
• E.g., 80 for HTTP, 443 for HTTPS• open/read/write on server files• read() on a socket, if the client is sending too slowly• write() on socket, if the client is receiving too slowly
– Yup, TCP has flow control like pipes
What if the server blocks while serving one client, and another client has work to do?
Under the Hood
CPU
I/O device
I/O requestI/O completion
start (arrival rate λ)
exit (throughput λ until some
center saturates)
Concurrency and Pipelining
CPU
DISK Before
NET
CPU
DISK
NET
After
Better single-server performance
• Goal: run at server’s hardware speed– Disk or network should be bottleneck
• Method:– Pipeline blocks of each request– Multiplex requests from multiple clients
• Two implementation approaches:– Multithreaded server– Asynchronous I/O
[MIT/Morris]
Concurrent threads or processes
• Using multiple threads/processes– so that only the flow
processing a particular request is blocked
– Java: extends Thread or implements Runnable interface
Example: a Multi-threaded WebServer, which creates a thread for each request
Multiple Process Architecture
• Advantages– Simple programming while addressing blocking issue
• Disadvantages– Many processes; large context switch overheads– Consumes much memory– Optimizations involving sharing information among
processes (e.g., caching) harder
AcceptConn
ReadRequest
FindFile
SendHeader
Read FileSend Data
AcceptConn
ReadRequest
FindFile
SendHeader
Read FileSend Data
Process 1
Process N
…
separate address spaces
Using Threads
• Advantages– Lower context switch overheads– Shared address space simplifies optimizations (e.g.,
caches)• Disadvantages
– Need kernel level threads (why?)– Some extra memory needed to support multiple stacks– Need thread-safe programs, synchronization
AcceptConn
ReadRequest
FindFile
SendHeader
Read FileSend Data
AcceptConn
ReadRequest
FindFile
SendHeader
Read FileSend Data
Thread 1
Thread N
…
Multithreaded serverserver() {
while (1) { cfd = accept(); read (cfd, name); fd = open (name); while (!eof(fd)) {
read(fd, block);write (cfd, block);
} close (cfd); close (fd);}}
for (i = 0; i < 10; i++)threadfork (server);
• When waiting for I/O, thread scheduler runs another thread
• What about references to shared data?
• Synchronization
[MIT/Morris]
Event-Driven Programming• One execution stream: no
CPU concurrency.• Register interest in events
(callbacks).• Event loop waits for events,
invokes handlers.• No preemption of event
handlers.• Handlers generally short-
lived.
EventLoop
Event Handlers
[Ousterhout 1995]
Single Process Event Driven (SPED)
• Single threaded• Asynchronous (non-blocking) I/O• Advantages
– Single address space– No synchronization
• Disadvantages– In practice, disk reads still block
AcceptConn
ReadRequest
FindFile
SendHeader
Read FileSend Data
Event Dispatcher
Asynchronous Multi-Process Event Driven (AMPED)
• Like SPED, but use helper processes/thread for disk I/O• Use IPC to communicate with helper process• Advantages
– Shared address space for most web server functions– Concurrency for disk I/O
• Disadvantages– IPC between main thread and helper threads
AcceptConn
ReadRequest
FindFile
SendHeader
Read FileSend Data
Event Dispatcher
Helper 1 Helper 1 Helper 1
This hybrid model is used by the “Flash” web server.
Event-Based Concurrent Servers Using I/O
Multiplexing
• Maintain a pool of connected descriptors.• Repeat the following forever:
– Use the Unix select function to block until:• (a) New connection request arrives on the listening
descriptor.• (b) New data arrives on an existing connected
descriptor.– If (a), add the new connection to the pool of
connections.– If (b), read any available data from the connection
• Close connection on EOF and remove it from the pool.
[CMU 15-213]
Select
• If a server has many open sockets, how does it know when one of them is ready for I/O?int select(int n, fd_set *readfds, fd_set *writefds, fd_set *exceptfds,
struct timeval *timeout);
• Issues with scalability: alternative event interfaces have been offered.
Asynchronous I/Ostruct callback { bool (*is_ready)(); void (*cb)(arg); void *arg;}
main() { while (1) {
for (c = each callback) { if (c->is_ready()) c->handler(c->arg);
} }}
• Code is structured as a collection of handlers• Handlers are nonblocking• Create new handlers for blocking operations• When operation completes, call handler
[MIT/Morris]
Asychronous server
init() {on_accept(accept_cb);
}accept_cb() {
on_readable(cfd,name_cb);}on_readable(fd, fn) {
c = new callback(test_readable, fn, fd); add c to callback list;}
name_cb(cfd) {read(cfd,name);fd = open(name);on_readable(fd, read_cb);
}read_cb(cfd, fd) {
read(fd, block);on_writeeable(fd, write_cb);
}write_cb(cfd, fd) {
write(cfd, block);on_readable(fd, read_cb);
}
[MIT/Morris]
Multithreaded vs. Async• Hard to program
– Locking code– Need to know what blocks
• Coordination explicit• State stored on thread’s stack
– Memory allocation implicit• Context switch may be
expensive• Multiprocessors
• Hard to program– Callback code– Need to know what blocks
• Coordination implicit• State passed around explicitly
– Memory allocation explicit• Lightweight context switch• Uniprocessors
[MIT/Morris]
Coordination example
• Threaded server:– Thread for network
interface– Interrupt wakes up
network thread– Protected (locks and
conditional variables) shared buffer shared between server threads and network thread
• Asynchronous I/O– Poll for packets
• How often to poll?– Or, interrupt generates
an event• Be careful: disable
interrupts when manipulating callback queue.
[MIT/Morris]
Threads!
One View
Should You Abandon Threads?
• No: important for high-end servers (e.g. databases).
• But, avoid threads wherever possible:– Use events, not threads, for GUIs,
distributed systems, low-end servers.– Only use threads where true CPU
concurrency is needed.– Where threads needed, isolate usage
in threaded application kernel: keepmost of code single-threaded.
Threaded Kernel
Event-Driven Handlers
[Ousterhout 1995]
Another view
• Events obscure control flow– For programmers and tools
Threads Eventsthread_main(int sock) { struct session s; accept_conn(sock, &s); read_request(&s); pin_cache(&s); write_response(&s); unpin(&s);}
pin_cache(struct session *s) { pin(&s); if( !in_cache(&s) ) read_file(&s);}
AcceptHandler(event e) { struct session *s = new_session(e); RequestHandler.enqueue(s);}RequestHandler(struct session *s) { …; CacheHandler.enqueue(s);}CacheHandler(struct session *s) { pin(s); if( !in_cache(s) ) ReadFileHandler.enqueue(s); else ResponseHandler.enqueue(s);}. . . ExitHandlerr(struct session *s) { …; unpin(&s); free_session(s); }
AcceptConn.
WriteResponse
ReadFile
ReadRequest
PinCache
Web Server
Exit
[von Behren]
Control Flow
AcceptConn.
WriteResponse
ReadFile
ReadRequest
PinCache
Web Server
Exit
Threads Eventsthread_main(int sock) { struct session s; accept_conn(sock, &s); read_request(&s); pin_cache(&s); write_response(&s); unpin(&s);}
pin_cache(struct session *s) { pin(&s); if( !in_cache(&s) ) read_file(&s);}
CacheHandler(struct session *s) { pin(s); if( !in_cache(s) ) ReadFileHandler.enqueue(s); else ResponseHandler.enqueue(s);}RequestHandler(struct session *s) { …; CacheHandler.enqueue(s);}. . . ExitHandlerr(struct session *s) { …; unpin(&s); free_session(s); }AcceptHandler(event e) { struct session *s = new_session(e); RequestHandler.enqueue(s); }
• Events obscure control flow– For programmers and tools
[von Behren]
Exceptions• Exceptions complicate control flow
– Harder to understand program flow– Cause bugs in cleanup code
AcceptConn.
WriteResponse
ReadFile
ReadRequest
PinCache
Web Server
Exit
Threads Eventsthread_main(int sock) { struct session s; accept_conn(sock, &s); if( !read_request(&s) ) return; pin_cache(&s); write_response(&s); unpin(&s);}
pin_cache(struct session *s) { pin(&s); if( !in_cache(&s) ) read_file(&s);}
CacheHandler(struct session *s) { pin(s); if( !in_cache(s) ) ReadFileHandler.enqueue(s); else ResponseHandler.enqueue(s);}RequestHandler(struct session *s) { …; if( error ) return; CacheHandler.enqueue(s);}. . . ExitHandlerr(struct session *s) { …; unpin(&s); free_session(s); }AcceptHandler(event e) { struct session *s = new_session(e); RequestHandler.enqueue(s); }
[von Behren]
State Management
Threads Eventsthread_main(int sock) { struct session s; accept_conn(sock, &s); if( !read_request(&s) ) return; pin_cache(&s); write_response(&s); unpin(&s);}
pin_cache(struct session *s) { pin(&s); if( !in_cache(&s) ) read_file(&s);}
CacheHandler(struct session *s) { pin(s); if( !in_cache(s) ) ReadFileHandler.enqueue(s); else ResponseHandler.enqueue(s);}RequestHandler(struct session *s) { …; if( error ) return; CacheHandler.enqueue(s);}. . . ExitHandlerr(struct session *s) { …; unpin(&s); free_session(s); }AcceptHandler(event e) { struct session *s = new_session(e); RequestHandler.enqueue(s); }
AcceptConn.
WriteResponse
ReadFile
ReadRequest
PinCache
Web Server
Exit
• Events require manual state management• Hard to know when to free
– Use GC or risk bugs
[von Behren]
AcceptConn
ReadRequest
FindFile
SendHeader
Read FileSend Data
AcceptConn
ReadRequest
FindFile
SendHeader
Read FileSend Data
Thread 1
Thread N
…
Internet Growth and Scale
The InternetThe Internet
How to handle all those client requests raining on your server?
Servers Under Stress
Ideal
Peak: some resource at max
Overload: someresource thrashing
Load (concurrent requests, or arrival rate)
Perf
orm
anc
e
[Von Behren]
Response Time
Components• Wire time +• Queuing time +• Service demand +• Wire time (response)
Depends on• Cost/length of request• Load conditions at server
late
ncy
offered load
Queuing Theory for Busy People
• Big Assumptions– Queue is First-Come-First-Served (FIFO, FCFS).– Request arrivals are independent (poisson
arrivals).– Requests have independent service demands.– i.e., arrival interval and service demand are
exponentially distributed (noted as “M”).
M/M/1 Service Center
offered loadrequest stream @
arrival rate λ
wait here
Process for mean service demand D
Utilization
• What is the probability that the center is busy?– Answer: some number between 0 and 1.
• What percentage of the time is the center busy?– Answer: some number between 0 and 100
• These are interchangeable: called utilization U • If the center is not saturated, i.e., it completes all
its requests in some bounded time, then:• U = λD = (arrivals/T * service demand)• “Utilization Law”
• The probability that the service center is idle is 1-U.
Little’s Law• For an unsaturated queue in steady state, mean
response time R and mean queue length N are governed by:
Little’s Law: N = λR
• Suppose a task T is in the system for R time units.
• During that time:– λR new tasks arrive.– N tasks depart (all tasks ahead of T).
• But in steady state, the flow in balances flow out.– Note: this means that throughput X = λ.
Inverse Idle Time “Law”
R
1(100%)
Service center saturates as 1/ λ approaches D: small increases in λ cause large increases in the expected response time R.
U
Little’s Law gives response time R = D/(1 - U).
Intuitively, each task T’s response time R = D + DN.Substituting λR for N: R = D + D λR Substituting U for λD: R = D + URR - UR = D --> R(1 - U) = D --> R = D/(1 - U)
Why Little’s Law Is Important
1. Intuitive understanding of FCFS queue behavior.• Compute response time from demand parameters (λ, D).• Compute N: how much storage is needed for the queue.
2. Notion of a saturated service center. – Response times rise rapidly with load and are unbounded.
• At 50% utilization, a 10% increase in load increases R by 10%.• At 90% utilization, a 10% increase in load increases R by 10x.
3. Basis for predicting performance of queuing networks.• Cheap and easy “back of napkin” estimates of system
performance based on observed behavior and proposed changes, e.g., capacity planning, “what if” questions.
What does this tell us about server behavior at
saturation?
Under the Hood
CPU
I/O device
I/O requestI/O completion
start (arrival rate λ)
exit (throughput λ until some
center saturates)
Common Bottlenecks
• No more File Descriptors• Sockets stuck in TIME_WAIT• High Memory Use (swapping)• CPU Overload• Interrupt (IRQ) Overload
[Aaron Bannert]
Scaling Server Sites: Clustering
server array
Clients
L4: TCPL7: HTTP
SSLetc.
Goalsserver load balancingfailure detectionaccess control filteringpriorities/QoSrequest localitytransparent caching smart
switch
virtual IP addresses
(VIPs)
What to switch/filter on?L3 source IP and/or VIPL4 (TCP) ports etc.L7 URLs and/or cookiesL7 SSL session IDs
Scaling Services: Replication
InternetInternet
Distribute service load across multiple sites.
How to select a server site for each client or request?
Is it scalable?Client
Site A Site B
?
Extra Slides
(Any new information on the following slides will not be tested.)
Event-Based Concurrent Servers Using I/O
Multiplexing
• Maintain a pool of connected descriptors.• Repeat the following forever:
– Use the Unix select function to block until:• (a) New connection request arrives on the listening
descriptor.• (b) New data arrives on an existing connected
descriptor.– If (a), add the new connection to the pool of
connections.– If (b), read any available data from the connection
• Close connection on EOF and remove it from the pool.
[CMU 15-213]
Problems of Multi-Thread Server
• High resource usage, context switch overhead, contended locks
• Too many threads throughput meltdown, response time explosion
• Solution: bound total number of threads
Event-Driven Programming
• Event-driven programming, also called asynchronous i/o• Using Finite State Machines (FSM) to monitor the progress of requests• Yields efficient and scalable concurrency• Many examples: Click router, Flash web server, TP Monitors, etc.
• Java: asynchronous i/o– for an example see: http://www.cafeaulait.org/books/jnp3/examples/12/
Traditional Processes
• Expensive and “heavyweight”• One system call per process• Fork overhead• Coordination
Events
• Need async I/O• Need select• Wasn’t originally available• Not standardized• Immature• But efficient• Code is distributed all through the program• Harder to debug and understand
Threads
• Separate interface and implementation• Pthreads interface• Implementation is user-level or kernel (native)• If user-level, needs async I/O• But hide the abstraction behind the thread
interface
Reference
The State of the Art in Locally Distributed Web-server Systems
Valeria Cardellini, Emiliano Casalicchio, Michele Colajanni and Philip S. Yu
Top Related