Scalable Internet Services Cluster Lessons and Architecture Design for Scalable Services BR01, TACC,...

Post on 13-Jan-2016

222 views 2 download

Tags:

Transcript of Scalable Internet Services Cluster Lessons and Architecture Design for Scalable Services BR01, TACC,...

Scalable Internet Services

Cluster Lessons and Architecture Design for Scalable Services

BR01, TACC, Porcupine, SEDA and Capriccio

Ben Y. Zhao ravenben@cs.ucsb.edu

Outline• Overview of cluster services

• lessons from giant-scale services (BR01)

• SEDA (staged event-driven architecture)

• Capriccio

Ben Y. Zhao ravenben@cs.ucsb.edu

Scalable Servers• Clustered services

• natural platform for large web services

• search engines, DB servers, transactional servers

• Key benefit• low cost of computing, COTS vs. SMP

• incremental scalability

• load balance traffic/requests across servers

• Extension from single server model• reliable/fast communication, but partitioned

data

Ben Y. Zhao ravenben@cs.ucsb.edu

Goals• Failure transparency

• hot-swapping components w/o loss of avail

• homogeneous functionality and/or replication

• Load balancing• partition data / requests for max service rate

• need to colocate requests w/ associated data

• Scalability• aggregate performance should scale w/# of

servers

Ben Y. Zhao ravenben@cs.ucsb.edu

Two Different Models• Read-mostly data

• web servers, DB servers, search engines (query)

• replicate across servers + (RR DNS / redirector)

IP Network (WAN)IP Network (WAN)

clientclient client client

client

Round Robin DNS

Ben Y. Zhao ravenben@cs.ucsb.edu

Two Different Models …• Read-write model

• mail servers, e-commerce sites, hosted services

• small(er) replication factor for stronger consistency

IP Network (WAN)IP Network (WAN)

clientclient client client

client

Load Redirector

Ben Y. Zhao ravenben@cs.ucsb.edu

Key Architecture Challenges

• Providing high availability• availability across component failures

• Handling flash crowds / peak load• need support for massive concurrency

• Other challenges• upgradability: maintaining availability and

minimal cost during upgrades in S/W, H/W, functionality

• error diagnosis: fast isolation of failures / performance degradation

Ben Y. Zhao ravenben@cs.ucsb.edu

Nuggets• Definition

• uptime = (MTBF – MTTR)/MTBF

• yield = queries completed / queries offered

• harvest = data available / complete data

• MTTR• at least as important at MTBF

• much easier to tune and quantify

• DQ principle• data/query x queries/second constant

• physical bottlenecks limit overall throughput

Ben Y. Zhao ravenben@cs.ucsb.edu

Staged Event-driven Architecture

• SEDA (SOSP’05)

Ben Y. Zhao ravenben@cs.ucsb.edu

Break…• Come back in 5 mins

• more on threads vs. events…

Ben Y. Zhao ravenben@cs.ucsb.edu

Tapestry Software Architecture

SEDA event-driven frameworkJava Virtual Machine

Dynamic Tap.

distance map

core router

application programming interface

applications

Patchwork

network

Ben Y. Zhao ravenben@cs.ucsb.edu

Impact of Correlated Events

• web / application servers• independent requests• maximize individual throughput

Network

???

???

?

ABC

• correlated requests: A+B+CD• e.g. online continuous queries, sensor

aggregation, p2p control layer, streaming data mining

event handler

+ + =

Ben Y. Zhao ravenben@cs.ucsb.edu

Capriccio• User-level light-weight threads

(SOSP03)

• Argument• threads are the natural programming model

• current problems result of implementation• not fundamental flaw

• Approach• aim for massive scalability

• compiler assistance

• linked stacks, block graph scheduling

Ben Y. Zhao ravenben@cs.ucsb.edu

The Price of Concurrency• Why is concurrency hard?

• Race conditions

• Code complexity

• Scalability (no O(n) operations)

• Scheduling & resource sensitivity

• Inevitable overload

• Performance vs. Programmability• No good solution

PerformanceEase

of

Pro

gra

mm

ing Threads

Threads

Events

Ideal

Ben Y. Zhao ravenben@cs.ucsb.edu

The Answer: Better Threads

• Goals• Simple programming model

• Good tools & infrastructure• Languages, compilers, debuggers, etc.

• Good performance

• Claims• Threads are preferable to events

• User-Level threads are key

Ben Y. Zhao ravenben@cs.ucsb.edu

“But Events Are Better!”• Recent arguments for events

• Lower runtime overhead

• Better live state management

• Inexpensive synchronization

• More flexible control flow

• Better scheduling and locality

• All true but…• Lauer & Needham duality argument

• Criticisms of specific threads packages

• No inherent problem with threads!

Ben Y. Zhao ravenben@cs.ucsb.edu

Criticism: Runtime Overhead

• Criticism: Threads don’t perform well for high concurrency

• Response• Avoid O(n) operations

• Minimize context switch overhead

• Simple scalability test• Slightly modified GNU Pth

• Thread-per-task vs. single thread

• Same performance!

Requ

ests

/ Sec

ond

Concurrent Tasks

Event-Based Server

Threaded Server

20000

30000

40000

50000

60000

70000

80000

90000

100000

110000

1 10 100 1000 10000 100000 1e+06

Ben Y. Zhao ravenben@cs.ucsb.edu

Criticism: Synchronization• Criticism: Thread synchronization is

heavyweight• Response

• Cooperative multitasking works for threads, too!

• Also presents same problems• Starvation & fairness• Multiprocessors• Unexpected blocking (page faults, etc.)

• Both regimes need help• Compiler / language support for concurrency• Better OS primitives

Ben Y. Zhao ravenben@cs.ucsb.edu

Criticism: Scheduling

• Criticism: Thread schedulers are too generic• Can’t use application-specific information

• Response• 2D scheduling: task & program location

• Threads schedule based on task only• Events schedule by location (e.g. SEDA)

• Allows batching• Allows prediction for SRCT

• Threads can use 2D, too!• Runtime system tracks current location• Call graph allows prediction

Task

Pro

gra

m

Loca

tion

Threads

Events

Ben Y. Zhao ravenben@cs.ucsb.edu

The Proof’s in the Pudding• User-level threads package

• Subset of pthreads

• Intercept blocking system calls

• No O(n) operations

• Support > 100K threads

• 5000 lines of C code

• Simple web server: Knot• 700 lines of C code

• Similar performance• Linear increase, then steady

• Drop-off due to poll() overhead

0

100

200

300

400

500

600

700

800

900

1 4 16 64 256 1024 4096 16384

KnotC (Favor Connections)

KnotA (Favor Accept)

Haboob

Concurrent Clients

Mbit

s /

seco

nd

Ben Y. Zhao ravenben@cs.ucsb.edu

Arguments For Threads• More natural programming model

• Control flow is more apparent

• Exception handling is easier

• State management is automatic

• Better fit with current tools & hardware• Better existing infrastructure

Ben Y. Zhao ravenben@cs.ucsb.edu

Why Threads: control Flow• Events obscure control flow

• For programmers and tools

Threads Eventsthread_main(int sock) {

struct session s;

accept_conn(sock, &s);

read_request(&s);

pin_cache(&s);

write_response(&s);

unpin(&s);

}

pin_cache(struct session *s) {

pin(&s);

if( !in_cache(&s) )

read_file(&s);

}

AcceptHandler(event e) {

struct session *s = new_session(e);

RequestHandler.enqueue(s);

}

RequestHandler(struct session *s) {

…; CacheHandler.enqueue(s);

}

CacheHandler(struct session *s) {

pin(s);

if( !in_cache(s) ) ReadFileHandler.enqueue(s);

else ResponseHandler.enqueue(s);

}

. . .

ExitHandler(struct session *s) {

…; unpin(&s); free_session(s); }

AcceptConn.

WriteResponse

ReadFile

ReadRequest

PinCache

Web Server

Exit

Ben Y. Zhao ravenben@cs.ucsb.edu

Threads Eventsthread_main(int sock) {

struct session s;

accept_conn(sock, &s);

read_request(&s);

pin_cache(&s);

write_response(&s);

unpin(&s);

}

pin_cache(struct session *s) {

pin(&s);

if( !in_cache(&s) )

read_file(&s);

}

CacheHandler(struct session *s) {

pin(s);

if( !in_cache(s) ) ReadFileHandler.enqueue(s);

else ResponseHandler.enqueue(s);

}

RequestHandler(struct session *s) {

…; CacheHandler.enqueue(s);

}

. . .

ExitHandler(struct session *s) {

…; unpin(&s); free_session(s);

}

AcceptHandler(event e) {

struct session *s = new_session(e);

RequestHandler.enqueue(s); }

• Events obscure control flow• For programmers and tools

AcceptConn.

WriteResponse

ReadFile

ReadRequest

PinCache

Web Server

Exit

Why Threads: control Flow

Ben Y. Zhao ravenben@cs.ucsb.edu

Why Threads: Exceptions• Exceptions complicate control flow

• Harder to understand program flow• Cause bugs in cleanup code

AcceptConn.

WriteResponse

ReadFile

ReadRequest

PinCache

Web Server

Exit

Threads Eventsthread_main(int sock) {

struct session s;

accept_conn(sock, &s);

if( !read_request(&s) )

return;

pin_cache(&s);

write_response(&s);

unpin(&s);

}

pin_cache(struct session *s) {

pin(&s);

if( !in_cache(&s) )

read_file(&s);

}

CacheHandler(struct session *s) {

pin(s);

if( !in_cache(s) ) ReadFileHandler.enqueue(s);

else ResponseHandler.enqueue(s);

}

RequestHandler(struct session *s) {

…; if( error ) return; CacheHandler.enqueue(s);

}

. . .

ExitHandler(struct session *s) {

…; unpin(&s); free_session(s);

}

AcceptHandler(event e) {

struct session *s = new_session(e);

RequestHandler.enqueue(s); }

Ben Y. Zhao ravenben@cs.ucsb.edu

Why Threads: State Management

Threads Eventsthread_main(int sock) {

struct session s;

accept_conn(sock, &s);

if( !read_request(&s) )

return;

pin_cache(&s);

write_response(&s);

unpin(&s);

}

pin_cache(struct session *s) {

pin(&s);

if( !in_cache(&s) )

read_file(&s);

}

CacheHandler(struct session *s) {

pin(s);

if( !in_cache(s) ) ReadFileHandler.enqueue(s);

else ResponseHandler.enqueue(s);

}

RequestHandler(struct session *s) {

…; if( error ) return; CacheHandler.enqueue(s);

}

. . .

ExitHandler(struct session *s) {

…; unpin(&s); free_session(s);

}

AcceptHandler(event e) {

struct session *s = new_session(e);

RequestHandler.enqueue(s); }

AcceptConn.

WriteResponse

ReadFile

ReadRequest

PinCache

Web Server

Exit

• Events require manual state management• Hard to know when to free

• Use GC or risk bugs

Ben Y. Zhao ravenben@cs.ucsb.edu

Why Threads: Existing Infrastructure

• Lots of infrastructure for threads• Debuggers

• Languages & compilers

• Consequences• More amenable to analysis

• Less effort to get working systems

Ben Y. Zhao ravenben@cs.ucsb.edu

Building Better Threads• Goals

• Simplify the programming model• Thread per concurrent activity• Scalability (100K+ threads)

• Support existing APIs and tools

• Automate application-specific customization

• Mechanisms• User-level threads

• Plumbing: avoid O(n) operations

• Compile-time analysis

• Run-time analysis

Ben Y. Zhao ravenben@cs.ucsb.edu

Case for User-Level Threads

• Decouple programming model and OS• Kernel threads

• Abstract hardware• Expose device concurrency

• User-level threads• Provide clean programming model• Expose logical concurrency

• Benefits of user-level threads• Control over concurrency model!

• Independent innovation

• Enables static analysis

• Enables application-specific tuning

Threads

App

OS

User

Ben Y. Zhao ravenben@cs.ucsb.edu

Case for User-Level Threads

Threads

OS

User

App

• Decouple programming model and OS• Kernel threads

• Abstract hardware• Expose device concurrency

• User-level threads• Provide clean programming model• Expose logical concurrency

• Benefits of user-level threads• Control over concurrency model!

• Independent innovation

• Enables static analysis

• Enables application-specific tuningSimilar argument tothe design of overlay

networks

Ben Y. Zhao ravenben@cs.ucsb.edu

Capriccio Internals• Cooperative user-level threads

• Fast context switches• Lightweight synchronization

• Kernel Mechanisms• Asynchronous I/O (Linux)

• Efficiency• Avoid O(n) operations • Fast, flexible scheduling

Ben Y. Zhao ravenben@cs.ucsb.edu

Safety: Linked Stacks• The problem: fixed stacks

• Overflow vs. wasted space• LinuxThreads: 2MB/stack• Limits thread numbers

• The solution: linked stacks• Allocate space as needed• Compiler analysis

• Add runtime checkpoints • Guarantee enough space until

next check

Fixed Stacks

Linked Stack

waste

overflow

Ben Y. Zhao ravenben@cs.ucsb.edu

Linked Stacks: Algorithm

5

4

2

6

3

3

2

3

• Parameters• MaxPath• MinChunk

• Steps• Break cycles• Trace back

• chkpts limit MaxPath length

• Special Cases• Function pointers• External calls• Use large stack

MaxPath = 8

Ben Y. Zhao ravenben@cs.ucsb.edu

Linked Stacks: Algorithm

5

4

2

6

3

3

2

3

MaxPath = 8

• Parameters• MaxPath• MinChunk

• Steps• Break cycles• Trace back

• chkpts limit MaxPath length

• Special Cases• Function pointers• External calls• Use large stack

Ben Y. Zhao ravenben@cs.ucsb.edu

Linked Stacks: Algorithm

5

4

2

6

3

3

2

3

MaxPath = 8

• Parameters• MaxPath• MinChunk

• Steps• Break cycles• Trace back

• chkpts limit MaxPath length

• Special Cases• Function pointers• External calls• Use large stack

Ben Y. Zhao ravenben@cs.ucsb.edu

Linked Stacks: Algorithm

5

4

2

6

3

3

2

3

MaxPath = 8

• Parameters• MaxPath• MinChunk

• Steps• Break cycles• Trace back

• chkpts limit MaxPath length

• Special Cases• Function pointers• External calls• Use large stack

Ben Y. Zhao ravenben@cs.ucsb.edu

Linked Stacks: Algorithm

5

4

2

6

3

3

2

3

MaxPath = 8

• Parameters• MaxPath• MinChunk

• Steps• Break cycles• Trace back

• chkpts limit MaxPath length

• Special Cases• Function pointers• External calls• Use large stack

Ben Y. Zhao ravenben@cs.ucsb.edu

Linked Stacks: Algorithm

5

4

2

6

3

3

2

3

MaxPath = 8

• Parameters• MaxPath• MinChunk

• Steps• Break cycles• Trace back

• chkpts limit MaxPath length

• Special Cases• Function pointers• External calls• Use large stack

Ben Y. Zhao ravenben@cs.ucsb.edu

Special Cases• Function pointers

• categorize f* by # and type of arguments

• “guess” which func will/can be called

• External functions• users annotate trusted stack bounds on libs

• or (re)use a small # of large stack chunks

• Result• use/reuse stack chunks much like VM

• can efficiently share stack chunks

• memory-touch benchmark, factor of 3 reduction in paging cost

Ben Y. Zhao ravenben@cs.ucsb.edu

Scheduling: Blocking Graph

• Lessons from event systems• Break app into stages

• Schedule based on stage priorities

• Allows SRCT scheduling, finding bottlenecks, etc.

• Capriccio does this for threads• Deduce stage with stack traces at

blocking points

• Prioritize based on runtime information

Accept

Write

Read

Read

Open

Web Server

Close

Close

Ben Y. Zhao ravenben@cs.ucsb.edu

Resource-Aware Scheduling

• Track resources used along BG edges• Memory, file descriptors, CPU

• Predict future from the past

• Algorithm• Increase use when underutilized• Decrease use near saturation

• Advantages• Operate near the knee w/o thrashing

• Automatic admission control

Accept

Write

Read

Read

Open

Web Server

Close

Close

Ben Y. Zhao ravenben@cs.ucsb.edu

Pitfalls• What is the max amt of resource?

• depends on workload• e.g.: disk thrashing depends on sequential

or random seeks• use early signs of thrashing to indicate max

capacity

• Detecting thrashing• only estimate using “productivity/overhead”• productivity from guessing (threads

created, files opened/closed)

Ben Y. Zhao ravenben@cs.ucsb.edu

Thread Performance

Capriccio

Capriccio-notrace

LinuxThreads

NPTL

Thread Creation 21.5 21.5 37.5 17.7

Context Switch 0.56 0.24 0.71 0.65

Uncontested mutex lock

0.04 0.04 0.14 0.15

• Slightly slower thread creation• Faster context switches

• Even with stack traces!• Much faster mutexes

Time of thread operations (microseconds)

Ben Y. Zhao ravenben@cs.ucsb.edu

Runtime Overhead• Tested Apache 2.0.44• Stack linking

• 78% slowdown for null call• 3-4% overall

• Resource statistics• 2% (on all the time)• 0.1% (with sampling)

• Stack traces• 8% overhead

Ben Y. Zhao ravenben@cs.ucsb.edu

Microbenchmark: Producer / Consumer

Ben Y. Zhao ravenben@cs.ucsb.edu

Web Server Performance

Ben Y. Zhao ravenben@cs.ucsb.edu

Example of “Great Systems Paper”

• observe higher level issue• threads vs. event programming abstraction

• use previous work (duality) to identify problem• why are threads not as efficient as events?

• good systems design• call graph analysis for linked stacks• resource aware scheduling

• good execution• full solid implementation• analysis leading to full understanding of detailed

issues

• cross-area approach (help from PL research)

Ben Y. Zhao ravenben@cs.ucsb.edu

Acknowledgements• Many slides “borrowed” from the

respective talks / papers:• Capriccio (Rob von Behren)

• SEDA (Matt Welsh)

• Brewer01: “Lessons…”