Scalable Internet Services Cluster Lessons and Architecture Design for Scalable Services BR01, TACC,...
-
Upload
nigel-melton -
Category
Documents
-
view
222 -
download
2
Transcript of Scalable Internet Services Cluster Lessons and Architecture Design for Scalable Services BR01, TACC,...
Scalable Internet Services
Cluster Lessons and Architecture Design for Scalable Services
BR01, TACC, Porcupine, SEDA and Capriccio
Ben Y. Zhao [email protected]
Outline• Overview of cluster services
• lessons from giant-scale services (BR01)
• SEDA (staged event-driven architecture)
• Capriccio
Ben Y. Zhao [email protected]
Scalable Servers• Clustered services
• natural platform for large web services
• search engines, DB servers, transactional servers
• Key benefit• low cost of computing, COTS vs. SMP
• incremental scalability
• load balance traffic/requests across servers
• Extension from single server model• reliable/fast communication, but partitioned
data
Ben Y. Zhao [email protected]
Goals• Failure transparency
• hot-swapping components w/o loss of avail
• homogeneous functionality and/or replication
• Load balancing• partition data / requests for max service rate
• need to colocate requests w/ associated data
• Scalability• aggregate performance should scale w/# of
servers
Ben Y. Zhao [email protected]
Two Different Models• Read-mostly data
• web servers, DB servers, search engines (query)
• replicate across servers + (RR DNS / redirector)
IP Network (WAN)IP Network (WAN)
clientclient client client
client
Round Robin DNS
Ben Y. Zhao [email protected]
Two Different Models …• Read-write model
• mail servers, e-commerce sites, hosted services
• small(er) replication factor for stronger consistency
IP Network (WAN)IP Network (WAN)
clientclient client client
client
Load Redirector
Ben Y. Zhao [email protected]
Key Architecture Challenges
• Providing high availability• availability across component failures
• Handling flash crowds / peak load• need support for massive concurrency
• Other challenges• upgradability: maintaining availability and
minimal cost during upgrades in S/W, H/W, functionality
• error diagnosis: fast isolation of failures / performance degradation
Ben Y. Zhao [email protected]
Nuggets• Definition
• uptime = (MTBF – MTTR)/MTBF
• yield = queries completed / queries offered
• harvest = data available / complete data
• MTTR• at least as important at MTBF
• much easier to tune and quantify
• DQ principle• data/query x queries/second constant
• physical bottlenecks limit overall throughput
Ben Y. Zhao [email protected]
Tapestry Software Architecture
SEDA event-driven frameworkJava Virtual Machine
Dynamic Tap.
distance map
core router
application programming interface
applications
Patchwork
network
Ben Y. Zhao [email protected]
Impact of Correlated Events
• web / application servers• independent requests• maximize individual throughput
Network
???
???
?
ABC
• correlated requests: A+B+CD• e.g. online continuous queries, sensor
aggregation, p2p control layer, streaming data mining
event handler
+ + =
Ben Y. Zhao [email protected]
Capriccio• User-level light-weight threads
(SOSP03)
• Argument• threads are the natural programming model
• current problems result of implementation• not fundamental flaw
• Approach• aim for massive scalability
• compiler assistance
• linked stacks, block graph scheduling
Ben Y. Zhao [email protected]
The Price of Concurrency• Why is concurrency hard?
• Race conditions
• Code complexity
• Scalability (no O(n) operations)
• Scheduling & resource sensitivity
• Inevitable overload
• Performance vs. Programmability• No good solution
PerformanceEase
of
Pro
gra
mm
ing Threads
Threads
Events
Ideal
Ben Y. Zhao [email protected]
The Answer: Better Threads
• Goals• Simple programming model
• Good tools & infrastructure• Languages, compilers, debuggers, etc.
• Good performance
• Claims• Threads are preferable to events
• User-Level threads are key
Ben Y. Zhao [email protected]
“But Events Are Better!”• Recent arguments for events
• Lower runtime overhead
• Better live state management
• Inexpensive synchronization
• More flexible control flow
• Better scheduling and locality
• All true but…• Lauer & Needham duality argument
• Criticisms of specific threads packages
• No inherent problem with threads!
Ben Y. Zhao [email protected]
Criticism: Runtime Overhead
• Criticism: Threads don’t perform well for high concurrency
• Response• Avoid O(n) operations
• Minimize context switch overhead
• Simple scalability test• Slightly modified GNU Pth
• Thread-per-task vs. single thread
• Same performance!
Requ
ests
/ Sec
ond
Concurrent Tasks
Event-Based Server
Threaded Server
20000
30000
40000
50000
60000
70000
80000
90000
100000
110000
1 10 100 1000 10000 100000 1e+06
Ben Y. Zhao [email protected]
Criticism: Synchronization• Criticism: Thread synchronization is
heavyweight• Response
• Cooperative multitasking works for threads, too!
• Also presents same problems• Starvation & fairness• Multiprocessors• Unexpected blocking (page faults, etc.)
• Both regimes need help• Compiler / language support for concurrency• Better OS primitives
Ben Y. Zhao [email protected]
Criticism: Scheduling
• Criticism: Thread schedulers are too generic• Can’t use application-specific information
• Response• 2D scheduling: task & program location
• Threads schedule based on task only• Events schedule by location (e.g. SEDA)
• Allows batching• Allows prediction for SRCT
• Threads can use 2D, too!• Runtime system tracks current location• Call graph allows prediction
Task
Pro
gra
m
Loca
tion
Threads
Events
Ben Y. Zhao [email protected]
The Proof’s in the Pudding• User-level threads package
• Subset of pthreads
• Intercept blocking system calls
• No O(n) operations
• Support > 100K threads
• 5000 lines of C code
• Simple web server: Knot• 700 lines of C code
• Similar performance• Linear increase, then steady
• Drop-off due to poll() overhead
0
100
200
300
400
500
600
700
800
900
1 4 16 64 256 1024 4096 16384
KnotC (Favor Connections)
KnotA (Favor Accept)
Haboob
Concurrent Clients
Mbit
s /
seco
nd
Ben Y. Zhao [email protected]
Arguments For Threads• More natural programming model
• Control flow is more apparent
• Exception handling is easier
• State management is automatic
• Better fit with current tools & hardware• Better existing infrastructure
Ben Y. Zhao [email protected]
Why Threads: control Flow• Events obscure control flow
• For programmers and tools
Threads Eventsthread_main(int sock) {
struct session s;
accept_conn(sock, &s);
read_request(&s);
pin_cache(&s);
write_response(&s);
unpin(&s);
}
pin_cache(struct session *s) {
pin(&s);
if( !in_cache(&s) )
read_file(&s);
}
AcceptHandler(event e) {
struct session *s = new_session(e);
RequestHandler.enqueue(s);
}
RequestHandler(struct session *s) {
…; CacheHandler.enqueue(s);
}
CacheHandler(struct session *s) {
pin(s);
if( !in_cache(s) ) ReadFileHandler.enqueue(s);
else ResponseHandler.enqueue(s);
}
. . .
ExitHandler(struct session *s) {
…; unpin(&s); free_session(s); }
AcceptConn.
WriteResponse
ReadFile
ReadRequest
PinCache
Web Server
Exit
Ben Y. Zhao [email protected]
Threads Eventsthread_main(int sock) {
struct session s;
accept_conn(sock, &s);
read_request(&s);
pin_cache(&s);
write_response(&s);
unpin(&s);
}
pin_cache(struct session *s) {
pin(&s);
if( !in_cache(&s) )
read_file(&s);
}
CacheHandler(struct session *s) {
pin(s);
if( !in_cache(s) ) ReadFileHandler.enqueue(s);
else ResponseHandler.enqueue(s);
}
RequestHandler(struct session *s) {
…; CacheHandler.enqueue(s);
}
. . .
ExitHandler(struct session *s) {
…; unpin(&s); free_session(s);
}
AcceptHandler(event e) {
struct session *s = new_session(e);
RequestHandler.enqueue(s); }
• Events obscure control flow• For programmers and tools
AcceptConn.
WriteResponse
ReadFile
ReadRequest
PinCache
Web Server
Exit
Why Threads: control Flow
Ben Y. Zhao [email protected]
Why Threads: Exceptions• Exceptions complicate control flow
• Harder to understand program flow• Cause bugs in cleanup code
AcceptConn.
WriteResponse
ReadFile
ReadRequest
PinCache
Web Server
Exit
Threads Eventsthread_main(int sock) {
struct session s;
accept_conn(sock, &s);
if( !read_request(&s) )
return;
pin_cache(&s);
write_response(&s);
unpin(&s);
}
pin_cache(struct session *s) {
pin(&s);
if( !in_cache(&s) )
read_file(&s);
}
CacheHandler(struct session *s) {
pin(s);
if( !in_cache(s) ) ReadFileHandler.enqueue(s);
else ResponseHandler.enqueue(s);
}
RequestHandler(struct session *s) {
…; if( error ) return; CacheHandler.enqueue(s);
}
. . .
ExitHandler(struct session *s) {
…; unpin(&s); free_session(s);
}
AcceptHandler(event e) {
struct session *s = new_session(e);
RequestHandler.enqueue(s); }
Ben Y. Zhao [email protected]
Why Threads: State Management
Threads Eventsthread_main(int sock) {
struct session s;
accept_conn(sock, &s);
if( !read_request(&s) )
return;
pin_cache(&s);
write_response(&s);
unpin(&s);
}
pin_cache(struct session *s) {
pin(&s);
if( !in_cache(&s) )
read_file(&s);
}
CacheHandler(struct session *s) {
pin(s);
if( !in_cache(s) ) ReadFileHandler.enqueue(s);
else ResponseHandler.enqueue(s);
}
RequestHandler(struct session *s) {
…; if( error ) return; CacheHandler.enqueue(s);
}
. . .
ExitHandler(struct session *s) {
…; unpin(&s); free_session(s);
}
AcceptHandler(event e) {
struct session *s = new_session(e);
RequestHandler.enqueue(s); }
AcceptConn.
WriteResponse
ReadFile
ReadRequest
PinCache
Web Server
Exit
• Events require manual state management• Hard to know when to free
• Use GC or risk bugs
Ben Y. Zhao [email protected]
Why Threads: Existing Infrastructure
• Lots of infrastructure for threads• Debuggers
• Languages & compilers
• Consequences• More amenable to analysis
• Less effort to get working systems
Ben Y. Zhao [email protected]
Building Better Threads• Goals
• Simplify the programming model• Thread per concurrent activity• Scalability (100K+ threads)
• Support existing APIs and tools
• Automate application-specific customization
• Mechanisms• User-level threads
• Plumbing: avoid O(n) operations
• Compile-time analysis
• Run-time analysis
Ben Y. Zhao [email protected]
Case for User-Level Threads
• Decouple programming model and OS• Kernel threads
• Abstract hardware• Expose device concurrency
• User-level threads• Provide clean programming model• Expose logical concurrency
• Benefits of user-level threads• Control over concurrency model!
• Independent innovation
• Enables static analysis
• Enables application-specific tuning
Threads
App
OS
User
Ben Y. Zhao [email protected]
Case for User-Level Threads
Threads
OS
User
App
• Decouple programming model and OS• Kernel threads
• Abstract hardware• Expose device concurrency
• User-level threads• Provide clean programming model• Expose logical concurrency
• Benefits of user-level threads• Control over concurrency model!
• Independent innovation
• Enables static analysis
• Enables application-specific tuningSimilar argument tothe design of overlay
networks
Ben Y. Zhao [email protected]
Capriccio Internals• Cooperative user-level threads
• Fast context switches• Lightweight synchronization
• Kernel Mechanisms• Asynchronous I/O (Linux)
• Efficiency• Avoid O(n) operations • Fast, flexible scheduling
Ben Y. Zhao [email protected]
Safety: Linked Stacks• The problem: fixed stacks
• Overflow vs. wasted space• LinuxThreads: 2MB/stack• Limits thread numbers
• The solution: linked stacks• Allocate space as needed• Compiler analysis
• Add runtime checkpoints • Guarantee enough space until
next check
Fixed Stacks
Linked Stack
waste
overflow
Ben Y. Zhao [email protected]
Linked Stacks: Algorithm
5
4
2
6
3
3
2
3
• Parameters• MaxPath• MinChunk
• Steps• Break cycles• Trace back
• chkpts limit MaxPath length
• Special Cases• Function pointers• External calls• Use large stack
MaxPath = 8
Ben Y. Zhao [email protected]
Linked Stacks: Algorithm
5
4
2
6
3
3
2
3
MaxPath = 8
• Parameters• MaxPath• MinChunk
• Steps• Break cycles• Trace back
• chkpts limit MaxPath length
• Special Cases• Function pointers• External calls• Use large stack
Ben Y. Zhao [email protected]
Linked Stacks: Algorithm
5
4
2
6
3
3
2
3
MaxPath = 8
• Parameters• MaxPath• MinChunk
• Steps• Break cycles• Trace back
• chkpts limit MaxPath length
• Special Cases• Function pointers• External calls• Use large stack
Ben Y. Zhao [email protected]
Linked Stacks: Algorithm
5
4
2
6
3
3
2
3
MaxPath = 8
• Parameters• MaxPath• MinChunk
• Steps• Break cycles• Trace back
• chkpts limit MaxPath length
• Special Cases• Function pointers• External calls• Use large stack
Ben Y. Zhao [email protected]
Linked Stacks: Algorithm
5
4
2
6
3
3
2
3
MaxPath = 8
• Parameters• MaxPath• MinChunk
• Steps• Break cycles• Trace back
• chkpts limit MaxPath length
• Special Cases• Function pointers• External calls• Use large stack
Ben Y. Zhao [email protected]
Linked Stacks: Algorithm
5
4
2
6
3
3
2
3
MaxPath = 8
• Parameters• MaxPath• MinChunk
• Steps• Break cycles• Trace back
• chkpts limit MaxPath length
• Special Cases• Function pointers• External calls• Use large stack
Ben Y. Zhao [email protected]
Special Cases• Function pointers
• categorize f* by # and type of arguments
• “guess” which func will/can be called
• External functions• users annotate trusted stack bounds on libs
• or (re)use a small # of large stack chunks
• Result• use/reuse stack chunks much like VM
• can efficiently share stack chunks
• memory-touch benchmark, factor of 3 reduction in paging cost
Ben Y. Zhao [email protected]
Scheduling: Blocking Graph
• Lessons from event systems• Break app into stages
• Schedule based on stage priorities
• Allows SRCT scheduling, finding bottlenecks, etc.
• Capriccio does this for threads• Deduce stage with stack traces at
blocking points
• Prioritize based on runtime information
Accept
Write
Read
Read
Open
Web Server
Close
Close
Ben Y. Zhao [email protected]
Resource-Aware Scheduling
• Track resources used along BG edges• Memory, file descriptors, CPU
• Predict future from the past
• Algorithm• Increase use when underutilized• Decrease use near saturation
• Advantages• Operate near the knee w/o thrashing
• Automatic admission control
Accept
Write
Read
Read
Open
Web Server
Close
Close
Ben Y. Zhao [email protected]
Pitfalls• What is the max amt of resource?
• depends on workload• e.g.: disk thrashing depends on sequential
or random seeks• use early signs of thrashing to indicate max
capacity
• Detecting thrashing• only estimate using “productivity/overhead”• productivity from guessing (threads
created, files opened/closed)
Ben Y. Zhao [email protected]
Thread Performance
Capriccio
Capriccio-notrace
LinuxThreads
NPTL
Thread Creation 21.5 21.5 37.5 17.7
Context Switch 0.56 0.24 0.71 0.65
Uncontested mutex lock
0.04 0.04 0.14 0.15
• Slightly slower thread creation• Faster context switches
• Even with stack traces!• Much faster mutexes
Time of thread operations (microseconds)
Ben Y. Zhao [email protected]
Runtime Overhead• Tested Apache 2.0.44• Stack linking
• 78% slowdown for null call• 3-4% overall
• Resource statistics• 2% (on all the time)• 0.1% (with sampling)
• Stack traces• 8% overhead
Ben Y. Zhao [email protected]
Microbenchmark: Producer / Consumer
Ben Y. Zhao [email protected]
Web Server Performance
Ben Y. Zhao [email protected]
Example of “Great Systems Paper”
• observe higher level issue• threads vs. event programming abstraction
• use previous work (duality) to identify problem• why are threads not as efficient as events?
• good systems design• call graph analysis for linked stacks• resource aware scheduling
• good execution• full solid implementation• analysis leading to full understanding of detailed
issues
• cross-area approach (help from PL research)
Ben Y. Zhao [email protected]
Acknowledgements• Many slides “borrowed” from the
respective talks / papers:• Capriccio (Rob von Behren)
• SEDA (Matt Welsh)
• Brewer01: “Lessons…”