Scaling SIP Servers
description
Transcript of Scaling SIP Servers
Scaling SIP Scaling SIP ServersServers
Sankaran NarayananJoint work with CINEMA team
IRT Group Meeting – April 17, 2002
AgendaAgenda Introduction Issues in scaling Facets of sipd architecture Some results Conclusion and Future Work
Introduction – SIP serversIntroduction – SIP servers SIP Signaling – Proxy,
redirect Proxies
Call routing by contact location
UDP/TCP/TLS Stateful or stateless Programmable scripts
User location – Registrars
SQLdatabase
What is scale ?What is scale ? Large call volumes,
commodity hardware [Schu0012:Industrial]
Response times (mean, deviation), Turn around time
Goals Delay budget [SIPstone]
R2 < 2 s R1 < 500 ms
Class-5 switches handle > 750K BHCA
REGISTER
200 OK
INVITE
180
INVITE
180200
200
ACKACK
R1
R2
Limits to scalingLimits to scaling Not CPU bound
Network I/O – blocking Wait for responses Latency: Contact, DNS lookups
OS resource limits Open files (<= 1024 on Unix) LWP’s (Solaris) vs. user-kernel threads
(Linux, Windows) Try not to…
Customize and recompile OS (parts) server into kernel (khttpd, AFPA, …)
The problemThe problem Scaling CPU-bound jobs (throughput=1/delay)
Hardware: CPU speed, RAM, … Software: better OS, scheduler, … Algorithm: optimize protocol processing
Blocking (Network, Disk I/O) is expensive Hypothesis
I/O-bound CPU-bound; reduce blocking Optimized resource usage – stability at high
loads
Facets of sipd architectureFacets of sipd architecture Blocking Process models Socket management Protocol processing
BlockingBlocking Mutex, event (socket,
timeout), fread Queue builds up
Potentially high variability Tandem queue system
Easy to fix Non-blocking calls (event
driven, later!) Move queue to different
thread (lazy logger)
Logger { lock; write; unlock;}
Blocking (2)Blocking (2) Call routing involves ( 1)
contact lookups 10 ms per query (approx)
Cache Works well for sipd style
servers Fetch-on-demand with
replacement (harder) Loading entire database is easy
need for refresh – long lived servers.
Potentially useful for DNS SRV lookups (?)
SQLdatabase
Cache
PeriodicRefresh
< 1 ms
REGISTER performanceREGISTER performanceSingle CPU Sun Ultra10
Response time is constant for Cache (FastSQL)
Process models (1)Process models (1)One thread per
request Doesn’t scale
Too many threads over a short timescale
Stateless proxy: 2-4 threads per transaction
High load affects throughput
R1R2
R3
R4
IncomingRequestsR1-4
Load
Thro
ughp
ut
Process models (2)Process models (2)Thread pool + Queue Thread overhead less;
more useful processing Overload management
drop requests over responses, drop tail
Not enough if holding time is high
Each request holds (blocks) a thread
IncomingRequestsR1-4
Fixed number of threads
Load
Thro
ughp
ut
Stateless proxy (Solaris)Stateless proxy (Solaris)
Turnaround time is almost constant for stateless proxy
• The sudden increase in response time - client problem
• UDP losses on Ultra10 @ (120 * 6 * 500 * 8) bps
Stateless proxy (Linux)Stateless proxy (Linux)
Request turnaround time breaks downResponse turnaround time is constantEffect of high holding times and thread schedulingHow to set queue size – investigate?
Queue evolution for sipdQueue evolution for sipd
Number of requests (y-axis) waiting in the queue for a free thread on Solaris (left) and Linux (right) over a period of up-time (x-axis).
Process models (3)Process models (3) Blocking thread model needs “too
many” threads Stateful transaction stays for 30 s Return thread to free pool instead of
blocking Event-driven architectures
State transition triggered by a global event scheduler
OnIncoming1xx(), OnInviteTimeout(), … SIP-CGI: pre-forked multiple processes
Socket managementSocket management Problem: open sockets limit (1024),
“liveness” detection, retransmission One socket per transaction does not
scale Global socket if downstream server is
alive, soft state – works for UDP Hard for TCP/TLS – connections Worse for Java servers – no select, poll
Optimizing protocol Optimizing protocol processingprocessing Not too useful if CPU is not the
bottleneck Text protocol - parsing, formatting
overheads Order of headers matter (Via) Other optimizations (parse-on-
demand, date formatting). . .
ConclusionConclusion Unlike web servers: can be stateful, less
disk I/O, lesser impact of TCP stack/behavior, …
Pros: UDP, Stateless routing, Load-balancing using DNS, …
Challenges: scaling state machine, Towards 2.5M BHCA (3600 messages/s)
Event driven architecture (SEDA?) Resource management (file limits, threads) Tuning operating system (scheduler, …)
Future workFuture work Stateful proxy performance
Evaluate event driven architecture Effect of request forking (> 1
contacts) on server behavior Programmable scripts
Queue management and overload control
Other types of servers (conference servers, media servers, etc.),
ReferencesReferences CINEMA web page.
http://www.cs.columbia.edu/IRT/cinema H. Schulzrinne. “Industrial strength
internet telephony,” Presentation at 6th SIP bakeoff, Dec. 2000.
H. Schulzrinne et. al. “SIPstone – Benchmarking SIP server performance,” CS Technical report, Columbia University.