Scaling SIP Servers

Scaling SIP Scaling SIP ServersServers

Sankaran NarayananJoint work with CINEMA team

IRT Group Meeting – April 17, 2002

AgendaAgenda Introduction Issues in scaling Facets of sipd architecture Some results Conclusion and Future Work

Introduction – SIP serversIntroduction – SIP servers SIP Signaling – Proxy,

redirect Proxies

Call routing by contact location

UDP/TCP/TLS Stateful or stateless Programmable scripts

User location – Registrars

SQLdatabase

What is scale ?What is scale ? Large call volumes,

commodity hardware [Schu0012:Industrial]

Response times (mean, deviation), Turn around time

Goals Delay budget [SIPstone]

R2 < 2 s R1 < 500 ms

Class-5 switches handle > 750K BHCA

REGISTER

200 OK

INVITE

180

INVITE

180200

200

ACKACK

R1

R2

Limits to scalingLimits to scaling Not CPU bound

Network I/O – blocking Wait for responses Latency: Contact, DNS lookups

OS resource limits Open files (<= 1024 on Unix) LWP’s (Solaris) vs. user-kernel threads

(Linux, Windows) Try not to…

Customize and recompile OS (parts) server into kernel (khttpd, AFPA, …)

The problemThe problem Scaling CPU-bound jobs (throughput=1/delay)

Hardware: CPU speed, RAM, … Software: better OS, scheduler, … Algorithm: optimize protocol processing

Blocking (Network, Disk I/O) is expensive Hypothesis

I/O-bound CPU-bound; reduce blocking Optimized resource usage – stability at high

loads

Facets of sipd architectureFacets of sipd architecture Blocking Process models Socket management Protocol processing

BlockingBlocking Mutex, event (socket,

timeout), fread Queue builds up

Potentially high variability Tandem queue system

Easy to fix Non-blocking calls (event

driven, later!) Move queue to different

thread (lazy logger)

Logger { lock; write; unlock;}

Blocking (2)Blocking (2) Call routing involves ( 1)

contact lookups 10 ms per query (approx)

Cache Works well for sipd style

servers Fetch-on-demand with

replacement (harder) Loading entire database is easy

need for refresh – long lived servers.

Potentially useful for DNS SRV lookups (?)

SQLdatabase

Cache

PeriodicRefresh

< 1 ms

REGISTER performanceREGISTER performanceSingle CPU Sun Ultra10

Response time is constant for Cache (FastSQL)

Process models (1)Process models (1)One thread per

request Doesn’t scale

Too many threads over a short timescale

Stateless proxy: 2-4 threads per transaction

High load affects throughput

R1R2

R3

R4

IncomingRequestsR1-4

Load

Thro

ughp

ut

Process models (2)Process models (2)Thread pool + Queue Thread overhead less;

more useful processing Overload management

drop requests over responses, drop tail

Not enough if holding time is high

Each request holds (blocks) a thread

IncomingRequestsR1-4

Fixed number of threads

Load

Thro

ughp

ut

Stateless proxy (Solaris)Stateless proxy (Solaris)

Turnaround time is almost constant for stateless proxy

• The sudden increase in response time - client problem

• UDP losses on Ultra10 @ (120 * 6 * 500 * 8) bps

Stateless proxy (Linux)Stateless proxy (Linux)

Request turnaround time breaks downResponse turnaround time is constantEffect of high holding times and thread schedulingHow to set queue size – investigate?

Queue evolution for sipdQueue evolution for sipd

Number of requests (y-axis) waiting in the queue for a free thread on Solaris (left) and Linux (right) over a period of up-time (x-axis).

Process models (3)Process models (3) Blocking thread model needs “too

many” threads Stateful transaction stays for 30 s Return thread to free pool instead of

blocking Event-driven architectures

State transition triggered by a global event scheduler

OnIncoming1xx(), OnInviteTimeout(), … SIP-CGI: pre-forked multiple processes

Socket managementSocket management Problem: open sockets limit (1024),

“liveness” detection, retransmission One socket per transaction does not

scale Global socket if downstream server is

alive, soft state – works for UDP Hard for TCP/TLS – connections Worse for Java servers – no select, poll

Optimizing protocol Optimizing protocol processingprocessing Not too useful if CPU is not the

bottleneck Text protocol - parsing, formatting

overheads Order of headers matter (Via) Other optimizations (parse-on-

demand, date formatting). . .

ConclusionConclusion Unlike web servers: can be stateful, less

disk I/O, lesser impact of TCP stack/behavior, …

Pros: UDP, Stateless routing, Load-balancing using DNS, …

Challenges: scaling state machine, Towards 2.5M BHCA (3600 messages/s)

Event driven architecture (SEDA?) Resource management (file limits, threads) Tuning operating system (scheduler, …)

Future workFuture work Stateful proxy performance

Evaluate event driven architecture Effect of request forking (> 1

contacts) on server behavior Programmable scripts

Queue management and overload control

Other types of servers (conference servers, media servers, etc.),

ReferencesReferences CINEMA web page.

http://www.cs.columbia.edu/IRT/cinema H. Schulzrinne. “Industrial strength

internet telephony,” Presentation at 6th SIP bakeoff, Dec. 2000.

H. Schulzrinne et. al. “SIPstone – Benchmarking SIP server performance,” CS Technical report, Columbia University.

http://www.cs.columbia.edu/IRT/cinema

Scaling SIP Servers

Documents

Transcript of Scaling SIP Servers