0.5mln packets per second with Erlang

0.5 mln packets per second with Erlang

Nov 22, 2014

Maxim Kharchenko

CTO/Cloudozer LLP

The road mapErlang on Xen intro

LINCX project overview

Speed-related notes

– Arguments are registers

– ETS tables are (mostly) ok

– Do not overuse records

– GC is key to speed

– gen_server vs. barebone process

– NIFS: more pain than gain

– Fast counters

– Static compiler?

Q&A

Erlang on Xen a.k.a. LING

A new Erlang platform that runs without OS

Conceived in 2009

Highly-compatible with Erlang/OTP

Built from scratch, not a “port”

Optimized for low startup latency

Open sourced in 2014 (github.com/cloudozer/ling)

Local and remote builds

Go to erlangonxen.org

Zerg demo: zerg.erlangonxen.org



Speed-related notes







– Fast counters


Q&A

LINCX: project overview

Started in December, 2013

Initial scope = porting LINC-Switch to LING

High degree of compatibility demonstrated for LING

Extended scope = fix LINC-Switch fast path

Beta version of LINCX open sourced on March 3, 2014

LINCX runs 100x faster than the old code

LINCX repository:

github.com/FlowForwarding/lincx

Raw network interfaces in ErlangLING adds raw network interfaces:

Port = net_vif:open(“eth1”, []),

port_command(Port, <<1,2,3>>),

receive

{Port,{data,Frame}} ‐>...

Raw interface receives whole Ethernet frames

LINCX uses standard gen_tcp for the control connection and net_vif - for

data ports

Raw interfaces support mailbox_limit option - packets get dropped if the

mailbox of the receiving process overflows:

Port = net_vif:open(“eth1”, [{mailbox_limit,16384}]),

...

Testbed configuration

* Test traffic goes between vm1 and vm2

* LINCX runs as a separate Xen domain

* Virtual interfaces are bridged in Dom0

IXIA confirms 460kpps peak rate

1GbE hw NICs/128 byte packets

IXIA packet generator/analyzer

Processing delay and low-level stats

LING can measure a processing delay for a packet:

1> ling:experimental(processing_delay, []).

Processing delay statistics:

Packets: 2000 Delay: 1.342us +‐ 0.143 (95%)

LING can collect low-level stats for a network interface:

1> ling:experimental(llstat, 1). %% stop/display

Duration: 4868.6ms

RX: interrupts: 69170 (0 kicks 0.0%) (freq 14207.4/s period 70.4us)

RX: reqs per int: 0/0.0/0

RX: tx buf freed per int: 0/8.5/234

TX: outputs: 1479707 (112263 kicks 7.6) (freq 303928.8/s period 3.3us)

TX: tx buf freed per int: 0/0.6/113

TX: rates: 303.9kpps 3622.66Mbps avg pkt size 1489.9B

TX: drops: 12392 (freq 2545.3/s period 392.9us)

TX: drop rates: 2.5kpps 30.26Mbps avg pkt size 1486.0B



Speed-related notes







– Fast counters


Q&A

Arguments are registers

animal(batman = Cat, Dog, Horse, Pig, Cow, State) ‐>feed(Cat, Dog, Horse, Pig, Cow, State);

animal(Cat, deli = Dog, Horse, Pig, Cow, State) ‐>pet(Cat, Dog, Horse, Pig, Cow, State);

...

%% SLOW

animal(batman = Cat, Dog, Horse, Pig, Cow, State) ‐>feed(Goat, Cat, Dog, Horse, Pig, Cow, State);

...

Many arguments do not make a function any slower

But do not reshuffle arguments:

ETS tables are (mostly) ok

A small ETS table lookup = 10x function activations

Do not use ets:tab2list() inside tight loops

Treat ETS as a database; not a pool of global variables

1-2 ETS lookups on the fast path are ok

Beware that ets:lookup(), etc create a copy of the data on the heap of

the caller, similarly to message passing

Do not overuse records

selelement() creates a copy of the tuple

State#state{foo=Foo1,bar=Bar1,baz=Baz1} creates 3(?) copies of the

tuple

Use tuples explicitly in performance-critical sections to control the heap

footprint of the code:

%% from 9p.erl

mixer({rauth,_,_}, {tauth,_,Afid,_,_}, _) ‐> {write_auth,AFid};

mixer({rauth,_,_}, {tauth,_,Afid,_,_,_}, _) ‐> {write_auth,AFid};

mixer({rwrite,_,_}, _, initial) ‐> start_attaching;

mixer({rerror,_,_}, _, initial) ‐> auth_failed;

mixer({rlerror,_,_}, _, initial) ‐> auth_failed;

mixer({rattach,_,Qid}, {tattach,_,Fid,_,_,Aname,_}, initial) ‐>{attach_more,Fid,AName,qid_type(Qid)};

mixer({rclunk,_}, {tclunk,_,Fid}, initial) ‐> {forget,Fid};

Garbage collection is key to speed

Heap is a list of chunks

'new heap' is close to its head, 'old heap' - to its tail

A GC run takes 10μs on average

GC may run 1000s times per second

proc_t

HTO

P

...

How to tackle GC-related issues

(Priority 1) Call erlang:garbage_collect() at strategic points

(Priority 2) For the fastest code avoid GC completely – restart the fast

process regularly:

spawn(F, [{suppress_gc,true}]), %% LING‐only

(Priority 3) Use fullsweep_after option

gen_server vs barebone process

Message passing using gen_server:call() is 2x slower than Pid ! Msg

For speedy code prefer barebone processes to gen_servers

Design Principles are about high availability, not high performance

NIFs: more pain than gain

A new principle of Erlang development: do not use NIFs

For a small performance boost, NIFs undermine key properties of

Erlang: reliability and soft-realtime guarantees

Most of the time Erlang code can be made as fast as C

Most of performance problems of Erlang are traceable to NIFs, or

external C libraries, which are similar

Erlang on Xen does not have NIFs and we do not plan to add them

Fast counters32-bit or 64-bit unsigned integer counters with overflow - trivial in C, not

easy in Erlang

FIXNUMs are signed 29-bit integers, BIGNUMs consume heap and are

10-100x slower

Use two variables for a counter?

foo(C1, 16#ffffff, ...) -> foo(C1+1, 0, ...);

foo(C1, C2, ...) ‐> foo(C1, C2+1, ...);

...

LING has a new experimental feature – fast counters:

erlang:new_counter(Bits) ‐> Ref

erlang:increment_counter(Ref, Incr)

erlang:read_counter(Ref)

erlang:release_counter(Ref)

Future: static compiler for Erlang

Scalars and algebraic types

Structural types only – no nominal types

Target compiler efficiency not static type checking

A middle ground between:

“Type is a first class citizen” (Haskell)

“A single type is good enough” (Python, Erlang)

Future: static compiler for Erlang - 2

Challenges:

Pattern matching compilation

Type inference for recursive types

y = {(unit | y), x, (unit | y)}

Work started in 2013

Currently the compiler is at the proof-of-concept stage

y = nil | {x, y}

Questions

?

e-mail: [email protected]

0.5mln packets per second with Erlang

Software

Transcript of 0.5mln packets per second with Erlang