Swift: A Fast Dynamic Packet Filterhnw/paper/swift.pdf · The CMU/Stanford Packet Filter (CSPF)...

Swift: A Fast Dynamic Packet Filter

Zhenyu Wu Mengjun Xie Haining WangThe College of William and Mary

{adamwu, mjxie, hnw}@cs.wm.edu

AbstractThis paper presentsSwift, a packet filter for high perfor-mance packet capture on commercial off-the-shelf hard-ware. The key features of Swift include (1) extremelylow filter update latency for dynamic packet filtering,and (2) Gbps high-speed packet processing. Based oncomplex instruction set computer (CISC) instruction setarchitecture (ISA), Swift achieves the former with aninstruction set design that avoids the need for compi-lation and security checking, and the latter by mainlyutilizing SIMD (single instruction, multiple data). Weimplement Swift in the Linux 2.6 kernel for both i386and x8664 architectures. The Swift userspace librarysupports two sets of application programming interfaces(APIs): a BPF-friendly API for backward compatibilityand an object oriented API for simplifying filter cod-ing. We extensively evaluate the dynamic and static fil-tering performance of Swift on multiple machines withdifferent hardware setups. We compare Swift with BPF(the BSD packet filter)—thede factostandard for packetfiltering in modern operating systems—and hand-codedoptimized C filters that are used for demonstrating possi-ble performance gains. For dynamic filtering tasks, Swiftis at least three orders of magnitude faster than BPF interms of filter update latency. For static filtering tasks,Swift outperforms BPF up to three times in terms ofpacket processing speed, and achieves much closer per-formance to the optimized C filters.

1 IntroductionA packet filter is an operating system kernel facility thatclassifies network packets according to criteria given byuser applications, and conveys the accepted packets froma network interface directly to the designated applicationwithout traversing the network stack. Since the birth ofthe seminal BSD packet filter (BPF) [14], packet filtershave become essential to build fundamental network ser-vices ranging from traffic monitoring [12, 21] to networkengineering [19] and intrusion detection [20]. In recentyears, with dramatically increasing network speed andescalating protocol complexity, packet filters have beenfacing intensified challenges posed by more dynamic fil-tering tasks and faster filtering requirements. However,existing packet filter systems have not yet fully addressedthese challenges in an efficient and secure manner.

Dynamic filtering tasks refer to on-line packet filteringprocedures in which filtering criteria frequently changeover time. Typically, when a filtering task cannot fullyspecify its criteriaa priori and the unknown part can onlybe determined at runtime, the filtering criteria have to beupdated throughout the filtering process. For example,many network protocols, such as FTP, RTSP (Real TimeStreaming Protocol), and SIP (Session Initiation Proto-col), establish connections with dynamically-negotiatedport numbers. Capturing the network traffic that usessuch protocols requires dynamic filtering. Even with pre-determined filter criteria, quite often it is necessary to usedynamic filtering. For instance, a network intrusion de-tection system (NIDS) needs to perform expensive deeptraffic analyses on suspicious network flows. However,applying such costly procedures to every packet in highvolume traffic would severely overload the system. In-stead, an NIDS could first apply simple filtering criteria,such as monitoring traffic to and from a honeypot or adarknet only. When suspicious activities are detected, theNIDS can then update its filtering criteria and capture thetraffic of suspicious hosts for deep inspection.

As thede factopacket filter on modern UNIX vari-ants, BPF has shown insufficiency in handling bothstatic and dynamic filtering tasks, particularly the latter[4, 7, 10]. A filter update in BPF must undergo three“pre-processing” phases: compilation, user–kernel copy-ing, and security checking. In the compilation phase, thefiltering criteria specified by the human-orientedpcapfil-ter language [11] are translated and optimized into themachine-oriented BPF filter program. In the user–kernelcopying phase, the compiled filter program is copiedinto the kernel. Finally, in the security checking phase,the kernel-resident BPF instruction interpreter examinesthe copied filter program for potentially dangerous op-erations such as backward branches, ensuring that user-level optimizer errors cannot trigger kernel misbehavior(e.g., infinite loops). Consequently, the whole process ofa filter update in BPF induces prolonged latency, whichranges from milliseconds up to seconds depending oncriterion complexity and system workload. In high-speednetworks, hundreds or even thousands of packets of in-terest might be missed by BPF during each filter update,effectively leaving a “window of blindness.” Frequent fil-ter updates, often required by a dynamic filtering task,

exacerbate the degree of blindness. A “window of blind-ness” coinciding with the initialization of a new sessioncan cause serious problems on certain applications suchas NIDS, since the beginning of a network connectionnormally is of particular interest for security analysis [9].

Recent packet filters such as xPF [10] and FFPF[4] move more packet processing capabilities fromuserspace into the kernel, which reduces contextswitches and improves overall performance. However,neither filter works well for dynamic filtering tasks. Be-cause xPF uses the BPF-based filtering engine, it offersno improvement on filter update latency. FFPF attemptsto solve this problem by using kernel space library func-tions, called external functions, which are pre-compiledbinaries for specific functionalities such as parsing net-work protocols and updating states. The use of externalfunctions increases filtering speed and eases extensibil-ity, but also increases programming complexity. Externalfunctions, unlike safety-checked BPF filters, have accessto the kernel’s full privileges. New external functionsshould thus be carefully examined for potential securitybugs, making them a poor fit for frequently-changing dy-namic filtering tasks.

In this paper, we proposeSwift, a packet filter thattakes an alternative approach to achieving high perfor-mance, especially for dynamic filtering tasks. Like BPF,Swift is based on a fixed set of instructions executed byan in-kernel interpreter. Unlike BPF, Swift is designedto optimize filtering performance with powerful instruc-tions and a simplified computational model. Swift’s in-struction set is able to accomplish common filtering taskswith a small number of instructions. These powerful in-structions of Swift resemble those in CISC ISAs and sup-port optimizations analogous to SIMD (single instruc-tion, multiple data). Running on the powerful instruc-tions, Swift attains static filtering speedup due mainlyto SIMD extension and hierarchical execution optimiza-tion, a special runtime optimization technique for avoid-ing redundant instruction interpretation. More impor-tantly, combining the powerful instructions with the sim-plified computational model, Swift removes filter com-pilation and security checking in filter update, and thussignificantly improves dynamic filtering performance interms of filter update latency.

We implement Swift in the Linux 2.6 kernel for bothi386 and x8664 architectures. The kernel implementa-tion of Swift is fully compatible and can coexist withLSF (Linux Socket Filter), “a BPF clone” in Linux. TheSwift userspace libraries provide a BPF-friendly applica-tion programming interface (API) with textual filter syn-tax for backward compatibility, and an object-orientedAPI that simplifies filter coding. To validate the efficacyof Swift, we conduct extensive experiments on multi-ple machines with different hardware setups and proces-

sor speeds. We compare the performance of Swift withthat of LSF and optimized C filters. These C filters areused for demonstrating the possible performance gainsobtainable by optimized binary code. For dynamic filter-ing tasks, Swift achieves at least three orders of magni-tude lower filter update latency than LSF, and reduces thenumber of missing packets per connection by about twoorders of magnitude in comparison with LSF. For staticfiltering tasks with simple filtering criteria, Swift runsas fast as LSF; but with complex filtering criteria, Swiftoutperforms LSF up to three times in terms of packetprocessing speed, and performs much closer to the opti-mized C filters than LSF.

The remainder of this paper is structured as follows.Section 2 surveys related work on packet filters. Sec-tion 3 details the design of Swift. Section 4 describes theimplementation of Swift. Section 5 evaluates the perfor-mance of Swift. Section 6 concludes the paper.

2 Related WorkThe CMU/Stanford Packet Filter (CSPF) [16], a kernel-resident network packet demultiplexer, introduces thepacket filter concept. CSPF provides a fast path, insteadof the normal layered/stacked path, for network packetsto reach their destined userspace applications. Thus, theliteral meaning of filtering in CSPF isdelayered demulti-plexing[25]. The original motivation behind CSPF is tofacilitate the implementation of network protocols suchas TCP/IP at userspace. Although the purposes and tech-niques vary in subsequent packet filters, the CSPF modelof kernel-resident and protocol-independentpacket filter-ing is inherited by all its descendants.

BPF [14] aims to support high-speed network moni-toring applications such astcpdump [11]. Users informthe in-kernel filtering machine of their interests througha predicate-based filtering language [15], and then re-ceive from BPF the packets that conform to filteringcriteria. To achieve high performance, BPF introducesin-place packet filtering to reduce unnecessary cross-domain copies, a register-based filter machine to fix themismatch between the filter and its underlying architec-ture, and a control flow graph (CFG) model to avoid re-dundant computations. BPF+ [3] further enhances theperformance of BPF by exploiting global data-flow op-timization to eliminate redundant predicates across filtercriteria and employing just-in-time compilation to con-vert a filtering criterion to native machine code. xPF [10]increases the computation power of BPF by using persis-tent memory and allowing backward jumps.

In response totcpdump’s inefficiency at handling dy-namic ports, a special monitoring toolmmdump [24] hasbeen developed to capture Internet multimedia traffic, inwhich dynamic ports are widely used.mmdump reducesthe cost of compilation by exploiting the uniformity of

its filtering criterion patterns. A customized function inmmdump assembles new filtering criteria by using piecesof pre-compiled criterion blocks preserved from the ini-tial filter compilation. Swift’s high-level SIMD instruc-tions and hierarchical instruction optimization can beviewed as a generalization of this technique, but Swift’stechniques apply to any type of filter and require no spe-cial compiler techniques.

MPF (Mach Packet Filter) [26], PathFinder [2], andDPF (Dynamic Packet Filter) [8] are filters designed todemultiplex packets for user-level networking. To effi-ciently demultiplex packets for multiple user-level appli-cations and to dispatch fragmented packets, MPF extendsthe instruction set of BPF with an associative match func-tion. With the same goal of achieving high filter scal-ability as MPF, PathFinder, abstracts the filtering pro-cess as a pattern matching process and adopts a spe-cial data structure for the abstraction. The abstractionmakes PathFinder amenable to implementation in bothsoftware and hardware and capable of handling Gbpsnetwork traffic. DPF utilizes dynamic code generationtechnology, instead of a traditional interpreter-based fil-ter engine, to compile packet filtering criteria into nativemachine code. DPF-like dynamic code generation couldimprove Swift’s performance on static filtering tasks.

The Fairly Fast Packet Filter (FFPF, later renamed asStreamline) [4] is the most recent research on packetfiltering. Unlike traditional packet filters such as BPF,FFPF is a framework for network monitoring. Within theframework of FFPF, multiple packet filtering programscan be simultaneously loaded into the kernel. The pro-cessing flow among these programs is organized as a di-rected graph. In comparison to the filtering architectureof BPF, FFPF can significantly reduce the cost of packetcopying for multiple concurrent filtering programs by us-ing flow group, shared circular buffers, and even hard-ware (e.g., Network Processing Unit). Therefore, FFPFachieves far greater scalability than BPF. FFPF expandsfilter capacity via external functions, which are essen-tially native code running in kernel space. In addition,FFPF features language neutral design and providesbackward compatibility with BPF.

FFPF and Swift are complementary as they targetquite different problems. FFPF focuses on the packet fil-tering framework and its main contribution lies in theimprovement of scalability for supporting multiple con-current monitoring applications, while Swift aims at thepacket filtering engine and provides a fast, flexible, andsafe filtering mechanism for individual applications. Byvirtue of the language neutral design of FFPF, Swiftcan be implemented within the FFPF framework, takingmaximal advantage of both designs.

Besides software-based packet capture solutions, mul-tiple hardware-based solutions [5, 18] have been pro-

posed to meet the challenge posed by extremely highspeed networks. Specifically, FPGAs and ASICs havebeen widely used in recent intrusion detection and pre-vention systems [9, 22]. Moreover, other than the packet-filter-based network monitoring architecture, there existmany specialized-architecture monitoring systems suchas OC3MAN [1], Windmill [13], Nprobe [17], andSCAMPI [6]. Even with these hardware or specializedsystem solutions, packet filtering still plays a major rolein network monitoring and measurement due to its sim-plicity, universal installation, high cost-effectiveness, andrich applications.

3 DesignIn this section, we first present the motivation of Swiftand its design overview, then we detail the design ofSwift, including its unique ISA, and finally we analyzethe characteristics of Swift in terms of performance andsecurity.

3.1 MotivationThe inefficiency of BPF observed in our past experiencesdirectly motivated Swift’s design. The most significantperformance degradation of BPF occurs in dynamic fil-tering tasks. This degradation is mainly caused by fre-quent filter updates. As mentioned above, the undulylong filter update latency in BPF is attributed to three fil-ter pre-processing operations: filter re-compilation, user–kernel copying, and security checking. While the lattertwo play non-negligible roles in the long delay, the ma-jority of the latency is introduced by filter re-compilation[7, 24]. The duration of a filter update in BPF would besignificantly shortened if the re-compilation were selec-tively performed only on the changed primitive, or totallyskipped, asmmdump accomplishes for selected filteringtasks. However, for general purpose network monitoringtasks, re-compiling the entire BPF filter is inevitable foreach update, because its instruction set architecture andfilter program organization are unsuitable for fast update.

BPF uses a RISC-like instruction set for a low-levelregister machine abstraction. Thus, eachpcap languageprimitive is translated into an instruction block that com-prises avariablenumber of simple instructions. Chang-ing a primitive in a filter often alters the size of thecorresponding instruction block. Without re-compiling,we need to modify code-offset-related instructions (e.g.,conditional branch) throughout the entire compiled filterto accommodate the change. Control flow optimization,which is indispensable for BPF to speed up filter execu-tion, makes the matter even worse. The BPF control flowoptimization merges multiple identical instructions intoone. This significantly reduces both filter program sizeand execution time, but complicates updates to instruc-tions shared by several primitives.

In addition to filter update latency, we also find thatthe filter execution efficiency of BPF can be improvedsubstantially. The RISC-like ISA in BPF induces highinstruction interpretation overhead. Interpretation over-head refers to the operations an interpreter must performbefore executing an actual filter instruction, such as pro-gram counter maintenance, instruction loading, opera-tion decoding, and so forth. Those operations are un-productive towards evaluating filter criteria, but cannotbe omitted. Because each BPF instruction accomplishesmerely a very simple task, such as loading, arithmetic,and conditional branching, most of the CPU time in exe-cuting a BPF program is spent uselessly as interpretationoverhead, and the CPU time spent in evaluating the ac-tual packet filtering criteria only makes up a small frac-tion of the total. Our measurement results (detailed inSection 5.2) show that the BPF interpretation overhead isabout 5.2 nanoseconds on average in a machine with In-tel 64-bit Xeon 2.0GHz CPU, and makes up nearly 56%of the average instruction execution time.

BPF’s continuing widespread use can be mainly at-tributed to (1) the generic pseudo-machine abstraction,which guarantees cross-platform compatibility, and (2)its natural-language-like, primitive based filter language,which ensures ease of use to application developers andnetwork administrators. Therefore, we decide to inheritfrom BPF the pseudo-machine abstraction and languageprimitives, while developing our own filtering model toachieve significant performance improvement.

3.2 Design OverviewThe primary objective of Swift is to achieve low filterupdate latency. Our approach to reaching this goal is byreducing filter criterion pre-processing on filter update asmuch as possible. More specifically, we attempt to avoidfilter re-compilation and optimization, allow “in-place”filter updating, and eliminate security checking.

To achieve “compilation free” update, filtering crite-ria must mapdirectly onto interpreter instructions. Thismakes a high-level, CISC-like instruction set architecturea natural choice for Swift. In addition to saving compila-tion time, the CISC-like instruction set also opens a doorfor performance optimization. A complex Swift instruc-tion is able to accomplish the same task as several simpleBPF instructions, thereby reducing instruction interpre-tation overhead.

Two design choices are made to enable in-place fil-ter modification: fixing instruction length and removingfilter optimization. By fixing the instruction length, weavoid the need to shift instructions on instruction re-placement. By removing filter optimization, not only dowe save precious time during a filter update, but alsopreserve one-to-one mapping between filtering primi-tives and filter program instructions: no instructions are

Figure 1: Filter organizations of BPF (a) and Swift (b)for criteria matching HTTP and DNS traffic

shared. As a result, updates to a filter can be directly ap-plied to the affected instructions without altering otherinstructions or filter program structure. This feature fur-ther helps to optimize filter update by reducing unneces-sary user–kernel data copying. Only the updated part ofa filter criterion is copied from userspace to the kernel.

A simplified computational model ensures filter pro-gram safety for Swift without security checking. Withthe specialized ISA, each Swift instruction is able to per-form a set of complex pattern matching operations. Theexecution path of a filter program is determined by theBoolean evaluation result of each instruction: either con-tinue (“true”) or abort (“false”). Therefore, Swift doesnot need storage or branch instructions to control the ex-ecution path of a filter program. With a fixed set of in-structions, acyclic execution path, and zero data storage,any Swift filter program is safe to run in the kernel.

Our secondary objective is to increase filter executionefficiency. We achieve this goal by exploring the fol-lowing two optimizations: SIMD expansion to the Swiftinstruction set and hierarchical execution optimization.SIMD allows an interpreter to perform a single instruc-tion interpretation and apply the same operation on manysets of data, thereby significantly reducing the cost ofinstruction interpretation. SIMD has been widely usedin contemporary high performance processors, such asIntel Pentium series and IBM Power series processors.While Swift’s design ensures low filter update latency,it also forfeits the benefit associated with filter programoptimization. To offset the possible performance loss, weintroduce an alternative optimization method called hi-erarchical execution optimization. This optimization isbased on our observation that during a dynamic filter-ing process, the newly-added primitives are often relatedto some existing ones. For example, the new primitivesquite often monitor the same host but on different ports orcapture the same protocol traffic but for different hosts.Therefore, the existing primitives can be viewed as the“parent” of the new primitives. Swift utilizes this hierar-

Figure 2: Swift filter structure

chical relationship among primitives to avoid redundantinstruction executions. Instead of actively optimizing fil-ter programs, i.e., performing automatic optimization infilter update, Swift makes the primitive hierarchy a hintfor the filter execution engine, and leaves applications re-sponsible for constructing the hierarchy.

3.3 Detailed DesignThe Boolean logic in a Swift filter is organized in dis-junctive normal form. Figure 1 (b) illustrates the controlflow organization of Swift, while the control flow graphof BPF, which is semantically equivalent, is shown inFigure 1 (a) for comparison. In the Swift control floworganization, each disjunct cluster—the rounded rectan-gle with shaded background—specifies a complete set ofprimitives that a packet must satisfy in order to be ac-cepted by the Swift filter. In Swift, we name such a dis-junct cluster aPass, meaning a “passage” of packets.

A pass consists of one or more literals. In a BPF fil-ter, a literal corresponds to apcap language primitive.Swift inherits the primitives from BPF. However, insteadof realizing a primitive with multiple simple instruc-tions, Swift maps each type of primitives into a pseudo-machine instruction—the basic building block of a filterprogram. Figure 2 illustrates the structure of a Swift fil-ter.

3.3.1 Swift Instruction Set

All Swift instructions have the same size, facilitating fastinstruction modification on filter update. The Swift in-struction layout is shown in Figure 3, where one 32-bitcommand field is followed by seven 32-bit parameterfields. Such a nicely-aligned 32-byte structure ensuresefficient memory accesses.

Figure 3: The Swift instruction format

We formulate our instruction set based on BPF primi-tives. We first classify BPF primitives into two categoriesaccording to their addressing modes. “Direct addressing”primitives, such as “ether proto” and “ip src host,” fetch

data from an absolute offset in a packet. “L1 indirect ad-dressing” primitives, such as “tcp dst port,” address databy calculating the variable header length of a protocollayer and adding a relative offset to it. We then general-ize the manipulation and comparison operations used inthe semantics of BPF primitives. There are three typesof basic operations: (1) test if equal, (2) mask and test ifequal, and (3) test if in range. Each type also has somevariations on operand width (short or long integer). Fi-nally, we design the complex instructions to accomplishthe corresponding operations. We come up with 14 dif-ferent operations that, alone or by combination, are ableto perform equivalent operations of any BPF primitivesexcept “expr, ” which involves arbitrary arithmetic.

To further exploit the CISC architecture and enhanceperformance, we introduce a new addressing mode, “L2indirect addressing,” with four additional instructions.Inthe new addressing mode, filtering operations addressdata by first performing “L1 indirect addressing” to re-trieve the related information, which is used to calculatethe variable header length of a deeper layer, and thenadding the relative offset. While BPF does not providesuch primitives, there are practical demands such as fil-tering based on TCP payload. Moreover, we add fourmore “power instructions” that perform equivalent op-erations of several frequently-used BPF primitive com-binations, such as “ip src and dst net” and “tcp src or dstport.” Therefore, in total Swift has 22 different types ofinstructions.

Table 1 lists a selection of Swift instructions, whichcaptures the characteristics of Swift’s CISC-like ISA.The four columns from leftmost to rightmost refer tothe addressing mode, the instruction type, the instruc-tion functionality, and the equivalent BPF operation(s),respectively. Swift instructions are fairly generic in thatgiven different parameters, one Swift instruction canfunction as several differentpcap language primitives.Examples are given in the fourth column. Based on theSwift instruction set, we can derive alternative faster im-plementations for some BPF primitives. For instance, the“ip and tcp port” primitive in BPF requires three initialsteps with six instructions to examine whether a packet isIP, non-fragment, and TCP. In Swift, we can take advan-tage of the “continuous masked comparison” instruction(D LEQ M), to perform the same examination in a singleinstruction.

We add the SIMD feature into the Swift instructionset by packing additional operands into unused parame-ter fields. For instance, the “Direct addressing load, test ifequal” instruction (D EQ) uses only one 32-bit operand.In contrast, the SIMD version of this instruction cancarry up to six additional operands, and the correspond-ing operation becomes “Direct addressing load, and testif equal to any of P[1] through P[7].”

Table 1: Sample of Swift instruction set

3.3.2 Swift Pass and Filter Program

A series of instructions connected by logic “AND” forma pass. When a packet arrives, the instructions of a passare evaluated one by one. If all evaluation results are“true,” the packet is accepted and copied to userspace.Otherwise, if any evaluation result is “false,” the packetfails the current pass, and will be tested by remainingpasses or dropped if it fails all passes. Passes are thus ef-fectively independent and are combined by logic “OR.”

We achieve the hierarchical execution optimizationfeature by establishing hierarchical relationships amongpasses. When a Swift filter is initially set, the passes itcontains are created by the application from scratch. Insubsequent changes, if a new pass is related to one ofthe existing passes, the application is entitled to add thenew pass by duplicating an existing pass and modifyingthe copy, or “child” pass. Performing duplication, insteadof creating a pass afresh, has two benefits. First, the ap-plication saves time in updating the criterion—only thedifference between the old and new control flows needsto be updated. Second, the parent–child relationship isnoted by the filtering engine and is used to optimize fil-ter execution.

When a pass is added by duplication, Swift makes abit-exact copy of the parent pass and then marks all theinstructions of the child pass as “copied,” a hint for thefiltering engine that the marked instruction is exactly thesame as the corresponding one in its parent pass. Whenan instruction in the child pass is later modified, the as-sociated “copied” mark is removed. To evaluate a Swiftfilter, the filtering engine traverses the passes accordingto their hierarchical relationships (if any) in a depth-firstmanner. A parent pass is evaluated before its children. Ifthe parent pass succeeds, then, as for any pass, the fil-tering engine halts. Otherwise, Swift records those in-structions that succeeded. When evaluating child passes,

Figure 4: Hierarchical pass relation diagram

Swift need not re-execute any copied instruction that suc-ceeded in the parent.

Figure 4 illustrates an example of the pass relation ina Swift filter. Pass 1 is created from scratch, while theother two passes are added by duplicating pass 1. The in-structions bearing the “copied” mark have shaded back-ground, so the filtering engine may skip their evaluations.

3.4 Analysis

Before giving detailed analysis of Swift in terms of per-formance and security, we first summarize the shared de-sign principles of Swift with other packet filters, espe-cially BPF, as well as its unique design features that dis-tinguish Swift from other packet filters. The shared fea-tures are marked with “+,” and the unique ones markedwith “⋄.”

+ Runs as a kernel module, filtering packets in place.

+ Uses architecture-independent pseudo-machine.

⋄ Utilizes CISC ISA with SIMD support.

⋄ Enables compile-free, in-place filter modification.

⋄ Ensures security with simplified computational model.

3.4.1 Performance

The performance superiority of Swift mainly originsfrom two aspects: high filter execution efficiency and lowfilter update latency.

Using a complex instruction set and SIMD benefitsstatic filtering performance. Thanks to the capability ofaggregating multiple simple operations into one instruc-tion, a Swift program has much fewer instructions thanits BPF counterpart. As a result, even though its per-instruction interpretation overhead is slightly higher thanthat of BPF, Swift achieves much lower interpretationoverhead of an entire filter program. While the filter en-gine size of Swift (24KB) is much larger than that of BPF(6KB), our experimental results show that the larger codesize has insignificant impact on performance. Even run-ning Swift on the CPU with only 12KB L1 cache (“PC1”in our experiment setup), there is still no observable per-formance degradation indicating cache thrashing.

Swift’s superior dynamic filtering performance ismainly attributed to its very low filter update latency.For a filtering program withN primitives and experi-encingM changes per update, the three pre-processingphases in BPF—recompiling the entire filter, copying thewhole compiled filter code to the kernel, and securitychecking—all takeO(N) runtime. However, performingthe same filter update in Swift involves neither compila-tion nor security checking. The only required operations,mapping the changed primitives into instruction opcodesand parameters, and copying the modified instructionsinto kernel, takeO(M) runtime. Because the filter updatein Swift is only related to the number of changes per up-date (M), not to the complexity of the existing filter (N),its filter update latency can be substantially lower thanthat of BPF, especially whenN is large (i.e., the filteringcriteria are complicated). Furthermore, hierarchical exe-cution optimization can avoid performance degradationcaused by redundant filter instructions.

3.4.2 Security

Security, and filter code safety in particular, have al-ways been a concern in packet filter design. Since mod-ern packet filters execute in kernel space, without propercode safety checking, a faulty filter program containinginfinite loops, wild jumps, out-of-bound array indexes,etc., could lead to unpredictable results. In addition, amaliciously crafted filter program can bypass any user-level access protection and can seriously undermine sys-tem security.

Depending on the design model, different packet fil-ters have different mechanisms to enforce the securityof the filter programs. The FFPF filter languages allowmemory allocation, and hence, FFPF has compile-timechecks to control resource consumption and run-timechecks to detect array boundary violations. In contrast,

BPF only needs to perform a security check in the ker-nel just before the filter program is attached; any programcontaining backward or out-of-bound jumps or illegal in-structions is rejected. However, Swift enforces securityin its design and eliminates the necessity for run-timesecurity checks. Swift trades off some of its computa-tional power, i.e., arbitrary data manipulation, for sim-pler computational model. Because any Swift programis an acyclic DFA (deterministic finite-state automaton),the interpreter is always in a pre-determined state, the ex-ecution of a finite-size filter program is always bounded,and Swift requires no security check at all.

Two rationales justify this design tradeoff. First, thereduction of computational power is harmless in the con-text of packet filtering. A packet filter is a very spe-cific system tool with a well defined set of operations.PathFinder [2] shows that all operations in packet filter-ing can be generalized as pattern matching. Thepcapfil-ter language uses the special primitive “expr” to supportarbitrary data manipulation. However, this primitive israrely used in practice, because its main usage is to spec-ify uncommon filtering criteria that are not covered byregular primitives. Second, BPF’s support for arbitrarydata manipulation comes with a high cost of its perfor-mance. Instead of following BPF’s strategy, we apply thestrategy of “optimize for the common case and preparefor the worst” in Swift’s design.

BPF does not differentiate predefined and arbitrarydata manipulation operations. Instead, BPF executes anydata manipulation by breaking it down to multiple el-ementary operations. Thus, BPF wastes a significantamount of time in interpretation, and sometimes it takeslonger time to interpret an instruction than to execute it.Swift supports well defined and commonly used data ma-nipulations by incorporating each variant in a single in-struction, and integrating their implementations into thefiltering engine. Since those operations are carried outby native binary code, Swift achieves very high execu-tion efficiency. Swift cannot perform data manipulationsthat are not defined in its instruction set. Instead, the userapplications need to carry out the custom data manipula-tions by themselves. However, in case an unsupporteddata manipulation is desperately needed, for example,when a new protocol requires a different data manipu-lation, we can always add new instructions to Swift.

4 Implementation

We have implemented the Swift kernel engine anduserspace libraries in Linux 2.6. Currently we provideimplementations for both i386 and x8664 architectures,and we plan to port Swift to other open-source UNIXvariants such as FreeBSD in the future.

Table 2: Selection oflibswift APIsRoutine Functionality

SPFOpen Create and attach an empty Swift filterto a socket

SPFNewPass Create a new pass in the Swift filterSPFDelPass Remove a pass from the Swift filterSPFDupPass Create a copy of a given passSPFSelectOp Assign a predefined operation (equivalent

to apcaplanguage primitive) to aninstructionof a given pass

SPFAddParam Add an additional (SIMD) parameter toan instructionin the given pass

SPFDelParam Remove a given parameter from aninstructionin the given pass

SPFPokeInst Change an arbitrary part of aninstructionin the given pass

4.1 Kernel Implementation

Swift coexists with the Linux kernel’s LSF (LinuxSocket Filter). LSF is the module equivalent to BPFin BSD UNIX and the default packet filtering modulefor the widely usedlibpcap library. Our implementa-tion requires little modification to the existing kernelcode, and is compatible with the existing packet filteringframework. Swift’s user–kernel communication mecha-nism uses thesetsockopt() system call. Swift filterprograms are attached to the same kernel data structuresk filter as LSF filter programs, with a flag set to telltwo kinds of programs apart. Packets captured by Swiftand LSF share the same delivery path no matter whichpacket filter is being used.

4.2 Userland Libraries

The libpcap library provides a set of well-designed rou-tines for setting filter programs and processing pack-ets, as well as utility functions for handling devices anddumping captured packets. Instead of hackinglibpcaptoincorporate Swift, we developed a set of complementarylibraries. Applications based on Swift can seamlessly usethoselibpcapfunctions that are unrelated to filter setting,but must invoke a separate mechanism to communicatewith the Swift filter engine for filter program installationand update.

As shown in Table 2, the C librarylibswift provides aset of function APIs for the convenient manipulation ofSwift filter programs. We also implement a C++ libraryooswift, providing object-oriented filter program controland manipulation and improved debugging support. Ta-ble 3 shows a common filtering criterion expressed inpcapprimitives, in Swift usingooswiftand in compiledLSF code. The table illustrates the clear logical connec-tion and easy transformation between the Swift filter pro-gram and thepcaplanguage primitives.

Table 3: A filter expressed bypcap, Swift, and LSFFiltering criterion by pcap

ip src net 192.168.254.0/24 and tcp dst port 23Swift filter program

Pass.Inst(0)→EtherIPTCP NonFrag()Pass.Inst(1)→Ether IPSrc(0xFFFFFF00, 0xC0A8FE00)Pass.Inst(2)→EtherIPTCUDPDst(23)

LSF filter program(00) ldh [12](01) jeq #0x800 jt 2 jf 13(02) ld [26](03) and #0xffffff00(04) jeq #0xc0a8fe00 jt 5 jf 13(05) ldb [23](06) jeq #0x6 jt 7 jf 13(07) ldh [20](08) jseq #0x1fff jt 13 jf 9(09) ldxb 4*([14]&0xf)(10) ldh [x + 16](11) jeq #0x17 jt 12 jf 13(12) ret #96(13) ret #0

5 Evaluation

In this section, we evaluate the performance of Swift onboth dynamic and static filtering tasks and compare itwith that of LSF and C kernel filters. The C kernel fil-ters (“Opt-C” for short in the following) are hand-coded,compiled (usinggcc “-O2” option) C programs that pro-vide some indication of possible performance gains ob-tainable by non-interpreted binary code. Each C kernelfilter is coded for a specific filtering task. We use the per-formance of Opt-C filters as an approximation to the per-formance of FFPF filters. An FFPF filter written in FPL-3, which is FFPF’s native language, is first transformedinto a C program and compiled into native code bygcc,and then loaded as a kernel module for use. Since bothFPL-3 filters and our Opt-C filters run inside the kernelnatively and only a single filter program runs in each ex-periment, the performance difference between FFPF fil-ters and Opt-C filters should be minimal.

Swift, LSF, and Opt-C filters share the same filteringframework and only differ in filtering engine. Therefore,

Table 4: Testbed machine configurationsCPU L2 FSB

PC11× Intel Pentium 4 2.8GHz

1MB 533MHz(32-bit)

PC22× Intel Xeon 2.8GHz

512KB 800MHz(32-bit w/ HyperThreading)

PC32× Intel Xeon 2.0GHz

4MB 1333MHz(EM64T DualCore)

PCS1× Intel Pentium D 2.8GHz

2MB 800MHz(EM64T DualCore)

0 50 100 150 2000

20

40

60

80

100

120

140

160

Session Number

Upd

ate

Late

ncy

(ms)

PC1−LSF

No bg500 Kpps

0 50 100 150 2000

20

40

60

80

100

120

140

160

Session Number

Upd

ate

Late

ncy

(ms)

PC2−LSF

No bg500 Kpps

0 50 100 150 2000

20

40

60

80

100

120

140

160

Session Number

Upd

ate

Late

ncy

(ms)

PC3−LSF

no bg500 Kpps

Figure 5: LSF filter update latency

0 50 100 150 2000

10

20

30

40

50

Session Number

Upd

ate

Late

ncy

(us)

PC1−Swift

No bg500 Kpps

0 50 100 150 2000

10

20

30

40

50

Session Number

Upd

ate

Late

ncy

(us)

PC2−Swift

No bg500 Kpps

0 50 100 150 2000

10

20

30

40

50

Session Number

Upd

ate

Late

ncy

(us)

PC3−Swift

No bg500 Kpps

Figure 6: Swift filter update latency

we use the number of CPU cycles spent by the filter-ing engine as the micro-benchmark metric. This metricis computed by taking the difference of the x86 Time-Stamp Counter (TSC) just before and right after a spe-cific filter operation. For dynamic filtering tasks, the op-eration is filter update, while for static filtering tasks, theoperation is filter evaluation. In order to compare filterperformance across different platforms, we further con-vert CPU cycles into clock time based on the correspond-ing machine’s processor frequency.

To evaluate the filters in a realistic but controllable en-vironment, we set up a testbed using a Gbps SMC man-aged switch to connect four different machines. We usethe mirror function of the switch to redirect the traffic onthe specified source port to the mirror port. The packetgenerator machine (PCS) connected to the source port re-plays traces, and one of the other three machines (PC1–3)connected to the mirror port captures the replayed traf-fic as a monitoring device. The four machines (PCS andPC1–3) have different generations of processors rangingfrom Pentium 4 32-bit to the latest Xeon dual core 64-bit.The configurations of these machines are listed in Table4.

5.1 Dynamic Filtering PerformanceWe use the task of capturing FTP passive mode traffic,a typical dynamic filtering task, to measure the perfor-mances among Swift, LSF, and Opt-C filters. We de-veloped an application calledFTPCapto monitor FTPtraffic and collect performance statistics. Three variantsthat use Swift, LSF, or Opt-C are called FTPCap-Swift,FTPCap-LSF, and FTPCap-Opt-C, respectively.

5.1.1 Experimental Setup

In passive mode FTP, the server port of a control con-nection is fixed (usually 21), but the server ports of dataconnections are dynamically assigned. FTPCap-LSF ini-tially employs “(ip and tcp port ftp)” to capture FTPcontrol packets. When a control packet containing theport number for a new data connection is captured, theserver IP address and port number for the new connec-tion will be recorded, and FTPCap-LSF will generate anew criterion similar to “(ip and tcp port ftp) or (ip x1and (tcp port y1 or tcp port y2)) or (ip x2 and (tcp porty3 or tcp port y4)),” in which “x1” and “x2” refer to theserver IP addresses and “y1 . . . y4” refer to the port num-bers. The LSF optimizer performs better when the portnumbers of the same server are grouped together. Corre-spondingly, FTPCap-Swift initializes the first pass withthe criterion “(ip and tcp port ftp)” to capture FTP controlpackets. When a data connection setup event is detected,FTPCap-Swift either simply includes the new port num-ber in the corresponding pass if the server is already ob-served, or adds a pass using hierarchical execution opti-mization otherwise. FTPCap-Opt-C, unlike the previoustwo, has no code for filter setting and updating becausethe work is already taken by the Opt-C filter. It simplyturns on/off the Opt-C filter.

The FTP traffic trace is obtained in a LAN environ-ment. We set up 10 FTP servers with different IP ad-dresses. For each server we make 20 concurrent passive-mode file transfer connections, which are initiated oneby one. In other words, at maximum there are 200 con-current passive FTP data connections. This trace lasts

0 50 100 150 2000

30

60

90

120

150

180

Session Number

Avg

. # o

f Mis

sed

Pac

kets

PC1−LSF

No bg500 Kpps

0 50 100 150 2000

30

60

90

120

150

180

Session Number

Avg

. # o

f Mis

sed

Pac

kets

PC2−LSF

No bg500 Kpps

0 50 100 150 2000

30

60

90

120

150

180

Session Number

Avg

. # o

f Mis

sed

Pac

kets

PC3−LSF

No bg500 Kpps

Figure 7: Missing packets per data connection by LSF

0 50 100 150 200−1

0

1

2

3

4

5

Session Number

Avg

. # o

f Mis

sed

Pac

kets

PC1−Swift

No bg500 Kpps

0 50 100 150 200−1

0

1

2

3

4

5

Session Number

Avg

. # o

f Mis

sed

Pac

kets

PC2−Swift

No bg500 Kpps

0 50 100 150 200−1

0

1

2

3

4

5

Session Number

Avg

. # o

f Mis

sed

Pac

kets

PC3−Swift

No bg500 Kpps

Figure 8: Missing packets per data connection by Swift

45 seconds with 3,948 packets per second (pps) on av-erage. In addition, we emulate the scenario of monitor-ing FTP packets under high-rate background traffic bymixing the captured FTP traffic with a constant high-rate(500 Kpps) non-FTP background traffic. The backgroundtraffic is generated by usingtcpreplay [23] to playback a large trace file, which is captured at the edge gate-way of our campus network.

Besides using filter update latency as the micro-benchmark performance metric, we also use the num-ber of missing packets per data connection as the macro-benchmark performance metric. The missing packets re-fer to those packets that are not captured by FTPCapat the beginning of a newly-established data connection.The packet miss is caused by the filter update latency be-ing larger than the FTP client acknowledgment delay—the interval between the time when the client receives theport assignment message and the time when the clientstarts to communicate with the server on that port. Themetric is derived by counting the number of the trans-mitted packets (including TCP control packets) prior tothe first packet captured by FTPCap, based on the offlineanalysis of the replayed trace.

5.1.2 Experimental Results

We run FTPCap-LSF and FTPCap-Swift 20 times eachon PC1–3, 10 times with the FTP traffic trace replayedand the other 10 times with the mixed traffic trace re-played. We take the median of 10 experimental resultsas the final result. FTPCap-Opt-C is also tested. Becausethere is no filter update at userspace, its filter update la-tency is zero and no packet is missed by FTPCap-Opt-

C for either FTP traffic or mixed traffic. Therefore, wefocus on the performance comparison between LSF andSwift.

Figures 5 and 6 show how filter update latencychanges with an increase in concurrent data connec-tions for LSF and Swift, respectively. The thick and thincurves show the filter update latencies for the traces withno background traffic and with 500 Kpps backgroundtraffic, respectively. The most significant difference be-tween Figures 5 and 6 lies in the scale of the y-axis.While the filter update latency for LSF is on the orderof milliseconds (ms), the filter update latency for Swiftis only on the order of microsecond (µs). By eliminat-ing filter compilation and security checking, Swift gainsat least three orders of magnitude speedup against LSFin filter update. Over 99% of LSF’s latency is caused byuser-level filter recompilation, but the remaining user–kernel copy and security check latency is still muchlarger than Swift’s entire update latency. For example,the user–kernel copy and security check latency on PC3grows from 10µs to 20µs in the experiment. FTPCaprunning on PC2 does not capture all control packetsthat carry dynamic port information under mixed traffic,which results in incomplete thin curves in “PC2-LSF”and “PC2-Swift.” The missing critical control packets aremainly due to PC2’s insufficient processing capacity.

As shown in Figure 5, both concurrent connectionsand background traffic affect the filter update latency ofLSF. When the number of concurrent connections in-creases, the filtering criterion expressed inpcaplanguagebecomes longer. And the compilation procedure and se-curity checking consume more CPU cycles. In contrast,

Table 5: Static filters inpcaplanguage and their instruction counts in LSF and SwiftFilter Description LSF Inst.# Swift Inst.#

1 “” (Accept all packets) 1 02 “ip” 3 13 “ip src net 192.168.2.0/24 and dst net 10.0.0.0/8” 10 24 “ip src or dst net 192.168.2.0/24” 10 25 “ip and tcp port (ssh or http or imap or smtp or pop3 or ftp)” 23 26 “ip and (not tcp port (80 or 25 or 143) and not ip host ( ... ))” 95 10

(The ellipsis mark stands for 38 IP addresses ORed together.)

the filter update latency of Swift is basically insuscepti-ble to changes in concurrent connections and backgroundtraffic: although all Swift curves in Figure 6 fluctuateslightly, the thin curves overlap with the thick curves toa great extent. This is because Swift filter updates areincremental and adding filter instructions for a new con-nection takes almost constant time. The large spikes ofSwift curves, which occur at the beginning and aroundthe addition of the 120th connection, are attributed to therelatively large overheads caused by pass duplication.

Figures 7 and 8 illustrate the average number of miss-ing packets per data connection by LSF and Swift, re-spectively. The y-axis scales are again significantly dif-ferent. The average numbers of missing packets per con-nection for LSF range from 30 to 160, while those forSwift are only one or two at maximum. Without back-ground traffic, Swift does not miss any packet no mat-ter how many concurrent connections exist. With back-ground traffic, the average levels of the “500 Kpps”curves slightly lift after around 120 concurrent connec-tions, which coincides with the occurrence of the sec-ond group of large spikes in Figure 6. The lift of fluc-tuation level may be attributed to the added passes andrelated pass duplication. The addition of more passes ex-tends the filtering path for non-FTP packets and results inmore CPU time spent on non-FTP traffic filtering. Evenso, Swift only misses one or two packets per connection.

0 10 20 30 40 500

40

80

120

160

200

← 12.5

Time (ms)

# of

Tra

nsm

itted

Pac

kets

Figure 9: Initial transmission of a data connection

There are two additional issues associated with theLSF curves in Figure 7. First, the ceiling phenomena—both “No bg” and “500 Kpps” curves bounded by 160—are caused by the rate-limiting of the FTP servers. Asillustrated by Figure 9, in the initial period of a data con-

nection, the servers first transmit about 160 packets intens of milliseconds and then stay idle for the next severalhundred milliseconds (not all shown) to limit download-ing rate. Such bounding behavior occurs for a wide rangeof rate-limit settings (e.g., from 100 KBps to 2 MBps).Since an LSF filter update latency is always shorter thanthe duration of the idle phase, the number of missingpackets in each update is bounded. Second, the round-trip time (RTT) of the FTP trace is small, varying fromtens of microseconds to hundreds of microseconds, as thetrace is collected in a LAN environment. A larger RTTwould cause fewer packets to be transmitted during thetime window of a filter update, thus reducing the impactof filter updates on packet missing. Compared to LSF,Swift is almost insensitive to the variation of RTT, andhence can support applications that require high-fidelitydata capture in diverse network environments.

5.2 Static Filtering PerformanceWe use six sets of filters with increasing complexity, asshown in Table 5, for static filtering performance eval-uation. The instruction numbers of these filters in LSFand Swift are also listed for comparison. The Opt-C filterprograms show performance gains that could potentiallybe achieved by improving LSF and Swift to native codespeeds.

The trace for static filtering is captured at the gate-way of our campus network. It contains over 14 mil-lion packets (75 bytes snap length) and its size is around1.1GB. We play back the trace file at 250 Kpps rate us-ing tcpreplay. Assuming an average of 500 bytesper packet, the playback rate represents a fully utilized1Gbps link bandwidth. We record the average time spentin accepting and rejecting packets separately, and selectthe larger value of the two as the filter performance data.We choose the larger value, instead of the smaller oneor the average, because the worse case runtime is muchless affected by network traffic conditions, such as trafficspeed and composition.

Figure 10 illustrates the per-packet processing timeof LSF, Swift, and Opt-C on all machines for eachfilter. In addition, Table 6 details the breakdown ofthe per-packet processing time for both LSF and Swift

Figure 10: Per-packet processing time of each static filter (nanoseconds)

on PC3. The “Exec.” column shows the average ex-ecution time per instruction and the average numberof instructions executed per packet, in the format of(time/instruction)×(instruction count). The “Aux.” col-umn shows the auxiliary processing time spent on filterengine setup and shutdown operations, such as call/retinstructions and local stack maintenance.

Filters 1 and 2 are the simplest criteria designed toshow the minimum overhead induced by the filtering en-gine. The corresponding results in Figure 10 demonstratethat Swift and LSF have approximately the same pro-cessing speed with these two simple filters. Both Swiftand LSF run slower than Opt-C. Table 6 further shedssome light on the performances of both LSF and Swiftfilter engines. For filter 1, the LSF filter program onlyconsists of a simple “ret” instruction, and thus the 5.2nanoseconds per-instruction execution time is mainly de-termined by LSF’s interpretation overhead. In contrast,the Swift filter engine is designed to accept all packetsby default. Therefore, the Swift filter program does notcontain any code, and its processing time is spent en-tirely on the filter engine setup and shutdown. By addinga “nop” instruction for Swift to execute before acceptinga packet, we estimate Swift’s interpretation overhead tobe about 8.2 nanoseconds. For filter 2, although the per-instruction execution time of LSF is 29% shorter thanthat of Swift, its overall execution time is longer thanthat of Swift. This is because the instruction count ratiobetween LSF and Swift is three to one.

Filters 3 and 4 are light load criteria designed todemonstrate filtering engine performance on basic packet

Table 6: Processing time breakdown for LSF and Swift

FilterLSF Swift

Exec. Aux. Exec. Aux.1 5.2ns×1.0 13.8ns (N/A)×0 23.1ns2 10.3ns×3.0 14.1ns 14.5ns×1.0 22.1ns3 9.4ns×9.0 12.9ns 13.8ns×2.0 20.9ns4 8.4ns×6.0 12.5ns 13.7ns×2.0 21.2ns5 10.3ns×19.8 14.8ns 25.4ns×1.9 21.3ns6 8.0ns×78.6 13.4ns 29.5ns×7.4 21.5ns

classification. The corresponding results in Figure 10 in-dicate that Swift has a moderate performance advantageover LSF on all machines. For filter 3, compared to Opt-C, LSF takes a factor of 2.32 to 2.92 more time to pro-cess a packet, with an average slowdown of 2.67 times;Swift takes a factor of 1.22 to 2.07 more time to pro-cess a packet, with an average slowdown of 1.61 times.The average speedup of Swift over LSF is 1.43. For fil-ter 4, compared to Opt-C, LSF takes a factor of 0.84 to1.87 more time to process a packet, with an average slow-down of 1.48 times; Swift takes a factor of 0.65 to 0.95more time to process a packet, with an average slowdownof 0.80 times. The average speedup of Swift over LSFis 1.39. Similar to the cases of filters 1 and 2, Table 6shows that for filters 3 and 4, the per-instruction execu-tion time of Swift is about 50% longer than that of LSF,but the much larger instruction count makes LSF slowerthan Swift in packet processing.

Filter 5 is a moderate load criterion designed to testthe filtering engine’s capability of handling highly spe-

Table 7: Optimization effects of Swift filter 5Filter 5 Original Less-optimized Unoptimized

Exec. Time 69.7ns 178.4ns 204.5ns

cific operation. The corresponding results in Figure 10show that Swift outperforms LSF by a significant amounton all machines. Compared to Opt-C, LSF takes a factorof 3.68 to 6.38 more time to process a packet, with anaverage slowdown of 4.68 times; Swift takes a factor of0.70 to 2.60 more time to process a packet, with an av-erage slowdown of 1.39 times. The average speedup ofSwift over LSF is 2.50. The significant speedup of Swiftis due to its architectural advantages and specificallySIMD instructions. The ability to pack many operands(12 for TCP/DUP ports) in one instruction and batch theexecution of comparison operations within a single fil-ter engine “cycle” enables many-fold reduction at thecost of instruction interpretation, and improves the per-formance of Swift close to that of Opt-C. As shown inTable 6, compared to previous filters, LSF maintains itsper-instruction execution time, but executes much moreinstructions. By contrast, Swift maintains its instructioncount, and packs more operations in each instruction.

Filter 6 is a heavy load, “real life” criterion obtainedfrom the campus network administrator. This filter isused by an application to detect suspicious IRC traffic.The filter is sufficiently complex for Swift to utilize theoptimizations discussed earlier, namely SIMD instruc-tions and hierarchical execution optimization. The cor-responding results in Figure 10 show even higher perfor-mance increase of Swift against LSF. Compared to Opt-C, LSF takes a factor of 7.83 to 12.94 more time to pro-cess a packet, with an average slowdown of 10.09 times;Swift takes a factor of 1.79 to 5.21 more time to pro-cess a packet, with an average slowdown of 3.18 times.The average speedup of Swift over LSF is 2.79. Accord-ing to Table 6, even though the per-instruction executiontime of LSF is less than one third of Swift, the instructioncount ratio between LSF and Swift is ten to one.

For all these filters, the auxiliary processing time forboth LSF and Swift is fairly steady, as shown in Table6. Although the auxiliary cost of Swift is about 8 to 9nanoseconds more than that of LSF, the extra cost is in-significant as the filter becomes more complex and re-quires more time to execute.

We further measure the average execution timeof Swift filter 5 with no SIMD extension (“Less-optimized”) and with neither SIMD extension nor hier-archical execution (“Unoptimized”) to provide more in-sights into the effect of Swift optimizations on perfor-mance improvement.The corresponding results are listedin Table 7. The removal of SIMD instructions exerts agreat impact on the performance of the Swift filter, re-sulting in a slowdown of 156%. The comparatively small

Figure 11: Per-packet processing time on all machines

increase of execution time after the removal of hierar-chical execution indicates that the hierarchical executionhas a minor effect on the performance of the Swift fil-ter. For Swift filter 6, due to filter program organization,we remove hierarchical execution optimization for the”Less-optimized” experiment, and remove both hierar-chical execution and SIMD extension for the ”Unopti-mized” experiment. Again, the impact of SIMD exten-sion is far greater than that of hierarchical execution onthe average execution time.

Overall, we find that (1) the SIMD extension plays avery important role in speeding up Swift filter execution,and (2) the hierarchical execution also helps the speedupbut its effect is much smaller than that of SIMD exten-sion, especially with a large instruction count. Withoutthe SIMD extension and hierarchical execution, Swiftcan only perform comparably to optimized LSF for staticfilter tasks. These results are consistent with our observa-tion from Figure 10 and Table 6: the speedup of Swift ismainly attributed to its use of much fewer instructionsthan LSF.

Figure 11 presents a comprehensive picture of filterexecution time for LSF, Swift, and Opt-C filters amongall machines categorized by six filtering criteria. It pro-vides a good overview of the static filtering performancefor cross comparison. When the filtering criteria are sim-ple, LSF, Swift and Opt-C have nearly indistinguishableperformance. As the criteria become more complex, thedifferences of filter execution time among the three fil-tering engines grow. Although both Swift and LSF runslower than Opt-C, the filter execution time of Swiftgrows at a much slower rate than that of LSF, and thusSwift achieves much closer performance to Opt-C thanLSF.

6 ConclusionThis paper presents the design and implementation of theSwift packet filter. Swift provides an elegant, fast, and ef-ficient packet filtering technique to handle the challengeof high speed network monitoring with dynamic filter up-dates. The key features of Swift lie in its low filter up-

date latency and high execution efficiency. Swift attainsthese performance advantages by embracing several ma-jor design innovations: (1) a specialized CISC instructionset increases filter execution efficiency and eliminates fil-ter re-compilation, resulting in significantly reduced fil-ter update latency; (2) a simple computational model re-moves the necessity of security checking and improvesfilter update latency; and (3) SIMD extensions furtherboost filter execution efficiency.

Our extensive experiments have validated Swift’s effi-cacy and demonstrated the superiority of Swift againstthe de facto packet filter, BPF. For dynamic filteringtasks, the filter update latency of Swift is three orders ofmagnitude lower than that of BPF, and on each filter up-date, the number of packets missed by Swift is about twoorders of magnitude less than that by BPF. For static fil-tering tasks, Swift runs as fast as BPF on simple filteringcriteria, but is up to three times as fast as BPF on com-plex filtering criteria. Swift also performs much closer tooptimized C filters than BPF.

There are many avenues we would like to further ex-periment and exploit in Swift. For instance, we will ex-plore the multi-thread expansion of Swift, and developa hardware optimized filter engine. We will make use ofthe extra registers supplied in the x8664 processors forfurther performance improvement. Moreover, we envi-sion that x86 high performance multimedia instructions(such as MMX and SSE) can also be used to acceleratethe packet processing.

AcknowledgmentsWe are very grateful to our shepherd Eddie Kohler andthe anonymous reviewers for their insightful and detailedcomments, which have greatly improved the quality ofthe paper. This work was partially supported by NSFgrants CNS-0627339 and CNS-0627340.

References[1] J. Apisdorf, k claffy, K. Thompson, and R. Wilder. OC3MON:

Flexible, affordable, high performance statistics collection. InProc. USENIX LISA’96, pages 97–112, 1996.

[2] M. L. Bailey, B. Gopal, M. A. Pagels, L. L. Peterson, andP. Sarkar. PathFinder: A pattern-based packet classifier. In Proc.USENIX OSDI’94, pages 115–123, 1994.

[3] A. Begel, S. McCanne, and S. L. Graham. BPF+: Exploitingglobal data-flow optimization in a generalized packet filterarchi-tecture. InProc. ACM SIGCOMM’99, pages 123–134, 1999.

[4] H. Bos, W. de Bruijn, M. Cristea, T. Nguyen, and G. Portokalidis.FFPF: Fairly fast packet filters. InProc. USENIX OSDI’04, pages347–363, 2004.

[5] J. Cleary, S. Donnelly, I. Graham, A. McGregor, and M. Pearson.Design principles for accurate passive measurement. InProc. ofIEEE PAM’00, pages 1–8, 2000.

[6] J. Coppens, E. Markatos, J. Novotny, M. Polychronakis, V. Smot-lacha, and S. Ubik. Scampi - a scaleable monitoring platformfor the internet. InProc. 2nd Int’l Workshop on Inter-DomainPerformance and Simulation, 2004.

[7] H. Dreger, A. Feldmann, V. Paxson, and R. Sommer. Operationalexperiences with high-volume network intrusion detection. InProc. ACM CCS’04, pages 2–11, 2004.

[8] D. R. Engler and M. F. Kaashoek. DPF: Fast, flexible messagedemultiplexing using dynamic code generation. InProc. ACMSIGCOMM’96, pages 53–59, 1996.

[9] J. M. Gonzalez, V. Paxson, and N. Weaver. Shunting: A hard-ware/software architecture for flexible, high-performance net-work intrusion prevention. InProc. ACM CCS’07, pages 139–149, 2007.

[10] S. Ioannidis, K. G. Anagnostakis, J. Ioannidis, and A. D.Keromytis. xPF: Packet filtering for low-cost network monitor-ing. In Proc. IEEE HPSR’02, pages 121–126, 2002.

[11] V. Jacobson, C. Leres, and S. McCanne. Tcpdump(1).Unix Man-ual Page, 1990.

[12] S. Kornexl, V. Paxson, H. Dreger, A. Feldmann, and R. Sommer.Building a time machine for efficient recording and retrieval ofhigh-volume network traffic. InProc. ACM/USENIX IMC 2005,pages 267–272, 2005.

[13] G. R. Malan and F. Jahanian. An extensible probe architecturefor network protocol performance measurement. InProc. ACMSIGCOMM’98, pages 215–227, 1998.

[14] S. McCanne and V. Jacobson. The BSD packet filter: A newarchitecture for user-level packet capture. InProc. 1993 WinterUSENIX Technical Conference, pages 259–269, 1993.

[15] S. McCanne, C. Leres, and V. Jacobson. Libpcap. Available athttp://www.tcpdump.org/. Lawrence Berkeley Labora-tory, Berkeley, CA.

[16] J. C. Mogul, R. F. Rashid, and M. J. Accetta. The packet filter:An efficient mechanism for user-level network code. InProc.11th ACM SOSP, pages 39–51, 1987.

[17] A. Moore, J. Hall, C. Kreibich, E. Harris, and I. Pratt. Archi-tecture of a network monitor. InProceedings of IEEE PAM’03,2003.

[18] U. of Waikato. The DAG project. Available athttp://dag.cs.waikato.ac.nz/.

[19] C. Partridge, A. C. Snoeren, W. T. Strayer, B. Schwartz,M. Con-dell, and I. Castineyra. FIRE: Flexible intra-AS routing environ-ment. InProc. SIGCOMM’00, pages 191–203, 2000.

[20] V. Paxson. Bro: A system for detecting network intruders inreal-time.Computer Networks, 31(23-24):2435–2463, December1999.

[21] S. Saroiu, S. D. Gribble, and H. M. Levy. Measurement andanal-ysis of spyware in a university environment. InProc. USENIXNSDI’04, pages 141–153, 2004.

[22] H. Song and J. W. Lockwood. Efficient packet classification fornetwork intrusion detection using fpga. InProc. ACM FPGA’05,pages 238 – 245, 2005.

[23] A. Turner. Tcpreplay.http://tcpreplay.synfin.net/trac/.

[24] J. van der Merwe, R. Caceres, Y. hua Chu, and C. Sreenan. mm-dump - a tool for monitoring internet multimedia traffic.ACMComputer Communication Review, 30(4), October 2000.

[25] G. Varghese. Network algorithmics - an interdisciplinary ap-proach to designing fast networked devices. InMorgan Kauf-mann Publishers, 2005.

[26] M. Yuhara, B. N. Bershad, C. Maeda, and J. E. B. Moss. Efficientpacket demultiplexing for multiple endpoints and large messages.In Proc. 1994 Winter USENIX Technical Conference, pages 153–165, 1994.

Swift: A Fast Dynamic Packet Filterhnw/paper/swift.pdf · The CMU/Stanford Packet Filter (CSPF)...

Documents

Transcript of Swift: A Fast Dynamic Packet Filterhnw/paper/swift.pdf · The CMU/Stanford Packet Filter (CSPF)...