Architectural Support for Software-Defined Metadata Processing

Architectural Support for Software-Defined Metadata Processing

Udit Dhawan1 Catalin Hritcu2 Raphael Rubin1 Nikos Vasilakis1 Silviu Chiricescu3

Jonathan M. Smith1 Thomas F. Knight Jr.4 Benjamin C. Pierce1 André DeHon1

1University of Pennsylvania 2INRIA, Paris 3BAE Systems 4Ginkgo BioworksContact author: [email protected]

AbstractOptimized hardware for propagating and checking software-programmable metadata tags can achieve low runtime over-head. We generalize prior work on hardware tagging byconsidering a generic architecture that supports software-defined policies over metadata of arbitrary size and complex-ity; we introduce several novel microarchitectural optimiza-tions that keep the overhead of this rich processing low. Ourmodel thus achieves the efficiency of previous hardware-based approaches with the flexibility of the software-basedones. We demonstrate this by using it to enforce four diversesafety and security policies—spatial and temporal memorysafety, taint tracking, control-flow integrity, and code anddata separation—plus a composite policy that enforces allof them simultaneously. Experiments on SPEC CPU2006benchmarks with a PUMP-enhanced RISC processor showmodest impact on runtime (typically under 10%) and powerceiling (less than 10%), in return for some increase in energyusage (typically under 60%) and area for on-chip memorystructures (110%).

Categories and Subject Descriptors C.1 [Processor Archi-tecture]: Miscellaneous—security

Keywords security, metadata, tagged architecture, CFI,Taint Tracking, Memory Safety

1. IntroductionToday’s computer systems are notoriously hard to secure,and conventional processor architectures are partly to blame,admitting behaviors (pointer forging, buffer overflows, . . . )that blatantly violate higher-level abstractions. The burdenof closing the gap between programming language and hard-

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than theauthor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected] 2015, March 14–18, 2015, Istanbul, Turkey.Copyright is held by the owner/author(s). Publication rights licensed to ACM.ACM 978-1-4503-2835-7/15/03. . . $15.00.http://dx.doi.org/10.1145/694344.2694383

ware is left to software, where the cost of enforcing airtightabstractions is often deemed too high.

Several recent efforts [2, 37, 53, 57] have demonstratedthe value of propagating metadata during execution to en-force policies that catch safety violations and malicious at-tacks as they occur. These policies can be enforced in soft-ware [3, 8, 28, 48, 69], but typically with high overheadsthat discourage their deployment or motivate coarse approx-imations providing less protection [23, 31]. Hardware sup-port for fixed policies can often reduce the overhead to ac-ceptable levels and prevent a large fraction of today’s at-tacks [19, 25, 40, 63]. Following this trend, Intel recentlyannounced hardware for bounds checking [36] and isolation[42]. While these mitigate many of today’s attacks, fully se-curing systems will require more than memory safety andisolation. Some needs we can already identify (like control-flow integrity and information flow control) but the completeset remains unknown. Attacks rapidly evolve to exploit anyremaining forms of vulnerability. What is needed is a flexi-ble security architecture that can be quickly adapted to thisever-changing landscape. Some recent designs have madethe hardware metadata computation configurable [24, 66]but have limited bits to represent metadata and only supporta limited class of policies. A natural question, then, is: Is itpossible to provide hardware to support extensible, software-defined metadata processing with low overhead? In particu-lar, in the spirit of the 0-1-∞ rule, can we efficiently supportfine-grained, software-defined metadata propagation withoutplacing a visible, hard bound on the number of bits allocatedto metadata or a bound on the number of policies simultane-ously enforced?

To achieve this goal, we introduce a rich architecturalmodel, the Programmable Unit for Metadata Processing(PUMP), that indivisibly associates a metadata tag with ev-ery word in the system’s main memory, caches, and registers.To support unbounded metadata, the tag is large enough toindirect to a data structure in memory. On every instruction,the tags of the inputs are used to determine if the operationis allowed, and if so to determine the tags for the results. Thetag checking and propagation rules are defined in software;however, to minimize performance impact, these rules arecached in a hardware structure, the PUMP rule cache, that

operates in parallel with the ALU. A software miss handlerservices cache misses based on the policy rule set currentlyin effect.

We measure the performance impact of the PUMP usinga composition of four different policies (Tab. 1) that stressthe PUMP in different ways and illustrate a range of securityproperties: (1) a Non-Executable Data and Non-WritableCode (NXD+NWC) policy that uses tags to distinguish codefrom data in memory and provides protection against sim-ple code injection attacks; (2) a Memory Safety policy thatdetects all spatial and temporal violations in heap-allocatedmemory, extending [17] with an effectively unlimited (260)number of colors (“taint marks” in [17]); (3) a Control-Flow Integrity (CFI) [3] policy that restricts indirect controltransfers to only the allowed edges in a program’s controlflow graph, preventing return-oriented-programming-styleattacks [59] (we enforce fine-grained CFI [3, 50], not coarse-grained approximations [20, 72] that are potentially vulner-able to attack [23, 31]); and (4) a fine-grained Taint Trackingpolicy (generalizing [49]) where each word can potentiallybe tainted by multiple sources (libraries and IO streams) si-multaneously. Since these are well-known policies whoseprotection capabilities have been established in the litera-ture, we focus only on measuring and reducing the perfor-mance impact of enforcing them using the PUMP. Exceptfor NXD+NWC, each of these policies needs to distinguishan essentially unlimited number of unique items; by con-trast, solutions with a limited number of metadata bits can,at best, support only grossly simplified approximations.

As might be expected, a simple, direct implementation ofthe PUMP is rather expensive. Adding pointer-sized (64b)tags to 64b words at least doubles the size and energy us-age of all the memories in the system; rule caches add areaand energy on top of this. For this simple implementation,we measured area overhead of 190% and geomean energyoverhead around 220%; moreover, runtime overhead is dis-appointing (over 300%) on some applications (see §3). Suchhigh overheads would discourage adoption, if they were thebest we could do.

However, we find that most policies exhibit spatial andtemporal locality for both tags and the rules defined overthem. The number of unique rules can be significantly re-duced by defining them over a group of identical instruc-tions, reducing compulsory misses and increasing the effec-tive capacity of the rule caches. Off-chip memory traffic canbe reduced by exploiting spatial locality in tags. On-chiparea and energy overhead can be minimized by using a smallnumber of bits to represent the subset of the pointer-sizedtags in use at a time. Runtime costs of composite policy misshandlers can be decreased by providing hardware support forcaching component policies. These optimizations allow thePUMP to achieve low overheads without compromising itsrich policy model.

The main contributions of this work are (a) a metadataprocessing architecture supporting unbounded metadata and

fully software-defined policies (§2); (b) an empirical eval-uation (§3) of the runtime, energy, power ceiling, and areaimpacts of a simple implementation of the PUMP integratedwith an in-order Alpha microarchitecture (§2) on the SPECCPU2006 benchmark set under four diverse policies andtheir combination; (c) a set of microarchitectural optimiza-tions (§4); and (d) an assessment of their impact (§5), show-ing typical runtime overhead under 10%, a power ceiling im-pact of 10%, and typically energy overhead under 60% byusing 110% additional area for on-chip memory structures.Tab. 2 summarizes related work, which is discussed in §6;§7 concludes and highlights future work.

2. The PUMPWe illustrate the PUMP as an extension to a conventionalRISC processor (we use an Alpha [1], but the mechanismsare not ISA-specific) and an in-order implementation witha 5-stage pipeline suitable for energy-conscious applica-tions [10]. In this section, we describe the ISA-level exten-sions that constitute the PUMP’s hardware interface layer,the basic microarchitectural changes, and the accompanyinglow-level software.

Metadata Tags Every word in a PUMP system is associ-ated with a pointer-sized tag. These tags are uninterpretedin the hardware. At the software level, a tag may representmetadata of unbounded size and complexity, as defined bythe policy. Simple policies that need only a few bits of meta-data may store it directly in the tag; if more bits are needed,indirection is used to store the metadata as a data structurein memory, with the address of this structure used as the tag.The basic addressable word is indivisibly extended with atag, making all value slots, including memory, caches, andregisters, suitably wider. The program counter is also tagged.This notion of software-defined metadata and its represen-tation as a pointer-sized tag extends previous tagging ap-proaches, where only a few bits are used for tags and/or theyare hardwired to fixed interpretations (see Tab. 2).

All tag manipulation is defined and implemented withPUMP rules. Metadata tags are not addressable by user pro-grams, only by software policy handlers invoked on rulecache misses as detailed below.

Tag-propagation rules We define metadata computationsin terms of rules of the formopcode : (PC ,CI ,OP1 ,OP2 ,MR) ⇒ (PCnew,R),which should be read: “If the current opcode is opcode,the current tag on the program counter is PC , the tag onthe current instruction is CI , the tags on its input operands(if any) are OP1 and OP2 , and the tag on the memorylocation (in case of load/store) is MR, then the tag on theprogram counter in the next machine state should be PC new

and the tag on the instruction’s result (a destination regis-ter or a memory location, if any) should be R”. Instead of(PCnew,R), the function can also fail, indicating that theoperation is not allowed. The PUMP can enforce any pol-

Policy Threat Metadata Max Unique Tags Tag Check (allow?) Tag Propagation Rules Ref.

NXD+NWC code injection DATA or CODE on memory locations 2 CI=CODE;MR 6=CODE on write no

Memory Safety spatial/temporal memorysafety violation on heap

color on pointers; region color +payload color on memory locations

(#mallocs)2 pointer color == refer-enced region color

R← OPi on mov/add/sub;R← payload(MR) on load

[17]

CFI control-flow hijacking(JOP/ROP/code reuse)

unique id on each indirect control-flow source and target

#sources+#targets (PC ,CI ) ∈ pro-gram’s control-flowgraph

PC ← CI on indirect jumps(including call, return)

[3]

Taint Tracking untrusted code,low-integrity data from IO

source-taint-id on instructions andIO; set of input taint-ids on words

2(#code+#IO ids) user-defined check R← CI ∪OP1 ∪OP2 ∪MR

[49]

Composite all of the above structure with 4 elements product of above all of the above (Alg. 2 with N = 4) *this

Table 1: Summary of Investigated Policies (simplified for brevity; details not critical to understand architecture and optimizations)

Prop- Outputs InputsTag Bits agate? allow? R (result) PC PC CI OP1 OP2 MR Usage (Example)

2 7 soft 7 7 7 7 7 7 3 memory protection (Mondrian [67])word 7 limited prog. 7 7 7 7 7 7 3 memory hygiene, stack, isolation (SECTAG [5])

32 7 limited prog. 7 7 7 7 7 7 3 unforgeable data, isolation (Loki [71])2 7 fixed fixed 7 7 7 7 7 3 fine-grained synchronization (HEP [61])1 3 fixed 7 7 7 7 3 7 7 capabilities (IBM System/38 [34], Cheri [68])

2–8 3 fixed fixed 7 7 7 3 3 7 types (Burroughs B5000, B6500/7500 [51], LISP Machine [44], SPUR [64])128 3 fixed copy 7 7 7 3 7 3 memory safety (HardBound [25], Watchdog [46, 47])0 3 software defined 7 propagate only one invariant checking (LBA [14])1 3 fixed fixed 7 7 7 3 3 3 taint (DIFT [63], [15], Minos [19])4 3 limited programmability 7 7 7 3 3 7 taint, interposition, fault isolation (Raksha [22])10 3 limited prog. fixed 7 7 7 3 3 3 taint, isolation (DataSafe [16])

unspec. 3 software defined 7 7 7 3 3 3 flexible taint (FlexiTaint [66])32 3 software defined 7 7 7 3 3 3 programmable, taint, memory checking, reference counting (Harmoni [24])

0–64 3 software defined 3 3 3 3 3 information flow, types (Aries [11])Unbounded 3 software defined 3 3 3 3 3 fully programmable, pointer-sized tags (PUMP)

Table 2: Taxonomy of Tagging Schemes (Propagate = tag propagates with data (3) or is a property of memory address (7); allow? = logic for allowing /disallowing operations; R = policy impacts tag of result; PC = programmable tag on program counter; CI = is current instruction tag checked or propagated;OP1 ,OP2 ,MR = is tag on these inputs checked or propagated)

icy that can be expressed as a function of this form. Themapping should be “purely functional”; i.e., it should nei-ther depend on nor modify any mutable state. The policyfunction is defined and resolved entirely in software (seeApp. A [27] for an example). This rule format, allowing twooutput tags to be computed from up to five input tags, ismarkedly more flexible than those considered in prior work(see Tab. 2), which typically compute one output from up totwo inputs. It should also generalize easily across most RISCarchitectures. Beyond previous solutions that only track datatags (OP1 , OP2 , MR, R), we add a current instruction tag(CI ) that can be used to track and enforce provenance, in-tegrity, and usage of code blocks; as well as a PC tag thatcan be used to record execution history [4], ambient author-ity [43], and “control state” including implicit informationflows [39]. The CFI policy (Tab. 1) exploits the PC tag forrecording the sources of indirect jumps and the CI tag foridentifying jump targets, NXD+NWC leverages the CI toenforce that data is not executable, and Taint Tracking usesthe CI to taint data based on the code that produced it.

Rule Cache The other main feature of the PUMP is hard-ware support for single-cycle common-case computation onmetadata. To resolve the rules in a single cycle in the com-mon case, we provide a hardware cache of the most recentlyused rules; as long as we hit in this cache, we do not add anyextra cycles. Fig. 1 shows the microarchitecture of our cur-rent rule cache implementation. Specifically, the rule cache

does nothing more than perform an associative mapping be-tween the instruction opcode and the 5 input tags and the twooutput tags. Since rules are purely functional, the rule cachecan directly map between pointer tag inputs and pointer tagoutputs without dereferencing the pointers or examining themetadata structures they point to. Failure cases are never in-serted into the PUMP rule cache since they must transfercontrol to software cleanup.

While the PUMP can, in principle, implement any policyfunction, the rule cache can only effectively accelerate pol-icy functions that generate small working sets and hence canbe cached well, including: (i) policies that use only a lim-ited number of distinct tags, and hence rules, and (ii) poli-cies where the potential universe of tags is large but the setof tags actually encountered is still modest (e.g., library andfile taint-tracking with up to 34 unique taints could see 234

distinct sets in principle, but only about 5–10,000 are seenin practice). Good performance depends upon temporal lo-cality, just as good performance in the instruction and datacaches also depend on temporal locality. As with data cachesand virtual memory, the PUMP allows the policy program-mer to express the policy without being concerned with ar-bitrary hard limits on the number of tags or rules. If the poli-cies do have (or mostly do have) small tag and rule sets, thepolicy will perform well. If the policies occasionally need touse more tags or rules, performance degrades only in propor-tion to the cases where many rules are required. If the policy

PC

CI

OP1

OP2

MR

Don’t CareMask

Mem

Hit/Miss

G1Mem

G2Mem

MR

Mem

PC, R

pipelineregisters (C

ache T

ags)

Hash 1

Ha

sh

2

Associative Memory (details show dMHC)

opcode

Ma

tch

Figure 1: PUMP Rule Cache Dataflow and Microarchitecture

PCtagPC

L1-I$

Tags

Fetch

RF

Tags

Decode

Don’tCare

Execute

L1-D$

Tags

Memory

L1PUMP

$

PUMP

PUMPMatch

Writeback

Figure 2: L1 PUMP Rule Cache in a Processor Pipeline

writer develops a policy with poor locality, the rule cachewill thrash, just as a program with poor locality will thrashvirtual memory and instruction and data caches. The exper-imental data in this paper shows that the PUMP rule cachescan be engineered to efficiently support four quite differentexample policies, and their composition.

Depending on the instruction and policy, one or more ofthe input slots in a rule may be unused. For example, aregister-to-register add will never depend on the value ofthe unused MR tag. To avoid polluting the cache with rulesfor all possible values of the unused slots, the rule-cachelookup logic refers to a bit vector containing a don’t-carebit for each input slot–opcode pair (shown in the Executestage in Fig. 2), which determines whether the correspond-ing tag is actually used in the rule cache lookup. To handlethese “don’t care” inputs efficiently, we mask them out be-fore presenting the inputs to the PUMP rule cache (AND gatesin Fig. 1). The don’t-care bit vectors are set by a privilegedinstruction as part of the miss handler installation. The im-plementation details of the rule cache are presented in §3.

Pipeline Integration Fig. 2 shows how we revise the 5-stage processor pipeline to incorporate the PUMP hardware.We add the rule cache lookup as an additional stage andbypass tag and data independently so that the PUMP stagedoes not create additional stalls in the processor pipeline.

Placing the PUMP as a separate stage is motivated bythe need to provide the tag on the word read from mem-ory (load), or to be overwritten in memory (store), as aninput to the PUMP. Since we allow rules that depend onthe existing tag of the memory location that is being writ-ten, write operations become read-modify-write opera-tions. The existing tag is read during the Memory stage like

Algorithm 1 Taint Tracking Miss Handler (Fragment)

1: switch (op)2: case add, sub, or:3: PCnew ← PC4: R← canonicalize(CI ∪OP1 ∪OP2 )5: (... cases for other instructions omitted ...)6: default: trap to error handler

a read, the rule is checked in the PUMP stage, and the writeis performed during the Commit stage. As with any cachingscheme, we can use multiple levels of caches for the PUMP;in this paper, we use two.

Miss Handler When a last-level miss occurs in the rulecache, it is handled as follows: (i) the current opcode andtags are saved in a (new) set of processor registers used onlyfor this purpose and (ii) control is transferred to the policymiss handler (described in more detail below), which (iii)invokes the policy function to decide if the operation is al-lowed and, if so, generates an appropriate rule. When themiss handler returns, the hardware (iv) installs this rule intothe rule caches, and (v) re-issues the faulting instruction. Toprovide isolation between the privileged miss handler andthe rest of the system software and user code, we add a miss-handler operational mode to the processor, controlled by abit in the processor state that is set on a last-level rule cachemiss and reset when the miss handler returns. To avoid theneed to save and restore registers on every miss handler invo-cation, we expand the integer register file with 16 additionalregisters that are available only to the miss handler. Addi-tionally, the rule inputs and outputs appear as registers whilein miss handler mode (cf. register windows [52]), allowingthe miss handler (but nothing else) to manipulate the tags asordinary values.

We add a new miss-handler-return instruction to fin-ish installing the rule into the PUMP rule caches and returnto user code; this instruction can only be issued when inmiss-handler mode. While in miss-handler mode, the rulecache is ignored and the PUMP instead applies a single,hardwired rule: all instructions and data touched by the misshandler must be tagged with a predefined MISSHANDLERtag, and all instruction results are given the same tag. In thisway, the PUMP architecture prevents user code from under-mining the protection provided by the policy. User code can-not: (1) divide, address or replace tags; (2) touch metadatadata structures and miss handler code; and (3) directly insertrules into the rule cache.

Alg. 1 illustrates the operation of the miss handler for ourTaint-Tracking policy. To minimize the number of distincttags (and hence rules), the miss handler takes care to use asingle tag for logically equivalent metadata by “canonical-izing” any new data structures that it builds. While it wouldbe functionally correct to create a new data structure, with anew metadata pointer, for the set resulting from the unions

Algorithm 2 N-Policy Miss Handler

1: for i=1 to N do2: Mi ←{op, PC [i], CI [i], OP1 [i], OP2 [i], MR[i]}3: {pci, resi}← policyi (Mi)4: PCnew ← canonicalize([pc1, pc2, ..., pcN ])5: R← canonicalize([res1, res2, ..., resN ])

in Line 4, it is important for performance to make sure thatequivalent sets are assigned the same tag—i.e., the samecanonical name—to increase reuse and sharing of rules inthe rule cache, thereby increasing its effectiveness.

Rather than forcing users to choose a single policy, wewant to enforce multiple policies simultaneously and addnew ones later. An advantage of our “unbounded” tags isthat we can in principle enforce any number of policies at thesame time. This can be achieved by letting tags be pointers totuples of tags from several component policies. For example,to combine the NXD+NWC policy with the taint-trackingpolicy, we can let each tag be a pointer to a tuple (s, t),where s is a NXD+NWC tag (either DATA or CODE) andt is a taint tag (a pointer to a set of taints). The rule cachelookup is exactly the same, but when a miss occurs, bothcomponent policies are evaluated separately: the operationis allowed only if both policies allow it, and the resultingtags are pairs of results from the two component policies.Alg. 2 shows the general behavior of the composite misshandler for any N policies. Depending on how correlated thetags in the tuple are, this could result in a large increase inthe number of tags and hence rules. In order to demonstratethe ability to support multiple policies simultaneously andmeasure its effect on working set sizes, we implement thecomposite policy (“Composite”) comprising all four policiesdescribed in §1. The Composite policy represents the kind ofpolicy workloads we would like to support; hence we use itfor most of the experiments described later.

Most policies will dispatch on the opcode to select theappropriate logic. Some, like NXD+NWC, will just checkwhether the operation is allowed. Others may consult adata structure (e.g., the CFI policy consults the graph ofallowed indirect call and return ids). Memory safety checksequality between address and memory region colors. Tainttracking computes fresh result tags by combining the in-put tags (Alg. 1). Policies that must access large data struc-tures (CFI) or canonicalize across large aggregates (TaintTracking, Composite) may make many memory accessesthat could miss in the on-chip caches and go to DRAM. Onaverage across all of our benchmarks, servicing misses forNXD+NWC required 30 cycles, Memory Safety 60, CFI 85,Taint Tracking 500, and Composite 800.

If the policy miss handler determines that the operationis not allowed, it invokes a suitable security fault handler.What this fault handler does is up to the runtime system andthe policy; typically, it would shut down the offending pro-

cess, but in some cases it might return a suitable “safe value”instead [35, 54]. For incremental deployment with UNIX-style [55] operating systems, we assume policies are appliedper process, allowing each process to get a different set ofpolicies. It also allows us to place the tags, rules, and misshandling support into the address space of the process; to-gether with the miss-handler operational-mode, this allowsfast transitions to and from the miss-handler software with-out the software bookkeeping, data copying, or managementof address translation tables that would be necessary for anOS-level context switch. Longer term, perhaps PUMP poli-cies can be used to protect the OS as well.

3. Evaluation of a Simple ImplementationWe next describe our evaluation methodology for measuringruntime, energy, area, and power overheads and apply it ona simple implementation of the PUMP hardware and soft-ware, using 128b words (64b payload and 64b “integrated”tag [38]) and the modified pipeline sketched in Fig. 2. Al-though the optimized implementation of §4 is the one whoseoverheads (relative to the baseline processor) we really careabout, it is useful to describe and measure the simple PUMPimplementation first, both because it gives us a chance toshow basic versions of the key mechanisms before gettingto more sophisticated versions and because it allows us toconfirm the natural intuition that the simple implementationdoes not perform very well.

Resource Estimates To estimate the physical resource im-pact of the PUMP, we focus on memory costs, since thememories are the dominant area and energy consumers ina simple RISC processor [21] and in the PUMP hardwareextensions. We consider a 32 nm Low Operating Power(LOP) process for the L1 memories and Low Standby Power(LSTP) for the L2 memories and use CACTI 6.5 [45] formodeling the area, access time, energy per access, and static(leakage) power of the main memory and the processor on-chip memories.Baseline Processor: Our (no-PUMP) baseline processor hasseparate 64KB L1 caches for data and instructions and aunified 512KB L2 cache. We use delay-optimized L1 cachesand an energy-optimized L2 cache. All caches use a write-back discipline. The baseline L1 cache has a latency around880 ps; we assume that it can return a result in one cycleand set its clock to 1 ns, giving us a 1 GHz-cycle target—comparable to modern embedded and cell phone processors.Tab. 3 shows the parameters for this processor.PUMP Implementation: The PUMP hardware implemen-tation has two parts: extending all memories with integratedtags, and adding PUMP rule caches to the processor. Extend-ing each 64b word in the on-chip memories with a 64b tagincreases their area and energy per access and worsens theiraccess latency. This is potentially tolerable for the L2 cache,which already has a multi-cycle access latency and is notused every cycle. But adding an extra cycle of latency to ac-

Unit Design Organization Area Access Energy Static Power Latency Cyc(mm2) Read/Write (pJ) (pJ/cyc) (ps)

Register File Baseline 64b, 2R1W, {32 Integer, 32 Floating} 0.002 0.3/0.5 0.080 264 1(Int. and FP) 64b tag extended 128b, 2R1W, {48 Integer, 32 Floating} 0.007 1.0/1.4 0.23 295 1

L1-$ Baseline 64KB, 4-way, 64B/line 0.236 17/11 14.4 880 1(I and D) 64b tag extended 64KB, 4-way, 128B/line (eff. 32KB, 64B/line) 0.244 19/14 14.5 880 1

L2-$ Baseline 512KB, 8-way, 64B/line 1.207 393/481 0.111 4000 5(unified) 64b tag extended 1MB, 8-way, 128B/line (eff. 512KB, 64B/line) 2.350 758/1223 0.214 4930 5

TLB Either 1KB, 2-way set-assoc. 0.040 3.6/4.5 2.0 800 1(I and D)DRAM Baseline 1GB, access 64B line per transfer - 15,000 - - 100

64b tag extended 1GB, access 128B line (eff. 64B line per transfer) - 31,000 - - 130L1 PUMP-$ 64b tag fully associative 1024 entry, 328b match, 128b out 1.500 750/900 3000 4

(not used; shown only for reference)1024 entry, 328b match, 128b out Fast-Value dMHC(4,2) 0.683 51/62 32.0 500 1

L2 PUMP-$ 64b tag 4096 entry, 328b match, 128b out 2-level dMHC(4,2) 0.994 173/444 0.085 3300 4Total Baseline Area 1.485mm2

Total 64b-tagged Area 4.318mm2 (+190% over baseline)

Table 3: Memory Resource Estimates for the Baseline and Simple PUMP-extended Processors at 32nm node

cess the L1 caches can lead to stalls in the pipeline. To avoidthis, in this simple implementation we reduce the effectivecapacity of our L1 caches to half of those in the baselinedesign and then add tags; this gives the same single-cycleaccess to the L1 caches, but can degrade performance due toincreased misses.

The PUMP rule caches require a long match key (5pointer-sized tags plus an instruction opcode, or 328b) com-pared to a traditional cache address key (less than the addresswidth), and they return a 128b result. Using a fully asso-ciative L1 rule cache would lead to high energy and delay(Tab. 3). Instead, we use a four-hash-function instance of adynamic Multi-Hash Cache (dMHC) [26], a near-associativemulti-hash cache scheme. The L1 rule cache, illustrated con-cretely with a two-hash function instance in Fig. 1, is de-signed to produce a result in a single cycle, checking for afalse hit (MR Mem lookup and comparison) in the secondcycle, while the L2 rule cache is designed for low energy,giving a multi-cycle access latency. Entries are added to thedMHC in such a way that the XOR of the values stored in thehashed memory locations (G1 and G2 in Fig. 1) produce thedesired output. As long as one of the hashed table entries isempty, a new rule mapping can be added without evictingor displacing existing rule mappings. Tab. 3 shows the pa-rameters for 1024-entry L1 and 4096-entry L2 rule cachesused in the simple implementation. When these caches reachcapacity, we use a simple FIFO replacement policy, whichappears to work well in practice for our current workloads(FIFO is within 6% of LRU here).

Evaluation Methodology To estimate the performance im-pact of the PUMP, we use a combination of ISA, PUMP,and address-trace simulators (Fig. 3). We use the gem5simulator [9] to generate instruction traces for the SPECCPU2006 [33] programs on a 64-bit Alpha baseline ISA(omitting xalancbmk and tonto, on which gem5 fails). Wesimulate each program for each of the four policies listedin §1 and the Composite policy for a warm-up period of 1Binstructions and then evaluate the next 500M instructions.In gem5, each benchmark is run on the baseline processor

Program

Arch.params

Alpha ISA(gem5)

Instr. tracePUMP params

PolicyPUMPSim

(C)

Miss Handlerinvocations

Alpha ISA(gem5)

Addr. trace⊕

Addr.trace

AddressTrace Sim

(C)

Runtime/Energy Cost Model

D/I $Misses

MissHandlerCosts

PUMP $Misses

Figure 3: PUMP Evaluation Framework

0

10

20

30

40

50

●●

●

●

●

●●●

●●

●●

●

●

●●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●●●

●

●●

●

●

●

●●●

●

●

●●

●

●

●●

●

●

●

●

●

●

●●●

●

●●

●

●

●

●●●

●●

●●

●

●

●●

●

●

●

●

●

●

●●●

●

% C

PI

ove

rhe

ad

NXD+N

WC

Mem

ory

Saf

ety

CFI

Tain

t Tra

ckin

g

100

150

200

250

300

350

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●% E

PI

ove

rhe

ad

NXD+N

WC

Mem

ory

Saf

ety

CFI

Tain

t Tra

ckin

gFigure 4: Single Policies with Simple Implementation

with no tags or policies. The resulting instruction trace isthen run through a PUMP simulator that performs meta-data computation for each instruction. This “phased” sim-ulation strategy is accurate for fail-stop policies (the onlyones we consider), where the PUMP’s results cannot causea program’s control flow to diverge from its baseline execu-tion. While address-trace simulations can be inaccurate forhighly pipelined and out-of-order processors, they are quiteaccurate for our simple, in-order, 5- and 6-stage pipeline.On the baseline configuration, the gem5 instruction simula-tion and address trace generation followed by our customaddress-trace simulations and accounting were within 1.2%of gem5’s cycle-accurate simulations.

We provide the PUMP simulator with miss-handler code(written in C) to implement each policy, and we assign meta-

% C

PI

ove

rhe

ad

0

50

100

150

MisshandlerL2 PUMP−$DRAML2−$I misfetch

Gem

sFD

TD

ast

ar

bw

aves

bzi

p2

cact

usA

DM

calc

ulix

dealII

gam

ess gcc

gobm

kgro

macs

h264re

fhm

mer

lbm

lesl

ie3d

libquantu

m mcf

milc

nam

dom

netp

pperlbench

sjeng

sphin

x3ze

usm

pm

ean

78

0%

32

0%

60

0%

(a) Composite Policy Runtime

% E

PI

ove

rhe

ad

0

100

200

300

400

500MisshandlerL2 PUMP−$L1 PUMP−$DRAML2−$L1−$

Gem

sFD

TD

ast

ar

bw

aves

bzi

p2

cact

usA

DM

calc

ulix

dealII

gam

ess gcc

gobm

kgro

macs

h264re

fhm

mer

lbm

lesl

ie3d

libquantu

m mcf

milc

nam

dom

netp

pperlbench

sjeng

sphin

x3ze

usm

pm

ean

16

00

%

60

0%

52

0%

(b) Composite Policy Energy

Pow

er

(pJ/c

ycle

)

0

100

200

300

400 BaselineTagged

Gem

sFD

TD

ast

ar

bw

aves

bzi

p2

cact

usA

DM

calc

ulix

dealII

gam

ess gcc

gobm

kgro

macs

h264re

fhm

mer

lbm

lesl

ie3d

libquantu

m mcf

milc

nam

dom

netp

pperlbench

sjeng

specr

and

sphin

x3 wrf

zeusm

pm

ean

(c) Absolute Power

Figure 5: Overhead of Simple Implementation with 64b Tags Compared to Baseline

data tags on the initial memory depending on the policy. OurPUMP simulator allows us to capture the access patterns inthe PUMP rule caches and estimate the associated runtimeand energy costs, accounting for the longer wait cycles re-quired to access the L2 rule cache. Since the PUMP misshandler code also runs on the processor, we separately sim-ulate our miss handler on gem5 to capture its dynamic be-havior. Since the miss-handler code potentially impacts thedata and instruction caches, we create a merged address tracethat includes properly interleaved memory accesses fromboth user and miss-handler code, which we use for the finaladdress-trace simulation to estimate the performance impactof the memory system.

Tag and Rule Usage One of the contributions of thePUMP model is support for unbounded metadata. The Com-posite policy applied to the benchmark set exercises thissupport. The Composite policy uses from 400 (specrand)to 2M (GemsFDTD) tags, requiring 1700 (specrand) to 14M(GemsFDTD) rules. (The large number of tags and rules inGemsFDTD are driven largerly by the memory safety pol-icy, which itself uses 640K memory component tags.) Thecomposite tags are pointers that point to metadata structuresfor the component policies. The Taint Tracking policy tagcomponent is itself a set represented as an ordered list. Thelargest single metadata structure is 176B. The total metadatastructure memory ranges from 1KB (specrand) to 56MB(GemsFDTD). As a fraction of the maximum heap allocatedmemory, this represents only 1–10%.

Simple Implementation We evaluate the simple PUMPimplementation, comparing it to the no-PUMP baseline.Area Overhead: The overall area overhead of the PUMP ontop of the baseline processor is 190% (Tab. 3). The dominantportion of this area overhead (110%) comes from the PUMPrule caches. The unified L2 cache contributes most of theremaining area overhead. The L1 D/I caches stay roughlythe same, since we halved their effective capacity. This highmemory area overhead roughly triples the static power, con-tributing to 24% of the energy overhead.Runtime Overhead: For all single policies on most bench-marks, the average runtime overhead of even this simple im-plementation is only 10% (see Fig. 4; to read boxplots [12]:bar is the median, box covers one quartile above and below(middle 50% of cases), dots represent each individual datapoint, whiskers denote full range except for outliers (more

than 1.5× respective quartile)), with the dominant overheadcoming from the additional DRAM traffic required to trans-fer tag bits to and from the processor. For the Memory Safetypolicy, we see a few benchmarks that exhibit high miss han-dler overhead, pushing their total overhead up to 40-50% dueto compulsory misses on newly allocated memory blocks.For the Composite policy, five of the benchmarks suffer fromvery high overheads in the miss handler (Fig. 5a), with theworst case close to 780% and the geomean reaching 50%.Two factors contribute to this overhead: (1) the large num-ber of cycles required to resolve a last-level rule cache miss(since every component miss handler must be consulted),and (2) an explosion in the number of rules, which expandsthe working set size and increases the rule cache miss rate. Inthe worst case, the number of unique composite tags couldbe the product of the unique tags in each component policy.However, we do not see this worst case—rather, total rulesincrease by a factor of 3–5 over the largest single policy,Memory Safety.

Energy Overhead: Moving more bits, due to wider words,and executing more instructions, due to miss handler code,both contribute to energy overheads, impacting both thesingle and composite policies (Figs. 4 and 5b). The CFIand Memory Safety policies—and hence also the Compos-ite policy—access large data structures that often requireenergy-expensive DRAM accesses. The worst-case energyoverhead is close to 400% for single policies, and about1600% for the Composite policy, with geomean overheadaround 220%.

Power Ceiling: For many platform designs the worst-casepower, or equivalently, energy per cycle, is the limiter. Thispower ceiling may be driven by the maximum current theycan draw from a battery or the maximum sustained operat-ing temperature either in a mobile or in a wired device withambient cooling. Fig. 5c shows that the simple implemen-tation raises the maximum power ceiling by 76% with lbmdriving the maximum power in both the baseline and sim-ple PUMP implementations. Note that this power ceiling in-crease is lower than the worst-case energy overhead in partbecause some benchmarks slow down more than the extraenergy they consume and in part because the benchmarkswith high energy overhead are the ones consuming the leastabsolute energy per cycle in the baseline design. Typicallythe data working set of these energy-efficient programs fits

Unit Parameter Range FinalL1 PUMP-$ Capacity 512–4096 1024

Tag Bits 8–12 10L2 PUMP-$ Capacity 1024–8192 4096

Tag Bits 13–16 14UCP-$ Capacity 512–4096 2048

CTAG-$ Capacity 128–1024 512

Table 4: PUMP Parameter Range Table

1e+03

1e+04

1e+05

1e+06

1e+07●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●●

●

●

●

●

●

●

●

●

● ● 1.5x

# P

UM

P r

ule

s (

log

10

)

Without With

Opgroup Optimization

(a) Number of PUMP Rules

1e−09

1e−07

1e−05

1e−03

1e−01●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

● ●

●

●

●

●

●

L1

PU

MP

mis

s−

rate

(lo

g1

0)

512 1024 2048 4096

L1 PUMP Capacity

(b) L1 PUMP-$ Miss-Rate

Figure 6: Impact of Opgroup Optimization

into the on-chip caches, so they seldom pay the higher costof DRAM accesses.

4. Optimizing the PUMPThough the simple implementation described above achievesreasonable performance on most benchmarks, the runtimeoverhead for the Composite policy on some of them and theenergy and power overheads on all policies and benchmarksseem unacceptably high. To address these overheads, we in-troduce a series of targeted microarchitecture optimizationsand examine the impact of the architectural parameters as-sociated with the PUMP components (Tab. 4) on the overallcosts. We use groupings of opcodes with identical rules toincrease the effective capacity of the rule caches, tag com-pression to reduce the delay and energy of DRAM transfers,short tags to reduce the area and energy in on-chip memo-ries, and Unified Component Policy and Composition Tagcaches to decrease the overheads in the miss handlers.

Opgroups In practical policies, it is common to define sim-ilar rules for several opcodes. For example, in the TaintTracking policy, the rules for the add and sub instructionsare identical (Alg. 1). However, in the simple implementa-tion, these rules occupy separate entries in the rule caches.Consequently, we group instruction opcodes with the samerules into “opgroups”, reducing the number of rules needed.Which opcodes can be grouped together depends on thepolicy; we therefore expand the don’t-care SRAM in theExecute stage (Fig. 2, also shown as “Mem” in Fig. 1)to also translate opcodes to opgroups before the rule cachelookup. For the Composite policy, we can reduce 300+ Al-pha opcodes to 14 opgroups and the total number of rulesby a factor of 1.1×–6×, with an average of 1.5× (Fig. 6ameasures this effect across all the benchmarks). This effec-tively increases the rule cache capacity for a given invest-ment in silicon area. Opgroups also reduce the number of

compulsory misses, since a miss on a single instruction inthe group installs the rule that applies to every instructionopcode in the group. Fig. 6b summarizes the miss-rate acrossall the benchmarks for different L1 rule cache sizes for theComposite policy with and without opgrouping. This figureshows that both the range and the mean of the miss-ratesare reduced by opgrouping. With opgroup optimization, a1024-entry rule cache has a lower miss rate than a 4096-entry rule cache without it. A lower miss-rate naturally re-duces the time and energy spent in miss handlers (Fig. 14),and smaller rule caches directly reduce area and energy.

Main Memory Tag Compression Using 64b tags on 64bwords doubles the off-chip memory traffic and hence roughlydoubles the associated energy. Typically, though, tags ex-hibit spatial locality—many adjacent words have the sametag. For example, Fig. 7a plots the distribution of unique tagsfor each DRAM transfer for the gcc benchmark with theComposite policy, showing that most words have the sametag: on average we see only 1.14 unique tags per DRAMtransfer of an 8-word cache line. We exploit this spatial lo-cality to compress the tag bits that must be transferred toand from the off-chip memory. Since we are already trans-ferring data in cache lines, we use cache lines as the basisfor this compression. We allocate 128B per cache line in themain memory, to keep addressing simple. However, ratherthan storing 128b tagged words directly, we store eight 64bwords (payloads) followed by eight 4b indexes and then upto eight 60b tags (see Fig. 7b). The index identifies which ofthe 60b tags goes with the associated word. We trim the tagto 60b to accommodate the indexes, but this does not com-promise the use of tags as pointers: assuming byte address-ing and 16B (two 64b words) aligned metadata structures,the low 4b of the 64b pointer can be filled in as zeros. As aresult, after transferring the 4B of indexes, we only need totransfer the unique 7.5B tags in the cache line. For instance,if the same tag is used by all the words in the cache line, wetransfer 64B+4B=68B in a first read, then 8B in a secondread for a total of 76B instead of 128B. The second readis to the same DRAM page, needing only CAS cycles andnot the full random-access cycle time. The 4b index can beeither a direct index or a special value. We define a specialindex value to represent a default tag, so that there is no needto transfer any tag in this case. By compressing tags in thismanner we are able to reduce the average energy overheadper DRAM transfer from 110% to 15%.

We chose this compression scheme for its combinationof simplicity and effectiveness at reducing off-chip mem-ory energy. Many clever schemes for fine-grained memorytagging exist—including multi-level tag page tables [60],variable-grained TLB-like structures [67, 71], and rangecaches [65]—and these could also be used to reduce theDRAM footprint.

Tag Translation The simple PUMP rule caches are large(adding 110% area) since each cached rule is 456b wide.

0.0

0.2

0.4

0.6

0.8

1.0

Pro

bab

ility

1 3 5 7

(a) Unique tags perDRAM transfer for gcc

w0

8B

w1

8B

w2

8B

w3

8B

w4

8B

w5

8B

w6

8B

w7

8B

Payloads Tags

Tag Indexes: i0

4b

i1

4b

i2

4b

i3

4b

i4

4b

i5

4b

i6

4b

i7

4b

t0

7.5B

t1

7.5B

t2

7.5B

t3

7.5B

t4

7.5B

t5

7.5B

t6

7.5B

t7

7.5B

(b) Cache line compression in DRAM

Figure 7: Main Memory Tag Compression

Supporting the PUMP also required extending the baselineon-chip memories (RFs and L1/L2 caches) with 64b tags.Using a full 64b (or 60b) tag for each 64b word here incursheavy area and energy overheads, as we saw in §3. However,a 64KB L1 D-$ holds only 8192 words and hence at most8192 unique tags. Along with a 64KB L1 I-$, we can haveat most 16384 unique tags in the L1 memory subsystem;these can be represented with just 14b tags, reducing thedelay, area, energy, and power in the system. Caches exist toexploit temporal locality, and this observation suggests wecan leverage that locality to reduce area and energy. If wereduce the tag bits to 14b, the PUMP rule cache match keyis reduced from 328b to 78b.

To get these savings without losing the flexibility of full,pointer-sized tags, we use different-width tags for differenton-chip memory subsystems and translate between these asneeded. For example, one might use 12b tags in the L1memories and 16b tags in the L2 memories. Fig. 8 showshow we manage tag translation between L1 and L2 memorysubsystems. Moving a word from L2 cache to L1 cacherequires translating its 16b tag to the corresponding 12btag, creating a new association if needed. We use a simpleSRAM for the L2-tag-to-L1-tag translation, with an extra bitindicating whether or not there is an L1 mapping for the L2tag. Translating an L1 tag to L2 tag (on a write-back or an L2lookup) is again performed with an SRAM lookup using theL1 tag as the address. A similar translation occurs betweenthe 60b main memory tags and 16b L2 tags.

When a long tag is not in the long-to-short translation ta-ble, we must allocate a new short tag, potentially reclaim-ing a previously allocated short tag that is no longer in use.There is a rich design space to explore for determining whena short tag can be reclaimed, including garbage collectionand tag-usage counting. For simplicity, we allocate short tagssequentially and flush all caches above a given level (instruc-tion, data, PUMP) when the short tag space is exhausted,avoiding the need to track when a specific short tag is avail-able for reclamation. We assume all caches are designed witha lightweight gang clear [41], making flushes inexpensive.

Compared to Tab. 3, where each L1 rule cache access costs51pJ, we get down to 10pJ with 8b L1 tags or 18pJ with 16bL1 tags, with the energy scaling linearly with tag length be-tween these points. The energy impact on the L1 instructionand data caches is small. Similarly, with 16b L2 tags, L2 rulecache access costs 120pJ, down from 173pJ with 64b tags.

L2.CacheLine

16b

L2tag → L1tagSRAM

Invalidnew L1tag flush?

L1tag → L2tagSRAM

Valid12b

L1D.CacheLine(a) L1 miss, L2 hit

L2.CacheLine

16b

L1tag → L2tagSRAM

12b

L1D.CacheLine(b) L1 to L2 writeback

Figure 8: Translation Between 12b L1 Tags and 16b L2 Tags

1

10

100

1000

10000

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●●

# L

1 flu

sh

es (

log1

0)

L1 Tag Length

8 9 10 11 12

1e−09

1e−07

1e−05

1e−03

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●● ● ● ● ●

L1 P

UM

P m

iss−

rate

(lo

g10

)

L1 Tag Length

8 9 10 11 12 64

Figure 9: Impact of L1 Tag Length on L1 PUMP Miss-Rates

Slimming L1 tags also allows us to restore the capacity of theL1 D/I caches. With 12b tags, the full-capacity (76KB, effec-tive 64KB) L1 cache will meet single-cycle timing require-ments, reducing the performance penalty the simple imple-mentation incurred from the reduced cache capacity. As aresult, we limit L1 tag length exploration to 12 bits or less.While even shorter tags reduce energy, they also increase thefrequency of flushes. Fig. 9 shows how flushes decrease withincreasing L1 tag length, as well as the impact on the L1 rulecache miss-rate.

Miss-Handler Acceleration Enforcing large policies ob-tained by the orthogonal combination of smaller policies is,not surprisingly, expensive. We illustrated this by combin-ing four policies into a single Composite policy. As shown inAlg. 2, each invocation of a N -policy miss handler must takeapart a tuple of tags, compute result tags for each componentpolicy, reassemble them into tuples, and canonicalize thesetuples to prevent duplication. Moreover, the greater numberof tags and rules needed for the Composite policy increasesthe rule cache miss rates—in Fig. 10a, even though the TaintTracking and CFI policies individually have a low miss-rate,a higher miss-rate from the Memory Safety policy drives themiss-rate for the Composite policy high as well. The lowermiss rates of the individual policies suggest that their resultsmay be cacheable even when the composite rules are not.Microarchitecture: We introduce two hardware structuresto optimize composite policy miss handlers. First we add aUnified Component Policy (UCP) cache where we simplycache the most recent component policy results. The generalmiss-handler for composite policies is modified to performlookups in this cache while resolving component policies

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

CFI

Tain

tTra

ck

Mem

Safe

tyC

om

posi

te CFI

Tain

tTra

ck

Mem

Safe

tyC

om

posi

te CFI

Tain

tTra

ck

Mem

Safe

tyC

om

posi

te CFI

Tain

tTra

ck

Mem

Safe

tyC

om

posi

te

L1

PU

MP

−$

Mis

s−

rate

(%

)

GemsFDTD astar gcc omnetpp

(a) Per policy miss-rates

512 1024 2048 4096

50

60

70

80

90

100

UCP−$ CapacityH

it R

ate

Per

Polic

y (

%)

NXD+NWCMemory SafetyCFITaint Tracking

(b) UCP Cache Hit-rate

100 *

m

512 1024 4096

81

82

83

84

85

86

87

160

180

200

220

Mis

s−

Serv

ice T

ime (

cyc)

UCP−$ Capacity

Hit R

ate

(%

)

(c) Miss Service Performance

92

93

94

95

96

97

98

128 256 512 1024

Hit R

ate

(%

)

CTAG−$ Capacity

GemsFDTD

astar

gcc

omnetpp

(d) Miss-Handler CTAG-$

0.0

0.2

0.4

0.6

0.8

1.0

Base

+ U

CP

−$

+ C

TAG

−$

Base

+ U

CP

−$

+ C

TAG

−$

Base

+ U

CP

−$

+ C

TAG

−$

Base

+ U

CP

−$

+ C

TAG

−$

Mis

s S

erv

ice

Tim

e

GemsFDTD astar gcc omnetpp

(e) Miss Handler Cost

Figure 10: Optimizing Composite Policy Miss Handler

Algorithm 3 N-Policy Miss Handler with HW support

1: for i=1 to N do2: Mi ←{op, PC [i], CI [i], OP1 [i], OP2 [i], MR[i]}3: {hit, pci, resi}← UCP-$ (i,Mi)4: if !hit then5: {pci, resi}← policyi (Mi)6: insert {i, pci, resi} into UCP-$7: pc[1..N ]← {pc1, pc2, ..., pcN}8: {hit, PCnew} ← CTAG-$ (pc[1..N ])9: if !hit then

10: PCnew ← canonicalize(pc[1..N ])11: insert {pc[1..N ], PCnew} into CTAG-$12: res[1..N ]← {res1, res2, ..., resN}13: {hit, R} ← CTAG-$ (res[1..N ])14: if !hit then15: R← canonicalize(res[1..N ])16: insert {res[1..N ], R} into CTAG-$

(Alg. 3, line 3). When this cache misses for a componentpolicy we perform its policy computation in software (andinsert the result in this cache). We implement the UCP cachewith the same hardware organization as the regular PUMPrule cache, with an additional policy identifier field. Forsimplicity we use a FIFO replacement policy for this cache,but it may be possible to achieve better results by prioritizingspace using a metric such as the re-computation cost for thecomponent policies. With modest capacity, this cache filtersout most policy recomputations (Fig. 10b; the low hit rate formemory safety is driven by compulsory misses associatedwith new memory allocations). As a result, we are able toreduce the average number of miss handler cycles by a factorof 5 for our most challenging benchmarks (Fig. 10e). It ispossible for every policy to hit in the UCP cache when thereis a miss in the L2 PUMP since the composite rules neededcould be a product of a small number of component policyrules. For GemsFDTD, we hit in 3 or more component policiesabout 96% of the time.

Second, we add a cache that translates a tuple of resulttags into its canonical composite result tag (Alg. 2, Line 4).This identifies the single pointer tag that points to the com-ponent result tag tuple, when such a pointer already exists;it is canonicalized so there is a single, unique tag for any

distinct combination of components in order to maximizethe effectiveness of the rule caches. This Composition Tag(CTAG) cache is implemented by hashing together all the tu-ple components to get an index into a small CTAG memory.Multiple hashes and memories are used to minimize con-flicts as in the PUMP rule cache (Fig. 1). Once the CTAGcache returns a candidate composite tag, the hardware deref-erences the pointer result to check that the hashed result isa valid hit. This CTAG cache is effective (Fig. 10d) becauseit is common for several component policy rules to returnthe same tuple of result tags. For example, in many cases thePCtag will be the same, even though the result tag is differ-ent. Furthermore, many different rule inputs can lead to thesame output. For example, in Taint Tracking we perform setunions, and many different unions will have the same result;e.g., (Blue, {A,B,C}) is the composite answer for writingthe result of both {A}∪{B,C} and {A,B}∪{B,C} (TaintTracking) into a Blue slot (Memory Safety). We also use aFIFO replacement policy for this cache. The CTAG cachereduces the average miss handler cycles by another factor of2 (Fig. 10e). Taken together, a 2048-entry UCP cache anda 512-entry CTAG cache reduce the average time spent oneach L2 rule cache miss from 800 cycles to 80 cycles.

Rule Prefetch: Another way to reduce the compulsory missrate is to pre-compute rules that might be needed in the nearfuture. A full treatment of prefetching is beyond the scopeof this paper. However, one easy case has high value forthe Memory Safety rules. When we allocate a new memorytag, we will likely need the 7 rules (initialize (1), add offsetto pointer and move (3), scalar load (1), scalar store (2))for that tag in the near future. Consequently, we add all ofthese rules to the UCP cache at once. For the single-policyMemory Safety case, we add the rules directly into the rulecaches. This reduces the number of Memory Safety miss-handler invocations by 2×.

Fig. 11 shows how the optimizations discussed in thissection extend the PUMP-microarchitecture.

5. Overall EvaluationDesign-Point Selection For the most part, the architectureparameters (Tab. 4) monotonically impact a particular cost,providing tradeoffs among energy, delay, and area, but notdefining a minimum within a single cost criteria. There is

Unit Design Organization Area Access Energy Static Power Latency Cyc(mm2) Read/Write (pJ) (pJ/cyc) (ps)

Register File Extended 10b 74b, 2R1W, {48 Integer, 32 Floating} 0.005 0.4/0.5 0.13 360 1L1 Cache 10b-tag 74KB, 4-way, 74B/line (eff. 64KB, 64B/line) 0.272 19/13 16.4 975 1L2 Cache 14b-tag 592KB, 8-way, 78B/line (eff. 512KB, 64B/line) 1.247 343/640 0.133 4600 5

TLB – 1KB, 2-way set-assoc. 0.040 3.6/4.5 2.0 800 1DRAM 64b-tag 1GB, access 128B line (move 76B) 17,500 112

L1 PUMP Cache 10b L1 tag 1024-entry, 58b match, 20b out Fast-Value dMHC(4,2) 0.095 15/43.2 2.22 520 1L2 PUMP Cache 14b L2 tag 4096-entry, 78b match, 28b out 2-level dMHC(4,2) 0.287 99.4/267 0.032 2800 3

full→L2-tag 64b→14b 8196-entry, 64b match, 14b out dMHC(4,2) 0.432 166/436 0.052 3400 4L2-tag→full 14b→64b 16K×64 SRAM 0.216 55.5/31.5 0.027 1700 2

L2-tag→L1-tag 14b→10b 16K×11 SRAM 0.038 8.8/4.7 0.0040 1420 2L1-tag→L2-tag 10b→14b 1K×14 SRAM 0.004 0.8/1 0.0004 780 1

UCP Cache 64b tags 2048-entry, 328b match, 128b out 2-level dMHC(4,2) 0.377 196/479 0.035 2730 3CTAG Cache 64b tags 512-entry 2-level dMHC(4,2) 0.107 56/139 0.009 1700 2

Total area 3.120mm2 (+110% over baseline)

Table 5: Memory Resource Estimates for the PUMP-optimized Processor at 32nm node

% C

PI

ove

rhe

ad

0

5

10

15

20

25

30MisshandlerL2 PUMP−$DRAML2−$I misfetch

85

%

54

%

Gem

sFD

TD

ast

ar

bw

aves

bzi

p2

cact

usA

DM

calc

ulix

dealII

gam

ess gcc

gobm

kgro

macs

h264re

fhm

mer

lbm

lesl

ie3d

libquantu

m mcf

milc

nam

dom

netp

pperlbench

sjeng

specr

and

sphin

x3 wrf

zeusm

pm

ean

(a) Runtime Overhead

% E

PI

ove

rhe

ad

0

50

100

150

200MisshandlerTag TranslateCTAG−$UCP−$L2 PUMP−$

L1 PUMP−$DRAML2−$L1−$

23

0%

Gem

sFD

TD

ast

ar

bw

aves

bzi

p2

cact

usA

DM

calc

ulix

dealII

gam

ess gcc

gobm

kgro

macs

h264re

fhm

mer

lbm

lesl

ie3d

libquantu

m mcf

milc

nam

dom

netp

pperlbench

sjeng

specr

and

sphin

x3 wrf

zeusm

pm

ean

(b) Energy Overhead

Pow

er

(pJ/c

ycle

)

0

50

100

150

200

250 BaselineTagged

Gem

sFD

TD

ast

ar

bw

aves

bzi

p2

cact

usA

DM

calc

ulix

dealII

gam

ess gcc

gobm

kgro

macs

h264re

fhm

mer

lbm

lesl

ie3d

libquantu

m mcf

milc

nam

dom

netp

pperlbench

sjeng

specr

and

sphin

x3 wrf

zeusm

pm

ean

(c) Absolute Power

Figure 12: Overhead of Optimized Implementation as Shown in Tab. 5 (Composite Policy)

opgroup

pctagcitagop1tagop2tagmrtag

maskout

logicL1

PUMP-$[Fast-ValuedMHC(4,2)]

pctag

restag

L1

Hit

L1 Miss

L1 t

ag→

L2 t

ag L2

PUMP-$[Two-LeveldMHC(4,2)]

L2

Hit

L2 Miss

L2 t

ag→

L1ta

g

L2ta

g→

full

tag compression

full→

L2ta

g

tag decompression

CTAG-$[Two-LeveldMHC(4,2)]

from/to s/w

UCP-$[Two-LeveldMHC(4,2)]

from/to s/w

L1

L2

Mis

sH

andl

erM

ode

Figure 11: PUMP Microarchitecture

the threshold effect that, once the tag bits are small enough,the L1 D/I caches can be restored to the capacity of the base-line, so we adopt that as our upper bound to explore for L1tag length, but beyond that point, decreasing tag length re-duces energy with small impact on performance. Fig. 13bshows that reducing tag length is the dominant energy effectfor most benchmark programs (e.g., leslie3d, mcf), witha few programs showing equal or larger benefits from in-creasing UCP cache capacity (e.g., GemsFDTD, gcc). Ignor-ing other cost concerns, to reduce energy, we would select

GemsFDTD

8 9 10 11 12

512

1024

2048

4096

59 127

gcc

8 9 10 11 12

512

1024

2048

4096

8 23

lbm

8 9 10 11 12

512

1024

2048

4096

9 9

leslie3d

8 9 10 11 12

512

1024

2048

4096

4 4

L1 Tag Length

UC

P−

$ C

ap

acity

min max

% C

PI ove

rhea

d

(a) Runtime Overhead

GemsFDTD

8 9 10 11 12

512

1024

2048

4096

191 243

gcc

8 9 10 11 12

512

1024

2048

4096

51 54

lbm

8 9 10 11 12

512

1024

2048

4096

18 18

leslie3d

8 9 10 11 12

512

1024

2048

4096

30 33

L1 Tag Length

UC

P−

$ C

ap

acity

min max

% E

PI ove

rhea

d

(b) Energy Overhead

Figure 13: Impact of Tag Bits and UCP-$ Capacity

large miss handler caches and few tag bits. Runtime over-head (Fig. 13a) is also minimized with larger miss handlercaches, but benefits from more rather than fewer tag bits(e.g., GemsFDTD, gcc). The magnitude of the benefits varyacross benchmarks and policies. Across all benchmarks, thebenefit beyond 10b L1 tags is small for the SPEC CPU2006benchmarks, so we use 10b as the compromise between en-ergy and delay and use a 2048-entry UCP cache and a 512-entry CTAG cache to reduce area overhead while comingclose to the minimum energy level within the space of thearchitecture parameters explored.

Runtime and Energy Impact of Optimizations Fig. 14shows the overall impact on runtime and energy overheadsof applying the optimizations from §4 in sequence. Everyoptimization is dominant for some benchmark (e.g., op-

% C

PI

ove

rhe

ad

0

200

400

600

800GemsFDTD

0

50

100

150

200

250

300

astar

010203040506070

gcc

0

1

2

3

4

5

h264ref

0

5

10

15

20

25

lbm

MisshandlerL2 PUMP−$DRAML2−$I misfetch

% E

PI

ove

rhe

ad

0

500

1000

1500

0

100

200

300

400

500

0

50

100

150

200

050

100150200250300350

0

20

40

60

80

100 MisshandlerTagTranslationL2 PUMP−$L1 PUMP−$DRAML2$L1$

ABCDE ABCDE ABCDE ABCDE ABCDE

Figure 14: Runtime and Energy Impact of Optimizations on Representa-tive Benchmarks (Composite Policy) (A: Simple; B: A+Opgrouping; C:B+DRAM Compression; D: C+(10b L1,14b L2) short tags; E: D+(2048-UCP, 512-CTAG))

groups for astar, DRAM tag compression for lbm, shorttags for h264ref, miss handler acceleration for GemsFDTD),and some benchmarks see benefits from all optimizations(e.g., gcc), with each optimization successively removingone bottleneck and exposing the next. The different behaviorfrom the benchmarks follows their baseline characteristicsas detailed below.

Applications with low locality have baseline energy andperformance driven by DRAM due to high main memorytraffic. The overhead in such benchmarks (e.g., lbm) trendsto the DRAM overhead, so reductions in DRAM overheaddirectly impact runtime and energy overhead. Applicationswith more locality are faster in the baseline configuration,consume less energy, and suffer less from DRAM overheads;as a result, these are more heavily impacted by the reducedL1 capacity and the tag energy in the L1 D/I and rule caches.DRAM optimization has less effect on these applications,but using short tags has a large effect on energy and removesthe L1 D/I cache capacity penalty (e.g., h264ref).

The benchmarks with heavy dynamic memory alloca-tion have higher L2 rule cache miss rates due to compul-sory misses as newly created tags must be installed in thecache. This drove the high overheads for several benchmarks(GemsFDTD, omnetpp) in the simple implementation. Themiss handler optimizations reduce the common-case cost ofsuch misses, and the opgroup optimization reduces the ca-pacity miss rate. For the simple implementation, GemsFDTDtook an L2 rule cache miss every 200 instructions and took800 cycles to service each miss driving a large part of its780% runtime overhead. With the optimizations, it servicesan L2 rule cache miss every 400 instructions and takes only140 cycles on average per miss, reducing its runtime over-head to 80%.

Overall, these optimizations bring runtime overhead be-low 10% for all benchmarks except GemsFDTD and omnetpp(Fig. 12a), which are high on memory allocation. The meanenergy overhead is close to 60%, with only 4 benchmarksexceeding 80% (Fig. 12b).

Power Ceiling The optimizations reduce the impact onpower ceiling to 10% (Fig. 12c), suggesting the optimizedPUMP will have little impact on the operating envelope of

the platform. The programs that still have the worst energyoverhead (GemsFDTD, h264ref) are well below the base-line absolute power ceiling set by lbm. DRAM compressionreduces the energy overhead for lbm to 20%; since it alsoslows down by 9%, its power requirement increases by 10%.

Area The area overhead of the optimized design is around110% (Tab. 5), compared to the 190% of the simple design(Tab. 3). On the one hand, short tags significantly reduce thearea of the L1 and L2 caches (now adding only 5% over thebaseline) and of the rule caches (adding only 26%). On theother hand, the optimized design spends some area to reduceruntime and energy overhead. The UCP and CTAG cachesadd 33% area overhead, while the translation memories forshort tags (both L1 and L2) add another 46%. While theseadditional hardware structures add area, they provide a netreduction in energy, since they are accessed infrequentlyand the UCP and CTAG caches also substantially reduce themiss-handler cycles.

Policy Scaling A key goal of our model (and optimiza-tions) is to make it relatively painless to add additional poli-cies that are simultaneously enforced. The Composite policyon the simple PUMP design incurred more than incremen-tal costs for several benchmarks due to the large increasein miss handler runtime, but we reduced these with themiss handler optimizations. For most benchmarks, one pol-icy (typically Memory Safety) dominates the performanceof the composition. Adding more policies to this dominantone incurs little further performance degradation—the run-time overhead is increased by less than 1% per policy. Thisis shown in Fig. 15 where we first show the overhead of eachsingle policy, then show composites that incrementally addone policy at a time to Memory Safety, our most complexsingle policy. Between the Memory-Safety-only case and thefull composite, the average runtime overhead grows from5% to 8% and the average energy overhead grows from 50%to 60%. The progression makes it clearer what overheadcomes simply from adding any policy as opposed to addinga higher-overhead policy. These results show that the cachesare effective for these benchmarks: the working set does notgrow substantially, and the time spent in miss-handlers re-solving all the policies is effectively contained such that itdoes not degrade performance.

To provide a sense of scaling beyond the four policiesmeasured here, we have decomposed the CFI policy into re-turns (CFI1) and indirect jumps (CFI2) and the Taint Track-ing policy into code tainting and I/O tainting. This does notincrease the main PUMP cache working set, as it requires thesame number of distinct composite rules, but it does increasethe miss-handler complexity and pressure in the UCP andCTAG caches. The fact that we do not see overhead jumpsbetween code and I/O tainting and between CFI1 and CFI2suggests the working set remains small enough that the addi-tional miss-handler complexity has only incremental impacton overall overhead.

0

20

40

60

80

●●

●●●

●●●●●●●●

●

●●

●●

●

●

●●●●

●

●●

●●●

●●●●●●●●

●

●●

●●

●

●

●●●●

●

●●

●●●

●●●●●●●●

●

●●

●●

●

●

●●●●

●

●

●●●●

●●●

●●●●●

●

●●

●●

●

●

●●●●

●

●

●

●●●

●●●

●

●●●●

●

●●

●●

●

●

●

●●●

●

●

●

●●●

●●●

●

●●●●

●

●●

●●

●

●

●

●●●

●

●

●

●●●

●●●

●

●●●●

●

●●

●●

●

●

●

●●●

●

●

●

●●●

●●●

●

●●●●

●

●●

●●

●

●

●

●●●

●

●

●

●●●

●●●

●

●●●●

●

●●

●●

●

●

●

●●●

●

% C

PI

ove

rhe

ad

NXD/N

WC

CFI

Cod

e+IO

−tai

nt

Mem

ory

Saf

ety

+ Cod

e−ta

int

+ IO

−tai

nt+

CFI

1+

CFI

2

+ NXD/N

WC

0

50

100

150

200

250

●

●

●

●

●

●●●

●

●●

●

●

●

●

●●●

●

●

●●

●

●

●

●

●

●

●

●

●●●

●

●●

●

●

●

●

●●●

●

●

●●

●

●

●

●

●

●

●

●

●●●

●

●●

●

●

●

●

●●●

●

●

●●

●

●

●

●

●

●

●

●

●●●

●

●●

●

●

●

●

●●●

●

●

●●

●

●

●

●

●

●

●●

●●●

●

●●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●●

●●●

●

●●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●●

●●●

●

●●

●

●

●

●●●●

●

●

●

●

●

●

●

●

●

●

●●

●●●

●●●

●

●

●

●●●●

●

●

●

●

●

●

●

●

●

●

●●

●●●●●●

●

●

●

●●●●

●

●

●

●

●

●

●% E

PI

ove

rhe

ad

NXD/N

WC

CFI

Cod

e+IO

−tai

nt

Mem

ory

Saf

ety

+ Cod

e−ta

int

+ IO

−tai

nt+

CFI

1+

CFI

2

+ NXD/N

WC

Figure 15: Impact Per Policy in CompositionTwo outliers (GemsFDTD and omnetpp) have a larger

increase in runtime overhead (20–40%), with only one(GemsFDTD) showing a larger increase in energy. The in-creases occur only when we add the first taint and first CFIpolicy, due mostly to higher memory safety compulsory missrates for these benchmarks coupled with the increased miss-handler resolution complexity.

6. Related WorkThe idea of using metadata for safety and security has beenaround since the 1960s [29]. During the past decade [63],it has been greatly refined and expanded (see Tab. 2). Ourproposal extends previous work by offering a unified archi-tecture for metadata tags and making it fully programmable.

The earliest uses of tags were for types, capabilities, andsynchronization [34, 51, 61]. More recent uses include tainttracking [15, 19, 63] and bounds checking [17, 46]. Rak-sha [22] and DataSafe [16] generalize from the 1-bit taintsand hardwired policies of earlier systems to larger taintfields (4b and 10b, respectively) and taint-specific config-urable propagation rules. Our proposal further generalizesthese systems by offering pointer-sized tags and fully pro-grammable, software-defined rules for metadata validationand propagation.

Metadata has long been used for page-size units; laterwork has associated metadata at finer granularities, downto individual words or bytes [5, 67]. These systems canalso trap to software for handling, though the tags do notpropagate with the data and no tag results are computed.

DISE [18], Log-Based Architecture (LBA) [13, 14], andFADE [30] provide software-defined checking by insertinginstructions in the processor’s instruction stream (DISE) orrunning software on a processor core dedicated to check-ing (LBA, FADE). Aside from performance gains from codecompactness, the performance and energy of DISE are com-parable to software-only solutions. LBA adds hardware ac-celerators for checking and simple propagation (but not com-putation) of metadata, and FADE adds the the ability to per-form programmable AND or OR computations. Some of theinnovations in these accelerators (e.g., the restriction of taintpropagation to unary inheritance tracking, excluding taintcombining in LBA or the programmable AND or OR in theFADE Event Table) give up generality that our solution re-tains. LBA and FADE were demonstrated on simpler poli-cies than our single policies, but have ∼50% (LBA) and 20–

80% (FADE) runtime overhead compared to our 6% single-policy overhead (Fig. 15). The PUMP can cache any rulefor single-cycle PUMP resolution, even rules involving com-posite tags, while FADE can only capture a subset of opera-tions in a single cycle (“single-shot”), with others requiringmulti-cycle resolution. Using a second processor for log pro-cessing incurs roughly 100% energy overhead—higher thanour worst-case single-policy energy overhead of 60%. Con-versely, FADE is designed to work with superscalar, out-of-order cores, while the PUMP design presented here is onlyfor in-order cores.

Other work aims to support software-defined metadataprocessing with hardware acceleration, similar to our ap-proach. Aries sketches a vision for cached metadata propa-gation rules [11], and more recently Harmoni shows where arule cache would fit in a programmable tag processing archi-tecture [24], though neither shows how to map policies intorule tables or evaluates the runtime characteristics of rule-cache-supported policies. FlexiTaint [66] is closest to our de-sign; it demonstrates support for two taint propagation poli-cies and their combination. The policies we show here arericher than the ones supported by FlexiTaint, due both to theextra tag inputs and outputs and to the richer tag metadata.FlexiTaint could in principle be scaled to wide metadata; itwould then suffer performance and energy problems similarto our simple design (§3) and would likely benefit from ouroptimizations.

7. Conclusions and Future WorkWith evidence mounting for the value of metadata-basedpolicy enforcement, the time is ripe to define an architec-ture for software-defined metadata processing and identifyaccelerators to remove most of the runtime overhead. ThePUMP architecture neither bounds the number of metadatabits nor the number of policies simultaneously supported; itsmicroarchitectural optimizations (opgroups, tag compres-sion, tag translation, and miss handler acceleration—see§4) achieve performance comparable to dedicated, hard-ware metadata propagation solutions (§5). We believe thesoftware-defined metadata policy model and its accelera-tion will be applicable to a large range of policies beyondthose illustrated here, including sound information-flow con-trol [7, 8, 32, 56, 62], fine-grained access control [67, 71],integrity, synchronization [6, 61], race detection [58, 73],debugging, application-specific policies [70], and controlledgeneration and execution of dynamic code.

AcknowledgmentsThis material is based upon work supported by the DARPACRASH program through the United States Air Force Re-search Laboratory (AFRL) under Contract No. FA8650-10-C-7090. The views expressed are those of the authors and donot reflect the official policy or position of the Departmentof Defense or the U.S. Government.

References[1] Alpha Architecture Handbook. Digital Equipment Corpora-

tion, 1992.

[2] M. Abadi, M. Budiu, Ú. Erlingsson, and J. Ligatti. Control-flow integrity. In 12th ACM Conference on Computer andCommunications Security, pages 340–353. ACM, 2005.

[3] M. Abadi, M. Budiu, Ú. Erlingsson, and J. Ligatti. Control-flow integrity principles, implementations, and applications.ACM Transactions on Information System Security, 13(1),2009.

[4] M. Abadi and C. Fournet. Access control based on executionhistory. In Proceedings of the 10th Annual Network andDistributed System Security Symposium, pages 107–121. TheInternet Society, 2003.

[5] D. Arora, S. Ravi, A. Raghunathan, and N. K. Jha. Archi-tectural support for run-time validation of program data prop-erties. IEEE Transactions on Very Large Scale Integration(VLSI) Systems, 15(5):546–559, May 2007.

[6] Arvind, R. S. Nikhil, and K. K. Pingali. I-structures: Datastructures for parallel computing. In Proceedings of the Work-shop on Graph Reduction (Springer-Verlag Lecture Notes inComputer Science 279), Sept. 1986.

[7] T. H. Austin and C. Flanagan. Efficient purely-dynamic in-formation flow analysis. In Workshop on Programming Lan-guages and Analysis for Security (PLAS), PLAS, pages 113–124. ACM, 2009.

[8] A. Bichhawat, V. Rajani, D. Garg, and C. Hammer. Infor-mation flow control in WebKit’s JavaScript bytecode. In 3rdInternational Conference on Principles of Security and Trust,volume 8414 of Lecture Notes in Computer Science, pages159–178. Springer, 2014.

[9] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi,A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti,R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A.Wood. The gem5 simulator. SIGARCH Comput. Archit. News,39(2):1–7, Aug. 2011.

[10] E. Blem, J. Menon, and K. Sankaralingam. Power Struggles:Revisiting the RISC vs. CISC Debate on Contemporary ARMand x86 Architectures. In Proc. HPCA, pages 1–12, 2013.

[11] J. Brown and T. F. Knight, Jr. A minimally trusted computingbase for dynamically ensuring secure information flow. Tech-nical Report 5, MIT CSAIL, November 2001. Aries MemoNo. 15.

[12] J. M. Chambers, W. S. Cleveland, B. Kleiner, and P. A. Tukey.Graphical Methods for Data Analysis. Wadsworth Statis-tics/Probability Series. Duxbury Press, 1983.

[13] S. Chen, B. Falsafi, P. B. Gibbons, M. Kozuch, T. C. Mowry,R. Teodorescu, A. Ailamaki, L. Fix, G. R. Ganger, B. Lin, andS. W. Schlosser. Log-based architectures for general-purposemonitoring of deployed code. In 1st Workshop on Architec-tural and System Support for Improving Software Dependabil-ity (ASID), pages 63–65. ACM, 2006.

[14] S. Chen, M. Kozuch, T. Strigkos, B. Falsafi, P. B. Gibbons,T. C. Mowry, V. Ramachandran, O. Ruwase, M. P. Ryan, andE. Vlachos. Flexible hardware acceleration for instruction-grain program monitoring. In 35th International Sympo-

sium on Computer Architecture (ISCA), pages 377–388. IEEE,2008.

[15] S. Chen, J. Xu, N. Nakka, Z. Kalbarczyk, and R. Iyer. Defeat-ing memory corruption attacks via pointer taintedness detec-tion. In International Conference on Dependable Systems andNetworks (DSN), pages 378–387, 2005.

[16] Y.-Y. Chen, P. A. Jamkhedkar, and R. B. Lee. A software-hardware architecture for self-protecting data. In ACM Con-ference on Computer and Communications Security, pages14–27. ACM, 2012.

[17] J. A. Clause, I. Doudalis, A. Orso, and M. Prvulovic. Ef-fective memory protection using dynamic tainting. In 22ndIEEE/ACM International Conference on Automated SoftwareEngineering (ASE), pages 284–292. ACM, 2007.

[18] M. L. Corliss, E. C. Lewis, and A. Roth. DISE: aprogrammable macro engine for customizing applications.SIGARCH Comput. Archit. News, 31(2):362–373, May 2003.

[19] J. R. Crandall, F. T. Chong, and S. F. Wu. Minos: Architecturalsupport for protecting control data. ACM Transactions onArchitecture and Code Optimization, 5:359–389, December2006.

[20] J. Criswell, N. Dautenhahn, and V. Adve. KCoFI: Completecontrol-flow integrity for commodity operating system ker-nels. In IEEE Security and Privacy Symposium, 2014.

[21] W. J. Dally, J. Balfour, D. Black-Shaffer, J. Chen, R. C. Hart-ing, V. Parikh, J. Park, and D. Sheffield. Efficient embeddedcomputing. IEEE Computer, 41(7):27–32, July 2008.

[22] M. Dalton, H. Kannan, and C. Kozyrakis. Raksha: a flexibleinformation flow architecture for software security. In Inter-national Symposium on Computer Architecture (ISCA), pages482–493, 2007.

[23] L. Davi, A. Sadeghi, D. Lehmann, and F. Monrose. Stitchingthe gadgets: On the ineffectiveness of coarse-grained control-flow integrity protection. In 23rd USENIX Security Sympo-sium, pages 401–416, 2014.

[24] D. Y. Deng and G. E. Suh. High-performance parallel ac-celerator for flexible and efficient run-time monitoring. InIEEE/IFIP International Conference on Dependable Systemsand Networks (DSN), pages 1–12. IEEE Computer Society,2012.

[25] J. Devietti, C. Blundell, M. M. K. Martin, and S. Zdancewic.HardBound: Architectural support for spatial safety of the Cprogramming language. In 13th International Conferenceon Architectural Support for Programming Languages andOperating Systems, pages 103–114, 2008.

[26] U. Dhawan and A. DeHon. Area-efficient near-associativememories on FPGAs. In Proceedings of the InternationalSymposium on Field-Programmable Gate Arrays, pages 191–200, 2013.

[27] U. Dhawan, C. Hritcu, R. Rubin, N. Vasilakis, S. Chiricescu,J. M. Smith, T. F. Knight, Jr., B. C. Pierce, and A. DeHon.Online appendix to Architectural support for software-definedmetadata processing. Available from http://ic.ese.upenn.edu/abstracts/sdmp_asplos2015.html, January2015.

[28] Ú. Erlingsson, M. Abadi, M. Vrable, M. Budiu, and G. C.Necula. XFI: Software guards for system address spaces. In7th Symposium on Operating Systems Design and Implemen-tation, pages 75–88. USENIX Association, 2006.

[29] E. A. Feustel. On the advantages of tagged architectures.IEEE Transactions on Computers, 22:644–652, July 1973.

[30] S. Fytraki, E. Vlachos, Y. O. Koçberber, B. Falsafi, andB. Grot. FADE: A programmable filtering accelerator forinstruction-grain monitoring. In 20th IEEE International Sym-posium on High Performance Computer Architecture, pages108–119, 2014.

[31] E. Göktas, E. Athanasopoulos, H. Bos, and G. Portokalidis.Out of control: Overcoming control-flow integrity. In IEEESymposium on Security and Privacy, 2014.

[32] D. Hedin and A. Sabelfeld. Information-flow security for acore of JavaScript. In 25th IEEE Computer Security Founda-tions Symposium (CSF), CSF, pages 3–18. IEEE, 2012.

[33] J. L. Henning. SPEC CPU2006 benchmark descriptions.SIGARCH Comput. Archit. News, 34(4):1–17, Sept. 2006.

[34] M. E. Houdek, F. G. Soltis, and R. L. Hoffman. IBM Sys-tem/38 Support for Capability-based Addressing. In Proceed-ings of the Eighth Annual Symposium on Computer Architec-ture, pages 341–348, 1981.

[35] C. Hritcu, M. Greenberg, B. Karel, B. C. Pierce, and G. Mor-risett. All your IFCException are belong to us. In 34th IEEESymposium on Security and Privacy, pages 3–17. IEEE Com-puter Society Press, May 2013.

[36] Introduction to Intel Memory Protection extensions.http://software.intel.com/en-us/articles/introduction-to-intel-memory-protection-extensions.Accessed: 2014-05-24.

[37] M. G. Kang, S. McCamant, P. Poosankam, and D. Song.DTA++: Dynamic taint analysis with targeted control-flowpropagation. In Network and Distributed System SecuritySymposium (NDSS). The Internet Society, 2011.

[38] H. Kannan. Ordering decoupled metadata accesses in multi-processors. In Proceedings of IEEE/ACM International Sym-posium on Microarchitecture, MICRO 42, pages 381–390,2009.

[39] D. King, B. Hicks, M. Hicks, and T. Jaeger. Implicit flows:Can’t live with ’em, can’t live without ’em. In 4th Inter-national Conference on Information Systems Security, ICISS,pages 56–70, 2008.

[40] A. Kwon, U. Dhawan, J. M. Smith, T. F. Knight, Jr., andA. DeHon. Low-fat pointers: compact encoding and efficientgate-level implementation of fat pointers for spatial safety andcapability-based security. In ACM SIGSAC Conference onComputer and Communications Security (CCS), pages 721–732. ACM, 2013.

[41] K. Mai, R. Ho, E. Alon, D. Liu, Y. Kim, D. Patil, andM. Horowitz. Architecture and Circuit Techniques for a1.1GHz 16-kb Reconfigurable Memory in 0.18um-CMOS.IEEE J. Solid-State Circuits, 40(1):261–275, January 2005.

[42] F. McKeen, I. Alexandrovich, A. Berenzon, C. V. Rozas,H. Shafi, V. Shanbhogue, and U. R. Savagaonkar. Innova-tive instructions and software model for isolated execution. In

Workshop on Hardware and Architectural Support for Secu-rity and Privacy, pages 10:1–10:1, 2013.

[43] M. S. Miller, K.-P. Yee, and J. Shapiro. Capability mythsdemolished. Version submitted to Usenix Security 2003.,2003.

[44] D. A. Moon. Architecture of the Symbolics 3600. In Proceed-ings of the 12th Annual International Symposium on Com-puter Architecture, ISCA, pages 76–83, Los Alamitos, CA,USA, 1985. IEEE Computer Society.

[45] N. Muralimanohar, R. Balasubramonian, and N. P. Jouppi.CACTI 6.0: A tool to model large caches. HPL 2009-85,HP Labs, Palo Alto, CA, April 2009. Latest code release forCACTI 6 is 6.5.

[46] S. Nagarakatte, M. M. K. Martin, and S. Zdancewic.Hardware-Enforced Comprehensive Memory Safety. IEEEMicro, 33(3):38–47, May-June 2013.

[47] S. Nagarakatte, M. M. K. Martin, and S. Zdancewic. Watch-dogLite: Hardware-accelerated compiler-based pointer check-ing. In 12th Annual IEEE/ACM International Symposium onCode Generation and Optimization, page 175. ACM, 2014.

[48] S. Nagarakatte, J. Zhao, M. M. K. Martin, and S. Zdancewic.CETS: compiler enforced temporal safety for C. In 9th Inter-national Symposium on Memory Management, pages 31–40.ACM, 2010.

[49] J. Newsome and D. X. Song. Dynamic taint analysis for auto-matic detection, analysis, and signature generation of exploitson commodity software. In the Network and Distributed Sys-tem Security Symposium (NDSS). The Internet Society, 2005.

[50] B. Niu and G. Tan. Modular control-flow integrity. In ACMSIGPLAN Conference on Programming Language Design andImplementation, page 58. ACM, 2014.

[51] E. I. Organick. Computer System Organization: TheB5700/B6700 Series. Academic Press, 1973.

[52] D. A. Patterson and C. H. Sequin. RISC I: A Reduced In-struction Set VLSI Computer. In Proceedings of the 8th An-nual Symposium on Computer Architecture, ISCA ’81, pages443–457, 1981.

[53] F. Qin, C. Wang, Z. Li, H. Kim, Y. Zhou, and Y. Wu. LIFT: Alow-overhead practical information flow tracking system fordetecting security attacks. In 39th IEEE/ACM InternationalSymposium on Microarchitecture (MICRO-39), pages 135–148, 2006.

[54] M. Rinard, C. Cadar, D. Dumitran, D. M. Roy, T. Leu, andW. S. Beebee, Jr. Enhancing Server Availability and SecurityThrough Failure-Oblivious Computing. In Proceedings of theSymposium on Operating Systems Design and Implementation(OSDI), December 2004.

[55] D. Ritchie and K. Thompson. The UNIX Time-Sharing Sys-tem. Bell System Technical Journal, 57(6):1905–1930, 1978.

[56] A. Russo and A. Sabelfeld. Dynamic vs. static flow-sensitivesecurity analysis. In 23rd Computer Security FoundationsSymposium (CSF), CSF, pages 186–199. IEEE Computer So-ciety, 2010.

[57] O. Ruwase and M. S. Lam. A Practical Dynamic BufferOverflow Detector. In Proc. NDSS, pages 159–169, 2004.

[58] S. Savage, M. Burrows, G. Nelson, P. Sobalvarro, and T. An-derson. Eraser: A dynamic race detector for multi-threadedprograms. ACM Transactions on Computer Systems, 15(4),1997.

[59] H. Shacham. The Geometry of Innocent Flesh on the Bone:Return-into-libc without Function Calls (on the x86). In Proc.ACM CCS, pages 552–561, Oct. 2007.

[60] R. Shioya, D. Kim, K. Horio, M. Goshima, and S. Sakai. Low-overhead architecture for security tag. In Proceedings of the2009 15th IEEE Pacific Rim International Symposium on De-pendable Computing, PRDC ’09, pages 135–142, Washing-ton, DC, USA, 2009. IEEE Computer Society.

[61] B. J. Smith. A pipelined, shared-resource MIMD computer.In Proc. ICPP, pages 6–8, 1978.

[62] D. Stefan, A. Russo, J. C. Mitchell, and D. Mazières. Flexibledynamic information flow control in Haskell. In 4th Sympo-sium on Haskell, pages 95–106. ACM, 2011.

[63] G. E. Suh, J. W. Lee, D. Zhang, and S. Devadas. Secure pro-gram execution via dynamic information flow tracking. InInternational Conference on Architectural Support for Pro-gramming Languages and Operating Systems, pages 85–96,2004.

[64] G. S. Taylor, P. N. Hilfinger, J. R. Larus, D. A. Patterson, andB. G. Zorn. Evaluation of the SPUR lisp architecture. InProceedings of the 13th annual International Symposium onComputer architecture, ISCA, pages 444–452, 1986.

[65] M. Tiwari, B. Agrawal, S. Mysore, J. Valamehr, and T. Sher-wood. A Small Cache of Large Ranges: Hardware Methodsfor Efficiently Searching, Storing, and Updating Big DataflowTags. In Proc. MICRO, pages 94–105, 2008.

[66] G. Venkataramani, I. Doudalis, Y. Solihin, and M. Prvulovic.FlexiTaint: A programmable accelerator for dynamic taintpropagation. In 14th International Symposium on High Per-formance Computer Architecture (HPCA), pages 173–184,Feb. 2008.

[67] E. Witchel, J. Cates, and K. Asanovic. Mondrian memoryprotection. In 10th International Conference on ArchitecturalSupport for Programming Languages and Operating Systems,ASPLOS X, pages 304–316, New York, NY, USA, 2002.ACM.

[68] J. Woodruff, R. N. Watson, D. Chisnall, S. W. Moore, J. An-derson, B. Davis, B. Laurie, P. G. Neumann, R. Norton, andM. Roe. The CHERI capability model: Revisiting RISC inan age of risk. In Proc. of the International Symposium onComputer Architecture (ISCA), pages 457–468, June 2014.

[69] B. Yee, D. Sehr, G. Dardyk, J. B. Chen, R. Muth, T. Ormandy,S. Okasaka, N. Narula, and N. Fullagar. Native Client: a sand-box for portable, untrusted x86 native code. Communicationsof the ACM, 53(1):91–99, 2010.

[70] A. Yip, X. Wang, N. Zeldovich, and M. F. Kaashoek. Improv-ing application security with data flow assertions. In Proceed-ings of the Symposium on Operating Systems Principles, BigSky, MT, USA, October 2009.

[71] N. Zeldovich, H. Kannan, M. Dalton, and C. Kozyrakis.Hardware enforcement of application security policies usingtagged memory. In Proceedings of the 8th USENIX Con-ference on Operating Systems Design and Implementation,OSDI, pages 225–240. USENIX Association, 2008.

[72] C. Zhang, T. Wei, Z. Chen, L. Duan, L. Szekeres, S. McCa-mant, D. Song, and W. Zou. Practical Control Flow Integrity& Randomization for Binary Executables. In IEEE Sympo-sium on Security and Privacy, 2013.

[73] P. Zhou, R. Teodorescu, and Y. Zhou. HARD: Hardware-assisted lockset-based race recording. In Proc. HPCA, 2007.

A. Policy Example: Memory SafetyDescription We implement a scheme (based on [17])that identifies all temporal and spatial violations in heap-allocated memory. Intuitively, for each new allocation wemake up a fresh color-id, c, and write c as the tag on eachmemory location in the newly created memory block (à lamemset). The pointer to the new block is also tagged c.Later, when we dereference a pointer, we check that its tagis the same as the tag on the memory cell to which it points.When a block is freed, the tags on all its cells are changed toa constant F representing free memory. The heap is initiallyall tagged F . We use a special tag ⊥ for non-pointers, andwe write t for a tag that is either a color c or ⊥.

Because memory cells may contain pointers, in generaleach word in memory has to be associated with two tags.We handle this by making the tag on each memory cell be apointer to a pair (c, t), where c is the id of the memory blockin which this cell was allocated and t is the tag on the wordstored in the cell. We use a domain-specific language basedon the rule function described in §2 for specifying a policyin terms of symbolic rules. The rules for load and storetake care of packing and unpacking these pairs, along withchecking that each memory access is valid (i.e., the accessedcell is within the block pointed to by this pointer):

load : (−,−, c1,−, (c2, t2))→ (−, t2) if c1 = c2

(1)

store : (−,−, t1, c2, (c3, t3))→ (−, (c3, t1)) if c2 = c3

(2)

The checking shows up as conditions under which the sym-bolic rule is valid (e.g., c2 = c3 above). The “–” symbolindicates the don’t care fields in the rule.

Address arithmetic operations preserve the pointer tag:

add : (−,−, c,⊥,−)→ (−, c) (3)

To maintain the invariant that tags on pointers can onlyoriginate from allocation, operations that create data fromscratch (like loading constants) set its tag to ⊥.

We augment malloc and free to tag memory regionsusing the tagged instructions and ephemeral rules (which aredeleted from the rule cache once they are used). In mallocwe generate a fresh tag for the pointer to the new region viaan ephemeral rule.

move : (−, tmalloc , t,−,−)1−→ (−, tnewtag) (4)

The arrow annotated with the 1 denotes an ephemeral rule.We then use the newly tagged pointer to write a zero to everyword in the allocated region using a special store rule

store : (−, tmallocinit , t1, c2, F )→ (−, (c2, t1)) (5)

before returning the tagged pointer. Later, free uses a mod-ified store instruction to retag the region as unallocated

store : (−, tfreeinit , t1, c2, (c3, t4))→ (−, F ) (6)

before returning the memory region to the free list.

Complete Symbolic Rule Set We use opgroups to com-pactly describe the ruleset:

nop, cbranch, ubranch, ijump, return :(−,−,−,−,−)→ (−,−) (1)

ar2s1d : (−,−,⊥,⊥,−)→ (−,⊥) (2)ar2s1d : (−,−, c,⊥,−)→ (−, c) (3)ar2s1d : (−,−,⊥, c,−)→ (−, c) (4)ar2s1d : (−,−, c, c,−)→ (−,⊥) (5)ar1s1d : (−,−, t,−,−)→ (−, t) (6)ar1d, dcall, icall, flags :

(−,−,−,−,−)→ (−,⊥) (7)load : (−,−, c1,−, (c2, t2))

→ (−, t2) if c1 = c2

(8)

store : (−, ci, t1, c2, (c3, t3))→ (−, (c3, t1)) if c2 = c3 ∧ ci 6∈ {tmallocinit , tfreeinit}

(9)

store : (−, tmallocinit , t1, c2, F )→ (−, (c2, t1)) (10)store : (−, tfreeinit , t1, c2, (c3, t4))→ (−, F ) (11)

move : (−, tmalloc , t,−,−)1−→ (−, tnewtag) (12)

move : (−, tmalloc , t,−,−)→ (−, t) (13)

Concrete Rules The symbolic rules used above for policyspecification are written using variables, allowing a fewsymbolic rules to describe the policy over an unboundeduniverse of distinct values. The concrete rules stored in therule cache, however, must refer to specific, concrete tagvalues. For example, if 23 and 24 are valid memory blockcolors, we will need concrete instances of Rule 3 in thePUMP rule cache for c = 23 and c = 24. Assuming weencode ⊥ as 0 and mark don’t care fields as 0, the concreterules are:

ar2s1d : (0, 0, 23, 0, 0)→ (0, 23)

ar2s1d : (0, 0, 24, 0, 0)→ (0, 24)

The miss handler gets the concrete input tags and runs codecompiled from the symbolic rules to produce the associatedconcrete output tags in order to insert rules into the PUMPrule cache. When the symbolic rule identifies a violation,control transfers to an error handler and no new concreterules are inserted into the PUMP rule cache.

Tag and Rule Statistics In the first 1.5B instructions, theGemsFDTD benchmark generates 580K memory block col-ors. 3.3M concrete rules are generated from the 13 symbolicrules above. Since not all combinations of memory block andpointer tags actually occur, symbolic rules like Rule 8 do notgenerate the theoretical maximum number of concrete rules((580K)2=330B>>3.3M).

Architectural Support for Software-Defined Metadata Processing

Documents

Transcript of Architectural Support for Software-Defined Metadata Processing