Supporting Differentiated Services in Computers via...

13
Supporting Differentiated Services in Computers via Programmable Architecture for Resourcing-on-Demand (PARD) Jiuyue Ma 1,2 Xiufeng Sui 1 Ninghui Sun 1 Yupeng Li 1 Zihao Yu 1,2 Bowen Huang 1 Tianni Xu 1,2 Zhicheng Yao 1 Yun Chen 3 Haibin Wang 3 Lixin Zhang 1 Yungang Bao 1 1 State Key Laboratory of Computer Architecture, ICT, CAS 2 University of Chinese Academy of Sciences 3 Huawei Technologies Abstract This paper presents PARD, a programmable architecture for resourcing-on-demand that provides a new programming interface to convey an application’s high-level information like quality-of- service requirements to the hardware. PARD enables new function- alities like fully hardware-supported virtualization and differenti- ated services in computers. PARD is inspired by the observation that a computer is inhe- rently a network in which hardware components communicate via packets (e.g., over the NoC or PCIe). We apply principles of software-defined networking to this intra-computer network and address three major challenges. First, to deal with the semantic gap between high-level applications and underlying hardware packets, PARD attaches a high-level semantic tag (e.g., a virtual machine or thread ID) to each memory-access, I/O, or interrupt packet. Second, to make hardware components more manageable, PARD implements programmable control planes that can be integrated into various shared resources (e.g., cache, DRAM, and I/O devices) and can differentially process packets according to tag-based rules. Third, to facilitate programming, PARD abstracts all control planes as a device file tree to provide a uniform programming interface via which users create and apply tag-based rules. Full-system simulation results show that by co-locating latency- critical memcached applications with other workloads PARD can improve a four-core computer’s CPU utilization by up to a factor of four without significantly increasing tail latency. FPGA emulation based on a preliminary RTL implementation demonstrates that the cache control plane introduces no extra latency and that the memory control plane can reduce queueing delay for high-priority memory-access requests by up to a factor of 5.6. Categories and Subject Descriptors C.5.5 [Computer System Or- ganization]: Computer System Implementation–Server Keywords Hardware/Software Interface, QoS, Data Center Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. ASPLOS ’15, March 14 - 18 2015, Istanbul, Turkey. Copyright c 2015 ACM 978-1-4503-2835-7/15/03. . . $15.00. http://dx.doi.org/10.1145/2694344.2694382 1. Introduction Conventional computer organization provides an instruction set architecture (ISA) for software developers and compilers to interact with the hardware. However, as pointed out in the community white paper 21st Century Computer Architecture [13]: “Current ISAs fail to provide an efficient means of capturing software-intent or conveying critical high-level information to the hardware. For example, they have no way of specifying when a program requires energy efficiency, robust security, or a desired Quality of Service (QoS) level.” The white paper further suggests that “new, higher-level inter- faces are needed to encapsulate and convey programmer and com- piler knowledge to the hardware, resulting in major efficiency gains and valuable new functionality.” Take data centers as a particular case study, such new architec- tural interfaces that convey an application’s QoS requirements to the hardware can bring significant efficiency benefits for contem- porary data centers, because they must manage difficult trade-offs between resource utilization and applications’ QoS. On one hand, to achieve high utilization, one straightforward approach is to co-locate workloads or virtual machines (VM) on each physical server. In this manner, Google’s batch-workload data centers achieve 75% CPU utilization, on average [15]. How- ever, co-location induces contention for various shared hardware resources (such as CPU cores, caches, memory bandwidth, and network switches [19]), as well as shared software resources (e.g., page caches, socket buffers, and multiple layers of queue- ing buffers [19, 35]). Such contention causes unpredictable per- formance variability [45, 47, 50] that is amplified at data cen- ter scales [19] where online services involve tens to hundreds of servers in processing even a single user request [51]. Moreover, such performance variability occurs frequently due to the unpre- dictability of frequent, short-running workloads. A month of pro- filing data from a 12,000-server Google data center shows that more than a million jobs ran for only a few minutes [59]. On the other hand, to guarantee QoS of latency-critical online services, data center operators or developers tend to avoid shar- ing by either dedicating resources or exaggerating reservations for online services in shared environments. Another Google example shows that typical online-service datacenters exhibit only about 30% CPU utilization, on average, much lower than batch-workload data centers [15]. In fact, without co-location industry-wide utiliza- tion is even lower — only between 6% [34] to 12% [1]. For shared environments, Reiss et al. [59] verify that developers indeed ex- aggerate resource requirements: in their 12,000-server data center,

Transcript of Supporting Differentiated Services in Computers via...

Page 1: Supporting Differentiated Services in Computers via ...acs.ict.ac.cn/baoyg/downloads/asplos2015_pard.pdf · 1State Key Laboratory of Computer Architecture, ... SMT load imbalance

Supporting Differentiated Services in Computers viaProgrammable Architecture for Resourcing-on-Demand (PARD)

Jiuyue Ma1,2 Xiufeng Sui1 Ninghui Sun1 Yupeng Li1 Zihao Yu1,2 Bowen Huang1 Tianni Xu1,2

Zhicheng Yao1 Yun Chen3 Haibin Wang3 Lixin Zhang1 Yungang Bao11State Key Laboratory of Computer Architecture, ICT, CAS

2University of Chinese Academy of Sciences3Huawei Technologies

AbstractThis paper presents PARD, a programmable architecture forresourcing-on-demand that provides a new programming interfaceto convey an application’s high-level information like quality-of-service requirements to the hardware. PARD enables new function-alities like fully hardware-supported virtualization and differenti-ated services in computers.

PARD is inspired by the observation that a computer is inhe-rently a network in which hardware components communicatevia packets (e.g., over the NoC or PCIe). We apply principles ofsoftware-defined networking to this intra-computer network andaddress three major challenges. First, to deal with the semantic gapbetween high-level applications and underlying hardware packets,PARD attaches a high-level semantic tag (e.g., a virtual machineor thread ID) to each memory-access, I/O, or interrupt packet.Second, to make hardware components more manageable, PARDimplements programmable control planes that can be integratedinto various shared resources (e.g., cache, DRAM, and I/O devices)and can differentially process packets according to tag-based rules.Third, to facilitate programming, PARD abstracts all control planesas a device file tree to provide a uniform programming interface viawhich users create and apply tag-based rules.

Full-system simulation results show that by co-locating latency-critical memcached applications with other workloads PARD canimprove a four-core computer’s CPU utilization by up to a factor offour without significantly increasing tail latency. FPGA emulationbased on a preliminary RTL implementation demonstrates thatthe cache control plane introduces no extra latency and that thememory control plane can reduce queueing delay for high-prioritymemory-access requests by up to a factor of 5.6.

Categories and Subject Descriptors C.5.5 [Computer System Or-ganization]: Computer System Implementation–Server

Keywords Hardware/Software Interface, QoS, Data Center

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected] ’15, March 14 - 18 2015, Istanbul, Turkey.Copyright c© 2015 ACM 978-1-4503-2835-7/15/03. . . $15.00.http://dx.doi.org/10.1145/2694344.2694382

1. IntroductionConventional computer organization provides an instruction setarchitecture (ISA) for software developers and compilers to interactwith the hardware. However, as pointed out in the community whitepaper 21st Century Computer Architecture [13]:

“Current ISAs fail to provide an efficient means of capturingsoftware-intent or conveying critical high-level information to thehardware. For example, they have no way of specifying when aprogram requires energy efficiency, robust security, or a desiredQuality of Service (QoS) level.”

The white paper further suggests that “new, higher-level inter-faces are needed to encapsulate and convey programmer and com-piler knowledge to the hardware, resulting in major efficiency gainsand valuable new functionality.”

Take data centers as a particular case study, such new architec-tural interfaces that convey an application’s QoS requirements tothe hardware can bring significant efficiency benefits for contem-porary data centers, because they must manage difficult trade-offsbetween resource utilization and applications’ QoS.

On one hand, to achieve high utilization, one straightforwardapproach is to co-locate workloads or virtual machines (VM) oneach physical server. In this manner, Google’s batch-workloaddata centers achieve 75% CPU utilization, on average [15]. How-ever, co-location induces contention for various shared hardwareresources (such as CPU cores, caches, memory bandwidth, andnetwork switches [19]), as well as shared software resources(e.g., page caches, socket buffers, and multiple layers of queue-ing buffers [19, 35]). Such contention causes unpredictable per-formance variability [45, 47, 50] that is amplified at data cen-ter scales [19] where online services involve tens to hundreds ofservers in processing even a single user request [51]. Moreover,such performance variability occurs frequently due to the unpre-dictability of frequent, short-running workloads. A month of pro-filing data from a 12,000-server Google data center shows thatmore than a million jobs ran for only a few minutes [59].

On the other hand, to guarantee QoS of latency-critical onlineservices, data center operators or developers tend to avoid shar-ing by either dedicating resources or exaggerating reservations foronline services in shared environments. Another Google exampleshows that typical online-service datacenters exhibit only about30% CPU utilization, on average, much lower than batch-workloaddata centers [15]. In fact, without co-location industry-wide utiliza-tion is even lower — only between 6% [34] to 12% [1]. For sharedenvironments, Reiss et al. [59] verify that developers indeed ex-aggerate resource requirements: in their 12,000-server data center,

Page 2: Supporting Differentiated Services in Computers via ...acs.ict.ac.cn/baoyg/downloads/asplos2015_pard.pdf · 1State Key Laboratory of Computer Architecture, ... SMT load imbalance

Layer Contention Points

Application Background daemon [19], backup job [19, 77]NetworkStacks

Nagle’s algorithm, limited buffers, delayed ACKcaused RTO [77], TCP congestion control [14, 23],packet scheduling [29, 71, 73, 78], kernel sockets [39]

OS Kernel Lock contention [35], context switch, kernel schedul-ing, SMT load imbalance and IRQ imbalance [39]

Hypervisor Virtual machine scheduling [72, 74, 75], networkbandwidth [33, 67, 72, 74]

Hardware SMT [79], Shared caches [20, 36, 39, 56, 63, 64, 69],memory [20, 49, 69, 76], NIC [57], I/O [20, 48]

Datacenter Global file system [19]

Table 1. Contention Identified in Prior Work

actual CPU and memory utilizations are only 20% and 40% respec-tively, while the average reservations are 75% and 60%.

Recognizing the inherent trade-offs, researchers have delvedinto the contention problem over the recent years. Table 1 high-lights efforts to eliminate resource contention points at various lev-els of the software stack, from the interrupt handler, OS sched-uler [39], and hypervisor scheduler [74, 75], to the network stack,queuing buffers, and application locks [35]. Despite these efforts,utilization of practical online-service data centers remains low.First, software-only approaches are limited to coarse-grained man-agement, which results in relatively poor control accuracy. Sec-ond, sophisticated software stacks and varying application demandscause contention points to change across different scenarios (asshown in Table 1). Third, finding contention points usually requirestremendous time and effort [74].

Hardware isolation techniques such as cache partitioning [36,56, 64] and memory channel partitioning [49] suffer from two pri-mary limitations: focusing on only one type rather than multipletypes of resources, and lacking uniform interfaces for software tomanage various types of resources. Building resource-efficient dat-acenters thus requires coordinated, cross-layer mechanisms [38].

In this paper, we propose Programmable Architecture forResourcing-on-Demand (PARD) that provides a new programminginterface to convey an application’s QoS requirements to the hard-ware. PARD supports new functionalities such as fully hardware-supported virtualization without software hypervisors and differen-tiated service (DiffServ) [60] in data center servers. For instance,PARD can accurately isolate performance in shared data centersto improve server utilization without degrading QoS for latency-critical applications.

The key idea of PARD is inspired by software-defined network-ing (SDN) [2], in which decoupling the control plane from the dataplane allows the control plane to be made programmable. We findthat a computer can be viewed as a network and then apply SDNprinciples to computer architecture to form PARD. Specifically,PARD consists of four mechanisms. First, a tagging mechanismlabels each intra-computer network (ICN) packet with a high-levelentity (e.g. VM, container, process, or thread) ID. Second, a pro-grammable control plane that can be easily integrated into a vari-ety of hardware components uses tag-based rules to differentiallyprocess ICN packets. Third, a per-computer IPMI-like [6] platformresource manager (PRM) running Linux-based firmware connectsall control planes. Fourth, the firmware abstracts all control planesas a device file tree to provide a uniform programming interfacefor users to define tag-based rules (e.g., basic priority rules and ad-vanced “trigger⇒action” rules) in the control planes.

We implemented PARD in a full-system simulator based onGEM5 [17]. Simulations of unmodified Linux 2.6.28.4 runningMemcached [8] and SPECCPU 2006 [26] workloads show that

PARD successfully balances CPU utilization and QoS require-ments: On a simulated four-core physical server, without PARD co-running applications achieves 100% CPU utilization but increasesMemcached’s tail latency by two orders of magnitude, unless theserver degrades CPU utilization to 25%. In contrast, when enablingPARD, the server can achieve 100% CPU utilization without sub-stantially increasing Memcached’s tail latency.

We evaluate hardware overhead on an FPGA developmentboard [11] with a preliminary RTL implementation based on theOpenSPARC T1 microprocessor [54]. Synthesis results show thatwhen supporting 256 different tags and 64 advanced “trigger⇒action”rules, the last level cache (LLC) and memory control planes intro-duce about 3.1% and 10.1% FPGA resource overhead, respectively.Since software is still being ported, we run the RTL emulation withmicrobenchmarks, finding that:

1. the LLC control plane does not introduce extra cycles becauseits control logic is effectively hidden within the pipeline ofthe LLC controller (OpenSPARC T1’s LLC has eight pipelinestages), and

2. the memory control plane can reduce queueing delays for high-priority memory-access requests by up to 5.6× (from 15.2cycles to only 2.7 cycles) while reasonably increasing the delayof low-priority memory-access requests by 33.6%.

In summary, our major contributions are:

• a new computer architecture supporting virtualization fully inhardware and providing a new programming interface to conveyhigh-level information to the hardware,• a programmable control plane structure that applies to various

hardware resources and can perform differentiated services,• a firmware that abstracts all control planes as a device file

tree and provides an advanced “trigger⇒action” programmingmethodology to deploy resourcing-on-demand managementpolicies, and• a full-system simulation model and a proof-of-concept FPGA

implementation.

The rest of the paper is organized as follows: Section §2 in-troduces background. Section §3 presents an overview of PARD.Section §4 outlines the architecture-level design and Section §5 de-scribes software design. Some implementation details are in Sec-tion §6. Section §7 demonstrates case studies based PARD andoverhead evaluations. Section §8 further discusses several issuessuch as overhead, optimization and functionality extension. Section§9 covers related work and Section §10 concludes.

2. BackgroundWe describe the background of DiffServ as well as SDN and thenpresent several challenges in applying SDN’s principles to com-puter architecture.

2.1 DiffServ and SDN: The Origin and InspirationHistorically, the current trade-offs data centers confront with aresimilar to the Internet in the 1990s when a lot of network applica-tions emerged, such as streaming video, e-commerce, email and filetransfers, which exhibited varying QoS requirements. To deliverend-to-end QoS on an IP-network, the Internet Engineering TaskForce (IETF) proposed Differentiated Services (DiffServ) [61] in1998. Today DiffServ has been one of the most widely adoptedtraffic engineering mechanism. The key idea of DiffServ is defin-ing an 8-bit Differentiated Services field (DS field) in the IP headerto designate an application’s requirements so that routers can lever-

Page 3: Supporting Differentiated Services in Computers via ...acs.ict.ac.cn/baoyg/downloads/asplos2015_pard.pdf · 1State Key Laboratory of Computer Architecture, ... SMT load imbalance

CPU Cores

Mem

ory

Last Level Cache

Disk

DRAM Controller

I/O Bridge

I/O Interface

NIC

Other I/O Devices

Devices Controller

Neighbor CPU

Interrupt

Figure 1. Intra-Computer Network (ICN). A traditional computercan be viewed as a network.

age the DS field to manage each packet for applications’ differentrequirements.

The advent of software-defined networking (SDN) [2] furtherfacilitates network management and traffic engineering. There isone prevalent interpretation of SDN’s key principles:

• Each network packet is attached with a tag containing applica-tion information like flowid in OpenFlow [10].• For network devices, the control plane is decoupled from the

data plane and is responsible for packet management while thedata plane is in charge of packet forwarding operations.• All control planes are logically centralized and are directly

programmable via a uniform programming interface.

The prevalence of DiffServ and SDN motivate us to investigatewhether it is feasible to apply SDN’s principles to computer ar-chitecture to address the trade-offs aforementioned. We find a keyobservation that a computer is inherently a network. As shown inFigure 1: a computer can be viewed as an intra-computer network(ICN)1; hardware components communicate with each other viadifferent types of packets, such as network-on-chip (NoC), Quick-Path Interconnect (QPI) [30] and PCIe packet; apart from process-ing packets, the controllers of hardware components actually playthe role of network routers, i.e., forwarding packets to a next hop.Therefore, it should be possible to apply SDN to computer archi-tecture.

2.2 Challenges of Applying SDN to Computer ArchitectureNevertheless, compared with deploying SDN in a network, thereare at least three challenges in applying SDN to computer architec-ture.

First, in a network, usually the network stack is the only sourcefor generating packets so it is easy to add tagging mechanism intothe network stack. In a computer, in contrast, a number of differ-ent hardware components can generate different types of packets.Thus, we need to address how to tagging packets generated by var-ious hardware components. (§4.1)

Second, unlike network routers that perform almost the samestore-and-forward behavior, hardware components in a computerbehave differently. It is challenging to figure out a uniform controlplane structure for different hardware components such as LLC,memory and I/O devices. (§4.2)

Finally, there is already a firmware running in a network routerfor users to access and configure the router’s control plane. How-ever, conventional computers lack similar firmware. The currentIPMI2 [6] firmware in servers only performs limited monitoring

1 The BarrelFish OS work [16] makes a similar observation that a computeris already a distributed system.2 IPMI (Intelligent Platform Management Interface) is used to monitor andmanage hardware when a server is powered off or unresponsive.

and management functionality such as temperature, fan speed andpower. Thus, it is essential to devise a firmware that provides a uni-form interface for server operators to access control planes of hard-ware components. Moreover, a flexible programming methodologyis desirable as well (§5).

3. PARD OverviewTo tackle the challenges above, we propose PARD architectureto demonstrate how to apply SDN to computer architecture. (Asmentioned in the Introduction section, this paper focuses on datacenter scenarios.) Figure 2 shows an overview of the architecture,which consists of four mechanisms:

1© Tagging Mechanism. In shared data center environments,in order to allow shared hardware resources to distinguish differententities such as multiple logic domains (LDoms3), a DiffServ ID(DS-id) tag is assigned to each entity by a centralized platformresource manager. We add a DS-id tag register into all sources thatgenerate requests (i.e., ICN packets) such as each CPU core andevery I/O device. These registers are used for tagging cache-access,memory-access, DMA and interrupt requests. When a DS-id tag isattached to a request at source-end, the tag will travel along withthe request during its whole lifetime until the request is completed.

2© Programmable Control Plane. Since each request containsa DS-id tag, we introduce a programmable control plane into everyshared resource to make use of the DS-id tag. Once receiving arequest, a control plane first serves the request according to its DS-id and then generates a new request attached with the same DS-idto a next component.

There are various hardware components that behave differentlyand use DS-id tags in different ways as well, e.g., LLC using DS-idfor capacity allocation while memory controller using DS-id forbandwidth allocation. We devise a basic control plane structurefor a variety of components. Specifically, this structure consistsof three DS-id indexed tables, i.e., a parameter table storing re-source allocation policies, a statistics table storing resource usageinformation and a trigger table storing performance triggers. Be-sides, the structure includes a programming interface and an inter-rupt line. All control planes are connected to a centralized platformresource manager. This control plane structure can be easily instan-tiated and integrated into LLC, memory and I/O controllers. Moredetails will be described in §4.2.

3© Per-Computer Centralized Platform Resource Manager(PRM). Like IPMI [6] in conventional servers, PARD includes aper-computer centralized platform resource manager (PRM) thatconnects all control planes and tag registers (see the dash lines inFigure 2). PRM is essentially an embedded system-on-chip (SoC)that consists of an embedded processor, RAM, flash, a local bus, anEthernet adaptor and several control plane adaptors (CPAs).

A Linux-based firmware running on PRM abstracts all con-trol planes as a device file tree that is logically centralized. Thefirmware provides a uniform file based programming interface toaccess different control planes and a more advanced “trigger⇒action”programming methodology (see below) for operators to create anddeploy resource management policies.

4© “Trigger⇒Action” Programming Methodology. To facili-tate operators to define resource management policies and programcontrol planes, PARD provides not only basic priority-based rulesbut also an advanced “trigger⇒action” programming methodol-ogy.

3 The term of logic domain (LDom) is originally introduced by SPARCprocessor [53]. In this paper, a LDom represents a submachine that containsseveral CPU cores, memory capacity and storage capacity and can run anunmodified OS. A server can be partitioned into multiple LDoms.

Page 4: Supporting Differentiated Services in Computers via ...acs.ict.ac.cn/baoyg/downloads/asplos2015_pard.pdf · 1State Key Laboratory of Computer Architecture, ... SMT load imbalance

On #3 {

}

On #2 {

}

On #1 { get current way mask & miss rate …... calculate new way mask update way mask to param table}

Interconnect

...High-Speed Interconnect

TriggerHandlers

Control PlaneNetwork...

Cache

Core0

TagRegCache

Core1

TagRegCache

Coren

TagReg

Parameter TableWayMaskBits

0xFF000x00FF

...

DS-iddefault

2...

Statistics TableMissCnt

015397...

HitCnt0

456708...

DS-id123...

Trigger TableDS-id

2...

StatsMissRate

...

OP>...

Val30%

...

DeviceN

Ctrl Plane

Device1

Ctrl Plane

Ctrl Plane

Bridge Shard CacheMem CtrlCtrl Plane Ctrl Plane

Eth

CPA[1..n]

③ Platform Resource Manager

EmbeddedProcessorFlash

RAM BUS

Figure 2. PARD Architecture Overview. The grey boxes represent PARD components.

Create LDom 2 for App Assign DS‐id:=2 Allocate resource for LDom Sharing w/o partitioning

User View

Operator View

Install “Trigger  Action” RulesSet Trigger for `DS‐id==2` CacheMissRate > 30%  Action #1 AvgQLat > 5 Request  Action #2 …… 

Define Trigger Handler #1  Increase allocated cache capacity  #2  Increase schedule priority …… 

Interference Occurs

Trigger Action #1

Create LDom 0 for App Alloc resource for LDom Un‐partition sharing Sharing w/o partitioning

Create LDom 1 for App Assign DS‐id:=1 Alloc resources for LDom Sharing w/o partitioning

TimeLine

CacheMissRate

ResponseTime

…… 

UserD: Launch Application UserX: Launch

Application

UserC: Launch Latency Critical Application

UserA: Launch ApplicationUserB: Launch

ApplicationT1

T2T3

T4T5

T6T7 T8 T9

Figure 3. An Illustrative Example of PARD Usages. Above the timeline is users’ view and below is data center operators’ view.

As shown in Figure 3, operators can define several DS-id tagbased “trigger⇒action” rules each of which targets a set of hard-ware resources. Particularly, triggers are based on performancemetrics such as LLC miss rate and are stored in trigger tables; anaction (a.k.a., trigger handler) can be written in any languages aslong as they support file primitives. Operator-defined rules are in-stalled in the device file tree of the firmware. It is worth noting thatthank to the centralized PRM, trigger and action can be designatedto different resources. For instance, if a trigger is created to monitormemory bandwidth, its action can be defined to adjust LLC capac-ity because LLC miss rate is strongly correlated with both memorybandwidth and LLC capacity.

Data center operators define actions and “trigger⇒action” rulesto represent different resource management policies that are corre-lated to service-level agreements (SLA) so that users can choosesuitable SLA according to their QoS requirements.

3.1 Illustrative ExampleThe example in Figure 3 demonstrates several usage scenarios on aPARD server in a shared data center.

1. At the time of T1 and T3, user A and user B intend to launchtheir applications respectively. They submit resource requests toa data center operator, who decides to allocate resources froma same PARD server to user A and B. (A data center operatorcould be a resource and job management system, such as Mesos[28].)

2. The operator in turn (at T2 and T4) processes users’ requestsand then sends resource requests to the PRM firmware of theserver. Take user B as an example, once receiving his/her re-quest, the firmware first creates a LDom, assigns the LDom’s

DS-id as “1” (i.e., forming LDom1) and allocates resources forLDom1. Since LDom1 is created for a normal priority appli-cation, the firmware adopts default “trigger⇒action” rules forit, which means that LDom1 can share LLC with other LDomsrather than possess dedicated resources. Finally, the firmwareexecutes commands to program control planes (i.e., modify ta-bles in Figure 2) and then launches LDom1 and its OS.

3. At T5, user C wants to run a high-priority latency-critical ap-plication and sends a resource request to the firmware. At T6,the firmware creates a LDom2 with DS-id “2” and adopts aset of “trigger⇒action” rules for LDom2. These rules are pro-grammed into related control planes.

4. During T7 and T8, more users submit resource requests to thefirmware and the utilization of the server keeps increasing.

5. Unfortunately, at T9, severe interference occurs so that thecache miss rate of user C’s application sharply increases(>30%). On a conventional server, high cache miss rate canworsen the tail response time of the application. In contrast, onthe PARD server, this change activates LDom2’s trigger storedin the LLC control plane, which immediately sends an interruptsignal to the PRM. The firmware running on the PRM capturesthe interrupt signal and invokes a corresponding trigger handler(see the rightmost Figure 2) to modify the parameter table of theLLC control plane, which thereby allocates more LLC capacityto user C’s application. As a result, the response time remainslow.

Page 5: Supporting Differentiated Services in Computers via ...acs.ict.ac.cn/baoyg/downloads/asplos2015_pard.pdf · 1State Key Laboratory of Computer Architecture, ... SMT load imbalance

address

Data Array 

Way Partitioning Enabled Pseudo‐LRU

Req & Resp from/to DRAM

Contrllor

Responseto NoC/Crossbar

Control Plane

CompareInterrupt to PRM

A[1..n]

evict_way[1..n] VevictMevict

wb_enevicthit_way[1..n]

evict_way[1..n]

hit

tag

index

Original RequestRequest_DS‐id

Request from NoC/Crossbar

owner_DS‐id addr & data

request_DS‐id

(1)

(1)

(2) (3)

(4)

(5)

(5)

(1)

(3)

(4)

(1) Tag1Tag2Tag3

A1V1M1

A2V2M2

A3V3M3

O_DS‐id1O_DS‐id2O_DS‐id3

Arbiter

DS‐id1DS‐id2...

WayMaskBits1WayMaskBits2

...Parameter Statistics Trigger

Programming InterfaceCmd & Data from/to PRM

DS‐id1DS‐id2...

MissCnt1MissCnt2

...

HitCnt1HitCnt2

...

……...

DS‐id1DS‐id2...

Cond1Cond2...

WB Buf

MSHRupdate

refill_data

Figure 4. Control Plane for Last Level Cache (LLC).

Arbiter

PHYBankCol/RowRowBufID

Priority

PriorityQueue

AddressTranslate

State MachineState 

MachineState 

MachineState 

Machine

ArbiterUpdate Signal ReqDone

OffsetRowBufID

RequestDS‐id

Programming Interface

Control Plane

Compare

Interrupt to PRMCmd & Data from/to PRM

Parameter

DS‐id1DS‐id2...

AddrMap1AddrMap2

...

Priority1Priority2

...Statistics

DS‐id1DS‐id2...

avgQLat1avgQLat2

...

ServCnt1ServCnt2

...

...

...

...Trigger

DS‐id1DS‐id2...

Cond1Cond2...

(1)

(2)

(3)

(4)

(5)(5)

(5)

Cmd & Data from/to DRAM ChipRowBufIDOriginal Req

Response to LLCDS‐id Addr Data

Request from LLCDS‐id Addr Data Cmd

Figure 5. Control Plane for Memory Controller.

4. Architecture DesignThis section presents the architecture design of PARD. The em-phasis is on the challenges encountered in tagging mechanism andcontrol plane design.

4.1 Tagging MechanismAlthough it is easy to understand the rational of tagging mechanismthat every on-the-fly request in a server is associated with a tag,there are two major design issues.

Tagging multi-phase writeback requests. Usually a write op-eration is divided into multiple phases due to caching. For example,at the LLC level, the first phase of a write request to DRAM onlystores data in LLC and marks the data block as dirty. For the secondphase, the dirty data block is selected to be evicted when a cachemiss occurs, and then written back to DRAM. During these latterphases, we need to determine which tag should be assigned to awriteback request.

To resolve this issue, We find that it is necessary to store DS-idtag if there are multi-phase writeback requests, otherwise result-ing in inaccuracy. For example, assume that the current requestis attached with DS-id1, the cache block to be written back be-longs to another LDom with DS-id2 that does not store in LLC.So the writeback request issued to DRAM has to use the DS-id1tag. Consequently, the memory control plane applies DS-id1’s“trigger⇒action” rules rather than DS-id2’s rules and updatescache usage statistics for DS-id1, resulting in wrong behaviors.

Therefore, for current LLC design, once a piece of requesteddata is filled into LLC, its DS-id tag is stored into the tag array ofLLC and is marked as owner DS-id. When the data is to be writtenback to DRAM, the owner DS-id will be assigned to the writebackrequest.

Tagging I/O request and interrupt requests. There are twoI/O access modes, i.e, programmed I/O (PIO) and directed memoryaccess (DMA). While it is easy to tagging a PIO request because itis issued by CPU, tagging a DMA request is challenging.

Let us first revisit how DMA works in three steps. Firstly, onedevice driver first informs a DMA engine of a DMA descriptor’sstart address, which stores DMA buffers’ information such as startaddress, size and status. Then, after being initialized, the DMAengine loads the descriptor’s content to get necessary informationfor each DMA operation and starts data transfer. Finally, the DMAengine may raise an interrupt after all data transfers are done.

To tag DMA requests, each DMA engine includes a tag registerthat is used as follows: (1) Initialize the tag register. When a device

driver writes a descriptor’s information into a DMA engine, the DS-id tag associated with the write request is stored into the tag registerof the DMA engine. (2) Tag data transfer requests. The DMAengine attaches the DS-id from the tag register to all DMA datatransfer requests, which are issued to a memory controller. (3) Taginterrupt signals. To deal with interrupt, we augment current APIC(advanced programmable interrupt controller) [5] by duplicatingmultiple interrupt route tables, each of which is associated witha DS-id. When a DMA engine is going to raise an interrupt, itattaches a DS-id to the interrupt request and sends the request to anAPIC component. Then the APIC component uses the DS-id to geta corresponding route table and forwards the interrupt to specifiedCPU cores.

There are two DMA directions, i.e., from-device and to-device.For a disk, our approach works well for both directions. However,for a network interface card (NIC), this is still insufficient for thefrom-device direction because the sources of incoming packets areunknown. Our current solution is to augment the Intel multi-queue[9] NIC with a control plane that contains a set of tag registers.An augmented NIC can be virtualized as multiple v-NIC withdifferent MAC addresses. Each v-NIC is dedicated to a LDomand stores the LDom’s DS-id in v-NIC’s tag register. When anincoming packet arrives, the physical NIC checks its destinationMAC address and forwards the packet to a corresponding v-NIC,which consequently issues DMA requests with DS-id. It is worthnoting that an alternative is to integrate PARD with SDN network(e.g., OpenFlow [2]) to allow a DS-id to travel crosse servers, bycorrelating a DS-id with network packet’s flowid.

4.2 Control Plane DesignThis subsection presents how to design component-specific con-trol planes based on the basic control plane structure described in§3. We focus on the LLC control plane and the memory controlplane, whose microachitecture are shown in Figure 4 and Figure5. If comparing the two figures, we can see that their bottom halfexhibit almost the same structure with only slightly different tablecolumns. Thus, this basic structure can be easily integrated into avariety of hardware components,

For I/O devices, we add control planes into I/O bridge and IDE.Since the design of these I/O control planes is similar to that of thememory control plane, here we omit their detail design.

LLC control plane. Figure 4 shows the microarchitecture ofthe LLC control plane, which supports a programmable way-basedpartitioning mechanism to adjust cache capacity for applications.

Page 6: Supporting Differentiated Services in Computers via ...acs.ict.ac.cn/baoyg/downloads/asplos2015_pard.pdf · 1State Key Laboratory of Computer Architecture, ... SMT load imbalance

The LLC control plane consists of three basic tables, i.e., a param-eter table, a statistics table and a trigger table, all of which canbe accessed by the PRM firmware through the programming inter-face. There is also an interrupt line connected to the PRM. Besides,the tag array and pseudo-LRU logic are modified (modification isindicated as shadow part and dash lines in Figure 4).

A key concern on the LLC control plane is whether it introducesextra delay. Fortunately, there is no extra delay because a pipelinedcache controller allows the control plane’s operations to be hiddeninto the original pipeline, as shown in Figure 4: (1) When a cache-access request arrives at the controller, its DS-id is used to querythe parameter table to get way-partitioning mask bits. For instance,the mask bits “0x00FF” indicate using the rightmost 8 ways ofthe whole 16 ways. Meanwhile, the request’s address is used tofind a corresponding entry in the tag array that stores not onlynormal cache tags and states but also DS-id tags. (2) The pseudo-LRU logic receives the mask bits from the parameter table and theaccess history information from the tag array, and then calculateswhich way to be evicted. (3) If the request hits in the LLC, datais retrieved from the data array, attached with the request’s DS-id and then sent back to a CPU core via NoC or crossbar. Notethat an LLC hit occurs if and only if the request’s address matchesthe original cache tag and the request’s DS-id matches the cacheblock’s owner DS-id tag4. (4) If a cache miss occurs, the LLCcontroller allocates a MSHR entry for the request and the DS-id.Assume that the evicted block is dirty, its data will be put intothe writeback buffer (WB Buf) with “owner DS-id”. When therequested data returns, LLC updates the tag array with the originalrequest’s DS-id. (5) During the above steps, the LLC control planeperforms several operations without being on critical pathes, suchas updating the statistics table, sending LLC usage statistic datato the PRM, activating a trigger if necessary, raising an interruptsignal and so forth.

The programming interface allows the PRM firmware to accessthe three tables of the LLC control plane. Specifically, it receivescommands from the PRM, extracts information from the commandto select a corresponding table and reads data from the table (orwrite data to the table). More details are in §5.1.

Memory control plane. Figure 5 shows the microarchitectureof memory control plane. To enable fully virtualized LDoms thatare able to run unmodified OSes and applications, the parametertable of the memory control plane stores address mapping infor-mation to translate an LDom physical address to a DRAM address.Besides, the table stores priority information for each DS-id.

Regarding performance isolation, unlike the LLC, the memorycontrol plane needs to takes into account two factors, i.e., queuingdelay and row buffer locality. To manage queuing delay, we adda priority queuing mechanism into the memory controller. Thenumber of queues depends on the priority levels supported bythe memory control plane. Our current design only supports twopriories, but it is easy to extend. To avoid memory low-priorityrequests degrading the row buffer hit rate of high-priority requests,we add one extra row buffer into each DRAM chip for high-prioritymemory requests. If we want to augment DRAM chip with morebuffers, we can leverage some commercial designs such as NEC’svirtual-channel memory (VCM).

With these mechanisms above, when a memory access requestattached with a DS-id tag arrives at the memory controller, a se-

4 Given two requests contain a same physical address, their DS-id tags areused to distinguish different LDoms. When a PARD server is partitionedinto multiple LDoms, each of which has same physical address spacestarting from 0. Thus, different LDom with different DS-id can issue cacherequests for identical physical address. This address mapping mechanism iscritical to enable LDom to run unmodified OS and applications (i.e., fullyvirtualized LDom).

Example 2: /cpa0_ldom0_t0.sh

1 # !/bin/sh

2 echo “<log message>” > /log/triggers.log

3 cur_mask=`cat /sys/cpa/.../waymask`

4 miss_rate=`cat /sys/cpa/.../miss_rate`

5 capacity=`cat /sys/cpa/.../capacity`

6 target=update_mask(

$cur_mask, $miss_rate, $capacity)

7 echo $targe > /sys/cpa/…/waymask

Example 1: Set miss rate trigger

pardtrigger /dev/cpa0

-ldom=0 -action=0

-stats=miss_rate -cond=gt,30

echo “/cpa0_ldom0_t0.sh” >

/sys/cpa/cpa0/ldoms/ldom0/triggers/0

/sys/cpa/

|- cpa0

| |- ident : “CACHE_CP”

| |- type : 0x43 ‘C’ | |- ldoms

| |- ldom0

| | |- parameters

| | | |- waymask

| | | |- ...

| | |- statistics

| | | |- miss_rate

| | | |- capacity

| | | |- ...

| | |- triggers

| | |- 0 > /cpa0_ldom0_t0.sh

| | |- 1 > /scripts/killme.sh

| | |- ...

| |- ...

|- …

|- cpaX

64-bit

IDENT type

cmdaddrIDENT_HIGH

16

-bit

Co

ntr

ol P

lan

eA

dd

ress

Sp

ace

data

XXtag offset

Table Selection

cpaX

...waymasktag

miss_rate capacitytag

tag0...

statsmiss_rate

...

op>...

val30%

...

32-bit

Figure 6. Control Plane Programming Methodology.

quence of operations occur: (1) The control plane uses the DS-idto query the parameter table to get corresponding address mappingmask, priority and row-buffer id. (2) The requested LDom phys-ical address is translated into a DRAM physical address. (3) Theoriginal request with DS-id enters a queue specified by the priorityinformation. (4) The DRAM scheduler (arbiter) chooses a requestto serve according to two rules: high-priority first and FR-FCFS[62]. (5) The memory control plane updates the statistics table andchecks if it needs to activate a trigger. When a trigger takes effect,an interrupt signal will be sent to the PRM.

5. Software SupportTo take the advantage of PARD architecture, we devise a Linux-based firmware that runs on the PRM. This section presents howthe firmware interacts with the control planes.

5.1 Control Plane Abstraction and InterfaceTo provide a uniform interface for firmware users (data center oper-ators) to access various control planes, we leverage Linux’s devicefile mechanism. Figure 6 shows the abstraction of the control planesfrom a firmware user’s perspective. Specifically, each control planeis connected to a PRM’s control plane adaptor (CPA) that is ab-stracted as a file and is mounted to the sysfs [55] of the firmware as“/sys/cpa/cpa[0-9][0-9]*”. Thus firmware users can use either bashcommands (e.g., cat or echo) or system calls to access these files.

Each control plane file is organized as a subtree, which containssome general information such as ident (e.g., cpa0’s “CACHE CP”in Figure 6) and type indicating which type of a control planeis, cache (“C”), memory (“M”) or I/O bridge (“B”). Besides, acontrol plane file further includes several subtrees, each of whichrepresents a LDom with a specific DS-id. When a new LDom iscreated, the firmware creates a new subtree with root “ldom[0-9][0-9]*” under the node ldoms, where the number represent a DS-id.Take the ldom0 subtree as an example, there are three tiny treeswith roots as parameters, statistics and triggers respectively, whichare mapped to several rows of the three tables of the LLC controlplane.

For the underlying implementation of the control plane tree, thePRM reserves a 64KB (16 bits) I/O address space for mapping thecontrol plane adaptors (CPA). In particular, each CPA occupies 32

Page 7: Supporting Differentiated Services in Computers via ...acs.ict.ac.cn/baoyg/downloads/asplos2015_pard.pdf · 1State Key Laboratory of Computer Architecture, ... SMT load imbalance

CPU 4 4-issue Out-of-Order X86 cores, 2GHzL1-I/core 64KB 2-way, hit = 2 cycles

L1-D/core 64KB 2-way, hit = 2 cyclesShared LLC 4MB 16-way, hit = 20 cycles

DRAM 8GB DDR3-1600 11-11-114Gbit chip (Micron MT41J512M8)1 channel, 2 ranks/channel, 8 banks/RankBurst Length = 8, Row buffer = 1KBtCK=1.25ns, tRCD = 13.75ns, tCL = 13.75nstRP = 13.75ns, tRAS = 35ns, tRRD = 6ns

Disks 4-channel IDE controller, 8 disksPlatform 1 X86 core, 100MHzResource 16MB DRAM, 32MB Flash StorageManager 4 control plane adaptors (CPA)

(PRM) 1 Ethernet adaptorFirmware: tailored Linux kernel2.6.28.4 with Busybox [3]

Server OS Gentoo Linux with kernel 2.6.28.4Workloads Memcached [8], SPECCPU 2006 [26]

Microbenchmark:CacheFlush & DiskCopy

Table 2. Simulation Parameters

bytes whose definition is shown in the topright of Figure 6: twofields IDENT and IDENT HIGH (12 bytes in total) represents theident node of the tree; the 32-bit type field indicates the type ofa control plane; the 32-bit addr field contains 16-bit DS-id, 14-bit offset designating the column of a table and, 2 bits for tableselection; the 32-bit cmd field specifies either READ operation orWRITE operation to a selected table; finally, the 64-bit data fielddepends on the cmd field. The content of data field is filled by usersif the cmd is WRITE operation. In contrast, it is filled by a controlplane and is read by users.

To program a control plane, a driver first writes data into theaddr field to select a table cell by a DS-id (row) and an offset(column). If a user intends to modify the table, the driver fillsthe data filed and then writes the WRITE command into the cmdfield. For reading data from the control plane, after the addr field isinitialized, the driver writes the READ command into the cmd fieldand then reads the data field.

5.2 Programming MethodologyAs mentioned in §3, we adopt a “trigger⇒action” programmingmethodology. Data center operators predefine a set of trigger⇒actionrules for different priorities. Usually a trigger is based on perfor-mance metrics, e.g., “MissRate>30%”.

Example 1 in Figure 6 demonstrates how to install a trigger⇒actionrule. A data center operator uses the pardtrigger command to pro-gram the trigger condition “MissRate>30%” for “ldom=0” (i.e.,DS-id=0) into the trigger table of the cache control plane (cpa0).The parameter “action=0” guides the command to create a leafnode “0” under “.../cpa0/ldoms/ldom0/triggers”. Then the opera-tor calls “echo ...” to install the action script “/cpa0 ldom0 t0.sh”shown in Example 2.

There are two programming approaches based on the device filetree. One is invoking system calls (open/read/write etc.) to openand manipulate a CPA file. For example, the pardtrigger commandis written in C and invokes syscalls to install a trigger into thecache control plane. The advantage of this approach is very fast. Amore convenient approach is leveraging bash commands to directlyaccess these CPA files, as shown in Example 2.

Parameter Cache: way mask-bitsTable Memory: row-buffer mask-bits, scheduling priority,

address mappingDisk: bandwidth

Statistics Cache: miss rate, capacityTable Memory: bandwidth, latency

Disk: bandwidthTrigger LLC miss rate⇒ way mask-bits

Table Memory latency⇒ row-buffer mask-bitsMemory latency⇒ scheduling priority

Table 3. Control Plane Tables

6. ImplementationWe implemented PARD architecture and the PRM firmware onPARD-gem55, a GEM5 [17] based full-system simulator. More-over, we have been implementing PARD on a Xilinx VC709 FPGAdevelopment board.

Full-system simulation. As shown in Table 2, PARD-gem5 is acycle-accurate simulator that is able to simulate OoO X86 systems.In particular, we added tag registers in CPU cores, implementedLLC, memory controller, I/O bridge and IDE control planes. Table3 illustrates major content of the three tables of the LLC, memoryand IDE control planes. We also implemented the PRM that is alsobased on an X86 processor running at 100MHz, 16MB DRAM,32MB Flash, one Ethernet adaptor and four control plane adaptors(CPAs). The firmware running on the PRM is based on a tailoredLinux kernel 2.6.28.4 with Busybox [3] toolkit.

With these efforts, PARD-gem5 is able to simulate a PARDserver that can be partitioned into multiple LDoms. Each LDom canrun unmodified Gentoo Linux with kernel 2.6.28.4 and memcached1.4.17 [8] of CloudSuite [24] etc.

Preliminary RTL implementation. Our RTL implementa-tion is based on OpenSPARC T1 processor [54]. The originalOpenSPARC T1 includes eight 4-thread cores (32 threads in to-tal), a 3MB LLC, a crossbar and four memory controllers etc.. Allof these modules make this processor too large to fit into the FPGAdevelopment board. Therefore, we first tailored the OpenSPARCT1 processor to a small processor with only two cores, a crossbar,768KB LLC and a memory controller, which are able to boot anunmodified Linux. Then we added a tag register into each coreand propagated its value to LLC, crossbar and memory controller.Finally, we implemented the LLC and memory control plane re-spectively, which are synthesizable.

Now the LLC control plane is able to distinguish DS-id tagsand perform way-based LLC partition according to different DS-idtags. The memory control plane can deliver tagged memory re-quests into different priority queues in accordance with requests’DS-id. However, we are still working on the PRM, the program-ming mechanism, the integration of multiple components andsoftware porting. The FPGA prototype aims at fully hardware-supported virtualization, performance isolation, differentiated QoS,and flexible programmability. This goal is still in progress.

Fortunately, current preliminary RTL implementation is suf-ficient for analyzing latency and logic resources of the controlplanes, because several major mechanisms and components havebeen implemented and verified, such as tag mechanisms, LLC andmemory control planes.

5 PARD-gem5 is open-sourced and available at https://github.com/fsg-ict/PARD-gem5.

Page 8: Supporting Differentiated Services in Computers via ...acs.ict.ac.cn/baoyg/downloads/asplos2015_pard.pdf · 1State Key Laboratory of Computer Architecture, ... SMT load imbalance

0

1

2

3

00.5

11.5

2

00.5

11.5

2

Occ

upie

d La

st L

evel

Cac

he (M

B)

00.5

11.5

2

Mem

ory

Ban

dwid

th (G

B/s

)

0

1

2

3

0

1

2

3

CacheFlushT CacheFlushTLDom0

LDom1

LDom2

LDom0

LDom1

LDom2

Boot OS

Boot OS

Boot OS

Bash Ready

Run 437.leslie3d

Bash ReadyRun 470.lbm

Bash ReadyRun CacheFlush

Run 437.leslie3d

Run 470.lbm

Run CacheFlush

500 1000 1500 2000 2500Simulated Time (ms)

echo 0xFF00 > /sys/cpa/cpa0/ldom0/parameters/waymask

echo 0x00FF > /sys/cpa/cpa0/ldom1/parameters/waymask

echo 0x00FF > /sys/cpa/cpa0/ldom2/parameters/waymask

Figure 7. Dynamically Partition a PARD server into Four LDoms and Launch Three LDoms in turn.

7. EvaluationThis section describes the experiments we conducted on both ofthe simulation and FPGA platforms. The goal is to verify newfunctionalities enabled by PARD architecture and the overhead ofcurrent PARD control plane design.

For experimental methodology, we leveraged GEM5’s Sim-pleTiming mode to boot Linux, launch and warmup workloads,made checkpoints, and then switched to cycle-accurate Out-of-Order (OoO) simulations. Simulation parameters are illustrated inTable 2. Because the full system cycle-accurate OoO simulationis very slow, we ran about 30 hours to simulate three seconds formemcached, during which the first second is for warmuping mi-croarchitecture and the latter two seconds are used for evaluation.

7.1 Functionality7.1.1 Fully Hardware-Supported Virtualization

This experiment demonstrates the effectiveness of fully hardware-supported virtualization enabled by PARD. We partitioned the four-core PARD server into four LDoms (LDom0 ˜LDom3) that equallyshare hardware resources. We launched only three LDoms and eachLDom ran Linux and a user application. Particularly, LDom0 andLDom1 ran two SPEC CPU2006 workloads (437.leslied3d and470.lbm) and LDom2 ran CacheFlush microbenchmark.

Figure 7 displays the change of memory bandwidth and occu-pied LLC capacity of the three LDoms over about two seconds.These usage data are collected by the LLC and memory controlplanes. For example, the occupied LLC capacity is stored in thestatistics table of the LLC cache plane6. To obtain these statisticsdata, we implemented a tool running on the firmware to periodi-cally read data from the two control planes.

Assume that LDom0 is with high priority. According to Figure7, at the time of TCacheFlush when LDom2 runs CacheFlush, theoccupied LLC capacity of LDom0 substantially decreases (see thetop right Figure 7). Actually, this situation mimics a software-basedvirtualization environment where multiple VMs run together on aphysical machine. Although a hypervisor can guarantee resourceisolation for these VMs, it is difficult to ensure performance isola-

6 Given an LDom, its occupied LLC capacity is calculated by counting thenumber of a DS-id in the LLC tag array. Specifically, the number increasesby 1 when a block with a DS-id is filled into LLC and decreases by 1 whena block with the DS-id is evicted.

tion. Thus, these VMs may contend for hardware resources such asLLC capacity, as shown in the figure.

In this experiment, in order to guarantee reasonable LLC capac-ity for LDom0, we manually ran three echo commands (shown inthe figure) to adjust LLC capacity. Since the LLC of the simulatedserver is 16-way, the way mask bits “0xFF00” indicates that theLLC control plane allocates eight ways for LDom0 and the maskbits “0x00FF” means that LDom1 and LDom2 share the other eightways. Consequently, the percentage of LDom0’s LLC capacity in-creases to 50% while the percentage of LDom2 drops sharply.

7.1.2 “Trigger⇒Action” Mechanism VerificationWe partitioned the server into four LDoms and ran latency-

critical memcached in LDom0. Because the simulation speed is tooslow to simulate network environments, we ran both memcachedclient and server in one LDom where they share a CPU core.Although the client and server may contend for resources within aLDom, this contention in a blackbox is acceptable while we focuson the interference between LDoms. The other three LDoms all ranSTREAM microbenchmark.

Figure 8 illustrates the change of memcached’s tail responsetime over different request loads. When memcached runs in thesolo-mode (only LDom0 is active), it can serve 22.5K request persecond (RPS) with reasonable 95-percentile response time (0.6ms).However, only one CPU core is fully busy, thus the total serverCPU utilization is 25% . If the other three LDoms are launched toshare resources with LDom0, All four CPU cores are fully busyso that the server exhibits 100% CPU utilization. Unfortunately,the throughput of memcached drops to 15K RPS while the 95-percentile response time already exceeds 1ms. When the through-put increases to 20K RPS, the 95-percetile response time dramati-cally increases by two orders of magnitude (62.6ms).

To verify the the effectiveness of the “trigger⇒action” mecha-nism, As shown in Figure 6, we predefined a trigger for LDom0,i.e., “LLC.MissRate > 30% ⇒ increasing LLC capacity up to50%”.

Next we installed the LLC trigger into the LLC control planebefore launching LDom0 and then launched all four LDoms. Fig-ure 9 shows how LDom0’s LLC miss rate varies while the triggertakes effect. After half of LLC capacity is dedicated to LDom0,memcached’s LLC miss rate quickly decreases to only about 10%,just slightly larger than the solo-mode’s miss rate (7%). As a re-sult, the throughput increases to 22.5K RPS with reasonable 95-

Page 9: Supporting Differentiated Services in Computers via ...acs.ict.ac.cn/baoyg/downloads/asplos2015_pard.pdf · 1State Key Laboratory of Computer Architecture, ... SMT load imbalance

0

10

20

30

10 12.5 15 17.5 20 22.5

Res

pons

e Ti

me

(ms)

Kilo Requests Per Second (KRPS)

solosharedw/ LLC Trigger

Figure 8. Change of Memcached’s TailResponse Time (95-percentile).

0%5%

10%15%20%25%30%35%

0 40 80 120 160

LLC

Mis

s R

ate

Simulated Time (ms)

4*LDomw/ cache

way-partition

memcached only3*STREAM startup

LLC miss rate triggered

Figure 9. Change of Memcached’s LLC MissRate (20KRPS).

0%

20%

40%

60%

80%

100%

0 200 400 600 800Simulated Time (ms)

LDom1LDom0

echo 80 >/sys/cpa/cpa3/ldom0/parameters/bandwidth

Figure 10. Disk I/O Performance Isola-tion

0

0.2

0.4

0.6

0.8

1

0 20 40 60 80 100

CDF

w/o Control PlaneHigh Priority

w/ Control Plane

Low Priorityw/ Control Plane

Delay Cycles

Figure 11. CDF of Queueing Delay of Memory Requests.

percentile response time (1.2ms). Compared with the solo-mode,since LDom0 uses only half of LLC capacity, this tail responsetime is slightly longer. Overall, these experimental results showthat even the total server CPU utilization is 100% (all four coresare fully busy), memcached’s behaviors can still approach the solomode.

This experiment demonstrates that the “trigger⇒action” mech-anism is able to take the advantage of PARD and is beneficial forbalancing the trade-offs between resource utilization and applica-tions’ QoS.

7.1.3 Disk I/O Performance IsolationPARD is able to manage not only capacity resources (e.g., LLC)

but also bandwidth resources such as memory and I/O bandwidth.In this experiment, two LDoms are active and each LDom exe-cutes disk write operation through this command: “dd if=/dev/zeroof=/dev/sdb bs=32M count=16”.

Initially, two LDoms share the IDE controller so that they ex-hibit the same disk I/O bandwidth. Assume that the user of LDom1pays more money in order to get better I/O performance, we man-ually ran “echo 80 > /sys/cpa/cpa3/ldom0/parameters/bandwidth”to program the IDE control plane to increase the disk I/O band-width quota of LDom0 to 80%. Figure 10 illustrates the effect ofdisk I/O bandwidth reallocation.

Although there are other approaches to manage I/O bandwidthsuch as cgroup [4], PARD provides a more efficient fine-grainedI/O management scheme that requires no modifications to OS ker-nel and applications, reducing the complexity of software stack.

7.2 OverheadThere are two major concerns about adding control planes into thecontrollers of hardware components: (1) How much extra latencyis introduced? (2) How many hardware resources are required? Toaddress these two concerns, we performed evaluation based on the

0

500

1000

1500

2000Logic LUT LUTRAM FF

64 128256

Statistics Parameter TriggerMemory Controller

64 128256

16 32 64

Statistics Parameter TriggerLast Level Cache

64 128256

64 128256

16 32 64

Figure 12. FPGA resource usage of the LLC and memory controlcontrol planes. For reference, the original memory controller and768KB 12-way LLC (w/ tag array only) consume 15178 and 75032LUT/FF respectively.

preliminary RTL implementation on an FPGA development board,as described in §6.

Latency. We find that the LLC control plane does not introduceextra latency. According to Figure 4, the processing logic of theLLC control plane can be hidden into the pipeline of the LLCcontroller. In fact, pipelining is a typical design for modern CPUs’LLC. For example, the L2 cache of OpenSPARC T1 has eightpipeline stages. Furthermore, synthesis data verify that the logicof the LLC control plane is not in critical pathes.

For the memory control plane, we used the RTL implementationto evaluate memory-access requests’ queueing latency, which de-pends on request inject rate (i.e., bandwidth utilization). We presenta representative case to show the distribution of queueing delayfor the inject rate of 0.44, as shown in Figure 11. There are halfhigh-priority requests and half low-priority requests. According tothe figure, the average queueing delay of the original memory con-troller is 15.2 memory cycles. When we enabled the memory con-trol plane, the average queueing delay of high-priority requestssignificantly decreases to 2.7 cycles (reduced by 5.6X) while thequeuing delay of low-priority requests increases to 20.3 cycles (by33.6%).

FPGA Resources. The required FPGA resources of a controlplane mainly depend on the number of table entries of the threetables.

Figure 12 shows the synthesis data reported by Xilinx VivadoDesign Suite [12]. Specifically, the 256-entry parameter and statis-tics tables of the memory control plane require 220 LUT for logicand 688 LUTRAM for storage. The 64-entry trigger table con-sumes more logic resources (582 LUT + 387 FF) than storages(40 LUTRAM) because it requires many comparators to implementtriggers. Two priority queues with 16-depth require only 324 LUT+ 30 FF. The total required FPGA resources (1526 LUT/FF) ac-

Page 10: Supporting Differentiated Services in Computers via ...acs.ict.ac.cn/baoyg/downloads/asplos2015_pard.pdf · 1State Key Laboratory of Computer Architecture, ... SMT load imbalance

counts for 10.1% of the original Xilinx MIGv7 memory controller(15178 LUT/FF).

The LLC control plane exhibits similar resources consumption.The 256-entry parameter and statistics tables and the 64-entry trig-ger table require 2359 LUT/FF, introducing only 3.1% extra FPGAresources compared with the original LLC controller (not countingdata array). Additionally, storing Owner DS-id in tag array adds50% more of blockRAMs (from 12 to 18). This is because the orig-inal tag array stores 28 bits for each cache block and the DS-id inour RTL implementation is 8 bits. In fact, the tag array consumesvery little FPGA resources compared with the whole cache, thusthese extra six blockRAMs are negligible.

8. DiscussionIn this section, we will discuss the following issues:

Physical Implementation Cost. PARD’s physical implemen-tation is similar with design-for-test (DFT) logic and can even bebased on the existing DFT logic or extra logic, so the hardwarecost varies from being incremental to being modest. As a particularexample, it is a common practice to use fixed-width physical reg-isters from the libraries for logic registers with various widths. Forinstance, in OpenSPARC T1’s RTL, the packets of L2 cache are128/144-bit wide, but the queues used to store them are composedof customized 160-bit physical register files. Those unused 32/16bits can be reused for the tagging mechanism of PARD.

For the centralized platform resource manager (PRM), the im-plementation can be based on prevalent IPMI components. Actu-ally, some IPMI already includes embedded processors and operat-ing systems. For instance, the IPMI component in one of our exper-imental servers consists of an ARM926EJ-S processor and 128MBmemory, running an embedded Linux with kernel 2.6.27.

Functionality Extension. Apart from supporting differentiatedQoS, it is also possible to extend PARD to support other differenti-ated services such as encryption, compression and security check.Take compression as an example, IBM’s Memory eXpansion Tech-nology (MXT) [70] integrates a compression engine into a memorycontroller. if a PARD server includes an MXT engine, the enginecan be programmed to compress memory-access packets for onlydesignated DS-id sets. However, there are still many open problemson how to leverage PARD to explore new functionalities.

Control Plane Design Optimization. Current control planesare based on a table structure, which is simple to design and imple-ment but is relatively inflexible. An alternative is to embed a cus-tomized microprocessor into control planes. This is not impossible.A recent research demonstrates that a microprocessor with specificinstruction set architecture (ISA) can be integrated into high-speedDDRx memory controllers [18]. Although the current ISA can onlyhandle simple scheduling, it may be augmented to support moremanagement policies.

9. Related WorkTable 1 illustrates prior work on eliminating contention at variouslayers. Companies like Google and Twitter now adopt advanceddata center management frameworks (e.g., Borg, Omega [65] andMesos [28]) to balance the trade-offs between resource utilizationand applications’ QoS. There are also a number of distributed tech-niques and OS kernel optimizations, such as multi-level priorities[25], scheduling [21, 22, 39, 46, 52], backup-request [19],cgroup[4], and resource containers [7].

Instead of presenting related work of all layers, in this sectionwe focus on related work within a server.

Software based Partitioning Researchers propose to leveragepage-coloring technique to partition cache and DRAM. The keyidea of page coloring is leveraging address mapping information

to manage hardware resources. Lin et al. [42] and Tam et al.[68] propose page-coloring based cache partitioning to address thecache contention problem. Liu et al. [43] present a practical page-corloring based DRAM bank partitioning approach to eliminate thememory bank-level interference and further propose an approach tocooperatively partition both cache and DRAM together [44].

Although these software techniques are able to effectively parti-tion shared cache and DRAM banks, there are two major concerns:first, they require reorganizing free-page lists and migrating pageswhen workload mixtures change. But the overhead is not negligi-ble. Although some work designs different kernel buddy systemsfor different workload mixtures, this approach requires significantkernel hacking efforts and has limited usage scenarios in produc-tion data centers. Second, contemporary processors adopts hash-ing algorithms to map physical address to cache index (e.g. Intel’sSandy Bridge). These hash algorithms are usually unknown, so itis difficult to apply page-coloring approach to such systems.

In contrast to page coloring, PARD’s control planes can exposemore underlying information (such as per DS-id cache miss rateand memory bandwidth) to firmware, operators thus can deploybetter resource management policy with the information.

Hardware based techniques. Kasture and Sanchez proposeUbik [36], a cache partitioning policy that characterizes and lever-ages the transient behavior of latency-critical applications to main-tain their target tail latency. Vantage [64] implements fine-grainedcache partitioning using the statistical properties of Zcaches [63].Utility-based cache partitioning (UCP) [56] strictly partitions theshared cache depending on the benefit of allocating different num-ber of ways to each application. Muralidhara et al. propose anapplication-aware memory channel partitioning (MCP) [49] to re-duce memory system interference. However, these work usually fo-cuses on only one type of resource while PARD is able to simulta-neously manage all shared hardware resources winthin a server.

Similar to PARD, NoHype [37] removes the virtualization layerand makes use of hardware virtualization extensions to partition aserver into multiple submachines. However, this partitioning is stat-ical. In contrast, PARD allows operators to dynamically partition aphysical server into multiple LDoms. Furthermore, PARD supports“trigger⇒action” mechanism to deploy resourcing-on-demand re-source management policies.

Architectural support for QoS. Rafique et al. [58] propose aOS-driven hardware cache partitioning mechanism that tags cacherequests and allows OS to adjust cache quota according to thetag. Sharifi et al. [66] further propose a feedback-based controlarchitecture for end-to-end on-chip resource management. Iyer etal. make substantial contributions in architectural support for QoS[27, 31, 32, 40, 41]. The closest work to PARD is class-of-servicebased QoS architecture (CoQoS) [40, 41], which assigns a prior-ity tag to each on-chip request and allows cache/DRAM/NoC toschedule the request according to the priority tag.

PARD differs from these prior proposals in the following as-pects: (1) Prior work primarily focuses on on-chip resources suchas cache, NoC and memory bandwidth (managed by on-chip mem-ory controller) while PARD is able to manage not only on-chipresources but also I/O resources. (2) We design programmablecontrol planes for PARD and a uniform programming interfacewhile prior work does not support programmability. (3) PARD in-cludes a centralized platform resource manager and a Linux-basedfirmware to facilitate operators’s management. (4) PARD supportsnot only QoS but also NoHype-like virtualization that can partitionone server into multiple LDoms. (5) PARD allows users to install“trigger⇒action” rules while prior work lacks this useful mecha-nism.

Page 11: Supporting Differentiated Services in Computers via ...acs.ict.ac.cn/baoyg/downloads/asplos2015_pard.pdf · 1State Key Laboratory of Computer Architecture, ... SMT load imbalance

10. Conclusion and Open ProblemsThis paper presented a study of building DiffServ in computerservers via architectural support. We proposed programmablearchitecture for resourcing-on-demand(PARD) and developed aLinux-based firmware to facilitate programming PARD. We im-plemented PARD on both a full-system simulator and an FPGAdevelopment board. Our experiments demonstrated that PARD isable to address the trade-offs between high utilization and highQoS in datacenter environments.

We believe that architectural support for DiffServ in computersis a promising trend as the number of cores keeps increasing. PARDmakes a case for this direction and provides new interfaces for usersto interact with the hardware. However, there are still a lot of openissues such as

• how to translate applications’ QoS requirements into efficient“trigger⇒action” rules?• how to leverage compilers to automatically generate trigger

rules?• how to make OS directly run on PARD server to support

process-level DiffServ?• how to support fine-grain Diffserv for not only process-level but

also thread-level or even C++/Java object-level?• how to support nested DiffServ, i.e., guarantee QoS of a process

within a LDom?• how to deal with simultaneous multithreading (SMT)?• how to deal with multiple processors issues such as cache co-

herency?• how to extend DiffServ to accelerators, e.g., enabling an en-

cryption engine to encrypt/decrypt LDoms with specific DS-id?• how to integrate PARD and SDN so that DS-id can be propa-

gated in a data center wide?• how to design and deploy security policy on PARD servers?• how to develop firmware applications to take advantage of

PARD?

To facilitate further optimization and exploration, werelease the PARD-gem5 simulator that is available athttps://github.com/fsg-ict/PARD-gem5.

AcknowledgmentsThe authors would like to thank to Kai Li, Babak Falsafi for theirinsights into the idea and the system. We give special thanks toSally McKee for her detailed feedback on the paper. We thank ourgroup members, Rui Hou, Xin Jin, Wenjie Li, Pin Li, Yupeng Qu,Rui Ren, Cong Wang, Cheng Yang, Xusheng Zhan, and Jianbo Zhufor their help on the work. We also thank David Brooks, MingyuChen, Fred Chong, Mark Hill, Xiaozhou Li, Feng Liu, Jason Mars,Calton Pu, Lingjia Tang, Yida Wang, Jin Xin (Princeton Univer-sity), Zhiwei Xu, Jianfeng Zhan, Wei Zhang, Xiaodong Zhang, theanonymous reviewers, and our shepherd David Meisner for theirvaluable comments. This work was supported by the Huawei-ICTJoint Research Lab, National Natural Science Foundation of China(NSFC) under grants No. 61420106013, 61221062, 61202062, andthe 973 Program under grant No. 2011CB302500. Yungang Bao issupport by CCF-Intel Young Faculty Researcher Program and Xi-ufeng Sui is supported by 2014 IBM SUR Cycle 2 grant.

References[1] Gartner says efficient data center design can lead to 300 percent

capacity growth in 60 percent less space. http://www.gartner.

com/newsroom/id/1472714.[2] Software-Defined Networking. https://www.opennetworking.

org/sdn-resources/sdn-definition/.[3] BusyBox. http://www.busybox.net/.[4] Cgroups. http://en.wikipedia.org/wiki/Cgroups.[5] Intel 64 and IA-32 Architectures Software Developer Manuals, volume

3: System Programming Guide.[6] Intelligent Platform Management Interface (IPMI). http:

//en.wikipedia.org/wiki/Intelligent_Platform_Management_Interface.

[7] Linux Container(LXC). http://lxc.sourceforge.net/.[8] Memcached. http://memcached.org/.[9] Intel 82599 10 gigabit ethernet controller: Datasheet. http://www.

intel.com/content/www/us/en/ethernet-controllers/82599-10-gbe-controller-datasheet.html.

[10] Openflow switch specification. https://www.opennetworking.org/sdn-resources/openflow/.

[11] Xilinx Virtex-7 FPGA VC709 Connectivity Kit. http://www.xilinx.com/products/boards-and-kits/EK-V7-VC709-CES-G.htm.

[12] Vivado Design Suite. http://www.xilinx.com/products/design-tools/vivado/.

[13] Computing Community Consortium (CCC). 21st century computer ar-chitecture. A community white paper, 2012. URL http://cra.org/ccc/docs/init/21stcenturyarchitecturewhitepaper.pdf.

[14] M. Alizadeh, A. Greenberg, D. A. Maltz, J. Padhye, P. Patel, B. Prab-hakar, S. Sengupta, and M. Sridharan. Data center TCP (DCTCP).In Proceedings of the ACM SIGCOMM 2010 Conference, SIGCOMM’10, New York, NY, USA, 2010.

[15] L. A. Barroso, J. Clidaras, and U. Holzle. The datacenter as a com-puter: An introduction to the design of warehouse-scale machines.Synthesis Lectures on Computer Architecture, 8(3):1–154, 2013.

[16] A. Baumann, P. Barham, P.-E. Dagand, T. Harris, R. Isaacs, S. Peter,T. Roscoe, A. Schupbach, and A. Singhania. The multikernel: Anew os architecture for scalable multicore systems. In Proceedings ofthe ACM SIGOPS 22nd Symposium on Operating Systems Principles,SOSP ’09, pages 29–44, New York, NY, USA, 2009.

[17] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi,A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen,K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood. Thegem5 simulator. SIGARCH Comput. Archit. News, 39(2):1–7, Aug.2011.

[18] M. N. Bojnordi and E. Ipek. PARDIS: a programmable memorycontroller for the DDRx interfacing standards. In Proceedings ofthe 39th Annual International Symposium on Computer Architecture,ISCA ’12, pages 13–24, Washington, DC, USA, 2012.

[19] J. Dean and L. A. Barroso. The tail at scale. Commun. ACM, 56(2):74–80, Feb. 2013.

[20] C. Delimitrou and C. Kozyrakis. ibench: Quantifying interferencefor datacenter applications. In Proceedings of the IEEE InternationalSymposium on Workload Characterization, pages 23–33, 2013.

[21] C. Delimitrou and C. Kozyrakis. Paragon: QoS-aware scheduling forheterogeneous datacenters. In Proceedings of the eighteenth inter-national conference on Architectural support for programming lan-guages and operating systems, page 77–88, 2013.

[22] C. Delimitrou and C. Kozyrakis. Quasar: Resource-efficient and QoS-aware cluster management. In Proceedings of the 19th InternationalConference on Architectural Support for Programming Languagesand Operating Systems, ASPLOS ’14, page 127–144, New York, NY,USA, 2014.

[23] X. L. Dong, B. Saha, and D. Srivastava. Less is more: selectingsources wisely for integration. In Proceedings of the 39th internationalconference on Very Large Data Bases, PVLDB’13, Trento, Italy, 2013.

[24] M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee,D. Jevdjic, C. Kaynak, A. D. Popescu, A. Ailamaki, and B. Falsafi.

Page 12: Supporting Differentiated Services in Computers via ...acs.ict.ac.cn/baoyg/downloads/asplos2015_pard.pdf · 1State Key Laboratory of Computer Architecture, ... SMT load imbalance

Clearing the clouds: a study of emerging scale-out workloads on mod-ern hardware. In Proceedings of the seventeenth international confer-ence on Architectural Support for Programming Languages and Op-erating Systems, ASPLOS ’12, pages 37–48, New York, NY, USA,2012.

[25] Google. Google Cluster workload traces. http://code.google.com/p/googleclusterdata.

[26] J. L. Henning. SPEC CPU2006 benchmark descriptions. SIGARCHComput. Archit. News, 34(4):1–17, Sept. 2006.

[27] A. Herdrich, R. Illikkal, R. Iyer, D. Newell, V. Chadha, and J. Moses.Rate-based QoS techniques for cache/memory in CMP platforms. InProceedings of the 23rd international conference on Supercomputing,page 479–488, 2009.

[28] B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph,R. Katz, S. Shenker, and I. Stoica. Mesos: A platform for fine-grained resource sharing in the data center. In Proceedings of the 8thUSENIX Conference on Networked Systems Design and Implementa-tion, NSDI’11, pages 22–22, Berkeley, CA, USA, 2011.

[29] C.-Y. Hong, M. Caesar, and P. B. Godfrey. Finishing flows quicklywith preemptive scheduling. SIGCOMM Comput. Commun. Rev., 42(4), Aug. 2012.

[30] Intel. An Introduction to the Intel QuickPath Interconnect. Jan. 2009.[31] R. Iyer. CQoS: a framework for enabling QoS in shared caches of

CMP platforms. In Proceedings of the 18th annual internationalconference on Supercomputing, page 257–266, 2004.

[32] R. Iyer, L. Zhao, F. Guo, R. Illikkal, S. Makineni, D. Newell, Y. Soli-hin, L. Hsu, and S. Reinhardt. QoS policies and architecture forCache/Memory in CMP platforms. In Proceedings of the 2007 ACMSIGMETRICS International Conference on Measurement and Model-ing of Computer Systems, SIGMETRICS ’07, page 25–36, New York,NY, USA, 2007.

[33] V. Jeyakumar, M. Alizadeh, D. Mazires, B. Prabhakar, C. Kim, andA. Greenberg. EyeQ: Practical network performance isolation at theedge. In Proceedings of the 10th USENIX Conference on NetworkedSystems Design and Implementation, NSDI’13, Berkeley, CA, USA,2013.

[34] J. M. Kaplan, W. Forrest, and N. Kindler. Revolutionizing data centerenergy efficiency. Technical report, McKinsey & Company, 2008.

[35] R. Kapoor, G. Porter, M. Tewari, G. M. Voelker, and A. Vahdat.Chronos: Predictable low latency for data center applications. In Pro-ceedings of the Third ACM Symposium on Cloud Computing, SoCC’12, pages 9:1–9:14, New York, NY, USA, 2012.

[36] H. Kasture and D. Sanchez. Ubik: Efficient cache sharing with strictqos for latency-critical workloads. In Proceedings of the 19th Inter-national Conference on Architectural Support for Programming Lan-guages and Operating Systems, ASPLOS ’14, page 729–742, NewYork, NY, USA, 2014.

[37] E. Keller, J. Szefer, J. Rexford, and R. B. Lee. NoHype: virtualizedcloud infrastructure without the virtualization. In Proceedings ofthe 37th Annual International Symposium on Computer Architecture,ISCA ’10, page 350–361, New York, NY, USA, 2010.

[38] C. Kozyrakis. Resource efficient computing for warehouse-scale data-centers. In Design, Automation Test in Europe Conference Exhibition(DATE), pages 1351–1356, Mar. 2013.

[39] J. Leverich and C. Kozyrakis. Reconciling high server utilizationand sub-millisecond quality-of-service. In Proceedings of the 2014EuroSys Conference, Amsterdam, Nethelands, 2014.

[40] B. Li, L. Zhao, R. Iyer, L. S. Peh, M. Leddige, M. Espig, S. E. Lee,and D. Newell. CoQoS: coordinating QoS-aware shared resources inNoC-based SoCs. Journal of Parallel and Distributed Computing, 71(5):700–713, 2011.

[41] B. Li, L. S. Peh, L. Zhao, and R. Iyer. Dynamic QoS managementfor chip multiprocessors. ACM Trans. Archit. Code Optim., 9(3):17:1–17:29, Oct. 2012.

[42] J. Lin, Q. Lu, X. Ding, Z. Zhang, X. Zhang, and P. Sadayappan. Gain-ing insights into multicore cache partitioning: Bridging the gap be-tween simulation and real systems. In IEEE 14th International Sympo-

sium on High Performance Computer Architecture, 2008. HPCA 2008,pages 367–378, Feb. 2008.

[43] L. Liu, Z. Cui, M. Xing, Y. Bao, M. Chen, and C. Wu. A softwarememory partition approach for eliminating bank-level interference inmulticore systems. In Proceedings of the 21st international conferenceon Parallel architectures and compilation techniques, page 367–376,2012.

[44] L. Liu, Y. Li, Z. Cui, Y. Bao, M. Chen, and C. Wu. Going verticalin memory management: Handling multiplicity by multi-policy. InComputer Architecture (ISCA), 2014 ACM/IEEE 41st InternationalSymposium on, pages 169–180, June 2014. .

[45] J. Mars, N. Vachharajani, R. Hundt, and M. L. Soffa. Contention awareexecution: Online contention detection and response. In Proceedingsof the 8th Annual IEEE/ACM International Symposium on Code Gen-eration and Optimization, CGO ’10, pages 257–265, New York, NY,USA, 2010.

[46] J. Mars, L. Tang, and R. Hundt. Heterogeneity in “homogeneous”warehouse-scale computers: A performance opportunity. ComputerArchitecture Letters, 10(2):29–32, 2011.

[47] J. Mars, L. Tang, and M. L. Soffa. Directly characterizing cross coreinterference through contention synthesis. In Proceedings of the 6thInternational Conference on High Performance and Embedded Archi-tectures and Compilers, HiPEAC ’11, pages 167–176, New York, NY,USA, 2011.

[48] M. Mesnier, F. Chen, T. Luo, and J. B. Akers. Differentiated storageservices. In Proceedings of the Twenty-Third ACM Symposium onOperating Systems Principles, SOSP ’11, page 57–70, New York, NY,USA, 2011.

[49] S. P. Muralidhara, L. Subramanian, O. Mutlu, M. Kandemir, andT. Moscibroda. Reducing memory interference in multicore systemsvia application-aware memory channel partitioning. In Proceedings ofthe 44th Annual IEEE/ACM International Symposium on Microarchi-tecture, page 374–385, 2011.

[50] O. Mutlu and T. Moscibroda. Stall-time fair memory access schedul-ing for chip multiprocessors. In Proceedings of the 40th AnnualIEEE/ACM International Symposium on Microarchitecture, pages146–160, 2007.

[51] R. Nishtala, H. Fugal, S. Grimm, M. Kwiatkowski, H. Lee, H. C. Li,R. McElroy, M. Paleczny, D. Peek, P. Saab, D. Stafford, T. Tung, andV. Venkataramani. Scaling memcache at facebook. In Proceedingsof the 10th USENIX Symposium on Networked Systems Design andImplementation (NSDI 13), pages 385–398, Lombard, IL, 2013.

[52] D. Novakovic, N. Vasic, S. Novakovic, D. Kostic, and R. Bianchini.Deepdive: Transparently identifying and managing performance in-terference in virtualized environments. In the 2013 USENIX AnnualTechnical Conference (USENIX ATC 13), pages 219–230, San Jose,CA, 2013.

[53] Oracle. Oracle VM Server for SPARC (Logical Do-mains). http://www.oracle.com/technetwork/systems/logical-domains/index.html, .

[54] Oracle. OpenSPARC T1 microprocessor. http://www.oracle.com/technetwork/systems/opensparc/index.html, .

[55] Patrick Mochel. The sysfs filesystem. In Linux Symposium, 2005.

[56] M. K. Qureshi and Y. N. Patt. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition sharedcaches. In Proceedings of the 39th Annual IEEE/ACM InternationalSymposium on Microarchitecture, page 423–432, 2006.

[57] S. Radhakrishnan, Y. Geng, V. Jeyakumar, A. Kabbani, G. Porter,and A. Vahdat. Senic: Scalable nic for end-host rate limiting. InProceedings of the 11th USENIX Conference on Networked SystemsDesign and Implementation, NSDI’14, pages 475–488, Berkeley, CA,USA, 2014.

[58] N. Rafique, W.-T. Lim, and M. Thottethodi. Architectural supportfor operating system-driven cmp cache management. In Proceedingsof the 15th International Conference on Parallel Architectures andCompilation Techniques, PACT ’06, pages 2–12, New York, NY, USA,2006.

Page 13: Supporting Differentiated Services in Computers via ...acs.ict.ac.cn/baoyg/downloads/asplos2015_pard.pdf · 1State Key Laboratory of Computer Architecture, ... SMT load imbalance

[59] C. Reiss, A. Tumanov, G. R. Ganger, R. H. Katz, and M. A. Kozuch.Heterogeneity and dynamicity of clouds at scale: Google trace analy-sis. In Proceedings of the Third ACM Symposium on Cloud Comput-ing, SoCC ’12, pages 7:1–7:13, New York, NY, USA, 2012.

[60] RFC2474. Definition of the Differentiated Services Field (DS Field)in the IPv4 and IPv6 Headers. http://tools.ietf.org/html/rfc2474.

[61] RFC2475. An Architecture for Differentiated Services. http://tools.ietf.org/html/rfc2475.

[62] S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens.Memory access scheduling. In Proceedings of the 27th Annual In-ternational Symposium on Computer Architecture, ISCA ’00, pages128–138, New York, NY, USA, 2000.

[63] D. Sanchez and C. Kozyrakis. The ZCache: decoupling ways andassociativity. In Proceedings of the 2010 43rd Annual IEEE/ACMInternational Symposium on Microarchitecture, MICRO ’43, page187–198, Washington, DC, USA, 2010.

[64] D. Sanchez and C. Kozyrakis. Vantage: scalable and efficient fine-grain cache partitioning. In ACM SIGARCH Computer ArchitectureNews, volume 39, page 57–68, 2011.

[65] M. Schwarzkopf, A. Konwinski, M. Abd-El-Malek, and J. Wilkes.Omega: Flexible, scalable schedulers for large compute clusters. InProceedings of the 8th ACM European Conference on Computer Sys-tems, EuroSys ’13, pages 351–364, New York, NY, USA, 2013.

[66] A. Sharifi, S. Srikantaiah, A. K. Mishra, M. Kandemir, and C. R. Das.METE: meeting end-to-end QoS in multicores through system-wideresource management. In Proceedings of the ACM SIGMETRICS jointinternational conference on Measurement and modeling of computersystems, page 13–24, 2011.

[67] A. Shieh, S. Kandula, A. Greenberg, C. Kim, and B. Saha. Sharing thedata center network. In Proceedings of the 8th USENIX Conference onNetworked Systems Design and Implementation, NSDI’11, Berkeley,CA, USA, 2011.

[68] D. Tam, R. Azimi, L. Soares, and M. Stumm. Managing shared l2caches on multicore systems in software. In In Proc. of the Workshopon the Interaction between Operating Systems and Computer Archi-tecture (WIOSCA), 2007.

[69] L. Tang, J. Mars, N. Vachharajani, R. Hundt, and M. L. Soffa. Theimpact of memory subsystem resource sharing on datacenter applica-tions. In Proceedings of the 38th Annual International Symposiumon Computer Architecture, ISCA ’11, pages 283–294, New York, NY,USA, 2011.

[70] R. B. Tremaine, P. Franaszek, J. Robinson, C. Schulz, T. Smith,M. Wazlowski, and P. M. Bland. IBM memory expansion technology(MXT). IBM Journal of Research and Development, 45(2):271–285,Mar. 2001.

[71] B. Vamanan, J. Hasan, and T. Vijaykumar. Deadline-aware datacenterTCP (d2tcp). In Proceedings of the ACM SIGCOMM 2012 Confer-ence, SIGCOMM ’12, New York, NY, USA, 2012.

[72] G. Wang and T. S. E. Ng. The impact of virtualization on networkperformance of amazon EC2 data center. In Proceedings of the 29thConference on Information Communications, INFOCOM’10, Piscat-away, NJ, USA, 2010.

[73] C. Wilson, H. Ballani, T. Karagiannis, and A. Rowtron. Better neverthan late: Meeting deadlines in datacenter networks. In Proceedingsof the ACM SIGCOMM 2011 Conference, SIGCOMM ’11, New York,NY, USA, 2011.

[74] Y. Xu, M. Bailey, B. Noble, and F. Jahanian. Small is better: Avoidinglatency traps in virtualized data centers. In Proceedings of the 4thAnnual Symposium on Cloud Computing, SOCC ’13, pages 7:1–7:16,New York, NY, USA, 2013.

[75] Y. Xu, Z. Musgrave, B. Noble, and M. Bailey. Bobtail: Avoiding longtails in the cloud. In Proceedings of the 10th USENIX Conferenceon Networked Systems Design and Implementation, NSDI’13, pages329–342, Berkeley, CA, USA, 2013.

[76] H. Yang, A. Breslow, J. Mars, and L. Tang. Bubble-flux: Precise onlineQoS management for increased utilization in warehouse scale comput-ers. In Proceedings of the 40th Annual International Symposium onComputer Architecture, page 607–618, 2013.

[77] M. Yu, A. Greenberg, D. Maltz, J. Rexford, L. Yuan, S. Kandula, andC. Kim. Profiling network performance for multi-tier data center ap-plications. In Proceedings of the 8th USENIX Conference on Net-worked Systems Design and Implementation, NSDI’11, Berkeley, CA,USA, 2011.

[78] D. Zats, T. Das, P. Mohan, D. Borthakur, and R. Katz. DeTail: Reduc-ing the flow completion time tail in datacenter networks. SIGCOMMComput. Commun. Rev., 42(4), Aug. 2012.

[79] Y. Zhang, M. Laurenzano, J. Mars, and L. Tang. Smite: Preciseqos prediction on real system smt processors to improve utilizationin warehouse scale computers. In Proceedings of the 47th AnnualIEEE/ACM International Symposium on Microarchitecture (MICRO),Dec. 2014.