VirtualTreeCoherence:LeveragingRegionsandIn ...enright/EnrightJergerN-VirtualTree.pdf · As...

12
Virtual Tree Coherence: Leveraging Regions and In-Network Multicast Trees for Scalable Cache Coherence Natalie D. Enright Jerger Dept of Electrical and Comp. Engineering University of Wisconsin-Madison Madison, WI 53706 Email: [email protected] Li-Shiuan Peh Dept of Electrical Engineering Princeton University Princeton, NJ 08544 Email: [email protected] Mikko H. Lipasti Dept of Electrical and Comp. Engineering University of Wisconsin-Madison Madison, WI 53706 Email: [email protected] Abstract Scalable cache coherence solutions are imperative to drive the many-core revolution forward. To fully realize the massive computation power of these many-core architectures, the com- munication substrate must be carefully examined and stream- lined. There is tension between the need for an ordered inter- connect to simplify coherence and the need for an unordered interconnect to provide scalable communication. In this work, we propose a coherence protocol, Virtual Tree Coherence (VTC), that relies on a virtually ordered interconnect. Our vir- tual ordering can be overlaid on any unordered interconnect to provide scalable, high-bandwidth communication. Specifically, VTC keeps track of sharers of a coarse-grained region, and multicasts requests to them through a virtual tree, employing properties of the virtual tree to enforce ordering amongst coherence requests. We compare VTC against a commonly used directory-based protocol and a greedy-order protocol extended onto an unordered interconnect. VTC outperforms both of these by averages of 25% and 11% in execution time respectively across a suite of scientific and commercial applications on 16 cores. For a 64-core system running server consolidation workloads, VTC outperforms directory and greedy protocols with average runtime improvements of 31% and 12%. 1. Introduction As transistor scaling affords chip architects the ability to place more and more cores on-chip, the coherence and communication substrates are going to become a significant performance bottleneck. Already we are seeing dozens of cores available on-chip from Intel [34] and Tilera [35]. Efficient coherence mechanisms are required to continue scaling these systems into the future. Throughout the literature, there are two predominant classes of cache coherence: broadcast-based protocols and directory- based protocols. Both classes exhibit problems for emerging many-core chip multiprocessors. Broadcast protocols do not scale well in performance and power since they rely on a central ordering point for all coherence requests and flood the interconnect with broadcast requests. Directory protocols provide greater scalability by distributing the ordering points to various directories; but latency penalties are paid for traveling This research was supported in part by the NSF under grants CCR-0133437, CCF-0429854, CCF-0702272, CNS-0509402, the MARCO Gigascale Systems Research Center, an IBM PhD Fellowship, as well as grants and equipment do- nations from IBM and Intel. The authors would like to thank Ravi Rajwar and the anonymous reviewers for their comments and suggestions on improving this work. to and accessing these ordering points. In large multi-chip multiprocessor systems, it is fairly easy to add extra bits to memory for directory storage and use directories to build scalable coherence protocols. However, to make performance acceptable for on-chip usage, directory information must be stored on-chip (not as part of off-chip memory as is done in distributed systems). Caching directory information on chip represents substantial overhead; this additional area could be better utilized for other CMP components, such as the processing cores. In [22], significant directory cache miss rates were ob- served for commercial workloads (up to 74%). Directory cache miss rates will become an even more significant performance problem with server consolidation workloads. Multiple server workloads sharing the same resources on chip will touch large amounts of memory, placing enormous pressure on these directory caches. So, while directory protocols do provide designers with scalability; they do so with significant perfor- mance degradations when directory indirections and directory misses are factored in. Even with hundreds to thousands of cores on-chip, in the common case, only a few processors need to observe a given coherence request. With the single-writer, multiple- reader protocol invariant, only cores that are actively caching a block need to be made aware of a pending write to that block. Since providing global coherence, such as a broadcast, across thousands of nodes is impractical from both a performance and a power standpoint, a logical solution is to maintain coherence amongst just the current subset of readers. Infrequently, it will be the case that every core on-chip does need to observe a coherence request; this case must be handled correctly but not necessarily quickly. In short, to address these key shortcomings of broadcast and directory-based protocols, the desirable properties for scalable on-chip coherence are: Limit coherence actions to the necessary subset of nodes: This will reduce power consumption and inter- connect pressure. Fast cache-to-cache transfers: Send request to sharers as quickly as possible avoiding overheads of directory indirections. Limited bandwidth overhead: Limit unnecessary broad- casts to entire chip; only communicate amongst the subset of sharers where coherence needs to be maintained. Limited storage overhead: Make efficient use of on-chip directory cache storage and cache tag array storage. The need for scalable coherence is not a new one; however,

Transcript of VirtualTreeCoherence:LeveragingRegionsandIn ...enright/EnrightJergerN-VirtualTree.pdf · As...

Page 1: VirtualTreeCoherence:LeveragingRegionsandIn ...enright/EnrightJergerN-VirtualTree.pdf · As transistor scaling affords chip architects the ability to place more and more cores on-chip,

Virtual Tree Coherence: Leveraging Regions and In-Network Multicast Trees for

Scalable Cache Coherence

Natalie D. Enright JergerDept of Electrical and Comp. Engineering

University of Wisconsin-Madison

Madison, WI 53706

Email: [email protected]

Li-Shiuan PehDept of Electrical Engineering

Princeton University

Princeton, NJ 08544

Email: [email protected]

Mikko H. LipastiDept of Electrical and Comp. Engineering

University of Wisconsin-Madison

Madison, WI 53706

Email: [email protected]

Abstract

Scalable cache coherence solutions are imperative to drivethe many-core revolution forward. To fully realize the massivecomputation power of these many-core architectures, the com-munication substrate must be carefully examined and stream-lined. There is tension between the need for an ordered inter-connect to simplify coherence and the need for an unorderedinterconnect to provide scalable communication. In this work,we propose a coherence protocol, Virtual Tree Coherence(VTC), that relies on a virtually ordered interconnect. Our vir-tual ordering can be overlaid on any unordered interconnect toprovide scalable, high-bandwidth communication. Specifically,VTC keeps track of sharers of a coarse-grained region, andmulticasts requests to them through a virtual tree, employingproperties of the virtual tree to enforce ordering amongstcoherence requests. We compare VTC against a commonly useddirectory-based protocol and a greedy-order protocol extendedonto an unordered interconnect. VTC outperforms both of theseby averages of 25% and 11% in execution time respectivelyacross a suite of scientific and commercial applications on16 cores. For a 64-core system running server consolidationworkloads, VTC outperforms directory and greedy protocolswith average runtime improvements of 31% and 12%.

1. Introduction

As transistor scaling affords chip architects the abilityto place more and more cores on-chip, the coherence andcommunication substrates are going to become a significantperformance bottleneck. Already we are seeing dozens of coresavailable on-chip from Intel [34] and Tilera [35]. Efficientcoherence mechanisms are required to continue scaling thesesystems into the future.Throughout the literature, there are two predominant classes

of cache coherence: broadcast-based protocols and directory-based protocols. Both classes exhibit problems for emergingmany-core chip multiprocessors. Broadcast protocols do notscale well in performance and power since they rely on acentral ordering point for all coherence requests and floodthe interconnect with broadcast requests. Directory protocolsprovide greater scalability by distributing the ordering points tovarious directories; but latency penalties are paid for traveling

This research was supported in part by the NSF under grants CCR-0133437,CCF-0429854, CCF-0702272, CNS-0509402, the MARCO Gigascale SystemsResearch Center, an IBM PhD Fellowship, as well as grants and equipment do-nations from IBM and Intel. The authors would like to thank Ravi Rajwar andthe anonymous reviewers for their comments and suggestions on improvingthis work.

to and accessing these ordering points. In large multi-chipmultiprocessor systems, it is fairly easy to add extra bits tomemory for directory storage and use directories to buildscalable coherence protocols. However, to make performanceacceptable for on-chip usage, directory information must bestored on-chip (not as part of off-chip memory as is done indistributed systems). Caching directory information on chiprepresents substantial overhead; this additional area couldbe better utilized for other CMP components, such as theprocessing cores.In [22], significant directory cache miss rates were ob-

served for commercial workloads (up to 74%). Directory cachemiss rates will become an even more significant performanceproblem with server consolidation workloads. Multiple serverworkloads sharing the same resources on chip will touchlarge amounts of memory, placing enormous pressure on thesedirectory caches. So, while directory protocols do providedesigners with scalability; they do so with significant perfor-mance degradations when directory indirections and directorymisses are factored in.Even with hundreds to thousands of cores on-chip, in

the common case, only a few processors need to observea given coherence request. With the single-writer, multiple-reader protocol invariant, only cores that are actively cachinga block need to be made aware of a pending write to that block.Since providing global coherence, such as a broadcast, acrossthousands of nodes is impractical from both a performance anda power standpoint, a logical solution is to maintain coherenceamongst just the current subset of readers. Infrequently, it willbe the case that every core on-chip does need to observe acoherence request; this case must be handled correctly but notnecessarily quickly.In short, to address these key shortcomings of broadcast and

directory-based protocols, the desirable properties for scalableon-chip coherence are:

• Limit coherence actions to the necessary subset of

nodes: This will reduce power consumption and inter-connect pressure.

• Fast cache-to-cache transfers: Send request to sharersas quickly as possible avoiding overheads of directoryindirections.

• Limited bandwidth overhead: Limit unnecessary broad-casts to entire chip; only communicate amongst the subsetof sharers where coherence needs to be maintained.

• Limited storage overhead:Make efficient use of on-chipdirectory cache storage and cache tag array storage.

The need for scalable coherence is not a new one; however,

Page 2: VirtualTreeCoherence:LeveragingRegionsandIn ...enright/EnrightJergerN-VirtualTree.pdf · As transistor scaling affords chip architects the ability to place more and more cores on-chip,

the tight integration and the abundance of on-chip resourcesprovide a new set of opportunities in which to explore thissignificant challenge. The tight coupling and integration ofdozens of computation elements on-chip necessitates the co-design of various components of the memory hierarchy. Thiswork specifically leverages the important interplay of the on-chip interconnection network and the cache coherence protocolto achieve scalable on-chip coherence.Specifically, in this paper, we propose Virtual Tree Coher-

ence (VTC), which targets and achieves each of the aboveattributes of an ideal scalable on-chip coherence protocol:Limit coherence actions to the necessary subset of nodes:

We use virtual trees to connect and order sharers, with requestsmulticast through these virtual trees so coherence actions arelimited to true sharers and not broadcast to all nodes. Thesevirtual trees are the virtual circuit trees recently proposed tosupport efficient multicasting in on-chip networks [14] andcan be mapped onto any unordered interconnect, thus enablingscaling to many-core chips.Fast cache-to-cache transfers: The root of each virtual tree

is used as the ordering point where all requests are first sent toand ordered, much like the home node of a vanilla directoryprotocol. However, as the root is one of the sharers and not astatically mapped home node that may be far away, the costof indirection is reduced considerably.Limited bandwidth overhead: By multicasting coherence

only to the current sharers, VTC avoids the bandwidth over-head of broadcast-based protocols, while approaching theirbenefit of fast cache-to-cache transfers.Limited storage overhead: VTC allocates virtual trees for

sharers of coarse-grained regions [9], [26], rather than percache line. The sharers of each region are tracked in localstructures and are guaranteed to completely capture all currentsharers of a region. Tracking on a coarse granularity allowsVTC to reduce the state that needs to be maintained, and limitsstorage overhead.We show that VTC improves performance by an average

of 25% (16-core) and 31% (64-core) over a directory protocoland reduces interconnect bandwidth by an average of 35% (16-core) and 68% (64-core) over a broadcast protocol.The rest of this paper is organized as follows: background

information on the underlying interconnection network andrecent coarse grain optimizations is provided in Section 2.The proposed coherence ordering mechanism, Virtual TreeCoherence is explained in Section 3 with its implementationdetails discussed in Section 4. Results are presented in Section5 followed by related work in Section 6. Finally the paper isconcluded in Section 7.

2. BackgroundTwo key considerations for a coherence protocol are the

network and the storage of state information. In this section,we provide background on the use of coarse grain tracking ofcoherence state as well as the multicast network design thatwe leverage for Virtual Tree Coherence.

2.1. Regions

In conventional systems, information about cache coherenceis maintained on a per-block granularity. However, by trackingcoherence information across multiple contiguous addresses (aregion), more optimizations can be enabled. A region is defined

as a contiguous portion of memory consisting of a power oftwo number of cache blocks. A cache line address maps to oneand only one region and will map to the same region for allprocessors.

Coarse Grain Coherence Tracking (CGCT) [9] has beenproposed to eliminate unnecessary broadcasts in order toimprove the scalability of broadcast-based systems. Requeststo non-shared regions of the address space can send a requestdirectly to the memory controller rather than order their requeston the broadcast bus which is a precious and limited resource.Tracking sharing patterns on a coarser granularity than cachelines can take advantage of spatial locality and reduce thestorage overhead associated with maintaining this information.RegionScout [26] makes similar observations about the benefitsof tracking information on a coarse granularity for coherencepurposes.

Tracking coherence state on a region granularity has severalbenefits. If a processor holds a region exclusively, it canupgrade any cache line in that region to a modified statewithout sending a request to other processors. If a processorknows no other core is caching a region, it can send its cachemiss directly to memory without snooping those caches. TheCGCT work tracks if other cores are caching a region; weexpand this to encompass not only if but who is caching eachregion.

The structures required by CGCT and RegionScout havebeen generalized in RegionTracker [37] to scalably incorporateadditional functionality. Specifically, RegionTracker replaces aconventional tag array with region tracking structures, and isshown to achieve comparable performance to a conventionalfine-grained tag array with the same area budget. With Region-Tracker, one structure is used to encompass the functionality ofboth the fine-grained tags of a conventional cache and maintainadditional information about larger regions. For details on howRegionTracker achieves this low overhead, we refer the readerto [37].

Further distinguishing this work from previous proposalsthat used region tracking structures to filter away unnec-essary broadcasts [9], [26], perform DRAM-speculation [3],and prefetching [10], Virtual Tree Coherence leverages thesestructures to keep track of the current set of sharers of a region.Additional bits are thus added to the Region Vector Array(RVA) of RegionTracker to track the current region sharersand the region root node (these additions will be discussed inSection 3). The low overhead design of RegionTracker allowsus to optimize for region based sharing with only a very modestarea increase over a conventional L2 cache design.

2.2. Virtual circuit tree multicasting (VCTM)

To reduce bandwidth overhead, Virtual Tree Coherenceutilizes multicasting to send requests to sharers within a region,rather than broadcasting to all nodes. However, as state-of-the-art on-chip networks do not support multicasting (due to thehigh hardware overhead of providing such support), multicastfunctionality has to be synthesized by sending multiple uni-cast messages destined for multiple nodes. This creates highbandwidth overheads which translate to high interconnect delayand power, reducing the attractiveness of multicast coherenceprotocols such as Virtual Tree Coherence.

Recently, a lightweight multicast network design, Virtual

Page 3: VirtualTreeCoherence:LeveragingRegionsandIn ...enright/EnrightJergerN-VirtualTree.pdf · As transistor scaling affords chip architects the ability to place more and more cores on-chip,

Circuit Tree Multicasting (VCTM), that can be implementedwithin the tight delay, area and power budgets of on-chipnetworks has been proposed [14]. VCTM makes multicast co-herence protocols such as the Virtual Tree Coherence proposedhere viable for CMPs.VCTM essentially overlays virtual multicast trees on any

unordered interconnect (e.g. mesh, torus, hierarchical meshes).Figure 1(a) illustrates how a multicast message will have tobe realized with multiple unicasts in a network that does notsupport multicasting. VCTM observes that it is much moreefficient (in both power and bandwidth) to only replicate themessages at forks in the tree. In this example, a message isthus sent from A to the closest nodes B and C and then thosemessages are in turn forwarded to the remaining leaves (Figure1b). Finally, in Figure 1c, it is shown how VCTM maps thelogical tree to an underlying mesh network.Each tree is associated with a fixed tree route from the

source to the destinations in the multicast destination set. Thistree route is set up upon the first multicast to a set, withsetup packets snaking through the network to build up thetree. The tree route information is then stored in virtual circuittree (VCT) tables at each router along the path. Subsequentmulticasts to the same set of sharers reuse this tree route bylooking it up in VCT tables. At each intermediate router inthe network, the packet follows the predetermined tree route,while network resources (virtual channels [12] and buffers)are granted on a per-hop basis so as to maintain high networkbandwidth.As may be evident from this brief description of VCTM, its

performance improvement and power savings are predicatedon reuse of the multicast tree, as tree reuse amortizes thesetup time of the tree route, and reduces the size of theVCT table needed. This makes a VCTM network particularlysuited for VTC: VTC use a multicast tree per coarse-grainedregion, and reuses this multicast tree for all requests to sharersof this region. In the rest of this paper, we will discusshow Virtual Tree Coherence leverages VCTM to lower itsbandwidth overhead, and how VCTM needs to be modifiedto support the ordering properties required by VTC.

3. Virtual Tree CoherenceMost cache coherence protocols rely on an ordering point

to maintain correctness. In directory-based protocols, the di-rectories serve as an ordering point for requests. With bus-based protocols, the bus serves as the ordering point forrequests. Recently several proposals have leveraged the partialordering properties of a ring to facilitate cache coherence [24],[31], [32]. Exploiting the partial ordering properties of a ringeliminates some of the disadvantages of using a bus; however,there are some downsides. A ring is a significantly lessscalable topology than a mesh or torus. Logically embeddinga ring into a mesh topology [32] provides a higher bandwidthcommunication for data but still suffers from large hop countsand lower bandwidth for ordering requests.Virtual Tree Coherence observes that ordering can be

achieved through structures other than a ring or a bus, inparticular, through a tree. Specifically, rather than realizingordering through a physical tree interconnect [16], VirtualTree Coherence maintains coherence through virtual trees.These trees are embedded into a physical network of arbitrarytopology by VCTM, with a virtual tree connecting the sharers

of a region. The root of this virtual tree now serves as theordering point in place of a directory protocol’s home node.Virtual Tree Coherence is a hierarchical protocol. At the first

level, snooping is achieved through logical trees. At the secondlevel of the protocol, the coarse directories provide the cacheswith information about which processors must be involved infirst level snooping.In a nutshell, every request is first sent to its virtual tree root

node (obtained from the local region tracking structure). Thisroot node orders the requests in order of receipt, then multicastsrequests to all sharers of the region through a VCTM virtualtree for that region. The above works for all current sharers.New sharers, however, need to take a 2-level approach: theyhave to first go to the directory home node to obtain the rootnode. This new sharer will then be updated into the local regiontracking structure and added incrementally into the virtual tree.Virtual Tree Coherence results in several benefits. First,

the root node can be strategically selected to be one of thesharers to cut down on latency, especially when sharers arephysically clustered close by. In a conventional directory, thedirectory home node for an address is statically defined givenan address, and is thus not necessarily a sharer. Second,latency is low, comparable to a broadcast-based protocol, sincewe do not need to collect acknowledgments nor wait for adirectory lookup; messages that reach the root node begin theirtree traversal immediately. Third, bandwidth overhead is lowcompared to a broadcast-based protocol, as only the currentsharers are involved in the multicast, not all nodes.

3.1. Walk-through Example

TABLE 1STEPS CORRESPONDING TO FIGURE 2

(1) Both A and B issue requests to the root to modify the sameblock owned by F with A,B,E,F caching this region, E is theroot node for this region

(2a) E receives B’s root request and it becomes ordered.(2b) E receives A’s root request, it becomes ordered after B(3) E forwards B’s request to all leaf nodes of the region tree

A sees B’s request has been ordered prior to its own requestA knows it will receive data from B after B has completedA must invalidate its existing copy so as not to read stale data

(4) E forwards A’s request to all leaf nodes of the region treeB sees its own request forwarded from E. B knows its requesthas been ordered. B does not need to wait for acknowledgmentsF sees B’s request and responds with the Data

(5) B receives the Data response from F and completes itstransition to Modified State. A,B,F see A’s ordered request

(6) B invalidates its Modified copy and sends the data to A(7) A receives the data from B and completes its transition to

Modified State

An example of Virtual Tree Coherence is illustrated inFigure 2 with the corresponding step descriptions presentedin Table 1. In this example, Node E is the root of this region’stree; as such, all requests to addresses within that region mustbe ordered through Node E.In the example, A and F are initially caching the block

in question (A has the block in shared state and F has theblock in owned state). Both invalidate their blocks when theysee B’s request from the root node. This prevents A fromreading a stale copy of the block after B has written it.Invalidation acknowledgments are unnecessary with VTC forwrites to complete since the virtual trees are snoop-based. This

Page 4: VirtualTreeCoherence:LeveragingRegionsandIn ...enright/EnrightJergerN-VirtualTree.pdf · As transistor scaling affords chip architects the ability to place more and more cores on-chip,

A

B C

D E F G

(a) Multicast message from A toB-G realized as multiple unicasts

A

B C

D E F G

(b) Logical multicast tree

E F

Root

A C

G

D

B

(c) VCTM: Logical multicast treemapped to mesh

Figure 1. VCTM Overview

A B C

D E F

O

IàMSàM

I

1.1.

2a.

2b.

Root

(a) Time 1-2

A B C

MI!M

5.

4.

D E F

O!II

3.

3.

3.

4.

Root

(b) Time 3-5

A B C

M!IM7.

6

5.6.

D E F

II

6.

4.

4.

Root

(c) Time 4-7

Figure 2. Virtual Tree Coherence Example: This example illustrates two exclusive requesters to the same address

in the tree-order protocol. Dashed and curved arrows represent messages originating at or intended for B,solid and

straight arrows represent messages originating at or intended for A. E is the root node for the region being accessed.

is analogous to the lack of acknowledgments in a snooping busprotocol. With VTC, a write can complete when it sees its ownrequest returned from the root node.

Currently we use a first touch assignment policy to deter-mine the root node. When the first system-wide request for aregion is made to the directory, the requester will become theroot node. The directory stores the identity of the root nodeas well as the sharing vector for each region block. Other rootassignment policies could be employed; however, the impactof these policies as well as root migration is left for futurework.

3.2. Ordering

Virtual Tree Coherence provides the following orderinginvariants:

• Ordering Point: Each region is mapped to a single order-ing point so all requests to the same address will go tothe same ordering point. This is achieved by assigninga single virtual tree per region, and having the root ofthat virtual tree as the ordering point. Requests are thenunicast to the tree root. This is similar to the use of adirectory as an ordering point.

• Sharers observe the same ordering of requests: requestsmulticast from the root node must arrive at leaf nodes inthe same order. Logically, the tree needs to maintain theordering of a bus: sending a request to the root node ofthe tree is equivalent to arbitrating for the bus (observing

the forwarding of one’s request from the root node isequivalent to gaining access to the bus). All requests sentout from the root of the tree will be in a total order. Inother words, requests on the same virtual tree must not bereordered by the underlying physical interconnect. This isachieved by modifying VCTM to ensure that each virtualtree is tied to a single virtual channel.

• Cores caching a block must see all coherence requests tothat block: A multicast must contain all current sharers.Additional non-sharing cores can be included but neverfewer. This is achieved with the second-level directoryalways having a complete list of the sharers. When a non-sharer requests a block, it must first get the sharing listfrom the directory and be added to the sharing list atthat time. Then when the Tree Root multicasts the newsharer’s request to all sharers, the current sharers will addthe requester to their sharing list so that region sharinglists at the L2 cache are kept up-to-date.

• Write serialization: Ordering through the root node serial-izes all writes to the same address region. In Section 4.2,we discuss how that write order is preserved through thenetwork from the root to all leaf nodes. Requests to thesame virtual tree maintain a total order. Since write orderis maintained from the ordering point to the leaf nodes,invalidation acknowledgments are not necessary.

• Write propagation: A write can complete once it seesits ordered request returned from the root node; this

Page 5: VirtualTreeCoherence:LeveragingRegionsandIn ...enright/EnrightJergerN-VirtualTree.pdf · As transistor scaling affords chip architects the ability to place more and more cores on-chip,

guarantees that any subsequent request to that cache blockby any processor will receive the new value written. It isessential that all cores caching the region be involved inthe virtual tree; stale values are invalidated when the rootforwards the write request to all leaf nodes.

3.3. Coherence States,Requests,Actions

Coherence information is maintained at the processor attwo granularities. The local last-level cache (L2) maintainscoherence state information on a cache block granularity.Coherence information is then maintained by RegionTracker,encompassing multiple contiguous cache blocks. On a per-region granularity, we track which external cores are cachingthe region and the location of the root node for this region.CGCT was first proposed for SMP systems; on clean-shared

misses a request would go directly to the memory controllerrather than waste precious bus bandwidth. Since our work dealswith a many-core CMP, we want requests to stay on-chip tosave miss latency. Therefore, clean-shared misses are multicastto other cores caching the region. Table 2 gives an overviewof the steps taken based on the region state for loads, storesand upgrades. Numbers with multiple parts (e.g. 1a and 1b)indicate actions that occur in parallel.When a core replaces a region, it notifies the root node for

that region to remove it from the sharing list. The root nodewill then construct a new tree for that region with one fewerleaf node. The other cores caching the region do not needto be notified since only the root is responsible for sendingcoherence requests to all sharers.

3.4. Relationshipbetween trees and regions

To maintain coherence, all cores caching a region must seecoherence requests to that region. A single tree is maintainedat the root node for that region. Remember, an address mapsto only one region; so that address participates in a singletree connecting all sharers. All requests use this tree; it is notpossible for a single region to map to multiple trees. If thatis allowed, it would mean that different cores had differentsharing lists for that region and incoherence would result.Multiple regions can map to the same tree; this simply meansthat multiple regions are being shared by the same set ofprocessors with the same root node. If all cores are cachinga region (for example, a lock variable), a single tree will beconstructed at the root node with all cores as leaf nodes.

4. Implementation

The following section details issues related to the hardwareimplementation of Virtual Tree Coherence. Specifically, wefirst present high-level changes to the system architecture. Nextwe discuss the requirements placed on the underlying intercon-nection network to preserve coherence ordering. Finally, wediscuss the area overheads of additional structures.

4.1. High level Architecture

Figure 3 presents a high level diagram of the architecturewe are proposing. Each node consists of a core, private L1instruction and data caches. Modifications are made starting atthe private L2 cache; the key structures for this architecture arehighlighted with bold borders. A RegionTracker sits alongsidethe Level 2 cache and maintains sharing lists for currentlycached address regions; this structure also encompasses the

R t

Coarse

Dir

Region

TrackerL2 Cache

Data ArrayRouter

VCT Table

Allocators

ata Array

Core + L1 I/D

Cache

Figure 3. Hardware Architecture Diagram

functionality of the Level 2 tags, thus resulting in very smallarea overhead. The directory at each node is coarse grainedand also contains sharing lists but is decoupled from the privateL2 cache as regions are distributed across all the directoriesin the system. Upon a miss in the RegionTracker, a requesthas to go to the directory to obtain the current sharing lists,thereby refreshing the RegionTracker. Entries are evicted fromthe RegionTracker in an LRU fashion. The packet-switchedrouter has been modified to include VCTM extensions, inparticular the virtual circuit tree (VCT) table. For details aboutthe VCT overheads, we refer the readers to [14].

4.2. Network design

Virtual tree construction. VCTM maintains a small con-tent addressable memory (CAM) of currently active trees atthe network interface of each node. RegionTracker providesthe destination set required for the coherence request to thenetwork interface controller which then searches the CAM todetermine if there exists an active tree for this destination set. Ifa tree exists, the CAM returns the Virtual Circuit Tree Id whichis used to index into the VCT lookup tables at each router. Ifno match is found in the CAM, a new tree must be setup; thisis done with low overhead, per the VCTM mechanisms.

When a sharer has been added to a region, this will result ina new destination set being provided to the network interfacecontroller. This will likely trigger a new tree setup unless thatdestination set is being actively used by another region. Regionsharing lists are decoupled from Virtual Circuit Trees; multipleregions can map to a single tree.

Preserving order in the network. We modify VCTM torestrict virtual channel allocation so as to maintain the orderinginvariants associated with a tree. Specifically, requests must notbe reordered from the time they depart the root node to the timethey arrive at each leaf node. An unrestricted virtual channelallocation policy would permit such a reordering, leading to aleaf node seeing a different order than other leaves. As such,each region is assigned a single virtual channel. All requestsfor that region must use the same virtual channel and cannotchange virtual channels at intermediate nodes in the network. Itshould be noted that here, ordering only needs to be maintainedwithin a single virtual tree, not across the entire network.Hence, different regions and different trees can be assignedto different virtual channels, which will avoid degrading thenetwork performance.

Scalability. As the number of nodes in the network grows,VCTM requires additional storage for multicast trees. Eachnode in the network is allocated a portion of the Virtual

Page 6: VirtualTreeCoherence:LeveragingRegionsandIn ...enright/EnrightJergerN-VirtualTree.pdf · As transistor scaling affords chip architects the ability to place more and more cores on-chip,

TABLE 2VIRTUAL TREE COHERENCE

Region State Cache Miss VTC Coherence ActionsInvalid: Load/Store 1. Request Region Destination Set Information from Directoryno information 2. Directory responds with region sharing listabout remote 3. Region state set to Exclusive/Modified if sharing list is nullcopies of region else Region State is set to shared

4. Load/Store actions performed according to steps belowShared: Load 1a. Send Read Request to Root NodeOther cores are 1b. Request data from memory: speculative memory request (partially overlap mem. latency w/ordering)caching the region 2. Request is ordered by Root Node and forwarded to region sharers(may be clean/dirty) 3. Observe own request - ordered w.r.t. other requests to this address

4. Multicast sharers caching data, respond to Read Request with Data5. If data not on chip, wait for memory response.

Store 1a. Send Store Request to Root Node1b. Request data from memory: speculative memory request2. Request is ordered by Root Node and forwarded to region sharers3. Observe own request4. Region sharers caching data, response to Store Req. w/Data and invalidate own copy5. Receive data from multicast sharer or wait for memory response if not cached on chip

Once observed own request and received cache line, it is safe to perform storeUpgrade 1. Sent Upgrade Request to Root Node

2. Root Node Forward Upgrade to all sharers3. Region sharers caching data observe upgrade request and invalidate cache block4. If Observe other store/upgrade request, another request ordered before own

Invalidate cache line, now request that was ordered prior to mine will supply fresh data5. else if Observe own request, Upgrade complete

Exclusive Load 1. No other cores caching region - request does not need to be orderedor Modified: 2. Send Read Request to MemoryNo other core Store 1. No other cores caching region - request does not need to be orderedcaching region 2. Send Store Request to Memory

Upgrade 1. Can upgrade without sending messageReplace – 1. Invalidate all copies cached in region

2a. Send Region Invalidate Acknowledgment to Directory2b. Notify root of invalidation3. Directory removes sharer from sharing list for region

Circuit Tree tables located in each router. If 64 trees areallocated to each core, in a sixteen node system, this willresult in 1024 entries in each VCT table. With 64 cores inthe system, this grows to 4096 entries. (Note: the size ofeach VCT entry remains the same, with Destination Set CAMentries widened to account added destinations). As the systemgrows, so do the number of unique trees; however, serverconsolidation workloads will not realistically access a largepercentage of trees. To better scale the VCTM mechanism,we propose replacing the original Destination CAM (used totrack active trees at the network interface controller) with aTernary CAM (TCAM). A TCAM allows us to collapse similartrees by specifying “don’t care”bits in the search for an activetree. This will allow us to hit on an active tree that includesall necessary destinations for our request with some extraunnecessary destinations included in the tree. To prevent theuse of tree collapsing from driving up bandwidth, we restrictthe maximum size of a similar tree to be within 2 links of theoriginal tree. As a result, 8 trees per core can achieve the sameperformance and hit rates as 64 trees per core, greatly reducingthe storage requirements at each router. Additionally trees withmany destinations can map to a single broadcast tree as well.

4.3. Storage overheads

The storage overheads of Virtual Tree Coherence stemlargely from three components: (1) the RegionTracker structurein each core, (2) the second-level coarse-grained directorycache in each core, both adding to the storage overhead,but (3) as RegionTracker obviates the need for L2 tags, L2tag array storage overhead is saved. Additionally, the second-level coarse-grained directory cache replaces the fine-graineddirectory cache required for the baseline directory protocol.

2

3

4

5

6

7

mber of Region

Sharers

64B

1KB

0

1

SPECjbb

SPECweb

TPC!H

TPC!W

Barnes

Ocean

Radiosity

Raytrace

Num 1KB

4KB

Figure 4. Characterization of Sharers by Region Size

The sizes of these three components clearly depend on thesize of a region, R.To arrive at R, we perform a characterization study across

our suite of benchmarks. In Figure 4, we present the sharingpatterns for a variety of scientific and commercial workloadsbased on various region sizes. Similar to what has beenpreviously observed [7], [19], the number of sharers for acache block (cache block size =64B) is small. As the regionsize increases (16and 64 cache lines), the number of sharersincreases slightly; this increase is a result of false sharing.Sending multicasts to an increased number of cores will utilizeadditional bandwidth but at a substantial area savings fortracking this information. Previous work leveraging regionshas used a 1KB region size; similarly, we believe this is agood trade-off between area overhead and multicast sharinginformation.

The original RegionTracker proposal consumes area compa-

Page 7: VirtualTreeCoherence:LeveragingRegionsandIn ...enright/EnrightJergerN-VirtualTree.pdf · As transistor scaling affords chip architects the ability to place more and more cores on-chip,

rable to a conventional L2 tag array assuming 1KB regions andan 8MB data array. We’ve added additional bits of information,specifically 16bits to track the multicast sharing vector and 4bits to track the multicast root node, for 16cores. The regionarray size per core is determined by Equation 1 in terms of N ,number of cores and region geometry (e.g. RS e ts , numberof sets in Region, Re g S iz e , size of each region, RW a y s ,the associativity of the region array). Each entry containsthe Region Tag, 3 bits of state, N bits for the multicastsharing vector, lo g 2(N) bits to identify the root and 4 bitsof state per cache line (v a lid bit + w a y ) in the region. Thisdesign includes per-region storage for both the root nodeand the sharing vector; however, only the root node needsto be responsible for maintaining the current sharing vector,while each sharer can just maintain the identity of the rootnode. Storage requirements could be reduced by removingthe sharing vector from the RegionTracker and employing asecond structure for the root node to look up the sharing vectorassociated with that region.

Re g io n A r r a y S iz e = (T a g + 3 + N + lo g 2(N)+

(Re g S iz e / C a c h e L in e S iz e )× 4)×RS e ts ×RW a y s(1)

For example, assuming a 50 bit address, 1KB regions, 1024region sets, and 8 region ways, we find the region array sizeto be 113 KB. Our simulation parameters assume a L2 cacheof 1MB, where the size of conventional cache tags would be74 KB.The second level directories also consume area; however,

compared to a conventional fine-grained directory, coarse-grained directories have significantly greater memory reachwhich improves performance by reducing the directory missrate. Computing the directory size (in bits) is done usingEquation 2, assuming each directory entry contains N bits forthe sharing vector and lo g 2(N) bits to identify the owner node(in a conventional directory) or lo g 2(N) bits to identify the treeroot with VTC. Directory sizes assuming 1024 directory setsand 16directory ways are presented in Table 3.

D S iz e = (T a g + N + lo g 2(N) + 3)× D ir S e ts × D ir W a y s(2)

Bit requirements for the baseline configuration (a conven-tional L2 +fine-grained directory cache) are compared againstthe RegionTracker +coarse-grained directories in Table 3, withthe parameters detailed above. We see VTC has a 21% storageoverhead over the baseline directory-based protocol.However, this storage overhead can be tuned by reducing

the size of the coarse-grained directories, since a single entryin the coarse-grained directory covers 16times more memorythan a corresponding entry in a fine-grained directory, with aregion size that is 16times a cache line size. So, we can reducethe coarse-grained storage to trade off the storage overhead ofRegionTrackers, while still being able to cache and cover morememory than conventional fine-grained directories. Setting thenumber of bits of the fine-grained directory + conventionalL2 tag array = coarse-grained directory + RegionTracker, wehave enough bits for a smaller coarse grain directory composedof 690 sets and 16ways. With 690 sets and 16ways, the small(area-equivalent) coarse-grained directory can cache 11MB ofmemory versus 1MB of memory that is cacheable with thefine-grained directory.

Using the same number of sets and ways in both the fine-and coarse-grained directories, the coarse-grained directory +RegionTracker configuration consumes 21% more bits. Moregeometries could be explored to find additional area andmemory-reach trade-offs.

TABLE 3STORAGE COMPARISONS IN KBITS

Conventional Directory-Based ProtocolConventional L2 Tag Array 592Fine-Grained Directory 896Virtual Tree CoherenceRegionTracker 904Coarse-Grained Directory 864Area-equivalent Coarse-Grained Directory 582

(a) Storage Breakdown

Conventional L2 Tag Array +Fine-Grained Directory 1488RegionTracker +Coarse-Grained Directory 1768RegionTracker +Area-equivalent Coarse-Grained Directory 1486

(b) KBit Totals

4.3.1 Scalability. In Section 4.2, we discussed a TCAM tech-nique to improve the scalability of VCTM for larger systems.For other added hardware discussed in this section, we considerscalability issues as the number of cores in the system grows.First, we expect that much larger regions will provide benefitfor systems running server consolidation workloads. Coarseaddress regions will include more false sharing but will stillkeep virtual machines isolated from one another providingperformance benefits and scalability.

A concentrated mesh (CMESH) [5] has been proposed forlarge systems. A CMESH groups four cores to one singlerouter; so a 64-core system would require a 4x4 mesh. Thisclustering can be applied to our storage structures as well; 4cores can share a region array, last level caches and a coarsegrain directory to reduce the amount of required storage. ACMESH will also reduce pressure on the VCTM hardware;multicasts can be routed to network nodes and then broadcastto the 4 tiles connected to each router.

5. EvaluationIn the following sections, we present our evaluation method-

ology as well as detailed information regarding the baselinesystems, followed by results.

5.1. MethodologyWe use PHARMsim, a full system multiprocessor simulator

[8], [18] built on SIMOS-PPC. Included in our simulationinfrastructure is a cycle-accurate network model includingpipelined routers, buffers, virtual channels and allocators forboth the baseline packet-switched mesh and the routers aug-mented with VCTM. Our simulation parameters are givenin Table 4. Results are presented for the following commer-cial workloads: TPC-H and TPC-W [33], SPECweb99 andSPECjbb2000 [30] and several Splash2 workloads [36]. Detailsfor each workload are presented in Table 5. We compareVirtual Tree Coherence against two baselines, a directory pro-tocol and a greedy-order protocol which are explained below.

Page 8: VirtualTreeCoherence:LeveragingRegionsandIn ...enright/EnrightJergerN-VirtualTree.pdf · As transistor scaling affords chip architects the ability to place more and more cores on-chip,

TABLE 4SIMULATION PARAMETERS

Cores 16in-order & 64 in-order cores

Memory System

L1 I/D Caches (lat) 32 KB 2 way set assoc. (1 cycle)Private L2 Caches 1 MB (16MB total) 8-way set assoc.

(6cycles), 64 Byte linesRegionTracker 1024 sets, 8 ways, 1KB regions(associated with each L2)Memory Latency 500 cycles

Interconnect

Packet Switched Mesh 3 Pipeline Stages8 VCs w/4 Buffers per VC

VCTM 64 Trees per source node (1024 total)

Statistical simulation is used to quantify overall performancewith 95% confidence intervals [4].

We have configured our simulation environment to supportserver consolidation workloads [15], [25] for up to 64 cores.For the server consolidation workloads, we create homoge-neous combinations of each of the commercial workloads listedin Table 5; e.g. we run 4 copies of SPECjbb to create a 64-core workload. Each virtual machine is scheduled to maintainaffinity among the threads of its workload.

TABLE 5BENCHMARK DESCRIPTIONS

Bench. Description

SPECjbb Standard java server workload utilizing 24 warehouses,executing 200 requests

SPECweb Zeus Web Server 3.3.7 servicing 300 HTTP requestsTPC-W TPC’s Web e-commerce benchmark, DB TierTPC-H TPC’s Decision Support System Benchmark, IBM DB2 v6.1

running query 12 w/512MB database, 1GB of memoryBarnes 8K particles, full end-to-end run including initializationOcean 514x514 full end-to-end run (parallel phase only)Radiosity -room -batch -ae 5000 -en .050 -bf .10 (parallel phase only)Raytrace car input (parallel phase only)

5.1.1 BaselineI: Directory-based Coherence. The first base-line we evaluate VTC against is a standard Directory protocolmodeled after the SGI-Origin protocol [17]. This protocol suf-fers from the latency overheads associated with an indirectionthrough a directory on each cache miss. Additionally, to makethis protocol amenable for a many-core architecture, directorycaches must be used. Misses to these directory caches sufferthe latency overhead of going off-chip to memory and can bequite frequent for server workloads and even more frequentfor server consolidation workloads. For a set of commercialworkloads, miss rates between 22 and 74% have been observed[22].

5.1.2 BaselineII: Greedy order region coherence. Greedyorder protocols have been proposed for ring interconnects [6],[24], [29] and overlaid atop unordered interconnects [32]. Here,as a second baseline, we map and optimize a greedy orderprotocol that can leverage the region tracking structures andmulticast network that VTC uses. The key difference is thatVTC relies on the virtual tree for ordering, while greedy orderdoes not.

In greedy order protocols, requests are ordered by the current

TABLE 6STEPS CORRESPONDING TO FIGURE 5

(1) Both A and B issue requests to all processors caching regionto modify a block owned by F with A,B,E,F caching this region

(2) A’s request reaches B and is replicated and forwarded to E andF. B’s request reaches E and F

(3) A’s request reaches FB’s request reached F (Owner) first, so B’s request will win

(4) E and F respond with acknowledgments to A and B’s request.B gets an owner acknowledgment from FF transitions from owned to invalid

(5) A receives E’s acknowledgment, B receives E’s acknowledgment(6) A receives F’s acknowledgment, B receives F’s owner

acknowledgment(7) B knows its Modified request will succeed, it sends a negative

acknowledgment to A(8) A receives a negative acknowledgment from B, it has now

collected all acknowledgments and did not succeed so it willacknowledge B’s request and it must retry its own request

(9) B collects its final acknowledgment from A and successfullytransitions to Modified State.

owner. Requests are live as soon as they leave the requester; inother words, they do not need to arbitrate for a shared resourcesuch as a bus or pass through a central ordering point such asa directory. A request becomes ordered when it reaches theowner of the cache block (another cache or memory). In thecommon case when no race occurs, these requests are servicedvery quickly because they do not require the additional latencyof an indirection through a directory. In [24], requests completeafter they have observed the combined snoop response thattrails the request on a ring. Based on the combined response,a request is successful or must retry.Mapping and extending Greedy-Order protocols onto an un-

ordered interconnect such as a mesh requires acknowledgmentsto be collected from all relevant processors. Also, extendingto regions rather than individual cache lines mean that ac-knowledgments have to be collected only from cores that arecaching that particular address region rather than all cores. So,the number of acknowledgments that are expected is derivedfrom the sharing vector in the region cache. The collection ofacknowledgments is similar to the combined response on thering but requires more messages. This is also similar to theprocess of collecting invalidates in a directory protocol. Theowner sends an owner acknowledgment signifying the transferof ownership. If no owner acknowledgment is received, thenanother request was ordered before this one and this requestmust retry. Greedy-Order can be applied in a broadcast fashionas well, where no sharers are tracked and acknowledgments aregathered from every processor. An example of Greedy-Orderis depicted in Figure 5 and walked through in Table 6.

5.2. ResultsIn the following sections, we present quantitative perfor-

mance results comparing Virtual Tree Coherence against ourtwo baselines. Additionally, we present a comparison betweenVirtual Tree-Multicast (VT-M) and Virtual Tree-Broadcast(VT-B). With VT-B, a virtual tree connects all nodes; however,regions are used to designate the root node so that there is nota single root bottleneck. Significant network bandwidth anddynamic power can be saved by limiting coherence actions tomulticasts instead of broadcasts.In Figure 6, results are presented for Greedy-Order Mul-

ticasting, Virtual Tree Broadcast and Virtual Tree Multicast

Page 9: VirtualTreeCoherence:LeveragingRegionsandIn ...enright/EnrightJergerN-VirtualTree.pdf · As transistor scaling affords chip architects the ability to place more and more cores on-chip,

A B C

I!MI!M

1.1.

2.

1.

3

D E F

OI

2.3.

2.

(a) Time 1-3

A B C

I!OMI!M

5 6

5,6.

D E FO!II

4.4.

5,6.

4.

(b) Time 4-6

A B C

MI8.

7

9. Retry 8. Success!

D E FII

7.

(c) Time 7-8

Figure 5. Greedy Order Example: This example illustrates two exclusive requesters in the greedy-order region protocol.

Dashed and curved arrows represent messages originating at or intended for B, Solid and straight arrows represent

messages originating at or intended for A. Time is progressing from left to right in the figure.

Coherence.Allresultsare normalizedtoBaselineI:DirectoryCoherence.Overall, significantperformance gainsare achievedbyVT-BandVT-M, upto39% and38% respectively(19% and25% onaverage) whencomparedtothe directoryprotocol.Vir-tualTree Coherence outperformsGreedy-Orderbyupto31%withanaverage improvementof11%.Ina4x4 system, thedifferencesbetweenVT-B andVT-M are minor;however, assystemsscale, the difference betweenthese twobecome muchmore pronouncedwithfavorable resultsforVT-M (Figure 6b).

In acouple ofinstances, notably, TPC-H and SPECjbb,VT-B outperformsVT-M.Withlargermemoryfootprints, andirregularaccesspatterns, these workloadsexperience muchlargerregionmissrateswhich incuradditionaloverheadtore-fetchregion information from the second-leveldirectory.ForSPECjbb andTPC-H, 21% and18% ofL2missesalsoresultinregionmisses;the restofthe workloadsexperienceregionmissratesoflessthan10%.SPECweb seesonlyasmallperformance improvementfrom VTC (5%);SPECweb seesthe sharpestincrease intrafficwhichlimitsthe performanceimprovement.Techniquestoimprove the regionhitrate andlowerthe numberoffalse-sharerswillleadtoperformanceimprovementsforSPECweb.

In Figure 6b, the difference between VT-B and VT-Mbecomesmore pronounced, VT-M outperformsVT-B byanaverage of11% and up to 16%.With a 64-core system,broadcasting becomesmore expensive (both inperformanceand power).VT-M providesmore isolation forthe virtualmachines;coherence requestsare onlysenttonodesinvolvedinsharing.VT-Bsendsbroadcaststoallnodes(acrossmultiplevirtualmachines).

InFigure 6, Greedy-Order, VT-B andVT-M, allleveragethe benefitsofVCTM.On a non-VCTM packet-switchedmesh, the peformance ofVT-M degradesbyanaverage of15%.Greedy-Orderwhichplacessignificantpressure ontheinterconnectdue toretriesseesperformance degradationsof48% when VCTM isremoved.VCTM simplifiesorderingofcoherence requests in the network and is essentialforperformance improvementsandscalability.WithoutVCTM,ina64-core system, VT-B sendsout63coherence packetsforeachcache misswhichsaturatesthe network.

BothVTC anddirectoryprotocolsrequire the indirectionto the ordering pointforcoherence requests.VTC derivesperformance benefitinpartfrom reducingthe costoftheseindirections.WithVTC, the hopcounttothe orderingpoint

0.60.70.80.91

1.11.21.3

lized Runtime

Greedy!O rder

Multicast

VT!B

0.40.5

SPECjbb

SPECweb

TPC!H

TPC!W

Barnes

Ocean

Radiosity

Raytrace

Average

Normal

VT!M

(a) Performance ofsingle workloadson16 cores

0 60.70.80.91

1.11.2

malized Runtime

Greedy Order

Multicast

VT!B

0.40.50.6

SPECjbb

SPECweb

TPC!H

TPC!W

Average

Norm

VT!M

(b) Performance of64-core serverconsolidationworkloads

Figure 6. Performance of Coherence Protocols.

isreducedby15% for16-core and50% for64-core since therootnode isaregionsharer.Furthermore, onaverage 4.2xmore coherence requestsare orderedinzerohopswithVTCthanwiththe directoryprotocol.

The interplaybetweenregionsize andthe efficiencyofaVCTM networkisaninterestingmotivationforthe needtoco-designthe coherence protocolandthe interconnect.Choosingsmallregionsresultsinamuchlargernumberofunique treesthatare needed;thislarge numberofunique treescausesthevirtualcircuittreestothrashinthe network.The virtualcircuittree hitrate in the interconnectrangesfrom 78% to 99%for1KB regions;the hitrate dropsto65% to95% for64Bregionsasdepicted inFigure 7.Withalowerhitrate there

Page 10: VirtualTreeCoherence:LeveragingRegionsandIn ...enright/EnrightJergerN-VirtualTree.pdf · As transistor scaling affords chip architects the ability to place more and more cores on-chip,

0 7

0.75

0.8

0.85

0.9

0.95

1

1.05uit Tree Hit Rate

64

1024

0.6

0.65

0.7

SPECjbb

SPECweb

TPC!H

TPC!W

Barnes

Ocean

Radiosity

Raytrace

Virtual Circu 1024

4096

Figure 7. Impact of Region Size on Virtual Circuit Tree

Hit Rate (16cores)

are more tree setupsinthe network;recallfrom the originalVCTM proposal, tree setuprequiresreplicationofthe multicastmessage intomanyunicastsresultinginashortburstoftrafficduring the setup phase which impactsinterconnectlatencyandthroughput.Asaresult, 1KB regionwhichsuffersfromamodestamountoffalse sharingcomparedto64B regions,actually has3% less interconnecttraffic by making betteruse ofvirtualcircuittrees.With 4KB regions, performanceissimilarto1KB regions;withslightdegradationsobservedforSPECweb andOcean.

Restricting a virtualtree to a single virtualchannelisneccesary to maintain in network order;howeverwith alarge numberofavailable trees, we are able touse networkbandwidthefficientlyandonlysee adegradationof3% inend-to-endnetworklatency(versusanunrestrictedvirtualchannelallocation).

5.2.1 Activity Comparisons. Figure 8 showsthe networkactivity(basedonlinktraversalsbyflits) foreachcoherenceprotocolrelative todirectorycoherence.Ofcourse, directorycoherence hasthe lowestinterconnecttrafficsince nearlyallofthe messagesare ofaunicast(pointtopoint) nature (in-validationrequestsfrom the directoryare the exception).Datatrafficissimilarforeachprotocol;the maindifference liesinthe requiredcoherence traffic.Greedy-Orderrequiresthe mostinterconnectbandwidthofallthe protocols, averaging 3.8xthe numberoflinktraversalsasthe directoryprotocol.VT-M consumeslessnetworkbandwidththanVT-B, 35% usagereductiononaverage, withthe mostsignificantreductionsof68% forSPECjbb and 40% forOcean.A large fractionofOcean’sreferencesare memorymisses;VT-M willoptimizeandgodirectlytomemoryifnoothercoresare cachingtheregion.These memorymissesare broadcasttoallcoresinVT-Bresultinginabandwidthspike whencomparedtoVT-MandGreedy-Order.The interconnecttrafficdifference betweenVT-B andVT-M growsfrom 35% with16 coresto68% with64 cores.VT-M requires1.6x more trafficthanadirectoryprotocolfor64 cores.

Networkactivity isonlypartofthe storyforpowercon-sumptiondifferencesinVT-B andVT-M.VT-B willconsumesignificantlymore powersince allcacheswillsnoopallcoher-ence requests;VT-M eliminatesasignificantfractionofcacheaccessesrequiredwithVT-B.The retriesinGreedy-Orderalsoincrease the numberofcache accessesrequired.

2

3

4

5

6

7

In

terc

on

ne

ct

Tra

ffic

Data

Coherence

0

1

Greedy

VT!B

VT!M

Greedy

VT!B

VT!M

Greedy

VT!B

VT!M

Greedy

VT!B

VT!M

Greedy

VT!B

VT!M

Greedy

VT!B

VT!M

Greedy

VT!B

VT!M

Greedy

VT!B

VT!M

SPECjbb SPECw eb TPC!H TPC!W Barnes O cean Radiosity Raytrace

No

rm

ali

ze

d

Figure 8. Interconnect Traffic Comparison Normalized

to Directory for 16cores (measured in link traversals by

flits)

5.3. Discussion

Inthe previoussections, we illustratedthatVTCcouldpro-vide substantialperformance improvementsoveradirectory-based protocol.VTC Broadcastperformswellwithouttheneedtotrackmulticastsharinggroups;however, the broadcastprotocolplacesheavydemandsonthe interconnectionnetworkandconsumessignificantunnecessarypowerlimitingitsabilitytoscale.

VTC largely out-performsGreedy-Order, butthere are afew benchmarkswhere Greedy-Orderdoesslightlybetter.Thekeydownside toGreedy-Orderprotocolsisthattheycanhaveanunboundednumberofretrieswhencertainrace conditionsoccur(whichcanleadtostarvation).Inpractice, we foundthe numberofretriestobe small(fewerthan10retriesper1000L2misses) withafew exceptions:RadiosityandTPC-H exhibitlarge numbersofretries, 696 and327for1000L2misses, whichaccountsforthe large spike intrafficforGreedy-Order.Anadditionaldownside ofaGreedy-Orderprotocolonameshascomparedto itsuse onaring, isthe additionalmessagesrequired.Withnocombinedsnoopresponse, thereisasignificantincrease inacknowledgmentmessagesplacinghighbandwidthdemandsonthe interconnect.

In addition to the problem ofretries, the Greedy-Orderprotocolmakesnouse ofthe implicitorderingpropertiesoftheunderlyingnetworkfunctionality.Greedy-Orderonlyutilizesthe virtualmulticasttreestoovercome the overheadassociatedwithmulticastingonameshnetwork.Greedy-Orderalsoplacessubstantially more pressure on the VCTM hardware;withVTC, eachregionhasatmostone tree.WithGreedy-Order,eachregionsharercreatesitsowntree resultinginaforestoftreesforwidelysharedvariables.

InSection1, we posited4 keyfeaturesofascalable coher-ence protocolformany-core architectures.Here we examinehow wellVTC achievesthese goals.

• Limit coherence actions to the necessary subset of

nodes: Coherence requestsare seenbyanaverage of4coreswith1KB regionsizes.Increasingthe regionsizetradesoffstorage overheadatthe expense ofmore un-necessarycoresbecominginvolvedincoherence requests.Forexample, witharegionsize of4KB, onaverage 10%more coressee acoherence requests.

• Fast cache-to-cache transfers: The orderingpoint, thevirtualtree rootislocatedatone ofthe regionsharersreducingthe latencyoforderingarequest.VirtualTree

Page 11: VirtualTreeCoherence:LeveragingRegionsandIn ...enright/EnrightJergerN-VirtualTree.pdf · As transistor scaling affords chip architects the ability to place more and more cores on-chip,

Coherence outperformsDirectoryCoherence by25% onaverage, adirectresultoffastcache-to-cache transfers.VTC out-performsGreedy-Orderbyanaverage of11%byavoidingretries.

• Limited bandwidth overhead: VTC reduceson-chipbandwidthutilizationbyanaverage of35% overabroad-castprotocoland45% overagreedyprotocol.We increaseon-chip bandwidthutilization byafactorof2.1x overadirectoryprotocol.Only in one instance (SPECweb)didthisbandwidthincrease resultinlimitedperformanceimprovement.

• Limited storage overhead: Inourmodifiedsystem Re-gionTracker + Coarse-Grain Directories consume 21%more areathanthe baseline L2 tags + Fine-Grain Di-rectories.However, with this increase in area comessignificantlymore reachforourdirectories.The coarsedirectoriesare able to touch 16x more memory with1KB regionsandthe same numberofdirectoryentries.Additionally, withafixedareabudget(baseline:L2Tags+Fine-GrainDirectories, VTC:RegionTracker+Coarse-GrainDirectories), the Coarse GrainDirectorieshave a4% lowerdirectorycache missrate thanthe Fine-GrainDirectories.Considerable performance improvementispossible withVTCwithoutincurringlarge areaoverheads.

• Scalability: With a 64-core system, VTC outperformsbothadirectoryprotocolandVT-Bbyanaverage of31%and15% respectively.

6. Related WorkCache coherence researchhasbeenofsignificantinterest

inbothsingle andmulti-chipmultiprocessorsystems.A vari-etyofprotocolshave beenproposed/implementedtoachieveperformance andscalabilityon bothorderedandunorderedinterconnects.We contrastthese priorworkswithVirtualTreeCoherence inthe followingsections.

6.1. Ordered InterconnectMulticastsnoopinganddestinationsetprediction[7], [19]

use predictionmechanismstodetermine whichprocessorswilllikelyneedtosee acoherence request.Incontrast, ourworkdeterminesexactlywhomustbe includedinamulticast.Extracoresmightbe contained in the destination setbutneverfewercoresthannecessary.These protocolsrelyonatotallyordered interconnectforsendingoutmulticastrequests.Ourdesignrelaxesthisconstraint, permittingahigherperformanceinterconnect.Requeststo the same addressregion use thesame virtualtree andare restrictedtousingthe same virtualchannelaspriorrequeststothe same addressregion.Thisvirtualchannelrestrictionpreventsmessagesfrom becomingreorderedwithrespecttoeachotherinthe network.BandwidthAdaptive Snooping[21]employsahybridproto-

colthatachievesthe latencyofbroadcastingwhenbandwidthisplentifulbutconvertstoadirectorystyle protocolwhenbandwidth islimited.Thisworkreliesonatotallyorderedinterconnectbutovercomessome ofthe pressure thatlargesnoopingsystemscanplace onthe interconnect.

6.2. Unordered InterconnectTokencoherence providesthe tokenabstractiontodecou-

ple performance from correctness[20].SeveralvariantsofToken Coherence have been proposed including one basedonbroadcastsandone ondirectories.TokenB, the broadcast

Tokenprotocolrequiresmore bandwidththanmulticastingwithVirtualTree Coherence.Extensionshave beenproposedformulti-chipCMPbasedsystemsin[23].

Virtualhierarchies[25]propose cache coherence variationstargetingserverconsolidationworkloadsrunningonchipmul-tiprocessors.One proposalutilizestwolevelsofdirectoriestoprovide fastlocalcoherence andcorrect(andsubstantiallyslower) globalcoherence (withthe observationthatglobalco-herence israre).The otherproposalstillutilizeslocaldirecto-riesforfastcoherence withinaserverapplicationandabackingbroadcastprotocolforglobalcoherence.The alternative ofutilizingalocalbroadcastbackedbyaglobaldirectoryprotocolismentioned butnotexplored.The coherence mechanismin thiswork issimilarto the lattercase.Firstofall, weexamine adifferenthierarchythanwhatisproposedinVirtualHierarchies.Assuch, we donotprovide adirectquantitativecomparison;however, we doprovide some pointsthatfurtherdistinguishourproposal.

Some ofthe performance improvementsofVirtualHierar-chiesare predicatedonthe abilityofthe schedulertoprovidelocalitybetweencommunicatingandsharingcoresorthreads.VirtualHierarchieswillworkwhenlocalityisnotpreserved;however, we believe thatthe couplingofmulticastcoherencewith afastmulticastsubstrate (VCTM) resultsin superiorperformance.VirtualTree Coherence willsupportflexibleplacementandschedulingofcommunicatingthreads, whereasthe benefitsachievedwithVirtualHierarchiesare predicatedonphysicalproximity.

The actionsofthe secondleveldirectoriesinVirtualTreeCoherence are verysimple unlike directoriesin otherhier-archicalprotocols.VirtualHierarchiesrequiresaverylargenumberofstatesandtransitionsinthe coherence protocoltoaccommodate twolevelsofdirectories;thisisnotthe case forVirtualTree Coherence.The directoriescontainsthe sharinglistofeachregionthatiscachedanywhere onchip, the identityofthe tree rootforthatregion, andwhetherablockisownedon-chiporifmemoryisthe owner.

In-NetworkCache Coherence [13]replacesdirectoriesbyembeddingsharing information intree structureswithinthenetwork.These virtualtrees (differentfrom VTC’svirtualcircuittrees) are usedtolocate dataon-chip.Whenarequestisen-route tothe directory, itcanbumpintoatree whichwillredirectthe requesttothe appropriate core thatissharingthecache block.Thisoptimizationtargetsthe directoryindirectionlatencyandcanleadtofewerinterconnecthopstofindavalidcache line.

However, with In-Network Coherence, depending on theroute taken by arequest, asharermay be nearby, buttherequestmaymissitandstillhave tomake itswaythe wholestretchtothe directory.Insuchscenarios, VTC willperformbetter.Cache missesthatdobumpintoatree inIn-NetworkCoherence willbe satisfiedmore quicklythanrequeststhathave totravelsignificantdistance tothe rootnode inVTC.Itshouldbe notedthoughthatinVTC, itisalwaysthe casethatthe rootnode isasharerofthe region, whichmaybecloserthanthe statically-mappeddirectorynode inIn-NetworkCoherence.Also, VTC utilizescoarse-grainedtrackingwhichrequireslessstorage overheadthanthe per-line, per-hopstorageneededinIn-NetworkCoherence.

UnCorq[32]broadcastscoherence requestsonanunordered

Page 12: VirtualTreeCoherence:LeveragingRegionsandIn ...enright/EnrightJergerN-VirtualTree.pdf · As transistor scaling affords chip architects the ability to place more and more cores on-chip,

interconnect(e.g.amesh) andthenorderssnoopresponsesviaalogicalring.Similarly, we utilize the ordering implied bylogicaltreestomaintaincoherence;however, we orderrequestsviavirtualtreesratherthanresponses.Anadditionaldifferenceisthe use ofmulticasting insteadofthe fullbroadcastusedbyUnCorq.Greedy-Orderbearssome similaritytoUnCorq;requestsare senttosharersquicklywithoutregardtoorder.UnCorq then ordersresponsesvia a logicalring, whereasGreedy-Orderusesthe ownertoorderrequests.

Treeshave alsobeenleveragedinpreviousproposalstobuildmore scalable directoriesforlarge distributedsharedmemorymachines[11], [28].These treesare usedtoreduce storageoverheadbutthe directoriesstillserve asthe orderingpoint.

6.3. NetworkDesigns Cognizant of Cache Coherence

Inadditiontorelevantwork inthe domainofcache co-herence protocols, VTC alsoexaminesthe requirementsthatare placedonthe interconnecttomaintaincorrectness.Theserequirementshave alsobeenexaminedinpriorwork.

The RotaryRouter[1], [2]providesmechanismstomaintainthe orderingofcoherence requestsinthe networkandpreventcoherence deadlockwithinthe interconnect.Bufferresourceallocationisdividedupbetweendependentmessagestopre-ventdependentmessagesfrom deadlockinginthe networkduetounavailable resources.In-orderdelivery isguaranteed byforcingorderedmessagestotraverse the same path.

Anothercommonsolutiontoavoidingdeadlockandprevent-ingmessage reorderingistodedicate aseparate virtualnetworktoeachclassofcoherence message (e.g.requestsvsresponses),where eachvirtualnetworkhasdistinctvirtualchannels.Thistechnique isemployedbythe Alpha21364 [27].VTC forcesthe underlyingnetworktodelivercoherence requestsinorderbyrestrictingavirtualtree touse asingle virtualchannel.

7. Conclusion

Thisworkproposesanew coherence protocolformany-core architectures, VirtualTree Coherence.Unordered inter-connectionnetworktopologiessuchasameshoratoruscanbe overlaidwithanorderinginvarianttomore easilyfacilitatecache coherence mechanisms.We utilize one suchnetwork,VirtualCircuitTree Multicastingtorealize VirtualTree Co-herence asascalable on-chipcache coherence solutionthatimprovesperformance byanaverage of25% overadirectory-basedprotocol.Byrelyingoncoarse-grainedregioncoherencestate, we reduce the on-chipstorage overheadforcoherencestate substantially, withoutthe expectednegative side effectsoffalse sharingandcoherence thrashing.Instead, we employefficienton-chipmulticastingtoreachallnodesinthe sharingset, andmaintainatotalorderofmessagestothe same regionbyrestrictingeachtree tosingle virtualchannel.VirtualTreeCoherence issimple inconceptand implementationsince itreliesonastraightforwardorderinginvariantbasedonalogicaltree.In summary, we have extended the fruitfulspace ofregion-basedoptimizationstoinclude ascalable multicastingprotocol.ExtendingVirtualTree Coherence tothe domainofserverconsolidationworkloads[15], where sharingisgenerallylimitedtowithinavirtualmachine andtosubsetsofcoreswithinthatVM resultsinevenmore substantialbenefitswithanaverage improvementof31%.

References[1]P.Abad, V.Puente, and J.Gregorio, “Reducing the interconnection

networkcostofchipmultiprocessors,”inNOCS, 2008.[2]P.Abad, V.Puente, J.Gregorio, andP.Prieto, “Rotaryrouter:anefficient

architecture forcmpinterconnectionnetworks,”inISCA, 2007.[3]N.Aggarwal, J.Cantin, M.Lipasti, and J.E.Smith, “Power-Aware

DRAM Speculation,”inHPCA-12, 2008.[4]A.R.AlameldeenandD.A.Wood, “Variabilityinarchitecturalsimula-

tionsofmulti-threadedworkloads,”inProceedings ofHPCA-9, 2003.[5]J.Balfourand W.Dally, “Design tradeoffs fortiled cmp on-chip

networks,”inInternational Conference on Supercomputing, 2006.[6]L.A.BarrosoandM.Dubois, “The performance ofcache-coherentring-

basedmultiprocessors,”inISCA-20, 1993.[7]E.Bilir, R.Dickson, Y.Hu, M.Plakal, D.Sorin, M.Hill, andD.Wood,

“Multicastsnooping:Anew coherence methodusingamulticastaddressnetwork,”inProc.ofISCA, May1999.

[8]H.Cain, K.Lepak, B.Schwarz, and M.H.Lipasti, “Precise andaccurate processorsimulation,”inWorkshopOn Computer ArchitectureEvaluation using Commercial Workloads, 2002.

[9]J.F.Cantin, M.H.Lipasti, andJ.E.Smith, “Improvingmultiprocessorperformance withcoarse-graincoherence tracking,”inISCA-32, 2005.

[10]—— , “Stealthprefetching,”inInternational Conference on ArchitecturalSupport for Programming Languages andOperating Systems, 2006.

[11]Y.ChangandL.N.Bhuyan, “Anefficienttree cache coherence protocolfordistributedsharedmemorymultiprocessors,”IEEE Transactions onComputers, vol.48, no.3, 1998.

[12]W.J.Dally, “Virtual-channelflow control,”inISCA, 1990.[13]N.Eisley, L.-S.Peh, andL.Shang, “In-networkcache coherence,”in

International Symposium on Microarchitecture, 2006.[14]N.EnrightJerger, L.-S.Peh, andM.H.Lipasti, “Virtualcircuittree mul-

ticasting:Acase foron-chiphardware multicastsupport,”inProceedingsofISCA-35, 2008.

[15]N.EnrightJerger, D.Vanatrease, andM.Lipasti, “Anevaluationofserverconsolidationworkloadsformulti-core designs,”inIISWC, 2007.

[16]K.Gharachorloo, M.Sharma, S.Steely, andS.V.Doren, “ArchitectureanddesignofAlphaServerGS320,”inArchitectural Support for Pro-gramming Languages andOperating Systems, 2000.

[17]J.LaudonandD.Lenoski, “The SGIOrigin:accNUMAhighlyscalableserver,”inISCA-24, 1997.

[18]K.M.Lepak, H.W.Cain, andM.H.Lipasti, “RedeemingIPC asaperformance metricformultithreadedprograms,”inProceeding of12thPACT, 2003, pp.232–243.

[19]M.M.K.Martin, P.J.Harper, D.J.Sorin, M.D.Hill, andD.A.Wood,“Usingdestination-setpredictiontoimprove the latency/bandwidthtrade-offinshared-memorymultiprocessors,”inProceedings ofthe 30thISCA,June 2003.

[20]M.M.K.Martin, M.D.Hill, andD.A.Wood, “Token coherence:Decouplingperformance andcorrectness,”inISCA-30, 2003.

[21]M.M.K.Martin, D.J.Sorin, M.D.Hill, andD.A.Wood, “Bandwidthadaptive snooping,”inHPCA-8, 2002.

[22]M.R.Marty, “Cache coherence techniquesformulticore processors,”inPhD Dissertation,UniversityofWisconsin - Madison, 2008.

[23]M.R.Marty, J.D.Bingham, M.D.Hill, A.J.Hu, M.M.K.Martin, andD.A.Wood, “Improvingmultiple-cmpsystemsusingtokencoherenece,”inHPCA, February2005.

[24]M.R.MartyandM.D.Hill, “Coherence orderingforring-basedchipmultiprocessors,”inMICRO-39, December2006.

[25]—— , “Virtualhierarchiestosupportserverconsolidation,”inISCA-34,2007.

[26]A.Moshovos, “Regionscout:Exploitingcoarse grainsharinginsnoop-basedcoherence.”inISCA-32, 2005.

[27]S.S.Mukherjee, P.Bannon, S.Lang, A.Spink, andD.Webb, “TheAlpha21364 networkarchitecture,”IEEEMicro, vol.22, no.1, pp.26–35, 2002.

[28]H.Nilsson and P.Stenstrom, “The scalable tree protocol- acachecoherence approachforlarge-scale multiprocessors,”inIPDPS, 1992.

[29]B.Sinharoy, R.Kalla, J.Tendler, R.Eickemeyer, andJ.Joyner, “Power5system microarchitecture,”IBM Journal ofResearchandDevelopment,vol.49, no.4, 2005.

[30]SPEC, “SPEC benchmarks,”http://www.spec.org.[31]K.Strauss, X.Shen, and J.Torrellas, “Flexible snooping:Adaptive

forwardingandfilteringofsnoopsinembeddedringmultiprocessors,”inInternational Symposium on Computer Architecture, 2006.

[32]—— , “Uncorq:Unconstrainedsnooprequestdeliveryinembedded-ringmultiprocessors,”inMICRO-40, 2007.

[33]TPC, “TPC benchmarks,”http://www.tpc.org.[34]S.Vangal, J.Howard, G.Ruhl, S.Dighe, H.Wilson, J.Tschanz, D.Finan,

P.Iyer, A.Singh, T.Jacob, S.Jain, S.Venkataraman, Y.Hoskote, andN.Borkar, “An80-tile 1.28tflopsnetwork-on-chipin65nm cmos,”inIEEEInternational SolidState Circuit Conference, 2007.

[35]D.Wentzlaff, P.Griffin, H.Hoffman, L.Bao, B.Edwards, C.Ramey,M.Mattina, C.-C.Miao, J.B.III, andA.Agarwal, “On-chipintercon-nectionarchitecture ofthe tile processor,”IEEEMicro, pp.15–31, 2007.

[36]S.Woo, M.Ohara, E.Torrie, J.Singh, andA.Gupta, “The SPLASH-2programs:Characterizationandmethodologicalconsiderations,”inISCA-22, June 1995.

[37]J.Zebchuk, E.Safi, andA.Moshovos, “A frameworkforcoarse-grainoptimizationsinthe on-chipmemoryhierarchy,”inMICRO-40, 2007.