HawkEye: Efficient Fine-grained OS Support for …sbansal/pubs/hawkeye.pdfHawkEye: Efficient...

14
HawkEye: Efficient Fine-grained OS Support for Huge Pages Ashish Panwar Indian Institute of Science ashishpanwar@iisc.ac.in Sorav Bansal Indian Institute of Technology Delhi sbansal@iitd.ac.in K. Gopinath Indian Institute of Science gopi@iisc.ac.in Abstract Effective huge page management in operating systems is necessary for mitigation of address translation overheads. However, this continues to remain a difficult area in OS de- sign. Recent work on Ingens [55] uncovered some interesting pitfalls in current huge page management strategies. Using both page access patterns discovered by the OS kernel and fine-grained data from hardware performance counters, we expose problematic aspects of current huge page manage- ment strategies. In our system, called HawkEye/Linux, we demonstrate alternate ways to address issues related to per- formance, page fault latency and memory bloat; the primary ideas behind HawkEye management algorithms are async page pre-zeroing, de-duplication of zero-filled pages, fine- grained page access tracking and measurement of address translation overheads through hardware performance coun- ters. Our evaluation shows that HawkEye is more perfor- mant, robust and better-suited to handle diverse workloads when compared with current state-of-the-art systems. CCS Concepts Software and its engineering Op- erating systems; Virtual memory; Keywords Virtual memory; huge pages; hardware counters ACM Reference Format: Ashish Panwar, Sorav Bansal, and K. Gopinath. 2019. HawkEye: Efficient Fine-grained OS Support for Huge Pages. In 2019 Architec- tural Support for Programming Languages and Operating Systems (ASPLOS ’19), April 13–17, 2019, Providence, RI, USA. ACM, New York, NY, USA, 14 pages. hps://doi.org/10.1145/3297858.3304064 1 Introduction Modern applications with large memory footprints have put address translation overheads in general-purpose processors Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. ASPLOS ’19, April 13–17, 2019, Providence, RI, USA © 2019 Association for Computing Machinery. ACM ISBN 978-1-4503-6240-5/19/04. . . $15.00 hps://doi . org/10. 1145/3297858. 3304064 into focus [32, 49, 59, 61, 63]. Modern architectures imple- menting large multi-level TLBs and page-walk caches, all supporting multiple page sizes [35, 40], require careful OS de- sign to determine suitable page sizes for different workloads [47, 55, 57, 64]. The problem becomes more severe with vir- tual machines where two layers of address translation cause additional MMU overheads [34, 46, 62]. Despite robust hardware support available across architec- tures [15], huge pages have provided unsatisfactory perfor- mance on important applications [6, 9, 10, 12, 2325]. These performance issues are often due to inadequate OS-based huge page management algorithms [5, 7, 11]. OS-based huge page management algorithms need to bal- ance complex trade-offs between address translation over- heads (aka MMU overheads), memory bloat, page fault la- tency, fairness and the overheads of the algorithm itself. In this paper, we discuss some subtleties related to huge pages and expose important weaknesses of current approaches, and propose a new set of algorithms to address them. We begin with a brief overview of the three representative state- of-the-art systems: Linux, FreeBSD, and a recent research paper by Kwon et al., i.e., Ingens [55]. Linux: Linux’s transparent huge page (THP) employs huge pages through two mechanisms: (1) either it allocates a huge page at the time of page fault if contiguous memory is avail- able, or (2) it promotes base pages to huge pages, by option- ally compacting memory [39], in a background kernel thread called khugepaged. Linux triggers background promotion when fragmentation is high and huge pages are difficult to allocate at the time of page fault. While promoting huge pages, khugepaged selects processes in first-come-first-serve (FCFS) order and promotes all huge pages in a process before selecting the next process. For security, a page is zeroed syn- chronously (except for copy-on-write pages) before getting mapped into the user process’ page table. FreeBSD: FreeBSD supports multiple huge page sizes [57]. Unlike Linux, FreeBSD reserves a contiguous region of phys- ical memory in the page fault handler but defers the promo- tion of baseline pages until all base pages in a huge page sized region are allocated by the application. If the reserved mem- ory region is only partially mapped, its unused pages are returned back to the page allocator when memory pressure increases. This way, FreeBSD manages memory contiguity more efficiently than Linux, at the potential expense of a

Transcript of HawkEye: Efficient Fine-grained OS Support for …sbansal/pubs/hawkeye.pdfHawkEye: Efficient...

Page 1: HawkEye: Efficient Fine-grained OS Support for …sbansal/pubs/hawkeye.pdfHawkEye: Efficient Fine-grained OS Support for Huge Pages Ashish Panwar Indian Institute of Science ashishpanwar@iisc.ac.in

HawkEye Efficient Fine-grainedOS Support for Huge Pages

Ashish PanwarIndian Institute of Scienceashishpanwariiscacin

Sorav BansalIndian Institute of Technology Delhi

sbansaliitdacin

K GopinathIndian Institute of Science

gopiiiscacin

AbstractEffective huge page management in operating systems isnecessary for mitigation of address translation overheadsHowever this continues to remain a difficult area in OS de-sign Recent work on Ingens [55] uncovered some interestingpitfalls in current huge page management strategies Usingboth page access patterns discovered by the OS kernel andfine-grained data from hardware performance counters weexpose problematic aspects of current huge page manage-ment strategies In our system called HawkEyeLinux wedemonstrate alternate ways to address issues related to per-formance page fault latency and memory bloat the primaryideas behind HawkEye management algorithms are asyncpage pre-zeroing de-duplication of zero-filled pages fine-grained page access tracking and measurement of addresstranslation overheads through hardware performance coun-ters Our evaluation shows that HawkEye is more perfor-mant robust and better-suited to handle diverse workloadswhen compared with current state-of-the-art systems

CCS Concepts bull Software and its engineering rarr Op-erating systems Virtual memory

Keywords Virtual memory huge pages hardware counters

ACM Reference FormatAshish Panwar Sorav Bansal and K Gopinath 2019 HawkEyeEfficient Fine-grained OS Support for Huge Pages In 2019 Architec-tural Support for Programming Languages and Operating Systems(ASPLOS rsquo19) April 13ndash17 2019 Providence RI USA ACM NewYork NY USA 14 pages httpsdoiorg10114532978583304064

1 IntroductionModern applications with large memory footprints have putaddress translation overheads in general-purpose processors

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies are notmade or distributed for profit or commercial advantage and that copies bearthis notice and the full citation on the first page Copyrights for componentsof this work owned by others than ACMmust be honored Abstracting withcredit is permitted To copy otherwise or republish to post on servers or toredistribute to lists requires prior specific permission andor a fee Requestpermissions from permissionsacmorgASPLOS rsquo19 April 13ndash17 2019 Providence RI USAcopy 2019 Association for Computing MachineryACM ISBN 978-1-4503-6240-51904 $1500httpsdoiorg10114532978583304064

into focus [32 49 59 61 63] Modern architectures imple-menting large multi-level TLBs and page-walk caches allsupporting multiple page sizes [35 40] require careful OS de-sign to determine suitable page sizes for different workloads[47 55 57 64] The problem becomes more severe with vir-tual machines where two layers of address translation causeadditional MMU overheads [34 46 62]

Despite robust hardware support available across architec-tures [15] huge pages have provided unsatisfactory perfor-mance on important applications [6 9 10 12 23ndash25] Theseperformance issues are often due to inadequate OS-basedhuge page management algorithms [5 7 11]

OS-based huge page management algorithms need to bal-ance complex trade-offs between address translation over-heads (aka MMU overheads) memory bloat page fault la-tency fairness and the overheads of the algorithm itself Inthis paper we discuss some subtleties related to huge pagesand expose important weaknesses of current approachesand propose a new set of algorithms to address them Webegin with a brief overview of the three representative state-of-the-art systems Linux FreeBSD and a recent researchpaper by Kwon et al ie Ingens [55]Linux Linuxrsquos transparent huge page (THP) employs hugepages through two mechanisms (1) either it allocates a hugepage at the time of page fault if contiguous memory is avail-able or (2) it promotes base pages to huge pages by option-ally compacting memory [39] in a background kernel threadcalled khugepaged Linux triggers background promotionwhen fragmentation is high and huge pages are difficultto allocate at the time of page fault While promoting hugepages khugepaged selects processes in first-come-first-serve(FCFS) order and promotes all huge pages in a process beforeselecting the next process For security a page is zeroed syn-chronously (except for copy-on-write pages) before gettingmapped into the user processrsquo page tableFreeBSD FreeBSD supports multiple huge page sizes [57]Unlike Linux FreeBSD reserves a contiguous region of phys-ical memory in the page fault handler but defers the promo-tion of baseline pages until all base pages in a huge page sizedregion are allocated by the application If the reserved mem-ory region is only partially mapped its unused pages arereturned back to the page allocator when memory pressureincreases This way FreeBSD manages memory contiguitymore efficiently than Linux at the potential expense of a

higher number of page faults and higher MMU overheadsdue to multiple TLB entries one per baseline pageIngens In the Ingens paper [55] the authors point out pit-falls in the huge page management policies of both Linux andFreeBSD and present a policy that is better at handling the as-sociated trade-offs In summary (1) Ingens uses an adaptivestrategy to balance address translation overhead and mem-ory bloat it uses conservative utilization-threshold basedhuge page allocation to prevent memory bloat under highmemory pressure but relaxes the threshold to allocate hugepages aggressively under no memory pressure to try andachieve the best of both worlds (2) To avoid high page faultlatency associated with synchronous page-zeroing Ingensemploys asynchronous huge page allocation with a dedi-cated kernel thread (3) To maintain fairness across multipleprocesses Ingens treats memory contiguity as a resourceand employs a share-based policy to allocate huge pagesfairlyThe Ingens paper highlights that current OSs deal with

huge page management issues through ldquospot fixesrdquo and mo-tivates the need for a principled approach While Ingensproposes a more sophisticated strategy than previous workwe show that its static configuration based and heuristic-driven approaches are suboptimal conflicting performanceobjectives and inadequate knowledge of MMU overheadsof running applications can limit its effectiveness Severalaspects of an OS-based huge page management system canthus benefit from a dynamic data-driven approachThis paper presents HawkEye an automated OS-level

solution for huge page management HawkEye proposesa set of simple-yet-effective algorithms to (1) balance thetradeoffs between memory bloat address translation over-heads and page fault latency (2) allocate huge pages toapplications with highest expected performance improve-ment due to the allocation and (3) improve memory sharingbehaviour in virtualized systems We also show that theactual address-translation overheads as measured throughperformance counters can be sometimes quite different fromthe expectedestimated overheads To demonstrate this wealso evaluate a variant of HawkEye that relies on hardwareperformance counters for making huge page allocation deci-sions and compare results Our evaluation involves work-loads with a diverse set of requirements vis-a-vis huge pagemanagement and demonstrates that HawkEye can achieveconsiderable performance improvement over existing sys-tems while adding negligible overhead (asymp 34 single-coreoverhead in the worst-case)

2 MotivationIn this section we discuss different tradeoffs involved in OS-level huge page management and how current solutions han-dle them We also provide a glimpse of the results achievedthrough HawkEye which is discussed in detail in sect3

0

8

16

24

32

40

48

1 69 137

205

273

341

409

477

545

613

681

749

817

885

953

1021

1089

1157

1225

1293

1361

1429

1497

1565

RSS

(GB)

Time (seconds)

Linux Ingens HawkEye

out-of-memory successout-of-memory success

P3

Figure 1 Resident Set Size (RSS) of Redis server across 3phases P1 (insert) P2(delete) and P3(insert)

21 Address Translation Overhead vs Memory BloatOne of the most important tradeoffs associated with hugepages is between address translation overheads and memorybloat While Linuxrsquos synchronous huge page allocation isaimed at minimizing MMU overheads it can often lead tomemory bloat when an application uses only a fraction ofthe allocated huge page FreeBSD promotes only after allbase pages in a huge page region are allocated FreeBSDrsquosconservative approach tackles memory bloat but sacrificesperformance by delaying the mapping of huge pages

Ingensrsquo strategy is adaptive in that it promotes huge pagesaggressively to minimize MMU overheads when memoryfragmentation is low To measure fragmentation it uses theFree Memory Fragmentation Index (FMFI [50]) when FMFIlt 05 (low fragmentation) Ingens behaves like Linux pro-moting allocated pages to huge pages at the first availableopportunity when FMFI gt 05 (high fragmentation) Ingensuses a conservative utilization-based strategy (ie promoteonly after a certain fraction of base pages say 90 are al-located) to mitigate memory bloat While Ingens appearsto capture the best of both Linux and FreeBSD we showthat this adaptive policy is far from a complete solution be-causememory bloat generated in the aggressive phase remainsunrecovered This property is also true for Linux and wedemonstrate this problem through a simple experiment withthe Redis key-value store [38] running on a 48GB memorysystem (see Figure 1)

We execute a client to generate memory bloat andmeasurethe interaction of different huge page policies with RedisFor exposition we divide our workload into three phases(P1) The client inserts 11 million key-value pairs of size(10B 4KB) for an in-memory dataset of 45GB (P2) The clientdeletes 80 randomly selected keys leaving Redis with asparsely populated address space (P3) After some time gapthe client tries to insert 17K (10B 2MB) key-value pairs sothat the dataset again reaches 45GB While this workload isused for clear exposition of the problem it resembles real-istic workload patterns that involve a mix of allocation anddeallocation and leave the address space fragmented Anideal system should recover elegantly from memory bloat toavoid disruptions under high memory pressure

In phase P2 when Redis releases pages back to the OSthrough madvise system call [16] the corresponding hugepage mappings are broken by the kernel and the resident-setsize (RSS) reduces to around 11GB At this point the ker-nelrsquos khugepaged thread promotes the regions containingthe deallocated pages back to huge pages As more allo-cations happen in phase P3 Ingens promotes huge pagesaggressively until the RSS reaches 32GB (or fragmentation islow) after that Ingens employs conservative promotion toavoid any further bloat This is evident in Figure 1 where therate of RSS growth decreases compared to Linux Howeverboth Linux and Ingens reach memory limit (out-of-memoryOOM exception is thrown) at significantly lower memoryutilization While Linux generates a bloat of 28GB (ie only20GB useful data) Ingens generates a bloat of 20GB (ie28GB useful data) Although Ingens tries to prevent bloat athigh memory utilization it is unable to recover from bloatgenerated at low memory utilizationWe note that it is possible to avoid memory bloat in In-

gens by configuring it to use only conservative promotionHowever as also explained in the Ingens paper this strategyrisks losing performance due to high MMU overheads inlow memory pressure situations An ideal strategy shouldpromote huge pages aggressively but should also be ableto recover from bloat in an elegant fashion Figure 1 showsHawkEyersquos behaviour for this workload which is able toeffectively recover from memory bloat even with an aggres-sive promotion strategy

22 Page fault latency vs Number of page faultsThe OS often needs to clear (zero) a page before mapping itinto the process address space to prevent insecure informa-tion flow Clearing a page is significantly more expensive forhuge pages in Linux on our experimental system clearinga base page takes about 25 of total page fault time whichincreases to 97 for huge pages High page fault latencyoften leads to high user-perceived latencies jeopardizingperformance for interactive applications Ingens avoids thisproblem by allocating only base pages in the page fault han-dler and relegating the promotion task to an asynchronousthread khugepaged It prioritizes the promotion of recentlyfaulted regions over older allocationsWhile Ingens reduces page allocation latency it nullifies

an important advantage of huge pages namely fewer pagefaults for access patterns exhibiting high spatial locality [26]Several applications that allocate and initialize a large mem-ory region sequentially exhibit this type of access pattern andtheir performance degrades due to higher number of pagefaults if only base pages are allocated We demonstrate theseverity of this problem through a custom microbenchmarkthat allocates a 10GB buffer touching one byte in every basepage and later frees the buffer Table 1 shows the cumulativeperformance summary for 10 runs of this workload

Event sync page-zeroing asyncpromotion no page-zeroing

Linux4KB

Linux2MB

Ingens90

Linux4KB

Linux2MB

Page faults 262M 515K 262M 262M 515KTotal page fault time (secs) 926 239 928 695 07Avg page fault time (micros) 35 465 35 265 13System time (secs) 102 24 104 79 13Total time (secs) 106 249 116 83 44

Table 1 Page faults allocation latency and performance fora microbenchmark with asymp100GB memory allocationLinux with THP support (Linux-2MB with sync page-

zeroing in Table 1) reduces the number of page faults bymore than 500times over Linux without THP support (Linux-4KB) and leads to more than 4times performance improvementfor this workload despite being 133timesworse on average pagefault latency (465micros vs 35micros) Ingens considerably reduceslatency compared to Linux-2MB but does not reduce thenumber of page faults for this workload because asynchro-nous promotion is activated only after 90 base pages areallocated from a region copying these pages to a contiguousmemory block takes more time than the time taken by thisworkload to generate the remaining 10 page faults in thesame region Consequently the overall performance of thisworkload degrades in Ingens due to excessive page faultsIf it were possible to allocate pages without having to zerothem at page fault time we could achieve both low page faultlatency and fewer page faults resulting in higher overall per-formance over all existing approaches (last two columns inTable 1) HawkEye implements rate-limited asynchronouspage pre-zeroing (sect31) to achieve this in the common case

23 Huge page allocation across multiple processesUnder memory fragmentation the OS must allocate hugepages among multiple processes fairly Ingens authors tack-led this problem by defining fairness through a proportionalhuge page promotion metric wherein they consider ldquomemorycontiguity as a resourcerdquo and try and be equitable in dis-tributing it (through huge pages) among applications Ingenspenalizes applications that have been allocated huge pagesbut are not accessing them frequently (ie have idle hugepages) Idleness of a page is estimated through the access-bitpresent in the page table entries which is maintained by thehardware and periodically cleared by the OS Ingens employsan idleness penalty factorwhereby an application is penalizedfor its idle or cold huge pages during the computation ofthe proportional huge page promotion metric

Ingensrsquo fairness policy has an important weakness mdash twoprocesses may have similar huge page requirements but oneof them (say P1) may have significantly higher TLB pressurethan the other (say P2) This can happen for example if P1rsquosaccesses are spread across many base pages within its hugepage regions while P2rsquos accesses are concentrated in one(or a few) base pages within its huge page regions In thisscenario promotion of huge pages in P1 is more desirablethan P2 but Ingens would treat them equally

Benchmark Suite Number of applicationsTotal TLB sensitive applications

SPEC CPU2006_int 12 4 (mcf astar omnetpp xalancbmk)SPEC CPU2006_fp 19 3 (zeusmp GemsFDTD cactusADM)PARSEC 13 2 (canneal dedup)SPLASH-2 10 0Biobench 9 2 (tigr mummer)NPB 9 2 (cg bt)CloudSuite 7 2 (graph-analytics data-analytics)Total 79 15

Table 2 Number of TLB sensitive applications in popularbenchmark suitesTable 2 provides some empirical evidence that different

applications usually behave quite differently vis-a-vis hugepages For example less than 20 of the applications in pop-ular benchmark suites experience noticeable performanceimprovement (gt 3) with huge pages

We posit that instead of considering ldquomemory contiguityas a resourcerdquo and granting it equally among processes itis more effective to consider ldquoMMU overheads as a systemoverheadrdquo and try and ensure that it is distributed equallyacross processes A fair algorithm should attempt to equalizeMMU overheads across all applications eg if two processesP1 and P2 experience 30 and 10 MMU overheads respthen huge pages should be allocated to P1 until its overheadalso reaches 10 In doing so it should be acceptable if P1has to be allocated more huge pages than P2 Such a policywould additionally yield the best overall system performanceby helping the most afflicted processes first

Further in Ingens huge page promotion is triggered in re-sponse to a few page-faults in the aggressive (non-fragmented)phase These huge pages are not necessarily frequently ac-cessed by the application yet they contribute to a processrsquosallocation quota of huge pages Under memory pressurethese idle huge pages lower the promotion metric of a pro-cess potentially preventing the OS from allocating morehuge pages to it This would increase MMU overheads if theprocess had other hot (frequently accessed) memory regionsthat required huge page promotion This behaviour wherepreviously-allocated cold huge pages can prevent an appli-cation from being allocated huge pages for its hot regionsseems sub-optimal and avoidable

Finally within a process Linux and Ingens promote hugepages through a sequential scan from lower to higher VAsThis approach is unfair to processes whose hot regions lie inthe higher VAs Because different applications would usuallycontain hot regions in different parts of their VA spaces (seeFigure 6) this scheme is likely to cause unfairness in practice

24 How to capture address translation overheadsIt is common to estimate MMU overheads based on working-set size (WSS) [56] bigger WSS should entail higher MMUand performance overheads We find that this is often nottrue for modern hardware where access patterns play an im-portant role in determining MMU overheads eg sequential

Workload RSS WSS TLB-misses(native-4KB)

cycles speedup4KB 2MB native virtual

btD 10GB 7-10 GB 045 64 131 105 115spD 12GB 8-12 GB 048 47 025 101 106luD 8GB 8 GB 006 33 018 10 101mgD 26GB 24 GB 003 104 004 101 111cgD 16GB 7-8 GB 2857 39 002 162 27ftD 78 GB 7-35 GB 021 39 214 101 104uaD 96 GB 5-7 GB 001 08 003 101 103

Table 3Memory characteristics address translation over-heads and speedup huge pages provide over base pages forNPB workloads

Performance CounterC1 DTLB_LOAD_MISSES_WALK_DURATIONC2 DTLB_STORE_MISSES_WALK_DURATIONC3 CPU_CLK_UNHALTEDMMU Overhead = ((C1 + C2) 100) C3

Table 4Methodology used to measure MMUOverhead [54]access patterns allow prefetching to hide TLB miss latenciesFurther different TLB miss requests can experience highlatency variations a translation may be present anywherein the multi-level page walk caches multi-level regular datacaches or in main memory For these reasons WSS is oftennot a good indicator of MMU overheads Table 3 demon-strates this with the NPB benchmark suite [30] a workloadwith large WSS (eg mgD) can have low MMU overheadscompared to one with a smaller WSS (eg cgD)

It is not clear to us how an OS can reliably capture MMUoverheads these are dependent on complex interactions be-tween applications and the underlying hardware architec-ture and we show that the actual address translation over-heads of a workload can be quite different from what can beestimated through its memory access pattern (eg working-set size) Hence we propose directly measuring TLB over-heads through hardware performance counters when avail-able (see Table 4 for methodology) This approach enablesa more efficient solution at the cost of portability as the re-quired performance counters may not be available on all plat-forms (eg most hypervisors have not yet virtualized TLB-related performance counters) To overcome this challengewe present two variants of our algorithmimplementation inHawkEyeLinux one where the MMU overheads are mea-sured through hardware performance counters (HawkEye-PMU) and another where MMU overheads are estimatedthrough the memory access pattern (HawkEye-G) We com-pare both approaches in sect4

3 Design and ImplementationFigure 2 shows our high-level design objectives Our solutionis based on four primary observations (1) high page faultlatency for huge pages can be avoided by asynchronouslypre-zeroing free pages (2) memory bloat can be tackledby identifying and de-duplicating zero-filled baseline pagespresent within allocated huge pages (3) promotion decisionsshould be based on finer-grained access tracking of hugepage sized regions and should include recency frequency

Low Memory Pressure High Memory Pressure

1 Fewer Page Faults 2 Lowshylatency Allocation 3 High Performance

1 High Memory Efficiency 2 Efficient Huge Page Promotion 3 Fairness

Figure 2 Design objectives in HawkEye

and access-coverage (ie how many baseline pages are ac-cessed inside a huge page) measurements and (4) fairnessshould be based on estimation of MMU overheads

31 Asynchronous page pre-zeroingWe propose that page zeroing of available free pages shouldbe performed asynchronously in a rate-limited backgroundkernel thread to eliminate high latency allocation We callthis scheme async pre-zeroing Async pre-zeroing is a ratherold idea and had been discussed extensively among kerneldevelopers in early 2000s [1 2 14 41]1 Linux developersopined that async pre-zeroing is not an overall performancewin for two main reasons We think that it is time to revisitthese opinionsFirst the async pre-zeroing thread might interfere with

primary workloads by polluting the cache [1] In particu-lar async pre-zeroing suffers from the ldquodouble cache missrdquoproblem because it causes the same datum to be accessedtwice with a large re-use distance first for pre-zeroing andthen for the actual access by the application These extracache misses are expected to degrade overall performancein general However these problems are partially solvableon modern hardware that support memory writes with non-temporal hints non-temporal hints instruct the hardware tobypass caches during memory loadstore instructions [43]We find that using non-temporal hints during pre-zeroingsignificantly reduces both cache contention and the doublecache miss problem

Second there was no consensus or empirical evidence todemonstrate the benefits of page pre-zeroing for real work-loads [1 42] We note that the early discussions on pagepre-zeroing were evaluating trade-offs with baseline 4KBpages Our experiments corroborate the kernel developersrsquoobservation that despite reducing the page fault overheadby 25 pre-zeroing does not necessarily enable high per-formance with 4KB pages At the same time we also showthat it enables non-negligible performance improvements(eg 14times faster VM boot-time) with huge pages due to muchhigher reduction (97) in page fault overheads Since hugepages (and huge-huge pages) are supported by most general-purpose processors today [15 27] pre-zeroing pages is animportant optimization that is worth revisitingPre-zeroing offers another advantage for virtualized sys-

tems it increases the number of zero pages in the guestrsquosphysical address (GPA) space enabling opportunities for1Windows and FreeBSD implement async pre-zeroing in a limited fashion [419] However their huge page policies do not allow high latency allocationsThis idea is more relevant for Linux due to its synchronous huge pageallocation behaviour

content-based page-sharing at the virtualization host Weevaluate this aspect in sect4

To implement async pre-zeroing HawkEye manages freepages in the Linux buddy allocator through two lists zeroand non-zero Pages released by applications are first addedto the non-zero list while the zero list is preferentially usedfor allocation A rate-limited thread periodically transferspages from non-zero to zero lists after zero-filling themusing non-temporal writes Because pre-zeroing involves se-quential memory accesses non-temporal store instructionsprovide performance similar to regular (caching) store in-structions but without polluting the cache [3] Finally wenote that for copy-on-write or filesystem-backed memoryregions pre-zeroing may sometimes be unnecessary andwasteful This problem is avoidable by preferentially allo-cating pages for these memory regions from the non-zerolist

Overall we believe that async pre-zeroing is a compellingidea for modern workload requirements and modern hard-ware support In our evaluation we provide some early evi-dence to corroborate this claim with different workloads

32 Managing bloat vs performanceWe observe that the fundamental tension between MMUoverheads and memory bloat can be resolved Our approachstems from the insight that most allocations in large-memoryworkloads are typically ldquozero-filled page allocationsrdquo the re-maining are either filesystem-backed (eg through mmap)or copy-on-write (COW) pages However huge pages inmodern scale-out workloads are primarily used for ldquoanony-mousrdquo pages that are initially zero-filled by the kernel [54]eg Linux supports huge pages only for anonymous mem-ory This property of typical workloads and Linux allowsautomatic recovery of bloat under memory pressure

To state our approach succinctly we allocate huge pagesat the time of first page-fault but under memory pressureto recover unused memory we scan existing huge pages toidentify zero-filled baseline pages within them If the numberof zero-filled baseline pages inside a huge page is significant(ie beyond a threshold) we break the huge page into itsconstituent baseline pages and de-duplicate the zero-filledbaseline pages to a canonical zero-page through standardCOW page management techniques [37] In this approachit is possible for applicationsrsquo in-use zero-pages to also getde-duplicated While this can result in a marginally highernumber of COW page faults in rare situations this does notcompromise correctnessTo trigger recovery from memory bloat HawkEye uses

two watermarks on the amount of allocated memory inthe system high and low When the amount of allocatedmemory exceeds high (85 in our prototype) a rate-limitedbloat-recovery thread is activated which executes period-ically until the allocated memory falls below low (70 inour prototype) At each step the bloat-recovery thread

675554

1155

39 28 12 1 663274

9110

30

60

90

120di

stan

ce (b

ytes

)

Figure 3 Average distance to the first non-zero byte in base-line (4KB) pages First four bars represent the average of allworkloads in the respective benchmark suite

chooses the application whose huge page allocations need tobe scanned (and potentially demoted) based on the estimatedMMU overheads of that application the application with thelowest estimated MMU overheads is chosen first for scanningThis strategy ensures that the application that least requireshuge pages is considered first mdash this is consistent with ourhuge page allocation strategy (sect23)

While scanning a baseline page to verify if it is zero-filledwe stop on encountering the first non-zero byte in it Inpractice the number of bytes that need to be scanned perin-use (not zero-filled) page before a non-zero byte is en-countered is very small we measured this over a total of 56diverse workloads and found that the average distance of thefirst non-zero byte in a 4KB page is only 911 (see Figure 3)Hence only ten bytes need to be scanned on average perin-use application page For bloat pages however all 4096bytes need to be scanned This implies that the overheadsof our bloat-recovery thread are largely proportional tothe number of bloat pages in the system and not to the totalsize of the allocated memory This is an important propertythat allows our method to scale to large memory systemsWe note that our bloat-recovery procedure has many

similarities with the standard de-duplication kernel threadsused for content-based page sharing for virtual machines[67] eg the kernel same-page merging (ksm) thread inLinux Unfortunately in current kernels the huge page man-agement logic (eg khugepaged) and the content-based pagesharing logic (eg ksm) are unconnected and can often inter-act in counter-productive ways [51] Ingens and SmartMD[52] proposed coordinated mechanisms to avoid such con-flicts Ingens demotes only infrequently-accessed huge pagesthrough ksmwhile SmartMD demotes pages based on access-frequency and repetition rate (ie the number of shareablepages within a huge page) These techniques are useful for in-use pages and our bloat-recovery proposal complementsthem by identifying unused zero-filled pages which canexecute much faster than typical same-page merging logic

33 Fine-grained huge page promotionAn efficient huge page promotion strategy should try tomaximize performance with a minimal number of huge page

9876543210

B1

B2

B3 B4

B9876543210

C1

C3

C5

C

C4

C29876543210

access_map

A1

A2 A3

A

orde

r of p

rom

otio

n

hot-regions

cold-regions

access_map access_map

Figure 4 A sample representation of access_map for threeprocesses A B and C

promotions Current systems promote pages through a se-quential scan from lower to higher VAs which is inefficientfor applications whose hot regions are not in lower VA spaceOur approach makes promotion decisions based on memoryaccess patterns First we define an important metric used inHawkEye to improve the efficiency of huge page promotionsAccess-coverage denotes the number of base pages that areaccessed from a huge page sized region in short intervalsWe sample the page table access bits at regular intervals andmaintain the exponential moving average (EMA) of access-coverage across different samples More precisely we clearthe page table access bits and test the number of set bitsafter 1 second to check how many pages are accessed Thisprocess is repeated once every 30 secondsThe access-coverage of a region potentially indicates its

TLB space requirement (number of base page entries requiredin the TLB) which we use to estimate the profitability ofpromoting it to a huge page A region with high access-coverage is likely to exhibit high contention on TLB andhence likely to benefit more from promotionHawkEye implements access-coverage based promotion

using a per-process data structure called access_map whichis an array of buckets a bucket contains huge page regionswith similar access-coverage On x86 with 2MB huge pagesthe value of access-coverage is in the range of 0-512 In ourprototype wemaintain ten buckets in the access_mapwhichprovides the necessary resolution across access-coverage val-ues at relatively low overheads regions with access-coverageof 0-49 are placed in bucket 0 regions with access-coverageof 50 minus 99 are placed in bucket 1 and so on Figure 4 showsan example state of the access_map of three processes A Band C Regions can move up or down in their respective ar-rays after every sampling period depending on their newlycomputed EMA-based access-coverage If a region movesup in access_map it is added to the head of its bucket If aregion moves down it is added to the tail Within a bucketpages are promoted from head to tail This strategy helps inprioritizing recently accessed regions within an index

HawkEye promotes regions from higher to lower indicesin access_map Notice that our approach captures both re-cency and frequency of accesses a region that has not been

accessed or accessed with low access-coverage in recentsampling periods is likely to shift towards a lower indexin access_map or towards the tail in its current index Pro-motion of cold regions is thus automatically deferred toprioritize hot regions

We note that HawkEyersquos access_map is somewhat similarto population_map and access_bitvector used in FreeBSDand Ingens resp these data structures are primarily used tocapture utilization or page access related metadata at hugepage granularity HawkEyersquos access_map additionally pro-vides the index of hotness of memory regions and enablesfine-grained huge page promotion

34 Huge page allocation across multiple processesIn our access-coverage based strategy regions belonging toa process with the highest expected MMU overheads arepromoted before others This is also consistent with ournotion of fairness sect23 as pages are promoted first fromregionsprocesses with high expected TLB pressure (andconsequently high expected MMU overheads)Our HawkEye variant that does not rely on hardware

performance counters (ie HawkEye-G) promotes regionsfrom the non-empty highest access-coverage index in all pro-cesses It is possible that multiple processes have non-emptybuckets in the (globally) highest non-empty index In thiscase round-robin is used to ensure fairness among suchprocesses We explain this further with an example Con-sider three applications A B and C with their VA regionsarranged in the access_map as shown in Figure 4 HawkEye-G promotes regions in the following order in this example

A1B1C1C2B2C3C4B3B4A2C5A3Recall however that MMU overheads may not necessarily

be correlated with our access-coverage based estimationand may depend on other more complex features of theaccess pattern (sect24) To capture this in HawkEye-PMU wefirst choose the process with the highest measured MMUoverheads and then choose regions from higher to lowerindices in selected processrsquos access_map Among processeswith similar highest MMU overheads round-robin is used

35 Limitations and discussionWe briefly outline a few issues that are related to huge pagemanagement but are currently unhandled in HawkEye1) Thresholds for identifyingmemorypressureWemea-sure the extent ofmemory pressurewith statically configuredvalues for low and high watermarks (ie 70 and 85 oftotal systemmemory) while dealing with memory bloat Anystrategy that relies on static thresholds faces the risk of beingconservative or overly aggressive when memory pressureconsistently fluctuates An ideal solution should adjust thesethresholds dynamically to prevent unintended system behav-ior The approach proposed by Guo et al [51] in the contextof memory deduplication for virtualized environments isrelevant in this context

2) Huge page starvationWhile we believe our approachof allocating huge pages based on MMU overheads optimizesthe system as a whole unbounded huge page allocations toa single process can be thought of as a starvation problemfor other processes An adverserial application can also po-tentially monopolize HawkEye to get more huge pages orprevent other applications from getting a fair share of hugepages Preliminary investigations show that our approach isreasonable even if the memory footprint of workloads differby more than an order of magnitude However if limitinghuge page allocations is still desirable it seems reasonable tointegrate a policy with existing resource limitingmonitoringtools such as Linuxrsquos cgroups [18]3) Other algorithms We do not discuss some parts of themanagement algorithms such as rate limiting khugepagedto reduce promotion overheads demotion based on low uti-lization to enable better sharing through same-page mergingand techniques for minimizing the overhead of page-tableaccess-bit tracking and compaction algorithms Much of thismaterial has been discussed and evaluated extensively inthe literature [32 55 59 68] and we have not contributedsignificantly new approaches in these areas

4 EvaluationWe now evaluate our algorithms in more detail Our exper-imental platform is an Intel Haswell-EP based E5-2690 v3server system running CentOS v74 on which 96GB mem-ory and 48 cores (with hyperthreading enabled) running at23GHz are partitioned on two sockets we bind each work-load to a single socket to avoid NUMA effects The L1 TLBcontains 64 and 8 entries for 4KB and 2MB pages respectivelywhile the L2 TLB contains 1024 entries for both 4KB and2MB pages The size of L1 L2 and shared L3 cache is 768KB3MB and 30MB resp A 96GB SSD-backed swap partitionis used to evaluate performance in an overcommitted sys-tem We evaluate HawkEye with a diverse set of workloadsranging from HPC graph algorithms in-memory databasesgenomics and machine learning [24 30 33 36 45 53 65 66]HawkEye is implemented in Linux kernel v43

We evaluate (a) the improvements due to our fine-grainedpromotion strategy based on access-coverage in both per-formance and fairness for single multiple homogeneous andmultiple heterogeneousworkloads (b) the bloat-vs-performancetradeoff (c) the impact of low latency page faults (d) cacheinterference caused by asynchronous pre-zeroing thread and(e) the impact of memory efficiency enabled by asynchronouspage pre-zeroingPerformance advantages of fine-grainedhuge page pro-motion We first evaluate the effectiveness of our access-coverage based promotion strategy For this we measure thetime required for our algorithm to recover from a fragmentedstate with high address translation overheads to a state with

0

6

12

18

24

Graph500 XSBench cgD

speedup

Linux Ingens HawkEyeshyPMU HawkEyeshyG

152

23 25155

20 36

1007 1011

175

628

152 168

0

300

600

900

1200

Graph500 XSBench cgD

Time saved (ms)

Figure 5 Performance speedup and time saved per hugepage promotion over baseline pages

Workload Execution Time (Seconds)Linux-4KB Linux-2MB Ingens HawkEye-PMU HawkEye-G

Graph500-1 2270 2145(106) 2243(101) 1987(114) 2007(113)Graph500-2 2289 2252(102) 2253(102) 1994(115) 2013(114)Graph500-3 2293 2293(10) 2299(100) 2012(114) 2018(114)Average 2284 2230(102) 2265(101) 1998(114) 2013(113)

XSBench-1 2427 2415(10) 2392(101) 2108(115) 2098(115)XSBench-2 2437 2427(10) 2415(101) 2109(115) 2110(115)XSBench-3 2443 2455(10) 2456(100) 2133(115) 2143(114)Average 2436 2432(10) 2421(100) 2117(115) 2117(115)

Table 5 Execution time of 3 instances of Graph500 andXSBench when executed simultaneously Values in parenthe-ses represent speedup over baseline pages

low address translation overheads Recall that HawkEye pro-motes pages based on access-coverage while previous ap-proaches promote pages in VA order (from low VAs to high)We fragment the memory initially by reading several files inmemory our test workloads are started in the fragmentedstate and we measure the time taken for the system to re-cover from high MMU overheads This experimental setupsimulates expected realistic situations where the memoryfragmentation in the system fluctuates over time Withoutpromotion our test workloads would keep incurring highaddress translation overheads however HawkEye is quicklyable to recover from these overheads through appropriatehuge page allocations Figure 5 (left) shows the performanceimprovement obtained by HawkEye (over a strategy thatnever promotes the pages) for three workloads Graph500XSBench and cgD These workloads allocate all their re-quired memory in the beginning ie in the fragmented stateof the system For these workloads the speedups due toeffective huge page management through HawkEye are ashigh as 22 Compared to Linux and Ingens access-coveragebased huge page promotion strategy of HawkEye improvesperformance by 13 12 and 6 over both Linux and Ingens

To understand this more clearly Figure 6 shows the accesspattern MMU overheads and the number of allocated hugepages over time for Graph500 and XSBench We find thathot-spots in these applications are concentrated in the highVAs and any sequential-scanning based promotion is likelyto be sub-optimal The second and third columns corrobo-rate this observation for example both HawkEye variantstake asymp300 seconds to eliminate MMU overheads of XSBenchwhile Linux and Ingens have high overheads even after 1000seconds

To quantify the cost-benefit analysis of huge page promo-tions further we propose a new metric the average execu-tion time saved (over using only baseline pages) per hugepage promotion A scheme that maximizes this metric wouldbe most effective in reducing MMU overheads Figure 5(right) shows that HawkEye performs significantly betterthan Linux and Ingens on this metric The difference betweenthe efficiency of HawkEye-PMU and HawkEye-G is also evi-dent HawkEye-PMU is more efficient as it stops promotinghuge pages whenMMU overheads are below a certain thresh-old (2 in our experiments) In summary HawkEye-G andHawkEye-PMU are up to 67times and 44times more efficient (forXSBench) than Linux in terms of time saved per huge pagepromotion For workloads whose access patterns are spreaduniformly over the VA space HawkEye delivers performancesimilar to Linux and IngensFairness advantages of fine-grained huge page promo-tion We next experiment withmultiple applications to studyboth performance and fairness first with identical applica-tions and then with heterogeneous applications runningsimultaneouslyIdentical workloads Figure 7 shows the MMU overheadsand huge page allocations when 3 instances of Graph500 andXSBench (in separate runs) are executed concurrently afterfragmenting the system (see Table 5 for execution time)While Linux creates performance imbalance by promot-

ing huge pages in one process at a time HawkEye ensuresfairness by judiciously distributing huge pages across allworkload instances Overall HawkEye-PMU and HawkEye-G achieve 114times and 113times speedup for Graph500 and 115timesspeedup for XSBench over Linux on average Unlike LinuxIngens promotes huge pages proportionally in all 3 instancesHowever it fails to improve the performance of these work-loads In fact it may lead to poorer performance than LinuxWe explain this behaviour with an example

Linux selects an address from Graph500-1rsquos address spaceand promotes all pages above it in around 10 minutes Eventhough this strategy is unfair it improves the performanceof Graph500-1 by promoting its hot-regions Linux then se-lects Graph500-2whose MMU overheads decrease after asymp20minutes In contrast Ingens promotes huge pages from lowerVAs in each instance of Graph500 Thus it takes even longerfor Ingens to promote suitable (hot) regions of the respec-tive processes which leads to 9 performance degradationover Linux for Graph500 For XSBench the performance ofboth Ingens and Linux is similar (but inferior to HawkEye)because both fail to promote the applicationrsquos hot regionsbefore the application finishesHeterogeneous workloads To measure the efficacy of dif-ferent strategies for heterogeneous workloads we executeworkloads by grouping them into sets where a set containsone TLB sensitive and one TLB insensitive application Each

0

128

256

384

512

05GB 1GB 15GB 2GB

Cold Region

acce

ss-c

over

age

Virtual Address Space

0

15

30

45

60

300 600 900 1200 1500

MM

U O

verh

ead

Time (seconds)

0

200

400

600

800

1000

300 600 900 1200 1500

Linux Ingens

HawkEye-G

HawkEye-PMU

H

uge

Pag

es

Time (seconds)

(a) Graph500

0

128

256

384

512

1GB 2GB 3GB 4GB 5GB 6GB

acce

ss-c

over

age

Virtual Address Space

0

15

30

45

60

300 600 900 1200 1500 1800 2100 2400

Linux Ingens

HawkEye-PMU HawkEye-GMM

U O

ver

hea

dTime (seconds)

0

500

1000

1500

2000

300 600 900 1200 1500 1800 2100 2400

Linux Ingens H

awkEye-G

HawkEye-PMU H

uge

Pag

es

Time (seconds)

(b) XSBenchFigure 6 Access-coverage in application VA MMU overhead and huge page promotions for Graph500 and XSBench HawkEyeis more efficient in promoting huge pages due to fine-grained decision making

0

200

400

600

800

1000

10 20 30 40

1 2

3 H

uge

Pag

es

Time (minutes)

0

200

400

600

800

1000

10 20 30 40

1 2 3

H

uge

Pag

es

Time (minutes)

0

200

400

600

800

1000

10 20 30 40

H

uge

Pag

es

Time (minutes)

0

200

400

600

800

1000

10 20 30 40

1 2 3

H

uge

Pag

es

Time (minutes)

0

10

20

30

40

50

10 20 30 40

MM

U O

verh

ead

Time (Minutes)

0

10

20

30

40

50

10 20 30 40

MM

U O

verh

ead

Time (minutes)

0

10

20

30

40

50

10 20 30 40

MM

U O

verh

ead

Time (minutes)

0

10

20

30

40

50

10 20 30 40

MM

U O

verh

ead

Time (minutes)

0

500

1000

1500

2000

10 20 30 40 50

1

2 3

H

uge

Pag

es

Time (Minutes)

0

300

600

900

1200

10 20 30 40 50

1 2 3

H

uge

Pag

es

Time (Minutes)

0

300

600

900

1200

10 20 30 40 50

H

uge

Pag

es

Time (Minutes)

0

300

600

900

1200

10 20 30 40 50

1 2 3

H

uge

Pag

es

Time (Minutes)

0

15

30

45

60

10 20 30 40 50

1 2 3

MM

U O

verh

ead

Time (Minutes)

0

15

30

45

60

10 20 30 40 50

1 2 3

MM

U O

verh

ead

Time (Minutes)

0

15

30

45

60

10 20 30 40 50

1 2 3

MM

U O

verh

ead

Time (Minutes)

0

15

30

45

60

10 20 30 40 50

1 2 3

MM

U O

verh

ead

Time (Minutes)

(a) Linux-2MB (b) Ingens (c) HawkEye-PMU (d) HawkEye-GFigure 7 Huge page promotions and their impact on TLB overhead over time for Linux Ingens and HawkEye for threeinstances of Graph500 and XSBench each

set is executed twice after fragmenting the system to as-sess the impact of huge page policies on the order of ex-ecution For the TLB insensitive workload we execute alightly loaded Redis server with 1KB-sized 40M keys servic-ing 10K requests per-second Figure 8 shows the performance

speedup over 4KB pages The (Before) and (After) configura-tions show whether the TLB sensitive workload is launchedbefore Redis or after Because Linux promotes huge pagesin process launch order the two configurations (Before) and(After) are intended to cover its two different behavioursIngens and HawkEye are agnostic to process creation order

0

15

30

45

60

75

cactusADM tigr Graph500 lbm_s SVM XSBench cgD cactusADM tigr Graph500 lbm_s SVM XSBench cgDBefore After

s

peed

up

Chart Title

Linux Ingens HawkEye-PMU HawkEye-G

Figure 8 Performance speedup over baseline pages of TLB sensitive applications when they are executed alongside a lightlystressed Redis server in different orders

1

12

14

16

18

2

Nor

mal

ized

Per

form

ance

HawkEye-Host HawkEye-Guest HawkEye-Both

Figure 9 Performance compared to Linux in a virtualizedsystem when HawkEye is applied at the host guest and bothlayers

The impact of execution order is clearly visible for LinuxIn the (Before) case Linux with THP support improves per-formance over baseline pages by promoting huge pages inTLB sensitive applications However in the (After) case itpromotes huge pages in Redis resulting in poor performancefor TLB sensitive workloads While the execution order doesnot matter in Ingens its proportional huge page promotionstrategy favors large memory workloads (Redis in this case)Further our client requests randomly selected keys whichmakes the Redis server access all its huge pages uniformlyConsequently Ingens promotes most huge pages in Redisleading to suboptimal performance of TLB sensitive work-loads In contrast HawkEye promotes huge pages based onMMU overhead measurements (HawkEye-PMU) or access-coverage based estimation of MMU overheads (HawkEye-G)This leads to 15ndash60 performance improvement over 4KBpages irrespective of the order of executionPerformance in virtualized systems For a virtualizedsystemwe use the KVMhypervisor runningwith Ubuntu1604and fragment the system prior to running the workloadsEvaluation is presented for three different configurationsie when HawkEye is deployed at the host guest and boththe layers Each of these configurations requires running thesame set of applications differently (see Table 6 for details)Figure 9 shows that HawkEye provides 18ndash90 speedup

compared to Linux in virtual environments Notice that insome cases (eg cgD) performance improvement is higherin a virtualized system compared to bare-metal This is due tohigh MMU overheads involved in traversing nested page ta-bles while servicing TLB misses of guest applications whichpresents higher opportunities to be exploited by HawkEye

Configuration Details

HawkEye-Host Two VMs each with 24 vCPUs and 48GB memoryVM-1 runs Redis VM-2 runs TLB sensitive workloads

HawkEye-Guest Single VM with 48 vCPUs and 80GB memoryrunning Redis and TLB sensitive workloads

HawkEye-Both Two VMs each with 24 vCPUs VM-1 (30GB) runs RedisVM-2 (60GB) runs both Redis and TLB sensitive workloads

Table 6 Experimental setup for configurations used to eval-uate a virtualized system

Bloat vs performance We next study the relationship be-tween memory bloat and performance we revisit an experi-ment similar to the one summarized in sect21 We populate aRedis instance with 8M (10B key 4KB value) key-value pairsand then delete 60 randomly selected keys (see Table 7)While Linux-4KB (no THP) is memory efficient (no bloat)its performance is low (7 lower throughput than Linux-2MB ie with THP) Linux-2MB delivers high performancebut consumes more memory which remains unavailable tothe rest of system even when memory pressure increasesIngens can be configured to balance the memory vs perfor-mance tradeoff For example Ingens-90 (Ingens with 90utilization threshold) is memory efficient while Ingens-50(Ingens with 50 utilization threshold) favors performanceand allows more memory bloat In either case it is unableto recover from bloat that may have already been gener-ated (eg during its aggressive phase) By switching fromaggressive huge page allocations under low memory pres-sure to memory conserving strategy at runtime HawkEye isable to achieve both low MMU overheads and high memoryefficiency depending on the state of the systemFast page faults Table 8 shows the performance of differentstrategies for the workloads chosen for these experimentstheir performance depends on the efficiency of the OS pagefault handler We measure Redis throughput when 2MB-sized values are inserted SparseHash [13] is a hash-maplibrary in C++ while HACC-IO [8] is a parallel IO benchmarkused with an in-memory file system We also measure thespin-up time of two virtual machines KVM and a Java VirtualMachine (JVM) where both are configured to allocate allmemory during initialization

Performance benefits of async page pre-zeroing with basepages (HawkEye-4KB) aremodest in this case the other page

Kernel Self-tuning Memory Size ThroughputLinux-4KB No 162GB 1061KLinux-2MB No 332GB 1138KIngens-90 No 163GB 1068KIngens-50 No 331GB 1134KHawkEye (no mem pressure) Yes 332GB 1136KHawkEye (mem pressure) Yes 162GB 1058K

Table 7 Redis memory consumption and throughput

Workload OS ConfigurationLinux4KB

Linux2MB

Ingens90

HawkEye4KB

HawkEye2MB

Redis (45GB) 233 437 192 236 551SparseHash (36GB) 501 172 515 466 106HACC-IO (6GB) 65 45 66 65 42JVM Spin-up (36GB) 377 186 527 298 137KVM Spin-up (36GB) 406 97 418 302 070

Table 8 Performance implications of asynchronous pagezeroing Values for Redis represent throughput (higher isbetter) all other values represent time in seconds (lower isbetter)

0

10

20

30

40

NPB PARSEC Graph500 cactusADM PageRank omnetpp

overhead with caching-stores with nt-stores

Figure 10 Performance overhead of async pre-zeroing withand without caching instructions The first two bars (NPBand Parsec) represent the average of all workloads in therespective benchmark suite

fault overheads (apart from zeroing) dominate performanceHowever page-zeroing cost becomes significant with hugepages Consequently HawkEye with huge pages (HawkEye-2MB) improves the performance of Redis and SparseHashby 126times and 162times over Linux The spin-up time of virtualmachines is purely dominated by page faults Hence weobserve a dramatic reduction in the spin-up time of VMswith async pre-zeroing of huge pages 138times and 136times overLinux Notice that without huge pages (only base pages)this reduction is only 134times and 126times All these workloadshave high spatial locality of page faults and hence the Ingensstrategy of utilization-based promotion has a negative impacton performance due to a higher number of page faultsAsync pre-zeroing enables memory sharing in virtual-ized environments Finally we note that async pre-zeroingwithin VMs enables memory sharing across VMs the freememory of a VM returns to the host through pre-zeroing inthe VM and same-page merging at the host This can havethe same net effect as ballooning as the free memory in aVM gets shared with other VMs In our experiments withmemory over-committed systems we have confirmed thisbehaviour the overall performance of an overcommittedsystem where the VMs are running HawkEye matches the

0

05

1

15

2

25

PageRank Redis MongoDB CC Memcached SPDNode 0 Node 1

Nor

mal

ized

Per

form

ance

Linux Linux-Ballooning HawkEye

Figure 11 Performance normalized to the case of no bal-looning in an overcommitted virtualized system

performance achieved through para-virtual interfaces likememory balloon drivers For example in an experiment in-volving a mix of latency-sensitive key-value stores and HPCworkloads (see Figure 11) with a total peakmemory consump-tion of around 150GB (15times of total memory) HawkEye-Gprovides 23times and 142times higher throughput for Redis andMongoDB along with significant tail latency reduction ForPageRank performance degrades slightly due to a highernumber of COW page faults due to unnecessary same-pagemerging These results are very close to the results obtainedwhen memory ballooning is enabled on all VMs We believethat this is a potential positive of our design and may offeran alternative method to efficiently manage memory in over-committed systems It is well-known that balloon-driversand other para-virtual interfaces are riddled with compatibil-ity problems [29] and a fully-virtual solution to this problemwould be highly desirable While our experiments show earlypromise in this direction we leave an extensive evaluationfor future workPerformance overheads ofHawkEyeWe evaluatedHawk-Eye with non-fragmented systems and with several work-loads that donrsquot benefit from huge pages and find that theperformance with HawkEye is very similar to performancewith Linux or Ingens in these scenarios confirming thatHawkEyersquos algorithms for access-bit tracking asynchronoushuge page promotion and memory de-duplication add negli-gible overheads over existingmechanisms in Linux Howeverthe async pre-zeroing thread requires special attention as itmay become a source of noticeable performance overheadfor certain types of workloads as discussed nextOverheads of async pre-zeroing A primary concern withasync pre-zeroing is its potential detrimental effect due tomemory and cache interference (sect31) Recall that we em-ploy memory stores with non-temporal hints to avoid theseeffects To measure the worst-case cache effects of asyncpre-zeroing we run our workloads while simultaneouslyzero-filling pages on a separate core sharing the same L3cache (ie on the same socket) at a high rate of 025M pagesper second (1GBps) with and without non-temporal mem-ory stores Figure 10 shows that using non-temporal hintssignificantly brings down the overhead of async pre-zeroing(eg from 27 to 6 for omnetpp) the remaining overheadis due to additional memory traffic generated by the pre-zeroing thread Further we note that this experiment presents

Workload MMUOverhead

Time (seconds)4KB HawkEye-PMU HawkEye-G

random(4GB) 60 582 328(177times) 413(141times)sequential(4GB) lt 1 517 535 532Total 1099 863(127times) 945(116times)cgD(16GB) 39 1952 1202(162times) 1450(135times)mgD(24GB) lt 1 1363 1364 1377Total 3315 2566(129times) 2827(117times)

Table 9 Comparison between HawkEye-PMU andHawkEye-G Values in parentheses represent speedup overbaseline pagesa highly-exaggerated behavior of async pre-zeroing onworst-case workloads In practice the zeroing thread is rate-limited(eg at most 10k pages per second) and the expected cache-interference overheads would be proportionally smallerComparison betweenHawk-PMUandHawkEye-G Eventhough HawkEye-G is based on simple and approximatedestimations of TLB contention it is reasonably accurate inidentifying TLB sensitive processes and memory regionsin most cases However HawkEye-PMU performs betterthan HawkEye-G in some cases Table 9 demonstrates twosuch examples where four workloads all with high access-coverage but different MMU overheads are executed in twosets each set contains one TLB sensitive and one TLB insensi-tive workload The 4KB column represents the case where nohuge pages are promoted As discussed earlier (sect24) MMUoverheads depend on the complex interaction between thehardware and access pattern In this case despite havinghigh access-coverage in VA regions sequential and mgDhave negligible MMU overheads (ie they are TLB insen-sitive) While HawkEye-PMU correctly identifies the pro-cess with higher MMU overheads for huge page allocationHawkEye-G treats them similarly (due to imprecise estima-tions) and may allocate huge pages to the TLB insensitiveprocess Consequently HawkEye-PMU may perform up to36 better than HawkEye-G Bridging this gap between thetwo approaches in a hardware independent manner is aninteresting future work

5 Related WorkMMU overheads have been a topic of much research in re-cent years and several solutions have been proposed includ-ing solutions that are architecture-based OS-based andorcompiler-based [28 34 49 56 58 62ndash64]Hardware SupportMost proposals for hardware supportaim to reduce TLB-miss frequency or accelerate TLB-missprocessing Large multi-level TLBs and support for multiplepage sizes to minimize the number of TLB-misses is alreadycommon in general-purpose processors today [15 20 21 40]Further modern processors employ page-walk-caches toavoid memory lookups in the TLB-miss path [31 35] POM-TLB services a TLB miss using a single memory lookup andfurther leverages regular data caches to speed up addresstranslation [63] Direct segments [32 46] were proposed tocompletely avoid TLB-miss processing cost through special

segmentation hardware OS-level challenges of memory frag-mentation have also been considered in hardware designsCoLT or Coalesced-Large-reach TLBs [61] were initially pro-posed to increase TLB reach using base pages based on theobservation that OSs naturally provide contiguous mappingsat smaller granularity This approach was further extendedto page-walk caches and huge pages [35 40 60] Techniquesto support huge pages in non-contiguous memory have alsobeen proposed [44] Hardware optimizations are importantbut complementary to HawkEyeSoftware Support FreeBSDrsquos huge page management islargely influenced by previous work on superpages [57]Carrefour-LP [47] showed that ad-hoc usage of huge pagescan degrade performance in NUMA systems due to addi-tional remote memory accesses and traffic imbalance acrossnode interconnects and proposed hardware-profiling basedtechniques to overcome these problems Anti-fragmentationtechniques for clustering pages based on mobility type ofpages proposed by Gorman et al [48] have been adoptedby Linux to support the allocation of huge pages in long-running systems Illuminator [59] highlighted critical draw-backs of Linuxrsquos fragmentation mitigation techniques andtheir implications on performance and latency and proposedtechniques to solve these problems Mechanisms of dealingwith physical memory fragmentation and NUMA systemsare important but orthogonal problems and their proposedsolutions can be adopted to improve the robustness of Hawk-Eye Compiler or application hints can also be used to assistOSs in prioritizing huge page mapping for certain parts ofthe address space [27 56] for which an interface is alreadyprovided by Linux through the madvise system call [16]An alternative approach for supporting huge pages has

also been explored via libhugetlbfs [22] where the useris provided more control on huge page allocation Howeversuch an approach requires manual intervention for reserv-ing huge pages in advance and considers each applicationin isolation Windows and OS X support huge pages onlythrough this reservation-based approach to avoid issues asso-ciated with transparent huge page management [17 55 59]We believe that insights from HawkEye can be leveraged toimprove huge page support in these important systems

6 ConclusionsTo summarize transparent huge page management is es-sential but complex Effective and efficient algorithms arerequired in the OS to automatically balance the tradeoffs andprovide performant and stable system behavior We exposesome important subtleties related to huge page managementin existing proposals and propose a new set of algorithmsto address them in HawkEye

References[1] 2000 Clearing pages in the idle loop httpswwwmail-archivecom

freebsd-hackersfreebsdorgmsg13993html

[2] 2000 Linux Page Zeroing Strategy httpsyarchivenetcomplinuxpagezeroingstrategyhtml

[3] 2007 Memory part 5 What programmers can do httpslwnnetArticles255364

[4] 2010 Mysteries of Windows Memory Management RevealedPart Two httpsblogsmsdnmicrosoftcomtims20101029pdc10-mysteries-of-windows-memory-management-revealed-part-two

[5] 2012 khugepaged eating 100 CPU httpsbugzillaredhatcomshowbugcgiid=879801

[6] 2012 Recommendation to disable huge pages for Hadoophttpsdeveloperamdcomwordpressmedia201210HadoopTuningGuide-Version5pdf

[7] 2014 Arch Linux becomes unresponsive from khugepagedhttpunixstackexchangecomquestions161858arch-linux-becomes-unresponsive-from-khugepaged

[8] 2014 CORAL Benchmark Codes httpsascllnlgovCORAL-benchmarkshacc

[9] 2014 Recommendation to disable huge pages for NuoDBhttpwwwnuodbcomtechbloglinux-transparent-huge-pages-jemalloc-and-nuodb

[10] 2014 Why TokuDB Hates Transparent HugePageshttpswwwperconacomblog20140723why-tokudb-hates-transparent-hugepages

[11] 2015 The Black Magic Of Systematically Reducing Linux OS Jit-ter httphighscalabilitycomblog201548the-black-magic-of-systematically-reducing-linux-os-jitterhtml

[12] 2015 Tales from the Field Taming Transparent Huge Pages onLinux httpswwwperforcecomblog151016tales-field-taming-transparent-huge-pages-linux

[13] 2016 C++ associative containers httpsgithubcomsparsehashsparsehash

[14] 2016 Remove PG_ZERO and zeroidle (page-zeroing) entirely httpsnewsycombinatorcomitemid=12227874

[15] 2017 Hugepages httpswikidebianorgHugepages[16] 2017 MADVISE Linux Programmerrsquos Manual httpman7orglinux

man-pagesman2madvise2html[17] 2018 Large-Page Support in Windows httpsdocsmicrosoftcom

en-gbwindowsdesktopMemorylarge-page-support[18] 2019 CGROUPS Linux Programmerrsquos Manual httpman7orglinux

man-pagesman7cgroups7html[19] 2019 FreeBSD Pre-Faulting and Zeroing Optimizations

httpswwwfreebsdorgdocenUSISO8859-1articlesvm-designprefault-optimizationshtml

[20] 2019 Intel Haswell httpwww7-cpucomcpuHaswellhtml[21] 2019 Intel Skylake httpwww7-cpucomcpuSkylakehtml[22] 2019 Libhugetlbfs Linux man page httpslinuxdienetman7

libhugetlbfs[23] 2019 Recommendation to disable huge pages for MongoDB https

docsmongodbcommanualtutorialtransparent-huge-pages[24] 2019 Recommendation to disable huge pages for Redis httpredisio

topicslatency[25] 2019 Recommendation to disable huge pages for VoltDB https

docsvoltdbcomAdminGuideadminmemmgtphp[26] 2019 Transparent Hugepage Support httpswwwkernelorgdoc

Documentationvmtranshugetxt[27] Mohammad Agbarya Idan Yaniv and Dan Tsafrir 2018 Memomania

From Huge to Huge-Huge Pages In Proceedings of the 11th ACM In-ternational Systems and Storage Conference (SYSTOR rsquo18) ACM NewYork NY USA 112ndash112 httpsdoiorg10114532118903211918

[28] Hanna Alam Tianhao Zhang Mattan Erez and Yoav Etsion 2017Do-It-Yourself Virtual Memory Translation In Proceedings of the44th Annual International Symposium on Computer Architecture (ISCArsquo17) ACM New York NY USA 457ndash468 httpsdoiorg10114530798563080209

[29] Nadav Amit Dan Tsafrir and Assaf Schuster 2014 VSwapper AMemory Swapper for Virtualized Environments In Proceedings of the19th International Conference on Architectural Support for ProgrammingLanguages and Operating Systems (ASPLOS rsquo14) ACM New York NYUSA 349ndash366 httpsdoiorg10114525419402541969

[30] D H Bailey E Barszcz J T Barton D S Browning R L Carter LDagum R A Fatoohi P O Frederickson T A Lasinski R S SchreiberH D Simon V Venkatakrishnan and S KWeeratunga 1991 The NASParallel BenchmarksSummary and Preliminary Results In Proceedingsof the 1991 ACMIEEE Conference on Supercomputing (Supercomputingrsquo91) ACM New York NY USA 158ndash165 httpsdoiorg101145125826125925

[31] Thomas W Barr Alan L Cox and Scott Rixner 2010 TranslationCaching Skip DonrsquoT Walk (the Page Table) In Proceedings of the37th Annual International Symposium on Computer Architecture (ISCArsquo10) ACM New York NY USA 48ndash59 httpsdoiorg10114518159611815970

[32] Arkaprava Basu Jayneel Gandhi Jichuan Chang Mark D Hill andMichael M Swift 2013 Efficient Virtual Memory for Big MemoryServers In Proceedings of the 40th Annual International Symposium onComputer Architecture (ISCA rsquo13) ACM New York NY USA 237ndash248httpsdoiorg10114524859222485943

[33] Scott Beamer Krste Asanovic and David A Patterson 2015 TheGAP Benchmark Suite CoRR abs150803619 (2015) arXiv150803619httparxivorgabs150803619

[34] Ravi Bhargava Benjamin Serebrin Francesco Spadini and SrilathaManne 2008 Accelerating Two-dimensional Page Walks for Vir-tualized Systems In Proceedings of the 13th International Confer-ence on Architectural Support for Programming Languages and Op-erating Systems (ASPLOS XIII) ACM New York NY USA 26ndash35httpsdoiorg10114513462811346286

[35] Abhishek Bhattacharjee 2013 Large-reach Memory ManagementUnit Caches In Proceedings of the 46th Annual IEEEACM InternationalSymposium on Microarchitecture (MICRO-46) ACM New York NYUSA 383ndash394 httpsdoiorg10114525407082540741

[36] Christian Bienia Sanjeev Kumar Jaswinder Pal Singh and Kai Li 2008The PARSEC Benchmark Suite Characterization and ArchitecturalImplications In Proceedings of the 17th International Conference onParallel Architectures and Compilation Techniques (PACT rsquo08) ACMNew York NY USA 72ndash81 httpsdoiorg10114514541151454128

[37] Daniel Bovet and Marco Cesati 2005 Understanding The Linux KernelOreilly amp Associates Inc

[38] Josiah L Carlson 2013 Redis in Action Manning Publications CoGreenwich CT USA

[39] Jonathan Corbet 2010 Memory compaction httpslwnnetArticles368869

[40] Guilherme Cox and Abhishek Bhattacharjee 2017 Efficient AddressTranslation for Architectures with Multiple Page Sizes In Proceed-ings of the Twenty-Second International Conference on ArchitecturalSupport for Programming Languages and Operating Systems (ASPLOSrsquo17) ACM New York NY USA 435ndash448 httpsdoiorg10114530376973037704

[41] Cort Dougan Paul Mackerras and Victor Yodaiken 1999 Opti-mizing the Idle Task and Other MMU Tricks In Proceedings of theThird Symposium on Operating Systems Design and Implementation(OSDI rsquo99) USENIX Association Berkeley CA USA 229ndash237 httpdlacmorgcitationcfmid=296806296833

[42] Niall Douglas 2011 User Mode Memory Page Allocation A Sil-ver Bullet For Memory Allocation CoRR abs11051811 (2011)arXiv11051811 httparxivorgabs11051811

[43] Ulrich Drepper 2007 What Every Programmer Should Know AboutMemory

[44] Yu Du Miao Zhou Bruce R Childers Daniel Mosseacute and Rami GMelhem 2015 Supporting superpages in non-contiguous physicalmemory 2015 IEEE 21st International Symposium on High Performance

Computer Architecture (HPCA) (2015) 223ndash234[45] M Franklin D Yeung n Xue Wu A Jaleel K Albayraktaroglu B

Jacob and n Chau-Wen Tseng 2005 BioBench A Benchmark Suite ofBioinformatics Applications In IEEE International Symposium on Per-formance Analysis of Systems and Software 2005 ISPASS 2005(ISPASS)Vol 00 2ndash9 httpsdoiorg101109ISPASS20051430554

[46] Jayneel Gandhi Arkaprava Basu Mark D Hill and Michael M Swift2014 Efficient Memory Virtualization Reducing Dimensionality ofNested Page Walks In Proceedings of the 47th Annual IEEEACM Inter-national Symposium on Microarchitecture (MICRO-47) IEEE ComputerSociety Washington DC USA 178ndash189 httpsdoiorg101109MICRO201437

[47] Fabien Gaud Baptiste Lepers Jeremie Decouchant Justin FunstonAlexandra Fedorova and Vivien Queacutema 2014 Large Pages MayBe Harmful on NUMA Systems In Proceedings of the 2014 USENIXConference on USENIX Annual Technical Conference (USENIX ATCrsquo14)USENIX Association Berkeley CA USA 231ndash242 httpdlacmorgcitationcfmid=26436342643659

[48] Mel Gorman and Patrick Healy 2008 Supporting Superpage Alloca-tion Without Additional Hardware Support In Proceedings of the 7thInternational Symposium on Memory Management (ISMM rsquo08) ACMNew York NY USA 41ndash50 httpsdoiorg10114513756341375641

[49] Mel Gorman and Patrick Healy 2012 Performance Characteristics ofExplicit Superpage Support In Proceedings of the 2010 InternationalConference on Computer Architecture (ISCArsquo10) Springer-Verlag BerlinHeidelberg 293ndash310 httpsdoiorg101007978-3-642-24322-624

[50] Mel Gorman and Andy Whitcroft 2006 The What The Why and theWhere To of Anti-Fragmentation In Linux Symposium 141

[51] Fei Guo Seongbeom Kim Yury Baskakov and Ishan Banerjee 2015Proactively Breaking Large Pages to Improve Memory Overcom-mitment Performance in VMware ESXi In Proceedings of the 11thACM SIGPLANSIGOPS International Conference on Virtual ExecutionEnvironments (VEE rsquo15) ACM New York NY USA 39ndash51 httpsdoiorg10114527311862731187

[52] Fan Guo Yongkun Li Yinlong Xu Song Jiang and John C S Lui2017 SmartMD A High Performance Deduplication Engine withMixed Pages In Proceedings of the 2017 USENIX Conference on UsenixAnnual Technical Conference (USENIX ATC rsquo17) USENIX AssociationBerkeley CA USA 733ndash744 httpdlacmorgcitationcfmid=31546903154759

[53] John L Henning 2006 SPEC CPU2006 Benchmark DescriptionsSIGARCH Comput Archit News 34 4 (Sept 2006) 1ndash17 httpsdoiorg10114511867361186737

[54] Vasileios Karakostas Osman S Unsal Mario Nemirovsky AdriaacutenCristal and Michael M Swift 2014 Performance analysis of thememory management unit under scale-out workloads 2014 IEEE In-ternational Symposium on Workload Characterization (IISWC) (2014)1ndash12

[55] Youngjin Kwon Hangchen Yu Simon Peter Christopher J Rossbachand Emmett Witchel 2016 Coordinated and Efficient Huge PageManagement with Ingens In Proceedings of the 12th USENIX Con-ference on Operating Systems Design and Implementation (OSDIrsquo16)USENIX Association Berkeley CA USA 705ndash721 httpdlacmorgcitationcfmid=30268773026931

[56] Joshua Magee and Apan Qasem 2009 A Case for Compiler-drivenSuperpage Allocation In Proceedings of the 47th Annual SoutheastRegional Conference (ACM-SE 47) ACM New York NY USA Article82 4 pages httpsdoiorg10114515664451566553

[57] Juan Navarro Sitararn Iyer Peter Druschel and Alan Cox 2002 Prac-tical Transparent Operating System Support for Superpages SIGOPSOper Syst Rev 36 SI (Dec 2002) 89ndash104 httpsdoiorg101145844128844138

[58] Ashish Panwar Naman Patel and K Gopinath 2016 A Case forProtecting Huge Pages from the Kernel In Proceedings of the 7th ACMSIGOPS Asia-Pacific Workshop on Systems (APSys rsquo16) ACM New YorkNY USA Article 15 8 pages httpsdoiorg10114529673602967371

[59] Ashish Panwar Aravinda Prasad and K Gopinath 2018 Making HugePages Actually Useful In Proceedings of the Twenty-Third InternationalConference on Architectural Support for Programming Languages andOperating Systems (ASPLOS rsquo18) ACM New York NY USA 679ndash692httpsdoiorg10114531731623173203

[60] Binh Pham Abhishek Bhattacharjee Yasuko Eckert and Gabriel HLoh 2014 Increasing TLB reach by exploiting clustering in page trans-lations 2014 IEEE 20th International Symposium on High PerformanceComputer Architecture (HPCA) (2014) 558ndash567

[61] Binh Pham Viswanathan Vaidyanathan Aamer Jaleel and AbhishekBhattacharjee 2012 CoLT Coalesced Large-Reach TLBs In Proceed-ings of the 2012 45th Annual IEEEACM International Symposium onMicroarchitecture (MICRO-45) IEEE Computer Society WashingtonDC USA 258ndash269 httpsdoiorg101109MICRO201232

[62] Binh Pham Jaacuten Veselyacute Gabriel H Loh and Abhishek Bhattacharjee2015 Large Pages and Lightweight Memory Management in Virtual-ized Environments Can You Have It Both Ways In Proceedings of the48th International Symposium on Microarchitecture (MICRO-48) ACMNew York NY USA 1ndash12 httpsdoiorg10114528307722830773

[63] Jee Ho Ryoo Nagendra Gulur Shuang Song and Lizy K John 2017Rethinking TLB Designs in Virtualized Environments A Very LargePart-of-Memory TLB In Proceedings of the 44th Annual InternationalSymposium on Computer Architecture (ISCA rsquo17) ACM New York NYUSA 469ndash480 httpsdoiorg10114530798563080210

[64] Indira Subramanian Clifford Mather Kurt Peterson and BalakrishnaRaghunath 1998 Implementation of Multiple Pagesize Support inHP-UX In USENIX Annual Technical Conference 105ndash119

[65] John R Tramm Andrew R Siegel Tanzima Islam and Martin Schulz[n d] XSBench - The Development and Verification of a PerformanceAbstraction for Monte Carlo Reactor Analysis In PHYSOR 2014 - TheRole of Reactor Physics toward a Sustainable Future Kyoto

[66] Koji Ueno and Toyotaro Suzumura 2012 Highly Scalable Graph Searchfor the Graph500 Benchmark In Proceedings of the 21st InternationalSymposium on High-Performance Parallel and Distributed Computing(HPDC rsquo12) ACM New York NY USA 149ndash160 httpsdoiorg10114522870762287104

[67] Carl AWaldspurger 2002 Memory ResourceManagement in VMwareESX Server SIGOPS Oper Syst Rev 36 SI (Dec 2002) 181ndash194 httpsdoiorg101145844128844146

[68] Xiao Zhang Sandhya Dwarkadas and Kai Shen 2009 Towards Practi-cal Page Coloring-based Multicore Cache Management In Proceedingsof the 4th ACM European Conference on Computer Systems (EuroSysrsquo09) ACM New York NY USA 89ndash102 httpsdoiorg10114515190651519076

  • Abstract
  • 1 Introduction
  • 2 Motivation
    • 21 Address Translation Overhead vs Memory Bloat
    • 22 Page fault latency vs Number of page faults
    • 23 Huge page allocation across multiple processes
    • 24 How to capture address translation overheads
      • 3 Design and Implementation
        • 31 Asynchronous page pre-zeroing
        • 32 Managing bloat vs performance
        • 33 Fine-grained huge page promotion
        • 34 Huge page allocation across multiple processes
        • 35 Limitations and discussion
          • 4 Evaluation
          • 5 Related Work
          • 6 Conclusions
          • References
Page 2: HawkEye: Efficient Fine-grained OS Support for …sbansal/pubs/hawkeye.pdfHawkEye: Efficient Fine-grained OS Support for Huge Pages Ashish Panwar Indian Institute of Science ashishpanwar@iisc.ac.in

higher number of page faults and higher MMU overheadsdue to multiple TLB entries one per baseline pageIngens In the Ingens paper [55] the authors point out pit-falls in the huge page management policies of both Linux andFreeBSD and present a policy that is better at handling the as-sociated trade-offs In summary (1) Ingens uses an adaptivestrategy to balance address translation overhead and mem-ory bloat it uses conservative utilization-threshold basedhuge page allocation to prevent memory bloat under highmemory pressure but relaxes the threshold to allocate hugepages aggressively under no memory pressure to try andachieve the best of both worlds (2) To avoid high page faultlatency associated with synchronous page-zeroing Ingensemploys asynchronous huge page allocation with a dedi-cated kernel thread (3) To maintain fairness across multipleprocesses Ingens treats memory contiguity as a resourceand employs a share-based policy to allocate huge pagesfairlyThe Ingens paper highlights that current OSs deal with

huge page management issues through ldquospot fixesrdquo and mo-tivates the need for a principled approach While Ingensproposes a more sophisticated strategy than previous workwe show that its static configuration based and heuristic-driven approaches are suboptimal conflicting performanceobjectives and inadequate knowledge of MMU overheadsof running applications can limit its effectiveness Severalaspects of an OS-based huge page management system canthus benefit from a dynamic data-driven approachThis paper presents HawkEye an automated OS-level

solution for huge page management HawkEye proposesa set of simple-yet-effective algorithms to (1) balance thetradeoffs between memory bloat address translation over-heads and page fault latency (2) allocate huge pages toapplications with highest expected performance improve-ment due to the allocation and (3) improve memory sharingbehaviour in virtualized systems We also show that theactual address-translation overheads as measured throughperformance counters can be sometimes quite different fromthe expectedestimated overheads To demonstrate this wealso evaluate a variant of HawkEye that relies on hardwareperformance counters for making huge page allocation deci-sions and compare results Our evaluation involves work-loads with a diverse set of requirements vis-a-vis huge pagemanagement and demonstrates that HawkEye can achieveconsiderable performance improvement over existing sys-tems while adding negligible overhead (asymp 34 single-coreoverhead in the worst-case)

2 MotivationIn this section we discuss different tradeoffs involved in OS-level huge page management and how current solutions han-dle them We also provide a glimpse of the results achievedthrough HawkEye which is discussed in detail in sect3

0

8

16

24

32

40

48

1 69 137

205

273

341

409

477

545

613

681

749

817

885

953

1021

1089

1157

1225

1293

1361

1429

1497

1565

RSS

(GB)

Time (seconds)

Linux Ingens HawkEye

out-of-memory successout-of-memory success

P3

Figure 1 Resident Set Size (RSS) of Redis server across 3phases P1 (insert) P2(delete) and P3(insert)

21 Address Translation Overhead vs Memory BloatOne of the most important tradeoffs associated with hugepages is between address translation overheads and memorybloat While Linuxrsquos synchronous huge page allocation isaimed at minimizing MMU overheads it can often lead tomemory bloat when an application uses only a fraction ofthe allocated huge page FreeBSD promotes only after allbase pages in a huge page region are allocated FreeBSDrsquosconservative approach tackles memory bloat but sacrificesperformance by delaying the mapping of huge pages

Ingensrsquo strategy is adaptive in that it promotes huge pagesaggressively to minimize MMU overheads when memoryfragmentation is low To measure fragmentation it uses theFree Memory Fragmentation Index (FMFI [50]) when FMFIlt 05 (low fragmentation) Ingens behaves like Linux pro-moting allocated pages to huge pages at the first availableopportunity when FMFI gt 05 (high fragmentation) Ingensuses a conservative utilization-based strategy (ie promoteonly after a certain fraction of base pages say 90 are al-located) to mitigate memory bloat While Ingens appearsto capture the best of both Linux and FreeBSD we showthat this adaptive policy is far from a complete solution be-causememory bloat generated in the aggressive phase remainsunrecovered This property is also true for Linux and wedemonstrate this problem through a simple experiment withthe Redis key-value store [38] running on a 48GB memorysystem (see Figure 1)

We execute a client to generate memory bloat andmeasurethe interaction of different huge page policies with RedisFor exposition we divide our workload into three phases(P1) The client inserts 11 million key-value pairs of size(10B 4KB) for an in-memory dataset of 45GB (P2) The clientdeletes 80 randomly selected keys leaving Redis with asparsely populated address space (P3) After some time gapthe client tries to insert 17K (10B 2MB) key-value pairs sothat the dataset again reaches 45GB While this workload isused for clear exposition of the problem it resembles real-istic workload patterns that involve a mix of allocation anddeallocation and leave the address space fragmented Anideal system should recover elegantly from memory bloat toavoid disruptions under high memory pressure

In phase P2 when Redis releases pages back to the OSthrough madvise system call [16] the corresponding hugepage mappings are broken by the kernel and the resident-setsize (RSS) reduces to around 11GB At this point the ker-nelrsquos khugepaged thread promotes the regions containingthe deallocated pages back to huge pages As more allo-cations happen in phase P3 Ingens promotes huge pagesaggressively until the RSS reaches 32GB (or fragmentation islow) after that Ingens employs conservative promotion toavoid any further bloat This is evident in Figure 1 where therate of RSS growth decreases compared to Linux Howeverboth Linux and Ingens reach memory limit (out-of-memoryOOM exception is thrown) at significantly lower memoryutilization While Linux generates a bloat of 28GB (ie only20GB useful data) Ingens generates a bloat of 20GB (ie28GB useful data) Although Ingens tries to prevent bloat athigh memory utilization it is unable to recover from bloatgenerated at low memory utilizationWe note that it is possible to avoid memory bloat in In-

gens by configuring it to use only conservative promotionHowever as also explained in the Ingens paper this strategyrisks losing performance due to high MMU overheads inlow memory pressure situations An ideal strategy shouldpromote huge pages aggressively but should also be ableto recover from bloat in an elegant fashion Figure 1 showsHawkEyersquos behaviour for this workload which is able toeffectively recover from memory bloat even with an aggres-sive promotion strategy

22 Page fault latency vs Number of page faultsThe OS often needs to clear (zero) a page before mapping itinto the process address space to prevent insecure informa-tion flow Clearing a page is significantly more expensive forhuge pages in Linux on our experimental system clearinga base page takes about 25 of total page fault time whichincreases to 97 for huge pages High page fault latencyoften leads to high user-perceived latencies jeopardizingperformance for interactive applications Ingens avoids thisproblem by allocating only base pages in the page fault han-dler and relegating the promotion task to an asynchronousthread khugepaged It prioritizes the promotion of recentlyfaulted regions over older allocationsWhile Ingens reduces page allocation latency it nullifies

an important advantage of huge pages namely fewer pagefaults for access patterns exhibiting high spatial locality [26]Several applications that allocate and initialize a large mem-ory region sequentially exhibit this type of access pattern andtheir performance degrades due to higher number of pagefaults if only base pages are allocated We demonstrate theseverity of this problem through a custom microbenchmarkthat allocates a 10GB buffer touching one byte in every basepage and later frees the buffer Table 1 shows the cumulativeperformance summary for 10 runs of this workload

Event sync page-zeroing asyncpromotion no page-zeroing

Linux4KB

Linux2MB

Ingens90

Linux4KB

Linux2MB

Page faults 262M 515K 262M 262M 515KTotal page fault time (secs) 926 239 928 695 07Avg page fault time (micros) 35 465 35 265 13System time (secs) 102 24 104 79 13Total time (secs) 106 249 116 83 44

Table 1 Page faults allocation latency and performance fora microbenchmark with asymp100GB memory allocationLinux with THP support (Linux-2MB with sync page-

zeroing in Table 1) reduces the number of page faults bymore than 500times over Linux without THP support (Linux-4KB) and leads to more than 4times performance improvementfor this workload despite being 133timesworse on average pagefault latency (465micros vs 35micros) Ingens considerably reduceslatency compared to Linux-2MB but does not reduce thenumber of page faults for this workload because asynchro-nous promotion is activated only after 90 base pages areallocated from a region copying these pages to a contiguousmemory block takes more time than the time taken by thisworkload to generate the remaining 10 page faults in thesame region Consequently the overall performance of thisworkload degrades in Ingens due to excessive page faultsIf it were possible to allocate pages without having to zerothem at page fault time we could achieve both low page faultlatency and fewer page faults resulting in higher overall per-formance over all existing approaches (last two columns inTable 1) HawkEye implements rate-limited asynchronouspage pre-zeroing (sect31) to achieve this in the common case

23 Huge page allocation across multiple processesUnder memory fragmentation the OS must allocate hugepages among multiple processes fairly Ingens authors tack-led this problem by defining fairness through a proportionalhuge page promotion metric wherein they consider ldquomemorycontiguity as a resourcerdquo and try and be equitable in dis-tributing it (through huge pages) among applications Ingenspenalizes applications that have been allocated huge pagesbut are not accessing them frequently (ie have idle hugepages) Idleness of a page is estimated through the access-bitpresent in the page table entries which is maintained by thehardware and periodically cleared by the OS Ingens employsan idleness penalty factorwhereby an application is penalizedfor its idle or cold huge pages during the computation ofthe proportional huge page promotion metric

Ingensrsquo fairness policy has an important weakness mdash twoprocesses may have similar huge page requirements but oneof them (say P1) may have significantly higher TLB pressurethan the other (say P2) This can happen for example if P1rsquosaccesses are spread across many base pages within its hugepage regions while P2rsquos accesses are concentrated in one(or a few) base pages within its huge page regions In thisscenario promotion of huge pages in P1 is more desirablethan P2 but Ingens would treat them equally

Benchmark Suite Number of applicationsTotal TLB sensitive applications

SPEC CPU2006_int 12 4 (mcf astar omnetpp xalancbmk)SPEC CPU2006_fp 19 3 (zeusmp GemsFDTD cactusADM)PARSEC 13 2 (canneal dedup)SPLASH-2 10 0Biobench 9 2 (tigr mummer)NPB 9 2 (cg bt)CloudSuite 7 2 (graph-analytics data-analytics)Total 79 15

Table 2 Number of TLB sensitive applications in popularbenchmark suitesTable 2 provides some empirical evidence that different

applications usually behave quite differently vis-a-vis hugepages For example less than 20 of the applications in pop-ular benchmark suites experience noticeable performanceimprovement (gt 3) with huge pages

We posit that instead of considering ldquomemory contiguityas a resourcerdquo and granting it equally among processes itis more effective to consider ldquoMMU overheads as a systemoverheadrdquo and try and ensure that it is distributed equallyacross processes A fair algorithm should attempt to equalizeMMU overheads across all applications eg if two processesP1 and P2 experience 30 and 10 MMU overheads respthen huge pages should be allocated to P1 until its overheadalso reaches 10 In doing so it should be acceptable if P1has to be allocated more huge pages than P2 Such a policywould additionally yield the best overall system performanceby helping the most afflicted processes first

Further in Ingens huge page promotion is triggered in re-sponse to a few page-faults in the aggressive (non-fragmented)phase These huge pages are not necessarily frequently ac-cessed by the application yet they contribute to a processrsquosallocation quota of huge pages Under memory pressurethese idle huge pages lower the promotion metric of a pro-cess potentially preventing the OS from allocating morehuge pages to it This would increase MMU overheads if theprocess had other hot (frequently accessed) memory regionsthat required huge page promotion This behaviour wherepreviously-allocated cold huge pages can prevent an appli-cation from being allocated huge pages for its hot regionsseems sub-optimal and avoidable

Finally within a process Linux and Ingens promote hugepages through a sequential scan from lower to higher VAsThis approach is unfair to processes whose hot regions lie inthe higher VAs Because different applications would usuallycontain hot regions in different parts of their VA spaces (seeFigure 6) this scheme is likely to cause unfairness in practice

24 How to capture address translation overheadsIt is common to estimate MMU overheads based on working-set size (WSS) [56] bigger WSS should entail higher MMUand performance overheads We find that this is often nottrue for modern hardware where access patterns play an im-portant role in determining MMU overheads eg sequential

Workload RSS WSS TLB-misses(native-4KB)

cycles speedup4KB 2MB native virtual

btD 10GB 7-10 GB 045 64 131 105 115spD 12GB 8-12 GB 048 47 025 101 106luD 8GB 8 GB 006 33 018 10 101mgD 26GB 24 GB 003 104 004 101 111cgD 16GB 7-8 GB 2857 39 002 162 27ftD 78 GB 7-35 GB 021 39 214 101 104uaD 96 GB 5-7 GB 001 08 003 101 103

Table 3Memory characteristics address translation over-heads and speedup huge pages provide over base pages forNPB workloads

Performance CounterC1 DTLB_LOAD_MISSES_WALK_DURATIONC2 DTLB_STORE_MISSES_WALK_DURATIONC3 CPU_CLK_UNHALTEDMMU Overhead = ((C1 + C2) 100) C3

Table 4Methodology used to measure MMUOverhead [54]access patterns allow prefetching to hide TLB miss latenciesFurther different TLB miss requests can experience highlatency variations a translation may be present anywherein the multi-level page walk caches multi-level regular datacaches or in main memory For these reasons WSS is oftennot a good indicator of MMU overheads Table 3 demon-strates this with the NPB benchmark suite [30] a workloadwith large WSS (eg mgD) can have low MMU overheadscompared to one with a smaller WSS (eg cgD)

It is not clear to us how an OS can reliably capture MMUoverheads these are dependent on complex interactions be-tween applications and the underlying hardware architec-ture and we show that the actual address translation over-heads of a workload can be quite different from what can beestimated through its memory access pattern (eg working-set size) Hence we propose directly measuring TLB over-heads through hardware performance counters when avail-able (see Table 4 for methodology) This approach enablesa more efficient solution at the cost of portability as the re-quired performance counters may not be available on all plat-forms (eg most hypervisors have not yet virtualized TLB-related performance counters) To overcome this challengewe present two variants of our algorithmimplementation inHawkEyeLinux one where the MMU overheads are mea-sured through hardware performance counters (HawkEye-PMU) and another where MMU overheads are estimatedthrough the memory access pattern (HawkEye-G) We com-pare both approaches in sect4

3 Design and ImplementationFigure 2 shows our high-level design objectives Our solutionis based on four primary observations (1) high page faultlatency for huge pages can be avoided by asynchronouslypre-zeroing free pages (2) memory bloat can be tackledby identifying and de-duplicating zero-filled baseline pagespresent within allocated huge pages (3) promotion decisionsshould be based on finer-grained access tracking of hugepage sized regions and should include recency frequency

Low Memory Pressure High Memory Pressure

1 Fewer Page Faults 2 Lowshylatency Allocation 3 High Performance

1 High Memory Efficiency 2 Efficient Huge Page Promotion 3 Fairness

Figure 2 Design objectives in HawkEye

and access-coverage (ie how many baseline pages are ac-cessed inside a huge page) measurements and (4) fairnessshould be based on estimation of MMU overheads

31 Asynchronous page pre-zeroingWe propose that page zeroing of available free pages shouldbe performed asynchronously in a rate-limited backgroundkernel thread to eliminate high latency allocation We callthis scheme async pre-zeroing Async pre-zeroing is a ratherold idea and had been discussed extensively among kerneldevelopers in early 2000s [1 2 14 41]1 Linux developersopined that async pre-zeroing is not an overall performancewin for two main reasons We think that it is time to revisitthese opinionsFirst the async pre-zeroing thread might interfere with

primary workloads by polluting the cache [1] In particu-lar async pre-zeroing suffers from the ldquodouble cache missrdquoproblem because it causes the same datum to be accessedtwice with a large re-use distance first for pre-zeroing andthen for the actual access by the application These extracache misses are expected to degrade overall performancein general However these problems are partially solvableon modern hardware that support memory writes with non-temporal hints non-temporal hints instruct the hardware tobypass caches during memory loadstore instructions [43]We find that using non-temporal hints during pre-zeroingsignificantly reduces both cache contention and the doublecache miss problem

Second there was no consensus or empirical evidence todemonstrate the benefits of page pre-zeroing for real work-loads [1 42] We note that the early discussions on pagepre-zeroing were evaluating trade-offs with baseline 4KBpages Our experiments corroborate the kernel developersrsquoobservation that despite reducing the page fault overheadby 25 pre-zeroing does not necessarily enable high per-formance with 4KB pages At the same time we also showthat it enables non-negligible performance improvements(eg 14times faster VM boot-time) with huge pages due to muchhigher reduction (97) in page fault overheads Since hugepages (and huge-huge pages) are supported by most general-purpose processors today [15 27] pre-zeroing pages is animportant optimization that is worth revisitingPre-zeroing offers another advantage for virtualized sys-

tems it increases the number of zero pages in the guestrsquosphysical address (GPA) space enabling opportunities for1Windows and FreeBSD implement async pre-zeroing in a limited fashion [419] However their huge page policies do not allow high latency allocationsThis idea is more relevant for Linux due to its synchronous huge pageallocation behaviour

content-based page-sharing at the virtualization host Weevaluate this aspect in sect4

To implement async pre-zeroing HawkEye manages freepages in the Linux buddy allocator through two lists zeroand non-zero Pages released by applications are first addedto the non-zero list while the zero list is preferentially usedfor allocation A rate-limited thread periodically transferspages from non-zero to zero lists after zero-filling themusing non-temporal writes Because pre-zeroing involves se-quential memory accesses non-temporal store instructionsprovide performance similar to regular (caching) store in-structions but without polluting the cache [3] Finally wenote that for copy-on-write or filesystem-backed memoryregions pre-zeroing may sometimes be unnecessary andwasteful This problem is avoidable by preferentially allo-cating pages for these memory regions from the non-zerolist

Overall we believe that async pre-zeroing is a compellingidea for modern workload requirements and modern hard-ware support In our evaluation we provide some early evi-dence to corroborate this claim with different workloads

32 Managing bloat vs performanceWe observe that the fundamental tension between MMUoverheads and memory bloat can be resolved Our approachstems from the insight that most allocations in large-memoryworkloads are typically ldquozero-filled page allocationsrdquo the re-maining are either filesystem-backed (eg through mmap)or copy-on-write (COW) pages However huge pages inmodern scale-out workloads are primarily used for ldquoanony-mousrdquo pages that are initially zero-filled by the kernel [54]eg Linux supports huge pages only for anonymous mem-ory This property of typical workloads and Linux allowsautomatic recovery of bloat under memory pressure

To state our approach succinctly we allocate huge pagesat the time of first page-fault but under memory pressureto recover unused memory we scan existing huge pages toidentify zero-filled baseline pages within them If the numberof zero-filled baseline pages inside a huge page is significant(ie beyond a threshold) we break the huge page into itsconstituent baseline pages and de-duplicate the zero-filledbaseline pages to a canonical zero-page through standardCOW page management techniques [37] In this approachit is possible for applicationsrsquo in-use zero-pages to also getde-duplicated While this can result in a marginally highernumber of COW page faults in rare situations this does notcompromise correctnessTo trigger recovery from memory bloat HawkEye uses

two watermarks on the amount of allocated memory inthe system high and low When the amount of allocatedmemory exceeds high (85 in our prototype) a rate-limitedbloat-recovery thread is activated which executes period-ically until the allocated memory falls below low (70 inour prototype) At each step the bloat-recovery thread

675554

1155

39 28 12 1 663274

9110

30

60

90

120di

stan

ce (b

ytes

)

Figure 3 Average distance to the first non-zero byte in base-line (4KB) pages First four bars represent the average of allworkloads in the respective benchmark suite

chooses the application whose huge page allocations need tobe scanned (and potentially demoted) based on the estimatedMMU overheads of that application the application with thelowest estimated MMU overheads is chosen first for scanningThis strategy ensures that the application that least requireshuge pages is considered first mdash this is consistent with ourhuge page allocation strategy (sect23)

While scanning a baseline page to verify if it is zero-filledwe stop on encountering the first non-zero byte in it Inpractice the number of bytes that need to be scanned perin-use (not zero-filled) page before a non-zero byte is en-countered is very small we measured this over a total of 56diverse workloads and found that the average distance of thefirst non-zero byte in a 4KB page is only 911 (see Figure 3)Hence only ten bytes need to be scanned on average perin-use application page For bloat pages however all 4096bytes need to be scanned This implies that the overheadsof our bloat-recovery thread are largely proportional tothe number of bloat pages in the system and not to the totalsize of the allocated memory This is an important propertythat allows our method to scale to large memory systemsWe note that our bloat-recovery procedure has many

similarities with the standard de-duplication kernel threadsused for content-based page sharing for virtual machines[67] eg the kernel same-page merging (ksm) thread inLinux Unfortunately in current kernels the huge page man-agement logic (eg khugepaged) and the content-based pagesharing logic (eg ksm) are unconnected and can often inter-act in counter-productive ways [51] Ingens and SmartMD[52] proposed coordinated mechanisms to avoid such con-flicts Ingens demotes only infrequently-accessed huge pagesthrough ksmwhile SmartMD demotes pages based on access-frequency and repetition rate (ie the number of shareablepages within a huge page) These techniques are useful for in-use pages and our bloat-recovery proposal complementsthem by identifying unused zero-filled pages which canexecute much faster than typical same-page merging logic

33 Fine-grained huge page promotionAn efficient huge page promotion strategy should try tomaximize performance with a minimal number of huge page

9876543210

B1

B2

B3 B4

B9876543210

C1

C3

C5

C

C4

C29876543210

access_map

A1

A2 A3

A

orde

r of p

rom

otio

n

hot-regions

cold-regions

access_map access_map

Figure 4 A sample representation of access_map for threeprocesses A B and C

promotions Current systems promote pages through a se-quential scan from lower to higher VAs which is inefficientfor applications whose hot regions are not in lower VA spaceOur approach makes promotion decisions based on memoryaccess patterns First we define an important metric used inHawkEye to improve the efficiency of huge page promotionsAccess-coverage denotes the number of base pages that areaccessed from a huge page sized region in short intervalsWe sample the page table access bits at regular intervals andmaintain the exponential moving average (EMA) of access-coverage across different samples More precisely we clearthe page table access bits and test the number of set bitsafter 1 second to check how many pages are accessed Thisprocess is repeated once every 30 secondsThe access-coverage of a region potentially indicates its

TLB space requirement (number of base page entries requiredin the TLB) which we use to estimate the profitability ofpromoting it to a huge page A region with high access-coverage is likely to exhibit high contention on TLB andhence likely to benefit more from promotionHawkEye implements access-coverage based promotion

using a per-process data structure called access_map whichis an array of buckets a bucket contains huge page regionswith similar access-coverage On x86 with 2MB huge pagesthe value of access-coverage is in the range of 0-512 In ourprototype wemaintain ten buckets in the access_mapwhichprovides the necessary resolution across access-coverage val-ues at relatively low overheads regions with access-coverageof 0-49 are placed in bucket 0 regions with access-coverageof 50 minus 99 are placed in bucket 1 and so on Figure 4 showsan example state of the access_map of three processes A Band C Regions can move up or down in their respective ar-rays after every sampling period depending on their newlycomputed EMA-based access-coverage If a region movesup in access_map it is added to the head of its bucket If aregion moves down it is added to the tail Within a bucketpages are promoted from head to tail This strategy helps inprioritizing recently accessed regions within an index

HawkEye promotes regions from higher to lower indicesin access_map Notice that our approach captures both re-cency and frequency of accesses a region that has not been

accessed or accessed with low access-coverage in recentsampling periods is likely to shift towards a lower indexin access_map or towards the tail in its current index Pro-motion of cold regions is thus automatically deferred toprioritize hot regions

We note that HawkEyersquos access_map is somewhat similarto population_map and access_bitvector used in FreeBSDand Ingens resp these data structures are primarily used tocapture utilization or page access related metadata at hugepage granularity HawkEyersquos access_map additionally pro-vides the index of hotness of memory regions and enablesfine-grained huge page promotion

34 Huge page allocation across multiple processesIn our access-coverage based strategy regions belonging toa process with the highest expected MMU overheads arepromoted before others This is also consistent with ournotion of fairness sect23 as pages are promoted first fromregionsprocesses with high expected TLB pressure (andconsequently high expected MMU overheads)Our HawkEye variant that does not rely on hardware

performance counters (ie HawkEye-G) promotes regionsfrom the non-empty highest access-coverage index in all pro-cesses It is possible that multiple processes have non-emptybuckets in the (globally) highest non-empty index In thiscase round-robin is used to ensure fairness among suchprocesses We explain this further with an example Con-sider three applications A B and C with their VA regionsarranged in the access_map as shown in Figure 4 HawkEye-G promotes regions in the following order in this example

A1B1C1C2B2C3C4B3B4A2C5A3Recall however that MMU overheads may not necessarily

be correlated with our access-coverage based estimationand may depend on other more complex features of theaccess pattern (sect24) To capture this in HawkEye-PMU wefirst choose the process with the highest measured MMUoverheads and then choose regions from higher to lowerindices in selected processrsquos access_map Among processeswith similar highest MMU overheads round-robin is used

35 Limitations and discussionWe briefly outline a few issues that are related to huge pagemanagement but are currently unhandled in HawkEye1) Thresholds for identifyingmemorypressureWemea-sure the extent ofmemory pressurewith statically configuredvalues for low and high watermarks (ie 70 and 85 oftotal systemmemory) while dealing with memory bloat Anystrategy that relies on static thresholds faces the risk of beingconservative or overly aggressive when memory pressureconsistently fluctuates An ideal solution should adjust thesethresholds dynamically to prevent unintended system behav-ior The approach proposed by Guo et al [51] in the contextof memory deduplication for virtualized environments isrelevant in this context

2) Huge page starvationWhile we believe our approachof allocating huge pages based on MMU overheads optimizesthe system as a whole unbounded huge page allocations toa single process can be thought of as a starvation problemfor other processes An adverserial application can also po-tentially monopolize HawkEye to get more huge pages orprevent other applications from getting a fair share of hugepages Preliminary investigations show that our approach isreasonable even if the memory footprint of workloads differby more than an order of magnitude However if limitinghuge page allocations is still desirable it seems reasonable tointegrate a policy with existing resource limitingmonitoringtools such as Linuxrsquos cgroups [18]3) Other algorithms We do not discuss some parts of themanagement algorithms such as rate limiting khugepagedto reduce promotion overheads demotion based on low uti-lization to enable better sharing through same-page mergingand techniques for minimizing the overhead of page-tableaccess-bit tracking and compaction algorithms Much of thismaterial has been discussed and evaluated extensively inthe literature [32 55 59 68] and we have not contributedsignificantly new approaches in these areas

4 EvaluationWe now evaluate our algorithms in more detail Our exper-imental platform is an Intel Haswell-EP based E5-2690 v3server system running CentOS v74 on which 96GB mem-ory and 48 cores (with hyperthreading enabled) running at23GHz are partitioned on two sockets we bind each work-load to a single socket to avoid NUMA effects The L1 TLBcontains 64 and 8 entries for 4KB and 2MB pages respectivelywhile the L2 TLB contains 1024 entries for both 4KB and2MB pages The size of L1 L2 and shared L3 cache is 768KB3MB and 30MB resp A 96GB SSD-backed swap partitionis used to evaluate performance in an overcommitted sys-tem We evaluate HawkEye with a diverse set of workloadsranging from HPC graph algorithms in-memory databasesgenomics and machine learning [24 30 33 36 45 53 65 66]HawkEye is implemented in Linux kernel v43

We evaluate (a) the improvements due to our fine-grainedpromotion strategy based on access-coverage in both per-formance and fairness for single multiple homogeneous andmultiple heterogeneousworkloads (b) the bloat-vs-performancetradeoff (c) the impact of low latency page faults (d) cacheinterference caused by asynchronous pre-zeroing thread and(e) the impact of memory efficiency enabled by asynchronouspage pre-zeroingPerformance advantages of fine-grainedhuge page pro-motion We first evaluate the effectiveness of our access-coverage based promotion strategy For this we measure thetime required for our algorithm to recover from a fragmentedstate with high address translation overheads to a state with

0

6

12

18

24

Graph500 XSBench cgD

speedup

Linux Ingens HawkEyeshyPMU HawkEyeshyG

152

23 25155

20 36

1007 1011

175

628

152 168

0

300

600

900

1200

Graph500 XSBench cgD

Time saved (ms)

Figure 5 Performance speedup and time saved per hugepage promotion over baseline pages

Workload Execution Time (Seconds)Linux-4KB Linux-2MB Ingens HawkEye-PMU HawkEye-G

Graph500-1 2270 2145(106) 2243(101) 1987(114) 2007(113)Graph500-2 2289 2252(102) 2253(102) 1994(115) 2013(114)Graph500-3 2293 2293(10) 2299(100) 2012(114) 2018(114)Average 2284 2230(102) 2265(101) 1998(114) 2013(113)

XSBench-1 2427 2415(10) 2392(101) 2108(115) 2098(115)XSBench-2 2437 2427(10) 2415(101) 2109(115) 2110(115)XSBench-3 2443 2455(10) 2456(100) 2133(115) 2143(114)Average 2436 2432(10) 2421(100) 2117(115) 2117(115)

Table 5 Execution time of 3 instances of Graph500 andXSBench when executed simultaneously Values in parenthe-ses represent speedup over baseline pages

low address translation overheads Recall that HawkEye pro-motes pages based on access-coverage while previous ap-proaches promote pages in VA order (from low VAs to high)We fragment the memory initially by reading several files inmemory our test workloads are started in the fragmentedstate and we measure the time taken for the system to re-cover from high MMU overheads This experimental setupsimulates expected realistic situations where the memoryfragmentation in the system fluctuates over time Withoutpromotion our test workloads would keep incurring highaddress translation overheads however HawkEye is quicklyable to recover from these overheads through appropriatehuge page allocations Figure 5 (left) shows the performanceimprovement obtained by HawkEye (over a strategy thatnever promotes the pages) for three workloads Graph500XSBench and cgD These workloads allocate all their re-quired memory in the beginning ie in the fragmented stateof the system For these workloads the speedups due toeffective huge page management through HawkEye are ashigh as 22 Compared to Linux and Ingens access-coveragebased huge page promotion strategy of HawkEye improvesperformance by 13 12 and 6 over both Linux and Ingens

To understand this more clearly Figure 6 shows the accesspattern MMU overheads and the number of allocated hugepages over time for Graph500 and XSBench We find thathot-spots in these applications are concentrated in the highVAs and any sequential-scanning based promotion is likelyto be sub-optimal The second and third columns corrobo-rate this observation for example both HawkEye variantstake asymp300 seconds to eliminate MMU overheads of XSBenchwhile Linux and Ingens have high overheads even after 1000seconds

To quantify the cost-benefit analysis of huge page promo-tions further we propose a new metric the average execu-tion time saved (over using only baseline pages) per hugepage promotion A scheme that maximizes this metric wouldbe most effective in reducing MMU overheads Figure 5(right) shows that HawkEye performs significantly betterthan Linux and Ingens on this metric The difference betweenthe efficiency of HawkEye-PMU and HawkEye-G is also evi-dent HawkEye-PMU is more efficient as it stops promotinghuge pages whenMMU overheads are below a certain thresh-old (2 in our experiments) In summary HawkEye-G andHawkEye-PMU are up to 67times and 44times more efficient (forXSBench) than Linux in terms of time saved per huge pagepromotion For workloads whose access patterns are spreaduniformly over the VA space HawkEye delivers performancesimilar to Linux and IngensFairness advantages of fine-grained huge page promo-tion We next experiment withmultiple applications to studyboth performance and fairness first with identical applica-tions and then with heterogeneous applications runningsimultaneouslyIdentical workloads Figure 7 shows the MMU overheadsand huge page allocations when 3 instances of Graph500 andXSBench (in separate runs) are executed concurrently afterfragmenting the system (see Table 5 for execution time)While Linux creates performance imbalance by promot-

ing huge pages in one process at a time HawkEye ensuresfairness by judiciously distributing huge pages across allworkload instances Overall HawkEye-PMU and HawkEye-G achieve 114times and 113times speedup for Graph500 and 115timesspeedup for XSBench over Linux on average Unlike LinuxIngens promotes huge pages proportionally in all 3 instancesHowever it fails to improve the performance of these work-loads In fact it may lead to poorer performance than LinuxWe explain this behaviour with an example

Linux selects an address from Graph500-1rsquos address spaceand promotes all pages above it in around 10 minutes Eventhough this strategy is unfair it improves the performanceof Graph500-1 by promoting its hot-regions Linux then se-lects Graph500-2whose MMU overheads decrease after asymp20minutes In contrast Ingens promotes huge pages from lowerVAs in each instance of Graph500 Thus it takes even longerfor Ingens to promote suitable (hot) regions of the respec-tive processes which leads to 9 performance degradationover Linux for Graph500 For XSBench the performance ofboth Ingens and Linux is similar (but inferior to HawkEye)because both fail to promote the applicationrsquos hot regionsbefore the application finishesHeterogeneous workloads To measure the efficacy of dif-ferent strategies for heterogeneous workloads we executeworkloads by grouping them into sets where a set containsone TLB sensitive and one TLB insensitive application Each

0

128

256

384

512

05GB 1GB 15GB 2GB

Cold Region

acce

ss-c

over

age

Virtual Address Space

0

15

30

45

60

300 600 900 1200 1500

MM

U O

verh

ead

Time (seconds)

0

200

400

600

800

1000

300 600 900 1200 1500

Linux Ingens

HawkEye-G

HawkEye-PMU

H

uge

Pag

es

Time (seconds)

(a) Graph500

0

128

256

384

512

1GB 2GB 3GB 4GB 5GB 6GB

acce

ss-c

over

age

Virtual Address Space

0

15

30

45

60

300 600 900 1200 1500 1800 2100 2400

Linux Ingens

HawkEye-PMU HawkEye-GMM

U O

ver

hea

dTime (seconds)

0

500

1000

1500

2000

300 600 900 1200 1500 1800 2100 2400

Linux Ingens H

awkEye-G

HawkEye-PMU H

uge

Pag

es

Time (seconds)

(b) XSBenchFigure 6 Access-coverage in application VA MMU overhead and huge page promotions for Graph500 and XSBench HawkEyeis more efficient in promoting huge pages due to fine-grained decision making

0

200

400

600

800

1000

10 20 30 40

1 2

3 H

uge

Pag

es

Time (minutes)

0

200

400

600

800

1000

10 20 30 40

1 2 3

H

uge

Pag

es

Time (minutes)

0

200

400

600

800

1000

10 20 30 40

H

uge

Pag

es

Time (minutes)

0

200

400

600

800

1000

10 20 30 40

1 2 3

H

uge

Pag

es

Time (minutes)

0

10

20

30

40

50

10 20 30 40

MM

U O

verh

ead

Time (Minutes)

0

10

20

30

40

50

10 20 30 40

MM

U O

verh

ead

Time (minutes)

0

10

20

30

40

50

10 20 30 40

MM

U O

verh

ead

Time (minutes)

0

10

20

30

40

50

10 20 30 40

MM

U O

verh

ead

Time (minutes)

0

500

1000

1500

2000

10 20 30 40 50

1

2 3

H

uge

Pag

es

Time (Minutes)

0

300

600

900

1200

10 20 30 40 50

1 2 3

H

uge

Pag

es

Time (Minutes)

0

300

600

900

1200

10 20 30 40 50

H

uge

Pag

es

Time (Minutes)

0

300

600

900

1200

10 20 30 40 50

1 2 3

H

uge

Pag

es

Time (Minutes)

0

15

30

45

60

10 20 30 40 50

1 2 3

MM

U O

verh

ead

Time (Minutes)

0

15

30

45

60

10 20 30 40 50

1 2 3

MM

U O

verh

ead

Time (Minutes)

0

15

30

45

60

10 20 30 40 50

1 2 3

MM

U O

verh

ead

Time (Minutes)

0

15

30

45

60

10 20 30 40 50

1 2 3

MM

U O

verh

ead

Time (Minutes)

(a) Linux-2MB (b) Ingens (c) HawkEye-PMU (d) HawkEye-GFigure 7 Huge page promotions and their impact on TLB overhead over time for Linux Ingens and HawkEye for threeinstances of Graph500 and XSBench each

set is executed twice after fragmenting the system to as-sess the impact of huge page policies on the order of ex-ecution For the TLB insensitive workload we execute alightly loaded Redis server with 1KB-sized 40M keys servic-ing 10K requests per-second Figure 8 shows the performance

speedup over 4KB pages The (Before) and (After) configura-tions show whether the TLB sensitive workload is launchedbefore Redis or after Because Linux promotes huge pagesin process launch order the two configurations (Before) and(After) are intended to cover its two different behavioursIngens and HawkEye are agnostic to process creation order

0

15

30

45

60

75

cactusADM tigr Graph500 lbm_s SVM XSBench cgD cactusADM tigr Graph500 lbm_s SVM XSBench cgDBefore After

s

peed

up

Chart Title

Linux Ingens HawkEye-PMU HawkEye-G

Figure 8 Performance speedup over baseline pages of TLB sensitive applications when they are executed alongside a lightlystressed Redis server in different orders

1

12

14

16

18

2

Nor

mal

ized

Per

form

ance

HawkEye-Host HawkEye-Guest HawkEye-Both

Figure 9 Performance compared to Linux in a virtualizedsystem when HawkEye is applied at the host guest and bothlayers

The impact of execution order is clearly visible for LinuxIn the (Before) case Linux with THP support improves per-formance over baseline pages by promoting huge pages inTLB sensitive applications However in the (After) case itpromotes huge pages in Redis resulting in poor performancefor TLB sensitive workloads While the execution order doesnot matter in Ingens its proportional huge page promotionstrategy favors large memory workloads (Redis in this case)Further our client requests randomly selected keys whichmakes the Redis server access all its huge pages uniformlyConsequently Ingens promotes most huge pages in Redisleading to suboptimal performance of TLB sensitive work-loads In contrast HawkEye promotes huge pages based onMMU overhead measurements (HawkEye-PMU) or access-coverage based estimation of MMU overheads (HawkEye-G)This leads to 15ndash60 performance improvement over 4KBpages irrespective of the order of executionPerformance in virtualized systems For a virtualizedsystemwe use the KVMhypervisor runningwith Ubuntu1604and fragment the system prior to running the workloadsEvaluation is presented for three different configurationsie when HawkEye is deployed at the host guest and boththe layers Each of these configurations requires running thesame set of applications differently (see Table 6 for details)Figure 9 shows that HawkEye provides 18ndash90 speedup

compared to Linux in virtual environments Notice that insome cases (eg cgD) performance improvement is higherin a virtualized system compared to bare-metal This is due tohigh MMU overheads involved in traversing nested page ta-bles while servicing TLB misses of guest applications whichpresents higher opportunities to be exploited by HawkEye

Configuration Details

HawkEye-Host Two VMs each with 24 vCPUs and 48GB memoryVM-1 runs Redis VM-2 runs TLB sensitive workloads

HawkEye-Guest Single VM with 48 vCPUs and 80GB memoryrunning Redis and TLB sensitive workloads

HawkEye-Both Two VMs each with 24 vCPUs VM-1 (30GB) runs RedisVM-2 (60GB) runs both Redis and TLB sensitive workloads

Table 6 Experimental setup for configurations used to eval-uate a virtualized system

Bloat vs performance We next study the relationship be-tween memory bloat and performance we revisit an experi-ment similar to the one summarized in sect21 We populate aRedis instance with 8M (10B key 4KB value) key-value pairsand then delete 60 randomly selected keys (see Table 7)While Linux-4KB (no THP) is memory efficient (no bloat)its performance is low (7 lower throughput than Linux-2MB ie with THP) Linux-2MB delivers high performancebut consumes more memory which remains unavailable tothe rest of system even when memory pressure increasesIngens can be configured to balance the memory vs perfor-mance tradeoff For example Ingens-90 (Ingens with 90utilization threshold) is memory efficient while Ingens-50(Ingens with 50 utilization threshold) favors performanceand allows more memory bloat In either case it is unableto recover from bloat that may have already been gener-ated (eg during its aggressive phase) By switching fromaggressive huge page allocations under low memory pres-sure to memory conserving strategy at runtime HawkEye isable to achieve both low MMU overheads and high memoryefficiency depending on the state of the systemFast page faults Table 8 shows the performance of differentstrategies for the workloads chosen for these experimentstheir performance depends on the efficiency of the OS pagefault handler We measure Redis throughput when 2MB-sized values are inserted SparseHash [13] is a hash-maplibrary in C++ while HACC-IO [8] is a parallel IO benchmarkused with an in-memory file system We also measure thespin-up time of two virtual machines KVM and a Java VirtualMachine (JVM) where both are configured to allocate allmemory during initialization

Performance benefits of async page pre-zeroing with basepages (HawkEye-4KB) aremodest in this case the other page

Kernel Self-tuning Memory Size ThroughputLinux-4KB No 162GB 1061KLinux-2MB No 332GB 1138KIngens-90 No 163GB 1068KIngens-50 No 331GB 1134KHawkEye (no mem pressure) Yes 332GB 1136KHawkEye (mem pressure) Yes 162GB 1058K

Table 7 Redis memory consumption and throughput

Workload OS ConfigurationLinux4KB

Linux2MB

Ingens90

HawkEye4KB

HawkEye2MB

Redis (45GB) 233 437 192 236 551SparseHash (36GB) 501 172 515 466 106HACC-IO (6GB) 65 45 66 65 42JVM Spin-up (36GB) 377 186 527 298 137KVM Spin-up (36GB) 406 97 418 302 070

Table 8 Performance implications of asynchronous pagezeroing Values for Redis represent throughput (higher isbetter) all other values represent time in seconds (lower isbetter)

0

10

20

30

40

NPB PARSEC Graph500 cactusADM PageRank omnetpp

overhead with caching-stores with nt-stores

Figure 10 Performance overhead of async pre-zeroing withand without caching instructions The first two bars (NPBand Parsec) represent the average of all workloads in therespective benchmark suite

fault overheads (apart from zeroing) dominate performanceHowever page-zeroing cost becomes significant with hugepages Consequently HawkEye with huge pages (HawkEye-2MB) improves the performance of Redis and SparseHashby 126times and 162times over Linux The spin-up time of virtualmachines is purely dominated by page faults Hence weobserve a dramatic reduction in the spin-up time of VMswith async pre-zeroing of huge pages 138times and 136times overLinux Notice that without huge pages (only base pages)this reduction is only 134times and 126times All these workloadshave high spatial locality of page faults and hence the Ingensstrategy of utilization-based promotion has a negative impacton performance due to a higher number of page faultsAsync pre-zeroing enables memory sharing in virtual-ized environments Finally we note that async pre-zeroingwithin VMs enables memory sharing across VMs the freememory of a VM returns to the host through pre-zeroing inthe VM and same-page merging at the host This can havethe same net effect as ballooning as the free memory in aVM gets shared with other VMs In our experiments withmemory over-committed systems we have confirmed thisbehaviour the overall performance of an overcommittedsystem where the VMs are running HawkEye matches the

0

05

1

15

2

25

PageRank Redis MongoDB CC Memcached SPDNode 0 Node 1

Nor

mal

ized

Per

form

ance

Linux Linux-Ballooning HawkEye

Figure 11 Performance normalized to the case of no bal-looning in an overcommitted virtualized system

performance achieved through para-virtual interfaces likememory balloon drivers For example in an experiment in-volving a mix of latency-sensitive key-value stores and HPCworkloads (see Figure 11) with a total peakmemory consump-tion of around 150GB (15times of total memory) HawkEye-Gprovides 23times and 142times higher throughput for Redis andMongoDB along with significant tail latency reduction ForPageRank performance degrades slightly due to a highernumber of COW page faults due to unnecessary same-pagemerging These results are very close to the results obtainedwhen memory ballooning is enabled on all VMs We believethat this is a potential positive of our design and may offeran alternative method to efficiently manage memory in over-committed systems It is well-known that balloon-driversand other para-virtual interfaces are riddled with compatibil-ity problems [29] and a fully-virtual solution to this problemwould be highly desirable While our experiments show earlypromise in this direction we leave an extensive evaluationfor future workPerformance overheads ofHawkEyeWe evaluatedHawk-Eye with non-fragmented systems and with several work-loads that donrsquot benefit from huge pages and find that theperformance with HawkEye is very similar to performancewith Linux or Ingens in these scenarios confirming thatHawkEyersquos algorithms for access-bit tracking asynchronoushuge page promotion and memory de-duplication add negli-gible overheads over existingmechanisms in Linux Howeverthe async pre-zeroing thread requires special attention as itmay become a source of noticeable performance overheadfor certain types of workloads as discussed nextOverheads of async pre-zeroing A primary concern withasync pre-zeroing is its potential detrimental effect due tomemory and cache interference (sect31) Recall that we em-ploy memory stores with non-temporal hints to avoid theseeffects To measure the worst-case cache effects of asyncpre-zeroing we run our workloads while simultaneouslyzero-filling pages on a separate core sharing the same L3cache (ie on the same socket) at a high rate of 025M pagesper second (1GBps) with and without non-temporal mem-ory stores Figure 10 shows that using non-temporal hintssignificantly brings down the overhead of async pre-zeroing(eg from 27 to 6 for omnetpp) the remaining overheadis due to additional memory traffic generated by the pre-zeroing thread Further we note that this experiment presents

Workload MMUOverhead

Time (seconds)4KB HawkEye-PMU HawkEye-G

random(4GB) 60 582 328(177times) 413(141times)sequential(4GB) lt 1 517 535 532Total 1099 863(127times) 945(116times)cgD(16GB) 39 1952 1202(162times) 1450(135times)mgD(24GB) lt 1 1363 1364 1377Total 3315 2566(129times) 2827(117times)

Table 9 Comparison between HawkEye-PMU andHawkEye-G Values in parentheses represent speedup overbaseline pagesa highly-exaggerated behavior of async pre-zeroing onworst-case workloads In practice the zeroing thread is rate-limited(eg at most 10k pages per second) and the expected cache-interference overheads would be proportionally smallerComparison betweenHawk-PMUandHawkEye-G Eventhough HawkEye-G is based on simple and approximatedestimations of TLB contention it is reasonably accurate inidentifying TLB sensitive processes and memory regionsin most cases However HawkEye-PMU performs betterthan HawkEye-G in some cases Table 9 demonstrates twosuch examples where four workloads all with high access-coverage but different MMU overheads are executed in twosets each set contains one TLB sensitive and one TLB insensi-tive workload The 4KB column represents the case where nohuge pages are promoted As discussed earlier (sect24) MMUoverheads depend on the complex interaction between thehardware and access pattern In this case despite havinghigh access-coverage in VA regions sequential and mgDhave negligible MMU overheads (ie they are TLB insen-sitive) While HawkEye-PMU correctly identifies the pro-cess with higher MMU overheads for huge page allocationHawkEye-G treats them similarly (due to imprecise estima-tions) and may allocate huge pages to the TLB insensitiveprocess Consequently HawkEye-PMU may perform up to36 better than HawkEye-G Bridging this gap between thetwo approaches in a hardware independent manner is aninteresting future work

5 Related WorkMMU overheads have been a topic of much research in re-cent years and several solutions have been proposed includ-ing solutions that are architecture-based OS-based andorcompiler-based [28 34 49 56 58 62ndash64]Hardware SupportMost proposals for hardware supportaim to reduce TLB-miss frequency or accelerate TLB-missprocessing Large multi-level TLBs and support for multiplepage sizes to minimize the number of TLB-misses is alreadycommon in general-purpose processors today [15 20 21 40]Further modern processors employ page-walk-caches toavoid memory lookups in the TLB-miss path [31 35] POM-TLB services a TLB miss using a single memory lookup andfurther leverages regular data caches to speed up addresstranslation [63] Direct segments [32 46] were proposed tocompletely avoid TLB-miss processing cost through special

segmentation hardware OS-level challenges of memory frag-mentation have also been considered in hardware designsCoLT or Coalesced-Large-reach TLBs [61] were initially pro-posed to increase TLB reach using base pages based on theobservation that OSs naturally provide contiguous mappingsat smaller granularity This approach was further extendedto page-walk caches and huge pages [35 40 60] Techniquesto support huge pages in non-contiguous memory have alsobeen proposed [44] Hardware optimizations are importantbut complementary to HawkEyeSoftware Support FreeBSDrsquos huge page management islargely influenced by previous work on superpages [57]Carrefour-LP [47] showed that ad-hoc usage of huge pagescan degrade performance in NUMA systems due to addi-tional remote memory accesses and traffic imbalance acrossnode interconnects and proposed hardware-profiling basedtechniques to overcome these problems Anti-fragmentationtechniques for clustering pages based on mobility type ofpages proposed by Gorman et al [48] have been adoptedby Linux to support the allocation of huge pages in long-running systems Illuminator [59] highlighted critical draw-backs of Linuxrsquos fragmentation mitigation techniques andtheir implications on performance and latency and proposedtechniques to solve these problems Mechanisms of dealingwith physical memory fragmentation and NUMA systemsare important but orthogonal problems and their proposedsolutions can be adopted to improve the robustness of Hawk-Eye Compiler or application hints can also be used to assistOSs in prioritizing huge page mapping for certain parts ofthe address space [27 56] for which an interface is alreadyprovided by Linux through the madvise system call [16]An alternative approach for supporting huge pages has

also been explored via libhugetlbfs [22] where the useris provided more control on huge page allocation Howeversuch an approach requires manual intervention for reserv-ing huge pages in advance and considers each applicationin isolation Windows and OS X support huge pages onlythrough this reservation-based approach to avoid issues asso-ciated with transparent huge page management [17 55 59]We believe that insights from HawkEye can be leveraged toimprove huge page support in these important systems

6 ConclusionsTo summarize transparent huge page management is es-sential but complex Effective and efficient algorithms arerequired in the OS to automatically balance the tradeoffs andprovide performant and stable system behavior We exposesome important subtleties related to huge page managementin existing proposals and propose a new set of algorithmsto address them in HawkEye

References[1] 2000 Clearing pages in the idle loop httpswwwmail-archivecom

freebsd-hackersfreebsdorgmsg13993html

[2] 2000 Linux Page Zeroing Strategy httpsyarchivenetcomplinuxpagezeroingstrategyhtml

[3] 2007 Memory part 5 What programmers can do httpslwnnetArticles255364

[4] 2010 Mysteries of Windows Memory Management RevealedPart Two httpsblogsmsdnmicrosoftcomtims20101029pdc10-mysteries-of-windows-memory-management-revealed-part-two

[5] 2012 khugepaged eating 100 CPU httpsbugzillaredhatcomshowbugcgiid=879801

[6] 2012 Recommendation to disable huge pages for Hadoophttpsdeveloperamdcomwordpressmedia201210HadoopTuningGuide-Version5pdf

[7] 2014 Arch Linux becomes unresponsive from khugepagedhttpunixstackexchangecomquestions161858arch-linux-becomes-unresponsive-from-khugepaged

[8] 2014 CORAL Benchmark Codes httpsascllnlgovCORAL-benchmarkshacc

[9] 2014 Recommendation to disable huge pages for NuoDBhttpwwwnuodbcomtechbloglinux-transparent-huge-pages-jemalloc-and-nuodb

[10] 2014 Why TokuDB Hates Transparent HugePageshttpswwwperconacomblog20140723why-tokudb-hates-transparent-hugepages

[11] 2015 The Black Magic Of Systematically Reducing Linux OS Jit-ter httphighscalabilitycomblog201548the-black-magic-of-systematically-reducing-linux-os-jitterhtml

[12] 2015 Tales from the Field Taming Transparent Huge Pages onLinux httpswwwperforcecomblog151016tales-field-taming-transparent-huge-pages-linux

[13] 2016 C++ associative containers httpsgithubcomsparsehashsparsehash

[14] 2016 Remove PG_ZERO and zeroidle (page-zeroing) entirely httpsnewsycombinatorcomitemid=12227874

[15] 2017 Hugepages httpswikidebianorgHugepages[16] 2017 MADVISE Linux Programmerrsquos Manual httpman7orglinux

man-pagesman2madvise2html[17] 2018 Large-Page Support in Windows httpsdocsmicrosoftcom

en-gbwindowsdesktopMemorylarge-page-support[18] 2019 CGROUPS Linux Programmerrsquos Manual httpman7orglinux

man-pagesman7cgroups7html[19] 2019 FreeBSD Pre-Faulting and Zeroing Optimizations

httpswwwfreebsdorgdocenUSISO8859-1articlesvm-designprefault-optimizationshtml

[20] 2019 Intel Haswell httpwww7-cpucomcpuHaswellhtml[21] 2019 Intel Skylake httpwww7-cpucomcpuSkylakehtml[22] 2019 Libhugetlbfs Linux man page httpslinuxdienetman7

libhugetlbfs[23] 2019 Recommendation to disable huge pages for MongoDB https

docsmongodbcommanualtutorialtransparent-huge-pages[24] 2019 Recommendation to disable huge pages for Redis httpredisio

topicslatency[25] 2019 Recommendation to disable huge pages for VoltDB https

docsvoltdbcomAdminGuideadminmemmgtphp[26] 2019 Transparent Hugepage Support httpswwwkernelorgdoc

Documentationvmtranshugetxt[27] Mohammad Agbarya Idan Yaniv and Dan Tsafrir 2018 Memomania

From Huge to Huge-Huge Pages In Proceedings of the 11th ACM In-ternational Systems and Storage Conference (SYSTOR rsquo18) ACM NewYork NY USA 112ndash112 httpsdoiorg10114532118903211918

[28] Hanna Alam Tianhao Zhang Mattan Erez and Yoav Etsion 2017Do-It-Yourself Virtual Memory Translation In Proceedings of the44th Annual International Symposium on Computer Architecture (ISCArsquo17) ACM New York NY USA 457ndash468 httpsdoiorg10114530798563080209

[29] Nadav Amit Dan Tsafrir and Assaf Schuster 2014 VSwapper AMemory Swapper for Virtualized Environments In Proceedings of the19th International Conference on Architectural Support for ProgrammingLanguages and Operating Systems (ASPLOS rsquo14) ACM New York NYUSA 349ndash366 httpsdoiorg10114525419402541969

[30] D H Bailey E Barszcz J T Barton D S Browning R L Carter LDagum R A Fatoohi P O Frederickson T A Lasinski R S SchreiberH D Simon V Venkatakrishnan and S KWeeratunga 1991 The NASParallel BenchmarksSummary and Preliminary Results In Proceedingsof the 1991 ACMIEEE Conference on Supercomputing (Supercomputingrsquo91) ACM New York NY USA 158ndash165 httpsdoiorg101145125826125925

[31] Thomas W Barr Alan L Cox and Scott Rixner 2010 TranslationCaching Skip DonrsquoT Walk (the Page Table) In Proceedings of the37th Annual International Symposium on Computer Architecture (ISCArsquo10) ACM New York NY USA 48ndash59 httpsdoiorg10114518159611815970

[32] Arkaprava Basu Jayneel Gandhi Jichuan Chang Mark D Hill andMichael M Swift 2013 Efficient Virtual Memory for Big MemoryServers In Proceedings of the 40th Annual International Symposium onComputer Architecture (ISCA rsquo13) ACM New York NY USA 237ndash248httpsdoiorg10114524859222485943

[33] Scott Beamer Krste Asanovic and David A Patterson 2015 TheGAP Benchmark Suite CoRR abs150803619 (2015) arXiv150803619httparxivorgabs150803619

[34] Ravi Bhargava Benjamin Serebrin Francesco Spadini and SrilathaManne 2008 Accelerating Two-dimensional Page Walks for Vir-tualized Systems In Proceedings of the 13th International Confer-ence on Architectural Support for Programming Languages and Op-erating Systems (ASPLOS XIII) ACM New York NY USA 26ndash35httpsdoiorg10114513462811346286

[35] Abhishek Bhattacharjee 2013 Large-reach Memory ManagementUnit Caches In Proceedings of the 46th Annual IEEEACM InternationalSymposium on Microarchitecture (MICRO-46) ACM New York NYUSA 383ndash394 httpsdoiorg10114525407082540741

[36] Christian Bienia Sanjeev Kumar Jaswinder Pal Singh and Kai Li 2008The PARSEC Benchmark Suite Characterization and ArchitecturalImplications In Proceedings of the 17th International Conference onParallel Architectures and Compilation Techniques (PACT rsquo08) ACMNew York NY USA 72ndash81 httpsdoiorg10114514541151454128

[37] Daniel Bovet and Marco Cesati 2005 Understanding The Linux KernelOreilly amp Associates Inc

[38] Josiah L Carlson 2013 Redis in Action Manning Publications CoGreenwich CT USA

[39] Jonathan Corbet 2010 Memory compaction httpslwnnetArticles368869

[40] Guilherme Cox and Abhishek Bhattacharjee 2017 Efficient AddressTranslation for Architectures with Multiple Page Sizes In Proceed-ings of the Twenty-Second International Conference on ArchitecturalSupport for Programming Languages and Operating Systems (ASPLOSrsquo17) ACM New York NY USA 435ndash448 httpsdoiorg10114530376973037704

[41] Cort Dougan Paul Mackerras and Victor Yodaiken 1999 Opti-mizing the Idle Task and Other MMU Tricks In Proceedings of theThird Symposium on Operating Systems Design and Implementation(OSDI rsquo99) USENIX Association Berkeley CA USA 229ndash237 httpdlacmorgcitationcfmid=296806296833

[42] Niall Douglas 2011 User Mode Memory Page Allocation A Sil-ver Bullet For Memory Allocation CoRR abs11051811 (2011)arXiv11051811 httparxivorgabs11051811

[43] Ulrich Drepper 2007 What Every Programmer Should Know AboutMemory

[44] Yu Du Miao Zhou Bruce R Childers Daniel Mosseacute and Rami GMelhem 2015 Supporting superpages in non-contiguous physicalmemory 2015 IEEE 21st International Symposium on High Performance

Computer Architecture (HPCA) (2015) 223ndash234[45] M Franklin D Yeung n Xue Wu A Jaleel K Albayraktaroglu B

Jacob and n Chau-Wen Tseng 2005 BioBench A Benchmark Suite ofBioinformatics Applications In IEEE International Symposium on Per-formance Analysis of Systems and Software 2005 ISPASS 2005(ISPASS)Vol 00 2ndash9 httpsdoiorg101109ISPASS20051430554

[46] Jayneel Gandhi Arkaprava Basu Mark D Hill and Michael M Swift2014 Efficient Memory Virtualization Reducing Dimensionality ofNested Page Walks In Proceedings of the 47th Annual IEEEACM Inter-national Symposium on Microarchitecture (MICRO-47) IEEE ComputerSociety Washington DC USA 178ndash189 httpsdoiorg101109MICRO201437

[47] Fabien Gaud Baptiste Lepers Jeremie Decouchant Justin FunstonAlexandra Fedorova and Vivien Queacutema 2014 Large Pages MayBe Harmful on NUMA Systems In Proceedings of the 2014 USENIXConference on USENIX Annual Technical Conference (USENIX ATCrsquo14)USENIX Association Berkeley CA USA 231ndash242 httpdlacmorgcitationcfmid=26436342643659

[48] Mel Gorman and Patrick Healy 2008 Supporting Superpage Alloca-tion Without Additional Hardware Support In Proceedings of the 7thInternational Symposium on Memory Management (ISMM rsquo08) ACMNew York NY USA 41ndash50 httpsdoiorg10114513756341375641

[49] Mel Gorman and Patrick Healy 2012 Performance Characteristics ofExplicit Superpage Support In Proceedings of the 2010 InternationalConference on Computer Architecture (ISCArsquo10) Springer-Verlag BerlinHeidelberg 293ndash310 httpsdoiorg101007978-3-642-24322-624

[50] Mel Gorman and Andy Whitcroft 2006 The What The Why and theWhere To of Anti-Fragmentation In Linux Symposium 141

[51] Fei Guo Seongbeom Kim Yury Baskakov and Ishan Banerjee 2015Proactively Breaking Large Pages to Improve Memory Overcom-mitment Performance in VMware ESXi In Proceedings of the 11thACM SIGPLANSIGOPS International Conference on Virtual ExecutionEnvironments (VEE rsquo15) ACM New York NY USA 39ndash51 httpsdoiorg10114527311862731187

[52] Fan Guo Yongkun Li Yinlong Xu Song Jiang and John C S Lui2017 SmartMD A High Performance Deduplication Engine withMixed Pages In Proceedings of the 2017 USENIX Conference on UsenixAnnual Technical Conference (USENIX ATC rsquo17) USENIX AssociationBerkeley CA USA 733ndash744 httpdlacmorgcitationcfmid=31546903154759

[53] John L Henning 2006 SPEC CPU2006 Benchmark DescriptionsSIGARCH Comput Archit News 34 4 (Sept 2006) 1ndash17 httpsdoiorg10114511867361186737

[54] Vasileios Karakostas Osman S Unsal Mario Nemirovsky AdriaacutenCristal and Michael M Swift 2014 Performance analysis of thememory management unit under scale-out workloads 2014 IEEE In-ternational Symposium on Workload Characterization (IISWC) (2014)1ndash12

[55] Youngjin Kwon Hangchen Yu Simon Peter Christopher J Rossbachand Emmett Witchel 2016 Coordinated and Efficient Huge PageManagement with Ingens In Proceedings of the 12th USENIX Con-ference on Operating Systems Design and Implementation (OSDIrsquo16)USENIX Association Berkeley CA USA 705ndash721 httpdlacmorgcitationcfmid=30268773026931

[56] Joshua Magee and Apan Qasem 2009 A Case for Compiler-drivenSuperpage Allocation In Proceedings of the 47th Annual SoutheastRegional Conference (ACM-SE 47) ACM New York NY USA Article82 4 pages httpsdoiorg10114515664451566553

[57] Juan Navarro Sitararn Iyer Peter Druschel and Alan Cox 2002 Prac-tical Transparent Operating System Support for Superpages SIGOPSOper Syst Rev 36 SI (Dec 2002) 89ndash104 httpsdoiorg101145844128844138

[58] Ashish Panwar Naman Patel and K Gopinath 2016 A Case forProtecting Huge Pages from the Kernel In Proceedings of the 7th ACMSIGOPS Asia-Pacific Workshop on Systems (APSys rsquo16) ACM New YorkNY USA Article 15 8 pages httpsdoiorg10114529673602967371

[59] Ashish Panwar Aravinda Prasad and K Gopinath 2018 Making HugePages Actually Useful In Proceedings of the Twenty-Third InternationalConference on Architectural Support for Programming Languages andOperating Systems (ASPLOS rsquo18) ACM New York NY USA 679ndash692httpsdoiorg10114531731623173203

[60] Binh Pham Abhishek Bhattacharjee Yasuko Eckert and Gabriel HLoh 2014 Increasing TLB reach by exploiting clustering in page trans-lations 2014 IEEE 20th International Symposium on High PerformanceComputer Architecture (HPCA) (2014) 558ndash567

[61] Binh Pham Viswanathan Vaidyanathan Aamer Jaleel and AbhishekBhattacharjee 2012 CoLT Coalesced Large-Reach TLBs In Proceed-ings of the 2012 45th Annual IEEEACM International Symposium onMicroarchitecture (MICRO-45) IEEE Computer Society WashingtonDC USA 258ndash269 httpsdoiorg101109MICRO201232

[62] Binh Pham Jaacuten Veselyacute Gabriel H Loh and Abhishek Bhattacharjee2015 Large Pages and Lightweight Memory Management in Virtual-ized Environments Can You Have It Both Ways In Proceedings of the48th International Symposium on Microarchitecture (MICRO-48) ACMNew York NY USA 1ndash12 httpsdoiorg10114528307722830773

[63] Jee Ho Ryoo Nagendra Gulur Shuang Song and Lizy K John 2017Rethinking TLB Designs in Virtualized Environments A Very LargePart-of-Memory TLB In Proceedings of the 44th Annual InternationalSymposium on Computer Architecture (ISCA rsquo17) ACM New York NYUSA 469ndash480 httpsdoiorg10114530798563080210

[64] Indira Subramanian Clifford Mather Kurt Peterson and BalakrishnaRaghunath 1998 Implementation of Multiple Pagesize Support inHP-UX In USENIX Annual Technical Conference 105ndash119

[65] John R Tramm Andrew R Siegel Tanzima Islam and Martin Schulz[n d] XSBench - The Development and Verification of a PerformanceAbstraction for Monte Carlo Reactor Analysis In PHYSOR 2014 - TheRole of Reactor Physics toward a Sustainable Future Kyoto

[66] Koji Ueno and Toyotaro Suzumura 2012 Highly Scalable Graph Searchfor the Graph500 Benchmark In Proceedings of the 21st InternationalSymposium on High-Performance Parallel and Distributed Computing(HPDC rsquo12) ACM New York NY USA 149ndash160 httpsdoiorg10114522870762287104

[67] Carl AWaldspurger 2002 Memory ResourceManagement in VMwareESX Server SIGOPS Oper Syst Rev 36 SI (Dec 2002) 181ndash194 httpsdoiorg101145844128844146

[68] Xiao Zhang Sandhya Dwarkadas and Kai Shen 2009 Towards Practi-cal Page Coloring-based Multicore Cache Management In Proceedingsof the 4th ACM European Conference on Computer Systems (EuroSysrsquo09) ACM New York NY USA 89ndash102 httpsdoiorg10114515190651519076

  • Abstract
  • 1 Introduction
  • 2 Motivation
    • 21 Address Translation Overhead vs Memory Bloat
    • 22 Page fault latency vs Number of page faults
    • 23 Huge page allocation across multiple processes
    • 24 How to capture address translation overheads
      • 3 Design and Implementation
        • 31 Asynchronous page pre-zeroing
        • 32 Managing bloat vs performance
        • 33 Fine-grained huge page promotion
        • 34 Huge page allocation across multiple processes
        • 35 Limitations and discussion
          • 4 Evaluation
          • 5 Related Work
          • 6 Conclusions
          • References
Page 3: HawkEye: Efficient Fine-grained OS Support for …sbansal/pubs/hawkeye.pdfHawkEye: Efficient Fine-grained OS Support for Huge Pages Ashish Panwar Indian Institute of Science ashishpanwar@iisc.ac.in

In phase P2 when Redis releases pages back to the OSthrough madvise system call [16] the corresponding hugepage mappings are broken by the kernel and the resident-setsize (RSS) reduces to around 11GB At this point the ker-nelrsquos khugepaged thread promotes the regions containingthe deallocated pages back to huge pages As more allo-cations happen in phase P3 Ingens promotes huge pagesaggressively until the RSS reaches 32GB (or fragmentation islow) after that Ingens employs conservative promotion toavoid any further bloat This is evident in Figure 1 where therate of RSS growth decreases compared to Linux Howeverboth Linux and Ingens reach memory limit (out-of-memoryOOM exception is thrown) at significantly lower memoryutilization While Linux generates a bloat of 28GB (ie only20GB useful data) Ingens generates a bloat of 20GB (ie28GB useful data) Although Ingens tries to prevent bloat athigh memory utilization it is unable to recover from bloatgenerated at low memory utilizationWe note that it is possible to avoid memory bloat in In-

gens by configuring it to use only conservative promotionHowever as also explained in the Ingens paper this strategyrisks losing performance due to high MMU overheads inlow memory pressure situations An ideal strategy shouldpromote huge pages aggressively but should also be ableto recover from bloat in an elegant fashion Figure 1 showsHawkEyersquos behaviour for this workload which is able toeffectively recover from memory bloat even with an aggres-sive promotion strategy

22 Page fault latency vs Number of page faultsThe OS often needs to clear (zero) a page before mapping itinto the process address space to prevent insecure informa-tion flow Clearing a page is significantly more expensive forhuge pages in Linux on our experimental system clearinga base page takes about 25 of total page fault time whichincreases to 97 for huge pages High page fault latencyoften leads to high user-perceived latencies jeopardizingperformance for interactive applications Ingens avoids thisproblem by allocating only base pages in the page fault han-dler and relegating the promotion task to an asynchronousthread khugepaged It prioritizes the promotion of recentlyfaulted regions over older allocationsWhile Ingens reduces page allocation latency it nullifies

an important advantage of huge pages namely fewer pagefaults for access patterns exhibiting high spatial locality [26]Several applications that allocate and initialize a large mem-ory region sequentially exhibit this type of access pattern andtheir performance degrades due to higher number of pagefaults if only base pages are allocated We demonstrate theseverity of this problem through a custom microbenchmarkthat allocates a 10GB buffer touching one byte in every basepage and later frees the buffer Table 1 shows the cumulativeperformance summary for 10 runs of this workload

Event sync page-zeroing asyncpromotion no page-zeroing

Linux4KB

Linux2MB

Ingens90

Linux4KB

Linux2MB

Page faults 262M 515K 262M 262M 515KTotal page fault time (secs) 926 239 928 695 07Avg page fault time (micros) 35 465 35 265 13System time (secs) 102 24 104 79 13Total time (secs) 106 249 116 83 44

Table 1 Page faults allocation latency and performance fora microbenchmark with asymp100GB memory allocationLinux with THP support (Linux-2MB with sync page-

zeroing in Table 1) reduces the number of page faults bymore than 500times over Linux without THP support (Linux-4KB) and leads to more than 4times performance improvementfor this workload despite being 133timesworse on average pagefault latency (465micros vs 35micros) Ingens considerably reduceslatency compared to Linux-2MB but does not reduce thenumber of page faults for this workload because asynchro-nous promotion is activated only after 90 base pages areallocated from a region copying these pages to a contiguousmemory block takes more time than the time taken by thisworkload to generate the remaining 10 page faults in thesame region Consequently the overall performance of thisworkload degrades in Ingens due to excessive page faultsIf it were possible to allocate pages without having to zerothem at page fault time we could achieve both low page faultlatency and fewer page faults resulting in higher overall per-formance over all existing approaches (last two columns inTable 1) HawkEye implements rate-limited asynchronouspage pre-zeroing (sect31) to achieve this in the common case

23 Huge page allocation across multiple processesUnder memory fragmentation the OS must allocate hugepages among multiple processes fairly Ingens authors tack-led this problem by defining fairness through a proportionalhuge page promotion metric wherein they consider ldquomemorycontiguity as a resourcerdquo and try and be equitable in dis-tributing it (through huge pages) among applications Ingenspenalizes applications that have been allocated huge pagesbut are not accessing them frequently (ie have idle hugepages) Idleness of a page is estimated through the access-bitpresent in the page table entries which is maintained by thehardware and periodically cleared by the OS Ingens employsan idleness penalty factorwhereby an application is penalizedfor its idle or cold huge pages during the computation ofthe proportional huge page promotion metric

Ingensrsquo fairness policy has an important weakness mdash twoprocesses may have similar huge page requirements but oneof them (say P1) may have significantly higher TLB pressurethan the other (say P2) This can happen for example if P1rsquosaccesses are spread across many base pages within its hugepage regions while P2rsquos accesses are concentrated in one(or a few) base pages within its huge page regions In thisscenario promotion of huge pages in P1 is more desirablethan P2 but Ingens would treat them equally

Benchmark Suite Number of applicationsTotal TLB sensitive applications

SPEC CPU2006_int 12 4 (mcf astar omnetpp xalancbmk)SPEC CPU2006_fp 19 3 (zeusmp GemsFDTD cactusADM)PARSEC 13 2 (canneal dedup)SPLASH-2 10 0Biobench 9 2 (tigr mummer)NPB 9 2 (cg bt)CloudSuite 7 2 (graph-analytics data-analytics)Total 79 15

Table 2 Number of TLB sensitive applications in popularbenchmark suitesTable 2 provides some empirical evidence that different

applications usually behave quite differently vis-a-vis hugepages For example less than 20 of the applications in pop-ular benchmark suites experience noticeable performanceimprovement (gt 3) with huge pages

We posit that instead of considering ldquomemory contiguityas a resourcerdquo and granting it equally among processes itis more effective to consider ldquoMMU overheads as a systemoverheadrdquo and try and ensure that it is distributed equallyacross processes A fair algorithm should attempt to equalizeMMU overheads across all applications eg if two processesP1 and P2 experience 30 and 10 MMU overheads respthen huge pages should be allocated to P1 until its overheadalso reaches 10 In doing so it should be acceptable if P1has to be allocated more huge pages than P2 Such a policywould additionally yield the best overall system performanceby helping the most afflicted processes first

Further in Ingens huge page promotion is triggered in re-sponse to a few page-faults in the aggressive (non-fragmented)phase These huge pages are not necessarily frequently ac-cessed by the application yet they contribute to a processrsquosallocation quota of huge pages Under memory pressurethese idle huge pages lower the promotion metric of a pro-cess potentially preventing the OS from allocating morehuge pages to it This would increase MMU overheads if theprocess had other hot (frequently accessed) memory regionsthat required huge page promotion This behaviour wherepreviously-allocated cold huge pages can prevent an appli-cation from being allocated huge pages for its hot regionsseems sub-optimal and avoidable

Finally within a process Linux and Ingens promote hugepages through a sequential scan from lower to higher VAsThis approach is unfair to processes whose hot regions lie inthe higher VAs Because different applications would usuallycontain hot regions in different parts of their VA spaces (seeFigure 6) this scheme is likely to cause unfairness in practice

24 How to capture address translation overheadsIt is common to estimate MMU overheads based on working-set size (WSS) [56] bigger WSS should entail higher MMUand performance overheads We find that this is often nottrue for modern hardware where access patterns play an im-portant role in determining MMU overheads eg sequential

Workload RSS WSS TLB-misses(native-4KB)

cycles speedup4KB 2MB native virtual

btD 10GB 7-10 GB 045 64 131 105 115spD 12GB 8-12 GB 048 47 025 101 106luD 8GB 8 GB 006 33 018 10 101mgD 26GB 24 GB 003 104 004 101 111cgD 16GB 7-8 GB 2857 39 002 162 27ftD 78 GB 7-35 GB 021 39 214 101 104uaD 96 GB 5-7 GB 001 08 003 101 103

Table 3Memory characteristics address translation over-heads and speedup huge pages provide over base pages forNPB workloads

Performance CounterC1 DTLB_LOAD_MISSES_WALK_DURATIONC2 DTLB_STORE_MISSES_WALK_DURATIONC3 CPU_CLK_UNHALTEDMMU Overhead = ((C1 + C2) 100) C3

Table 4Methodology used to measure MMUOverhead [54]access patterns allow prefetching to hide TLB miss latenciesFurther different TLB miss requests can experience highlatency variations a translation may be present anywherein the multi-level page walk caches multi-level regular datacaches or in main memory For these reasons WSS is oftennot a good indicator of MMU overheads Table 3 demon-strates this with the NPB benchmark suite [30] a workloadwith large WSS (eg mgD) can have low MMU overheadscompared to one with a smaller WSS (eg cgD)

It is not clear to us how an OS can reliably capture MMUoverheads these are dependent on complex interactions be-tween applications and the underlying hardware architec-ture and we show that the actual address translation over-heads of a workload can be quite different from what can beestimated through its memory access pattern (eg working-set size) Hence we propose directly measuring TLB over-heads through hardware performance counters when avail-able (see Table 4 for methodology) This approach enablesa more efficient solution at the cost of portability as the re-quired performance counters may not be available on all plat-forms (eg most hypervisors have not yet virtualized TLB-related performance counters) To overcome this challengewe present two variants of our algorithmimplementation inHawkEyeLinux one where the MMU overheads are mea-sured through hardware performance counters (HawkEye-PMU) and another where MMU overheads are estimatedthrough the memory access pattern (HawkEye-G) We com-pare both approaches in sect4

3 Design and ImplementationFigure 2 shows our high-level design objectives Our solutionis based on four primary observations (1) high page faultlatency for huge pages can be avoided by asynchronouslypre-zeroing free pages (2) memory bloat can be tackledby identifying and de-duplicating zero-filled baseline pagespresent within allocated huge pages (3) promotion decisionsshould be based on finer-grained access tracking of hugepage sized regions and should include recency frequency

Low Memory Pressure High Memory Pressure

1 Fewer Page Faults 2 Lowshylatency Allocation 3 High Performance

1 High Memory Efficiency 2 Efficient Huge Page Promotion 3 Fairness

Figure 2 Design objectives in HawkEye

and access-coverage (ie how many baseline pages are ac-cessed inside a huge page) measurements and (4) fairnessshould be based on estimation of MMU overheads

31 Asynchronous page pre-zeroingWe propose that page zeroing of available free pages shouldbe performed asynchronously in a rate-limited backgroundkernel thread to eliminate high latency allocation We callthis scheme async pre-zeroing Async pre-zeroing is a ratherold idea and had been discussed extensively among kerneldevelopers in early 2000s [1 2 14 41]1 Linux developersopined that async pre-zeroing is not an overall performancewin for two main reasons We think that it is time to revisitthese opinionsFirst the async pre-zeroing thread might interfere with

primary workloads by polluting the cache [1] In particu-lar async pre-zeroing suffers from the ldquodouble cache missrdquoproblem because it causes the same datum to be accessedtwice with a large re-use distance first for pre-zeroing andthen for the actual access by the application These extracache misses are expected to degrade overall performancein general However these problems are partially solvableon modern hardware that support memory writes with non-temporal hints non-temporal hints instruct the hardware tobypass caches during memory loadstore instructions [43]We find that using non-temporal hints during pre-zeroingsignificantly reduces both cache contention and the doublecache miss problem

Second there was no consensus or empirical evidence todemonstrate the benefits of page pre-zeroing for real work-loads [1 42] We note that the early discussions on pagepre-zeroing were evaluating trade-offs with baseline 4KBpages Our experiments corroborate the kernel developersrsquoobservation that despite reducing the page fault overheadby 25 pre-zeroing does not necessarily enable high per-formance with 4KB pages At the same time we also showthat it enables non-negligible performance improvements(eg 14times faster VM boot-time) with huge pages due to muchhigher reduction (97) in page fault overheads Since hugepages (and huge-huge pages) are supported by most general-purpose processors today [15 27] pre-zeroing pages is animportant optimization that is worth revisitingPre-zeroing offers another advantage for virtualized sys-

tems it increases the number of zero pages in the guestrsquosphysical address (GPA) space enabling opportunities for1Windows and FreeBSD implement async pre-zeroing in a limited fashion [419] However their huge page policies do not allow high latency allocationsThis idea is more relevant for Linux due to its synchronous huge pageallocation behaviour

content-based page-sharing at the virtualization host Weevaluate this aspect in sect4

To implement async pre-zeroing HawkEye manages freepages in the Linux buddy allocator through two lists zeroand non-zero Pages released by applications are first addedto the non-zero list while the zero list is preferentially usedfor allocation A rate-limited thread periodically transferspages from non-zero to zero lists after zero-filling themusing non-temporal writes Because pre-zeroing involves se-quential memory accesses non-temporal store instructionsprovide performance similar to regular (caching) store in-structions but without polluting the cache [3] Finally wenote that for copy-on-write or filesystem-backed memoryregions pre-zeroing may sometimes be unnecessary andwasteful This problem is avoidable by preferentially allo-cating pages for these memory regions from the non-zerolist

Overall we believe that async pre-zeroing is a compellingidea for modern workload requirements and modern hard-ware support In our evaluation we provide some early evi-dence to corroborate this claim with different workloads

32 Managing bloat vs performanceWe observe that the fundamental tension between MMUoverheads and memory bloat can be resolved Our approachstems from the insight that most allocations in large-memoryworkloads are typically ldquozero-filled page allocationsrdquo the re-maining are either filesystem-backed (eg through mmap)or copy-on-write (COW) pages However huge pages inmodern scale-out workloads are primarily used for ldquoanony-mousrdquo pages that are initially zero-filled by the kernel [54]eg Linux supports huge pages only for anonymous mem-ory This property of typical workloads and Linux allowsautomatic recovery of bloat under memory pressure

To state our approach succinctly we allocate huge pagesat the time of first page-fault but under memory pressureto recover unused memory we scan existing huge pages toidentify zero-filled baseline pages within them If the numberof zero-filled baseline pages inside a huge page is significant(ie beyond a threshold) we break the huge page into itsconstituent baseline pages and de-duplicate the zero-filledbaseline pages to a canonical zero-page through standardCOW page management techniques [37] In this approachit is possible for applicationsrsquo in-use zero-pages to also getde-duplicated While this can result in a marginally highernumber of COW page faults in rare situations this does notcompromise correctnessTo trigger recovery from memory bloat HawkEye uses

two watermarks on the amount of allocated memory inthe system high and low When the amount of allocatedmemory exceeds high (85 in our prototype) a rate-limitedbloat-recovery thread is activated which executes period-ically until the allocated memory falls below low (70 inour prototype) At each step the bloat-recovery thread

675554

1155

39 28 12 1 663274

9110

30

60

90

120di

stan

ce (b

ytes

)

Figure 3 Average distance to the first non-zero byte in base-line (4KB) pages First four bars represent the average of allworkloads in the respective benchmark suite

chooses the application whose huge page allocations need tobe scanned (and potentially demoted) based on the estimatedMMU overheads of that application the application with thelowest estimated MMU overheads is chosen first for scanningThis strategy ensures that the application that least requireshuge pages is considered first mdash this is consistent with ourhuge page allocation strategy (sect23)

While scanning a baseline page to verify if it is zero-filledwe stop on encountering the first non-zero byte in it Inpractice the number of bytes that need to be scanned perin-use (not zero-filled) page before a non-zero byte is en-countered is very small we measured this over a total of 56diverse workloads and found that the average distance of thefirst non-zero byte in a 4KB page is only 911 (see Figure 3)Hence only ten bytes need to be scanned on average perin-use application page For bloat pages however all 4096bytes need to be scanned This implies that the overheadsof our bloat-recovery thread are largely proportional tothe number of bloat pages in the system and not to the totalsize of the allocated memory This is an important propertythat allows our method to scale to large memory systemsWe note that our bloat-recovery procedure has many

similarities with the standard de-duplication kernel threadsused for content-based page sharing for virtual machines[67] eg the kernel same-page merging (ksm) thread inLinux Unfortunately in current kernels the huge page man-agement logic (eg khugepaged) and the content-based pagesharing logic (eg ksm) are unconnected and can often inter-act in counter-productive ways [51] Ingens and SmartMD[52] proposed coordinated mechanisms to avoid such con-flicts Ingens demotes only infrequently-accessed huge pagesthrough ksmwhile SmartMD demotes pages based on access-frequency and repetition rate (ie the number of shareablepages within a huge page) These techniques are useful for in-use pages and our bloat-recovery proposal complementsthem by identifying unused zero-filled pages which canexecute much faster than typical same-page merging logic

33 Fine-grained huge page promotionAn efficient huge page promotion strategy should try tomaximize performance with a minimal number of huge page

9876543210

B1

B2

B3 B4

B9876543210

C1

C3

C5

C

C4

C29876543210

access_map

A1

A2 A3

A

orde

r of p

rom

otio

n

hot-regions

cold-regions

access_map access_map

Figure 4 A sample representation of access_map for threeprocesses A B and C

promotions Current systems promote pages through a se-quential scan from lower to higher VAs which is inefficientfor applications whose hot regions are not in lower VA spaceOur approach makes promotion decisions based on memoryaccess patterns First we define an important metric used inHawkEye to improve the efficiency of huge page promotionsAccess-coverage denotes the number of base pages that areaccessed from a huge page sized region in short intervalsWe sample the page table access bits at regular intervals andmaintain the exponential moving average (EMA) of access-coverage across different samples More precisely we clearthe page table access bits and test the number of set bitsafter 1 second to check how many pages are accessed Thisprocess is repeated once every 30 secondsThe access-coverage of a region potentially indicates its

TLB space requirement (number of base page entries requiredin the TLB) which we use to estimate the profitability ofpromoting it to a huge page A region with high access-coverage is likely to exhibit high contention on TLB andhence likely to benefit more from promotionHawkEye implements access-coverage based promotion

using a per-process data structure called access_map whichis an array of buckets a bucket contains huge page regionswith similar access-coverage On x86 with 2MB huge pagesthe value of access-coverage is in the range of 0-512 In ourprototype wemaintain ten buckets in the access_mapwhichprovides the necessary resolution across access-coverage val-ues at relatively low overheads regions with access-coverageof 0-49 are placed in bucket 0 regions with access-coverageof 50 minus 99 are placed in bucket 1 and so on Figure 4 showsan example state of the access_map of three processes A Band C Regions can move up or down in their respective ar-rays after every sampling period depending on their newlycomputed EMA-based access-coverage If a region movesup in access_map it is added to the head of its bucket If aregion moves down it is added to the tail Within a bucketpages are promoted from head to tail This strategy helps inprioritizing recently accessed regions within an index

HawkEye promotes regions from higher to lower indicesin access_map Notice that our approach captures both re-cency and frequency of accesses a region that has not been

accessed or accessed with low access-coverage in recentsampling periods is likely to shift towards a lower indexin access_map or towards the tail in its current index Pro-motion of cold regions is thus automatically deferred toprioritize hot regions

We note that HawkEyersquos access_map is somewhat similarto population_map and access_bitvector used in FreeBSDand Ingens resp these data structures are primarily used tocapture utilization or page access related metadata at hugepage granularity HawkEyersquos access_map additionally pro-vides the index of hotness of memory regions and enablesfine-grained huge page promotion

34 Huge page allocation across multiple processesIn our access-coverage based strategy regions belonging toa process with the highest expected MMU overheads arepromoted before others This is also consistent with ournotion of fairness sect23 as pages are promoted first fromregionsprocesses with high expected TLB pressure (andconsequently high expected MMU overheads)Our HawkEye variant that does not rely on hardware

performance counters (ie HawkEye-G) promotes regionsfrom the non-empty highest access-coverage index in all pro-cesses It is possible that multiple processes have non-emptybuckets in the (globally) highest non-empty index In thiscase round-robin is used to ensure fairness among suchprocesses We explain this further with an example Con-sider three applications A B and C with their VA regionsarranged in the access_map as shown in Figure 4 HawkEye-G promotes regions in the following order in this example

A1B1C1C2B2C3C4B3B4A2C5A3Recall however that MMU overheads may not necessarily

be correlated with our access-coverage based estimationand may depend on other more complex features of theaccess pattern (sect24) To capture this in HawkEye-PMU wefirst choose the process with the highest measured MMUoverheads and then choose regions from higher to lowerindices in selected processrsquos access_map Among processeswith similar highest MMU overheads round-robin is used

35 Limitations and discussionWe briefly outline a few issues that are related to huge pagemanagement but are currently unhandled in HawkEye1) Thresholds for identifyingmemorypressureWemea-sure the extent ofmemory pressurewith statically configuredvalues for low and high watermarks (ie 70 and 85 oftotal systemmemory) while dealing with memory bloat Anystrategy that relies on static thresholds faces the risk of beingconservative or overly aggressive when memory pressureconsistently fluctuates An ideal solution should adjust thesethresholds dynamically to prevent unintended system behav-ior The approach proposed by Guo et al [51] in the contextof memory deduplication for virtualized environments isrelevant in this context

2) Huge page starvationWhile we believe our approachof allocating huge pages based on MMU overheads optimizesthe system as a whole unbounded huge page allocations toa single process can be thought of as a starvation problemfor other processes An adverserial application can also po-tentially monopolize HawkEye to get more huge pages orprevent other applications from getting a fair share of hugepages Preliminary investigations show that our approach isreasonable even if the memory footprint of workloads differby more than an order of magnitude However if limitinghuge page allocations is still desirable it seems reasonable tointegrate a policy with existing resource limitingmonitoringtools such as Linuxrsquos cgroups [18]3) Other algorithms We do not discuss some parts of themanagement algorithms such as rate limiting khugepagedto reduce promotion overheads demotion based on low uti-lization to enable better sharing through same-page mergingand techniques for minimizing the overhead of page-tableaccess-bit tracking and compaction algorithms Much of thismaterial has been discussed and evaluated extensively inthe literature [32 55 59 68] and we have not contributedsignificantly new approaches in these areas

4 EvaluationWe now evaluate our algorithms in more detail Our exper-imental platform is an Intel Haswell-EP based E5-2690 v3server system running CentOS v74 on which 96GB mem-ory and 48 cores (with hyperthreading enabled) running at23GHz are partitioned on two sockets we bind each work-load to a single socket to avoid NUMA effects The L1 TLBcontains 64 and 8 entries for 4KB and 2MB pages respectivelywhile the L2 TLB contains 1024 entries for both 4KB and2MB pages The size of L1 L2 and shared L3 cache is 768KB3MB and 30MB resp A 96GB SSD-backed swap partitionis used to evaluate performance in an overcommitted sys-tem We evaluate HawkEye with a diverse set of workloadsranging from HPC graph algorithms in-memory databasesgenomics and machine learning [24 30 33 36 45 53 65 66]HawkEye is implemented in Linux kernel v43

We evaluate (a) the improvements due to our fine-grainedpromotion strategy based on access-coverage in both per-formance and fairness for single multiple homogeneous andmultiple heterogeneousworkloads (b) the bloat-vs-performancetradeoff (c) the impact of low latency page faults (d) cacheinterference caused by asynchronous pre-zeroing thread and(e) the impact of memory efficiency enabled by asynchronouspage pre-zeroingPerformance advantages of fine-grainedhuge page pro-motion We first evaluate the effectiveness of our access-coverage based promotion strategy For this we measure thetime required for our algorithm to recover from a fragmentedstate with high address translation overheads to a state with

0

6

12

18

24

Graph500 XSBench cgD

speedup

Linux Ingens HawkEyeshyPMU HawkEyeshyG

152

23 25155

20 36

1007 1011

175

628

152 168

0

300

600

900

1200

Graph500 XSBench cgD

Time saved (ms)

Figure 5 Performance speedup and time saved per hugepage promotion over baseline pages

Workload Execution Time (Seconds)Linux-4KB Linux-2MB Ingens HawkEye-PMU HawkEye-G

Graph500-1 2270 2145(106) 2243(101) 1987(114) 2007(113)Graph500-2 2289 2252(102) 2253(102) 1994(115) 2013(114)Graph500-3 2293 2293(10) 2299(100) 2012(114) 2018(114)Average 2284 2230(102) 2265(101) 1998(114) 2013(113)

XSBench-1 2427 2415(10) 2392(101) 2108(115) 2098(115)XSBench-2 2437 2427(10) 2415(101) 2109(115) 2110(115)XSBench-3 2443 2455(10) 2456(100) 2133(115) 2143(114)Average 2436 2432(10) 2421(100) 2117(115) 2117(115)

Table 5 Execution time of 3 instances of Graph500 andXSBench when executed simultaneously Values in parenthe-ses represent speedup over baseline pages

low address translation overheads Recall that HawkEye pro-motes pages based on access-coverage while previous ap-proaches promote pages in VA order (from low VAs to high)We fragment the memory initially by reading several files inmemory our test workloads are started in the fragmentedstate and we measure the time taken for the system to re-cover from high MMU overheads This experimental setupsimulates expected realistic situations where the memoryfragmentation in the system fluctuates over time Withoutpromotion our test workloads would keep incurring highaddress translation overheads however HawkEye is quicklyable to recover from these overheads through appropriatehuge page allocations Figure 5 (left) shows the performanceimprovement obtained by HawkEye (over a strategy thatnever promotes the pages) for three workloads Graph500XSBench and cgD These workloads allocate all their re-quired memory in the beginning ie in the fragmented stateof the system For these workloads the speedups due toeffective huge page management through HawkEye are ashigh as 22 Compared to Linux and Ingens access-coveragebased huge page promotion strategy of HawkEye improvesperformance by 13 12 and 6 over both Linux and Ingens

To understand this more clearly Figure 6 shows the accesspattern MMU overheads and the number of allocated hugepages over time for Graph500 and XSBench We find thathot-spots in these applications are concentrated in the highVAs and any sequential-scanning based promotion is likelyto be sub-optimal The second and third columns corrobo-rate this observation for example both HawkEye variantstake asymp300 seconds to eliminate MMU overheads of XSBenchwhile Linux and Ingens have high overheads even after 1000seconds

To quantify the cost-benefit analysis of huge page promo-tions further we propose a new metric the average execu-tion time saved (over using only baseline pages) per hugepage promotion A scheme that maximizes this metric wouldbe most effective in reducing MMU overheads Figure 5(right) shows that HawkEye performs significantly betterthan Linux and Ingens on this metric The difference betweenthe efficiency of HawkEye-PMU and HawkEye-G is also evi-dent HawkEye-PMU is more efficient as it stops promotinghuge pages whenMMU overheads are below a certain thresh-old (2 in our experiments) In summary HawkEye-G andHawkEye-PMU are up to 67times and 44times more efficient (forXSBench) than Linux in terms of time saved per huge pagepromotion For workloads whose access patterns are spreaduniformly over the VA space HawkEye delivers performancesimilar to Linux and IngensFairness advantages of fine-grained huge page promo-tion We next experiment withmultiple applications to studyboth performance and fairness first with identical applica-tions and then with heterogeneous applications runningsimultaneouslyIdentical workloads Figure 7 shows the MMU overheadsand huge page allocations when 3 instances of Graph500 andXSBench (in separate runs) are executed concurrently afterfragmenting the system (see Table 5 for execution time)While Linux creates performance imbalance by promot-

ing huge pages in one process at a time HawkEye ensuresfairness by judiciously distributing huge pages across allworkload instances Overall HawkEye-PMU and HawkEye-G achieve 114times and 113times speedup for Graph500 and 115timesspeedup for XSBench over Linux on average Unlike LinuxIngens promotes huge pages proportionally in all 3 instancesHowever it fails to improve the performance of these work-loads In fact it may lead to poorer performance than LinuxWe explain this behaviour with an example

Linux selects an address from Graph500-1rsquos address spaceand promotes all pages above it in around 10 minutes Eventhough this strategy is unfair it improves the performanceof Graph500-1 by promoting its hot-regions Linux then se-lects Graph500-2whose MMU overheads decrease after asymp20minutes In contrast Ingens promotes huge pages from lowerVAs in each instance of Graph500 Thus it takes even longerfor Ingens to promote suitable (hot) regions of the respec-tive processes which leads to 9 performance degradationover Linux for Graph500 For XSBench the performance ofboth Ingens and Linux is similar (but inferior to HawkEye)because both fail to promote the applicationrsquos hot regionsbefore the application finishesHeterogeneous workloads To measure the efficacy of dif-ferent strategies for heterogeneous workloads we executeworkloads by grouping them into sets where a set containsone TLB sensitive and one TLB insensitive application Each

0

128

256

384

512

05GB 1GB 15GB 2GB

Cold Region

acce

ss-c

over

age

Virtual Address Space

0

15

30

45

60

300 600 900 1200 1500

MM

U O

verh

ead

Time (seconds)

0

200

400

600

800

1000

300 600 900 1200 1500

Linux Ingens

HawkEye-G

HawkEye-PMU

H

uge

Pag

es

Time (seconds)

(a) Graph500

0

128

256

384

512

1GB 2GB 3GB 4GB 5GB 6GB

acce

ss-c

over

age

Virtual Address Space

0

15

30

45

60

300 600 900 1200 1500 1800 2100 2400

Linux Ingens

HawkEye-PMU HawkEye-GMM

U O

ver

hea

dTime (seconds)

0

500

1000

1500

2000

300 600 900 1200 1500 1800 2100 2400

Linux Ingens H

awkEye-G

HawkEye-PMU H

uge

Pag

es

Time (seconds)

(b) XSBenchFigure 6 Access-coverage in application VA MMU overhead and huge page promotions for Graph500 and XSBench HawkEyeis more efficient in promoting huge pages due to fine-grained decision making

0

200

400

600

800

1000

10 20 30 40

1 2

3 H

uge

Pag

es

Time (minutes)

0

200

400

600

800

1000

10 20 30 40

1 2 3

H

uge

Pag

es

Time (minutes)

0

200

400

600

800

1000

10 20 30 40

H

uge

Pag

es

Time (minutes)

0

200

400

600

800

1000

10 20 30 40

1 2 3

H

uge

Pag

es

Time (minutes)

0

10

20

30

40

50

10 20 30 40

MM

U O

verh

ead

Time (Minutes)

0

10

20

30

40

50

10 20 30 40

MM

U O

verh

ead

Time (minutes)

0

10

20

30

40

50

10 20 30 40

MM

U O

verh

ead

Time (minutes)

0

10

20

30

40

50

10 20 30 40

MM

U O

verh

ead

Time (minutes)

0

500

1000

1500

2000

10 20 30 40 50

1

2 3

H

uge

Pag

es

Time (Minutes)

0

300

600

900

1200

10 20 30 40 50

1 2 3

H

uge

Pag

es

Time (Minutes)

0

300

600

900

1200

10 20 30 40 50

H

uge

Pag

es

Time (Minutes)

0

300

600

900

1200

10 20 30 40 50

1 2 3

H

uge

Pag

es

Time (Minutes)

0

15

30

45

60

10 20 30 40 50

1 2 3

MM

U O

verh

ead

Time (Minutes)

0

15

30

45

60

10 20 30 40 50

1 2 3

MM

U O

verh

ead

Time (Minutes)

0

15

30

45

60

10 20 30 40 50

1 2 3

MM

U O

verh

ead

Time (Minutes)

0

15

30

45

60

10 20 30 40 50

1 2 3

MM

U O

verh

ead

Time (Minutes)

(a) Linux-2MB (b) Ingens (c) HawkEye-PMU (d) HawkEye-GFigure 7 Huge page promotions and their impact on TLB overhead over time for Linux Ingens and HawkEye for threeinstances of Graph500 and XSBench each

set is executed twice after fragmenting the system to as-sess the impact of huge page policies on the order of ex-ecution For the TLB insensitive workload we execute alightly loaded Redis server with 1KB-sized 40M keys servic-ing 10K requests per-second Figure 8 shows the performance

speedup over 4KB pages The (Before) and (After) configura-tions show whether the TLB sensitive workload is launchedbefore Redis or after Because Linux promotes huge pagesin process launch order the two configurations (Before) and(After) are intended to cover its two different behavioursIngens and HawkEye are agnostic to process creation order

0

15

30

45

60

75

cactusADM tigr Graph500 lbm_s SVM XSBench cgD cactusADM tigr Graph500 lbm_s SVM XSBench cgDBefore After

s

peed

up

Chart Title

Linux Ingens HawkEye-PMU HawkEye-G

Figure 8 Performance speedup over baseline pages of TLB sensitive applications when they are executed alongside a lightlystressed Redis server in different orders

1

12

14

16

18

2

Nor

mal

ized

Per

form

ance

HawkEye-Host HawkEye-Guest HawkEye-Both

Figure 9 Performance compared to Linux in a virtualizedsystem when HawkEye is applied at the host guest and bothlayers

The impact of execution order is clearly visible for LinuxIn the (Before) case Linux with THP support improves per-formance over baseline pages by promoting huge pages inTLB sensitive applications However in the (After) case itpromotes huge pages in Redis resulting in poor performancefor TLB sensitive workloads While the execution order doesnot matter in Ingens its proportional huge page promotionstrategy favors large memory workloads (Redis in this case)Further our client requests randomly selected keys whichmakes the Redis server access all its huge pages uniformlyConsequently Ingens promotes most huge pages in Redisleading to suboptimal performance of TLB sensitive work-loads In contrast HawkEye promotes huge pages based onMMU overhead measurements (HawkEye-PMU) or access-coverage based estimation of MMU overheads (HawkEye-G)This leads to 15ndash60 performance improvement over 4KBpages irrespective of the order of executionPerformance in virtualized systems For a virtualizedsystemwe use the KVMhypervisor runningwith Ubuntu1604and fragment the system prior to running the workloadsEvaluation is presented for three different configurationsie when HawkEye is deployed at the host guest and boththe layers Each of these configurations requires running thesame set of applications differently (see Table 6 for details)Figure 9 shows that HawkEye provides 18ndash90 speedup

compared to Linux in virtual environments Notice that insome cases (eg cgD) performance improvement is higherin a virtualized system compared to bare-metal This is due tohigh MMU overheads involved in traversing nested page ta-bles while servicing TLB misses of guest applications whichpresents higher opportunities to be exploited by HawkEye

Configuration Details

HawkEye-Host Two VMs each with 24 vCPUs and 48GB memoryVM-1 runs Redis VM-2 runs TLB sensitive workloads

HawkEye-Guest Single VM with 48 vCPUs and 80GB memoryrunning Redis and TLB sensitive workloads

HawkEye-Both Two VMs each with 24 vCPUs VM-1 (30GB) runs RedisVM-2 (60GB) runs both Redis and TLB sensitive workloads

Table 6 Experimental setup for configurations used to eval-uate a virtualized system

Bloat vs performance We next study the relationship be-tween memory bloat and performance we revisit an experi-ment similar to the one summarized in sect21 We populate aRedis instance with 8M (10B key 4KB value) key-value pairsand then delete 60 randomly selected keys (see Table 7)While Linux-4KB (no THP) is memory efficient (no bloat)its performance is low (7 lower throughput than Linux-2MB ie with THP) Linux-2MB delivers high performancebut consumes more memory which remains unavailable tothe rest of system even when memory pressure increasesIngens can be configured to balance the memory vs perfor-mance tradeoff For example Ingens-90 (Ingens with 90utilization threshold) is memory efficient while Ingens-50(Ingens with 50 utilization threshold) favors performanceand allows more memory bloat In either case it is unableto recover from bloat that may have already been gener-ated (eg during its aggressive phase) By switching fromaggressive huge page allocations under low memory pres-sure to memory conserving strategy at runtime HawkEye isable to achieve both low MMU overheads and high memoryefficiency depending on the state of the systemFast page faults Table 8 shows the performance of differentstrategies for the workloads chosen for these experimentstheir performance depends on the efficiency of the OS pagefault handler We measure Redis throughput when 2MB-sized values are inserted SparseHash [13] is a hash-maplibrary in C++ while HACC-IO [8] is a parallel IO benchmarkused with an in-memory file system We also measure thespin-up time of two virtual machines KVM and a Java VirtualMachine (JVM) where both are configured to allocate allmemory during initialization

Performance benefits of async page pre-zeroing with basepages (HawkEye-4KB) aremodest in this case the other page

Kernel Self-tuning Memory Size ThroughputLinux-4KB No 162GB 1061KLinux-2MB No 332GB 1138KIngens-90 No 163GB 1068KIngens-50 No 331GB 1134KHawkEye (no mem pressure) Yes 332GB 1136KHawkEye (mem pressure) Yes 162GB 1058K

Table 7 Redis memory consumption and throughput

Workload OS ConfigurationLinux4KB

Linux2MB

Ingens90

HawkEye4KB

HawkEye2MB

Redis (45GB) 233 437 192 236 551SparseHash (36GB) 501 172 515 466 106HACC-IO (6GB) 65 45 66 65 42JVM Spin-up (36GB) 377 186 527 298 137KVM Spin-up (36GB) 406 97 418 302 070

Table 8 Performance implications of asynchronous pagezeroing Values for Redis represent throughput (higher isbetter) all other values represent time in seconds (lower isbetter)

0

10

20

30

40

NPB PARSEC Graph500 cactusADM PageRank omnetpp

overhead with caching-stores with nt-stores

Figure 10 Performance overhead of async pre-zeroing withand without caching instructions The first two bars (NPBand Parsec) represent the average of all workloads in therespective benchmark suite

fault overheads (apart from zeroing) dominate performanceHowever page-zeroing cost becomes significant with hugepages Consequently HawkEye with huge pages (HawkEye-2MB) improves the performance of Redis and SparseHashby 126times and 162times over Linux The spin-up time of virtualmachines is purely dominated by page faults Hence weobserve a dramatic reduction in the spin-up time of VMswith async pre-zeroing of huge pages 138times and 136times overLinux Notice that without huge pages (only base pages)this reduction is only 134times and 126times All these workloadshave high spatial locality of page faults and hence the Ingensstrategy of utilization-based promotion has a negative impacton performance due to a higher number of page faultsAsync pre-zeroing enables memory sharing in virtual-ized environments Finally we note that async pre-zeroingwithin VMs enables memory sharing across VMs the freememory of a VM returns to the host through pre-zeroing inthe VM and same-page merging at the host This can havethe same net effect as ballooning as the free memory in aVM gets shared with other VMs In our experiments withmemory over-committed systems we have confirmed thisbehaviour the overall performance of an overcommittedsystem where the VMs are running HawkEye matches the

0

05

1

15

2

25

PageRank Redis MongoDB CC Memcached SPDNode 0 Node 1

Nor

mal

ized

Per

form

ance

Linux Linux-Ballooning HawkEye

Figure 11 Performance normalized to the case of no bal-looning in an overcommitted virtualized system

performance achieved through para-virtual interfaces likememory balloon drivers For example in an experiment in-volving a mix of latency-sensitive key-value stores and HPCworkloads (see Figure 11) with a total peakmemory consump-tion of around 150GB (15times of total memory) HawkEye-Gprovides 23times and 142times higher throughput for Redis andMongoDB along with significant tail latency reduction ForPageRank performance degrades slightly due to a highernumber of COW page faults due to unnecessary same-pagemerging These results are very close to the results obtainedwhen memory ballooning is enabled on all VMs We believethat this is a potential positive of our design and may offeran alternative method to efficiently manage memory in over-committed systems It is well-known that balloon-driversand other para-virtual interfaces are riddled with compatibil-ity problems [29] and a fully-virtual solution to this problemwould be highly desirable While our experiments show earlypromise in this direction we leave an extensive evaluationfor future workPerformance overheads ofHawkEyeWe evaluatedHawk-Eye with non-fragmented systems and with several work-loads that donrsquot benefit from huge pages and find that theperformance with HawkEye is very similar to performancewith Linux or Ingens in these scenarios confirming thatHawkEyersquos algorithms for access-bit tracking asynchronoushuge page promotion and memory de-duplication add negli-gible overheads over existingmechanisms in Linux Howeverthe async pre-zeroing thread requires special attention as itmay become a source of noticeable performance overheadfor certain types of workloads as discussed nextOverheads of async pre-zeroing A primary concern withasync pre-zeroing is its potential detrimental effect due tomemory and cache interference (sect31) Recall that we em-ploy memory stores with non-temporal hints to avoid theseeffects To measure the worst-case cache effects of asyncpre-zeroing we run our workloads while simultaneouslyzero-filling pages on a separate core sharing the same L3cache (ie on the same socket) at a high rate of 025M pagesper second (1GBps) with and without non-temporal mem-ory stores Figure 10 shows that using non-temporal hintssignificantly brings down the overhead of async pre-zeroing(eg from 27 to 6 for omnetpp) the remaining overheadis due to additional memory traffic generated by the pre-zeroing thread Further we note that this experiment presents

Workload MMUOverhead

Time (seconds)4KB HawkEye-PMU HawkEye-G

random(4GB) 60 582 328(177times) 413(141times)sequential(4GB) lt 1 517 535 532Total 1099 863(127times) 945(116times)cgD(16GB) 39 1952 1202(162times) 1450(135times)mgD(24GB) lt 1 1363 1364 1377Total 3315 2566(129times) 2827(117times)

Table 9 Comparison between HawkEye-PMU andHawkEye-G Values in parentheses represent speedup overbaseline pagesa highly-exaggerated behavior of async pre-zeroing onworst-case workloads In practice the zeroing thread is rate-limited(eg at most 10k pages per second) and the expected cache-interference overheads would be proportionally smallerComparison betweenHawk-PMUandHawkEye-G Eventhough HawkEye-G is based on simple and approximatedestimations of TLB contention it is reasonably accurate inidentifying TLB sensitive processes and memory regionsin most cases However HawkEye-PMU performs betterthan HawkEye-G in some cases Table 9 demonstrates twosuch examples where four workloads all with high access-coverage but different MMU overheads are executed in twosets each set contains one TLB sensitive and one TLB insensi-tive workload The 4KB column represents the case where nohuge pages are promoted As discussed earlier (sect24) MMUoverheads depend on the complex interaction between thehardware and access pattern In this case despite havinghigh access-coverage in VA regions sequential and mgDhave negligible MMU overheads (ie they are TLB insen-sitive) While HawkEye-PMU correctly identifies the pro-cess with higher MMU overheads for huge page allocationHawkEye-G treats them similarly (due to imprecise estima-tions) and may allocate huge pages to the TLB insensitiveprocess Consequently HawkEye-PMU may perform up to36 better than HawkEye-G Bridging this gap between thetwo approaches in a hardware independent manner is aninteresting future work

5 Related WorkMMU overheads have been a topic of much research in re-cent years and several solutions have been proposed includ-ing solutions that are architecture-based OS-based andorcompiler-based [28 34 49 56 58 62ndash64]Hardware SupportMost proposals for hardware supportaim to reduce TLB-miss frequency or accelerate TLB-missprocessing Large multi-level TLBs and support for multiplepage sizes to minimize the number of TLB-misses is alreadycommon in general-purpose processors today [15 20 21 40]Further modern processors employ page-walk-caches toavoid memory lookups in the TLB-miss path [31 35] POM-TLB services a TLB miss using a single memory lookup andfurther leverages regular data caches to speed up addresstranslation [63] Direct segments [32 46] were proposed tocompletely avoid TLB-miss processing cost through special

segmentation hardware OS-level challenges of memory frag-mentation have also been considered in hardware designsCoLT or Coalesced-Large-reach TLBs [61] were initially pro-posed to increase TLB reach using base pages based on theobservation that OSs naturally provide contiguous mappingsat smaller granularity This approach was further extendedto page-walk caches and huge pages [35 40 60] Techniquesto support huge pages in non-contiguous memory have alsobeen proposed [44] Hardware optimizations are importantbut complementary to HawkEyeSoftware Support FreeBSDrsquos huge page management islargely influenced by previous work on superpages [57]Carrefour-LP [47] showed that ad-hoc usage of huge pagescan degrade performance in NUMA systems due to addi-tional remote memory accesses and traffic imbalance acrossnode interconnects and proposed hardware-profiling basedtechniques to overcome these problems Anti-fragmentationtechniques for clustering pages based on mobility type ofpages proposed by Gorman et al [48] have been adoptedby Linux to support the allocation of huge pages in long-running systems Illuminator [59] highlighted critical draw-backs of Linuxrsquos fragmentation mitigation techniques andtheir implications on performance and latency and proposedtechniques to solve these problems Mechanisms of dealingwith physical memory fragmentation and NUMA systemsare important but orthogonal problems and their proposedsolutions can be adopted to improve the robustness of Hawk-Eye Compiler or application hints can also be used to assistOSs in prioritizing huge page mapping for certain parts ofthe address space [27 56] for which an interface is alreadyprovided by Linux through the madvise system call [16]An alternative approach for supporting huge pages has

also been explored via libhugetlbfs [22] where the useris provided more control on huge page allocation Howeversuch an approach requires manual intervention for reserv-ing huge pages in advance and considers each applicationin isolation Windows and OS X support huge pages onlythrough this reservation-based approach to avoid issues asso-ciated with transparent huge page management [17 55 59]We believe that insights from HawkEye can be leveraged toimprove huge page support in these important systems

6 ConclusionsTo summarize transparent huge page management is es-sential but complex Effective and efficient algorithms arerequired in the OS to automatically balance the tradeoffs andprovide performant and stable system behavior We exposesome important subtleties related to huge page managementin existing proposals and propose a new set of algorithmsto address them in HawkEye

References[1] 2000 Clearing pages in the idle loop httpswwwmail-archivecom

freebsd-hackersfreebsdorgmsg13993html

[2] 2000 Linux Page Zeroing Strategy httpsyarchivenetcomplinuxpagezeroingstrategyhtml

[3] 2007 Memory part 5 What programmers can do httpslwnnetArticles255364

[4] 2010 Mysteries of Windows Memory Management RevealedPart Two httpsblogsmsdnmicrosoftcomtims20101029pdc10-mysteries-of-windows-memory-management-revealed-part-two

[5] 2012 khugepaged eating 100 CPU httpsbugzillaredhatcomshowbugcgiid=879801

[6] 2012 Recommendation to disable huge pages for Hadoophttpsdeveloperamdcomwordpressmedia201210HadoopTuningGuide-Version5pdf

[7] 2014 Arch Linux becomes unresponsive from khugepagedhttpunixstackexchangecomquestions161858arch-linux-becomes-unresponsive-from-khugepaged

[8] 2014 CORAL Benchmark Codes httpsascllnlgovCORAL-benchmarkshacc

[9] 2014 Recommendation to disable huge pages for NuoDBhttpwwwnuodbcomtechbloglinux-transparent-huge-pages-jemalloc-and-nuodb

[10] 2014 Why TokuDB Hates Transparent HugePageshttpswwwperconacomblog20140723why-tokudb-hates-transparent-hugepages

[11] 2015 The Black Magic Of Systematically Reducing Linux OS Jit-ter httphighscalabilitycomblog201548the-black-magic-of-systematically-reducing-linux-os-jitterhtml

[12] 2015 Tales from the Field Taming Transparent Huge Pages onLinux httpswwwperforcecomblog151016tales-field-taming-transparent-huge-pages-linux

[13] 2016 C++ associative containers httpsgithubcomsparsehashsparsehash

[14] 2016 Remove PG_ZERO and zeroidle (page-zeroing) entirely httpsnewsycombinatorcomitemid=12227874

[15] 2017 Hugepages httpswikidebianorgHugepages[16] 2017 MADVISE Linux Programmerrsquos Manual httpman7orglinux

man-pagesman2madvise2html[17] 2018 Large-Page Support in Windows httpsdocsmicrosoftcom

en-gbwindowsdesktopMemorylarge-page-support[18] 2019 CGROUPS Linux Programmerrsquos Manual httpman7orglinux

man-pagesman7cgroups7html[19] 2019 FreeBSD Pre-Faulting and Zeroing Optimizations

httpswwwfreebsdorgdocenUSISO8859-1articlesvm-designprefault-optimizationshtml

[20] 2019 Intel Haswell httpwww7-cpucomcpuHaswellhtml[21] 2019 Intel Skylake httpwww7-cpucomcpuSkylakehtml[22] 2019 Libhugetlbfs Linux man page httpslinuxdienetman7

libhugetlbfs[23] 2019 Recommendation to disable huge pages for MongoDB https

docsmongodbcommanualtutorialtransparent-huge-pages[24] 2019 Recommendation to disable huge pages for Redis httpredisio

topicslatency[25] 2019 Recommendation to disable huge pages for VoltDB https

docsvoltdbcomAdminGuideadminmemmgtphp[26] 2019 Transparent Hugepage Support httpswwwkernelorgdoc

Documentationvmtranshugetxt[27] Mohammad Agbarya Idan Yaniv and Dan Tsafrir 2018 Memomania

From Huge to Huge-Huge Pages In Proceedings of the 11th ACM In-ternational Systems and Storage Conference (SYSTOR rsquo18) ACM NewYork NY USA 112ndash112 httpsdoiorg10114532118903211918

[28] Hanna Alam Tianhao Zhang Mattan Erez and Yoav Etsion 2017Do-It-Yourself Virtual Memory Translation In Proceedings of the44th Annual International Symposium on Computer Architecture (ISCArsquo17) ACM New York NY USA 457ndash468 httpsdoiorg10114530798563080209

[29] Nadav Amit Dan Tsafrir and Assaf Schuster 2014 VSwapper AMemory Swapper for Virtualized Environments In Proceedings of the19th International Conference on Architectural Support for ProgrammingLanguages and Operating Systems (ASPLOS rsquo14) ACM New York NYUSA 349ndash366 httpsdoiorg10114525419402541969

[30] D H Bailey E Barszcz J T Barton D S Browning R L Carter LDagum R A Fatoohi P O Frederickson T A Lasinski R S SchreiberH D Simon V Venkatakrishnan and S KWeeratunga 1991 The NASParallel BenchmarksSummary and Preliminary Results In Proceedingsof the 1991 ACMIEEE Conference on Supercomputing (Supercomputingrsquo91) ACM New York NY USA 158ndash165 httpsdoiorg101145125826125925

[31] Thomas W Barr Alan L Cox and Scott Rixner 2010 TranslationCaching Skip DonrsquoT Walk (the Page Table) In Proceedings of the37th Annual International Symposium on Computer Architecture (ISCArsquo10) ACM New York NY USA 48ndash59 httpsdoiorg10114518159611815970

[32] Arkaprava Basu Jayneel Gandhi Jichuan Chang Mark D Hill andMichael M Swift 2013 Efficient Virtual Memory for Big MemoryServers In Proceedings of the 40th Annual International Symposium onComputer Architecture (ISCA rsquo13) ACM New York NY USA 237ndash248httpsdoiorg10114524859222485943

[33] Scott Beamer Krste Asanovic and David A Patterson 2015 TheGAP Benchmark Suite CoRR abs150803619 (2015) arXiv150803619httparxivorgabs150803619

[34] Ravi Bhargava Benjamin Serebrin Francesco Spadini and SrilathaManne 2008 Accelerating Two-dimensional Page Walks for Vir-tualized Systems In Proceedings of the 13th International Confer-ence on Architectural Support for Programming Languages and Op-erating Systems (ASPLOS XIII) ACM New York NY USA 26ndash35httpsdoiorg10114513462811346286

[35] Abhishek Bhattacharjee 2013 Large-reach Memory ManagementUnit Caches In Proceedings of the 46th Annual IEEEACM InternationalSymposium on Microarchitecture (MICRO-46) ACM New York NYUSA 383ndash394 httpsdoiorg10114525407082540741

[36] Christian Bienia Sanjeev Kumar Jaswinder Pal Singh and Kai Li 2008The PARSEC Benchmark Suite Characterization and ArchitecturalImplications In Proceedings of the 17th International Conference onParallel Architectures and Compilation Techniques (PACT rsquo08) ACMNew York NY USA 72ndash81 httpsdoiorg10114514541151454128

[37] Daniel Bovet and Marco Cesati 2005 Understanding The Linux KernelOreilly amp Associates Inc

[38] Josiah L Carlson 2013 Redis in Action Manning Publications CoGreenwich CT USA

[39] Jonathan Corbet 2010 Memory compaction httpslwnnetArticles368869

[40] Guilherme Cox and Abhishek Bhattacharjee 2017 Efficient AddressTranslation for Architectures with Multiple Page Sizes In Proceed-ings of the Twenty-Second International Conference on ArchitecturalSupport for Programming Languages and Operating Systems (ASPLOSrsquo17) ACM New York NY USA 435ndash448 httpsdoiorg10114530376973037704

[41] Cort Dougan Paul Mackerras and Victor Yodaiken 1999 Opti-mizing the Idle Task and Other MMU Tricks In Proceedings of theThird Symposium on Operating Systems Design and Implementation(OSDI rsquo99) USENIX Association Berkeley CA USA 229ndash237 httpdlacmorgcitationcfmid=296806296833

[42] Niall Douglas 2011 User Mode Memory Page Allocation A Sil-ver Bullet For Memory Allocation CoRR abs11051811 (2011)arXiv11051811 httparxivorgabs11051811

[43] Ulrich Drepper 2007 What Every Programmer Should Know AboutMemory

[44] Yu Du Miao Zhou Bruce R Childers Daniel Mosseacute and Rami GMelhem 2015 Supporting superpages in non-contiguous physicalmemory 2015 IEEE 21st International Symposium on High Performance

Computer Architecture (HPCA) (2015) 223ndash234[45] M Franklin D Yeung n Xue Wu A Jaleel K Albayraktaroglu B

Jacob and n Chau-Wen Tseng 2005 BioBench A Benchmark Suite ofBioinformatics Applications In IEEE International Symposium on Per-formance Analysis of Systems and Software 2005 ISPASS 2005(ISPASS)Vol 00 2ndash9 httpsdoiorg101109ISPASS20051430554

[46] Jayneel Gandhi Arkaprava Basu Mark D Hill and Michael M Swift2014 Efficient Memory Virtualization Reducing Dimensionality ofNested Page Walks In Proceedings of the 47th Annual IEEEACM Inter-national Symposium on Microarchitecture (MICRO-47) IEEE ComputerSociety Washington DC USA 178ndash189 httpsdoiorg101109MICRO201437

[47] Fabien Gaud Baptiste Lepers Jeremie Decouchant Justin FunstonAlexandra Fedorova and Vivien Queacutema 2014 Large Pages MayBe Harmful on NUMA Systems In Proceedings of the 2014 USENIXConference on USENIX Annual Technical Conference (USENIX ATCrsquo14)USENIX Association Berkeley CA USA 231ndash242 httpdlacmorgcitationcfmid=26436342643659

[48] Mel Gorman and Patrick Healy 2008 Supporting Superpage Alloca-tion Without Additional Hardware Support In Proceedings of the 7thInternational Symposium on Memory Management (ISMM rsquo08) ACMNew York NY USA 41ndash50 httpsdoiorg10114513756341375641

[49] Mel Gorman and Patrick Healy 2012 Performance Characteristics ofExplicit Superpage Support In Proceedings of the 2010 InternationalConference on Computer Architecture (ISCArsquo10) Springer-Verlag BerlinHeidelberg 293ndash310 httpsdoiorg101007978-3-642-24322-624

[50] Mel Gorman and Andy Whitcroft 2006 The What The Why and theWhere To of Anti-Fragmentation In Linux Symposium 141

[51] Fei Guo Seongbeom Kim Yury Baskakov and Ishan Banerjee 2015Proactively Breaking Large Pages to Improve Memory Overcom-mitment Performance in VMware ESXi In Proceedings of the 11thACM SIGPLANSIGOPS International Conference on Virtual ExecutionEnvironments (VEE rsquo15) ACM New York NY USA 39ndash51 httpsdoiorg10114527311862731187

[52] Fan Guo Yongkun Li Yinlong Xu Song Jiang and John C S Lui2017 SmartMD A High Performance Deduplication Engine withMixed Pages In Proceedings of the 2017 USENIX Conference on UsenixAnnual Technical Conference (USENIX ATC rsquo17) USENIX AssociationBerkeley CA USA 733ndash744 httpdlacmorgcitationcfmid=31546903154759

[53] John L Henning 2006 SPEC CPU2006 Benchmark DescriptionsSIGARCH Comput Archit News 34 4 (Sept 2006) 1ndash17 httpsdoiorg10114511867361186737

[54] Vasileios Karakostas Osman S Unsal Mario Nemirovsky AdriaacutenCristal and Michael M Swift 2014 Performance analysis of thememory management unit under scale-out workloads 2014 IEEE In-ternational Symposium on Workload Characterization (IISWC) (2014)1ndash12

[55] Youngjin Kwon Hangchen Yu Simon Peter Christopher J Rossbachand Emmett Witchel 2016 Coordinated and Efficient Huge PageManagement with Ingens In Proceedings of the 12th USENIX Con-ference on Operating Systems Design and Implementation (OSDIrsquo16)USENIX Association Berkeley CA USA 705ndash721 httpdlacmorgcitationcfmid=30268773026931

[56] Joshua Magee and Apan Qasem 2009 A Case for Compiler-drivenSuperpage Allocation In Proceedings of the 47th Annual SoutheastRegional Conference (ACM-SE 47) ACM New York NY USA Article82 4 pages httpsdoiorg10114515664451566553

[57] Juan Navarro Sitararn Iyer Peter Druschel and Alan Cox 2002 Prac-tical Transparent Operating System Support for Superpages SIGOPSOper Syst Rev 36 SI (Dec 2002) 89ndash104 httpsdoiorg101145844128844138

[58] Ashish Panwar Naman Patel and K Gopinath 2016 A Case forProtecting Huge Pages from the Kernel In Proceedings of the 7th ACMSIGOPS Asia-Pacific Workshop on Systems (APSys rsquo16) ACM New YorkNY USA Article 15 8 pages httpsdoiorg10114529673602967371

[59] Ashish Panwar Aravinda Prasad and K Gopinath 2018 Making HugePages Actually Useful In Proceedings of the Twenty-Third InternationalConference on Architectural Support for Programming Languages andOperating Systems (ASPLOS rsquo18) ACM New York NY USA 679ndash692httpsdoiorg10114531731623173203

[60] Binh Pham Abhishek Bhattacharjee Yasuko Eckert and Gabriel HLoh 2014 Increasing TLB reach by exploiting clustering in page trans-lations 2014 IEEE 20th International Symposium on High PerformanceComputer Architecture (HPCA) (2014) 558ndash567

[61] Binh Pham Viswanathan Vaidyanathan Aamer Jaleel and AbhishekBhattacharjee 2012 CoLT Coalesced Large-Reach TLBs In Proceed-ings of the 2012 45th Annual IEEEACM International Symposium onMicroarchitecture (MICRO-45) IEEE Computer Society WashingtonDC USA 258ndash269 httpsdoiorg101109MICRO201232

[62] Binh Pham Jaacuten Veselyacute Gabriel H Loh and Abhishek Bhattacharjee2015 Large Pages and Lightweight Memory Management in Virtual-ized Environments Can You Have It Both Ways In Proceedings of the48th International Symposium on Microarchitecture (MICRO-48) ACMNew York NY USA 1ndash12 httpsdoiorg10114528307722830773

[63] Jee Ho Ryoo Nagendra Gulur Shuang Song and Lizy K John 2017Rethinking TLB Designs in Virtualized Environments A Very LargePart-of-Memory TLB In Proceedings of the 44th Annual InternationalSymposium on Computer Architecture (ISCA rsquo17) ACM New York NYUSA 469ndash480 httpsdoiorg10114530798563080210

[64] Indira Subramanian Clifford Mather Kurt Peterson and BalakrishnaRaghunath 1998 Implementation of Multiple Pagesize Support inHP-UX In USENIX Annual Technical Conference 105ndash119

[65] John R Tramm Andrew R Siegel Tanzima Islam and Martin Schulz[n d] XSBench - The Development and Verification of a PerformanceAbstraction for Monte Carlo Reactor Analysis In PHYSOR 2014 - TheRole of Reactor Physics toward a Sustainable Future Kyoto

[66] Koji Ueno and Toyotaro Suzumura 2012 Highly Scalable Graph Searchfor the Graph500 Benchmark In Proceedings of the 21st InternationalSymposium on High-Performance Parallel and Distributed Computing(HPDC rsquo12) ACM New York NY USA 149ndash160 httpsdoiorg10114522870762287104

[67] Carl AWaldspurger 2002 Memory ResourceManagement in VMwareESX Server SIGOPS Oper Syst Rev 36 SI (Dec 2002) 181ndash194 httpsdoiorg101145844128844146

[68] Xiao Zhang Sandhya Dwarkadas and Kai Shen 2009 Towards Practi-cal Page Coloring-based Multicore Cache Management In Proceedingsof the 4th ACM European Conference on Computer Systems (EuroSysrsquo09) ACM New York NY USA 89ndash102 httpsdoiorg10114515190651519076

  • Abstract
  • 1 Introduction
  • 2 Motivation
    • 21 Address Translation Overhead vs Memory Bloat
    • 22 Page fault latency vs Number of page faults
    • 23 Huge page allocation across multiple processes
    • 24 How to capture address translation overheads
      • 3 Design and Implementation
        • 31 Asynchronous page pre-zeroing
        • 32 Managing bloat vs performance
        • 33 Fine-grained huge page promotion
        • 34 Huge page allocation across multiple processes
        • 35 Limitations and discussion
          • 4 Evaluation
          • 5 Related Work
          • 6 Conclusions
          • References
Page 4: HawkEye: Efficient Fine-grained OS Support for …sbansal/pubs/hawkeye.pdfHawkEye: Efficient Fine-grained OS Support for Huge Pages Ashish Panwar Indian Institute of Science ashishpanwar@iisc.ac.in

Benchmark Suite Number of applicationsTotal TLB sensitive applications

SPEC CPU2006_int 12 4 (mcf astar omnetpp xalancbmk)SPEC CPU2006_fp 19 3 (zeusmp GemsFDTD cactusADM)PARSEC 13 2 (canneal dedup)SPLASH-2 10 0Biobench 9 2 (tigr mummer)NPB 9 2 (cg bt)CloudSuite 7 2 (graph-analytics data-analytics)Total 79 15

Table 2 Number of TLB sensitive applications in popularbenchmark suitesTable 2 provides some empirical evidence that different

applications usually behave quite differently vis-a-vis hugepages For example less than 20 of the applications in pop-ular benchmark suites experience noticeable performanceimprovement (gt 3) with huge pages

We posit that instead of considering ldquomemory contiguityas a resourcerdquo and granting it equally among processes itis more effective to consider ldquoMMU overheads as a systemoverheadrdquo and try and ensure that it is distributed equallyacross processes A fair algorithm should attempt to equalizeMMU overheads across all applications eg if two processesP1 and P2 experience 30 and 10 MMU overheads respthen huge pages should be allocated to P1 until its overheadalso reaches 10 In doing so it should be acceptable if P1has to be allocated more huge pages than P2 Such a policywould additionally yield the best overall system performanceby helping the most afflicted processes first

Further in Ingens huge page promotion is triggered in re-sponse to a few page-faults in the aggressive (non-fragmented)phase These huge pages are not necessarily frequently ac-cessed by the application yet they contribute to a processrsquosallocation quota of huge pages Under memory pressurethese idle huge pages lower the promotion metric of a pro-cess potentially preventing the OS from allocating morehuge pages to it This would increase MMU overheads if theprocess had other hot (frequently accessed) memory regionsthat required huge page promotion This behaviour wherepreviously-allocated cold huge pages can prevent an appli-cation from being allocated huge pages for its hot regionsseems sub-optimal and avoidable

Finally within a process Linux and Ingens promote hugepages through a sequential scan from lower to higher VAsThis approach is unfair to processes whose hot regions lie inthe higher VAs Because different applications would usuallycontain hot regions in different parts of their VA spaces (seeFigure 6) this scheme is likely to cause unfairness in practice

24 How to capture address translation overheadsIt is common to estimate MMU overheads based on working-set size (WSS) [56] bigger WSS should entail higher MMUand performance overheads We find that this is often nottrue for modern hardware where access patterns play an im-portant role in determining MMU overheads eg sequential

Workload RSS WSS TLB-misses(native-4KB)

cycles speedup4KB 2MB native virtual

btD 10GB 7-10 GB 045 64 131 105 115spD 12GB 8-12 GB 048 47 025 101 106luD 8GB 8 GB 006 33 018 10 101mgD 26GB 24 GB 003 104 004 101 111cgD 16GB 7-8 GB 2857 39 002 162 27ftD 78 GB 7-35 GB 021 39 214 101 104uaD 96 GB 5-7 GB 001 08 003 101 103

Table 3Memory characteristics address translation over-heads and speedup huge pages provide over base pages forNPB workloads

Performance CounterC1 DTLB_LOAD_MISSES_WALK_DURATIONC2 DTLB_STORE_MISSES_WALK_DURATIONC3 CPU_CLK_UNHALTEDMMU Overhead = ((C1 + C2) 100) C3

Table 4Methodology used to measure MMUOverhead [54]access patterns allow prefetching to hide TLB miss latenciesFurther different TLB miss requests can experience highlatency variations a translation may be present anywherein the multi-level page walk caches multi-level regular datacaches or in main memory For these reasons WSS is oftennot a good indicator of MMU overheads Table 3 demon-strates this with the NPB benchmark suite [30] a workloadwith large WSS (eg mgD) can have low MMU overheadscompared to one with a smaller WSS (eg cgD)

It is not clear to us how an OS can reliably capture MMUoverheads these are dependent on complex interactions be-tween applications and the underlying hardware architec-ture and we show that the actual address translation over-heads of a workload can be quite different from what can beestimated through its memory access pattern (eg working-set size) Hence we propose directly measuring TLB over-heads through hardware performance counters when avail-able (see Table 4 for methodology) This approach enablesa more efficient solution at the cost of portability as the re-quired performance counters may not be available on all plat-forms (eg most hypervisors have not yet virtualized TLB-related performance counters) To overcome this challengewe present two variants of our algorithmimplementation inHawkEyeLinux one where the MMU overheads are mea-sured through hardware performance counters (HawkEye-PMU) and another where MMU overheads are estimatedthrough the memory access pattern (HawkEye-G) We com-pare both approaches in sect4

3 Design and ImplementationFigure 2 shows our high-level design objectives Our solutionis based on four primary observations (1) high page faultlatency for huge pages can be avoided by asynchronouslypre-zeroing free pages (2) memory bloat can be tackledby identifying and de-duplicating zero-filled baseline pagespresent within allocated huge pages (3) promotion decisionsshould be based on finer-grained access tracking of hugepage sized regions and should include recency frequency

Low Memory Pressure High Memory Pressure

1 Fewer Page Faults 2 Lowshylatency Allocation 3 High Performance

1 High Memory Efficiency 2 Efficient Huge Page Promotion 3 Fairness

Figure 2 Design objectives in HawkEye

and access-coverage (ie how many baseline pages are ac-cessed inside a huge page) measurements and (4) fairnessshould be based on estimation of MMU overheads

31 Asynchronous page pre-zeroingWe propose that page zeroing of available free pages shouldbe performed asynchronously in a rate-limited backgroundkernel thread to eliminate high latency allocation We callthis scheme async pre-zeroing Async pre-zeroing is a ratherold idea and had been discussed extensively among kerneldevelopers in early 2000s [1 2 14 41]1 Linux developersopined that async pre-zeroing is not an overall performancewin for two main reasons We think that it is time to revisitthese opinionsFirst the async pre-zeroing thread might interfere with

primary workloads by polluting the cache [1] In particu-lar async pre-zeroing suffers from the ldquodouble cache missrdquoproblem because it causes the same datum to be accessedtwice with a large re-use distance first for pre-zeroing andthen for the actual access by the application These extracache misses are expected to degrade overall performancein general However these problems are partially solvableon modern hardware that support memory writes with non-temporal hints non-temporal hints instruct the hardware tobypass caches during memory loadstore instructions [43]We find that using non-temporal hints during pre-zeroingsignificantly reduces both cache contention and the doublecache miss problem

Second there was no consensus or empirical evidence todemonstrate the benefits of page pre-zeroing for real work-loads [1 42] We note that the early discussions on pagepre-zeroing were evaluating trade-offs with baseline 4KBpages Our experiments corroborate the kernel developersrsquoobservation that despite reducing the page fault overheadby 25 pre-zeroing does not necessarily enable high per-formance with 4KB pages At the same time we also showthat it enables non-negligible performance improvements(eg 14times faster VM boot-time) with huge pages due to muchhigher reduction (97) in page fault overheads Since hugepages (and huge-huge pages) are supported by most general-purpose processors today [15 27] pre-zeroing pages is animportant optimization that is worth revisitingPre-zeroing offers another advantage for virtualized sys-

tems it increases the number of zero pages in the guestrsquosphysical address (GPA) space enabling opportunities for1Windows and FreeBSD implement async pre-zeroing in a limited fashion [419] However their huge page policies do not allow high latency allocationsThis idea is more relevant for Linux due to its synchronous huge pageallocation behaviour

content-based page-sharing at the virtualization host Weevaluate this aspect in sect4

To implement async pre-zeroing HawkEye manages freepages in the Linux buddy allocator through two lists zeroand non-zero Pages released by applications are first addedto the non-zero list while the zero list is preferentially usedfor allocation A rate-limited thread periodically transferspages from non-zero to zero lists after zero-filling themusing non-temporal writes Because pre-zeroing involves se-quential memory accesses non-temporal store instructionsprovide performance similar to regular (caching) store in-structions but without polluting the cache [3] Finally wenote that for copy-on-write or filesystem-backed memoryregions pre-zeroing may sometimes be unnecessary andwasteful This problem is avoidable by preferentially allo-cating pages for these memory regions from the non-zerolist

Overall we believe that async pre-zeroing is a compellingidea for modern workload requirements and modern hard-ware support In our evaluation we provide some early evi-dence to corroborate this claim with different workloads

32 Managing bloat vs performanceWe observe that the fundamental tension between MMUoverheads and memory bloat can be resolved Our approachstems from the insight that most allocations in large-memoryworkloads are typically ldquozero-filled page allocationsrdquo the re-maining are either filesystem-backed (eg through mmap)or copy-on-write (COW) pages However huge pages inmodern scale-out workloads are primarily used for ldquoanony-mousrdquo pages that are initially zero-filled by the kernel [54]eg Linux supports huge pages only for anonymous mem-ory This property of typical workloads and Linux allowsautomatic recovery of bloat under memory pressure

To state our approach succinctly we allocate huge pagesat the time of first page-fault but under memory pressureto recover unused memory we scan existing huge pages toidentify zero-filled baseline pages within them If the numberof zero-filled baseline pages inside a huge page is significant(ie beyond a threshold) we break the huge page into itsconstituent baseline pages and de-duplicate the zero-filledbaseline pages to a canonical zero-page through standardCOW page management techniques [37] In this approachit is possible for applicationsrsquo in-use zero-pages to also getde-duplicated While this can result in a marginally highernumber of COW page faults in rare situations this does notcompromise correctnessTo trigger recovery from memory bloat HawkEye uses

two watermarks on the amount of allocated memory inthe system high and low When the amount of allocatedmemory exceeds high (85 in our prototype) a rate-limitedbloat-recovery thread is activated which executes period-ically until the allocated memory falls below low (70 inour prototype) At each step the bloat-recovery thread

675554

1155

39 28 12 1 663274

9110

30

60

90

120di

stan

ce (b

ytes

)

Figure 3 Average distance to the first non-zero byte in base-line (4KB) pages First four bars represent the average of allworkloads in the respective benchmark suite

chooses the application whose huge page allocations need tobe scanned (and potentially demoted) based on the estimatedMMU overheads of that application the application with thelowest estimated MMU overheads is chosen first for scanningThis strategy ensures that the application that least requireshuge pages is considered first mdash this is consistent with ourhuge page allocation strategy (sect23)

While scanning a baseline page to verify if it is zero-filledwe stop on encountering the first non-zero byte in it Inpractice the number of bytes that need to be scanned perin-use (not zero-filled) page before a non-zero byte is en-countered is very small we measured this over a total of 56diverse workloads and found that the average distance of thefirst non-zero byte in a 4KB page is only 911 (see Figure 3)Hence only ten bytes need to be scanned on average perin-use application page For bloat pages however all 4096bytes need to be scanned This implies that the overheadsof our bloat-recovery thread are largely proportional tothe number of bloat pages in the system and not to the totalsize of the allocated memory This is an important propertythat allows our method to scale to large memory systemsWe note that our bloat-recovery procedure has many

similarities with the standard de-duplication kernel threadsused for content-based page sharing for virtual machines[67] eg the kernel same-page merging (ksm) thread inLinux Unfortunately in current kernels the huge page man-agement logic (eg khugepaged) and the content-based pagesharing logic (eg ksm) are unconnected and can often inter-act in counter-productive ways [51] Ingens and SmartMD[52] proposed coordinated mechanisms to avoid such con-flicts Ingens demotes only infrequently-accessed huge pagesthrough ksmwhile SmartMD demotes pages based on access-frequency and repetition rate (ie the number of shareablepages within a huge page) These techniques are useful for in-use pages and our bloat-recovery proposal complementsthem by identifying unused zero-filled pages which canexecute much faster than typical same-page merging logic

33 Fine-grained huge page promotionAn efficient huge page promotion strategy should try tomaximize performance with a minimal number of huge page

9876543210

B1

B2

B3 B4

B9876543210

C1

C3

C5

C

C4

C29876543210

access_map

A1

A2 A3

A

orde

r of p

rom

otio

n

hot-regions

cold-regions

access_map access_map

Figure 4 A sample representation of access_map for threeprocesses A B and C

promotions Current systems promote pages through a se-quential scan from lower to higher VAs which is inefficientfor applications whose hot regions are not in lower VA spaceOur approach makes promotion decisions based on memoryaccess patterns First we define an important metric used inHawkEye to improve the efficiency of huge page promotionsAccess-coverage denotes the number of base pages that areaccessed from a huge page sized region in short intervalsWe sample the page table access bits at regular intervals andmaintain the exponential moving average (EMA) of access-coverage across different samples More precisely we clearthe page table access bits and test the number of set bitsafter 1 second to check how many pages are accessed Thisprocess is repeated once every 30 secondsThe access-coverage of a region potentially indicates its

TLB space requirement (number of base page entries requiredin the TLB) which we use to estimate the profitability ofpromoting it to a huge page A region with high access-coverage is likely to exhibit high contention on TLB andhence likely to benefit more from promotionHawkEye implements access-coverage based promotion

using a per-process data structure called access_map whichis an array of buckets a bucket contains huge page regionswith similar access-coverage On x86 with 2MB huge pagesthe value of access-coverage is in the range of 0-512 In ourprototype wemaintain ten buckets in the access_mapwhichprovides the necessary resolution across access-coverage val-ues at relatively low overheads regions with access-coverageof 0-49 are placed in bucket 0 regions with access-coverageof 50 minus 99 are placed in bucket 1 and so on Figure 4 showsan example state of the access_map of three processes A Band C Regions can move up or down in their respective ar-rays after every sampling period depending on their newlycomputed EMA-based access-coverage If a region movesup in access_map it is added to the head of its bucket If aregion moves down it is added to the tail Within a bucketpages are promoted from head to tail This strategy helps inprioritizing recently accessed regions within an index

HawkEye promotes regions from higher to lower indicesin access_map Notice that our approach captures both re-cency and frequency of accesses a region that has not been

accessed or accessed with low access-coverage in recentsampling periods is likely to shift towards a lower indexin access_map or towards the tail in its current index Pro-motion of cold regions is thus automatically deferred toprioritize hot regions

We note that HawkEyersquos access_map is somewhat similarto population_map and access_bitvector used in FreeBSDand Ingens resp these data structures are primarily used tocapture utilization or page access related metadata at hugepage granularity HawkEyersquos access_map additionally pro-vides the index of hotness of memory regions and enablesfine-grained huge page promotion

34 Huge page allocation across multiple processesIn our access-coverage based strategy regions belonging toa process with the highest expected MMU overheads arepromoted before others This is also consistent with ournotion of fairness sect23 as pages are promoted first fromregionsprocesses with high expected TLB pressure (andconsequently high expected MMU overheads)Our HawkEye variant that does not rely on hardware

performance counters (ie HawkEye-G) promotes regionsfrom the non-empty highest access-coverage index in all pro-cesses It is possible that multiple processes have non-emptybuckets in the (globally) highest non-empty index In thiscase round-robin is used to ensure fairness among suchprocesses We explain this further with an example Con-sider three applications A B and C with their VA regionsarranged in the access_map as shown in Figure 4 HawkEye-G promotes regions in the following order in this example

A1B1C1C2B2C3C4B3B4A2C5A3Recall however that MMU overheads may not necessarily

be correlated with our access-coverage based estimationand may depend on other more complex features of theaccess pattern (sect24) To capture this in HawkEye-PMU wefirst choose the process with the highest measured MMUoverheads and then choose regions from higher to lowerindices in selected processrsquos access_map Among processeswith similar highest MMU overheads round-robin is used

35 Limitations and discussionWe briefly outline a few issues that are related to huge pagemanagement but are currently unhandled in HawkEye1) Thresholds for identifyingmemorypressureWemea-sure the extent ofmemory pressurewith statically configuredvalues for low and high watermarks (ie 70 and 85 oftotal systemmemory) while dealing with memory bloat Anystrategy that relies on static thresholds faces the risk of beingconservative or overly aggressive when memory pressureconsistently fluctuates An ideal solution should adjust thesethresholds dynamically to prevent unintended system behav-ior The approach proposed by Guo et al [51] in the contextof memory deduplication for virtualized environments isrelevant in this context

2) Huge page starvationWhile we believe our approachof allocating huge pages based on MMU overheads optimizesthe system as a whole unbounded huge page allocations toa single process can be thought of as a starvation problemfor other processes An adverserial application can also po-tentially monopolize HawkEye to get more huge pages orprevent other applications from getting a fair share of hugepages Preliminary investigations show that our approach isreasonable even if the memory footprint of workloads differby more than an order of magnitude However if limitinghuge page allocations is still desirable it seems reasonable tointegrate a policy with existing resource limitingmonitoringtools such as Linuxrsquos cgroups [18]3) Other algorithms We do not discuss some parts of themanagement algorithms such as rate limiting khugepagedto reduce promotion overheads demotion based on low uti-lization to enable better sharing through same-page mergingand techniques for minimizing the overhead of page-tableaccess-bit tracking and compaction algorithms Much of thismaterial has been discussed and evaluated extensively inthe literature [32 55 59 68] and we have not contributedsignificantly new approaches in these areas

4 EvaluationWe now evaluate our algorithms in more detail Our exper-imental platform is an Intel Haswell-EP based E5-2690 v3server system running CentOS v74 on which 96GB mem-ory and 48 cores (with hyperthreading enabled) running at23GHz are partitioned on two sockets we bind each work-load to a single socket to avoid NUMA effects The L1 TLBcontains 64 and 8 entries for 4KB and 2MB pages respectivelywhile the L2 TLB contains 1024 entries for both 4KB and2MB pages The size of L1 L2 and shared L3 cache is 768KB3MB and 30MB resp A 96GB SSD-backed swap partitionis used to evaluate performance in an overcommitted sys-tem We evaluate HawkEye with a diverse set of workloadsranging from HPC graph algorithms in-memory databasesgenomics and machine learning [24 30 33 36 45 53 65 66]HawkEye is implemented in Linux kernel v43

We evaluate (a) the improvements due to our fine-grainedpromotion strategy based on access-coverage in both per-formance and fairness for single multiple homogeneous andmultiple heterogeneousworkloads (b) the bloat-vs-performancetradeoff (c) the impact of low latency page faults (d) cacheinterference caused by asynchronous pre-zeroing thread and(e) the impact of memory efficiency enabled by asynchronouspage pre-zeroingPerformance advantages of fine-grainedhuge page pro-motion We first evaluate the effectiveness of our access-coverage based promotion strategy For this we measure thetime required for our algorithm to recover from a fragmentedstate with high address translation overheads to a state with

0

6

12

18

24

Graph500 XSBench cgD

speedup

Linux Ingens HawkEyeshyPMU HawkEyeshyG

152

23 25155

20 36

1007 1011

175

628

152 168

0

300

600

900

1200

Graph500 XSBench cgD

Time saved (ms)

Figure 5 Performance speedup and time saved per hugepage promotion over baseline pages

Workload Execution Time (Seconds)Linux-4KB Linux-2MB Ingens HawkEye-PMU HawkEye-G

Graph500-1 2270 2145(106) 2243(101) 1987(114) 2007(113)Graph500-2 2289 2252(102) 2253(102) 1994(115) 2013(114)Graph500-3 2293 2293(10) 2299(100) 2012(114) 2018(114)Average 2284 2230(102) 2265(101) 1998(114) 2013(113)

XSBench-1 2427 2415(10) 2392(101) 2108(115) 2098(115)XSBench-2 2437 2427(10) 2415(101) 2109(115) 2110(115)XSBench-3 2443 2455(10) 2456(100) 2133(115) 2143(114)Average 2436 2432(10) 2421(100) 2117(115) 2117(115)

Table 5 Execution time of 3 instances of Graph500 andXSBench when executed simultaneously Values in parenthe-ses represent speedup over baseline pages

low address translation overheads Recall that HawkEye pro-motes pages based on access-coverage while previous ap-proaches promote pages in VA order (from low VAs to high)We fragment the memory initially by reading several files inmemory our test workloads are started in the fragmentedstate and we measure the time taken for the system to re-cover from high MMU overheads This experimental setupsimulates expected realistic situations where the memoryfragmentation in the system fluctuates over time Withoutpromotion our test workloads would keep incurring highaddress translation overheads however HawkEye is quicklyable to recover from these overheads through appropriatehuge page allocations Figure 5 (left) shows the performanceimprovement obtained by HawkEye (over a strategy thatnever promotes the pages) for three workloads Graph500XSBench and cgD These workloads allocate all their re-quired memory in the beginning ie in the fragmented stateof the system For these workloads the speedups due toeffective huge page management through HawkEye are ashigh as 22 Compared to Linux and Ingens access-coveragebased huge page promotion strategy of HawkEye improvesperformance by 13 12 and 6 over both Linux and Ingens

To understand this more clearly Figure 6 shows the accesspattern MMU overheads and the number of allocated hugepages over time for Graph500 and XSBench We find thathot-spots in these applications are concentrated in the highVAs and any sequential-scanning based promotion is likelyto be sub-optimal The second and third columns corrobo-rate this observation for example both HawkEye variantstake asymp300 seconds to eliminate MMU overheads of XSBenchwhile Linux and Ingens have high overheads even after 1000seconds

To quantify the cost-benefit analysis of huge page promo-tions further we propose a new metric the average execu-tion time saved (over using only baseline pages) per hugepage promotion A scheme that maximizes this metric wouldbe most effective in reducing MMU overheads Figure 5(right) shows that HawkEye performs significantly betterthan Linux and Ingens on this metric The difference betweenthe efficiency of HawkEye-PMU and HawkEye-G is also evi-dent HawkEye-PMU is more efficient as it stops promotinghuge pages whenMMU overheads are below a certain thresh-old (2 in our experiments) In summary HawkEye-G andHawkEye-PMU are up to 67times and 44times more efficient (forXSBench) than Linux in terms of time saved per huge pagepromotion For workloads whose access patterns are spreaduniformly over the VA space HawkEye delivers performancesimilar to Linux and IngensFairness advantages of fine-grained huge page promo-tion We next experiment withmultiple applications to studyboth performance and fairness first with identical applica-tions and then with heterogeneous applications runningsimultaneouslyIdentical workloads Figure 7 shows the MMU overheadsand huge page allocations when 3 instances of Graph500 andXSBench (in separate runs) are executed concurrently afterfragmenting the system (see Table 5 for execution time)While Linux creates performance imbalance by promot-

ing huge pages in one process at a time HawkEye ensuresfairness by judiciously distributing huge pages across allworkload instances Overall HawkEye-PMU and HawkEye-G achieve 114times and 113times speedup for Graph500 and 115timesspeedup for XSBench over Linux on average Unlike LinuxIngens promotes huge pages proportionally in all 3 instancesHowever it fails to improve the performance of these work-loads In fact it may lead to poorer performance than LinuxWe explain this behaviour with an example

Linux selects an address from Graph500-1rsquos address spaceand promotes all pages above it in around 10 minutes Eventhough this strategy is unfair it improves the performanceof Graph500-1 by promoting its hot-regions Linux then se-lects Graph500-2whose MMU overheads decrease after asymp20minutes In contrast Ingens promotes huge pages from lowerVAs in each instance of Graph500 Thus it takes even longerfor Ingens to promote suitable (hot) regions of the respec-tive processes which leads to 9 performance degradationover Linux for Graph500 For XSBench the performance ofboth Ingens and Linux is similar (but inferior to HawkEye)because both fail to promote the applicationrsquos hot regionsbefore the application finishesHeterogeneous workloads To measure the efficacy of dif-ferent strategies for heterogeneous workloads we executeworkloads by grouping them into sets where a set containsone TLB sensitive and one TLB insensitive application Each

0

128

256

384

512

05GB 1GB 15GB 2GB

Cold Region

acce

ss-c

over

age

Virtual Address Space

0

15

30

45

60

300 600 900 1200 1500

MM

U O

verh

ead

Time (seconds)

0

200

400

600

800

1000

300 600 900 1200 1500

Linux Ingens

HawkEye-G

HawkEye-PMU

H

uge

Pag

es

Time (seconds)

(a) Graph500

0

128

256

384

512

1GB 2GB 3GB 4GB 5GB 6GB

acce

ss-c

over

age

Virtual Address Space

0

15

30

45

60

300 600 900 1200 1500 1800 2100 2400

Linux Ingens

HawkEye-PMU HawkEye-GMM

U O

ver

hea

dTime (seconds)

0

500

1000

1500

2000

300 600 900 1200 1500 1800 2100 2400

Linux Ingens H

awkEye-G

HawkEye-PMU H

uge

Pag

es

Time (seconds)

(b) XSBenchFigure 6 Access-coverage in application VA MMU overhead and huge page promotions for Graph500 and XSBench HawkEyeis more efficient in promoting huge pages due to fine-grained decision making

0

200

400

600

800

1000

10 20 30 40

1 2

3 H

uge

Pag

es

Time (minutes)

0

200

400

600

800

1000

10 20 30 40

1 2 3

H

uge

Pag

es

Time (minutes)

0

200

400

600

800

1000

10 20 30 40

H

uge

Pag

es

Time (minutes)

0

200

400

600

800

1000

10 20 30 40

1 2 3

H

uge

Pag

es

Time (minutes)

0

10

20

30

40

50

10 20 30 40

MM

U O

verh

ead

Time (Minutes)

0

10

20

30

40

50

10 20 30 40

MM

U O

verh

ead

Time (minutes)

0

10

20

30

40

50

10 20 30 40

MM

U O

verh

ead

Time (minutes)

0

10

20

30

40

50

10 20 30 40

MM

U O

verh

ead

Time (minutes)

0

500

1000

1500

2000

10 20 30 40 50

1

2 3

H

uge

Pag

es

Time (Minutes)

0

300

600

900

1200

10 20 30 40 50

1 2 3

H

uge

Pag

es

Time (Minutes)

0

300

600

900

1200

10 20 30 40 50

H

uge

Pag

es

Time (Minutes)

0

300

600

900

1200

10 20 30 40 50

1 2 3

H

uge

Pag

es

Time (Minutes)

0

15

30

45

60

10 20 30 40 50

1 2 3

MM

U O

verh

ead

Time (Minutes)

0

15

30

45

60

10 20 30 40 50

1 2 3

MM

U O

verh

ead

Time (Minutes)

0

15

30

45

60

10 20 30 40 50

1 2 3

MM

U O

verh

ead

Time (Minutes)

0

15

30

45

60

10 20 30 40 50

1 2 3

MM

U O

verh

ead

Time (Minutes)

(a) Linux-2MB (b) Ingens (c) HawkEye-PMU (d) HawkEye-GFigure 7 Huge page promotions and their impact on TLB overhead over time for Linux Ingens and HawkEye for threeinstances of Graph500 and XSBench each

set is executed twice after fragmenting the system to as-sess the impact of huge page policies on the order of ex-ecution For the TLB insensitive workload we execute alightly loaded Redis server with 1KB-sized 40M keys servic-ing 10K requests per-second Figure 8 shows the performance

speedup over 4KB pages The (Before) and (After) configura-tions show whether the TLB sensitive workload is launchedbefore Redis or after Because Linux promotes huge pagesin process launch order the two configurations (Before) and(After) are intended to cover its two different behavioursIngens and HawkEye are agnostic to process creation order

0

15

30

45

60

75

cactusADM tigr Graph500 lbm_s SVM XSBench cgD cactusADM tigr Graph500 lbm_s SVM XSBench cgDBefore After

s

peed

up

Chart Title

Linux Ingens HawkEye-PMU HawkEye-G

Figure 8 Performance speedup over baseline pages of TLB sensitive applications when they are executed alongside a lightlystressed Redis server in different orders

1

12

14

16

18

2

Nor

mal

ized

Per

form

ance

HawkEye-Host HawkEye-Guest HawkEye-Both

Figure 9 Performance compared to Linux in a virtualizedsystem when HawkEye is applied at the host guest and bothlayers

The impact of execution order is clearly visible for LinuxIn the (Before) case Linux with THP support improves per-formance over baseline pages by promoting huge pages inTLB sensitive applications However in the (After) case itpromotes huge pages in Redis resulting in poor performancefor TLB sensitive workloads While the execution order doesnot matter in Ingens its proportional huge page promotionstrategy favors large memory workloads (Redis in this case)Further our client requests randomly selected keys whichmakes the Redis server access all its huge pages uniformlyConsequently Ingens promotes most huge pages in Redisleading to suboptimal performance of TLB sensitive work-loads In contrast HawkEye promotes huge pages based onMMU overhead measurements (HawkEye-PMU) or access-coverage based estimation of MMU overheads (HawkEye-G)This leads to 15ndash60 performance improvement over 4KBpages irrespective of the order of executionPerformance in virtualized systems For a virtualizedsystemwe use the KVMhypervisor runningwith Ubuntu1604and fragment the system prior to running the workloadsEvaluation is presented for three different configurationsie when HawkEye is deployed at the host guest and boththe layers Each of these configurations requires running thesame set of applications differently (see Table 6 for details)Figure 9 shows that HawkEye provides 18ndash90 speedup

compared to Linux in virtual environments Notice that insome cases (eg cgD) performance improvement is higherin a virtualized system compared to bare-metal This is due tohigh MMU overheads involved in traversing nested page ta-bles while servicing TLB misses of guest applications whichpresents higher opportunities to be exploited by HawkEye

Configuration Details

HawkEye-Host Two VMs each with 24 vCPUs and 48GB memoryVM-1 runs Redis VM-2 runs TLB sensitive workloads

HawkEye-Guest Single VM with 48 vCPUs and 80GB memoryrunning Redis and TLB sensitive workloads

HawkEye-Both Two VMs each with 24 vCPUs VM-1 (30GB) runs RedisVM-2 (60GB) runs both Redis and TLB sensitive workloads

Table 6 Experimental setup for configurations used to eval-uate a virtualized system

Bloat vs performance We next study the relationship be-tween memory bloat and performance we revisit an experi-ment similar to the one summarized in sect21 We populate aRedis instance with 8M (10B key 4KB value) key-value pairsand then delete 60 randomly selected keys (see Table 7)While Linux-4KB (no THP) is memory efficient (no bloat)its performance is low (7 lower throughput than Linux-2MB ie with THP) Linux-2MB delivers high performancebut consumes more memory which remains unavailable tothe rest of system even when memory pressure increasesIngens can be configured to balance the memory vs perfor-mance tradeoff For example Ingens-90 (Ingens with 90utilization threshold) is memory efficient while Ingens-50(Ingens with 50 utilization threshold) favors performanceand allows more memory bloat In either case it is unableto recover from bloat that may have already been gener-ated (eg during its aggressive phase) By switching fromaggressive huge page allocations under low memory pres-sure to memory conserving strategy at runtime HawkEye isable to achieve both low MMU overheads and high memoryefficiency depending on the state of the systemFast page faults Table 8 shows the performance of differentstrategies for the workloads chosen for these experimentstheir performance depends on the efficiency of the OS pagefault handler We measure Redis throughput when 2MB-sized values are inserted SparseHash [13] is a hash-maplibrary in C++ while HACC-IO [8] is a parallel IO benchmarkused with an in-memory file system We also measure thespin-up time of two virtual machines KVM and a Java VirtualMachine (JVM) where both are configured to allocate allmemory during initialization

Performance benefits of async page pre-zeroing with basepages (HawkEye-4KB) aremodest in this case the other page

Kernel Self-tuning Memory Size ThroughputLinux-4KB No 162GB 1061KLinux-2MB No 332GB 1138KIngens-90 No 163GB 1068KIngens-50 No 331GB 1134KHawkEye (no mem pressure) Yes 332GB 1136KHawkEye (mem pressure) Yes 162GB 1058K

Table 7 Redis memory consumption and throughput

Workload OS ConfigurationLinux4KB

Linux2MB

Ingens90

HawkEye4KB

HawkEye2MB

Redis (45GB) 233 437 192 236 551SparseHash (36GB) 501 172 515 466 106HACC-IO (6GB) 65 45 66 65 42JVM Spin-up (36GB) 377 186 527 298 137KVM Spin-up (36GB) 406 97 418 302 070

Table 8 Performance implications of asynchronous pagezeroing Values for Redis represent throughput (higher isbetter) all other values represent time in seconds (lower isbetter)

0

10

20

30

40

NPB PARSEC Graph500 cactusADM PageRank omnetpp

overhead with caching-stores with nt-stores

Figure 10 Performance overhead of async pre-zeroing withand without caching instructions The first two bars (NPBand Parsec) represent the average of all workloads in therespective benchmark suite

fault overheads (apart from zeroing) dominate performanceHowever page-zeroing cost becomes significant with hugepages Consequently HawkEye with huge pages (HawkEye-2MB) improves the performance of Redis and SparseHashby 126times and 162times over Linux The spin-up time of virtualmachines is purely dominated by page faults Hence weobserve a dramatic reduction in the spin-up time of VMswith async pre-zeroing of huge pages 138times and 136times overLinux Notice that without huge pages (only base pages)this reduction is only 134times and 126times All these workloadshave high spatial locality of page faults and hence the Ingensstrategy of utilization-based promotion has a negative impacton performance due to a higher number of page faultsAsync pre-zeroing enables memory sharing in virtual-ized environments Finally we note that async pre-zeroingwithin VMs enables memory sharing across VMs the freememory of a VM returns to the host through pre-zeroing inthe VM and same-page merging at the host This can havethe same net effect as ballooning as the free memory in aVM gets shared with other VMs In our experiments withmemory over-committed systems we have confirmed thisbehaviour the overall performance of an overcommittedsystem where the VMs are running HawkEye matches the

0

05

1

15

2

25

PageRank Redis MongoDB CC Memcached SPDNode 0 Node 1

Nor

mal

ized

Per

form

ance

Linux Linux-Ballooning HawkEye

Figure 11 Performance normalized to the case of no bal-looning in an overcommitted virtualized system

performance achieved through para-virtual interfaces likememory balloon drivers For example in an experiment in-volving a mix of latency-sensitive key-value stores and HPCworkloads (see Figure 11) with a total peakmemory consump-tion of around 150GB (15times of total memory) HawkEye-Gprovides 23times and 142times higher throughput for Redis andMongoDB along with significant tail latency reduction ForPageRank performance degrades slightly due to a highernumber of COW page faults due to unnecessary same-pagemerging These results are very close to the results obtainedwhen memory ballooning is enabled on all VMs We believethat this is a potential positive of our design and may offeran alternative method to efficiently manage memory in over-committed systems It is well-known that balloon-driversand other para-virtual interfaces are riddled with compatibil-ity problems [29] and a fully-virtual solution to this problemwould be highly desirable While our experiments show earlypromise in this direction we leave an extensive evaluationfor future workPerformance overheads ofHawkEyeWe evaluatedHawk-Eye with non-fragmented systems and with several work-loads that donrsquot benefit from huge pages and find that theperformance with HawkEye is very similar to performancewith Linux or Ingens in these scenarios confirming thatHawkEyersquos algorithms for access-bit tracking asynchronoushuge page promotion and memory de-duplication add negli-gible overheads over existingmechanisms in Linux Howeverthe async pre-zeroing thread requires special attention as itmay become a source of noticeable performance overheadfor certain types of workloads as discussed nextOverheads of async pre-zeroing A primary concern withasync pre-zeroing is its potential detrimental effect due tomemory and cache interference (sect31) Recall that we em-ploy memory stores with non-temporal hints to avoid theseeffects To measure the worst-case cache effects of asyncpre-zeroing we run our workloads while simultaneouslyzero-filling pages on a separate core sharing the same L3cache (ie on the same socket) at a high rate of 025M pagesper second (1GBps) with and without non-temporal mem-ory stores Figure 10 shows that using non-temporal hintssignificantly brings down the overhead of async pre-zeroing(eg from 27 to 6 for omnetpp) the remaining overheadis due to additional memory traffic generated by the pre-zeroing thread Further we note that this experiment presents

Workload MMUOverhead

Time (seconds)4KB HawkEye-PMU HawkEye-G

random(4GB) 60 582 328(177times) 413(141times)sequential(4GB) lt 1 517 535 532Total 1099 863(127times) 945(116times)cgD(16GB) 39 1952 1202(162times) 1450(135times)mgD(24GB) lt 1 1363 1364 1377Total 3315 2566(129times) 2827(117times)

Table 9 Comparison between HawkEye-PMU andHawkEye-G Values in parentheses represent speedup overbaseline pagesa highly-exaggerated behavior of async pre-zeroing onworst-case workloads In practice the zeroing thread is rate-limited(eg at most 10k pages per second) and the expected cache-interference overheads would be proportionally smallerComparison betweenHawk-PMUandHawkEye-G Eventhough HawkEye-G is based on simple and approximatedestimations of TLB contention it is reasonably accurate inidentifying TLB sensitive processes and memory regionsin most cases However HawkEye-PMU performs betterthan HawkEye-G in some cases Table 9 demonstrates twosuch examples where four workloads all with high access-coverage but different MMU overheads are executed in twosets each set contains one TLB sensitive and one TLB insensi-tive workload The 4KB column represents the case where nohuge pages are promoted As discussed earlier (sect24) MMUoverheads depend on the complex interaction between thehardware and access pattern In this case despite havinghigh access-coverage in VA regions sequential and mgDhave negligible MMU overheads (ie they are TLB insen-sitive) While HawkEye-PMU correctly identifies the pro-cess with higher MMU overheads for huge page allocationHawkEye-G treats them similarly (due to imprecise estima-tions) and may allocate huge pages to the TLB insensitiveprocess Consequently HawkEye-PMU may perform up to36 better than HawkEye-G Bridging this gap between thetwo approaches in a hardware independent manner is aninteresting future work

5 Related WorkMMU overheads have been a topic of much research in re-cent years and several solutions have been proposed includ-ing solutions that are architecture-based OS-based andorcompiler-based [28 34 49 56 58 62ndash64]Hardware SupportMost proposals for hardware supportaim to reduce TLB-miss frequency or accelerate TLB-missprocessing Large multi-level TLBs and support for multiplepage sizes to minimize the number of TLB-misses is alreadycommon in general-purpose processors today [15 20 21 40]Further modern processors employ page-walk-caches toavoid memory lookups in the TLB-miss path [31 35] POM-TLB services a TLB miss using a single memory lookup andfurther leverages regular data caches to speed up addresstranslation [63] Direct segments [32 46] were proposed tocompletely avoid TLB-miss processing cost through special

segmentation hardware OS-level challenges of memory frag-mentation have also been considered in hardware designsCoLT or Coalesced-Large-reach TLBs [61] were initially pro-posed to increase TLB reach using base pages based on theobservation that OSs naturally provide contiguous mappingsat smaller granularity This approach was further extendedto page-walk caches and huge pages [35 40 60] Techniquesto support huge pages in non-contiguous memory have alsobeen proposed [44] Hardware optimizations are importantbut complementary to HawkEyeSoftware Support FreeBSDrsquos huge page management islargely influenced by previous work on superpages [57]Carrefour-LP [47] showed that ad-hoc usage of huge pagescan degrade performance in NUMA systems due to addi-tional remote memory accesses and traffic imbalance acrossnode interconnects and proposed hardware-profiling basedtechniques to overcome these problems Anti-fragmentationtechniques for clustering pages based on mobility type ofpages proposed by Gorman et al [48] have been adoptedby Linux to support the allocation of huge pages in long-running systems Illuminator [59] highlighted critical draw-backs of Linuxrsquos fragmentation mitigation techniques andtheir implications on performance and latency and proposedtechniques to solve these problems Mechanisms of dealingwith physical memory fragmentation and NUMA systemsare important but orthogonal problems and their proposedsolutions can be adopted to improve the robustness of Hawk-Eye Compiler or application hints can also be used to assistOSs in prioritizing huge page mapping for certain parts ofthe address space [27 56] for which an interface is alreadyprovided by Linux through the madvise system call [16]An alternative approach for supporting huge pages has

also been explored via libhugetlbfs [22] where the useris provided more control on huge page allocation Howeversuch an approach requires manual intervention for reserv-ing huge pages in advance and considers each applicationin isolation Windows and OS X support huge pages onlythrough this reservation-based approach to avoid issues asso-ciated with transparent huge page management [17 55 59]We believe that insights from HawkEye can be leveraged toimprove huge page support in these important systems

6 ConclusionsTo summarize transparent huge page management is es-sential but complex Effective and efficient algorithms arerequired in the OS to automatically balance the tradeoffs andprovide performant and stable system behavior We exposesome important subtleties related to huge page managementin existing proposals and propose a new set of algorithmsto address them in HawkEye

References[1] 2000 Clearing pages in the idle loop httpswwwmail-archivecom

freebsd-hackersfreebsdorgmsg13993html

[2] 2000 Linux Page Zeroing Strategy httpsyarchivenetcomplinuxpagezeroingstrategyhtml

[3] 2007 Memory part 5 What programmers can do httpslwnnetArticles255364

[4] 2010 Mysteries of Windows Memory Management RevealedPart Two httpsblogsmsdnmicrosoftcomtims20101029pdc10-mysteries-of-windows-memory-management-revealed-part-two

[5] 2012 khugepaged eating 100 CPU httpsbugzillaredhatcomshowbugcgiid=879801

[6] 2012 Recommendation to disable huge pages for Hadoophttpsdeveloperamdcomwordpressmedia201210HadoopTuningGuide-Version5pdf

[7] 2014 Arch Linux becomes unresponsive from khugepagedhttpunixstackexchangecomquestions161858arch-linux-becomes-unresponsive-from-khugepaged

[8] 2014 CORAL Benchmark Codes httpsascllnlgovCORAL-benchmarkshacc

[9] 2014 Recommendation to disable huge pages for NuoDBhttpwwwnuodbcomtechbloglinux-transparent-huge-pages-jemalloc-and-nuodb

[10] 2014 Why TokuDB Hates Transparent HugePageshttpswwwperconacomblog20140723why-tokudb-hates-transparent-hugepages

[11] 2015 The Black Magic Of Systematically Reducing Linux OS Jit-ter httphighscalabilitycomblog201548the-black-magic-of-systematically-reducing-linux-os-jitterhtml

[12] 2015 Tales from the Field Taming Transparent Huge Pages onLinux httpswwwperforcecomblog151016tales-field-taming-transparent-huge-pages-linux

[13] 2016 C++ associative containers httpsgithubcomsparsehashsparsehash

[14] 2016 Remove PG_ZERO and zeroidle (page-zeroing) entirely httpsnewsycombinatorcomitemid=12227874

[15] 2017 Hugepages httpswikidebianorgHugepages[16] 2017 MADVISE Linux Programmerrsquos Manual httpman7orglinux

man-pagesman2madvise2html[17] 2018 Large-Page Support in Windows httpsdocsmicrosoftcom

en-gbwindowsdesktopMemorylarge-page-support[18] 2019 CGROUPS Linux Programmerrsquos Manual httpman7orglinux

man-pagesman7cgroups7html[19] 2019 FreeBSD Pre-Faulting and Zeroing Optimizations

httpswwwfreebsdorgdocenUSISO8859-1articlesvm-designprefault-optimizationshtml

[20] 2019 Intel Haswell httpwww7-cpucomcpuHaswellhtml[21] 2019 Intel Skylake httpwww7-cpucomcpuSkylakehtml[22] 2019 Libhugetlbfs Linux man page httpslinuxdienetman7

libhugetlbfs[23] 2019 Recommendation to disable huge pages for MongoDB https

docsmongodbcommanualtutorialtransparent-huge-pages[24] 2019 Recommendation to disable huge pages for Redis httpredisio

topicslatency[25] 2019 Recommendation to disable huge pages for VoltDB https

docsvoltdbcomAdminGuideadminmemmgtphp[26] 2019 Transparent Hugepage Support httpswwwkernelorgdoc

Documentationvmtranshugetxt[27] Mohammad Agbarya Idan Yaniv and Dan Tsafrir 2018 Memomania

From Huge to Huge-Huge Pages In Proceedings of the 11th ACM In-ternational Systems and Storage Conference (SYSTOR rsquo18) ACM NewYork NY USA 112ndash112 httpsdoiorg10114532118903211918

[28] Hanna Alam Tianhao Zhang Mattan Erez and Yoav Etsion 2017Do-It-Yourself Virtual Memory Translation In Proceedings of the44th Annual International Symposium on Computer Architecture (ISCArsquo17) ACM New York NY USA 457ndash468 httpsdoiorg10114530798563080209

[29] Nadav Amit Dan Tsafrir and Assaf Schuster 2014 VSwapper AMemory Swapper for Virtualized Environments In Proceedings of the19th International Conference on Architectural Support for ProgrammingLanguages and Operating Systems (ASPLOS rsquo14) ACM New York NYUSA 349ndash366 httpsdoiorg10114525419402541969

[30] D H Bailey E Barszcz J T Barton D S Browning R L Carter LDagum R A Fatoohi P O Frederickson T A Lasinski R S SchreiberH D Simon V Venkatakrishnan and S KWeeratunga 1991 The NASParallel BenchmarksSummary and Preliminary Results In Proceedingsof the 1991 ACMIEEE Conference on Supercomputing (Supercomputingrsquo91) ACM New York NY USA 158ndash165 httpsdoiorg101145125826125925

[31] Thomas W Barr Alan L Cox and Scott Rixner 2010 TranslationCaching Skip DonrsquoT Walk (the Page Table) In Proceedings of the37th Annual International Symposium on Computer Architecture (ISCArsquo10) ACM New York NY USA 48ndash59 httpsdoiorg10114518159611815970

[32] Arkaprava Basu Jayneel Gandhi Jichuan Chang Mark D Hill andMichael M Swift 2013 Efficient Virtual Memory for Big MemoryServers In Proceedings of the 40th Annual International Symposium onComputer Architecture (ISCA rsquo13) ACM New York NY USA 237ndash248httpsdoiorg10114524859222485943

[33] Scott Beamer Krste Asanovic and David A Patterson 2015 TheGAP Benchmark Suite CoRR abs150803619 (2015) arXiv150803619httparxivorgabs150803619

[34] Ravi Bhargava Benjamin Serebrin Francesco Spadini and SrilathaManne 2008 Accelerating Two-dimensional Page Walks for Vir-tualized Systems In Proceedings of the 13th International Confer-ence on Architectural Support for Programming Languages and Op-erating Systems (ASPLOS XIII) ACM New York NY USA 26ndash35httpsdoiorg10114513462811346286

[35] Abhishek Bhattacharjee 2013 Large-reach Memory ManagementUnit Caches In Proceedings of the 46th Annual IEEEACM InternationalSymposium on Microarchitecture (MICRO-46) ACM New York NYUSA 383ndash394 httpsdoiorg10114525407082540741

[36] Christian Bienia Sanjeev Kumar Jaswinder Pal Singh and Kai Li 2008The PARSEC Benchmark Suite Characterization and ArchitecturalImplications In Proceedings of the 17th International Conference onParallel Architectures and Compilation Techniques (PACT rsquo08) ACMNew York NY USA 72ndash81 httpsdoiorg10114514541151454128

[37] Daniel Bovet and Marco Cesati 2005 Understanding The Linux KernelOreilly amp Associates Inc

[38] Josiah L Carlson 2013 Redis in Action Manning Publications CoGreenwich CT USA

[39] Jonathan Corbet 2010 Memory compaction httpslwnnetArticles368869

[40] Guilherme Cox and Abhishek Bhattacharjee 2017 Efficient AddressTranslation for Architectures with Multiple Page Sizes In Proceed-ings of the Twenty-Second International Conference on ArchitecturalSupport for Programming Languages and Operating Systems (ASPLOSrsquo17) ACM New York NY USA 435ndash448 httpsdoiorg10114530376973037704

[41] Cort Dougan Paul Mackerras and Victor Yodaiken 1999 Opti-mizing the Idle Task and Other MMU Tricks In Proceedings of theThird Symposium on Operating Systems Design and Implementation(OSDI rsquo99) USENIX Association Berkeley CA USA 229ndash237 httpdlacmorgcitationcfmid=296806296833

[42] Niall Douglas 2011 User Mode Memory Page Allocation A Sil-ver Bullet For Memory Allocation CoRR abs11051811 (2011)arXiv11051811 httparxivorgabs11051811

[43] Ulrich Drepper 2007 What Every Programmer Should Know AboutMemory

[44] Yu Du Miao Zhou Bruce R Childers Daniel Mosseacute and Rami GMelhem 2015 Supporting superpages in non-contiguous physicalmemory 2015 IEEE 21st International Symposium on High Performance

Computer Architecture (HPCA) (2015) 223ndash234[45] M Franklin D Yeung n Xue Wu A Jaleel K Albayraktaroglu B

Jacob and n Chau-Wen Tseng 2005 BioBench A Benchmark Suite ofBioinformatics Applications In IEEE International Symposium on Per-formance Analysis of Systems and Software 2005 ISPASS 2005(ISPASS)Vol 00 2ndash9 httpsdoiorg101109ISPASS20051430554

[46] Jayneel Gandhi Arkaprava Basu Mark D Hill and Michael M Swift2014 Efficient Memory Virtualization Reducing Dimensionality ofNested Page Walks In Proceedings of the 47th Annual IEEEACM Inter-national Symposium on Microarchitecture (MICRO-47) IEEE ComputerSociety Washington DC USA 178ndash189 httpsdoiorg101109MICRO201437

[47] Fabien Gaud Baptiste Lepers Jeremie Decouchant Justin FunstonAlexandra Fedorova and Vivien Queacutema 2014 Large Pages MayBe Harmful on NUMA Systems In Proceedings of the 2014 USENIXConference on USENIX Annual Technical Conference (USENIX ATCrsquo14)USENIX Association Berkeley CA USA 231ndash242 httpdlacmorgcitationcfmid=26436342643659

[48] Mel Gorman and Patrick Healy 2008 Supporting Superpage Alloca-tion Without Additional Hardware Support In Proceedings of the 7thInternational Symposium on Memory Management (ISMM rsquo08) ACMNew York NY USA 41ndash50 httpsdoiorg10114513756341375641

[49] Mel Gorman and Patrick Healy 2012 Performance Characteristics ofExplicit Superpage Support In Proceedings of the 2010 InternationalConference on Computer Architecture (ISCArsquo10) Springer-Verlag BerlinHeidelberg 293ndash310 httpsdoiorg101007978-3-642-24322-624

[50] Mel Gorman and Andy Whitcroft 2006 The What The Why and theWhere To of Anti-Fragmentation In Linux Symposium 141

[51] Fei Guo Seongbeom Kim Yury Baskakov and Ishan Banerjee 2015Proactively Breaking Large Pages to Improve Memory Overcom-mitment Performance in VMware ESXi In Proceedings of the 11thACM SIGPLANSIGOPS International Conference on Virtual ExecutionEnvironments (VEE rsquo15) ACM New York NY USA 39ndash51 httpsdoiorg10114527311862731187

[52] Fan Guo Yongkun Li Yinlong Xu Song Jiang and John C S Lui2017 SmartMD A High Performance Deduplication Engine withMixed Pages In Proceedings of the 2017 USENIX Conference on UsenixAnnual Technical Conference (USENIX ATC rsquo17) USENIX AssociationBerkeley CA USA 733ndash744 httpdlacmorgcitationcfmid=31546903154759

[53] John L Henning 2006 SPEC CPU2006 Benchmark DescriptionsSIGARCH Comput Archit News 34 4 (Sept 2006) 1ndash17 httpsdoiorg10114511867361186737

[54] Vasileios Karakostas Osman S Unsal Mario Nemirovsky AdriaacutenCristal and Michael M Swift 2014 Performance analysis of thememory management unit under scale-out workloads 2014 IEEE In-ternational Symposium on Workload Characterization (IISWC) (2014)1ndash12

[55] Youngjin Kwon Hangchen Yu Simon Peter Christopher J Rossbachand Emmett Witchel 2016 Coordinated and Efficient Huge PageManagement with Ingens In Proceedings of the 12th USENIX Con-ference on Operating Systems Design and Implementation (OSDIrsquo16)USENIX Association Berkeley CA USA 705ndash721 httpdlacmorgcitationcfmid=30268773026931

[56] Joshua Magee and Apan Qasem 2009 A Case for Compiler-drivenSuperpage Allocation In Proceedings of the 47th Annual SoutheastRegional Conference (ACM-SE 47) ACM New York NY USA Article82 4 pages httpsdoiorg10114515664451566553

[57] Juan Navarro Sitararn Iyer Peter Druschel and Alan Cox 2002 Prac-tical Transparent Operating System Support for Superpages SIGOPSOper Syst Rev 36 SI (Dec 2002) 89ndash104 httpsdoiorg101145844128844138

[58] Ashish Panwar Naman Patel and K Gopinath 2016 A Case forProtecting Huge Pages from the Kernel In Proceedings of the 7th ACMSIGOPS Asia-Pacific Workshop on Systems (APSys rsquo16) ACM New YorkNY USA Article 15 8 pages httpsdoiorg10114529673602967371

[59] Ashish Panwar Aravinda Prasad and K Gopinath 2018 Making HugePages Actually Useful In Proceedings of the Twenty-Third InternationalConference on Architectural Support for Programming Languages andOperating Systems (ASPLOS rsquo18) ACM New York NY USA 679ndash692httpsdoiorg10114531731623173203

[60] Binh Pham Abhishek Bhattacharjee Yasuko Eckert and Gabriel HLoh 2014 Increasing TLB reach by exploiting clustering in page trans-lations 2014 IEEE 20th International Symposium on High PerformanceComputer Architecture (HPCA) (2014) 558ndash567

[61] Binh Pham Viswanathan Vaidyanathan Aamer Jaleel and AbhishekBhattacharjee 2012 CoLT Coalesced Large-Reach TLBs In Proceed-ings of the 2012 45th Annual IEEEACM International Symposium onMicroarchitecture (MICRO-45) IEEE Computer Society WashingtonDC USA 258ndash269 httpsdoiorg101109MICRO201232

[62] Binh Pham Jaacuten Veselyacute Gabriel H Loh and Abhishek Bhattacharjee2015 Large Pages and Lightweight Memory Management in Virtual-ized Environments Can You Have It Both Ways In Proceedings of the48th International Symposium on Microarchitecture (MICRO-48) ACMNew York NY USA 1ndash12 httpsdoiorg10114528307722830773

[63] Jee Ho Ryoo Nagendra Gulur Shuang Song and Lizy K John 2017Rethinking TLB Designs in Virtualized Environments A Very LargePart-of-Memory TLB In Proceedings of the 44th Annual InternationalSymposium on Computer Architecture (ISCA rsquo17) ACM New York NYUSA 469ndash480 httpsdoiorg10114530798563080210

[64] Indira Subramanian Clifford Mather Kurt Peterson and BalakrishnaRaghunath 1998 Implementation of Multiple Pagesize Support inHP-UX In USENIX Annual Technical Conference 105ndash119

[65] John R Tramm Andrew R Siegel Tanzima Islam and Martin Schulz[n d] XSBench - The Development and Verification of a PerformanceAbstraction for Monte Carlo Reactor Analysis In PHYSOR 2014 - TheRole of Reactor Physics toward a Sustainable Future Kyoto

[66] Koji Ueno and Toyotaro Suzumura 2012 Highly Scalable Graph Searchfor the Graph500 Benchmark In Proceedings of the 21st InternationalSymposium on High-Performance Parallel and Distributed Computing(HPDC rsquo12) ACM New York NY USA 149ndash160 httpsdoiorg10114522870762287104

[67] Carl AWaldspurger 2002 Memory ResourceManagement in VMwareESX Server SIGOPS Oper Syst Rev 36 SI (Dec 2002) 181ndash194 httpsdoiorg101145844128844146

[68] Xiao Zhang Sandhya Dwarkadas and Kai Shen 2009 Towards Practi-cal Page Coloring-based Multicore Cache Management In Proceedingsof the 4th ACM European Conference on Computer Systems (EuroSysrsquo09) ACM New York NY USA 89ndash102 httpsdoiorg10114515190651519076

  • Abstract
  • 1 Introduction
  • 2 Motivation
    • 21 Address Translation Overhead vs Memory Bloat
    • 22 Page fault latency vs Number of page faults
    • 23 Huge page allocation across multiple processes
    • 24 How to capture address translation overheads
      • 3 Design and Implementation
        • 31 Asynchronous page pre-zeroing
        • 32 Managing bloat vs performance
        • 33 Fine-grained huge page promotion
        • 34 Huge page allocation across multiple processes
        • 35 Limitations and discussion
          • 4 Evaluation
          • 5 Related Work
          • 6 Conclusions
          • References
Page 5: HawkEye: Efficient Fine-grained OS Support for …sbansal/pubs/hawkeye.pdfHawkEye: Efficient Fine-grained OS Support for Huge Pages Ashish Panwar Indian Institute of Science ashishpanwar@iisc.ac.in

Low Memory Pressure High Memory Pressure

1 Fewer Page Faults 2 Lowshylatency Allocation 3 High Performance

1 High Memory Efficiency 2 Efficient Huge Page Promotion 3 Fairness

Figure 2 Design objectives in HawkEye

and access-coverage (ie how many baseline pages are ac-cessed inside a huge page) measurements and (4) fairnessshould be based on estimation of MMU overheads

31 Asynchronous page pre-zeroingWe propose that page zeroing of available free pages shouldbe performed asynchronously in a rate-limited backgroundkernel thread to eliminate high latency allocation We callthis scheme async pre-zeroing Async pre-zeroing is a ratherold idea and had been discussed extensively among kerneldevelopers in early 2000s [1 2 14 41]1 Linux developersopined that async pre-zeroing is not an overall performancewin for two main reasons We think that it is time to revisitthese opinionsFirst the async pre-zeroing thread might interfere with

primary workloads by polluting the cache [1] In particu-lar async pre-zeroing suffers from the ldquodouble cache missrdquoproblem because it causes the same datum to be accessedtwice with a large re-use distance first for pre-zeroing andthen for the actual access by the application These extracache misses are expected to degrade overall performancein general However these problems are partially solvableon modern hardware that support memory writes with non-temporal hints non-temporal hints instruct the hardware tobypass caches during memory loadstore instructions [43]We find that using non-temporal hints during pre-zeroingsignificantly reduces both cache contention and the doublecache miss problem

Second there was no consensus or empirical evidence todemonstrate the benefits of page pre-zeroing for real work-loads [1 42] We note that the early discussions on pagepre-zeroing were evaluating trade-offs with baseline 4KBpages Our experiments corroborate the kernel developersrsquoobservation that despite reducing the page fault overheadby 25 pre-zeroing does not necessarily enable high per-formance with 4KB pages At the same time we also showthat it enables non-negligible performance improvements(eg 14times faster VM boot-time) with huge pages due to muchhigher reduction (97) in page fault overheads Since hugepages (and huge-huge pages) are supported by most general-purpose processors today [15 27] pre-zeroing pages is animportant optimization that is worth revisitingPre-zeroing offers another advantage for virtualized sys-

tems it increases the number of zero pages in the guestrsquosphysical address (GPA) space enabling opportunities for1Windows and FreeBSD implement async pre-zeroing in a limited fashion [419] However their huge page policies do not allow high latency allocationsThis idea is more relevant for Linux due to its synchronous huge pageallocation behaviour

content-based page-sharing at the virtualization host Weevaluate this aspect in sect4

To implement async pre-zeroing HawkEye manages freepages in the Linux buddy allocator through two lists zeroand non-zero Pages released by applications are first addedto the non-zero list while the zero list is preferentially usedfor allocation A rate-limited thread periodically transferspages from non-zero to zero lists after zero-filling themusing non-temporal writes Because pre-zeroing involves se-quential memory accesses non-temporal store instructionsprovide performance similar to regular (caching) store in-structions but without polluting the cache [3] Finally wenote that for copy-on-write or filesystem-backed memoryregions pre-zeroing may sometimes be unnecessary andwasteful This problem is avoidable by preferentially allo-cating pages for these memory regions from the non-zerolist

Overall we believe that async pre-zeroing is a compellingidea for modern workload requirements and modern hard-ware support In our evaluation we provide some early evi-dence to corroborate this claim with different workloads

32 Managing bloat vs performanceWe observe that the fundamental tension between MMUoverheads and memory bloat can be resolved Our approachstems from the insight that most allocations in large-memoryworkloads are typically ldquozero-filled page allocationsrdquo the re-maining are either filesystem-backed (eg through mmap)or copy-on-write (COW) pages However huge pages inmodern scale-out workloads are primarily used for ldquoanony-mousrdquo pages that are initially zero-filled by the kernel [54]eg Linux supports huge pages only for anonymous mem-ory This property of typical workloads and Linux allowsautomatic recovery of bloat under memory pressure

To state our approach succinctly we allocate huge pagesat the time of first page-fault but under memory pressureto recover unused memory we scan existing huge pages toidentify zero-filled baseline pages within them If the numberof zero-filled baseline pages inside a huge page is significant(ie beyond a threshold) we break the huge page into itsconstituent baseline pages and de-duplicate the zero-filledbaseline pages to a canonical zero-page through standardCOW page management techniques [37] In this approachit is possible for applicationsrsquo in-use zero-pages to also getde-duplicated While this can result in a marginally highernumber of COW page faults in rare situations this does notcompromise correctnessTo trigger recovery from memory bloat HawkEye uses

two watermarks on the amount of allocated memory inthe system high and low When the amount of allocatedmemory exceeds high (85 in our prototype) a rate-limitedbloat-recovery thread is activated which executes period-ically until the allocated memory falls below low (70 inour prototype) At each step the bloat-recovery thread

675554

1155

39 28 12 1 663274

9110

30

60

90

120di

stan

ce (b

ytes

)

Figure 3 Average distance to the first non-zero byte in base-line (4KB) pages First four bars represent the average of allworkloads in the respective benchmark suite

chooses the application whose huge page allocations need tobe scanned (and potentially demoted) based on the estimatedMMU overheads of that application the application with thelowest estimated MMU overheads is chosen first for scanningThis strategy ensures that the application that least requireshuge pages is considered first mdash this is consistent with ourhuge page allocation strategy (sect23)

While scanning a baseline page to verify if it is zero-filledwe stop on encountering the first non-zero byte in it Inpractice the number of bytes that need to be scanned perin-use (not zero-filled) page before a non-zero byte is en-countered is very small we measured this over a total of 56diverse workloads and found that the average distance of thefirst non-zero byte in a 4KB page is only 911 (see Figure 3)Hence only ten bytes need to be scanned on average perin-use application page For bloat pages however all 4096bytes need to be scanned This implies that the overheadsof our bloat-recovery thread are largely proportional tothe number of bloat pages in the system and not to the totalsize of the allocated memory This is an important propertythat allows our method to scale to large memory systemsWe note that our bloat-recovery procedure has many

similarities with the standard de-duplication kernel threadsused for content-based page sharing for virtual machines[67] eg the kernel same-page merging (ksm) thread inLinux Unfortunately in current kernels the huge page man-agement logic (eg khugepaged) and the content-based pagesharing logic (eg ksm) are unconnected and can often inter-act in counter-productive ways [51] Ingens and SmartMD[52] proposed coordinated mechanisms to avoid such con-flicts Ingens demotes only infrequently-accessed huge pagesthrough ksmwhile SmartMD demotes pages based on access-frequency and repetition rate (ie the number of shareablepages within a huge page) These techniques are useful for in-use pages and our bloat-recovery proposal complementsthem by identifying unused zero-filled pages which canexecute much faster than typical same-page merging logic

33 Fine-grained huge page promotionAn efficient huge page promotion strategy should try tomaximize performance with a minimal number of huge page

9876543210

B1

B2

B3 B4

B9876543210

C1

C3

C5

C

C4

C29876543210

access_map

A1

A2 A3

A

orde

r of p

rom

otio

n

hot-regions

cold-regions

access_map access_map

Figure 4 A sample representation of access_map for threeprocesses A B and C

promotions Current systems promote pages through a se-quential scan from lower to higher VAs which is inefficientfor applications whose hot regions are not in lower VA spaceOur approach makes promotion decisions based on memoryaccess patterns First we define an important metric used inHawkEye to improve the efficiency of huge page promotionsAccess-coverage denotes the number of base pages that areaccessed from a huge page sized region in short intervalsWe sample the page table access bits at regular intervals andmaintain the exponential moving average (EMA) of access-coverage across different samples More precisely we clearthe page table access bits and test the number of set bitsafter 1 second to check how many pages are accessed Thisprocess is repeated once every 30 secondsThe access-coverage of a region potentially indicates its

TLB space requirement (number of base page entries requiredin the TLB) which we use to estimate the profitability ofpromoting it to a huge page A region with high access-coverage is likely to exhibit high contention on TLB andhence likely to benefit more from promotionHawkEye implements access-coverage based promotion

using a per-process data structure called access_map whichis an array of buckets a bucket contains huge page regionswith similar access-coverage On x86 with 2MB huge pagesthe value of access-coverage is in the range of 0-512 In ourprototype wemaintain ten buckets in the access_mapwhichprovides the necessary resolution across access-coverage val-ues at relatively low overheads regions with access-coverageof 0-49 are placed in bucket 0 regions with access-coverageof 50 minus 99 are placed in bucket 1 and so on Figure 4 showsan example state of the access_map of three processes A Band C Regions can move up or down in their respective ar-rays after every sampling period depending on their newlycomputed EMA-based access-coverage If a region movesup in access_map it is added to the head of its bucket If aregion moves down it is added to the tail Within a bucketpages are promoted from head to tail This strategy helps inprioritizing recently accessed regions within an index

HawkEye promotes regions from higher to lower indicesin access_map Notice that our approach captures both re-cency and frequency of accesses a region that has not been

accessed or accessed with low access-coverage in recentsampling periods is likely to shift towards a lower indexin access_map or towards the tail in its current index Pro-motion of cold regions is thus automatically deferred toprioritize hot regions

We note that HawkEyersquos access_map is somewhat similarto population_map and access_bitvector used in FreeBSDand Ingens resp these data structures are primarily used tocapture utilization or page access related metadata at hugepage granularity HawkEyersquos access_map additionally pro-vides the index of hotness of memory regions and enablesfine-grained huge page promotion

34 Huge page allocation across multiple processesIn our access-coverage based strategy regions belonging toa process with the highest expected MMU overheads arepromoted before others This is also consistent with ournotion of fairness sect23 as pages are promoted first fromregionsprocesses with high expected TLB pressure (andconsequently high expected MMU overheads)Our HawkEye variant that does not rely on hardware

performance counters (ie HawkEye-G) promotes regionsfrom the non-empty highest access-coverage index in all pro-cesses It is possible that multiple processes have non-emptybuckets in the (globally) highest non-empty index In thiscase round-robin is used to ensure fairness among suchprocesses We explain this further with an example Con-sider three applications A B and C with their VA regionsarranged in the access_map as shown in Figure 4 HawkEye-G promotes regions in the following order in this example

A1B1C1C2B2C3C4B3B4A2C5A3Recall however that MMU overheads may not necessarily

be correlated with our access-coverage based estimationand may depend on other more complex features of theaccess pattern (sect24) To capture this in HawkEye-PMU wefirst choose the process with the highest measured MMUoverheads and then choose regions from higher to lowerindices in selected processrsquos access_map Among processeswith similar highest MMU overheads round-robin is used

35 Limitations and discussionWe briefly outline a few issues that are related to huge pagemanagement but are currently unhandled in HawkEye1) Thresholds for identifyingmemorypressureWemea-sure the extent ofmemory pressurewith statically configuredvalues for low and high watermarks (ie 70 and 85 oftotal systemmemory) while dealing with memory bloat Anystrategy that relies on static thresholds faces the risk of beingconservative or overly aggressive when memory pressureconsistently fluctuates An ideal solution should adjust thesethresholds dynamically to prevent unintended system behav-ior The approach proposed by Guo et al [51] in the contextof memory deduplication for virtualized environments isrelevant in this context

2) Huge page starvationWhile we believe our approachof allocating huge pages based on MMU overheads optimizesthe system as a whole unbounded huge page allocations toa single process can be thought of as a starvation problemfor other processes An adverserial application can also po-tentially monopolize HawkEye to get more huge pages orprevent other applications from getting a fair share of hugepages Preliminary investigations show that our approach isreasonable even if the memory footprint of workloads differby more than an order of magnitude However if limitinghuge page allocations is still desirable it seems reasonable tointegrate a policy with existing resource limitingmonitoringtools such as Linuxrsquos cgroups [18]3) Other algorithms We do not discuss some parts of themanagement algorithms such as rate limiting khugepagedto reduce promotion overheads demotion based on low uti-lization to enable better sharing through same-page mergingand techniques for minimizing the overhead of page-tableaccess-bit tracking and compaction algorithms Much of thismaterial has been discussed and evaluated extensively inthe literature [32 55 59 68] and we have not contributedsignificantly new approaches in these areas

4 EvaluationWe now evaluate our algorithms in more detail Our exper-imental platform is an Intel Haswell-EP based E5-2690 v3server system running CentOS v74 on which 96GB mem-ory and 48 cores (with hyperthreading enabled) running at23GHz are partitioned on two sockets we bind each work-load to a single socket to avoid NUMA effects The L1 TLBcontains 64 and 8 entries for 4KB and 2MB pages respectivelywhile the L2 TLB contains 1024 entries for both 4KB and2MB pages The size of L1 L2 and shared L3 cache is 768KB3MB and 30MB resp A 96GB SSD-backed swap partitionis used to evaluate performance in an overcommitted sys-tem We evaluate HawkEye with a diverse set of workloadsranging from HPC graph algorithms in-memory databasesgenomics and machine learning [24 30 33 36 45 53 65 66]HawkEye is implemented in Linux kernel v43

We evaluate (a) the improvements due to our fine-grainedpromotion strategy based on access-coverage in both per-formance and fairness for single multiple homogeneous andmultiple heterogeneousworkloads (b) the bloat-vs-performancetradeoff (c) the impact of low latency page faults (d) cacheinterference caused by asynchronous pre-zeroing thread and(e) the impact of memory efficiency enabled by asynchronouspage pre-zeroingPerformance advantages of fine-grainedhuge page pro-motion We first evaluate the effectiveness of our access-coverage based promotion strategy For this we measure thetime required for our algorithm to recover from a fragmentedstate with high address translation overheads to a state with

0

6

12

18

24

Graph500 XSBench cgD

speedup

Linux Ingens HawkEyeshyPMU HawkEyeshyG

152

23 25155

20 36

1007 1011

175

628

152 168

0

300

600

900

1200

Graph500 XSBench cgD

Time saved (ms)

Figure 5 Performance speedup and time saved per hugepage promotion over baseline pages

Workload Execution Time (Seconds)Linux-4KB Linux-2MB Ingens HawkEye-PMU HawkEye-G

Graph500-1 2270 2145(106) 2243(101) 1987(114) 2007(113)Graph500-2 2289 2252(102) 2253(102) 1994(115) 2013(114)Graph500-3 2293 2293(10) 2299(100) 2012(114) 2018(114)Average 2284 2230(102) 2265(101) 1998(114) 2013(113)

XSBench-1 2427 2415(10) 2392(101) 2108(115) 2098(115)XSBench-2 2437 2427(10) 2415(101) 2109(115) 2110(115)XSBench-3 2443 2455(10) 2456(100) 2133(115) 2143(114)Average 2436 2432(10) 2421(100) 2117(115) 2117(115)

Table 5 Execution time of 3 instances of Graph500 andXSBench when executed simultaneously Values in parenthe-ses represent speedup over baseline pages

low address translation overheads Recall that HawkEye pro-motes pages based on access-coverage while previous ap-proaches promote pages in VA order (from low VAs to high)We fragment the memory initially by reading several files inmemory our test workloads are started in the fragmentedstate and we measure the time taken for the system to re-cover from high MMU overheads This experimental setupsimulates expected realistic situations where the memoryfragmentation in the system fluctuates over time Withoutpromotion our test workloads would keep incurring highaddress translation overheads however HawkEye is quicklyable to recover from these overheads through appropriatehuge page allocations Figure 5 (left) shows the performanceimprovement obtained by HawkEye (over a strategy thatnever promotes the pages) for three workloads Graph500XSBench and cgD These workloads allocate all their re-quired memory in the beginning ie in the fragmented stateof the system For these workloads the speedups due toeffective huge page management through HawkEye are ashigh as 22 Compared to Linux and Ingens access-coveragebased huge page promotion strategy of HawkEye improvesperformance by 13 12 and 6 over both Linux and Ingens

To understand this more clearly Figure 6 shows the accesspattern MMU overheads and the number of allocated hugepages over time for Graph500 and XSBench We find thathot-spots in these applications are concentrated in the highVAs and any sequential-scanning based promotion is likelyto be sub-optimal The second and third columns corrobo-rate this observation for example both HawkEye variantstake asymp300 seconds to eliminate MMU overheads of XSBenchwhile Linux and Ingens have high overheads even after 1000seconds

To quantify the cost-benefit analysis of huge page promo-tions further we propose a new metric the average execu-tion time saved (over using only baseline pages) per hugepage promotion A scheme that maximizes this metric wouldbe most effective in reducing MMU overheads Figure 5(right) shows that HawkEye performs significantly betterthan Linux and Ingens on this metric The difference betweenthe efficiency of HawkEye-PMU and HawkEye-G is also evi-dent HawkEye-PMU is more efficient as it stops promotinghuge pages whenMMU overheads are below a certain thresh-old (2 in our experiments) In summary HawkEye-G andHawkEye-PMU are up to 67times and 44times more efficient (forXSBench) than Linux in terms of time saved per huge pagepromotion For workloads whose access patterns are spreaduniformly over the VA space HawkEye delivers performancesimilar to Linux and IngensFairness advantages of fine-grained huge page promo-tion We next experiment withmultiple applications to studyboth performance and fairness first with identical applica-tions and then with heterogeneous applications runningsimultaneouslyIdentical workloads Figure 7 shows the MMU overheadsand huge page allocations when 3 instances of Graph500 andXSBench (in separate runs) are executed concurrently afterfragmenting the system (see Table 5 for execution time)While Linux creates performance imbalance by promot-

ing huge pages in one process at a time HawkEye ensuresfairness by judiciously distributing huge pages across allworkload instances Overall HawkEye-PMU and HawkEye-G achieve 114times and 113times speedup for Graph500 and 115timesspeedup for XSBench over Linux on average Unlike LinuxIngens promotes huge pages proportionally in all 3 instancesHowever it fails to improve the performance of these work-loads In fact it may lead to poorer performance than LinuxWe explain this behaviour with an example

Linux selects an address from Graph500-1rsquos address spaceand promotes all pages above it in around 10 minutes Eventhough this strategy is unfair it improves the performanceof Graph500-1 by promoting its hot-regions Linux then se-lects Graph500-2whose MMU overheads decrease after asymp20minutes In contrast Ingens promotes huge pages from lowerVAs in each instance of Graph500 Thus it takes even longerfor Ingens to promote suitable (hot) regions of the respec-tive processes which leads to 9 performance degradationover Linux for Graph500 For XSBench the performance ofboth Ingens and Linux is similar (but inferior to HawkEye)because both fail to promote the applicationrsquos hot regionsbefore the application finishesHeterogeneous workloads To measure the efficacy of dif-ferent strategies for heterogeneous workloads we executeworkloads by grouping them into sets where a set containsone TLB sensitive and one TLB insensitive application Each

0

128

256

384

512

05GB 1GB 15GB 2GB

Cold Region

acce

ss-c

over

age

Virtual Address Space

0

15

30

45

60

300 600 900 1200 1500

MM

U O

verh

ead

Time (seconds)

0

200

400

600

800

1000

300 600 900 1200 1500

Linux Ingens

HawkEye-G

HawkEye-PMU

H

uge

Pag

es

Time (seconds)

(a) Graph500

0

128

256

384

512

1GB 2GB 3GB 4GB 5GB 6GB

acce

ss-c

over

age

Virtual Address Space

0

15

30

45

60

300 600 900 1200 1500 1800 2100 2400

Linux Ingens

HawkEye-PMU HawkEye-GMM

U O

ver

hea

dTime (seconds)

0

500

1000

1500

2000

300 600 900 1200 1500 1800 2100 2400

Linux Ingens H

awkEye-G

HawkEye-PMU H

uge

Pag

es

Time (seconds)

(b) XSBenchFigure 6 Access-coverage in application VA MMU overhead and huge page promotions for Graph500 and XSBench HawkEyeis more efficient in promoting huge pages due to fine-grained decision making

0

200

400

600

800

1000

10 20 30 40

1 2

3 H

uge

Pag

es

Time (minutes)

0

200

400

600

800

1000

10 20 30 40

1 2 3

H

uge

Pag

es

Time (minutes)

0

200

400

600

800

1000

10 20 30 40

H

uge

Pag

es

Time (minutes)

0

200

400

600

800

1000

10 20 30 40

1 2 3

H

uge

Pag

es

Time (minutes)

0

10

20

30

40

50

10 20 30 40

MM

U O

verh

ead

Time (Minutes)

0

10

20

30

40

50

10 20 30 40

MM

U O

verh

ead

Time (minutes)

0

10

20

30

40

50

10 20 30 40

MM

U O

verh

ead

Time (minutes)

0

10

20

30

40

50

10 20 30 40

MM

U O

verh

ead

Time (minutes)

0

500

1000

1500

2000

10 20 30 40 50

1

2 3

H

uge

Pag

es

Time (Minutes)

0

300

600

900

1200

10 20 30 40 50

1 2 3

H

uge

Pag

es

Time (Minutes)

0

300

600

900

1200

10 20 30 40 50

H

uge

Pag

es

Time (Minutes)

0

300

600

900

1200

10 20 30 40 50

1 2 3

H

uge

Pag

es

Time (Minutes)

0

15

30

45

60

10 20 30 40 50

1 2 3

MM

U O

verh

ead

Time (Minutes)

0

15

30

45

60

10 20 30 40 50

1 2 3

MM

U O

verh

ead

Time (Minutes)

0

15

30

45

60

10 20 30 40 50

1 2 3

MM

U O

verh

ead

Time (Minutes)

0

15

30

45

60

10 20 30 40 50

1 2 3

MM

U O

verh

ead

Time (Minutes)

(a) Linux-2MB (b) Ingens (c) HawkEye-PMU (d) HawkEye-GFigure 7 Huge page promotions and their impact on TLB overhead over time for Linux Ingens and HawkEye for threeinstances of Graph500 and XSBench each

set is executed twice after fragmenting the system to as-sess the impact of huge page policies on the order of ex-ecution For the TLB insensitive workload we execute alightly loaded Redis server with 1KB-sized 40M keys servic-ing 10K requests per-second Figure 8 shows the performance

speedup over 4KB pages The (Before) and (After) configura-tions show whether the TLB sensitive workload is launchedbefore Redis or after Because Linux promotes huge pagesin process launch order the two configurations (Before) and(After) are intended to cover its two different behavioursIngens and HawkEye are agnostic to process creation order

0

15

30

45

60

75

cactusADM tigr Graph500 lbm_s SVM XSBench cgD cactusADM tigr Graph500 lbm_s SVM XSBench cgDBefore After

s

peed

up

Chart Title

Linux Ingens HawkEye-PMU HawkEye-G

Figure 8 Performance speedup over baseline pages of TLB sensitive applications when they are executed alongside a lightlystressed Redis server in different orders

1

12

14

16

18

2

Nor

mal

ized

Per

form

ance

HawkEye-Host HawkEye-Guest HawkEye-Both

Figure 9 Performance compared to Linux in a virtualizedsystem when HawkEye is applied at the host guest and bothlayers

The impact of execution order is clearly visible for LinuxIn the (Before) case Linux with THP support improves per-formance over baseline pages by promoting huge pages inTLB sensitive applications However in the (After) case itpromotes huge pages in Redis resulting in poor performancefor TLB sensitive workloads While the execution order doesnot matter in Ingens its proportional huge page promotionstrategy favors large memory workloads (Redis in this case)Further our client requests randomly selected keys whichmakes the Redis server access all its huge pages uniformlyConsequently Ingens promotes most huge pages in Redisleading to suboptimal performance of TLB sensitive work-loads In contrast HawkEye promotes huge pages based onMMU overhead measurements (HawkEye-PMU) or access-coverage based estimation of MMU overheads (HawkEye-G)This leads to 15ndash60 performance improvement over 4KBpages irrespective of the order of executionPerformance in virtualized systems For a virtualizedsystemwe use the KVMhypervisor runningwith Ubuntu1604and fragment the system prior to running the workloadsEvaluation is presented for three different configurationsie when HawkEye is deployed at the host guest and boththe layers Each of these configurations requires running thesame set of applications differently (see Table 6 for details)Figure 9 shows that HawkEye provides 18ndash90 speedup

compared to Linux in virtual environments Notice that insome cases (eg cgD) performance improvement is higherin a virtualized system compared to bare-metal This is due tohigh MMU overheads involved in traversing nested page ta-bles while servicing TLB misses of guest applications whichpresents higher opportunities to be exploited by HawkEye

Configuration Details

HawkEye-Host Two VMs each with 24 vCPUs and 48GB memoryVM-1 runs Redis VM-2 runs TLB sensitive workloads

HawkEye-Guest Single VM with 48 vCPUs and 80GB memoryrunning Redis and TLB sensitive workloads

HawkEye-Both Two VMs each with 24 vCPUs VM-1 (30GB) runs RedisVM-2 (60GB) runs both Redis and TLB sensitive workloads

Table 6 Experimental setup for configurations used to eval-uate a virtualized system

Bloat vs performance We next study the relationship be-tween memory bloat and performance we revisit an experi-ment similar to the one summarized in sect21 We populate aRedis instance with 8M (10B key 4KB value) key-value pairsand then delete 60 randomly selected keys (see Table 7)While Linux-4KB (no THP) is memory efficient (no bloat)its performance is low (7 lower throughput than Linux-2MB ie with THP) Linux-2MB delivers high performancebut consumes more memory which remains unavailable tothe rest of system even when memory pressure increasesIngens can be configured to balance the memory vs perfor-mance tradeoff For example Ingens-90 (Ingens with 90utilization threshold) is memory efficient while Ingens-50(Ingens with 50 utilization threshold) favors performanceand allows more memory bloat In either case it is unableto recover from bloat that may have already been gener-ated (eg during its aggressive phase) By switching fromaggressive huge page allocations under low memory pres-sure to memory conserving strategy at runtime HawkEye isable to achieve both low MMU overheads and high memoryefficiency depending on the state of the systemFast page faults Table 8 shows the performance of differentstrategies for the workloads chosen for these experimentstheir performance depends on the efficiency of the OS pagefault handler We measure Redis throughput when 2MB-sized values are inserted SparseHash [13] is a hash-maplibrary in C++ while HACC-IO [8] is a parallel IO benchmarkused with an in-memory file system We also measure thespin-up time of two virtual machines KVM and a Java VirtualMachine (JVM) where both are configured to allocate allmemory during initialization

Performance benefits of async page pre-zeroing with basepages (HawkEye-4KB) aremodest in this case the other page

Kernel Self-tuning Memory Size ThroughputLinux-4KB No 162GB 1061KLinux-2MB No 332GB 1138KIngens-90 No 163GB 1068KIngens-50 No 331GB 1134KHawkEye (no mem pressure) Yes 332GB 1136KHawkEye (mem pressure) Yes 162GB 1058K

Table 7 Redis memory consumption and throughput

Workload OS ConfigurationLinux4KB

Linux2MB

Ingens90

HawkEye4KB

HawkEye2MB

Redis (45GB) 233 437 192 236 551SparseHash (36GB) 501 172 515 466 106HACC-IO (6GB) 65 45 66 65 42JVM Spin-up (36GB) 377 186 527 298 137KVM Spin-up (36GB) 406 97 418 302 070

Table 8 Performance implications of asynchronous pagezeroing Values for Redis represent throughput (higher isbetter) all other values represent time in seconds (lower isbetter)

0

10

20

30

40

NPB PARSEC Graph500 cactusADM PageRank omnetpp

overhead with caching-stores with nt-stores

Figure 10 Performance overhead of async pre-zeroing withand without caching instructions The first two bars (NPBand Parsec) represent the average of all workloads in therespective benchmark suite

fault overheads (apart from zeroing) dominate performanceHowever page-zeroing cost becomes significant with hugepages Consequently HawkEye with huge pages (HawkEye-2MB) improves the performance of Redis and SparseHashby 126times and 162times over Linux The spin-up time of virtualmachines is purely dominated by page faults Hence weobserve a dramatic reduction in the spin-up time of VMswith async pre-zeroing of huge pages 138times and 136times overLinux Notice that without huge pages (only base pages)this reduction is only 134times and 126times All these workloadshave high spatial locality of page faults and hence the Ingensstrategy of utilization-based promotion has a negative impacton performance due to a higher number of page faultsAsync pre-zeroing enables memory sharing in virtual-ized environments Finally we note that async pre-zeroingwithin VMs enables memory sharing across VMs the freememory of a VM returns to the host through pre-zeroing inthe VM and same-page merging at the host This can havethe same net effect as ballooning as the free memory in aVM gets shared with other VMs In our experiments withmemory over-committed systems we have confirmed thisbehaviour the overall performance of an overcommittedsystem where the VMs are running HawkEye matches the

0

05

1

15

2

25

PageRank Redis MongoDB CC Memcached SPDNode 0 Node 1

Nor

mal

ized

Per

form

ance

Linux Linux-Ballooning HawkEye

Figure 11 Performance normalized to the case of no bal-looning in an overcommitted virtualized system

performance achieved through para-virtual interfaces likememory balloon drivers For example in an experiment in-volving a mix of latency-sensitive key-value stores and HPCworkloads (see Figure 11) with a total peakmemory consump-tion of around 150GB (15times of total memory) HawkEye-Gprovides 23times and 142times higher throughput for Redis andMongoDB along with significant tail latency reduction ForPageRank performance degrades slightly due to a highernumber of COW page faults due to unnecessary same-pagemerging These results are very close to the results obtainedwhen memory ballooning is enabled on all VMs We believethat this is a potential positive of our design and may offeran alternative method to efficiently manage memory in over-committed systems It is well-known that balloon-driversand other para-virtual interfaces are riddled with compatibil-ity problems [29] and a fully-virtual solution to this problemwould be highly desirable While our experiments show earlypromise in this direction we leave an extensive evaluationfor future workPerformance overheads ofHawkEyeWe evaluatedHawk-Eye with non-fragmented systems and with several work-loads that donrsquot benefit from huge pages and find that theperformance with HawkEye is very similar to performancewith Linux or Ingens in these scenarios confirming thatHawkEyersquos algorithms for access-bit tracking asynchronoushuge page promotion and memory de-duplication add negli-gible overheads over existingmechanisms in Linux Howeverthe async pre-zeroing thread requires special attention as itmay become a source of noticeable performance overheadfor certain types of workloads as discussed nextOverheads of async pre-zeroing A primary concern withasync pre-zeroing is its potential detrimental effect due tomemory and cache interference (sect31) Recall that we em-ploy memory stores with non-temporal hints to avoid theseeffects To measure the worst-case cache effects of asyncpre-zeroing we run our workloads while simultaneouslyzero-filling pages on a separate core sharing the same L3cache (ie on the same socket) at a high rate of 025M pagesper second (1GBps) with and without non-temporal mem-ory stores Figure 10 shows that using non-temporal hintssignificantly brings down the overhead of async pre-zeroing(eg from 27 to 6 for omnetpp) the remaining overheadis due to additional memory traffic generated by the pre-zeroing thread Further we note that this experiment presents

Workload MMUOverhead

Time (seconds)4KB HawkEye-PMU HawkEye-G

random(4GB) 60 582 328(177times) 413(141times)sequential(4GB) lt 1 517 535 532Total 1099 863(127times) 945(116times)cgD(16GB) 39 1952 1202(162times) 1450(135times)mgD(24GB) lt 1 1363 1364 1377Total 3315 2566(129times) 2827(117times)

Table 9 Comparison between HawkEye-PMU andHawkEye-G Values in parentheses represent speedup overbaseline pagesa highly-exaggerated behavior of async pre-zeroing onworst-case workloads In practice the zeroing thread is rate-limited(eg at most 10k pages per second) and the expected cache-interference overheads would be proportionally smallerComparison betweenHawk-PMUandHawkEye-G Eventhough HawkEye-G is based on simple and approximatedestimations of TLB contention it is reasonably accurate inidentifying TLB sensitive processes and memory regionsin most cases However HawkEye-PMU performs betterthan HawkEye-G in some cases Table 9 demonstrates twosuch examples where four workloads all with high access-coverage but different MMU overheads are executed in twosets each set contains one TLB sensitive and one TLB insensi-tive workload The 4KB column represents the case where nohuge pages are promoted As discussed earlier (sect24) MMUoverheads depend on the complex interaction between thehardware and access pattern In this case despite havinghigh access-coverage in VA regions sequential and mgDhave negligible MMU overheads (ie they are TLB insen-sitive) While HawkEye-PMU correctly identifies the pro-cess with higher MMU overheads for huge page allocationHawkEye-G treats them similarly (due to imprecise estima-tions) and may allocate huge pages to the TLB insensitiveprocess Consequently HawkEye-PMU may perform up to36 better than HawkEye-G Bridging this gap between thetwo approaches in a hardware independent manner is aninteresting future work

5 Related WorkMMU overheads have been a topic of much research in re-cent years and several solutions have been proposed includ-ing solutions that are architecture-based OS-based andorcompiler-based [28 34 49 56 58 62ndash64]Hardware SupportMost proposals for hardware supportaim to reduce TLB-miss frequency or accelerate TLB-missprocessing Large multi-level TLBs and support for multiplepage sizes to minimize the number of TLB-misses is alreadycommon in general-purpose processors today [15 20 21 40]Further modern processors employ page-walk-caches toavoid memory lookups in the TLB-miss path [31 35] POM-TLB services a TLB miss using a single memory lookup andfurther leverages regular data caches to speed up addresstranslation [63] Direct segments [32 46] were proposed tocompletely avoid TLB-miss processing cost through special

segmentation hardware OS-level challenges of memory frag-mentation have also been considered in hardware designsCoLT or Coalesced-Large-reach TLBs [61] were initially pro-posed to increase TLB reach using base pages based on theobservation that OSs naturally provide contiguous mappingsat smaller granularity This approach was further extendedto page-walk caches and huge pages [35 40 60] Techniquesto support huge pages in non-contiguous memory have alsobeen proposed [44] Hardware optimizations are importantbut complementary to HawkEyeSoftware Support FreeBSDrsquos huge page management islargely influenced by previous work on superpages [57]Carrefour-LP [47] showed that ad-hoc usage of huge pagescan degrade performance in NUMA systems due to addi-tional remote memory accesses and traffic imbalance acrossnode interconnects and proposed hardware-profiling basedtechniques to overcome these problems Anti-fragmentationtechniques for clustering pages based on mobility type ofpages proposed by Gorman et al [48] have been adoptedby Linux to support the allocation of huge pages in long-running systems Illuminator [59] highlighted critical draw-backs of Linuxrsquos fragmentation mitigation techniques andtheir implications on performance and latency and proposedtechniques to solve these problems Mechanisms of dealingwith physical memory fragmentation and NUMA systemsare important but orthogonal problems and their proposedsolutions can be adopted to improve the robustness of Hawk-Eye Compiler or application hints can also be used to assistOSs in prioritizing huge page mapping for certain parts ofthe address space [27 56] for which an interface is alreadyprovided by Linux through the madvise system call [16]An alternative approach for supporting huge pages has

also been explored via libhugetlbfs [22] where the useris provided more control on huge page allocation Howeversuch an approach requires manual intervention for reserv-ing huge pages in advance and considers each applicationin isolation Windows and OS X support huge pages onlythrough this reservation-based approach to avoid issues asso-ciated with transparent huge page management [17 55 59]We believe that insights from HawkEye can be leveraged toimprove huge page support in these important systems

6 ConclusionsTo summarize transparent huge page management is es-sential but complex Effective and efficient algorithms arerequired in the OS to automatically balance the tradeoffs andprovide performant and stable system behavior We exposesome important subtleties related to huge page managementin existing proposals and propose a new set of algorithmsto address them in HawkEye

References[1] 2000 Clearing pages in the idle loop httpswwwmail-archivecom

freebsd-hackersfreebsdorgmsg13993html

[2] 2000 Linux Page Zeroing Strategy httpsyarchivenetcomplinuxpagezeroingstrategyhtml

[3] 2007 Memory part 5 What programmers can do httpslwnnetArticles255364

[4] 2010 Mysteries of Windows Memory Management RevealedPart Two httpsblogsmsdnmicrosoftcomtims20101029pdc10-mysteries-of-windows-memory-management-revealed-part-two

[5] 2012 khugepaged eating 100 CPU httpsbugzillaredhatcomshowbugcgiid=879801

[6] 2012 Recommendation to disable huge pages for Hadoophttpsdeveloperamdcomwordpressmedia201210HadoopTuningGuide-Version5pdf

[7] 2014 Arch Linux becomes unresponsive from khugepagedhttpunixstackexchangecomquestions161858arch-linux-becomes-unresponsive-from-khugepaged

[8] 2014 CORAL Benchmark Codes httpsascllnlgovCORAL-benchmarkshacc

[9] 2014 Recommendation to disable huge pages for NuoDBhttpwwwnuodbcomtechbloglinux-transparent-huge-pages-jemalloc-and-nuodb

[10] 2014 Why TokuDB Hates Transparent HugePageshttpswwwperconacomblog20140723why-tokudb-hates-transparent-hugepages

[11] 2015 The Black Magic Of Systematically Reducing Linux OS Jit-ter httphighscalabilitycomblog201548the-black-magic-of-systematically-reducing-linux-os-jitterhtml

[12] 2015 Tales from the Field Taming Transparent Huge Pages onLinux httpswwwperforcecomblog151016tales-field-taming-transparent-huge-pages-linux

[13] 2016 C++ associative containers httpsgithubcomsparsehashsparsehash

[14] 2016 Remove PG_ZERO and zeroidle (page-zeroing) entirely httpsnewsycombinatorcomitemid=12227874

[15] 2017 Hugepages httpswikidebianorgHugepages[16] 2017 MADVISE Linux Programmerrsquos Manual httpman7orglinux

man-pagesman2madvise2html[17] 2018 Large-Page Support in Windows httpsdocsmicrosoftcom

en-gbwindowsdesktopMemorylarge-page-support[18] 2019 CGROUPS Linux Programmerrsquos Manual httpman7orglinux

man-pagesman7cgroups7html[19] 2019 FreeBSD Pre-Faulting and Zeroing Optimizations

httpswwwfreebsdorgdocenUSISO8859-1articlesvm-designprefault-optimizationshtml

[20] 2019 Intel Haswell httpwww7-cpucomcpuHaswellhtml[21] 2019 Intel Skylake httpwww7-cpucomcpuSkylakehtml[22] 2019 Libhugetlbfs Linux man page httpslinuxdienetman7

libhugetlbfs[23] 2019 Recommendation to disable huge pages for MongoDB https

docsmongodbcommanualtutorialtransparent-huge-pages[24] 2019 Recommendation to disable huge pages for Redis httpredisio

topicslatency[25] 2019 Recommendation to disable huge pages for VoltDB https

docsvoltdbcomAdminGuideadminmemmgtphp[26] 2019 Transparent Hugepage Support httpswwwkernelorgdoc

Documentationvmtranshugetxt[27] Mohammad Agbarya Idan Yaniv and Dan Tsafrir 2018 Memomania

From Huge to Huge-Huge Pages In Proceedings of the 11th ACM In-ternational Systems and Storage Conference (SYSTOR rsquo18) ACM NewYork NY USA 112ndash112 httpsdoiorg10114532118903211918

[28] Hanna Alam Tianhao Zhang Mattan Erez and Yoav Etsion 2017Do-It-Yourself Virtual Memory Translation In Proceedings of the44th Annual International Symposium on Computer Architecture (ISCArsquo17) ACM New York NY USA 457ndash468 httpsdoiorg10114530798563080209

[29] Nadav Amit Dan Tsafrir and Assaf Schuster 2014 VSwapper AMemory Swapper for Virtualized Environments In Proceedings of the19th International Conference on Architectural Support for ProgrammingLanguages and Operating Systems (ASPLOS rsquo14) ACM New York NYUSA 349ndash366 httpsdoiorg10114525419402541969

[30] D H Bailey E Barszcz J T Barton D S Browning R L Carter LDagum R A Fatoohi P O Frederickson T A Lasinski R S SchreiberH D Simon V Venkatakrishnan and S KWeeratunga 1991 The NASParallel BenchmarksSummary and Preliminary Results In Proceedingsof the 1991 ACMIEEE Conference on Supercomputing (Supercomputingrsquo91) ACM New York NY USA 158ndash165 httpsdoiorg101145125826125925

[31] Thomas W Barr Alan L Cox and Scott Rixner 2010 TranslationCaching Skip DonrsquoT Walk (the Page Table) In Proceedings of the37th Annual International Symposium on Computer Architecture (ISCArsquo10) ACM New York NY USA 48ndash59 httpsdoiorg10114518159611815970

[32] Arkaprava Basu Jayneel Gandhi Jichuan Chang Mark D Hill andMichael M Swift 2013 Efficient Virtual Memory for Big MemoryServers In Proceedings of the 40th Annual International Symposium onComputer Architecture (ISCA rsquo13) ACM New York NY USA 237ndash248httpsdoiorg10114524859222485943

[33] Scott Beamer Krste Asanovic and David A Patterson 2015 TheGAP Benchmark Suite CoRR abs150803619 (2015) arXiv150803619httparxivorgabs150803619

[34] Ravi Bhargava Benjamin Serebrin Francesco Spadini and SrilathaManne 2008 Accelerating Two-dimensional Page Walks for Vir-tualized Systems In Proceedings of the 13th International Confer-ence on Architectural Support for Programming Languages and Op-erating Systems (ASPLOS XIII) ACM New York NY USA 26ndash35httpsdoiorg10114513462811346286

[35] Abhishek Bhattacharjee 2013 Large-reach Memory ManagementUnit Caches In Proceedings of the 46th Annual IEEEACM InternationalSymposium on Microarchitecture (MICRO-46) ACM New York NYUSA 383ndash394 httpsdoiorg10114525407082540741

[36] Christian Bienia Sanjeev Kumar Jaswinder Pal Singh and Kai Li 2008The PARSEC Benchmark Suite Characterization and ArchitecturalImplications In Proceedings of the 17th International Conference onParallel Architectures and Compilation Techniques (PACT rsquo08) ACMNew York NY USA 72ndash81 httpsdoiorg10114514541151454128

[37] Daniel Bovet and Marco Cesati 2005 Understanding The Linux KernelOreilly amp Associates Inc

[38] Josiah L Carlson 2013 Redis in Action Manning Publications CoGreenwich CT USA

[39] Jonathan Corbet 2010 Memory compaction httpslwnnetArticles368869

[40] Guilherme Cox and Abhishek Bhattacharjee 2017 Efficient AddressTranslation for Architectures with Multiple Page Sizes In Proceed-ings of the Twenty-Second International Conference on ArchitecturalSupport for Programming Languages and Operating Systems (ASPLOSrsquo17) ACM New York NY USA 435ndash448 httpsdoiorg10114530376973037704

[41] Cort Dougan Paul Mackerras and Victor Yodaiken 1999 Opti-mizing the Idle Task and Other MMU Tricks In Proceedings of theThird Symposium on Operating Systems Design and Implementation(OSDI rsquo99) USENIX Association Berkeley CA USA 229ndash237 httpdlacmorgcitationcfmid=296806296833

[42] Niall Douglas 2011 User Mode Memory Page Allocation A Sil-ver Bullet For Memory Allocation CoRR abs11051811 (2011)arXiv11051811 httparxivorgabs11051811

[43] Ulrich Drepper 2007 What Every Programmer Should Know AboutMemory

[44] Yu Du Miao Zhou Bruce R Childers Daniel Mosseacute and Rami GMelhem 2015 Supporting superpages in non-contiguous physicalmemory 2015 IEEE 21st International Symposium on High Performance

Computer Architecture (HPCA) (2015) 223ndash234[45] M Franklin D Yeung n Xue Wu A Jaleel K Albayraktaroglu B

Jacob and n Chau-Wen Tseng 2005 BioBench A Benchmark Suite ofBioinformatics Applications In IEEE International Symposium on Per-formance Analysis of Systems and Software 2005 ISPASS 2005(ISPASS)Vol 00 2ndash9 httpsdoiorg101109ISPASS20051430554

[46] Jayneel Gandhi Arkaprava Basu Mark D Hill and Michael M Swift2014 Efficient Memory Virtualization Reducing Dimensionality ofNested Page Walks In Proceedings of the 47th Annual IEEEACM Inter-national Symposium on Microarchitecture (MICRO-47) IEEE ComputerSociety Washington DC USA 178ndash189 httpsdoiorg101109MICRO201437

[47] Fabien Gaud Baptiste Lepers Jeremie Decouchant Justin FunstonAlexandra Fedorova and Vivien Queacutema 2014 Large Pages MayBe Harmful on NUMA Systems In Proceedings of the 2014 USENIXConference on USENIX Annual Technical Conference (USENIX ATCrsquo14)USENIX Association Berkeley CA USA 231ndash242 httpdlacmorgcitationcfmid=26436342643659

[48] Mel Gorman and Patrick Healy 2008 Supporting Superpage Alloca-tion Without Additional Hardware Support In Proceedings of the 7thInternational Symposium on Memory Management (ISMM rsquo08) ACMNew York NY USA 41ndash50 httpsdoiorg10114513756341375641

[49] Mel Gorman and Patrick Healy 2012 Performance Characteristics ofExplicit Superpage Support In Proceedings of the 2010 InternationalConference on Computer Architecture (ISCArsquo10) Springer-Verlag BerlinHeidelberg 293ndash310 httpsdoiorg101007978-3-642-24322-624

[50] Mel Gorman and Andy Whitcroft 2006 The What The Why and theWhere To of Anti-Fragmentation In Linux Symposium 141

[51] Fei Guo Seongbeom Kim Yury Baskakov and Ishan Banerjee 2015Proactively Breaking Large Pages to Improve Memory Overcom-mitment Performance in VMware ESXi In Proceedings of the 11thACM SIGPLANSIGOPS International Conference on Virtual ExecutionEnvironments (VEE rsquo15) ACM New York NY USA 39ndash51 httpsdoiorg10114527311862731187

[52] Fan Guo Yongkun Li Yinlong Xu Song Jiang and John C S Lui2017 SmartMD A High Performance Deduplication Engine withMixed Pages In Proceedings of the 2017 USENIX Conference on UsenixAnnual Technical Conference (USENIX ATC rsquo17) USENIX AssociationBerkeley CA USA 733ndash744 httpdlacmorgcitationcfmid=31546903154759

[53] John L Henning 2006 SPEC CPU2006 Benchmark DescriptionsSIGARCH Comput Archit News 34 4 (Sept 2006) 1ndash17 httpsdoiorg10114511867361186737

[54] Vasileios Karakostas Osman S Unsal Mario Nemirovsky AdriaacutenCristal and Michael M Swift 2014 Performance analysis of thememory management unit under scale-out workloads 2014 IEEE In-ternational Symposium on Workload Characterization (IISWC) (2014)1ndash12

[55] Youngjin Kwon Hangchen Yu Simon Peter Christopher J Rossbachand Emmett Witchel 2016 Coordinated and Efficient Huge PageManagement with Ingens In Proceedings of the 12th USENIX Con-ference on Operating Systems Design and Implementation (OSDIrsquo16)USENIX Association Berkeley CA USA 705ndash721 httpdlacmorgcitationcfmid=30268773026931

[56] Joshua Magee and Apan Qasem 2009 A Case for Compiler-drivenSuperpage Allocation In Proceedings of the 47th Annual SoutheastRegional Conference (ACM-SE 47) ACM New York NY USA Article82 4 pages httpsdoiorg10114515664451566553

[57] Juan Navarro Sitararn Iyer Peter Druschel and Alan Cox 2002 Prac-tical Transparent Operating System Support for Superpages SIGOPSOper Syst Rev 36 SI (Dec 2002) 89ndash104 httpsdoiorg101145844128844138

[58] Ashish Panwar Naman Patel and K Gopinath 2016 A Case forProtecting Huge Pages from the Kernel In Proceedings of the 7th ACMSIGOPS Asia-Pacific Workshop on Systems (APSys rsquo16) ACM New YorkNY USA Article 15 8 pages httpsdoiorg10114529673602967371

[59] Ashish Panwar Aravinda Prasad and K Gopinath 2018 Making HugePages Actually Useful In Proceedings of the Twenty-Third InternationalConference on Architectural Support for Programming Languages andOperating Systems (ASPLOS rsquo18) ACM New York NY USA 679ndash692httpsdoiorg10114531731623173203

[60] Binh Pham Abhishek Bhattacharjee Yasuko Eckert and Gabriel HLoh 2014 Increasing TLB reach by exploiting clustering in page trans-lations 2014 IEEE 20th International Symposium on High PerformanceComputer Architecture (HPCA) (2014) 558ndash567

[61] Binh Pham Viswanathan Vaidyanathan Aamer Jaleel and AbhishekBhattacharjee 2012 CoLT Coalesced Large-Reach TLBs In Proceed-ings of the 2012 45th Annual IEEEACM International Symposium onMicroarchitecture (MICRO-45) IEEE Computer Society WashingtonDC USA 258ndash269 httpsdoiorg101109MICRO201232

[62] Binh Pham Jaacuten Veselyacute Gabriel H Loh and Abhishek Bhattacharjee2015 Large Pages and Lightweight Memory Management in Virtual-ized Environments Can You Have It Both Ways In Proceedings of the48th International Symposium on Microarchitecture (MICRO-48) ACMNew York NY USA 1ndash12 httpsdoiorg10114528307722830773

[63] Jee Ho Ryoo Nagendra Gulur Shuang Song and Lizy K John 2017Rethinking TLB Designs in Virtualized Environments A Very LargePart-of-Memory TLB In Proceedings of the 44th Annual InternationalSymposium on Computer Architecture (ISCA rsquo17) ACM New York NYUSA 469ndash480 httpsdoiorg10114530798563080210

[64] Indira Subramanian Clifford Mather Kurt Peterson and BalakrishnaRaghunath 1998 Implementation of Multiple Pagesize Support inHP-UX In USENIX Annual Technical Conference 105ndash119

[65] John R Tramm Andrew R Siegel Tanzima Islam and Martin Schulz[n d] XSBench - The Development and Verification of a PerformanceAbstraction for Monte Carlo Reactor Analysis In PHYSOR 2014 - TheRole of Reactor Physics toward a Sustainable Future Kyoto

[66] Koji Ueno and Toyotaro Suzumura 2012 Highly Scalable Graph Searchfor the Graph500 Benchmark In Proceedings of the 21st InternationalSymposium on High-Performance Parallel and Distributed Computing(HPDC rsquo12) ACM New York NY USA 149ndash160 httpsdoiorg10114522870762287104

[67] Carl AWaldspurger 2002 Memory ResourceManagement in VMwareESX Server SIGOPS Oper Syst Rev 36 SI (Dec 2002) 181ndash194 httpsdoiorg101145844128844146

[68] Xiao Zhang Sandhya Dwarkadas and Kai Shen 2009 Towards Practi-cal Page Coloring-based Multicore Cache Management In Proceedingsof the 4th ACM European Conference on Computer Systems (EuroSysrsquo09) ACM New York NY USA 89ndash102 httpsdoiorg10114515190651519076

  • Abstract
  • 1 Introduction
  • 2 Motivation
    • 21 Address Translation Overhead vs Memory Bloat
    • 22 Page fault latency vs Number of page faults
    • 23 Huge page allocation across multiple processes
    • 24 How to capture address translation overheads
      • 3 Design and Implementation
        • 31 Asynchronous page pre-zeroing
        • 32 Managing bloat vs performance
        • 33 Fine-grained huge page promotion
        • 34 Huge page allocation across multiple processes
        • 35 Limitations and discussion
          • 4 Evaluation
          • 5 Related Work
          • 6 Conclusions
          • References
Page 6: HawkEye: Efficient Fine-grained OS Support for …sbansal/pubs/hawkeye.pdfHawkEye: Efficient Fine-grained OS Support for Huge Pages Ashish Panwar Indian Institute of Science ashishpanwar@iisc.ac.in

675554

1155

39 28 12 1 663274

9110

30

60

90

120di

stan

ce (b

ytes

)

Figure 3 Average distance to the first non-zero byte in base-line (4KB) pages First four bars represent the average of allworkloads in the respective benchmark suite

chooses the application whose huge page allocations need tobe scanned (and potentially demoted) based on the estimatedMMU overheads of that application the application with thelowest estimated MMU overheads is chosen first for scanningThis strategy ensures that the application that least requireshuge pages is considered first mdash this is consistent with ourhuge page allocation strategy (sect23)

While scanning a baseline page to verify if it is zero-filledwe stop on encountering the first non-zero byte in it Inpractice the number of bytes that need to be scanned perin-use (not zero-filled) page before a non-zero byte is en-countered is very small we measured this over a total of 56diverse workloads and found that the average distance of thefirst non-zero byte in a 4KB page is only 911 (see Figure 3)Hence only ten bytes need to be scanned on average perin-use application page For bloat pages however all 4096bytes need to be scanned This implies that the overheadsof our bloat-recovery thread are largely proportional tothe number of bloat pages in the system and not to the totalsize of the allocated memory This is an important propertythat allows our method to scale to large memory systemsWe note that our bloat-recovery procedure has many

similarities with the standard de-duplication kernel threadsused for content-based page sharing for virtual machines[67] eg the kernel same-page merging (ksm) thread inLinux Unfortunately in current kernels the huge page man-agement logic (eg khugepaged) and the content-based pagesharing logic (eg ksm) are unconnected and can often inter-act in counter-productive ways [51] Ingens and SmartMD[52] proposed coordinated mechanisms to avoid such con-flicts Ingens demotes only infrequently-accessed huge pagesthrough ksmwhile SmartMD demotes pages based on access-frequency and repetition rate (ie the number of shareablepages within a huge page) These techniques are useful for in-use pages and our bloat-recovery proposal complementsthem by identifying unused zero-filled pages which canexecute much faster than typical same-page merging logic

33 Fine-grained huge page promotionAn efficient huge page promotion strategy should try tomaximize performance with a minimal number of huge page

9876543210

B1

B2

B3 B4

B9876543210

C1

C3

C5

C

C4

C29876543210

access_map

A1

A2 A3

A

orde

r of p

rom

otio

n

hot-regions

cold-regions

access_map access_map

Figure 4 A sample representation of access_map for threeprocesses A B and C

promotions Current systems promote pages through a se-quential scan from lower to higher VAs which is inefficientfor applications whose hot regions are not in lower VA spaceOur approach makes promotion decisions based on memoryaccess patterns First we define an important metric used inHawkEye to improve the efficiency of huge page promotionsAccess-coverage denotes the number of base pages that areaccessed from a huge page sized region in short intervalsWe sample the page table access bits at regular intervals andmaintain the exponential moving average (EMA) of access-coverage across different samples More precisely we clearthe page table access bits and test the number of set bitsafter 1 second to check how many pages are accessed Thisprocess is repeated once every 30 secondsThe access-coverage of a region potentially indicates its

TLB space requirement (number of base page entries requiredin the TLB) which we use to estimate the profitability ofpromoting it to a huge page A region with high access-coverage is likely to exhibit high contention on TLB andhence likely to benefit more from promotionHawkEye implements access-coverage based promotion

using a per-process data structure called access_map whichis an array of buckets a bucket contains huge page regionswith similar access-coverage On x86 with 2MB huge pagesthe value of access-coverage is in the range of 0-512 In ourprototype wemaintain ten buckets in the access_mapwhichprovides the necessary resolution across access-coverage val-ues at relatively low overheads regions with access-coverageof 0-49 are placed in bucket 0 regions with access-coverageof 50 minus 99 are placed in bucket 1 and so on Figure 4 showsan example state of the access_map of three processes A Band C Regions can move up or down in their respective ar-rays after every sampling period depending on their newlycomputed EMA-based access-coverage If a region movesup in access_map it is added to the head of its bucket If aregion moves down it is added to the tail Within a bucketpages are promoted from head to tail This strategy helps inprioritizing recently accessed regions within an index

HawkEye promotes regions from higher to lower indicesin access_map Notice that our approach captures both re-cency and frequency of accesses a region that has not been

accessed or accessed with low access-coverage in recentsampling periods is likely to shift towards a lower indexin access_map or towards the tail in its current index Pro-motion of cold regions is thus automatically deferred toprioritize hot regions

We note that HawkEyersquos access_map is somewhat similarto population_map and access_bitvector used in FreeBSDand Ingens resp these data structures are primarily used tocapture utilization or page access related metadata at hugepage granularity HawkEyersquos access_map additionally pro-vides the index of hotness of memory regions and enablesfine-grained huge page promotion

34 Huge page allocation across multiple processesIn our access-coverage based strategy regions belonging toa process with the highest expected MMU overheads arepromoted before others This is also consistent with ournotion of fairness sect23 as pages are promoted first fromregionsprocesses with high expected TLB pressure (andconsequently high expected MMU overheads)Our HawkEye variant that does not rely on hardware

performance counters (ie HawkEye-G) promotes regionsfrom the non-empty highest access-coverage index in all pro-cesses It is possible that multiple processes have non-emptybuckets in the (globally) highest non-empty index In thiscase round-robin is used to ensure fairness among suchprocesses We explain this further with an example Con-sider three applications A B and C with their VA regionsarranged in the access_map as shown in Figure 4 HawkEye-G promotes regions in the following order in this example

A1B1C1C2B2C3C4B3B4A2C5A3Recall however that MMU overheads may not necessarily

be correlated with our access-coverage based estimationand may depend on other more complex features of theaccess pattern (sect24) To capture this in HawkEye-PMU wefirst choose the process with the highest measured MMUoverheads and then choose regions from higher to lowerindices in selected processrsquos access_map Among processeswith similar highest MMU overheads round-robin is used

35 Limitations and discussionWe briefly outline a few issues that are related to huge pagemanagement but are currently unhandled in HawkEye1) Thresholds for identifyingmemorypressureWemea-sure the extent ofmemory pressurewith statically configuredvalues for low and high watermarks (ie 70 and 85 oftotal systemmemory) while dealing with memory bloat Anystrategy that relies on static thresholds faces the risk of beingconservative or overly aggressive when memory pressureconsistently fluctuates An ideal solution should adjust thesethresholds dynamically to prevent unintended system behav-ior The approach proposed by Guo et al [51] in the contextof memory deduplication for virtualized environments isrelevant in this context

2) Huge page starvationWhile we believe our approachof allocating huge pages based on MMU overheads optimizesthe system as a whole unbounded huge page allocations toa single process can be thought of as a starvation problemfor other processes An adverserial application can also po-tentially monopolize HawkEye to get more huge pages orprevent other applications from getting a fair share of hugepages Preliminary investigations show that our approach isreasonable even if the memory footprint of workloads differby more than an order of magnitude However if limitinghuge page allocations is still desirable it seems reasonable tointegrate a policy with existing resource limitingmonitoringtools such as Linuxrsquos cgroups [18]3) Other algorithms We do not discuss some parts of themanagement algorithms such as rate limiting khugepagedto reduce promotion overheads demotion based on low uti-lization to enable better sharing through same-page mergingand techniques for minimizing the overhead of page-tableaccess-bit tracking and compaction algorithms Much of thismaterial has been discussed and evaluated extensively inthe literature [32 55 59 68] and we have not contributedsignificantly new approaches in these areas

4 EvaluationWe now evaluate our algorithms in more detail Our exper-imental platform is an Intel Haswell-EP based E5-2690 v3server system running CentOS v74 on which 96GB mem-ory and 48 cores (with hyperthreading enabled) running at23GHz are partitioned on two sockets we bind each work-load to a single socket to avoid NUMA effects The L1 TLBcontains 64 and 8 entries for 4KB and 2MB pages respectivelywhile the L2 TLB contains 1024 entries for both 4KB and2MB pages The size of L1 L2 and shared L3 cache is 768KB3MB and 30MB resp A 96GB SSD-backed swap partitionis used to evaluate performance in an overcommitted sys-tem We evaluate HawkEye with a diverse set of workloadsranging from HPC graph algorithms in-memory databasesgenomics and machine learning [24 30 33 36 45 53 65 66]HawkEye is implemented in Linux kernel v43

We evaluate (a) the improvements due to our fine-grainedpromotion strategy based on access-coverage in both per-formance and fairness for single multiple homogeneous andmultiple heterogeneousworkloads (b) the bloat-vs-performancetradeoff (c) the impact of low latency page faults (d) cacheinterference caused by asynchronous pre-zeroing thread and(e) the impact of memory efficiency enabled by asynchronouspage pre-zeroingPerformance advantages of fine-grainedhuge page pro-motion We first evaluate the effectiveness of our access-coverage based promotion strategy For this we measure thetime required for our algorithm to recover from a fragmentedstate with high address translation overheads to a state with

0

6

12

18

24

Graph500 XSBench cgD

speedup

Linux Ingens HawkEyeshyPMU HawkEyeshyG

152

23 25155

20 36

1007 1011

175

628

152 168

0

300

600

900

1200

Graph500 XSBench cgD

Time saved (ms)

Figure 5 Performance speedup and time saved per hugepage promotion over baseline pages

Workload Execution Time (Seconds)Linux-4KB Linux-2MB Ingens HawkEye-PMU HawkEye-G

Graph500-1 2270 2145(106) 2243(101) 1987(114) 2007(113)Graph500-2 2289 2252(102) 2253(102) 1994(115) 2013(114)Graph500-3 2293 2293(10) 2299(100) 2012(114) 2018(114)Average 2284 2230(102) 2265(101) 1998(114) 2013(113)

XSBench-1 2427 2415(10) 2392(101) 2108(115) 2098(115)XSBench-2 2437 2427(10) 2415(101) 2109(115) 2110(115)XSBench-3 2443 2455(10) 2456(100) 2133(115) 2143(114)Average 2436 2432(10) 2421(100) 2117(115) 2117(115)

Table 5 Execution time of 3 instances of Graph500 andXSBench when executed simultaneously Values in parenthe-ses represent speedup over baseline pages

low address translation overheads Recall that HawkEye pro-motes pages based on access-coverage while previous ap-proaches promote pages in VA order (from low VAs to high)We fragment the memory initially by reading several files inmemory our test workloads are started in the fragmentedstate and we measure the time taken for the system to re-cover from high MMU overheads This experimental setupsimulates expected realistic situations where the memoryfragmentation in the system fluctuates over time Withoutpromotion our test workloads would keep incurring highaddress translation overheads however HawkEye is quicklyable to recover from these overheads through appropriatehuge page allocations Figure 5 (left) shows the performanceimprovement obtained by HawkEye (over a strategy thatnever promotes the pages) for three workloads Graph500XSBench and cgD These workloads allocate all their re-quired memory in the beginning ie in the fragmented stateof the system For these workloads the speedups due toeffective huge page management through HawkEye are ashigh as 22 Compared to Linux and Ingens access-coveragebased huge page promotion strategy of HawkEye improvesperformance by 13 12 and 6 over both Linux and Ingens

To understand this more clearly Figure 6 shows the accesspattern MMU overheads and the number of allocated hugepages over time for Graph500 and XSBench We find thathot-spots in these applications are concentrated in the highVAs and any sequential-scanning based promotion is likelyto be sub-optimal The second and third columns corrobo-rate this observation for example both HawkEye variantstake asymp300 seconds to eliminate MMU overheads of XSBenchwhile Linux and Ingens have high overheads even after 1000seconds

To quantify the cost-benefit analysis of huge page promo-tions further we propose a new metric the average execu-tion time saved (over using only baseline pages) per hugepage promotion A scheme that maximizes this metric wouldbe most effective in reducing MMU overheads Figure 5(right) shows that HawkEye performs significantly betterthan Linux and Ingens on this metric The difference betweenthe efficiency of HawkEye-PMU and HawkEye-G is also evi-dent HawkEye-PMU is more efficient as it stops promotinghuge pages whenMMU overheads are below a certain thresh-old (2 in our experiments) In summary HawkEye-G andHawkEye-PMU are up to 67times and 44times more efficient (forXSBench) than Linux in terms of time saved per huge pagepromotion For workloads whose access patterns are spreaduniformly over the VA space HawkEye delivers performancesimilar to Linux and IngensFairness advantages of fine-grained huge page promo-tion We next experiment withmultiple applications to studyboth performance and fairness first with identical applica-tions and then with heterogeneous applications runningsimultaneouslyIdentical workloads Figure 7 shows the MMU overheadsand huge page allocations when 3 instances of Graph500 andXSBench (in separate runs) are executed concurrently afterfragmenting the system (see Table 5 for execution time)While Linux creates performance imbalance by promot-

ing huge pages in one process at a time HawkEye ensuresfairness by judiciously distributing huge pages across allworkload instances Overall HawkEye-PMU and HawkEye-G achieve 114times and 113times speedup for Graph500 and 115timesspeedup for XSBench over Linux on average Unlike LinuxIngens promotes huge pages proportionally in all 3 instancesHowever it fails to improve the performance of these work-loads In fact it may lead to poorer performance than LinuxWe explain this behaviour with an example

Linux selects an address from Graph500-1rsquos address spaceand promotes all pages above it in around 10 minutes Eventhough this strategy is unfair it improves the performanceof Graph500-1 by promoting its hot-regions Linux then se-lects Graph500-2whose MMU overheads decrease after asymp20minutes In contrast Ingens promotes huge pages from lowerVAs in each instance of Graph500 Thus it takes even longerfor Ingens to promote suitable (hot) regions of the respec-tive processes which leads to 9 performance degradationover Linux for Graph500 For XSBench the performance ofboth Ingens and Linux is similar (but inferior to HawkEye)because both fail to promote the applicationrsquos hot regionsbefore the application finishesHeterogeneous workloads To measure the efficacy of dif-ferent strategies for heterogeneous workloads we executeworkloads by grouping them into sets where a set containsone TLB sensitive and one TLB insensitive application Each

0

128

256

384

512

05GB 1GB 15GB 2GB

Cold Region

acce

ss-c

over

age

Virtual Address Space

0

15

30

45

60

300 600 900 1200 1500

MM

U O

verh

ead

Time (seconds)

0

200

400

600

800

1000

300 600 900 1200 1500

Linux Ingens

HawkEye-G

HawkEye-PMU

H

uge

Pag

es

Time (seconds)

(a) Graph500

0

128

256

384

512

1GB 2GB 3GB 4GB 5GB 6GB

acce

ss-c

over

age

Virtual Address Space

0

15

30

45

60

300 600 900 1200 1500 1800 2100 2400

Linux Ingens

HawkEye-PMU HawkEye-GMM

U O

ver

hea

dTime (seconds)

0

500

1000

1500

2000

300 600 900 1200 1500 1800 2100 2400

Linux Ingens H

awkEye-G

HawkEye-PMU H

uge

Pag

es

Time (seconds)

(b) XSBenchFigure 6 Access-coverage in application VA MMU overhead and huge page promotions for Graph500 and XSBench HawkEyeis more efficient in promoting huge pages due to fine-grained decision making

0

200

400

600

800

1000

10 20 30 40

1 2

3 H

uge

Pag

es

Time (minutes)

0

200

400

600

800

1000

10 20 30 40

1 2 3

H

uge

Pag

es

Time (minutes)

0

200

400

600

800

1000

10 20 30 40

H

uge

Pag

es

Time (minutes)

0

200

400

600

800

1000

10 20 30 40

1 2 3

H

uge

Pag

es

Time (minutes)

0

10

20

30

40

50

10 20 30 40

MM

U O

verh

ead

Time (Minutes)

0

10

20

30

40

50

10 20 30 40

MM

U O

verh

ead

Time (minutes)

0

10

20

30

40

50

10 20 30 40

MM

U O

verh

ead

Time (minutes)

0

10

20

30

40

50

10 20 30 40

MM

U O

verh

ead

Time (minutes)

0

500

1000

1500

2000

10 20 30 40 50

1

2 3

H

uge

Pag

es

Time (Minutes)

0

300

600

900

1200

10 20 30 40 50

1 2 3

H

uge

Pag

es

Time (Minutes)

0

300

600

900

1200

10 20 30 40 50

H

uge

Pag

es

Time (Minutes)

0

300

600

900

1200

10 20 30 40 50

1 2 3

H

uge

Pag

es

Time (Minutes)

0

15

30

45

60

10 20 30 40 50

1 2 3

MM

U O

verh

ead

Time (Minutes)

0

15

30

45

60

10 20 30 40 50

1 2 3

MM

U O

verh

ead

Time (Minutes)

0

15

30

45

60

10 20 30 40 50

1 2 3

MM

U O

verh

ead

Time (Minutes)

0

15

30

45

60

10 20 30 40 50

1 2 3

MM

U O

verh

ead

Time (Minutes)

(a) Linux-2MB (b) Ingens (c) HawkEye-PMU (d) HawkEye-GFigure 7 Huge page promotions and their impact on TLB overhead over time for Linux Ingens and HawkEye for threeinstances of Graph500 and XSBench each

set is executed twice after fragmenting the system to as-sess the impact of huge page policies on the order of ex-ecution For the TLB insensitive workload we execute alightly loaded Redis server with 1KB-sized 40M keys servic-ing 10K requests per-second Figure 8 shows the performance

speedup over 4KB pages The (Before) and (After) configura-tions show whether the TLB sensitive workload is launchedbefore Redis or after Because Linux promotes huge pagesin process launch order the two configurations (Before) and(After) are intended to cover its two different behavioursIngens and HawkEye are agnostic to process creation order

0

15

30

45

60

75

cactusADM tigr Graph500 lbm_s SVM XSBench cgD cactusADM tigr Graph500 lbm_s SVM XSBench cgDBefore After

s

peed

up

Chart Title

Linux Ingens HawkEye-PMU HawkEye-G

Figure 8 Performance speedup over baseline pages of TLB sensitive applications when they are executed alongside a lightlystressed Redis server in different orders

1

12

14

16

18

2

Nor

mal

ized

Per

form

ance

HawkEye-Host HawkEye-Guest HawkEye-Both

Figure 9 Performance compared to Linux in a virtualizedsystem when HawkEye is applied at the host guest and bothlayers

The impact of execution order is clearly visible for LinuxIn the (Before) case Linux with THP support improves per-formance over baseline pages by promoting huge pages inTLB sensitive applications However in the (After) case itpromotes huge pages in Redis resulting in poor performancefor TLB sensitive workloads While the execution order doesnot matter in Ingens its proportional huge page promotionstrategy favors large memory workloads (Redis in this case)Further our client requests randomly selected keys whichmakes the Redis server access all its huge pages uniformlyConsequently Ingens promotes most huge pages in Redisleading to suboptimal performance of TLB sensitive work-loads In contrast HawkEye promotes huge pages based onMMU overhead measurements (HawkEye-PMU) or access-coverage based estimation of MMU overheads (HawkEye-G)This leads to 15ndash60 performance improvement over 4KBpages irrespective of the order of executionPerformance in virtualized systems For a virtualizedsystemwe use the KVMhypervisor runningwith Ubuntu1604and fragment the system prior to running the workloadsEvaluation is presented for three different configurationsie when HawkEye is deployed at the host guest and boththe layers Each of these configurations requires running thesame set of applications differently (see Table 6 for details)Figure 9 shows that HawkEye provides 18ndash90 speedup

compared to Linux in virtual environments Notice that insome cases (eg cgD) performance improvement is higherin a virtualized system compared to bare-metal This is due tohigh MMU overheads involved in traversing nested page ta-bles while servicing TLB misses of guest applications whichpresents higher opportunities to be exploited by HawkEye

Configuration Details

HawkEye-Host Two VMs each with 24 vCPUs and 48GB memoryVM-1 runs Redis VM-2 runs TLB sensitive workloads

HawkEye-Guest Single VM with 48 vCPUs and 80GB memoryrunning Redis and TLB sensitive workloads

HawkEye-Both Two VMs each with 24 vCPUs VM-1 (30GB) runs RedisVM-2 (60GB) runs both Redis and TLB sensitive workloads

Table 6 Experimental setup for configurations used to eval-uate a virtualized system

Bloat vs performance We next study the relationship be-tween memory bloat and performance we revisit an experi-ment similar to the one summarized in sect21 We populate aRedis instance with 8M (10B key 4KB value) key-value pairsand then delete 60 randomly selected keys (see Table 7)While Linux-4KB (no THP) is memory efficient (no bloat)its performance is low (7 lower throughput than Linux-2MB ie with THP) Linux-2MB delivers high performancebut consumes more memory which remains unavailable tothe rest of system even when memory pressure increasesIngens can be configured to balance the memory vs perfor-mance tradeoff For example Ingens-90 (Ingens with 90utilization threshold) is memory efficient while Ingens-50(Ingens with 50 utilization threshold) favors performanceand allows more memory bloat In either case it is unableto recover from bloat that may have already been gener-ated (eg during its aggressive phase) By switching fromaggressive huge page allocations under low memory pres-sure to memory conserving strategy at runtime HawkEye isable to achieve both low MMU overheads and high memoryefficiency depending on the state of the systemFast page faults Table 8 shows the performance of differentstrategies for the workloads chosen for these experimentstheir performance depends on the efficiency of the OS pagefault handler We measure Redis throughput when 2MB-sized values are inserted SparseHash [13] is a hash-maplibrary in C++ while HACC-IO [8] is a parallel IO benchmarkused with an in-memory file system We also measure thespin-up time of two virtual machines KVM and a Java VirtualMachine (JVM) where both are configured to allocate allmemory during initialization

Performance benefits of async page pre-zeroing with basepages (HawkEye-4KB) aremodest in this case the other page

Kernel Self-tuning Memory Size ThroughputLinux-4KB No 162GB 1061KLinux-2MB No 332GB 1138KIngens-90 No 163GB 1068KIngens-50 No 331GB 1134KHawkEye (no mem pressure) Yes 332GB 1136KHawkEye (mem pressure) Yes 162GB 1058K

Table 7 Redis memory consumption and throughput

Workload OS ConfigurationLinux4KB

Linux2MB

Ingens90

HawkEye4KB

HawkEye2MB

Redis (45GB) 233 437 192 236 551SparseHash (36GB) 501 172 515 466 106HACC-IO (6GB) 65 45 66 65 42JVM Spin-up (36GB) 377 186 527 298 137KVM Spin-up (36GB) 406 97 418 302 070

Table 8 Performance implications of asynchronous pagezeroing Values for Redis represent throughput (higher isbetter) all other values represent time in seconds (lower isbetter)

0

10

20

30

40

NPB PARSEC Graph500 cactusADM PageRank omnetpp

overhead with caching-stores with nt-stores

Figure 10 Performance overhead of async pre-zeroing withand without caching instructions The first two bars (NPBand Parsec) represent the average of all workloads in therespective benchmark suite

fault overheads (apart from zeroing) dominate performanceHowever page-zeroing cost becomes significant with hugepages Consequently HawkEye with huge pages (HawkEye-2MB) improves the performance of Redis and SparseHashby 126times and 162times over Linux The spin-up time of virtualmachines is purely dominated by page faults Hence weobserve a dramatic reduction in the spin-up time of VMswith async pre-zeroing of huge pages 138times and 136times overLinux Notice that without huge pages (only base pages)this reduction is only 134times and 126times All these workloadshave high spatial locality of page faults and hence the Ingensstrategy of utilization-based promotion has a negative impacton performance due to a higher number of page faultsAsync pre-zeroing enables memory sharing in virtual-ized environments Finally we note that async pre-zeroingwithin VMs enables memory sharing across VMs the freememory of a VM returns to the host through pre-zeroing inthe VM and same-page merging at the host This can havethe same net effect as ballooning as the free memory in aVM gets shared with other VMs In our experiments withmemory over-committed systems we have confirmed thisbehaviour the overall performance of an overcommittedsystem where the VMs are running HawkEye matches the

0

05

1

15

2

25

PageRank Redis MongoDB CC Memcached SPDNode 0 Node 1

Nor

mal

ized

Per

form

ance

Linux Linux-Ballooning HawkEye

Figure 11 Performance normalized to the case of no bal-looning in an overcommitted virtualized system

performance achieved through para-virtual interfaces likememory balloon drivers For example in an experiment in-volving a mix of latency-sensitive key-value stores and HPCworkloads (see Figure 11) with a total peakmemory consump-tion of around 150GB (15times of total memory) HawkEye-Gprovides 23times and 142times higher throughput for Redis andMongoDB along with significant tail latency reduction ForPageRank performance degrades slightly due to a highernumber of COW page faults due to unnecessary same-pagemerging These results are very close to the results obtainedwhen memory ballooning is enabled on all VMs We believethat this is a potential positive of our design and may offeran alternative method to efficiently manage memory in over-committed systems It is well-known that balloon-driversand other para-virtual interfaces are riddled with compatibil-ity problems [29] and a fully-virtual solution to this problemwould be highly desirable While our experiments show earlypromise in this direction we leave an extensive evaluationfor future workPerformance overheads ofHawkEyeWe evaluatedHawk-Eye with non-fragmented systems and with several work-loads that donrsquot benefit from huge pages and find that theperformance with HawkEye is very similar to performancewith Linux or Ingens in these scenarios confirming thatHawkEyersquos algorithms for access-bit tracking asynchronoushuge page promotion and memory de-duplication add negli-gible overheads over existingmechanisms in Linux Howeverthe async pre-zeroing thread requires special attention as itmay become a source of noticeable performance overheadfor certain types of workloads as discussed nextOverheads of async pre-zeroing A primary concern withasync pre-zeroing is its potential detrimental effect due tomemory and cache interference (sect31) Recall that we em-ploy memory stores with non-temporal hints to avoid theseeffects To measure the worst-case cache effects of asyncpre-zeroing we run our workloads while simultaneouslyzero-filling pages on a separate core sharing the same L3cache (ie on the same socket) at a high rate of 025M pagesper second (1GBps) with and without non-temporal mem-ory stores Figure 10 shows that using non-temporal hintssignificantly brings down the overhead of async pre-zeroing(eg from 27 to 6 for omnetpp) the remaining overheadis due to additional memory traffic generated by the pre-zeroing thread Further we note that this experiment presents

Workload MMUOverhead

Time (seconds)4KB HawkEye-PMU HawkEye-G

random(4GB) 60 582 328(177times) 413(141times)sequential(4GB) lt 1 517 535 532Total 1099 863(127times) 945(116times)cgD(16GB) 39 1952 1202(162times) 1450(135times)mgD(24GB) lt 1 1363 1364 1377Total 3315 2566(129times) 2827(117times)

Table 9 Comparison between HawkEye-PMU andHawkEye-G Values in parentheses represent speedup overbaseline pagesa highly-exaggerated behavior of async pre-zeroing onworst-case workloads In practice the zeroing thread is rate-limited(eg at most 10k pages per second) and the expected cache-interference overheads would be proportionally smallerComparison betweenHawk-PMUandHawkEye-G Eventhough HawkEye-G is based on simple and approximatedestimations of TLB contention it is reasonably accurate inidentifying TLB sensitive processes and memory regionsin most cases However HawkEye-PMU performs betterthan HawkEye-G in some cases Table 9 demonstrates twosuch examples where four workloads all with high access-coverage but different MMU overheads are executed in twosets each set contains one TLB sensitive and one TLB insensi-tive workload The 4KB column represents the case where nohuge pages are promoted As discussed earlier (sect24) MMUoverheads depend on the complex interaction between thehardware and access pattern In this case despite havinghigh access-coverage in VA regions sequential and mgDhave negligible MMU overheads (ie they are TLB insen-sitive) While HawkEye-PMU correctly identifies the pro-cess with higher MMU overheads for huge page allocationHawkEye-G treats them similarly (due to imprecise estima-tions) and may allocate huge pages to the TLB insensitiveprocess Consequently HawkEye-PMU may perform up to36 better than HawkEye-G Bridging this gap between thetwo approaches in a hardware independent manner is aninteresting future work

5 Related WorkMMU overheads have been a topic of much research in re-cent years and several solutions have been proposed includ-ing solutions that are architecture-based OS-based andorcompiler-based [28 34 49 56 58 62ndash64]Hardware SupportMost proposals for hardware supportaim to reduce TLB-miss frequency or accelerate TLB-missprocessing Large multi-level TLBs and support for multiplepage sizes to minimize the number of TLB-misses is alreadycommon in general-purpose processors today [15 20 21 40]Further modern processors employ page-walk-caches toavoid memory lookups in the TLB-miss path [31 35] POM-TLB services a TLB miss using a single memory lookup andfurther leverages regular data caches to speed up addresstranslation [63] Direct segments [32 46] were proposed tocompletely avoid TLB-miss processing cost through special

segmentation hardware OS-level challenges of memory frag-mentation have also been considered in hardware designsCoLT or Coalesced-Large-reach TLBs [61] were initially pro-posed to increase TLB reach using base pages based on theobservation that OSs naturally provide contiguous mappingsat smaller granularity This approach was further extendedto page-walk caches and huge pages [35 40 60] Techniquesto support huge pages in non-contiguous memory have alsobeen proposed [44] Hardware optimizations are importantbut complementary to HawkEyeSoftware Support FreeBSDrsquos huge page management islargely influenced by previous work on superpages [57]Carrefour-LP [47] showed that ad-hoc usage of huge pagescan degrade performance in NUMA systems due to addi-tional remote memory accesses and traffic imbalance acrossnode interconnects and proposed hardware-profiling basedtechniques to overcome these problems Anti-fragmentationtechniques for clustering pages based on mobility type ofpages proposed by Gorman et al [48] have been adoptedby Linux to support the allocation of huge pages in long-running systems Illuminator [59] highlighted critical draw-backs of Linuxrsquos fragmentation mitigation techniques andtheir implications on performance and latency and proposedtechniques to solve these problems Mechanisms of dealingwith physical memory fragmentation and NUMA systemsare important but orthogonal problems and their proposedsolutions can be adopted to improve the robustness of Hawk-Eye Compiler or application hints can also be used to assistOSs in prioritizing huge page mapping for certain parts ofthe address space [27 56] for which an interface is alreadyprovided by Linux through the madvise system call [16]An alternative approach for supporting huge pages has

also been explored via libhugetlbfs [22] where the useris provided more control on huge page allocation Howeversuch an approach requires manual intervention for reserv-ing huge pages in advance and considers each applicationin isolation Windows and OS X support huge pages onlythrough this reservation-based approach to avoid issues asso-ciated with transparent huge page management [17 55 59]We believe that insights from HawkEye can be leveraged toimprove huge page support in these important systems

6 ConclusionsTo summarize transparent huge page management is es-sential but complex Effective and efficient algorithms arerequired in the OS to automatically balance the tradeoffs andprovide performant and stable system behavior We exposesome important subtleties related to huge page managementin existing proposals and propose a new set of algorithmsto address them in HawkEye

References[1] 2000 Clearing pages in the idle loop httpswwwmail-archivecom

freebsd-hackersfreebsdorgmsg13993html

[2] 2000 Linux Page Zeroing Strategy httpsyarchivenetcomplinuxpagezeroingstrategyhtml

[3] 2007 Memory part 5 What programmers can do httpslwnnetArticles255364

[4] 2010 Mysteries of Windows Memory Management RevealedPart Two httpsblogsmsdnmicrosoftcomtims20101029pdc10-mysteries-of-windows-memory-management-revealed-part-two

[5] 2012 khugepaged eating 100 CPU httpsbugzillaredhatcomshowbugcgiid=879801

[6] 2012 Recommendation to disable huge pages for Hadoophttpsdeveloperamdcomwordpressmedia201210HadoopTuningGuide-Version5pdf

[7] 2014 Arch Linux becomes unresponsive from khugepagedhttpunixstackexchangecomquestions161858arch-linux-becomes-unresponsive-from-khugepaged

[8] 2014 CORAL Benchmark Codes httpsascllnlgovCORAL-benchmarkshacc

[9] 2014 Recommendation to disable huge pages for NuoDBhttpwwwnuodbcomtechbloglinux-transparent-huge-pages-jemalloc-and-nuodb

[10] 2014 Why TokuDB Hates Transparent HugePageshttpswwwperconacomblog20140723why-tokudb-hates-transparent-hugepages

[11] 2015 The Black Magic Of Systematically Reducing Linux OS Jit-ter httphighscalabilitycomblog201548the-black-magic-of-systematically-reducing-linux-os-jitterhtml

[12] 2015 Tales from the Field Taming Transparent Huge Pages onLinux httpswwwperforcecomblog151016tales-field-taming-transparent-huge-pages-linux

[13] 2016 C++ associative containers httpsgithubcomsparsehashsparsehash

[14] 2016 Remove PG_ZERO and zeroidle (page-zeroing) entirely httpsnewsycombinatorcomitemid=12227874

[15] 2017 Hugepages httpswikidebianorgHugepages[16] 2017 MADVISE Linux Programmerrsquos Manual httpman7orglinux

man-pagesman2madvise2html[17] 2018 Large-Page Support in Windows httpsdocsmicrosoftcom

en-gbwindowsdesktopMemorylarge-page-support[18] 2019 CGROUPS Linux Programmerrsquos Manual httpman7orglinux

man-pagesman7cgroups7html[19] 2019 FreeBSD Pre-Faulting and Zeroing Optimizations

httpswwwfreebsdorgdocenUSISO8859-1articlesvm-designprefault-optimizationshtml

[20] 2019 Intel Haswell httpwww7-cpucomcpuHaswellhtml[21] 2019 Intel Skylake httpwww7-cpucomcpuSkylakehtml[22] 2019 Libhugetlbfs Linux man page httpslinuxdienetman7

libhugetlbfs[23] 2019 Recommendation to disable huge pages for MongoDB https

docsmongodbcommanualtutorialtransparent-huge-pages[24] 2019 Recommendation to disable huge pages for Redis httpredisio

topicslatency[25] 2019 Recommendation to disable huge pages for VoltDB https

docsvoltdbcomAdminGuideadminmemmgtphp[26] 2019 Transparent Hugepage Support httpswwwkernelorgdoc

Documentationvmtranshugetxt[27] Mohammad Agbarya Idan Yaniv and Dan Tsafrir 2018 Memomania

From Huge to Huge-Huge Pages In Proceedings of the 11th ACM In-ternational Systems and Storage Conference (SYSTOR rsquo18) ACM NewYork NY USA 112ndash112 httpsdoiorg10114532118903211918

[28] Hanna Alam Tianhao Zhang Mattan Erez and Yoav Etsion 2017Do-It-Yourself Virtual Memory Translation In Proceedings of the44th Annual International Symposium on Computer Architecture (ISCArsquo17) ACM New York NY USA 457ndash468 httpsdoiorg10114530798563080209

[29] Nadav Amit Dan Tsafrir and Assaf Schuster 2014 VSwapper AMemory Swapper for Virtualized Environments In Proceedings of the19th International Conference on Architectural Support for ProgrammingLanguages and Operating Systems (ASPLOS rsquo14) ACM New York NYUSA 349ndash366 httpsdoiorg10114525419402541969

[30] D H Bailey E Barszcz J T Barton D S Browning R L Carter LDagum R A Fatoohi P O Frederickson T A Lasinski R S SchreiberH D Simon V Venkatakrishnan and S KWeeratunga 1991 The NASParallel BenchmarksSummary and Preliminary Results In Proceedingsof the 1991 ACMIEEE Conference on Supercomputing (Supercomputingrsquo91) ACM New York NY USA 158ndash165 httpsdoiorg101145125826125925

[31] Thomas W Barr Alan L Cox and Scott Rixner 2010 TranslationCaching Skip DonrsquoT Walk (the Page Table) In Proceedings of the37th Annual International Symposium on Computer Architecture (ISCArsquo10) ACM New York NY USA 48ndash59 httpsdoiorg10114518159611815970

[32] Arkaprava Basu Jayneel Gandhi Jichuan Chang Mark D Hill andMichael M Swift 2013 Efficient Virtual Memory for Big MemoryServers In Proceedings of the 40th Annual International Symposium onComputer Architecture (ISCA rsquo13) ACM New York NY USA 237ndash248httpsdoiorg10114524859222485943

[33] Scott Beamer Krste Asanovic and David A Patterson 2015 TheGAP Benchmark Suite CoRR abs150803619 (2015) arXiv150803619httparxivorgabs150803619

[34] Ravi Bhargava Benjamin Serebrin Francesco Spadini and SrilathaManne 2008 Accelerating Two-dimensional Page Walks for Vir-tualized Systems In Proceedings of the 13th International Confer-ence on Architectural Support for Programming Languages and Op-erating Systems (ASPLOS XIII) ACM New York NY USA 26ndash35httpsdoiorg10114513462811346286

[35] Abhishek Bhattacharjee 2013 Large-reach Memory ManagementUnit Caches In Proceedings of the 46th Annual IEEEACM InternationalSymposium on Microarchitecture (MICRO-46) ACM New York NYUSA 383ndash394 httpsdoiorg10114525407082540741

[36] Christian Bienia Sanjeev Kumar Jaswinder Pal Singh and Kai Li 2008The PARSEC Benchmark Suite Characterization and ArchitecturalImplications In Proceedings of the 17th International Conference onParallel Architectures and Compilation Techniques (PACT rsquo08) ACMNew York NY USA 72ndash81 httpsdoiorg10114514541151454128

[37] Daniel Bovet and Marco Cesati 2005 Understanding The Linux KernelOreilly amp Associates Inc

[38] Josiah L Carlson 2013 Redis in Action Manning Publications CoGreenwich CT USA

[39] Jonathan Corbet 2010 Memory compaction httpslwnnetArticles368869

[40] Guilherme Cox and Abhishek Bhattacharjee 2017 Efficient AddressTranslation for Architectures with Multiple Page Sizes In Proceed-ings of the Twenty-Second International Conference on ArchitecturalSupport for Programming Languages and Operating Systems (ASPLOSrsquo17) ACM New York NY USA 435ndash448 httpsdoiorg10114530376973037704

[41] Cort Dougan Paul Mackerras and Victor Yodaiken 1999 Opti-mizing the Idle Task and Other MMU Tricks In Proceedings of theThird Symposium on Operating Systems Design and Implementation(OSDI rsquo99) USENIX Association Berkeley CA USA 229ndash237 httpdlacmorgcitationcfmid=296806296833

[42] Niall Douglas 2011 User Mode Memory Page Allocation A Sil-ver Bullet For Memory Allocation CoRR abs11051811 (2011)arXiv11051811 httparxivorgabs11051811

[43] Ulrich Drepper 2007 What Every Programmer Should Know AboutMemory

[44] Yu Du Miao Zhou Bruce R Childers Daniel Mosseacute and Rami GMelhem 2015 Supporting superpages in non-contiguous physicalmemory 2015 IEEE 21st International Symposium on High Performance

Computer Architecture (HPCA) (2015) 223ndash234[45] M Franklin D Yeung n Xue Wu A Jaleel K Albayraktaroglu B

Jacob and n Chau-Wen Tseng 2005 BioBench A Benchmark Suite ofBioinformatics Applications In IEEE International Symposium on Per-formance Analysis of Systems and Software 2005 ISPASS 2005(ISPASS)Vol 00 2ndash9 httpsdoiorg101109ISPASS20051430554

[46] Jayneel Gandhi Arkaprava Basu Mark D Hill and Michael M Swift2014 Efficient Memory Virtualization Reducing Dimensionality ofNested Page Walks In Proceedings of the 47th Annual IEEEACM Inter-national Symposium on Microarchitecture (MICRO-47) IEEE ComputerSociety Washington DC USA 178ndash189 httpsdoiorg101109MICRO201437

[47] Fabien Gaud Baptiste Lepers Jeremie Decouchant Justin FunstonAlexandra Fedorova and Vivien Queacutema 2014 Large Pages MayBe Harmful on NUMA Systems In Proceedings of the 2014 USENIXConference on USENIX Annual Technical Conference (USENIX ATCrsquo14)USENIX Association Berkeley CA USA 231ndash242 httpdlacmorgcitationcfmid=26436342643659

[48] Mel Gorman and Patrick Healy 2008 Supporting Superpage Alloca-tion Without Additional Hardware Support In Proceedings of the 7thInternational Symposium on Memory Management (ISMM rsquo08) ACMNew York NY USA 41ndash50 httpsdoiorg10114513756341375641

[49] Mel Gorman and Patrick Healy 2012 Performance Characteristics ofExplicit Superpage Support In Proceedings of the 2010 InternationalConference on Computer Architecture (ISCArsquo10) Springer-Verlag BerlinHeidelberg 293ndash310 httpsdoiorg101007978-3-642-24322-624

[50] Mel Gorman and Andy Whitcroft 2006 The What The Why and theWhere To of Anti-Fragmentation In Linux Symposium 141

[51] Fei Guo Seongbeom Kim Yury Baskakov and Ishan Banerjee 2015Proactively Breaking Large Pages to Improve Memory Overcom-mitment Performance in VMware ESXi In Proceedings of the 11thACM SIGPLANSIGOPS International Conference on Virtual ExecutionEnvironments (VEE rsquo15) ACM New York NY USA 39ndash51 httpsdoiorg10114527311862731187

[52] Fan Guo Yongkun Li Yinlong Xu Song Jiang and John C S Lui2017 SmartMD A High Performance Deduplication Engine withMixed Pages In Proceedings of the 2017 USENIX Conference on UsenixAnnual Technical Conference (USENIX ATC rsquo17) USENIX AssociationBerkeley CA USA 733ndash744 httpdlacmorgcitationcfmid=31546903154759

[53] John L Henning 2006 SPEC CPU2006 Benchmark DescriptionsSIGARCH Comput Archit News 34 4 (Sept 2006) 1ndash17 httpsdoiorg10114511867361186737

[54] Vasileios Karakostas Osman S Unsal Mario Nemirovsky AdriaacutenCristal and Michael M Swift 2014 Performance analysis of thememory management unit under scale-out workloads 2014 IEEE In-ternational Symposium on Workload Characterization (IISWC) (2014)1ndash12

[55] Youngjin Kwon Hangchen Yu Simon Peter Christopher J Rossbachand Emmett Witchel 2016 Coordinated and Efficient Huge PageManagement with Ingens In Proceedings of the 12th USENIX Con-ference on Operating Systems Design and Implementation (OSDIrsquo16)USENIX Association Berkeley CA USA 705ndash721 httpdlacmorgcitationcfmid=30268773026931

[56] Joshua Magee and Apan Qasem 2009 A Case for Compiler-drivenSuperpage Allocation In Proceedings of the 47th Annual SoutheastRegional Conference (ACM-SE 47) ACM New York NY USA Article82 4 pages httpsdoiorg10114515664451566553

[57] Juan Navarro Sitararn Iyer Peter Druschel and Alan Cox 2002 Prac-tical Transparent Operating System Support for Superpages SIGOPSOper Syst Rev 36 SI (Dec 2002) 89ndash104 httpsdoiorg101145844128844138

[58] Ashish Panwar Naman Patel and K Gopinath 2016 A Case forProtecting Huge Pages from the Kernel In Proceedings of the 7th ACMSIGOPS Asia-Pacific Workshop on Systems (APSys rsquo16) ACM New YorkNY USA Article 15 8 pages httpsdoiorg10114529673602967371

[59] Ashish Panwar Aravinda Prasad and K Gopinath 2018 Making HugePages Actually Useful In Proceedings of the Twenty-Third InternationalConference on Architectural Support for Programming Languages andOperating Systems (ASPLOS rsquo18) ACM New York NY USA 679ndash692httpsdoiorg10114531731623173203

[60] Binh Pham Abhishek Bhattacharjee Yasuko Eckert and Gabriel HLoh 2014 Increasing TLB reach by exploiting clustering in page trans-lations 2014 IEEE 20th International Symposium on High PerformanceComputer Architecture (HPCA) (2014) 558ndash567

[61] Binh Pham Viswanathan Vaidyanathan Aamer Jaleel and AbhishekBhattacharjee 2012 CoLT Coalesced Large-Reach TLBs In Proceed-ings of the 2012 45th Annual IEEEACM International Symposium onMicroarchitecture (MICRO-45) IEEE Computer Society WashingtonDC USA 258ndash269 httpsdoiorg101109MICRO201232

[62] Binh Pham Jaacuten Veselyacute Gabriel H Loh and Abhishek Bhattacharjee2015 Large Pages and Lightweight Memory Management in Virtual-ized Environments Can You Have It Both Ways In Proceedings of the48th International Symposium on Microarchitecture (MICRO-48) ACMNew York NY USA 1ndash12 httpsdoiorg10114528307722830773

[63] Jee Ho Ryoo Nagendra Gulur Shuang Song and Lizy K John 2017Rethinking TLB Designs in Virtualized Environments A Very LargePart-of-Memory TLB In Proceedings of the 44th Annual InternationalSymposium on Computer Architecture (ISCA rsquo17) ACM New York NYUSA 469ndash480 httpsdoiorg10114530798563080210

[64] Indira Subramanian Clifford Mather Kurt Peterson and BalakrishnaRaghunath 1998 Implementation of Multiple Pagesize Support inHP-UX In USENIX Annual Technical Conference 105ndash119

[65] John R Tramm Andrew R Siegel Tanzima Islam and Martin Schulz[n d] XSBench - The Development and Verification of a PerformanceAbstraction for Monte Carlo Reactor Analysis In PHYSOR 2014 - TheRole of Reactor Physics toward a Sustainable Future Kyoto

[66] Koji Ueno and Toyotaro Suzumura 2012 Highly Scalable Graph Searchfor the Graph500 Benchmark In Proceedings of the 21st InternationalSymposium on High-Performance Parallel and Distributed Computing(HPDC rsquo12) ACM New York NY USA 149ndash160 httpsdoiorg10114522870762287104

[67] Carl AWaldspurger 2002 Memory ResourceManagement in VMwareESX Server SIGOPS Oper Syst Rev 36 SI (Dec 2002) 181ndash194 httpsdoiorg101145844128844146

[68] Xiao Zhang Sandhya Dwarkadas and Kai Shen 2009 Towards Practi-cal Page Coloring-based Multicore Cache Management In Proceedingsof the 4th ACM European Conference on Computer Systems (EuroSysrsquo09) ACM New York NY USA 89ndash102 httpsdoiorg10114515190651519076

  • Abstract
  • 1 Introduction
  • 2 Motivation
    • 21 Address Translation Overhead vs Memory Bloat
    • 22 Page fault latency vs Number of page faults
    • 23 Huge page allocation across multiple processes
    • 24 How to capture address translation overheads
      • 3 Design and Implementation
        • 31 Asynchronous page pre-zeroing
        • 32 Managing bloat vs performance
        • 33 Fine-grained huge page promotion
        • 34 Huge page allocation across multiple processes
        • 35 Limitations and discussion
          • 4 Evaluation
          • 5 Related Work
          • 6 Conclusions
          • References
Page 7: HawkEye: Efficient Fine-grained OS Support for …sbansal/pubs/hawkeye.pdfHawkEye: Efficient Fine-grained OS Support for Huge Pages Ashish Panwar Indian Institute of Science ashishpanwar@iisc.ac.in

accessed or accessed with low access-coverage in recentsampling periods is likely to shift towards a lower indexin access_map or towards the tail in its current index Pro-motion of cold regions is thus automatically deferred toprioritize hot regions

We note that HawkEyersquos access_map is somewhat similarto population_map and access_bitvector used in FreeBSDand Ingens resp these data structures are primarily used tocapture utilization or page access related metadata at hugepage granularity HawkEyersquos access_map additionally pro-vides the index of hotness of memory regions and enablesfine-grained huge page promotion

34 Huge page allocation across multiple processesIn our access-coverage based strategy regions belonging toa process with the highest expected MMU overheads arepromoted before others This is also consistent with ournotion of fairness sect23 as pages are promoted first fromregionsprocesses with high expected TLB pressure (andconsequently high expected MMU overheads)Our HawkEye variant that does not rely on hardware

performance counters (ie HawkEye-G) promotes regionsfrom the non-empty highest access-coverage index in all pro-cesses It is possible that multiple processes have non-emptybuckets in the (globally) highest non-empty index In thiscase round-robin is used to ensure fairness among suchprocesses We explain this further with an example Con-sider three applications A B and C with their VA regionsarranged in the access_map as shown in Figure 4 HawkEye-G promotes regions in the following order in this example

A1B1C1C2B2C3C4B3B4A2C5A3Recall however that MMU overheads may not necessarily

be correlated with our access-coverage based estimationand may depend on other more complex features of theaccess pattern (sect24) To capture this in HawkEye-PMU wefirst choose the process with the highest measured MMUoverheads and then choose regions from higher to lowerindices in selected processrsquos access_map Among processeswith similar highest MMU overheads round-robin is used

35 Limitations and discussionWe briefly outline a few issues that are related to huge pagemanagement but are currently unhandled in HawkEye1) Thresholds for identifyingmemorypressureWemea-sure the extent ofmemory pressurewith statically configuredvalues for low and high watermarks (ie 70 and 85 oftotal systemmemory) while dealing with memory bloat Anystrategy that relies on static thresholds faces the risk of beingconservative or overly aggressive when memory pressureconsistently fluctuates An ideal solution should adjust thesethresholds dynamically to prevent unintended system behav-ior The approach proposed by Guo et al [51] in the contextof memory deduplication for virtualized environments isrelevant in this context

2) Huge page starvationWhile we believe our approachof allocating huge pages based on MMU overheads optimizesthe system as a whole unbounded huge page allocations toa single process can be thought of as a starvation problemfor other processes An adverserial application can also po-tentially monopolize HawkEye to get more huge pages orprevent other applications from getting a fair share of hugepages Preliminary investigations show that our approach isreasonable even if the memory footprint of workloads differby more than an order of magnitude However if limitinghuge page allocations is still desirable it seems reasonable tointegrate a policy with existing resource limitingmonitoringtools such as Linuxrsquos cgroups [18]3) Other algorithms We do not discuss some parts of themanagement algorithms such as rate limiting khugepagedto reduce promotion overheads demotion based on low uti-lization to enable better sharing through same-page mergingand techniques for minimizing the overhead of page-tableaccess-bit tracking and compaction algorithms Much of thismaterial has been discussed and evaluated extensively inthe literature [32 55 59 68] and we have not contributedsignificantly new approaches in these areas

4 EvaluationWe now evaluate our algorithms in more detail Our exper-imental platform is an Intel Haswell-EP based E5-2690 v3server system running CentOS v74 on which 96GB mem-ory and 48 cores (with hyperthreading enabled) running at23GHz are partitioned on two sockets we bind each work-load to a single socket to avoid NUMA effects The L1 TLBcontains 64 and 8 entries for 4KB and 2MB pages respectivelywhile the L2 TLB contains 1024 entries for both 4KB and2MB pages The size of L1 L2 and shared L3 cache is 768KB3MB and 30MB resp A 96GB SSD-backed swap partitionis used to evaluate performance in an overcommitted sys-tem We evaluate HawkEye with a diverse set of workloadsranging from HPC graph algorithms in-memory databasesgenomics and machine learning [24 30 33 36 45 53 65 66]HawkEye is implemented in Linux kernel v43

We evaluate (a) the improvements due to our fine-grainedpromotion strategy based on access-coverage in both per-formance and fairness for single multiple homogeneous andmultiple heterogeneousworkloads (b) the bloat-vs-performancetradeoff (c) the impact of low latency page faults (d) cacheinterference caused by asynchronous pre-zeroing thread and(e) the impact of memory efficiency enabled by asynchronouspage pre-zeroingPerformance advantages of fine-grainedhuge page pro-motion We first evaluate the effectiveness of our access-coverage based promotion strategy For this we measure thetime required for our algorithm to recover from a fragmentedstate with high address translation overheads to a state with

0

6

12

18

24

Graph500 XSBench cgD

speedup

Linux Ingens HawkEyeshyPMU HawkEyeshyG

152

23 25155

20 36

1007 1011

175

628

152 168

0

300

600

900

1200

Graph500 XSBench cgD

Time saved (ms)

Figure 5 Performance speedup and time saved per hugepage promotion over baseline pages

Workload Execution Time (Seconds)Linux-4KB Linux-2MB Ingens HawkEye-PMU HawkEye-G

Graph500-1 2270 2145(106) 2243(101) 1987(114) 2007(113)Graph500-2 2289 2252(102) 2253(102) 1994(115) 2013(114)Graph500-3 2293 2293(10) 2299(100) 2012(114) 2018(114)Average 2284 2230(102) 2265(101) 1998(114) 2013(113)

XSBench-1 2427 2415(10) 2392(101) 2108(115) 2098(115)XSBench-2 2437 2427(10) 2415(101) 2109(115) 2110(115)XSBench-3 2443 2455(10) 2456(100) 2133(115) 2143(114)Average 2436 2432(10) 2421(100) 2117(115) 2117(115)

Table 5 Execution time of 3 instances of Graph500 andXSBench when executed simultaneously Values in parenthe-ses represent speedup over baseline pages

low address translation overheads Recall that HawkEye pro-motes pages based on access-coverage while previous ap-proaches promote pages in VA order (from low VAs to high)We fragment the memory initially by reading several files inmemory our test workloads are started in the fragmentedstate and we measure the time taken for the system to re-cover from high MMU overheads This experimental setupsimulates expected realistic situations where the memoryfragmentation in the system fluctuates over time Withoutpromotion our test workloads would keep incurring highaddress translation overheads however HawkEye is quicklyable to recover from these overheads through appropriatehuge page allocations Figure 5 (left) shows the performanceimprovement obtained by HawkEye (over a strategy thatnever promotes the pages) for three workloads Graph500XSBench and cgD These workloads allocate all their re-quired memory in the beginning ie in the fragmented stateof the system For these workloads the speedups due toeffective huge page management through HawkEye are ashigh as 22 Compared to Linux and Ingens access-coveragebased huge page promotion strategy of HawkEye improvesperformance by 13 12 and 6 over both Linux and Ingens

To understand this more clearly Figure 6 shows the accesspattern MMU overheads and the number of allocated hugepages over time for Graph500 and XSBench We find thathot-spots in these applications are concentrated in the highVAs and any sequential-scanning based promotion is likelyto be sub-optimal The second and third columns corrobo-rate this observation for example both HawkEye variantstake asymp300 seconds to eliminate MMU overheads of XSBenchwhile Linux and Ingens have high overheads even after 1000seconds

To quantify the cost-benefit analysis of huge page promo-tions further we propose a new metric the average execu-tion time saved (over using only baseline pages) per hugepage promotion A scheme that maximizes this metric wouldbe most effective in reducing MMU overheads Figure 5(right) shows that HawkEye performs significantly betterthan Linux and Ingens on this metric The difference betweenthe efficiency of HawkEye-PMU and HawkEye-G is also evi-dent HawkEye-PMU is more efficient as it stops promotinghuge pages whenMMU overheads are below a certain thresh-old (2 in our experiments) In summary HawkEye-G andHawkEye-PMU are up to 67times and 44times more efficient (forXSBench) than Linux in terms of time saved per huge pagepromotion For workloads whose access patterns are spreaduniformly over the VA space HawkEye delivers performancesimilar to Linux and IngensFairness advantages of fine-grained huge page promo-tion We next experiment withmultiple applications to studyboth performance and fairness first with identical applica-tions and then with heterogeneous applications runningsimultaneouslyIdentical workloads Figure 7 shows the MMU overheadsand huge page allocations when 3 instances of Graph500 andXSBench (in separate runs) are executed concurrently afterfragmenting the system (see Table 5 for execution time)While Linux creates performance imbalance by promot-

ing huge pages in one process at a time HawkEye ensuresfairness by judiciously distributing huge pages across allworkload instances Overall HawkEye-PMU and HawkEye-G achieve 114times and 113times speedup for Graph500 and 115timesspeedup for XSBench over Linux on average Unlike LinuxIngens promotes huge pages proportionally in all 3 instancesHowever it fails to improve the performance of these work-loads In fact it may lead to poorer performance than LinuxWe explain this behaviour with an example

Linux selects an address from Graph500-1rsquos address spaceand promotes all pages above it in around 10 minutes Eventhough this strategy is unfair it improves the performanceof Graph500-1 by promoting its hot-regions Linux then se-lects Graph500-2whose MMU overheads decrease after asymp20minutes In contrast Ingens promotes huge pages from lowerVAs in each instance of Graph500 Thus it takes even longerfor Ingens to promote suitable (hot) regions of the respec-tive processes which leads to 9 performance degradationover Linux for Graph500 For XSBench the performance ofboth Ingens and Linux is similar (but inferior to HawkEye)because both fail to promote the applicationrsquos hot regionsbefore the application finishesHeterogeneous workloads To measure the efficacy of dif-ferent strategies for heterogeneous workloads we executeworkloads by grouping them into sets where a set containsone TLB sensitive and one TLB insensitive application Each

0

128

256

384

512

05GB 1GB 15GB 2GB

Cold Region

acce

ss-c

over

age

Virtual Address Space

0

15

30

45

60

300 600 900 1200 1500

MM

U O

verh

ead

Time (seconds)

0

200

400

600

800

1000

300 600 900 1200 1500

Linux Ingens

HawkEye-G

HawkEye-PMU

H

uge

Pag

es

Time (seconds)

(a) Graph500

0

128

256

384

512

1GB 2GB 3GB 4GB 5GB 6GB

acce

ss-c

over

age

Virtual Address Space

0

15

30

45

60

300 600 900 1200 1500 1800 2100 2400

Linux Ingens

HawkEye-PMU HawkEye-GMM

U O

ver

hea

dTime (seconds)

0

500

1000

1500

2000

300 600 900 1200 1500 1800 2100 2400

Linux Ingens H

awkEye-G

HawkEye-PMU H

uge

Pag

es

Time (seconds)

(b) XSBenchFigure 6 Access-coverage in application VA MMU overhead and huge page promotions for Graph500 and XSBench HawkEyeis more efficient in promoting huge pages due to fine-grained decision making

0

200

400

600

800

1000

10 20 30 40

1 2

3 H

uge

Pag

es

Time (minutes)

0

200

400

600

800

1000

10 20 30 40

1 2 3

H

uge

Pag

es

Time (minutes)

0

200

400

600

800

1000

10 20 30 40

H

uge

Pag

es

Time (minutes)

0

200

400

600

800

1000

10 20 30 40

1 2 3

H

uge

Pag

es

Time (minutes)

0

10

20

30

40

50

10 20 30 40

MM

U O

verh

ead

Time (Minutes)

0

10

20

30

40

50

10 20 30 40

MM

U O

verh

ead

Time (minutes)

0

10

20

30

40

50

10 20 30 40

MM

U O

verh

ead

Time (minutes)

0

10

20

30

40

50

10 20 30 40

MM

U O

verh

ead

Time (minutes)

0

500

1000

1500

2000

10 20 30 40 50

1

2 3

H

uge

Pag

es

Time (Minutes)

0

300

600

900

1200

10 20 30 40 50

1 2 3

H

uge

Pag

es

Time (Minutes)

0

300

600

900

1200

10 20 30 40 50

H

uge

Pag

es

Time (Minutes)

0

300

600

900

1200

10 20 30 40 50

1 2 3

H

uge

Pag

es

Time (Minutes)

0

15

30

45

60

10 20 30 40 50

1 2 3

MM

U O

verh

ead

Time (Minutes)

0

15

30

45

60

10 20 30 40 50

1 2 3

MM

U O

verh

ead

Time (Minutes)

0

15

30

45

60

10 20 30 40 50

1 2 3

MM

U O

verh

ead

Time (Minutes)

0

15

30

45

60

10 20 30 40 50

1 2 3

MM

U O

verh

ead

Time (Minutes)

(a) Linux-2MB (b) Ingens (c) HawkEye-PMU (d) HawkEye-GFigure 7 Huge page promotions and their impact on TLB overhead over time for Linux Ingens and HawkEye for threeinstances of Graph500 and XSBench each

set is executed twice after fragmenting the system to as-sess the impact of huge page policies on the order of ex-ecution For the TLB insensitive workload we execute alightly loaded Redis server with 1KB-sized 40M keys servic-ing 10K requests per-second Figure 8 shows the performance

speedup over 4KB pages The (Before) and (After) configura-tions show whether the TLB sensitive workload is launchedbefore Redis or after Because Linux promotes huge pagesin process launch order the two configurations (Before) and(After) are intended to cover its two different behavioursIngens and HawkEye are agnostic to process creation order

0

15

30

45

60

75

cactusADM tigr Graph500 lbm_s SVM XSBench cgD cactusADM tigr Graph500 lbm_s SVM XSBench cgDBefore After

s

peed

up

Chart Title

Linux Ingens HawkEye-PMU HawkEye-G

Figure 8 Performance speedup over baseline pages of TLB sensitive applications when they are executed alongside a lightlystressed Redis server in different orders

1

12

14

16

18

2

Nor

mal

ized

Per

form

ance

HawkEye-Host HawkEye-Guest HawkEye-Both

Figure 9 Performance compared to Linux in a virtualizedsystem when HawkEye is applied at the host guest and bothlayers

The impact of execution order is clearly visible for LinuxIn the (Before) case Linux with THP support improves per-formance over baseline pages by promoting huge pages inTLB sensitive applications However in the (After) case itpromotes huge pages in Redis resulting in poor performancefor TLB sensitive workloads While the execution order doesnot matter in Ingens its proportional huge page promotionstrategy favors large memory workloads (Redis in this case)Further our client requests randomly selected keys whichmakes the Redis server access all its huge pages uniformlyConsequently Ingens promotes most huge pages in Redisleading to suboptimal performance of TLB sensitive work-loads In contrast HawkEye promotes huge pages based onMMU overhead measurements (HawkEye-PMU) or access-coverage based estimation of MMU overheads (HawkEye-G)This leads to 15ndash60 performance improvement over 4KBpages irrespective of the order of executionPerformance in virtualized systems For a virtualizedsystemwe use the KVMhypervisor runningwith Ubuntu1604and fragment the system prior to running the workloadsEvaluation is presented for three different configurationsie when HawkEye is deployed at the host guest and boththe layers Each of these configurations requires running thesame set of applications differently (see Table 6 for details)Figure 9 shows that HawkEye provides 18ndash90 speedup

compared to Linux in virtual environments Notice that insome cases (eg cgD) performance improvement is higherin a virtualized system compared to bare-metal This is due tohigh MMU overheads involved in traversing nested page ta-bles while servicing TLB misses of guest applications whichpresents higher opportunities to be exploited by HawkEye

Configuration Details

HawkEye-Host Two VMs each with 24 vCPUs and 48GB memoryVM-1 runs Redis VM-2 runs TLB sensitive workloads

HawkEye-Guest Single VM with 48 vCPUs and 80GB memoryrunning Redis and TLB sensitive workloads

HawkEye-Both Two VMs each with 24 vCPUs VM-1 (30GB) runs RedisVM-2 (60GB) runs both Redis and TLB sensitive workloads

Table 6 Experimental setup for configurations used to eval-uate a virtualized system

Bloat vs performance We next study the relationship be-tween memory bloat and performance we revisit an experi-ment similar to the one summarized in sect21 We populate aRedis instance with 8M (10B key 4KB value) key-value pairsand then delete 60 randomly selected keys (see Table 7)While Linux-4KB (no THP) is memory efficient (no bloat)its performance is low (7 lower throughput than Linux-2MB ie with THP) Linux-2MB delivers high performancebut consumes more memory which remains unavailable tothe rest of system even when memory pressure increasesIngens can be configured to balance the memory vs perfor-mance tradeoff For example Ingens-90 (Ingens with 90utilization threshold) is memory efficient while Ingens-50(Ingens with 50 utilization threshold) favors performanceand allows more memory bloat In either case it is unableto recover from bloat that may have already been gener-ated (eg during its aggressive phase) By switching fromaggressive huge page allocations under low memory pres-sure to memory conserving strategy at runtime HawkEye isable to achieve both low MMU overheads and high memoryefficiency depending on the state of the systemFast page faults Table 8 shows the performance of differentstrategies for the workloads chosen for these experimentstheir performance depends on the efficiency of the OS pagefault handler We measure Redis throughput when 2MB-sized values are inserted SparseHash [13] is a hash-maplibrary in C++ while HACC-IO [8] is a parallel IO benchmarkused with an in-memory file system We also measure thespin-up time of two virtual machines KVM and a Java VirtualMachine (JVM) where both are configured to allocate allmemory during initialization

Performance benefits of async page pre-zeroing with basepages (HawkEye-4KB) aremodest in this case the other page

Kernel Self-tuning Memory Size ThroughputLinux-4KB No 162GB 1061KLinux-2MB No 332GB 1138KIngens-90 No 163GB 1068KIngens-50 No 331GB 1134KHawkEye (no mem pressure) Yes 332GB 1136KHawkEye (mem pressure) Yes 162GB 1058K

Table 7 Redis memory consumption and throughput

Workload OS ConfigurationLinux4KB

Linux2MB

Ingens90

HawkEye4KB

HawkEye2MB

Redis (45GB) 233 437 192 236 551SparseHash (36GB) 501 172 515 466 106HACC-IO (6GB) 65 45 66 65 42JVM Spin-up (36GB) 377 186 527 298 137KVM Spin-up (36GB) 406 97 418 302 070

Table 8 Performance implications of asynchronous pagezeroing Values for Redis represent throughput (higher isbetter) all other values represent time in seconds (lower isbetter)

0

10

20

30

40

NPB PARSEC Graph500 cactusADM PageRank omnetpp

overhead with caching-stores with nt-stores

Figure 10 Performance overhead of async pre-zeroing withand without caching instructions The first two bars (NPBand Parsec) represent the average of all workloads in therespective benchmark suite

fault overheads (apart from zeroing) dominate performanceHowever page-zeroing cost becomes significant with hugepages Consequently HawkEye with huge pages (HawkEye-2MB) improves the performance of Redis and SparseHashby 126times and 162times over Linux The spin-up time of virtualmachines is purely dominated by page faults Hence weobserve a dramatic reduction in the spin-up time of VMswith async pre-zeroing of huge pages 138times and 136times overLinux Notice that without huge pages (only base pages)this reduction is only 134times and 126times All these workloadshave high spatial locality of page faults and hence the Ingensstrategy of utilization-based promotion has a negative impacton performance due to a higher number of page faultsAsync pre-zeroing enables memory sharing in virtual-ized environments Finally we note that async pre-zeroingwithin VMs enables memory sharing across VMs the freememory of a VM returns to the host through pre-zeroing inthe VM and same-page merging at the host This can havethe same net effect as ballooning as the free memory in aVM gets shared with other VMs In our experiments withmemory over-committed systems we have confirmed thisbehaviour the overall performance of an overcommittedsystem where the VMs are running HawkEye matches the

0

05

1

15

2

25

PageRank Redis MongoDB CC Memcached SPDNode 0 Node 1

Nor

mal

ized

Per

form

ance

Linux Linux-Ballooning HawkEye

Figure 11 Performance normalized to the case of no bal-looning in an overcommitted virtualized system

performance achieved through para-virtual interfaces likememory balloon drivers For example in an experiment in-volving a mix of latency-sensitive key-value stores and HPCworkloads (see Figure 11) with a total peakmemory consump-tion of around 150GB (15times of total memory) HawkEye-Gprovides 23times and 142times higher throughput for Redis andMongoDB along with significant tail latency reduction ForPageRank performance degrades slightly due to a highernumber of COW page faults due to unnecessary same-pagemerging These results are very close to the results obtainedwhen memory ballooning is enabled on all VMs We believethat this is a potential positive of our design and may offeran alternative method to efficiently manage memory in over-committed systems It is well-known that balloon-driversand other para-virtual interfaces are riddled with compatibil-ity problems [29] and a fully-virtual solution to this problemwould be highly desirable While our experiments show earlypromise in this direction we leave an extensive evaluationfor future workPerformance overheads ofHawkEyeWe evaluatedHawk-Eye with non-fragmented systems and with several work-loads that donrsquot benefit from huge pages and find that theperformance with HawkEye is very similar to performancewith Linux or Ingens in these scenarios confirming thatHawkEyersquos algorithms for access-bit tracking asynchronoushuge page promotion and memory de-duplication add negli-gible overheads over existingmechanisms in Linux Howeverthe async pre-zeroing thread requires special attention as itmay become a source of noticeable performance overheadfor certain types of workloads as discussed nextOverheads of async pre-zeroing A primary concern withasync pre-zeroing is its potential detrimental effect due tomemory and cache interference (sect31) Recall that we em-ploy memory stores with non-temporal hints to avoid theseeffects To measure the worst-case cache effects of asyncpre-zeroing we run our workloads while simultaneouslyzero-filling pages on a separate core sharing the same L3cache (ie on the same socket) at a high rate of 025M pagesper second (1GBps) with and without non-temporal mem-ory stores Figure 10 shows that using non-temporal hintssignificantly brings down the overhead of async pre-zeroing(eg from 27 to 6 for omnetpp) the remaining overheadis due to additional memory traffic generated by the pre-zeroing thread Further we note that this experiment presents

Workload MMUOverhead

Time (seconds)4KB HawkEye-PMU HawkEye-G

random(4GB) 60 582 328(177times) 413(141times)sequential(4GB) lt 1 517 535 532Total 1099 863(127times) 945(116times)cgD(16GB) 39 1952 1202(162times) 1450(135times)mgD(24GB) lt 1 1363 1364 1377Total 3315 2566(129times) 2827(117times)

Table 9 Comparison between HawkEye-PMU andHawkEye-G Values in parentheses represent speedup overbaseline pagesa highly-exaggerated behavior of async pre-zeroing onworst-case workloads In practice the zeroing thread is rate-limited(eg at most 10k pages per second) and the expected cache-interference overheads would be proportionally smallerComparison betweenHawk-PMUandHawkEye-G Eventhough HawkEye-G is based on simple and approximatedestimations of TLB contention it is reasonably accurate inidentifying TLB sensitive processes and memory regionsin most cases However HawkEye-PMU performs betterthan HawkEye-G in some cases Table 9 demonstrates twosuch examples where four workloads all with high access-coverage but different MMU overheads are executed in twosets each set contains one TLB sensitive and one TLB insensi-tive workload The 4KB column represents the case where nohuge pages are promoted As discussed earlier (sect24) MMUoverheads depend on the complex interaction between thehardware and access pattern In this case despite havinghigh access-coverage in VA regions sequential and mgDhave negligible MMU overheads (ie they are TLB insen-sitive) While HawkEye-PMU correctly identifies the pro-cess with higher MMU overheads for huge page allocationHawkEye-G treats them similarly (due to imprecise estima-tions) and may allocate huge pages to the TLB insensitiveprocess Consequently HawkEye-PMU may perform up to36 better than HawkEye-G Bridging this gap between thetwo approaches in a hardware independent manner is aninteresting future work

5 Related WorkMMU overheads have been a topic of much research in re-cent years and several solutions have been proposed includ-ing solutions that are architecture-based OS-based andorcompiler-based [28 34 49 56 58 62ndash64]Hardware SupportMost proposals for hardware supportaim to reduce TLB-miss frequency or accelerate TLB-missprocessing Large multi-level TLBs and support for multiplepage sizes to minimize the number of TLB-misses is alreadycommon in general-purpose processors today [15 20 21 40]Further modern processors employ page-walk-caches toavoid memory lookups in the TLB-miss path [31 35] POM-TLB services a TLB miss using a single memory lookup andfurther leverages regular data caches to speed up addresstranslation [63] Direct segments [32 46] were proposed tocompletely avoid TLB-miss processing cost through special

segmentation hardware OS-level challenges of memory frag-mentation have also been considered in hardware designsCoLT or Coalesced-Large-reach TLBs [61] were initially pro-posed to increase TLB reach using base pages based on theobservation that OSs naturally provide contiguous mappingsat smaller granularity This approach was further extendedto page-walk caches and huge pages [35 40 60] Techniquesto support huge pages in non-contiguous memory have alsobeen proposed [44] Hardware optimizations are importantbut complementary to HawkEyeSoftware Support FreeBSDrsquos huge page management islargely influenced by previous work on superpages [57]Carrefour-LP [47] showed that ad-hoc usage of huge pagescan degrade performance in NUMA systems due to addi-tional remote memory accesses and traffic imbalance acrossnode interconnects and proposed hardware-profiling basedtechniques to overcome these problems Anti-fragmentationtechniques for clustering pages based on mobility type ofpages proposed by Gorman et al [48] have been adoptedby Linux to support the allocation of huge pages in long-running systems Illuminator [59] highlighted critical draw-backs of Linuxrsquos fragmentation mitigation techniques andtheir implications on performance and latency and proposedtechniques to solve these problems Mechanisms of dealingwith physical memory fragmentation and NUMA systemsare important but orthogonal problems and their proposedsolutions can be adopted to improve the robustness of Hawk-Eye Compiler or application hints can also be used to assistOSs in prioritizing huge page mapping for certain parts ofthe address space [27 56] for which an interface is alreadyprovided by Linux through the madvise system call [16]An alternative approach for supporting huge pages has

also been explored via libhugetlbfs [22] where the useris provided more control on huge page allocation Howeversuch an approach requires manual intervention for reserv-ing huge pages in advance and considers each applicationin isolation Windows and OS X support huge pages onlythrough this reservation-based approach to avoid issues asso-ciated with transparent huge page management [17 55 59]We believe that insights from HawkEye can be leveraged toimprove huge page support in these important systems

6 ConclusionsTo summarize transparent huge page management is es-sential but complex Effective and efficient algorithms arerequired in the OS to automatically balance the tradeoffs andprovide performant and stable system behavior We exposesome important subtleties related to huge page managementin existing proposals and propose a new set of algorithmsto address them in HawkEye

References[1] 2000 Clearing pages in the idle loop httpswwwmail-archivecom

freebsd-hackersfreebsdorgmsg13993html

[2] 2000 Linux Page Zeroing Strategy httpsyarchivenetcomplinuxpagezeroingstrategyhtml

[3] 2007 Memory part 5 What programmers can do httpslwnnetArticles255364

[4] 2010 Mysteries of Windows Memory Management RevealedPart Two httpsblogsmsdnmicrosoftcomtims20101029pdc10-mysteries-of-windows-memory-management-revealed-part-two

[5] 2012 khugepaged eating 100 CPU httpsbugzillaredhatcomshowbugcgiid=879801

[6] 2012 Recommendation to disable huge pages for Hadoophttpsdeveloperamdcomwordpressmedia201210HadoopTuningGuide-Version5pdf

[7] 2014 Arch Linux becomes unresponsive from khugepagedhttpunixstackexchangecomquestions161858arch-linux-becomes-unresponsive-from-khugepaged

[8] 2014 CORAL Benchmark Codes httpsascllnlgovCORAL-benchmarkshacc

[9] 2014 Recommendation to disable huge pages for NuoDBhttpwwwnuodbcomtechbloglinux-transparent-huge-pages-jemalloc-and-nuodb

[10] 2014 Why TokuDB Hates Transparent HugePageshttpswwwperconacomblog20140723why-tokudb-hates-transparent-hugepages

[11] 2015 The Black Magic Of Systematically Reducing Linux OS Jit-ter httphighscalabilitycomblog201548the-black-magic-of-systematically-reducing-linux-os-jitterhtml

[12] 2015 Tales from the Field Taming Transparent Huge Pages onLinux httpswwwperforcecomblog151016tales-field-taming-transparent-huge-pages-linux

[13] 2016 C++ associative containers httpsgithubcomsparsehashsparsehash

[14] 2016 Remove PG_ZERO and zeroidle (page-zeroing) entirely httpsnewsycombinatorcomitemid=12227874

[15] 2017 Hugepages httpswikidebianorgHugepages[16] 2017 MADVISE Linux Programmerrsquos Manual httpman7orglinux

man-pagesman2madvise2html[17] 2018 Large-Page Support in Windows httpsdocsmicrosoftcom

en-gbwindowsdesktopMemorylarge-page-support[18] 2019 CGROUPS Linux Programmerrsquos Manual httpman7orglinux

man-pagesman7cgroups7html[19] 2019 FreeBSD Pre-Faulting and Zeroing Optimizations

httpswwwfreebsdorgdocenUSISO8859-1articlesvm-designprefault-optimizationshtml

[20] 2019 Intel Haswell httpwww7-cpucomcpuHaswellhtml[21] 2019 Intel Skylake httpwww7-cpucomcpuSkylakehtml[22] 2019 Libhugetlbfs Linux man page httpslinuxdienetman7

libhugetlbfs[23] 2019 Recommendation to disable huge pages for MongoDB https

docsmongodbcommanualtutorialtransparent-huge-pages[24] 2019 Recommendation to disable huge pages for Redis httpredisio

topicslatency[25] 2019 Recommendation to disable huge pages for VoltDB https

docsvoltdbcomAdminGuideadminmemmgtphp[26] 2019 Transparent Hugepage Support httpswwwkernelorgdoc

Documentationvmtranshugetxt[27] Mohammad Agbarya Idan Yaniv and Dan Tsafrir 2018 Memomania

From Huge to Huge-Huge Pages In Proceedings of the 11th ACM In-ternational Systems and Storage Conference (SYSTOR rsquo18) ACM NewYork NY USA 112ndash112 httpsdoiorg10114532118903211918

[28] Hanna Alam Tianhao Zhang Mattan Erez and Yoav Etsion 2017Do-It-Yourself Virtual Memory Translation In Proceedings of the44th Annual International Symposium on Computer Architecture (ISCArsquo17) ACM New York NY USA 457ndash468 httpsdoiorg10114530798563080209

[29] Nadav Amit Dan Tsafrir and Assaf Schuster 2014 VSwapper AMemory Swapper for Virtualized Environments In Proceedings of the19th International Conference on Architectural Support for ProgrammingLanguages and Operating Systems (ASPLOS rsquo14) ACM New York NYUSA 349ndash366 httpsdoiorg10114525419402541969

[30] D H Bailey E Barszcz J T Barton D S Browning R L Carter LDagum R A Fatoohi P O Frederickson T A Lasinski R S SchreiberH D Simon V Venkatakrishnan and S KWeeratunga 1991 The NASParallel BenchmarksSummary and Preliminary Results In Proceedingsof the 1991 ACMIEEE Conference on Supercomputing (Supercomputingrsquo91) ACM New York NY USA 158ndash165 httpsdoiorg101145125826125925

[31] Thomas W Barr Alan L Cox and Scott Rixner 2010 TranslationCaching Skip DonrsquoT Walk (the Page Table) In Proceedings of the37th Annual International Symposium on Computer Architecture (ISCArsquo10) ACM New York NY USA 48ndash59 httpsdoiorg10114518159611815970

[32] Arkaprava Basu Jayneel Gandhi Jichuan Chang Mark D Hill andMichael M Swift 2013 Efficient Virtual Memory for Big MemoryServers In Proceedings of the 40th Annual International Symposium onComputer Architecture (ISCA rsquo13) ACM New York NY USA 237ndash248httpsdoiorg10114524859222485943

[33] Scott Beamer Krste Asanovic and David A Patterson 2015 TheGAP Benchmark Suite CoRR abs150803619 (2015) arXiv150803619httparxivorgabs150803619

[34] Ravi Bhargava Benjamin Serebrin Francesco Spadini and SrilathaManne 2008 Accelerating Two-dimensional Page Walks for Vir-tualized Systems In Proceedings of the 13th International Confer-ence on Architectural Support for Programming Languages and Op-erating Systems (ASPLOS XIII) ACM New York NY USA 26ndash35httpsdoiorg10114513462811346286

[35] Abhishek Bhattacharjee 2013 Large-reach Memory ManagementUnit Caches In Proceedings of the 46th Annual IEEEACM InternationalSymposium on Microarchitecture (MICRO-46) ACM New York NYUSA 383ndash394 httpsdoiorg10114525407082540741

[36] Christian Bienia Sanjeev Kumar Jaswinder Pal Singh and Kai Li 2008The PARSEC Benchmark Suite Characterization and ArchitecturalImplications In Proceedings of the 17th International Conference onParallel Architectures and Compilation Techniques (PACT rsquo08) ACMNew York NY USA 72ndash81 httpsdoiorg10114514541151454128

[37] Daniel Bovet and Marco Cesati 2005 Understanding The Linux KernelOreilly amp Associates Inc

[38] Josiah L Carlson 2013 Redis in Action Manning Publications CoGreenwich CT USA

[39] Jonathan Corbet 2010 Memory compaction httpslwnnetArticles368869

[40] Guilherme Cox and Abhishek Bhattacharjee 2017 Efficient AddressTranslation for Architectures with Multiple Page Sizes In Proceed-ings of the Twenty-Second International Conference on ArchitecturalSupport for Programming Languages and Operating Systems (ASPLOSrsquo17) ACM New York NY USA 435ndash448 httpsdoiorg10114530376973037704

[41] Cort Dougan Paul Mackerras and Victor Yodaiken 1999 Opti-mizing the Idle Task and Other MMU Tricks In Proceedings of theThird Symposium on Operating Systems Design and Implementation(OSDI rsquo99) USENIX Association Berkeley CA USA 229ndash237 httpdlacmorgcitationcfmid=296806296833

[42] Niall Douglas 2011 User Mode Memory Page Allocation A Sil-ver Bullet For Memory Allocation CoRR abs11051811 (2011)arXiv11051811 httparxivorgabs11051811

[43] Ulrich Drepper 2007 What Every Programmer Should Know AboutMemory

[44] Yu Du Miao Zhou Bruce R Childers Daniel Mosseacute and Rami GMelhem 2015 Supporting superpages in non-contiguous physicalmemory 2015 IEEE 21st International Symposium on High Performance

Computer Architecture (HPCA) (2015) 223ndash234[45] M Franklin D Yeung n Xue Wu A Jaleel K Albayraktaroglu B

Jacob and n Chau-Wen Tseng 2005 BioBench A Benchmark Suite ofBioinformatics Applications In IEEE International Symposium on Per-formance Analysis of Systems and Software 2005 ISPASS 2005(ISPASS)Vol 00 2ndash9 httpsdoiorg101109ISPASS20051430554

[46] Jayneel Gandhi Arkaprava Basu Mark D Hill and Michael M Swift2014 Efficient Memory Virtualization Reducing Dimensionality ofNested Page Walks In Proceedings of the 47th Annual IEEEACM Inter-national Symposium on Microarchitecture (MICRO-47) IEEE ComputerSociety Washington DC USA 178ndash189 httpsdoiorg101109MICRO201437

[47] Fabien Gaud Baptiste Lepers Jeremie Decouchant Justin FunstonAlexandra Fedorova and Vivien Queacutema 2014 Large Pages MayBe Harmful on NUMA Systems In Proceedings of the 2014 USENIXConference on USENIX Annual Technical Conference (USENIX ATCrsquo14)USENIX Association Berkeley CA USA 231ndash242 httpdlacmorgcitationcfmid=26436342643659

[48] Mel Gorman and Patrick Healy 2008 Supporting Superpage Alloca-tion Without Additional Hardware Support In Proceedings of the 7thInternational Symposium on Memory Management (ISMM rsquo08) ACMNew York NY USA 41ndash50 httpsdoiorg10114513756341375641

[49] Mel Gorman and Patrick Healy 2012 Performance Characteristics ofExplicit Superpage Support In Proceedings of the 2010 InternationalConference on Computer Architecture (ISCArsquo10) Springer-Verlag BerlinHeidelberg 293ndash310 httpsdoiorg101007978-3-642-24322-624

[50] Mel Gorman and Andy Whitcroft 2006 The What The Why and theWhere To of Anti-Fragmentation In Linux Symposium 141

[51] Fei Guo Seongbeom Kim Yury Baskakov and Ishan Banerjee 2015Proactively Breaking Large Pages to Improve Memory Overcom-mitment Performance in VMware ESXi In Proceedings of the 11thACM SIGPLANSIGOPS International Conference on Virtual ExecutionEnvironments (VEE rsquo15) ACM New York NY USA 39ndash51 httpsdoiorg10114527311862731187

[52] Fan Guo Yongkun Li Yinlong Xu Song Jiang and John C S Lui2017 SmartMD A High Performance Deduplication Engine withMixed Pages In Proceedings of the 2017 USENIX Conference on UsenixAnnual Technical Conference (USENIX ATC rsquo17) USENIX AssociationBerkeley CA USA 733ndash744 httpdlacmorgcitationcfmid=31546903154759

[53] John L Henning 2006 SPEC CPU2006 Benchmark DescriptionsSIGARCH Comput Archit News 34 4 (Sept 2006) 1ndash17 httpsdoiorg10114511867361186737

[54] Vasileios Karakostas Osman S Unsal Mario Nemirovsky AdriaacutenCristal and Michael M Swift 2014 Performance analysis of thememory management unit under scale-out workloads 2014 IEEE In-ternational Symposium on Workload Characterization (IISWC) (2014)1ndash12

[55] Youngjin Kwon Hangchen Yu Simon Peter Christopher J Rossbachand Emmett Witchel 2016 Coordinated and Efficient Huge PageManagement with Ingens In Proceedings of the 12th USENIX Con-ference on Operating Systems Design and Implementation (OSDIrsquo16)USENIX Association Berkeley CA USA 705ndash721 httpdlacmorgcitationcfmid=30268773026931

[56] Joshua Magee and Apan Qasem 2009 A Case for Compiler-drivenSuperpage Allocation In Proceedings of the 47th Annual SoutheastRegional Conference (ACM-SE 47) ACM New York NY USA Article82 4 pages httpsdoiorg10114515664451566553

[57] Juan Navarro Sitararn Iyer Peter Druschel and Alan Cox 2002 Prac-tical Transparent Operating System Support for Superpages SIGOPSOper Syst Rev 36 SI (Dec 2002) 89ndash104 httpsdoiorg101145844128844138

[58] Ashish Panwar Naman Patel and K Gopinath 2016 A Case forProtecting Huge Pages from the Kernel In Proceedings of the 7th ACMSIGOPS Asia-Pacific Workshop on Systems (APSys rsquo16) ACM New YorkNY USA Article 15 8 pages httpsdoiorg10114529673602967371

[59] Ashish Panwar Aravinda Prasad and K Gopinath 2018 Making HugePages Actually Useful In Proceedings of the Twenty-Third InternationalConference on Architectural Support for Programming Languages andOperating Systems (ASPLOS rsquo18) ACM New York NY USA 679ndash692httpsdoiorg10114531731623173203

[60] Binh Pham Abhishek Bhattacharjee Yasuko Eckert and Gabriel HLoh 2014 Increasing TLB reach by exploiting clustering in page trans-lations 2014 IEEE 20th International Symposium on High PerformanceComputer Architecture (HPCA) (2014) 558ndash567

[61] Binh Pham Viswanathan Vaidyanathan Aamer Jaleel and AbhishekBhattacharjee 2012 CoLT Coalesced Large-Reach TLBs In Proceed-ings of the 2012 45th Annual IEEEACM International Symposium onMicroarchitecture (MICRO-45) IEEE Computer Society WashingtonDC USA 258ndash269 httpsdoiorg101109MICRO201232

[62] Binh Pham Jaacuten Veselyacute Gabriel H Loh and Abhishek Bhattacharjee2015 Large Pages and Lightweight Memory Management in Virtual-ized Environments Can You Have It Both Ways In Proceedings of the48th International Symposium on Microarchitecture (MICRO-48) ACMNew York NY USA 1ndash12 httpsdoiorg10114528307722830773

[63] Jee Ho Ryoo Nagendra Gulur Shuang Song and Lizy K John 2017Rethinking TLB Designs in Virtualized Environments A Very LargePart-of-Memory TLB In Proceedings of the 44th Annual InternationalSymposium on Computer Architecture (ISCA rsquo17) ACM New York NYUSA 469ndash480 httpsdoiorg10114530798563080210

[64] Indira Subramanian Clifford Mather Kurt Peterson and BalakrishnaRaghunath 1998 Implementation of Multiple Pagesize Support inHP-UX In USENIX Annual Technical Conference 105ndash119

[65] John R Tramm Andrew R Siegel Tanzima Islam and Martin Schulz[n d] XSBench - The Development and Verification of a PerformanceAbstraction for Monte Carlo Reactor Analysis In PHYSOR 2014 - TheRole of Reactor Physics toward a Sustainable Future Kyoto

[66] Koji Ueno and Toyotaro Suzumura 2012 Highly Scalable Graph Searchfor the Graph500 Benchmark In Proceedings of the 21st InternationalSymposium on High-Performance Parallel and Distributed Computing(HPDC rsquo12) ACM New York NY USA 149ndash160 httpsdoiorg10114522870762287104

[67] Carl AWaldspurger 2002 Memory ResourceManagement in VMwareESX Server SIGOPS Oper Syst Rev 36 SI (Dec 2002) 181ndash194 httpsdoiorg101145844128844146

[68] Xiao Zhang Sandhya Dwarkadas and Kai Shen 2009 Towards Practi-cal Page Coloring-based Multicore Cache Management In Proceedingsof the 4th ACM European Conference on Computer Systems (EuroSysrsquo09) ACM New York NY USA 89ndash102 httpsdoiorg10114515190651519076

  • Abstract
  • 1 Introduction
  • 2 Motivation
    • 21 Address Translation Overhead vs Memory Bloat
    • 22 Page fault latency vs Number of page faults
    • 23 Huge page allocation across multiple processes
    • 24 How to capture address translation overheads
      • 3 Design and Implementation
        • 31 Asynchronous page pre-zeroing
        • 32 Managing bloat vs performance
        • 33 Fine-grained huge page promotion
        • 34 Huge page allocation across multiple processes
        • 35 Limitations and discussion
          • 4 Evaluation
          • 5 Related Work
          • 6 Conclusions
          • References
Page 8: HawkEye: Efficient Fine-grained OS Support for …sbansal/pubs/hawkeye.pdfHawkEye: Efficient Fine-grained OS Support for Huge Pages Ashish Panwar Indian Institute of Science ashishpanwar@iisc.ac.in

0

6

12

18

24

Graph500 XSBench cgD

speedup

Linux Ingens HawkEyeshyPMU HawkEyeshyG

152

23 25155

20 36

1007 1011

175

628

152 168

0

300

600

900

1200

Graph500 XSBench cgD

Time saved (ms)

Figure 5 Performance speedup and time saved per hugepage promotion over baseline pages

Workload Execution Time (Seconds)Linux-4KB Linux-2MB Ingens HawkEye-PMU HawkEye-G

Graph500-1 2270 2145(106) 2243(101) 1987(114) 2007(113)Graph500-2 2289 2252(102) 2253(102) 1994(115) 2013(114)Graph500-3 2293 2293(10) 2299(100) 2012(114) 2018(114)Average 2284 2230(102) 2265(101) 1998(114) 2013(113)

XSBench-1 2427 2415(10) 2392(101) 2108(115) 2098(115)XSBench-2 2437 2427(10) 2415(101) 2109(115) 2110(115)XSBench-3 2443 2455(10) 2456(100) 2133(115) 2143(114)Average 2436 2432(10) 2421(100) 2117(115) 2117(115)

Table 5 Execution time of 3 instances of Graph500 andXSBench when executed simultaneously Values in parenthe-ses represent speedup over baseline pages

low address translation overheads Recall that HawkEye pro-motes pages based on access-coverage while previous ap-proaches promote pages in VA order (from low VAs to high)We fragment the memory initially by reading several files inmemory our test workloads are started in the fragmentedstate and we measure the time taken for the system to re-cover from high MMU overheads This experimental setupsimulates expected realistic situations where the memoryfragmentation in the system fluctuates over time Withoutpromotion our test workloads would keep incurring highaddress translation overheads however HawkEye is quicklyable to recover from these overheads through appropriatehuge page allocations Figure 5 (left) shows the performanceimprovement obtained by HawkEye (over a strategy thatnever promotes the pages) for three workloads Graph500XSBench and cgD These workloads allocate all their re-quired memory in the beginning ie in the fragmented stateof the system For these workloads the speedups due toeffective huge page management through HawkEye are ashigh as 22 Compared to Linux and Ingens access-coveragebased huge page promotion strategy of HawkEye improvesperformance by 13 12 and 6 over both Linux and Ingens

To understand this more clearly Figure 6 shows the accesspattern MMU overheads and the number of allocated hugepages over time for Graph500 and XSBench We find thathot-spots in these applications are concentrated in the highVAs and any sequential-scanning based promotion is likelyto be sub-optimal The second and third columns corrobo-rate this observation for example both HawkEye variantstake asymp300 seconds to eliminate MMU overheads of XSBenchwhile Linux and Ingens have high overheads even after 1000seconds

To quantify the cost-benefit analysis of huge page promo-tions further we propose a new metric the average execu-tion time saved (over using only baseline pages) per hugepage promotion A scheme that maximizes this metric wouldbe most effective in reducing MMU overheads Figure 5(right) shows that HawkEye performs significantly betterthan Linux and Ingens on this metric The difference betweenthe efficiency of HawkEye-PMU and HawkEye-G is also evi-dent HawkEye-PMU is more efficient as it stops promotinghuge pages whenMMU overheads are below a certain thresh-old (2 in our experiments) In summary HawkEye-G andHawkEye-PMU are up to 67times and 44times more efficient (forXSBench) than Linux in terms of time saved per huge pagepromotion For workloads whose access patterns are spreaduniformly over the VA space HawkEye delivers performancesimilar to Linux and IngensFairness advantages of fine-grained huge page promo-tion We next experiment withmultiple applications to studyboth performance and fairness first with identical applica-tions and then with heterogeneous applications runningsimultaneouslyIdentical workloads Figure 7 shows the MMU overheadsand huge page allocations when 3 instances of Graph500 andXSBench (in separate runs) are executed concurrently afterfragmenting the system (see Table 5 for execution time)While Linux creates performance imbalance by promot-

ing huge pages in one process at a time HawkEye ensuresfairness by judiciously distributing huge pages across allworkload instances Overall HawkEye-PMU and HawkEye-G achieve 114times and 113times speedup for Graph500 and 115timesspeedup for XSBench over Linux on average Unlike LinuxIngens promotes huge pages proportionally in all 3 instancesHowever it fails to improve the performance of these work-loads In fact it may lead to poorer performance than LinuxWe explain this behaviour with an example

Linux selects an address from Graph500-1rsquos address spaceand promotes all pages above it in around 10 minutes Eventhough this strategy is unfair it improves the performanceof Graph500-1 by promoting its hot-regions Linux then se-lects Graph500-2whose MMU overheads decrease after asymp20minutes In contrast Ingens promotes huge pages from lowerVAs in each instance of Graph500 Thus it takes even longerfor Ingens to promote suitable (hot) regions of the respec-tive processes which leads to 9 performance degradationover Linux for Graph500 For XSBench the performance ofboth Ingens and Linux is similar (but inferior to HawkEye)because both fail to promote the applicationrsquos hot regionsbefore the application finishesHeterogeneous workloads To measure the efficacy of dif-ferent strategies for heterogeneous workloads we executeworkloads by grouping them into sets where a set containsone TLB sensitive and one TLB insensitive application Each

0

128

256

384

512

05GB 1GB 15GB 2GB

Cold Region

acce

ss-c

over

age

Virtual Address Space

0

15

30

45

60

300 600 900 1200 1500

MM

U O

verh

ead

Time (seconds)

0

200

400

600

800

1000

300 600 900 1200 1500

Linux Ingens

HawkEye-G

HawkEye-PMU

H

uge

Pag

es

Time (seconds)

(a) Graph500

0

128

256

384

512

1GB 2GB 3GB 4GB 5GB 6GB

acce

ss-c

over

age

Virtual Address Space

0

15

30

45

60

300 600 900 1200 1500 1800 2100 2400

Linux Ingens

HawkEye-PMU HawkEye-GMM

U O

ver

hea

dTime (seconds)

0

500

1000

1500

2000

300 600 900 1200 1500 1800 2100 2400

Linux Ingens H

awkEye-G

HawkEye-PMU H

uge

Pag

es

Time (seconds)

(b) XSBenchFigure 6 Access-coverage in application VA MMU overhead and huge page promotions for Graph500 and XSBench HawkEyeis more efficient in promoting huge pages due to fine-grained decision making

0

200

400

600

800

1000

10 20 30 40

1 2

3 H

uge

Pag

es

Time (minutes)

0

200

400

600

800

1000

10 20 30 40

1 2 3

H

uge

Pag

es

Time (minutes)

0

200

400

600

800

1000

10 20 30 40

H

uge

Pag

es

Time (minutes)

0

200

400

600

800

1000

10 20 30 40

1 2 3

H

uge

Pag

es

Time (minutes)

0

10

20

30

40

50

10 20 30 40

MM

U O

verh

ead

Time (Minutes)

0

10

20

30

40

50

10 20 30 40

MM

U O

verh

ead

Time (minutes)

0

10

20

30

40

50

10 20 30 40

MM

U O

verh

ead

Time (minutes)

0

10

20

30

40

50

10 20 30 40

MM

U O

verh

ead

Time (minutes)

0

500

1000

1500

2000

10 20 30 40 50

1

2 3

H

uge

Pag

es

Time (Minutes)

0

300

600

900

1200

10 20 30 40 50

1 2 3

H

uge

Pag

es

Time (Minutes)

0

300

600

900

1200

10 20 30 40 50

H

uge

Pag

es

Time (Minutes)

0

300

600

900

1200

10 20 30 40 50

1 2 3

H

uge

Pag

es

Time (Minutes)

0

15

30

45

60

10 20 30 40 50

1 2 3

MM

U O

verh

ead

Time (Minutes)

0

15

30

45

60

10 20 30 40 50

1 2 3

MM

U O

verh

ead

Time (Minutes)

0

15

30

45

60

10 20 30 40 50

1 2 3

MM

U O

verh

ead

Time (Minutes)

0

15

30

45

60

10 20 30 40 50

1 2 3

MM

U O

verh

ead

Time (Minutes)

(a) Linux-2MB (b) Ingens (c) HawkEye-PMU (d) HawkEye-GFigure 7 Huge page promotions and their impact on TLB overhead over time for Linux Ingens and HawkEye for threeinstances of Graph500 and XSBench each

set is executed twice after fragmenting the system to as-sess the impact of huge page policies on the order of ex-ecution For the TLB insensitive workload we execute alightly loaded Redis server with 1KB-sized 40M keys servic-ing 10K requests per-second Figure 8 shows the performance

speedup over 4KB pages The (Before) and (After) configura-tions show whether the TLB sensitive workload is launchedbefore Redis or after Because Linux promotes huge pagesin process launch order the two configurations (Before) and(After) are intended to cover its two different behavioursIngens and HawkEye are agnostic to process creation order

0

15

30

45

60

75

cactusADM tigr Graph500 lbm_s SVM XSBench cgD cactusADM tigr Graph500 lbm_s SVM XSBench cgDBefore After

s

peed

up

Chart Title

Linux Ingens HawkEye-PMU HawkEye-G

Figure 8 Performance speedup over baseline pages of TLB sensitive applications when they are executed alongside a lightlystressed Redis server in different orders

1

12

14

16

18

2

Nor

mal

ized

Per

form

ance

HawkEye-Host HawkEye-Guest HawkEye-Both

Figure 9 Performance compared to Linux in a virtualizedsystem when HawkEye is applied at the host guest and bothlayers

The impact of execution order is clearly visible for LinuxIn the (Before) case Linux with THP support improves per-formance over baseline pages by promoting huge pages inTLB sensitive applications However in the (After) case itpromotes huge pages in Redis resulting in poor performancefor TLB sensitive workloads While the execution order doesnot matter in Ingens its proportional huge page promotionstrategy favors large memory workloads (Redis in this case)Further our client requests randomly selected keys whichmakes the Redis server access all its huge pages uniformlyConsequently Ingens promotes most huge pages in Redisleading to suboptimal performance of TLB sensitive work-loads In contrast HawkEye promotes huge pages based onMMU overhead measurements (HawkEye-PMU) or access-coverage based estimation of MMU overheads (HawkEye-G)This leads to 15ndash60 performance improvement over 4KBpages irrespective of the order of executionPerformance in virtualized systems For a virtualizedsystemwe use the KVMhypervisor runningwith Ubuntu1604and fragment the system prior to running the workloadsEvaluation is presented for three different configurationsie when HawkEye is deployed at the host guest and boththe layers Each of these configurations requires running thesame set of applications differently (see Table 6 for details)Figure 9 shows that HawkEye provides 18ndash90 speedup

compared to Linux in virtual environments Notice that insome cases (eg cgD) performance improvement is higherin a virtualized system compared to bare-metal This is due tohigh MMU overheads involved in traversing nested page ta-bles while servicing TLB misses of guest applications whichpresents higher opportunities to be exploited by HawkEye

Configuration Details

HawkEye-Host Two VMs each with 24 vCPUs and 48GB memoryVM-1 runs Redis VM-2 runs TLB sensitive workloads

HawkEye-Guest Single VM with 48 vCPUs and 80GB memoryrunning Redis and TLB sensitive workloads

HawkEye-Both Two VMs each with 24 vCPUs VM-1 (30GB) runs RedisVM-2 (60GB) runs both Redis and TLB sensitive workloads

Table 6 Experimental setup for configurations used to eval-uate a virtualized system

Bloat vs performance We next study the relationship be-tween memory bloat and performance we revisit an experi-ment similar to the one summarized in sect21 We populate aRedis instance with 8M (10B key 4KB value) key-value pairsand then delete 60 randomly selected keys (see Table 7)While Linux-4KB (no THP) is memory efficient (no bloat)its performance is low (7 lower throughput than Linux-2MB ie with THP) Linux-2MB delivers high performancebut consumes more memory which remains unavailable tothe rest of system even when memory pressure increasesIngens can be configured to balance the memory vs perfor-mance tradeoff For example Ingens-90 (Ingens with 90utilization threshold) is memory efficient while Ingens-50(Ingens with 50 utilization threshold) favors performanceand allows more memory bloat In either case it is unableto recover from bloat that may have already been gener-ated (eg during its aggressive phase) By switching fromaggressive huge page allocations under low memory pres-sure to memory conserving strategy at runtime HawkEye isable to achieve both low MMU overheads and high memoryefficiency depending on the state of the systemFast page faults Table 8 shows the performance of differentstrategies for the workloads chosen for these experimentstheir performance depends on the efficiency of the OS pagefault handler We measure Redis throughput when 2MB-sized values are inserted SparseHash [13] is a hash-maplibrary in C++ while HACC-IO [8] is a parallel IO benchmarkused with an in-memory file system We also measure thespin-up time of two virtual machines KVM and a Java VirtualMachine (JVM) where both are configured to allocate allmemory during initialization

Performance benefits of async page pre-zeroing with basepages (HawkEye-4KB) aremodest in this case the other page

Kernel Self-tuning Memory Size ThroughputLinux-4KB No 162GB 1061KLinux-2MB No 332GB 1138KIngens-90 No 163GB 1068KIngens-50 No 331GB 1134KHawkEye (no mem pressure) Yes 332GB 1136KHawkEye (mem pressure) Yes 162GB 1058K

Table 7 Redis memory consumption and throughput

Workload OS ConfigurationLinux4KB

Linux2MB

Ingens90

HawkEye4KB

HawkEye2MB

Redis (45GB) 233 437 192 236 551SparseHash (36GB) 501 172 515 466 106HACC-IO (6GB) 65 45 66 65 42JVM Spin-up (36GB) 377 186 527 298 137KVM Spin-up (36GB) 406 97 418 302 070

Table 8 Performance implications of asynchronous pagezeroing Values for Redis represent throughput (higher isbetter) all other values represent time in seconds (lower isbetter)

0

10

20

30

40

NPB PARSEC Graph500 cactusADM PageRank omnetpp

overhead with caching-stores with nt-stores

Figure 10 Performance overhead of async pre-zeroing withand without caching instructions The first two bars (NPBand Parsec) represent the average of all workloads in therespective benchmark suite

fault overheads (apart from zeroing) dominate performanceHowever page-zeroing cost becomes significant with hugepages Consequently HawkEye with huge pages (HawkEye-2MB) improves the performance of Redis and SparseHashby 126times and 162times over Linux The spin-up time of virtualmachines is purely dominated by page faults Hence weobserve a dramatic reduction in the spin-up time of VMswith async pre-zeroing of huge pages 138times and 136times overLinux Notice that without huge pages (only base pages)this reduction is only 134times and 126times All these workloadshave high spatial locality of page faults and hence the Ingensstrategy of utilization-based promotion has a negative impacton performance due to a higher number of page faultsAsync pre-zeroing enables memory sharing in virtual-ized environments Finally we note that async pre-zeroingwithin VMs enables memory sharing across VMs the freememory of a VM returns to the host through pre-zeroing inthe VM and same-page merging at the host This can havethe same net effect as ballooning as the free memory in aVM gets shared with other VMs In our experiments withmemory over-committed systems we have confirmed thisbehaviour the overall performance of an overcommittedsystem where the VMs are running HawkEye matches the

0

05

1

15

2

25

PageRank Redis MongoDB CC Memcached SPDNode 0 Node 1

Nor

mal

ized

Per

form

ance

Linux Linux-Ballooning HawkEye

Figure 11 Performance normalized to the case of no bal-looning in an overcommitted virtualized system

performance achieved through para-virtual interfaces likememory balloon drivers For example in an experiment in-volving a mix of latency-sensitive key-value stores and HPCworkloads (see Figure 11) with a total peakmemory consump-tion of around 150GB (15times of total memory) HawkEye-Gprovides 23times and 142times higher throughput for Redis andMongoDB along with significant tail latency reduction ForPageRank performance degrades slightly due to a highernumber of COW page faults due to unnecessary same-pagemerging These results are very close to the results obtainedwhen memory ballooning is enabled on all VMs We believethat this is a potential positive of our design and may offeran alternative method to efficiently manage memory in over-committed systems It is well-known that balloon-driversand other para-virtual interfaces are riddled with compatibil-ity problems [29] and a fully-virtual solution to this problemwould be highly desirable While our experiments show earlypromise in this direction we leave an extensive evaluationfor future workPerformance overheads ofHawkEyeWe evaluatedHawk-Eye with non-fragmented systems and with several work-loads that donrsquot benefit from huge pages and find that theperformance with HawkEye is very similar to performancewith Linux or Ingens in these scenarios confirming thatHawkEyersquos algorithms for access-bit tracking asynchronoushuge page promotion and memory de-duplication add negli-gible overheads over existingmechanisms in Linux Howeverthe async pre-zeroing thread requires special attention as itmay become a source of noticeable performance overheadfor certain types of workloads as discussed nextOverheads of async pre-zeroing A primary concern withasync pre-zeroing is its potential detrimental effect due tomemory and cache interference (sect31) Recall that we em-ploy memory stores with non-temporal hints to avoid theseeffects To measure the worst-case cache effects of asyncpre-zeroing we run our workloads while simultaneouslyzero-filling pages on a separate core sharing the same L3cache (ie on the same socket) at a high rate of 025M pagesper second (1GBps) with and without non-temporal mem-ory stores Figure 10 shows that using non-temporal hintssignificantly brings down the overhead of async pre-zeroing(eg from 27 to 6 for omnetpp) the remaining overheadis due to additional memory traffic generated by the pre-zeroing thread Further we note that this experiment presents

Workload MMUOverhead

Time (seconds)4KB HawkEye-PMU HawkEye-G

random(4GB) 60 582 328(177times) 413(141times)sequential(4GB) lt 1 517 535 532Total 1099 863(127times) 945(116times)cgD(16GB) 39 1952 1202(162times) 1450(135times)mgD(24GB) lt 1 1363 1364 1377Total 3315 2566(129times) 2827(117times)

Table 9 Comparison between HawkEye-PMU andHawkEye-G Values in parentheses represent speedup overbaseline pagesa highly-exaggerated behavior of async pre-zeroing onworst-case workloads In practice the zeroing thread is rate-limited(eg at most 10k pages per second) and the expected cache-interference overheads would be proportionally smallerComparison betweenHawk-PMUandHawkEye-G Eventhough HawkEye-G is based on simple and approximatedestimations of TLB contention it is reasonably accurate inidentifying TLB sensitive processes and memory regionsin most cases However HawkEye-PMU performs betterthan HawkEye-G in some cases Table 9 demonstrates twosuch examples where four workloads all with high access-coverage but different MMU overheads are executed in twosets each set contains one TLB sensitive and one TLB insensi-tive workload The 4KB column represents the case where nohuge pages are promoted As discussed earlier (sect24) MMUoverheads depend on the complex interaction between thehardware and access pattern In this case despite havinghigh access-coverage in VA regions sequential and mgDhave negligible MMU overheads (ie they are TLB insen-sitive) While HawkEye-PMU correctly identifies the pro-cess with higher MMU overheads for huge page allocationHawkEye-G treats them similarly (due to imprecise estima-tions) and may allocate huge pages to the TLB insensitiveprocess Consequently HawkEye-PMU may perform up to36 better than HawkEye-G Bridging this gap between thetwo approaches in a hardware independent manner is aninteresting future work

5 Related WorkMMU overheads have been a topic of much research in re-cent years and several solutions have been proposed includ-ing solutions that are architecture-based OS-based andorcompiler-based [28 34 49 56 58 62ndash64]Hardware SupportMost proposals for hardware supportaim to reduce TLB-miss frequency or accelerate TLB-missprocessing Large multi-level TLBs and support for multiplepage sizes to minimize the number of TLB-misses is alreadycommon in general-purpose processors today [15 20 21 40]Further modern processors employ page-walk-caches toavoid memory lookups in the TLB-miss path [31 35] POM-TLB services a TLB miss using a single memory lookup andfurther leverages regular data caches to speed up addresstranslation [63] Direct segments [32 46] were proposed tocompletely avoid TLB-miss processing cost through special

segmentation hardware OS-level challenges of memory frag-mentation have also been considered in hardware designsCoLT or Coalesced-Large-reach TLBs [61] were initially pro-posed to increase TLB reach using base pages based on theobservation that OSs naturally provide contiguous mappingsat smaller granularity This approach was further extendedto page-walk caches and huge pages [35 40 60] Techniquesto support huge pages in non-contiguous memory have alsobeen proposed [44] Hardware optimizations are importantbut complementary to HawkEyeSoftware Support FreeBSDrsquos huge page management islargely influenced by previous work on superpages [57]Carrefour-LP [47] showed that ad-hoc usage of huge pagescan degrade performance in NUMA systems due to addi-tional remote memory accesses and traffic imbalance acrossnode interconnects and proposed hardware-profiling basedtechniques to overcome these problems Anti-fragmentationtechniques for clustering pages based on mobility type ofpages proposed by Gorman et al [48] have been adoptedby Linux to support the allocation of huge pages in long-running systems Illuminator [59] highlighted critical draw-backs of Linuxrsquos fragmentation mitigation techniques andtheir implications on performance and latency and proposedtechniques to solve these problems Mechanisms of dealingwith physical memory fragmentation and NUMA systemsare important but orthogonal problems and their proposedsolutions can be adopted to improve the robustness of Hawk-Eye Compiler or application hints can also be used to assistOSs in prioritizing huge page mapping for certain parts ofthe address space [27 56] for which an interface is alreadyprovided by Linux through the madvise system call [16]An alternative approach for supporting huge pages has

also been explored via libhugetlbfs [22] where the useris provided more control on huge page allocation Howeversuch an approach requires manual intervention for reserv-ing huge pages in advance and considers each applicationin isolation Windows and OS X support huge pages onlythrough this reservation-based approach to avoid issues asso-ciated with transparent huge page management [17 55 59]We believe that insights from HawkEye can be leveraged toimprove huge page support in these important systems

6 ConclusionsTo summarize transparent huge page management is es-sential but complex Effective and efficient algorithms arerequired in the OS to automatically balance the tradeoffs andprovide performant and stable system behavior We exposesome important subtleties related to huge page managementin existing proposals and propose a new set of algorithmsto address them in HawkEye

References[1] 2000 Clearing pages in the idle loop httpswwwmail-archivecom

freebsd-hackersfreebsdorgmsg13993html

[2] 2000 Linux Page Zeroing Strategy httpsyarchivenetcomplinuxpagezeroingstrategyhtml

[3] 2007 Memory part 5 What programmers can do httpslwnnetArticles255364

[4] 2010 Mysteries of Windows Memory Management RevealedPart Two httpsblogsmsdnmicrosoftcomtims20101029pdc10-mysteries-of-windows-memory-management-revealed-part-two

[5] 2012 khugepaged eating 100 CPU httpsbugzillaredhatcomshowbugcgiid=879801

[6] 2012 Recommendation to disable huge pages for Hadoophttpsdeveloperamdcomwordpressmedia201210HadoopTuningGuide-Version5pdf

[7] 2014 Arch Linux becomes unresponsive from khugepagedhttpunixstackexchangecomquestions161858arch-linux-becomes-unresponsive-from-khugepaged

[8] 2014 CORAL Benchmark Codes httpsascllnlgovCORAL-benchmarkshacc

[9] 2014 Recommendation to disable huge pages for NuoDBhttpwwwnuodbcomtechbloglinux-transparent-huge-pages-jemalloc-and-nuodb

[10] 2014 Why TokuDB Hates Transparent HugePageshttpswwwperconacomblog20140723why-tokudb-hates-transparent-hugepages

[11] 2015 The Black Magic Of Systematically Reducing Linux OS Jit-ter httphighscalabilitycomblog201548the-black-magic-of-systematically-reducing-linux-os-jitterhtml

[12] 2015 Tales from the Field Taming Transparent Huge Pages onLinux httpswwwperforcecomblog151016tales-field-taming-transparent-huge-pages-linux

[13] 2016 C++ associative containers httpsgithubcomsparsehashsparsehash

[14] 2016 Remove PG_ZERO and zeroidle (page-zeroing) entirely httpsnewsycombinatorcomitemid=12227874

[15] 2017 Hugepages httpswikidebianorgHugepages[16] 2017 MADVISE Linux Programmerrsquos Manual httpman7orglinux

man-pagesman2madvise2html[17] 2018 Large-Page Support in Windows httpsdocsmicrosoftcom

en-gbwindowsdesktopMemorylarge-page-support[18] 2019 CGROUPS Linux Programmerrsquos Manual httpman7orglinux

man-pagesman7cgroups7html[19] 2019 FreeBSD Pre-Faulting and Zeroing Optimizations

httpswwwfreebsdorgdocenUSISO8859-1articlesvm-designprefault-optimizationshtml

[20] 2019 Intel Haswell httpwww7-cpucomcpuHaswellhtml[21] 2019 Intel Skylake httpwww7-cpucomcpuSkylakehtml[22] 2019 Libhugetlbfs Linux man page httpslinuxdienetman7

libhugetlbfs[23] 2019 Recommendation to disable huge pages for MongoDB https

docsmongodbcommanualtutorialtransparent-huge-pages[24] 2019 Recommendation to disable huge pages for Redis httpredisio

topicslatency[25] 2019 Recommendation to disable huge pages for VoltDB https

docsvoltdbcomAdminGuideadminmemmgtphp[26] 2019 Transparent Hugepage Support httpswwwkernelorgdoc

Documentationvmtranshugetxt[27] Mohammad Agbarya Idan Yaniv and Dan Tsafrir 2018 Memomania

From Huge to Huge-Huge Pages In Proceedings of the 11th ACM In-ternational Systems and Storage Conference (SYSTOR rsquo18) ACM NewYork NY USA 112ndash112 httpsdoiorg10114532118903211918

[28] Hanna Alam Tianhao Zhang Mattan Erez and Yoav Etsion 2017Do-It-Yourself Virtual Memory Translation In Proceedings of the44th Annual International Symposium on Computer Architecture (ISCArsquo17) ACM New York NY USA 457ndash468 httpsdoiorg10114530798563080209

[29] Nadav Amit Dan Tsafrir and Assaf Schuster 2014 VSwapper AMemory Swapper for Virtualized Environments In Proceedings of the19th International Conference on Architectural Support for ProgrammingLanguages and Operating Systems (ASPLOS rsquo14) ACM New York NYUSA 349ndash366 httpsdoiorg10114525419402541969

[30] D H Bailey E Barszcz J T Barton D S Browning R L Carter LDagum R A Fatoohi P O Frederickson T A Lasinski R S SchreiberH D Simon V Venkatakrishnan and S KWeeratunga 1991 The NASParallel BenchmarksSummary and Preliminary Results In Proceedingsof the 1991 ACMIEEE Conference on Supercomputing (Supercomputingrsquo91) ACM New York NY USA 158ndash165 httpsdoiorg101145125826125925

[31] Thomas W Barr Alan L Cox and Scott Rixner 2010 TranslationCaching Skip DonrsquoT Walk (the Page Table) In Proceedings of the37th Annual International Symposium on Computer Architecture (ISCArsquo10) ACM New York NY USA 48ndash59 httpsdoiorg10114518159611815970

[32] Arkaprava Basu Jayneel Gandhi Jichuan Chang Mark D Hill andMichael M Swift 2013 Efficient Virtual Memory for Big MemoryServers In Proceedings of the 40th Annual International Symposium onComputer Architecture (ISCA rsquo13) ACM New York NY USA 237ndash248httpsdoiorg10114524859222485943

[33] Scott Beamer Krste Asanovic and David A Patterson 2015 TheGAP Benchmark Suite CoRR abs150803619 (2015) arXiv150803619httparxivorgabs150803619

[34] Ravi Bhargava Benjamin Serebrin Francesco Spadini and SrilathaManne 2008 Accelerating Two-dimensional Page Walks for Vir-tualized Systems In Proceedings of the 13th International Confer-ence on Architectural Support for Programming Languages and Op-erating Systems (ASPLOS XIII) ACM New York NY USA 26ndash35httpsdoiorg10114513462811346286

[35] Abhishek Bhattacharjee 2013 Large-reach Memory ManagementUnit Caches In Proceedings of the 46th Annual IEEEACM InternationalSymposium on Microarchitecture (MICRO-46) ACM New York NYUSA 383ndash394 httpsdoiorg10114525407082540741

[36] Christian Bienia Sanjeev Kumar Jaswinder Pal Singh and Kai Li 2008The PARSEC Benchmark Suite Characterization and ArchitecturalImplications In Proceedings of the 17th International Conference onParallel Architectures and Compilation Techniques (PACT rsquo08) ACMNew York NY USA 72ndash81 httpsdoiorg10114514541151454128

[37] Daniel Bovet and Marco Cesati 2005 Understanding The Linux KernelOreilly amp Associates Inc

[38] Josiah L Carlson 2013 Redis in Action Manning Publications CoGreenwich CT USA

[39] Jonathan Corbet 2010 Memory compaction httpslwnnetArticles368869

[40] Guilherme Cox and Abhishek Bhattacharjee 2017 Efficient AddressTranslation for Architectures with Multiple Page Sizes In Proceed-ings of the Twenty-Second International Conference on ArchitecturalSupport for Programming Languages and Operating Systems (ASPLOSrsquo17) ACM New York NY USA 435ndash448 httpsdoiorg10114530376973037704

[41] Cort Dougan Paul Mackerras and Victor Yodaiken 1999 Opti-mizing the Idle Task and Other MMU Tricks In Proceedings of theThird Symposium on Operating Systems Design and Implementation(OSDI rsquo99) USENIX Association Berkeley CA USA 229ndash237 httpdlacmorgcitationcfmid=296806296833

[42] Niall Douglas 2011 User Mode Memory Page Allocation A Sil-ver Bullet For Memory Allocation CoRR abs11051811 (2011)arXiv11051811 httparxivorgabs11051811

[43] Ulrich Drepper 2007 What Every Programmer Should Know AboutMemory

[44] Yu Du Miao Zhou Bruce R Childers Daniel Mosseacute and Rami GMelhem 2015 Supporting superpages in non-contiguous physicalmemory 2015 IEEE 21st International Symposium on High Performance

Computer Architecture (HPCA) (2015) 223ndash234[45] M Franklin D Yeung n Xue Wu A Jaleel K Albayraktaroglu B

Jacob and n Chau-Wen Tseng 2005 BioBench A Benchmark Suite ofBioinformatics Applications In IEEE International Symposium on Per-formance Analysis of Systems and Software 2005 ISPASS 2005(ISPASS)Vol 00 2ndash9 httpsdoiorg101109ISPASS20051430554

[46] Jayneel Gandhi Arkaprava Basu Mark D Hill and Michael M Swift2014 Efficient Memory Virtualization Reducing Dimensionality ofNested Page Walks In Proceedings of the 47th Annual IEEEACM Inter-national Symposium on Microarchitecture (MICRO-47) IEEE ComputerSociety Washington DC USA 178ndash189 httpsdoiorg101109MICRO201437

[47] Fabien Gaud Baptiste Lepers Jeremie Decouchant Justin FunstonAlexandra Fedorova and Vivien Queacutema 2014 Large Pages MayBe Harmful on NUMA Systems In Proceedings of the 2014 USENIXConference on USENIX Annual Technical Conference (USENIX ATCrsquo14)USENIX Association Berkeley CA USA 231ndash242 httpdlacmorgcitationcfmid=26436342643659

[48] Mel Gorman and Patrick Healy 2008 Supporting Superpage Alloca-tion Without Additional Hardware Support In Proceedings of the 7thInternational Symposium on Memory Management (ISMM rsquo08) ACMNew York NY USA 41ndash50 httpsdoiorg10114513756341375641

[49] Mel Gorman and Patrick Healy 2012 Performance Characteristics ofExplicit Superpage Support In Proceedings of the 2010 InternationalConference on Computer Architecture (ISCArsquo10) Springer-Verlag BerlinHeidelberg 293ndash310 httpsdoiorg101007978-3-642-24322-624

[50] Mel Gorman and Andy Whitcroft 2006 The What The Why and theWhere To of Anti-Fragmentation In Linux Symposium 141

[51] Fei Guo Seongbeom Kim Yury Baskakov and Ishan Banerjee 2015Proactively Breaking Large Pages to Improve Memory Overcom-mitment Performance in VMware ESXi In Proceedings of the 11thACM SIGPLANSIGOPS International Conference on Virtual ExecutionEnvironments (VEE rsquo15) ACM New York NY USA 39ndash51 httpsdoiorg10114527311862731187

[52] Fan Guo Yongkun Li Yinlong Xu Song Jiang and John C S Lui2017 SmartMD A High Performance Deduplication Engine withMixed Pages In Proceedings of the 2017 USENIX Conference on UsenixAnnual Technical Conference (USENIX ATC rsquo17) USENIX AssociationBerkeley CA USA 733ndash744 httpdlacmorgcitationcfmid=31546903154759

[53] John L Henning 2006 SPEC CPU2006 Benchmark DescriptionsSIGARCH Comput Archit News 34 4 (Sept 2006) 1ndash17 httpsdoiorg10114511867361186737

[54] Vasileios Karakostas Osman S Unsal Mario Nemirovsky AdriaacutenCristal and Michael M Swift 2014 Performance analysis of thememory management unit under scale-out workloads 2014 IEEE In-ternational Symposium on Workload Characterization (IISWC) (2014)1ndash12

[55] Youngjin Kwon Hangchen Yu Simon Peter Christopher J Rossbachand Emmett Witchel 2016 Coordinated and Efficient Huge PageManagement with Ingens In Proceedings of the 12th USENIX Con-ference on Operating Systems Design and Implementation (OSDIrsquo16)USENIX Association Berkeley CA USA 705ndash721 httpdlacmorgcitationcfmid=30268773026931

[56] Joshua Magee and Apan Qasem 2009 A Case for Compiler-drivenSuperpage Allocation In Proceedings of the 47th Annual SoutheastRegional Conference (ACM-SE 47) ACM New York NY USA Article82 4 pages httpsdoiorg10114515664451566553

[57] Juan Navarro Sitararn Iyer Peter Druschel and Alan Cox 2002 Prac-tical Transparent Operating System Support for Superpages SIGOPSOper Syst Rev 36 SI (Dec 2002) 89ndash104 httpsdoiorg101145844128844138

[58] Ashish Panwar Naman Patel and K Gopinath 2016 A Case forProtecting Huge Pages from the Kernel In Proceedings of the 7th ACMSIGOPS Asia-Pacific Workshop on Systems (APSys rsquo16) ACM New YorkNY USA Article 15 8 pages httpsdoiorg10114529673602967371

[59] Ashish Panwar Aravinda Prasad and K Gopinath 2018 Making HugePages Actually Useful In Proceedings of the Twenty-Third InternationalConference on Architectural Support for Programming Languages andOperating Systems (ASPLOS rsquo18) ACM New York NY USA 679ndash692httpsdoiorg10114531731623173203

[60] Binh Pham Abhishek Bhattacharjee Yasuko Eckert and Gabriel HLoh 2014 Increasing TLB reach by exploiting clustering in page trans-lations 2014 IEEE 20th International Symposium on High PerformanceComputer Architecture (HPCA) (2014) 558ndash567

[61] Binh Pham Viswanathan Vaidyanathan Aamer Jaleel and AbhishekBhattacharjee 2012 CoLT Coalesced Large-Reach TLBs In Proceed-ings of the 2012 45th Annual IEEEACM International Symposium onMicroarchitecture (MICRO-45) IEEE Computer Society WashingtonDC USA 258ndash269 httpsdoiorg101109MICRO201232

[62] Binh Pham Jaacuten Veselyacute Gabriel H Loh and Abhishek Bhattacharjee2015 Large Pages and Lightweight Memory Management in Virtual-ized Environments Can You Have It Both Ways In Proceedings of the48th International Symposium on Microarchitecture (MICRO-48) ACMNew York NY USA 1ndash12 httpsdoiorg10114528307722830773

[63] Jee Ho Ryoo Nagendra Gulur Shuang Song and Lizy K John 2017Rethinking TLB Designs in Virtualized Environments A Very LargePart-of-Memory TLB In Proceedings of the 44th Annual InternationalSymposium on Computer Architecture (ISCA rsquo17) ACM New York NYUSA 469ndash480 httpsdoiorg10114530798563080210

[64] Indira Subramanian Clifford Mather Kurt Peterson and BalakrishnaRaghunath 1998 Implementation of Multiple Pagesize Support inHP-UX In USENIX Annual Technical Conference 105ndash119

[65] John R Tramm Andrew R Siegel Tanzima Islam and Martin Schulz[n d] XSBench - The Development and Verification of a PerformanceAbstraction for Monte Carlo Reactor Analysis In PHYSOR 2014 - TheRole of Reactor Physics toward a Sustainable Future Kyoto

[66] Koji Ueno and Toyotaro Suzumura 2012 Highly Scalable Graph Searchfor the Graph500 Benchmark In Proceedings of the 21st InternationalSymposium on High-Performance Parallel and Distributed Computing(HPDC rsquo12) ACM New York NY USA 149ndash160 httpsdoiorg10114522870762287104

[67] Carl AWaldspurger 2002 Memory ResourceManagement in VMwareESX Server SIGOPS Oper Syst Rev 36 SI (Dec 2002) 181ndash194 httpsdoiorg101145844128844146

[68] Xiao Zhang Sandhya Dwarkadas and Kai Shen 2009 Towards Practi-cal Page Coloring-based Multicore Cache Management In Proceedingsof the 4th ACM European Conference on Computer Systems (EuroSysrsquo09) ACM New York NY USA 89ndash102 httpsdoiorg10114515190651519076

  • Abstract
  • 1 Introduction
  • 2 Motivation
    • 21 Address Translation Overhead vs Memory Bloat
    • 22 Page fault latency vs Number of page faults
    • 23 Huge page allocation across multiple processes
    • 24 How to capture address translation overheads
      • 3 Design and Implementation
        • 31 Asynchronous page pre-zeroing
        • 32 Managing bloat vs performance
        • 33 Fine-grained huge page promotion
        • 34 Huge page allocation across multiple processes
        • 35 Limitations and discussion
          • 4 Evaluation
          • 5 Related Work
          • 6 Conclusions
          • References
Page 9: HawkEye: Efficient Fine-grained OS Support for …sbansal/pubs/hawkeye.pdfHawkEye: Efficient Fine-grained OS Support for Huge Pages Ashish Panwar Indian Institute of Science ashishpanwar@iisc.ac.in

0

128

256

384

512

05GB 1GB 15GB 2GB

Cold Region

acce

ss-c

over

age

Virtual Address Space

0

15

30

45

60

300 600 900 1200 1500

MM

U O

verh

ead

Time (seconds)

0

200

400

600

800

1000

300 600 900 1200 1500

Linux Ingens

HawkEye-G

HawkEye-PMU

H

uge

Pag

es

Time (seconds)

(a) Graph500

0

128

256

384

512

1GB 2GB 3GB 4GB 5GB 6GB

acce

ss-c

over

age

Virtual Address Space

0

15

30

45

60

300 600 900 1200 1500 1800 2100 2400

Linux Ingens

HawkEye-PMU HawkEye-GMM

U O

ver

hea

dTime (seconds)

0

500

1000

1500

2000

300 600 900 1200 1500 1800 2100 2400

Linux Ingens H

awkEye-G

HawkEye-PMU H

uge

Pag

es

Time (seconds)

(b) XSBenchFigure 6 Access-coverage in application VA MMU overhead and huge page promotions for Graph500 and XSBench HawkEyeis more efficient in promoting huge pages due to fine-grained decision making

0

200

400

600

800

1000

10 20 30 40

1 2

3 H

uge

Pag

es

Time (minutes)

0

200

400

600

800

1000

10 20 30 40

1 2 3

H

uge

Pag

es

Time (minutes)

0

200

400

600

800

1000

10 20 30 40

H

uge

Pag

es

Time (minutes)

0

200

400

600

800

1000

10 20 30 40

1 2 3

H

uge

Pag

es

Time (minutes)

0

10

20

30

40

50

10 20 30 40

MM

U O

verh

ead

Time (Minutes)

0

10

20

30

40

50

10 20 30 40

MM

U O

verh

ead

Time (minutes)

0

10

20

30

40

50

10 20 30 40

MM

U O

verh

ead

Time (minutes)

0

10

20

30

40

50

10 20 30 40

MM

U O

verh

ead

Time (minutes)

0

500

1000

1500

2000

10 20 30 40 50

1

2 3

H

uge

Pag

es

Time (Minutes)

0

300

600

900

1200

10 20 30 40 50

1 2 3

H

uge

Pag

es

Time (Minutes)

0

300

600

900

1200

10 20 30 40 50

H

uge

Pag

es

Time (Minutes)

0

300

600

900

1200

10 20 30 40 50

1 2 3

H

uge

Pag

es

Time (Minutes)

0

15

30

45

60

10 20 30 40 50

1 2 3

MM

U O

verh

ead

Time (Minutes)

0

15

30

45

60

10 20 30 40 50

1 2 3

MM

U O

verh

ead

Time (Minutes)

0

15

30

45

60

10 20 30 40 50

1 2 3

MM

U O

verh

ead

Time (Minutes)

0

15

30

45

60

10 20 30 40 50

1 2 3

MM

U O

verh

ead

Time (Minutes)

(a) Linux-2MB (b) Ingens (c) HawkEye-PMU (d) HawkEye-GFigure 7 Huge page promotions and their impact on TLB overhead over time for Linux Ingens and HawkEye for threeinstances of Graph500 and XSBench each

set is executed twice after fragmenting the system to as-sess the impact of huge page policies on the order of ex-ecution For the TLB insensitive workload we execute alightly loaded Redis server with 1KB-sized 40M keys servic-ing 10K requests per-second Figure 8 shows the performance

speedup over 4KB pages The (Before) and (After) configura-tions show whether the TLB sensitive workload is launchedbefore Redis or after Because Linux promotes huge pagesin process launch order the two configurations (Before) and(After) are intended to cover its two different behavioursIngens and HawkEye are agnostic to process creation order

0

15

30

45

60

75

cactusADM tigr Graph500 lbm_s SVM XSBench cgD cactusADM tigr Graph500 lbm_s SVM XSBench cgDBefore After

s

peed

up

Chart Title

Linux Ingens HawkEye-PMU HawkEye-G

Figure 8 Performance speedup over baseline pages of TLB sensitive applications when they are executed alongside a lightlystressed Redis server in different orders

1

12

14

16

18

2

Nor

mal

ized

Per

form

ance

HawkEye-Host HawkEye-Guest HawkEye-Both

Figure 9 Performance compared to Linux in a virtualizedsystem when HawkEye is applied at the host guest and bothlayers

The impact of execution order is clearly visible for LinuxIn the (Before) case Linux with THP support improves per-formance over baseline pages by promoting huge pages inTLB sensitive applications However in the (After) case itpromotes huge pages in Redis resulting in poor performancefor TLB sensitive workloads While the execution order doesnot matter in Ingens its proportional huge page promotionstrategy favors large memory workloads (Redis in this case)Further our client requests randomly selected keys whichmakes the Redis server access all its huge pages uniformlyConsequently Ingens promotes most huge pages in Redisleading to suboptimal performance of TLB sensitive work-loads In contrast HawkEye promotes huge pages based onMMU overhead measurements (HawkEye-PMU) or access-coverage based estimation of MMU overheads (HawkEye-G)This leads to 15ndash60 performance improvement over 4KBpages irrespective of the order of executionPerformance in virtualized systems For a virtualizedsystemwe use the KVMhypervisor runningwith Ubuntu1604and fragment the system prior to running the workloadsEvaluation is presented for three different configurationsie when HawkEye is deployed at the host guest and boththe layers Each of these configurations requires running thesame set of applications differently (see Table 6 for details)Figure 9 shows that HawkEye provides 18ndash90 speedup

compared to Linux in virtual environments Notice that insome cases (eg cgD) performance improvement is higherin a virtualized system compared to bare-metal This is due tohigh MMU overheads involved in traversing nested page ta-bles while servicing TLB misses of guest applications whichpresents higher opportunities to be exploited by HawkEye

Configuration Details

HawkEye-Host Two VMs each with 24 vCPUs and 48GB memoryVM-1 runs Redis VM-2 runs TLB sensitive workloads

HawkEye-Guest Single VM with 48 vCPUs and 80GB memoryrunning Redis and TLB sensitive workloads

HawkEye-Both Two VMs each with 24 vCPUs VM-1 (30GB) runs RedisVM-2 (60GB) runs both Redis and TLB sensitive workloads

Table 6 Experimental setup for configurations used to eval-uate a virtualized system

Bloat vs performance We next study the relationship be-tween memory bloat and performance we revisit an experi-ment similar to the one summarized in sect21 We populate aRedis instance with 8M (10B key 4KB value) key-value pairsand then delete 60 randomly selected keys (see Table 7)While Linux-4KB (no THP) is memory efficient (no bloat)its performance is low (7 lower throughput than Linux-2MB ie with THP) Linux-2MB delivers high performancebut consumes more memory which remains unavailable tothe rest of system even when memory pressure increasesIngens can be configured to balance the memory vs perfor-mance tradeoff For example Ingens-90 (Ingens with 90utilization threshold) is memory efficient while Ingens-50(Ingens with 50 utilization threshold) favors performanceand allows more memory bloat In either case it is unableto recover from bloat that may have already been gener-ated (eg during its aggressive phase) By switching fromaggressive huge page allocations under low memory pres-sure to memory conserving strategy at runtime HawkEye isable to achieve both low MMU overheads and high memoryefficiency depending on the state of the systemFast page faults Table 8 shows the performance of differentstrategies for the workloads chosen for these experimentstheir performance depends on the efficiency of the OS pagefault handler We measure Redis throughput when 2MB-sized values are inserted SparseHash [13] is a hash-maplibrary in C++ while HACC-IO [8] is a parallel IO benchmarkused with an in-memory file system We also measure thespin-up time of two virtual machines KVM and a Java VirtualMachine (JVM) where both are configured to allocate allmemory during initialization

Performance benefits of async page pre-zeroing with basepages (HawkEye-4KB) aremodest in this case the other page

Kernel Self-tuning Memory Size ThroughputLinux-4KB No 162GB 1061KLinux-2MB No 332GB 1138KIngens-90 No 163GB 1068KIngens-50 No 331GB 1134KHawkEye (no mem pressure) Yes 332GB 1136KHawkEye (mem pressure) Yes 162GB 1058K

Table 7 Redis memory consumption and throughput

Workload OS ConfigurationLinux4KB

Linux2MB

Ingens90

HawkEye4KB

HawkEye2MB

Redis (45GB) 233 437 192 236 551SparseHash (36GB) 501 172 515 466 106HACC-IO (6GB) 65 45 66 65 42JVM Spin-up (36GB) 377 186 527 298 137KVM Spin-up (36GB) 406 97 418 302 070

Table 8 Performance implications of asynchronous pagezeroing Values for Redis represent throughput (higher isbetter) all other values represent time in seconds (lower isbetter)

0

10

20

30

40

NPB PARSEC Graph500 cactusADM PageRank omnetpp

overhead with caching-stores with nt-stores

Figure 10 Performance overhead of async pre-zeroing withand without caching instructions The first two bars (NPBand Parsec) represent the average of all workloads in therespective benchmark suite

fault overheads (apart from zeroing) dominate performanceHowever page-zeroing cost becomes significant with hugepages Consequently HawkEye with huge pages (HawkEye-2MB) improves the performance of Redis and SparseHashby 126times and 162times over Linux The spin-up time of virtualmachines is purely dominated by page faults Hence weobserve a dramatic reduction in the spin-up time of VMswith async pre-zeroing of huge pages 138times and 136times overLinux Notice that without huge pages (only base pages)this reduction is only 134times and 126times All these workloadshave high spatial locality of page faults and hence the Ingensstrategy of utilization-based promotion has a negative impacton performance due to a higher number of page faultsAsync pre-zeroing enables memory sharing in virtual-ized environments Finally we note that async pre-zeroingwithin VMs enables memory sharing across VMs the freememory of a VM returns to the host through pre-zeroing inthe VM and same-page merging at the host This can havethe same net effect as ballooning as the free memory in aVM gets shared with other VMs In our experiments withmemory over-committed systems we have confirmed thisbehaviour the overall performance of an overcommittedsystem where the VMs are running HawkEye matches the

0

05

1

15

2

25

PageRank Redis MongoDB CC Memcached SPDNode 0 Node 1

Nor

mal

ized

Per

form

ance

Linux Linux-Ballooning HawkEye

Figure 11 Performance normalized to the case of no bal-looning in an overcommitted virtualized system

performance achieved through para-virtual interfaces likememory balloon drivers For example in an experiment in-volving a mix of latency-sensitive key-value stores and HPCworkloads (see Figure 11) with a total peakmemory consump-tion of around 150GB (15times of total memory) HawkEye-Gprovides 23times and 142times higher throughput for Redis andMongoDB along with significant tail latency reduction ForPageRank performance degrades slightly due to a highernumber of COW page faults due to unnecessary same-pagemerging These results are very close to the results obtainedwhen memory ballooning is enabled on all VMs We believethat this is a potential positive of our design and may offeran alternative method to efficiently manage memory in over-committed systems It is well-known that balloon-driversand other para-virtual interfaces are riddled with compatibil-ity problems [29] and a fully-virtual solution to this problemwould be highly desirable While our experiments show earlypromise in this direction we leave an extensive evaluationfor future workPerformance overheads ofHawkEyeWe evaluatedHawk-Eye with non-fragmented systems and with several work-loads that donrsquot benefit from huge pages and find that theperformance with HawkEye is very similar to performancewith Linux or Ingens in these scenarios confirming thatHawkEyersquos algorithms for access-bit tracking asynchronoushuge page promotion and memory de-duplication add negli-gible overheads over existingmechanisms in Linux Howeverthe async pre-zeroing thread requires special attention as itmay become a source of noticeable performance overheadfor certain types of workloads as discussed nextOverheads of async pre-zeroing A primary concern withasync pre-zeroing is its potential detrimental effect due tomemory and cache interference (sect31) Recall that we em-ploy memory stores with non-temporal hints to avoid theseeffects To measure the worst-case cache effects of asyncpre-zeroing we run our workloads while simultaneouslyzero-filling pages on a separate core sharing the same L3cache (ie on the same socket) at a high rate of 025M pagesper second (1GBps) with and without non-temporal mem-ory stores Figure 10 shows that using non-temporal hintssignificantly brings down the overhead of async pre-zeroing(eg from 27 to 6 for omnetpp) the remaining overheadis due to additional memory traffic generated by the pre-zeroing thread Further we note that this experiment presents

Workload MMUOverhead

Time (seconds)4KB HawkEye-PMU HawkEye-G

random(4GB) 60 582 328(177times) 413(141times)sequential(4GB) lt 1 517 535 532Total 1099 863(127times) 945(116times)cgD(16GB) 39 1952 1202(162times) 1450(135times)mgD(24GB) lt 1 1363 1364 1377Total 3315 2566(129times) 2827(117times)

Table 9 Comparison between HawkEye-PMU andHawkEye-G Values in parentheses represent speedup overbaseline pagesa highly-exaggerated behavior of async pre-zeroing onworst-case workloads In practice the zeroing thread is rate-limited(eg at most 10k pages per second) and the expected cache-interference overheads would be proportionally smallerComparison betweenHawk-PMUandHawkEye-G Eventhough HawkEye-G is based on simple and approximatedestimations of TLB contention it is reasonably accurate inidentifying TLB sensitive processes and memory regionsin most cases However HawkEye-PMU performs betterthan HawkEye-G in some cases Table 9 demonstrates twosuch examples where four workloads all with high access-coverage but different MMU overheads are executed in twosets each set contains one TLB sensitive and one TLB insensi-tive workload The 4KB column represents the case where nohuge pages are promoted As discussed earlier (sect24) MMUoverheads depend on the complex interaction between thehardware and access pattern In this case despite havinghigh access-coverage in VA regions sequential and mgDhave negligible MMU overheads (ie they are TLB insen-sitive) While HawkEye-PMU correctly identifies the pro-cess with higher MMU overheads for huge page allocationHawkEye-G treats them similarly (due to imprecise estima-tions) and may allocate huge pages to the TLB insensitiveprocess Consequently HawkEye-PMU may perform up to36 better than HawkEye-G Bridging this gap between thetwo approaches in a hardware independent manner is aninteresting future work

5 Related WorkMMU overheads have been a topic of much research in re-cent years and several solutions have been proposed includ-ing solutions that are architecture-based OS-based andorcompiler-based [28 34 49 56 58 62ndash64]Hardware SupportMost proposals for hardware supportaim to reduce TLB-miss frequency or accelerate TLB-missprocessing Large multi-level TLBs and support for multiplepage sizes to minimize the number of TLB-misses is alreadycommon in general-purpose processors today [15 20 21 40]Further modern processors employ page-walk-caches toavoid memory lookups in the TLB-miss path [31 35] POM-TLB services a TLB miss using a single memory lookup andfurther leverages regular data caches to speed up addresstranslation [63] Direct segments [32 46] were proposed tocompletely avoid TLB-miss processing cost through special

segmentation hardware OS-level challenges of memory frag-mentation have also been considered in hardware designsCoLT or Coalesced-Large-reach TLBs [61] were initially pro-posed to increase TLB reach using base pages based on theobservation that OSs naturally provide contiguous mappingsat smaller granularity This approach was further extendedto page-walk caches and huge pages [35 40 60] Techniquesto support huge pages in non-contiguous memory have alsobeen proposed [44] Hardware optimizations are importantbut complementary to HawkEyeSoftware Support FreeBSDrsquos huge page management islargely influenced by previous work on superpages [57]Carrefour-LP [47] showed that ad-hoc usage of huge pagescan degrade performance in NUMA systems due to addi-tional remote memory accesses and traffic imbalance acrossnode interconnects and proposed hardware-profiling basedtechniques to overcome these problems Anti-fragmentationtechniques for clustering pages based on mobility type ofpages proposed by Gorman et al [48] have been adoptedby Linux to support the allocation of huge pages in long-running systems Illuminator [59] highlighted critical draw-backs of Linuxrsquos fragmentation mitigation techniques andtheir implications on performance and latency and proposedtechniques to solve these problems Mechanisms of dealingwith physical memory fragmentation and NUMA systemsare important but orthogonal problems and their proposedsolutions can be adopted to improve the robustness of Hawk-Eye Compiler or application hints can also be used to assistOSs in prioritizing huge page mapping for certain parts ofthe address space [27 56] for which an interface is alreadyprovided by Linux through the madvise system call [16]An alternative approach for supporting huge pages has

also been explored via libhugetlbfs [22] where the useris provided more control on huge page allocation Howeversuch an approach requires manual intervention for reserv-ing huge pages in advance and considers each applicationin isolation Windows and OS X support huge pages onlythrough this reservation-based approach to avoid issues asso-ciated with transparent huge page management [17 55 59]We believe that insights from HawkEye can be leveraged toimprove huge page support in these important systems

6 ConclusionsTo summarize transparent huge page management is es-sential but complex Effective and efficient algorithms arerequired in the OS to automatically balance the tradeoffs andprovide performant and stable system behavior We exposesome important subtleties related to huge page managementin existing proposals and propose a new set of algorithmsto address them in HawkEye

References[1] 2000 Clearing pages in the idle loop httpswwwmail-archivecom

freebsd-hackersfreebsdorgmsg13993html

[2] 2000 Linux Page Zeroing Strategy httpsyarchivenetcomplinuxpagezeroingstrategyhtml

[3] 2007 Memory part 5 What programmers can do httpslwnnetArticles255364

[4] 2010 Mysteries of Windows Memory Management RevealedPart Two httpsblogsmsdnmicrosoftcomtims20101029pdc10-mysteries-of-windows-memory-management-revealed-part-two

[5] 2012 khugepaged eating 100 CPU httpsbugzillaredhatcomshowbugcgiid=879801

[6] 2012 Recommendation to disable huge pages for Hadoophttpsdeveloperamdcomwordpressmedia201210HadoopTuningGuide-Version5pdf

[7] 2014 Arch Linux becomes unresponsive from khugepagedhttpunixstackexchangecomquestions161858arch-linux-becomes-unresponsive-from-khugepaged

[8] 2014 CORAL Benchmark Codes httpsascllnlgovCORAL-benchmarkshacc

[9] 2014 Recommendation to disable huge pages for NuoDBhttpwwwnuodbcomtechbloglinux-transparent-huge-pages-jemalloc-and-nuodb

[10] 2014 Why TokuDB Hates Transparent HugePageshttpswwwperconacomblog20140723why-tokudb-hates-transparent-hugepages

[11] 2015 The Black Magic Of Systematically Reducing Linux OS Jit-ter httphighscalabilitycomblog201548the-black-magic-of-systematically-reducing-linux-os-jitterhtml

[12] 2015 Tales from the Field Taming Transparent Huge Pages onLinux httpswwwperforcecomblog151016tales-field-taming-transparent-huge-pages-linux

[13] 2016 C++ associative containers httpsgithubcomsparsehashsparsehash

[14] 2016 Remove PG_ZERO and zeroidle (page-zeroing) entirely httpsnewsycombinatorcomitemid=12227874

[15] 2017 Hugepages httpswikidebianorgHugepages[16] 2017 MADVISE Linux Programmerrsquos Manual httpman7orglinux

man-pagesman2madvise2html[17] 2018 Large-Page Support in Windows httpsdocsmicrosoftcom

en-gbwindowsdesktopMemorylarge-page-support[18] 2019 CGROUPS Linux Programmerrsquos Manual httpman7orglinux

man-pagesman7cgroups7html[19] 2019 FreeBSD Pre-Faulting and Zeroing Optimizations

httpswwwfreebsdorgdocenUSISO8859-1articlesvm-designprefault-optimizationshtml

[20] 2019 Intel Haswell httpwww7-cpucomcpuHaswellhtml[21] 2019 Intel Skylake httpwww7-cpucomcpuSkylakehtml[22] 2019 Libhugetlbfs Linux man page httpslinuxdienetman7

libhugetlbfs[23] 2019 Recommendation to disable huge pages for MongoDB https

docsmongodbcommanualtutorialtransparent-huge-pages[24] 2019 Recommendation to disable huge pages for Redis httpredisio

topicslatency[25] 2019 Recommendation to disable huge pages for VoltDB https

docsvoltdbcomAdminGuideadminmemmgtphp[26] 2019 Transparent Hugepage Support httpswwwkernelorgdoc

Documentationvmtranshugetxt[27] Mohammad Agbarya Idan Yaniv and Dan Tsafrir 2018 Memomania

From Huge to Huge-Huge Pages In Proceedings of the 11th ACM In-ternational Systems and Storage Conference (SYSTOR rsquo18) ACM NewYork NY USA 112ndash112 httpsdoiorg10114532118903211918

[28] Hanna Alam Tianhao Zhang Mattan Erez and Yoav Etsion 2017Do-It-Yourself Virtual Memory Translation In Proceedings of the44th Annual International Symposium on Computer Architecture (ISCArsquo17) ACM New York NY USA 457ndash468 httpsdoiorg10114530798563080209

[29] Nadav Amit Dan Tsafrir and Assaf Schuster 2014 VSwapper AMemory Swapper for Virtualized Environments In Proceedings of the19th International Conference on Architectural Support for ProgrammingLanguages and Operating Systems (ASPLOS rsquo14) ACM New York NYUSA 349ndash366 httpsdoiorg10114525419402541969

[30] D H Bailey E Barszcz J T Barton D S Browning R L Carter LDagum R A Fatoohi P O Frederickson T A Lasinski R S SchreiberH D Simon V Venkatakrishnan and S KWeeratunga 1991 The NASParallel BenchmarksSummary and Preliminary Results In Proceedingsof the 1991 ACMIEEE Conference on Supercomputing (Supercomputingrsquo91) ACM New York NY USA 158ndash165 httpsdoiorg101145125826125925

[31] Thomas W Barr Alan L Cox and Scott Rixner 2010 TranslationCaching Skip DonrsquoT Walk (the Page Table) In Proceedings of the37th Annual International Symposium on Computer Architecture (ISCArsquo10) ACM New York NY USA 48ndash59 httpsdoiorg10114518159611815970

[32] Arkaprava Basu Jayneel Gandhi Jichuan Chang Mark D Hill andMichael M Swift 2013 Efficient Virtual Memory for Big MemoryServers In Proceedings of the 40th Annual International Symposium onComputer Architecture (ISCA rsquo13) ACM New York NY USA 237ndash248httpsdoiorg10114524859222485943

[33] Scott Beamer Krste Asanovic and David A Patterson 2015 TheGAP Benchmark Suite CoRR abs150803619 (2015) arXiv150803619httparxivorgabs150803619

[34] Ravi Bhargava Benjamin Serebrin Francesco Spadini and SrilathaManne 2008 Accelerating Two-dimensional Page Walks for Vir-tualized Systems In Proceedings of the 13th International Confer-ence on Architectural Support for Programming Languages and Op-erating Systems (ASPLOS XIII) ACM New York NY USA 26ndash35httpsdoiorg10114513462811346286

[35] Abhishek Bhattacharjee 2013 Large-reach Memory ManagementUnit Caches In Proceedings of the 46th Annual IEEEACM InternationalSymposium on Microarchitecture (MICRO-46) ACM New York NYUSA 383ndash394 httpsdoiorg10114525407082540741

[36] Christian Bienia Sanjeev Kumar Jaswinder Pal Singh and Kai Li 2008The PARSEC Benchmark Suite Characterization and ArchitecturalImplications In Proceedings of the 17th International Conference onParallel Architectures and Compilation Techniques (PACT rsquo08) ACMNew York NY USA 72ndash81 httpsdoiorg10114514541151454128

[37] Daniel Bovet and Marco Cesati 2005 Understanding The Linux KernelOreilly amp Associates Inc

[38] Josiah L Carlson 2013 Redis in Action Manning Publications CoGreenwich CT USA

[39] Jonathan Corbet 2010 Memory compaction httpslwnnetArticles368869

[40] Guilherme Cox and Abhishek Bhattacharjee 2017 Efficient AddressTranslation for Architectures with Multiple Page Sizes In Proceed-ings of the Twenty-Second International Conference on ArchitecturalSupport for Programming Languages and Operating Systems (ASPLOSrsquo17) ACM New York NY USA 435ndash448 httpsdoiorg10114530376973037704

[41] Cort Dougan Paul Mackerras and Victor Yodaiken 1999 Opti-mizing the Idle Task and Other MMU Tricks In Proceedings of theThird Symposium on Operating Systems Design and Implementation(OSDI rsquo99) USENIX Association Berkeley CA USA 229ndash237 httpdlacmorgcitationcfmid=296806296833

[42] Niall Douglas 2011 User Mode Memory Page Allocation A Sil-ver Bullet For Memory Allocation CoRR abs11051811 (2011)arXiv11051811 httparxivorgabs11051811

[43] Ulrich Drepper 2007 What Every Programmer Should Know AboutMemory

[44] Yu Du Miao Zhou Bruce R Childers Daniel Mosseacute and Rami GMelhem 2015 Supporting superpages in non-contiguous physicalmemory 2015 IEEE 21st International Symposium on High Performance

Computer Architecture (HPCA) (2015) 223ndash234[45] M Franklin D Yeung n Xue Wu A Jaleel K Albayraktaroglu B

Jacob and n Chau-Wen Tseng 2005 BioBench A Benchmark Suite ofBioinformatics Applications In IEEE International Symposium on Per-formance Analysis of Systems and Software 2005 ISPASS 2005(ISPASS)Vol 00 2ndash9 httpsdoiorg101109ISPASS20051430554

[46] Jayneel Gandhi Arkaprava Basu Mark D Hill and Michael M Swift2014 Efficient Memory Virtualization Reducing Dimensionality ofNested Page Walks In Proceedings of the 47th Annual IEEEACM Inter-national Symposium on Microarchitecture (MICRO-47) IEEE ComputerSociety Washington DC USA 178ndash189 httpsdoiorg101109MICRO201437

[47] Fabien Gaud Baptiste Lepers Jeremie Decouchant Justin FunstonAlexandra Fedorova and Vivien Queacutema 2014 Large Pages MayBe Harmful on NUMA Systems In Proceedings of the 2014 USENIXConference on USENIX Annual Technical Conference (USENIX ATCrsquo14)USENIX Association Berkeley CA USA 231ndash242 httpdlacmorgcitationcfmid=26436342643659

[48] Mel Gorman and Patrick Healy 2008 Supporting Superpage Alloca-tion Without Additional Hardware Support In Proceedings of the 7thInternational Symposium on Memory Management (ISMM rsquo08) ACMNew York NY USA 41ndash50 httpsdoiorg10114513756341375641

[49] Mel Gorman and Patrick Healy 2012 Performance Characteristics ofExplicit Superpage Support In Proceedings of the 2010 InternationalConference on Computer Architecture (ISCArsquo10) Springer-Verlag BerlinHeidelberg 293ndash310 httpsdoiorg101007978-3-642-24322-624

[50] Mel Gorman and Andy Whitcroft 2006 The What The Why and theWhere To of Anti-Fragmentation In Linux Symposium 141

[51] Fei Guo Seongbeom Kim Yury Baskakov and Ishan Banerjee 2015Proactively Breaking Large Pages to Improve Memory Overcom-mitment Performance in VMware ESXi In Proceedings of the 11thACM SIGPLANSIGOPS International Conference on Virtual ExecutionEnvironments (VEE rsquo15) ACM New York NY USA 39ndash51 httpsdoiorg10114527311862731187

[52] Fan Guo Yongkun Li Yinlong Xu Song Jiang and John C S Lui2017 SmartMD A High Performance Deduplication Engine withMixed Pages In Proceedings of the 2017 USENIX Conference on UsenixAnnual Technical Conference (USENIX ATC rsquo17) USENIX AssociationBerkeley CA USA 733ndash744 httpdlacmorgcitationcfmid=31546903154759

[53] John L Henning 2006 SPEC CPU2006 Benchmark DescriptionsSIGARCH Comput Archit News 34 4 (Sept 2006) 1ndash17 httpsdoiorg10114511867361186737

[54] Vasileios Karakostas Osman S Unsal Mario Nemirovsky AdriaacutenCristal and Michael M Swift 2014 Performance analysis of thememory management unit under scale-out workloads 2014 IEEE In-ternational Symposium on Workload Characterization (IISWC) (2014)1ndash12

[55] Youngjin Kwon Hangchen Yu Simon Peter Christopher J Rossbachand Emmett Witchel 2016 Coordinated and Efficient Huge PageManagement with Ingens In Proceedings of the 12th USENIX Con-ference on Operating Systems Design and Implementation (OSDIrsquo16)USENIX Association Berkeley CA USA 705ndash721 httpdlacmorgcitationcfmid=30268773026931

[56] Joshua Magee and Apan Qasem 2009 A Case for Compiler-drivenSuperpage Allocation In Proceedings of the 47th Annual SoutheastRegional Conference (ACM-SE 47) ACM New York NY USA Article82 4 pages httpsdoiorg10114515664451566553

[57] Juan Navarro Sitararn Iyer Peter Druschel and Alan Cox 2002 Prac-tical Transparent Operating System Support for Superpages SIGOPSOper Syst Rev 36 SI (Dec 2002) 89ndash104 httpsdoiorg101145844128844138

[58] Ashish Panwar Naman Patel and K Gopinath 2016 A Case forProtecting Huge Pages from the Kernel In Proceedings of the 7th ACMSIGOPS Asia-Pacific Workshop on Systems (APSys rsquo16) ACM New YorkNY USA Article 15 8 pages httpsdoiorg10114529673602967371

[59] Ashish Panwar Aravinda Prasad and K Gopinath 2018 Making HugePages Actually Useful In Proceedings of the Twenty-Third InternationalConference on Architectural Support for Programming Languages andOperating Systems (ASPLOS rsquo18) ACM New York NY USA 679ndash692httpsdoiorg10114531731623173203

[60] Binh Pham Abhishek Bhattacharjee Yasuko Eckert and Gabriel HLoh 2014 Increasing TLB reach by exploiting clustering in page trans-lations 2014 IEEE 20th International Symposium on High PerformanceComputer Architecture (HPCA) (2014) 558ndash567

[61] Binh Pham Viswanathan Vaidyanathan Aamer Jaleel and AbhishekBhattacharjee 2012 CoLT Coalesced Large-Reach TLBs In Proceed-ings of the 2012 45th Annual IEEEACM International Symposium onMicroarchitecture (MICRO-45) IEEE Computer Society WashingtonDC USA 258ndash269 httpsdoiorg101109MICRO201232

[62] Binh Pham Jaacuten Veselyacute Gabriel H Loh and Abhishek Bhattacharjee2015 Large Pages and Lightweight Memory Management in Virtual-ized Environments Can You Have It Both Ways In Proceedings of the48th International Symposium on Microarchitecture (MICRO-48) ACMNew York NY USA 1ndash12 httpsdoiorg10114528307722830773

[63] Jee Ho Ryoo Nagendra Gulur Shuang Song and Lizy K John 2017Rethinking TLB Designs in Virtualized Environments A Very LargePart-of-Memory TLB In Proceedings of the 44th Annual InternationalSymposium on Computer Architecture (ISCA rsquo17) ACM New York NYUSA 469ndash480 httpsdoiorg10114530798563080210

[64] Indira Subramanian Clifford Mather Kurt Peterson and BalakrishnaRaghunath 1998 Implementation of Multiple Pagesize Support inHP-UX In USENIX Annual Technical Conference 105ndash119

[65] John R Tramm Andrew R Siegel Tanzima Islam and Martin Schulz[n d] XSBench - The Development and Verification of a PerformanceAbstraction for Monte Carlo Reactor Analysis In PHYSOR 2014 - TheRole of Reactor Physics toward a Sustainable Future Kyoto

[66] Koji Ueno and Toyotaro Suzumura 2012 Highly Scalable Graph Searchfor the Graph500 Benchmark In Proceedings of the 21st InternationalSymposium on High-Performance Parallel and Distributed Computing(HPDC rsquo12) ACM New York NY USA 149ndash160 httpsdoiorg10114522870762287104

[67] Carl AWaldspurger 2002 Memory ResourceManagement in VMwareESX Server SIGOPS Oper Syst Rev 36 SI (Dec 2002) 181ndash194 httpsdoiorg101145844128844146

[68] Xiao Zhang Sandhya Dwarkadas and Kai Shen 2009 Towards Practi-cal Page Coloring-based Multicore Cache Management In Proceedingsof the 4th ACM European Conference on Computer Systems (EuroSysrsquo09) ACM New York NY USA 89ndash102 httpsdoiorg10114515190651519076

  • Abstract
  • 1 Introduction
  • 2 Motivation
    • 21 Address Translation Overhead vs Memory Bloat
    • 22 Page fault latency vs Number of page faults
    • 23 Huge page allocation across multiple processes
    • 24 How to capture address translation overheads
      • 3 Design and Implementation
        • 31 Asynchronous page pre-zeroing
        • 32 Managing bloat vs performance
        • 33 Fine-grained huge page promotion
        • 34 Huge page allocation across multiple processes
        • 35 Limitations and discussion
          • 4 Evaluation
          • 5 Related Work
          • 6 Conclusions
          • References
Page 10: HawkEye: Efficient Fine-grained OS Support for …sbansal/pubs/hawkeye.pdfHawkEye: Efficient Fine-grained OS Support for Huge Pages Ashish Panwar Indian Institute of Science ashishpanwar@iisc.ac.in

0

15

30

45

60

75

cactusADM tigr Graph500 lbm_s SVM XSBench cgD cactusADM tigr Graph500 lbm_s SVM XSBench cgDBefore After

s

peed

up

Chart Title

Linux Ingens HawkEye-PMU HawkEye-G

Figure 8 Performance speedup over baseline pages of TLB sensitive applications when they are executed alongside a lightlystressed Redis server in different orders

1

12

14

16

18

2

Nor

mal

ized

Per

form

ance

HawkEye-Host HawkEye-Guest HawkEye-Both

Figure 9 Performance compared to Linux in a virtualizedsystem when HawkEye is applied at the host guest and bothlayers

The impact of execution order is clearly visible for LinuxIn the (Before) case Linux with THP support improves per-formance over baseline pages by promoting huge pages inTLB sensitive applications However in the (After) case itpromotes huge pages in Redis resulting in poor performancefor TLB sensitive workloads While the execution order doesnot matter in Ingens its proportional huge page promotionstrategy favors large memory workloads (Redis in this case)Further our client requests randomly selected keys whichmakes the Redis server access all its huge pages uniformlyConsequently Ingens promotes most huge pages in Redisleading to suboptimal performance of TLB sensitive work-loads In contrast HawkEye promotes huge pages based onMMU overhead measurements (HawkEye-PMU) or access-coverage based estimation of MMU overheads (HawkEye-G)This leads to 15ndash60 performance improvement over 4KBpages irrespective of the order of executionPerformance in virtualized systems For a virtualizedsystemwe use the KVMhypervisor runningwith Ubuntu1604and fragment the system prior to running the workloadsEvaluation is presented for three different configurationsie when HawkEye is deployed at the host guest and boththe layers Each of these configurations requires running thesame set of applications differently (see Table 6 for details)Figure 9 shows that HawkEye provides 18ndash90 speedup

compared to Linux in virtual environments Notice that insome cases (eg cgD) performance improvement is higherin a virtualized system compared to bare-metal This is due tohigh MMU overheads involved in traversing nested page ta-bles while servicing TLB misses of guest applications whichpresents higher opportunities to be exploited by HawkEye

Configuration Details

HawkEye-Host Two VMs each with 24 vCPUs and 48GB memoryVM-1 runs Redis VM-2 runs TLB sensitive workloads

HawkEye-Guest Single VM with 48 vCPUs and 80GB memoryrunning Redis and TLB sensitive workloads

HawkEye-Both Two VMs each with 24 vCPUs VM-1 (30GB) runs RedisVM-2 (60GB) runs both Redis and TLB sensitive workloads

Table 6 Experimental setup for configurations used to eval-uate a virtualized system

Bloat vs performance We next study the relationship be-tween memory bloat and performance we revisit an experi-ment similar to the one summarized in sect21 We populate aRedis instance with 8M (10B key 4KB value) key-value pairsand then delete 60 randomly selected keys (see Table 7)While Linux-4KB (no THP) is memory efficient (no bloat)its performance is low (7 lower throughput than Linux-2MB ie with THP) Linux-2MB delivers high performancebut consumes more memory which remains unavailable tothe rest of system even when memory pressure increasesIngens can be configured to balance the memory vs perfor-mance tradeoff For example Ingens-90 (Ingens with 90utilization threshold) is memory efficient while Ingens-50(Ingens with 50 utilization threshold) favors performanceand allows more memory bloat In either case it is unableto recover from bloat that may have already been gener-ated (eg during its aggressive phase) By switching fromaggressive huge page allocations under low memory pres-sure to memory conserving strategy at runtime HawkEye isable to achieve both low MMU overheads and high memoryefficiency depending on the state of the systemFast page faults Table 8 shows the performance of differentstrategies for the workloads chosen for these experimentstheir performance depends on the efficiency of the OS pagefault handler We measure Redis throughput when 2MB-sized values are inserted SparseHash [13] is a hash-maplibrary in C++ while HACC-IO [8] is a parallel IO benchmarkused with an in-memory file system We also measure thespin-up time of two virtual machines KVM and a Java VirtualMachine (JVM) where both are configured to allocate allmemory during initialization

Performance benefits of async page pre-zeroing with basepages (HawkEye-4KB) aremodest in this case the other page

Kernel Self-tuning Memory Size ThroughputLinux-4KB No 162GB 1061KLinux-2MB No 332GB 1138KIngens-90 No 163GB 1068KIngens-50 No 331GB 1134KHawkEye (no mem pressure) Yes 332GB 1136KHawkEye (mem pressure) Yes 162GB 1058K

Table 7 Redis memory consumption and throughput

Workload OS ConfigurationLinux4KB

Linux2MB

Ingens90

HawkEye4KB

HawkEye2MB

Redis (45GB) 233 437 192 236 551SparseHash (36GB) 501 172 515 466 106HACC-IO (6GB) 65 45 66 65 42JVM Spin-up (36GB) 377 186 527 298 137KVM Spin-up (36GB) 406 97 418 302 070

Table 8 Performance implications of asynchronous pagezeroing Values for Redis represent throughput (higher isbetter) all other values represent time in seconds (lower isbetter)

0

10

20

30

40

NPB PARSEC Graph500 cactusADM PageRank omnetpp

overhead with caching-stores with nt-stores

Figure 10 Performance overhead of async pre-zeroing withand without caching instructions The first two bars (NPBand Parsec) represent the average of all workloads in therespective benchmark suite

fault overheads (apart from zeroing) dominate performanceHowever page-zeroing cost becomes significant with hugepages Consequently HawkEye with huge pages (HawkEye-2MB) improves the performance of Redis and SparseHashby 126times and 162times over Linux The spin-up time of virtualmachines is purely dominated by page faults Hence weobserve a dramatic reduction in the spin-up time of VMswith async pre-zeroing of huge pages 138times and 136times overLinux Notice that without huge pages (only base pages)this reduction is only 134times and 126times All these workloadshave high spatial locality of page faults and hence the Ingensstrategy of utilization-based promotion has a negative impacton performance due to a higher number of page faultsAsync pre-zeroing enables memory sharing in virtual-ized environments Finally we note that async pre-zeroingwithin VMs enables memory sharing across VMs the freememory of a VM returns to the host through pre-zeroing inthe VM and same-page merging at the host This can havethe same net effect as ballooning as the free memory in aVM gets shared with other VMs In our experiments withmemory over-committed systems we have confirmed thisbehaviour the overall performance of an overcommittedsystem where the VMs are running HawkEye matches the

0

05

1

15

2

25

PageRank Redis MongoDB CC Memcached SPDNode 0 Node 1

Nor

mal

ized

Per

form

ance

Linux Linux-Ballooning HawkEye

Figure 11 Performance normalized to the case of no bal-looning in an overcommitted virtualized system

performance achieved through para-virtual interfaces likememory balloon drivers For example in an experiment in-volving a mix of latency-sensitive key-value stores and HPCworkloads (see Figure 11) with a total peakmemory consump-tion of around 150GB (15times of total memory) HawkEye-Gprovides 23times and 142times higher throughput for Redis andMongoDB along with significant tail latency reduction ForPageRank performance degrades slightly due to a highernumber of COW page faults due to unnecessary same-pagemerging These results are very close to the results obtainedwhen memory ballooning is enabled on all VMs We believethat this is a potential positive of our design and may offeran alternative method to efficiently manage memory in over-committed systems It is well-known that balloon-driversand other para-virtual interfaces are riddled with compatibil-ity problems [29] and a fully-virtual solution to this problemwould be highly desirable While our experiments show earlypromise in this direction we leave an extensive evaluationfor future workPerformance overheads ofHawkEyeWe evaluatedHawk-Eye with non-fragmented systems and with several work-loads that donrsquot benefit from huge pages and find that theperformance with HawkEye is very similar to performancewith Linux or Ingens in these scenarios confirming thatHawkEyersquos algorithms for access-bit tracking asynchronoushuge page promotion and memory de-duplication add negli-gible overheads over existingmechanisms in Linux Howeverthe async pre-zeroing thread requires special attention as itmay become a source of noticeable performance overheadfor certain types of workloads as discussed nextOverheads of async pre-zeroing A primary concern withasync pre-zeroing is its potential detrimental effect due tomemory and cache interference (sect31) Recall that we em-ploy memory stores with non-temporal hints to avoid theseeffects To measure the worst-case cache effects of asyncpre-zeroing we run our workloads while simultaneouslyzero-filling pages on a separate core sharing the same L3cache (ie on the same socket) at a high rate of 025M pagesper second (1GBps) with and without non-temporal mem-ory stores Figure 10 shows that using non-temporal hintssignificantly brings down the overhead of async pre-zeroing(eg from 27 to 6 for omnetpp) the remaining overheadis due to additional memory traffic generated by the pre-zeroing thread Further we note that this experiment presents

Workload MMUOverhead

Time (seconds)4KB HawkEye-PMU HawkEye-G

random(4GB) 60 582 328(177times) 413(141times)sequential(4GB) lt 1 517 535 532Total 1099 863(127times) 945(116times)cgD(16GB) 39 1952 1202(162times) 1450(135times)mgD(24GB) lt 1 1363 1364 1377Total 3315 2566(129times) 2827(117times)

Table 9 Comparison between HawkEye-PMU andHawkEye-G Values in parentheses represent speedup overbaseline pagesa highly-exaggerated behavior of async pre-zeroing onworst-case workloads In practice the zeroing thread is rate-limited(eg at most 10k pages per second) and the expected cache-interference overheads would be proportionally smallerComparison betweenHawk-PMUandHawkEye-G Eventhough HawkEye-G is based on simple and approximatedestimations of TLB contention it is reasonably accurate inidentifying TLB sensitive processes and memory regionsin most cases However HawkEye-PMU performs betterthan HawkEye-G in some cases Table 9 demonstrates twosuch examples where four workloads all with high access-coverage but different MMU overheads are executed in twosets each set contains one TLB sensitive and one TLB insensi-tive workload The 4KB column represents the case where nohuge pages are promoted As discussed earlier (sect24) MMUoverheads depend on the complex interaction between thehardware and access pattern In this case despite havinghigh access-coverage in VA regions sequential and mgDhave negligible MMU overheads (ie they are TLB insen-sitive) While HawkEye-PMU correctly identifies the pro-cess with higher MMU overheads for huge page allocationHawkEye-G treats them similarly (due to imprecise estima-tions) and may allocate huge pages to the TLB insensitiveprocess Consequently HawkEye-PMU may perform up to36 better than HawkEye-G Bridging this gap between thetwo approaches in a hardware independent manner is aninteresting future work

5 Related WorkMMU overheads have been a topic of much research in re-cent years and several solutions have been proposed includ-ing solutions that are architecture-based OS-based andorcompiler-based [28 34 49 56 58 62ndash64]Hardware SupportMost proposals for hardware supportaim to reduce TLB-miss frequency or accelerate TLB-missprocessing Large multi-level TLBs and support for multiplepage sizes to minimize the number of TLB-misses is alreadycommon in general-purpose processors today [15 20 21 40]Further modern processors employ page-walk-caches toavoid memory lookups in the TLB-miss path [31 35] POM-TLB services a TLB miss using a single memory lookup andfurther leverages regular data caches to speed up addresstranslation [63] Direct segments [32 46] were proposed tocompletely avoid TLB-miss processing cost through special

segmentation hardware OS-level challenges of memory frag-mentation have also been considered in hardware designsCoLT or Coalesced-Large-reach TLBs [61] were initially pro-posed to increase TLB reach using base pages based on theobservation that OSs naturally provide contiguous mappingsat smaller granularity This approach was further extendedto page-walk caches and huge pages [35 40 60] Techniquesto support huge pages in non-contiguous memory have alsobeen proposed [44] Hardware optimizations are importantbut complementary to HawkEyeSoftware Support FreeBSDrsquos huge page management islargely influenced by previous work on superpages [57]Carrefour-LP [47] showed that ad-hoc usage of huge pagescan degrade performance in NUMA systems due to addi-tional remote memory accesses and traffic imbalance acrossnode interconnects and proposed hardware-profiling basedtechniques to overcome these problems Anti-fragmentationtechniques for clustering pages based on mobility type ofpages proposed by Gorman et al [48] have been adoptedby Linux to support the allocation of huge pages in long-running systems Illuminator [59] highlighted critical draw-backs of Linuxrsquos fragmentation mitigation techniques andtheir implications on performance and latency and proposedtechniques to solve these problems Mechanisms of dealingwith physical memory fragmentation and NUMA systemsare important but orthogonal problems and their proposedsolutions can be adopted to improve the robustness of Hawk-Eye Compiler or application hints can also be used to assistOSs in prioritizing huge page mapping for certain parts ofthe address space [27 56] for which an interface is alreadyprovided by Linux through the madvise system call [16]An alternative approach for supporting huge pages has

also been explored via libhugetlbfs [22] where the useris provided more control on huge page allocation Howeversuch an approach requires manual intervention for reserv-ing huge pages in advance and considers each applicationin isolation Windows and OS X support huge pages onlythrough this reservation-based approach to avoid issues asso-ciated with transparent huge page management [17 55 59]We believe that insights from HawkEye can be leveraged toimprove huge page support in these important systems

6 ConclusionsTo summarize transparent huge page management is es-sential but complex Effective and efficient algorithms arerequired in the OS to automatically balance the tradeoffs andprovide performant and stable system behavior We exposesome important subtleties related to huge page managementin existing proposals and propose a new set of algorithmsto address them in HawkEye

References[1] 2000 Clearing pages in the idle loop httpswwwmail-archivecom

freebsd-hackersfreebsdorgmsg13993html

[2] 2000 Linux Page Zeroing Strategy httpsyarchivenetcomplinuxpagezeroingstrategyhtml

[3] 2007 Memory part 5 What programmers can do httpslwnnetArticles255364

[4] 2010 Mysteries of Windows Memory Management RevealedPart Two httpsblogsmsdnmicrosoftcomtims20101029pdc10-mysteries-of-windows-memory-management-revealed-part-two

[5] 2012 khugepaged eating 100 CPU httpsbugzillaredhatcomshowbugcgiid=879801

[6] 2012 Recommendation to disable huge pages for Hadoophttpsdeveloperamdcomwordpressmedia201210HadoopTuningGuide-Version5pdf

[7] 2014 Arch Linux becomes unresponsive from khugepagedhttpunixstackexchangecomquestions161858arch-linux-becomes-unresponsive-from-khugepaged

[8] 2014 CORAL Benchmark Codes httpsascllnlgovCORAL-benchmarkshacc

[9] 2014 Recommendation to disable huge pages for NuoDBhttpwwwnuodbcomtechbloglinux-transparent-huge-pages-jemalloc-and-nuodb

[10] 2014 Why TokuDB Hates Transparent HugePageshttpswwwperconacomblog20140723why-tokudb-hates-transparent-hugepages

[11] 2015 The Black Magic Of Systematically Reducing Linux OS Jit-ter httphighscalabilitycomblog201548the-black-magic-of-systematically-reducing-linux-os-jitterhtml

[12] 2015 Tales from the Field Taming Transparent Huge Pages onLinux httpswwwperforcecomblog151016tales-field-taming-transparent-huge-pages-linux

[13] 2016 C++ associative containers httpsgithubcomsparsehashsparsehash

[14] 2016 Remove PG_ZERO and zeroidle (page-zeroing) entirely httpsnewsycombinatorcomitemid=12227874

[15] 2017 Hugepages httpswikidebianorgHugepages[16] 2017 MADVISE Linux Programmerrsquos Manual httpman7orglinux

man-pagesman2madvise2html[17] 2018 Large-Page Support in Windows httpsdocsmicrosoftcom

en-gbwindowsdesktopMemorylarge-page-support[18] 2019 CGROUPS Linux Programmerrsquos Manual httpman7orglinux

man-pagesman7cgroups7html[19] 2019 FreeBSD Pre-Faulting and Zeroing Optimizations

httpswwwfreebsdorgdocenUSISO8859-1articlesvm-designprefault-optimizationshtml

[20] 2019 Intel Haswell httpwww7-cpucomcpuHaswellhtml[21] 2019 Intel Skylake httpwww7-cpucomcpuSkylakehtml[22] 2019 Libhugetlbfs Linux man page httpslinuxdienetman7

libhugetlbfs[23] 2019 Recommendation to disable huge pages for MongoDB https

docsmongodbcommanualtutorialtransparent-huge-pages[24] 2019 Recommendation to disable huge pages for Redis httpredisio

topicslatency[25] 2019 Recommendation to disable huge pages for VoltDB https

docsvoltdbcomAdminGuideadminmemmgtphp[26] 2019 Transparent Hugepage Support httpswwwkernelorgdoc

Documentationvmtranshugetxt[27] Mohammad Agbarya Idan Yaniv and Dan Tsafrir 2018 Memomania

From Huge to Huge-Huge Pages In Proceedings of the 11th ACM In-ternational Systems and Storage Conference (SYSTOR rsquo18) ACM NewYork NY USA 112ndash112 httpsdoiorg10114532118903211918

[28] Hanna Alam Tianhao Zhang Mattan Erez and Yoav Etsion 2017Do-It-Yourself Virtual Memory Translation In Proceedings of the44th Annual International Symposium on Computer Architecture (ISCArsquo17) ACM New York NY USA 457ndash468 httpsdoiorg10114530798563080209

[29] Nadav Amit Dan Tsafrir and Assaf Schuster 2014 VSwapper AMemory Swapper for Virtualized Environments In Proceedings of the19th International Conference on Architectural Support for ProgrammingLanguages and Operating Systems (ASPLOS rsquo14) ACM New York NYUSA 349ndash366 httpsdoiorg10114525419402541969

[30] D H Bailey E Barszcz J T Barton D S Browning R L Carter LDagum R A Fatoohi P O Frederickson T A Lasinski R S SchreiberH D Simon V Venkatakrishnan and S KWeeratunga 1991 The NASParallel BenchmarksSummary and Preliminary Results In Proceedingsof the 1991 ACMIEEE Conference on Supercomputing (Supercomputingrsquo91) ACM New York NY USA 158ndash165 httpsdoiorg101145125826125925

[31] Thomas W Barr Alan L Cox and Scott Rixner 2010 TranslationCaching Skip DonrsquoT Walk (the Page Table) In Proceedings of the37th Annual International Symposium on Computer Architecture (ISCArsquo10) ACM New York NY USA 48ndash59 httpsdoiorg10114518159611815970

[32] Arkaprava Basu Jayneel Gandhi Jichuan Chang Mark D Hill andMichael M Swift 2013 Efficient Virtual Memory for Big MemoryServers In Proceedings of the 40th Annual International Symposium onComputer Architecture (ISCA rsquo13) ACM New York NY USA 237ndash248httpsdoiorg10114524859222485943

[33] Scott Beamer Krste Asanovic and David A Patterson 2015 TheGAP Benchmark Suite CoRR abs150803619 (2015) arXiv150803619httparxivorgabs150803619

[34] Ravi Bhargava Benjamin Serebrin Francesco Spadini and SrilathaManne 2008 Accelerating Two-dimensional Page Walks for Vir-tualized Systems In Proceedings of the 13th International Confer-ence on Architectural Support for Programming Languages and Op-erating Systems (ASPLOS XIII) ACM New York NY USA 26ndash35httpsdoiorg10114513462811346286

[35] Abhishek Bhattacharjee 2013 Large-reach Memory ManagementUnit Caches In Proceedings of the 46th Annual IEEEACM InternationalSymposium on Microarchitecture (MICRO-46) ACM New York NYUSA 383ndash394 httpsdoiorg10114525407082540741

[36] Christian Bienia Sanjeev Kumar Jaswinder Pal Singh and Kai Li 2008The PARSEC Benchmark Suite Characterization and ArchitecturalImplications In Proceedings of the 17th International Conference onParallel Architectures and Compilation Techniques (PACT rsquo08) ACMNew York NY USA 72ndash81 httpsdoiorg10114514541151454128

[37] Daniel Bovet and Marco Cesati 2005 Understanding The Linux KernelOreilly amp Associates Inc

[38] Josiah L Carlson 2013 Redis in Action Manning Publications CoGreenwich CT USA

[39] Jonathan Corbet 2010 Memory compaction httpslwnnetArticles368869

[40] Guilherme Cox and Abhishek Bhattacharjee 2017 Efficient AddressTranslation for Architectures with Multiple Page Sizes In Proceed-ings of the Twenty-Second International Conference on ArchitecturalSupport for Programming Languages and Operating Systems (ASPLOSrsquo17) ACM New York NY USA 435ndash448 httpsdoiorg10114530376973037704

[41] Cort Dougan Paul Mackerras and Victor Yodaiken 1999 Opti-mizing the Idle Task and Other MMU Tricks In Proceedings of theThird Symposium on Operating Systems Design and Implementation(OSDI rsquo99) USENIX Association Berkeley CA USA 229ndash237 httpdlacmorgcitationcfmid=296806296833

[42] Niall Douglas 2011 User Mode Memory Page Allocation A Sil-ver Bullet For Memory Allocation CoRR abs11051811 (2011)arXiv11051811 httparxivorgabs11051811

[43] Ulrich Drepper 2007 What Every Programmer Should Know AboutMemory

[44] Yu Du Miao Zhou Bruce R Childers Daniel Mosseacute and Rami GMelhem 2015 Supporting superpages in non-contiguous physicalmemory 2015 IEEE 21st International Symposium on High Performance

Computer Architecture (HPCA) (2015) 223ndash234[45] M Franklin D Yeung n Xue Wu A Jaleel K Albayraktaroglu B

Jacob and n Chau-Wen Tseng 2005 BioBench A Benchmark Suite ofBioinformatics Applications In IEEE International Symposium on Per-formance Analysis of Systems and Software 2005 ISPASS 2005(ISPASS)Vol 00 2ndash9 httpsdoiorg101109ISPASS20051430554

[46] Jayneel Gandhi Arkaprava Basu Mark D Hill and Michael M Swift2014 Efficient Memory Virtualization Reducing Dimensionality ofNested Page Walks In Proceedings of the 47th Annual IEEEACM Inter-national Symposium on Microarchitecture (MICRO-47) IEEE ComputerSociety Washington DC USA 178ndash189 httpsdoiorg101109MICRO201437

[47] Fabien Gaud Baptiste Lepers Jeremie Decouchant Justin FunstonAlexandra Fedorova and Vivien Queacutema 2014 Large Pages MayBe Harmful on NUMA Systems In Proceedings of the 2014 USENIXConference on USENIX Annual Technical Conference (USENIX ATCrsquo14)USENIX Association Berkeley CA USA 231ndash242 httpdlacmorgcitationcfmid=26436342643659

[48] Mel Gorman and Patrick Healy 2008 Supporting Superpage Alloca-tion Without Additional Hardware Support In Proceedings of the 7thInternational Symposium on Memory Management (ISMM rsquo08) ACMNew York NY USA 41ndash50 httpsdoiorg10114513756341375641

[49] Mel Gorman and Patrick Healy 2012 Performance Characteristics ofExplicit Superpage Support In Proceedings of the 2010 InternationalConference on Computer Architecture (ISCArsquo10) Springer-Verlag BerlinHeidelberg 293ndash310 httpsdoiorg101007978-3-642-24322-624

[50] Mel Gorman and Andy Whitcroft 2006 The What The Why and theWhere To of Anti-Fragmentation In Linux Symposium 141

[51] Fei Guo Seongbeom Kim Yury Baskakov and Ishan Banerjee 2015Proactively Breaking Large Pages to Improve Memory Overcom-mitment Performance in VMware ESXi In Proceedings of the 11thACM SIGPLANSIGOPS International Conference on Virtual ExecutionEnvironments (VEE rsquo15) ACM New York NY USA 39ndash51 httpsdoiorg10114527311862731187

[52] Fan Guo Yongkun Li Yinlong Xu Song Jiang and John C S Lui2017 SmartMD A High Performance Deduplication Engine withMixed Pages In Proceedings of the 2017 USENIX Conference on UsenixAnnual Technical Conference (USENIX ATC rsquo17) USENIX AssociationBerkeley CA USA 733ndash744 httpdlacmorgcitationcfmid=31546903154759

[53] John L Henning 2006 SPEC CPU2006 Benchmark DescriptionsSIGARCH Comput Archit News 34 4 (Sept 2006) 1ndash17 httpsdoiorg10114511867361186737

[54] Vasileios Karakostas Osman S Unsal Mario Nemirovsky AdriaacutenCristal and Michael M Swift 2014 Performance analysis of thememory management unit under scale-out workloads 2014 IEEE In-ternational Symposium on Workload Characterization (IISWC) (2014)1ndash12

[55] Youngjin Kwon Hangchen Yu Simon Peter Christopher J Rossbachand Emmett Witchel 2016 Coordinated and Efficient Huge PageManagement with Ingens In Proceedings of the 12th USENIX Con-ference on Operating Systems Design and Implementation (OSDIrsquo16)USENIX Association Berkeley CA USA 705ndash721 httpdlacmorgcitationcfmid=30268773026931

[56] Joshua Magee and Apan Qasem 2009 A Case for Compiler-drivenSuperpage Allocation In Proceedings of the 47th Annual SoutheastRegional Conference (ACM-SE 47) ACM New York NY USA Article82 4 pages httpsdoiorg10114515664451566553

[57] Juan Navarro Sitararn Iyer Peter Druschel and Alan Cox 2002 Prac-tical Transparent Operating System Support for Superpages SIGOPSOper Syst Rev 36 SI (Dec 2002) 89ndash104 httpsdoiorg101145844128844138

[58] Ashish Panwar Naman Patel and K Gopinath 2016 A Case forProtecting Huge Pages from the Kernel In Proceedings of the 7th ACMSIGOPS Asia-Pacific Workshop on Systems (APSys rsquo16) ACM New YorkNY USA Article 15 8 pages httpsdoiorg10114529673602967371

[59] Ashish Panwar Aravinda Prasad and K Gopinath 2018 Making HugePages Actually Useful In Proceedings of the Twenty-Third InternationalConference on Architectural Support for Programming Languages andOperating Systems (ASPLOS rsquo18) ACM New York NY USA 679ndash692httpsdoiorg10114531731623173203

[60] Binh Pham Abhishek Bhattacharjee Yasuko Eckert and Gabriel HLoh 2014 Increasing TLB reach by exploiting clustering in page trans-lations 2014 IEEE 20th International Symposium on High PerformanceComputer Architecture (HPCA) (2014) 558ndash567

[61] Binh Pham Viswanathan Vaidyanathan Aamer Jaleel and AbhishekBhattacharjee 2012 CoLT Coalesced Large-Reach TLBs In Proceed-ings of the 2012 45th Annual IEEEACM International Symposium onMicroarchitecture (MICRO-45) IEEE Computer Society WashingtonDC USA 258ndash269 httpsdoiorg101109MICRO201232

[62] Binh Pham Jaacuten Veselyacute Gabriel H Loh and Abhishek Bhattacharjee2015 Large Pages and Lightweight Memory Management in Virtual-ized Environments Can You Have It Both Ways In Proceedings of the48th International Symposium on Microarchitecture (MICRO-48) ACMNew York NY USA 1ndash12 httpsdoiorg10114528307722830773

[63] Jee Ho Ryoo Nagendra Gulur Shuang Song and Lizy K John 2017Rethinking TLB Designs in Virtualized Environments A Very LargePart-of-Memory TLB In Proceedings of the 44th Annual InternationalSymposium on Computer Architecture (ISCA rsquo17) ACM New York NYUSA 469ndash480 httpsdoiorg10114530798563080210

[64] Indira Subramanian Clifford Mather Kurt Peterson and BalakrishnaRaghunath 1998 Implementation of Multiple Pagesize Support inHP-UX In USENIX Annual Technical Conference 105ndash119

[65] John R Tramm Andrew R Siegel Tanzima Islam and Martin Schulz[n d] XSBench - The Development and Verification of a PerformanceAbstraction for Monte Carlo Reactor Analysis In PHYSOR 2014 - TheRole of Reactor Physics toward a Sustainable Future Kyoto

[66] Koji Ueno and Toyotaro Suzumura 2012 Highly Scalable Graph Searchfor the Graph500 Benchmark In Proceedings of the 21st InternationalSymposium on High-Performance Parallel and Distributed Computing(HPDC rsquo12) ACM New York NY USA 149ndash160 httpsdoiorg10114522870762287104

[67] Carl AWaldspurger 2002 Memory ResourceManagement in VMwareESX Server SIGOPS Oper Syst Rev 36 SI (Dec 2002) 181ndash194 httpsdoiorg101145844128844146

[68] Xiao Zhang Sandhya Dwarkadas and Kai Shen 2009 Towards Practi-cal Page Coloring-based Multicore Cache Management In Proceedingsof the 4th ACM European Conference on Computer Systems (EuroSysrsquo09) ACM New York NY USA 89ndash102 httpsdoiorg10114515190651519076

  • Abstract
  • 1 Introduction
  • 2 Motivation
    • 21 Address Translation Overhead vs Memory Bloat
    • 22 Page fault latency vs Number of page faults
    • 23 Huge page allocation across multiple processes
    • 24 How to capture address translation overheads
      • 3 Design and Implementation
        • 31 Asynchronous page pre-zeroing
        • 32 Managing bloat vs performance
        • 33 Fine-grained huge page promotion
        • 34 Huge page allocation across multiple processes
        • 35 Limitations and discussion
          • 4 Evaluation
          • 5 Related Work
          • 6 Conclusions
          • References
Page 11: HawkEye: Efficient Fine-grained OS Support for …sbansal/pubs/hawkeye.pdfHawkEye: Efficient Fine-grained OS Support for Huge Pages Ashish Panwar Indian Institute of Science ashishpanwar@iisc.ac.in

Kernel Self-tuning Memory Size ThroughputLinux-4KB No 162GB 1061KLinux-2MB No 332GB 1138KIngens-90 No 163GB 1068KIngens-50 No 331GB 1134KHawkEye (no mem pressure) Yes 332GB 1136KHawkEye (mem pressure) Yes 162GB 1058K

Table 7 Redis memory consumption and throughput

Workload OS ConfigurationLinux4KB

Linux2MB

Ingens90

HawkEye4KB

HawkEye2MB

Redis (45GB) 233 437 192 236 551SparseHash (36GB) 501 172 515 466 106HACC-IO (6GB) 65 45 66 65 42JVM Spin-up (36GB) 377 186 527 298 137KVM Spin-up (36GB) 406 97 418 302 070

Table 8 Performance implications of asynchronous pagezeroing Values for Redis represent throughput (higher isbetter) all other values represent time in seconds (lower isbetter)

0

10

20

30

40

NPB PARSEC Graph500 cactusADM PageRank omnetpp

overhead with caching-stores with nt-stores

Figure 10 Performance overhead of async pre-zeroing withand without caching instructions The first two bars (NPBand Parsec) represent the average of all workloads in therespective benchmark suite

fault overheads (apart from zeroing) dominate performanceHowever page-zeroing cost becomes significant with hugepages Consequently HawkEye with huge pages (HawkEye-2MB) improves the performance of Redis and SparseHashby 126times and 162times over Linux The spin-up time of virtualmachines is purely dominated by page faults Hence weobserve a dramatic reduction in the spin-up time of VMswith async pre-zeroing of huge pages 138times and 136times overLinux Notice that without huge pages (only base pages)this reduction is only 134times and 126times All these workloadshave high spatial locality of page faults and hence the Ingensstrategy of utilization-based promotion has a negative impacton performance due to a higher number of page faultsAsync pre-zeroing enables memory sharing in virtual-ized environments Finally we note that async pre-zeroingwithin VMs enables memory sharing across VMs the freememory of a VM returns to the host through pre-zeroing inthe VM and same-page merging at the host This can havethe same net effect as ballooning as the free memory in aVM gets shared with other VMs In our experiments withmemory over-committed systems we have confirmed thisbehaviour the overall performance of an overcommittedsystem where the VMs are running HawkEye matches the

0

05

1

15

2

25

PageRank Redis MongoDB CC Memcached SPDNode 0 Node 1

Nor

mal

ized

Per

form

ance

Linux Linux-Ballooning HawkEye

Figure 11 Performance normalized to the case of no bal-looning in an overcommitted virtualized system

performance achieved through para-virtual interfaces likememory balloon drivers For example in an experiment in-volving a mix of latency-sensitive key-value stores and HPCworkloads (see Figure 11) with a total peakmemory consump-tion of around 150GB (15times of total memory) HawkEye-Gprovides 23times and 142times higher throughput for Redis andMongoDB along with significant tail latency reduction ForPageRank performance degrades slightly due to a highernumber of COW page faults due to unnecessary same-pagemerging These results are very close to the results obtainedwhen memory ballooning is enabled on all VMs We believethat this is a potential positive of our design and may offeran alternative method to efficiently manage memory in over-committed systems It is well-known that balloon-driversand other para-virtual interfaces are riddled with compatibil-ity problems [29] and a fully-virtual solution to this problemwould be highly desirable While our experiments show earlypromise in this direction we leave an extensive evaluationfor future workPerformance overheads ofHawkEyeWe evaluatedHawk-Eye with non-fragmented systems and with several work-loads that donrsquot benefit from huge pages and find that theperformance with HawkEye is very similar to performancewith Linux or Ingens in these scenarios confirming thatHawkEyersquos algorithms for access-bit tracking asynchronoushuge page promotion and memory de-duplication add negli-gible overheads over existingmechanisms in Linux Howeverthe async pre-zeroing thread requires special attention as itmay become a source of noticeable performance overheadfor certain types of workloads as discussed nextOverheads of async pre-zeroing A primary concern withasync pre-zeroing is its potential detrimental effect due tomemory and cache interference (sect31) Recall that we em-ploy memory stores with non-temporal hints to avoid theseeffects To measure the worst-case cache effects of asyncpre-zeroing we run our workloads while simultaneouslyzero-filling pages on a separate core sharing the same L3cache (ie on the same socket) at a high rate of 025M pagesper second (1GBps) with and without non-temporal mem-ory stores Figure 10 shows that using non-temporal hintssignificantly brings down the overhead of async pre-zeroing(eg from 27 to 6 for omnetpp) the remaining overheadis due to additional memory traffic generated by the pre-zeroing thread Further we note that this experiment presents

Workload MMUOverhead

Time (seconds)4KB HawkEye-PMU HawkEye-G

random(4GB) 60 582 328(177times) 413(141times)sequential(4GB) lt 1 517 535 532Total 1099 863(127times) 945(116times)cgD(16GB) 39 1952 1202(162times) 1450(135times)mgD(24GB) lt 1 1363 1364 1377Total 3315 2566(129times) 2827(117times)

Table 9 Comparison between HawkEye-PMU andHawkEye-G Values in parentheses represent speedup overbaseline pagesa highly-exaggerated behavior of async pre-zeroing onworst-case workloads In practice the zeroing thread is rate-limited(eg at most 10k pages per second) and the expected cache-interference overheads would be proportionally smallerComparison betweenHawk-PMUandHawkEye-G Eventhough HawkEye-G is based on simple and approximatedestimations of TLB contention it is reasonably accurate inidentifying TLB sensitive processes and memory regionsin most cases However HawkEye-PMU performs betterthan HawkEye-G in some cases Table 9 demonstrates twosuch examples where four workloads all with high access-coverage but different MMU overheads are executed in twosets each set contains one TLB sensitive and one TLB insensi-tive workload The 4KB column represents the case where nohuge pages are promoted As discussed earlier (sect24) MMUoverheads depend on the complex interaction between thehardware and access pattern In this case despite havinghigh access-coverage in VA regions sequential and mgDhave negligible MMU overheads (ie they are TLB insen-sitive) While HawkEye-PMU correctly identifies the pro-cess with higher MMU overheads for huge page allocationHawkEye-G treats them similarly (due to imprecise estima-tions) and may allocate huge pages to the TLB insensitiveprocess Consequently HawkEye-PMU may perform up to36 better than HawkEye-G Bridging this gap between thetwo approaches in a hardware independent manner is aninteresting future work

5 Related WorkMMU overheads have been a topic of much research in re-cent years and several solutions have been proposed includ-ing solutions that are architecture-based OS-based andorcompiler-based [28 34 49 56 58 62ndash64]Hardware SupportMost proposals for hardware supportaim to reduce TLB-miss frequency or accelerate TLB-missprocessing Large multi-level TLBs and support for multiplepage sizes to minimize the number of TLB-misses is alreadycommon in general-purpose processors today [15 20 21 40]Further modern processors employ page-walk-caches toavoid memory lookups in the TLB-miss path [31 35] POM-TLB services a TLB miss using a single memory lookup andfurther leverages regular data caches to speed up addresstranslation [63] Direct segments [32 46] were proposed tocompletely avoid TLB-miss processing cost through special

segmentation hardware OS-level challenges of memory frag-mentation have also been considered in hardware designsCoLT or Coalesced-Large-reach TLBs [61] were initially pro-posed to increase TLB reach using base pages based on theobservation that OSs naturally provide contiguous mappingsat smaller granularity This approach was further extendedto page-walk caches and huge pages [35 40 60] Techniquesto support huge pages in non-contiguous memory have alsobeen proposed [44] Hardware optimizations are importantbut complementary to HawkEyeSoftware Support FreeBSDrsquos huge page management islargely influenced by previous work on superpages [57]Carrefour-LP [47] showed that ad-hoc usage of huge pagescan degrade performance in NUMA systems due to addi-tional remote memory accesses and traffic imbalance acrossnode interconnects and proposed hardware-profiling basedtechniques to overcome these problems Anti-fragmentationtechniques for clustering pages based on mobility type ofpages proposed by Gorman et al [48] have been adoptedby Linux to support the allocation of huge pages in long-running systems Illuminator [59] highlighted critical draw-backs of Linuxrsquos fragmentation mitigation techniques andtheir implications on performance and latency and proposedtechniques to solve these problems Mechanisms of dealingwith physical memory fragmentation and NUMA systemsare important but orthogonal problems and their proposedsolutions can be adopted to improve the robustness of Hawk-Eye Compiler or application hints can also be used to assistOSs in prioritizing huge page mapping for certain parts ofthe address space [27 56] for which an interface is alreadyprovided by Linux through the madvise system call [16]An alternative approach for supporting huge pages has

also been explored via libhugetlbfs [22] where the useris provided more control on huge page allocation Howeversuch an approach requires manual intervention for reserv-ing huge pages in advance and considers each applicationin isolation Windows and OS X support huge pages onlythrough this reservation-based approach to avoid issues asso-ciated with transparent huge page management [17 55 59]We believe that insights from HawkEye can be leveraged toimprove huge page support in these important systems

6 ConclusionsTo summarize transparent huge page management is es-sential but complex Effective and efficient algorithms arerequired in the OS to automatically balance the tradeoffs andprovide performant and stable system behavior We exposesome important subtleties related to huge page managementin existing proposals and propose a new set of algorithmsto address them in HawkEye

References[1] 2000 Clearing pages in the idle loop httpswwwmail-archivecom

freebsd-hackersfreebsdorgmsg13993html

[2] 2000 Linux Page Zeroing Strategy httpsyarchivenetcomplinuxpagezeroingstrategyhtml

[3] 2007 Memory part 5 What programmers can do httpslwnnetArticles255364

[4] 2010 Mysteries of Windows Memory Management RevealedPart Two httpsblogsmsdnmicrosoftcomtims20101029pdc10-mysteries-of-windows-memory-management-revealed-part-two

[5] 2012 khugepaged eating 100 CPU httpsbugzillaredhatcomshowbugcgiid=879801

[6] 2012 Recommendation to disable huge pages for Hadoophttpsdeveloperamdcomwordpressmedia201210HadoopTuningGuide-Version5pdf

[7] 2014 Arch Linux becomes unresponsive from khugepagedhttpunixstackexchangecomquestions161858arch-linux-becomes-unresponsive-from-khugepaged

[8] 2014 CORAL Benchmark Codes httpsascllnlgovCORAL-benchmarkshacc

[9] 2014 Recommendation to disable huge pages for NuoDBhttpwwwnuodbcomtechbloglinux-transparent-huge-pages-jemalloc-and-nuodb

[10] 2014 Why TokuDB Hates Transparent HugePageshttpswwwperconacomblog20140723why-tokudb-hates-transparent-hugepages

[11] 2015 The Black Magic Of Systematically Reducing Linux OS Jit-ter httphighscalabilitycomblog201548the-black-magic-of-systematically-reducing-linux-os-jitterhtml

[12] 2015 Tales from the Field Taming Transparent Huge Pages onLinux httpswwwperforcecomblog151016tales-field-taming-transparent-huge-pages-linux

[13] 2016 C++ associative containers httpsgithubcomsparsehashsparsehash

[14] 2016 Remove PG_ZERO and zeroidle (page-zeroing) entirely httpsnewsycombinatorcomitemid=12227874

[15] 2017 Hugepages httpswikidebianorgHugepages[16] 2017 MADVISE Linux Programmerrsquos Manual httpman7orglinux

man-pagesman2madvise2html[17] 2018 Large-Page Support in Windows httpsdocsmicrosoftcom

en-gbwindowsdesktopMemorylarge-page-support[18] 2019 CGROUPS Linux Programmerrsquos Manual httpman7orglinux

man-pagesman7cgroups7html[19] 2019 FreeBSD Pre-Faulting and Zeroing Optimizations

httpswwwfreebsdorgdocenUSISO8859-1articlesvm-designprefault-optimizationshtml

[20] 2019 Intel Haswell httpwww7-cpucomcpuHaswellhtml[21] 2019 Intel Skylake httpwww7-cpucomcpuSkylakehtml[22] 2019 Libhugetlbfs Linux man page httpslinuxdienetman7

libhugetlbfs[23] 2019 Recommendation to disable huge pages for MongoDB https

docsmongodbcommanualtutorialtransparent-huge-pages[24] 2019 Recommendation to disable huge pages for Redis httpredisio

topicslatency[25] 2019 Recommendation to disable huge pages for VoltDB https

docsvoltdbcomAdminGuideadminmemmgtphp[26] 2019 Transparent Hugepage Support httpswwwkernelorgdoc

Documentationvmtranshugetxt[27] Mohammad Agbarya Idan Yaniv and Dan Tsafrir 2018 Memomania

From Huge to Huge-Huge Pages In Proceedings of the 11th ACM In-ternational Systems and Storage Conference (SYSTOR rsquo18) ACM NewYork NY USA 112ndash112 httpsdoiorg10114532118903211918

[28] Hanna Alam Tianhao Zhang Mattan Erez and Yoav Etsion 2017Do-It-Yourself Virtual Memory Translation In Proceedings of the44th Annual International Symposium on Computer Architecture (ISCArsquo17) ACM New York NY USA 457ndash468 httpsdoiorg10114530798563080209

[29] Nadav Amit Dan Tsafrir and Assaf Schuster 2014 VSwapper AMemory Swapper for Virtualized Environments In Proceedings of the19th International Conference on Architectural Support for ProgrammingLanguages and Operating Systems (ASPLOS rsquo14) ACM New York NYUSA 349ndash366 httpsdoiorg10114525419402541969

[30] D H Bailey E Barszcz J T Barton D S Browning R L Carter LDagum R A Fatoohi P O Frederickson T A Lasinski R S SchreiberH D Simon V Venkatakrishnan and S KWeeratunga 1991 The NASParallel BenchmarksSummary and Preliminary Results In Proceedingsof the 1991 ACMIEEE Conference on Supercomputing (Supercomputingrsquo91) ACM New York NY USA 158ndash165 httpsdoiorg101145125826125925

[31] Thomas W Barr Alan L Cox and Scott Rixner 2010 TranslationCaching Skip DonrsquoT Walk (the Page Table) In Proceedings of the37th Annual International Symposium on Computer Architecture (ISCArsquo10) ACM New York NY USA 48ndash59 httpsdoiorg10114518159611815970

[32] Arkaprava Basu Jayneel Gandhi Jichuan Chang Mark D Hill andMichael M Swift 2013 Efficient Virtual Memory for Big MemoryServers In Proceedings of the 40th Annual International Symposium onComputer Architecture (ISCA rsquo13) ACM New York NY USA 237ndash248httpsdoiorg10114524859222485943

[33] Scott Beamer Krste Asanovic and David A Patterson 2015 TheGAP Benchmark Suite CoRR abs150803619 (2015) arXiv150803619httparxivorgabs150803619

[34] Ravi Bhargava Benjamin Serebrin Francesco Spadini and SrilathaManne 2008 Accelerating Two-dimensional Page Walks for Vir-tualized Systems In Proceedings of the 13th International Confer-ence on Architectural Support for Programming Languages and Op-erating Systems (ASPLOS XIII) ACM New York NY USA 26ndash35httpsdoiorg10114513462811346286

[35] Abhishek Bhattacharjee 2013 Large-reach Memory ManagementUnit Caches In Proceedings of the 46th Annual IEEEACM InternationalSymposium on Microarchitecture (MICRO-46) ACM New York NYUSA 383ndash394 httpsdoiorg10114525407082540741

[36] Christian Bienia Sanjeev Kumar Jaswinder Pal Singh and Kai Li 2008The PARSEC Benchmark Suite Characterization and ArchitecturalImplications In Proceedings of the 17th International Conference onParallel Architectures and Compilation Techniques (PACT rsquo08) ACMNew York NY USA 72ndash81 httpsdoiorg10114514541151454128

[37] Daniel Bovet and Marco Cesati 2005 Understanding The Linux KernelOreilly amp Associates Inc

[38] Josiah L Carlson 2013 Redis in Action Manning Publications CoGreenwich CT USA

[39] Jonathan Corbet 2010 Memory compaction httpslwnnetArticles368869

[40] Guilherme Cox and Abhishek Bhattacharjee 2017 Efficient AddressTranslation for Architectures with Multiple Page Sizes In Proceed-ings of the Twenty-Second International Conference on ArchitecturalSupport for Programming Languages and Operating Systems (ASPLOSrsquo17) ACM New York NY USA 435ndash448 httpsdoiorg10114530376973037704

[41] Cort Dougan Paul Mackerras and Victor Yodaiken 1999 Opti-mizing the Idle Task and Other MMU Tricks In Proceedings of theThird Symposium on Operating Systems Design and Implementation(OSDI rsquo99) USENIX Association Berkeley CA USA 229ndash237 httpdlacmorgcitationcfmid=296806296833

[42] Niall Douglas 2011 User Mode Memory Page Allocation A Sil-ver Bullet For Memory Allocation CoRR abs11051811 (2011)arXiv11051811 httparxivorgabs11051811

[43] Ulrich Drepper 2007 What Every Programmer Should Know AboutMemory

[44] Yu Du Miao Zhou Bruce R Childers Daniel Mosseacute and Rami GMelhem 2015 Supporting superpages in non-contiguous physicalmemory 2015 IEEE 21st International Symposium on High Performance

Computer Architecture (HPCA) (2015) 223ndash234[45] M Franklin D Yeung n Xue Wu A Jaleel K Albayraktaroglu B

Jacob and n Chau-Wen Tseng 2005 BioBench A Benchmark Suite ofBioinformatics Applications In IEEE International Symposium on Per-formance Analysis of Systems and Software 2005 ISPASS 2005(ISPASS)Vol 00 2ndash9 httpsdoiorg101109ISPASS20051430554

[46] Jayneel Gandhi Arkaprava Basu Mark D Hill and Michael M Swift2014 Efficient Memory Virtualization Reducing Dimensionality ofNested Page Walks In Proceedings of the 47th Annual IEEEACM Inter-national Symposium on Microarchitecture (MICRO-47) IEEE ComputerSociety Washington DC USA 178ndash189 httpsdoiorg101109MICRO201437

[47] Fabien Gaud Baptiste Lepers Jeremie Decouchant Justin FunstonAlexandra Fedorova and Vivien Queacutema 2014 Large Pages MayBe Harmful on NUMA Systems In Proceedings of the 2014 USENIXConference on USENIX Annual Technical Conference (USENIX ATCrsquo14)USENIX Association Berkeley CA USA 231ndash242 httpdlacmorgcitationcfmid=26436342643659

[48] Mel Gorman and Patrick Healy 2008 Supporting Superpage Alloca-tion Without Additional Hardware Support In Proceedings of the 7thInternational Symposium on Memory Management (ISMM rsquo08) ACMNew York NY USA 41ndash50 httpsdoiorg10114513756341375641

[49] Mel Gorman and Patrick Healy 2012 Performance Characteristics ofExplicit Superpage Support In Proceedings of the 2010 InternationalConference on Computer Architecture (ISCArsquo10) Springer-Verlag BerlinHeidelberg 293ndash310 httpsdoiorg101007978-3-642-24322-624

[50] Mel Gorman and Andy Whitcroft 2006 The What The Why and theWhere To of Anti-Fragmentation In Linux Symposium 141

[51] Fei Guo Seongbeom Kim Yury Baskakov and Ishan Banerjee 2015Proactively Breaking Large Pages to Improve Memory Overcom-mitment Performance in VMware ESXi In Proceedings of the 11thACM SIGPLANSIGOPS International Conference on Virtual ExecutionEnvironments (VEE rsquo15) ACM New York NY USA 39ndash51 httpsdoiorg10114527311862731187

[52] Fan Guo Yongkun Li Yinlong Xu Song Jiang and John C S Lui2017 SmartMD A High Performance Deduplication Engine withMixed Pages In Proceedings of the 2017 USENIX Conference on UsenixAnnual Technical Conference (USENIX ATC rsquo17) USENIX AssociationBerkeley CA USA 733ndash744 httpdlacmorgcitationcfmid=31546903154759

[53] John L Henning 2006 SPEC CPU2006 Benchmark DescriptionsSIGARCH Comput Archit News 34 4 (Sept 2006) 1ndash17 httpsdoiorg10114511867361186737

[54] Vasileios Karakostas Osman S Unsal Mario Nemirovsky AdriaacutenCristal and Michael M Swift 2014 Performance analysis of thememory management unit under scale-out workloads 2014 IEEE In-ternational Symposium on Workload Characterization (IISWC) (2014)1ndash12

[55] Youngjin Kwon Hangchen Yu Simon Peter Christopher J Rossbachand Emmett Witchel 2016 Coordinated and Efficient Huge PageManagement with Ingens In Proceedings of the 12th USENIX Con-ference on Operating Systems Design and Implementation (OSDIrsquo16)USENIX Association Berkeley CA USA 705ndash721 httpdlacmorgcitationcfmid=30268773026931

[56] Joshua Magee and Apan Qasem 2009 A Case for Compiler-drivenSuperpage Allocation In Proceedings of the 47th Annual SoutheastRegional Conference (ACM-SE 47) ACM New York NY USA Article82 4 pages httpsdoiorg10114515664451566553

[57] Juan Navarro Sitararn Iyer Peter Druschel and Alan Cox 2002 Prac-tical Transparent Operating System Support for Superpages SIGOPSOper Syst Rev 36 SI (Dec 2002) 89ndash104 httpsdoiorg101145844128844138

[58] Ashish Panwar Naman Patel and K Gopinath 2016 A Case forProtecting Huge Pages from the Kernel In Proceedings of the 7th ACMSIGOPS Asia-Pacific Workshop on Systems (APSys rsquo16) ACM New YorkNY USA Article 15 8 pages httpsdoiorg10114529673602967371

[59] Ashish Panwar Aravinda Prasad and K Gopinath 2018 Making HugePages Actually Useful In Proceedings of the Twenty-Third InternationalConference on Architectural Support for Programming Languages andOperating Systems (ASPLOS rsquo18) ACM New York NY USA 679ndash692httpsdoiorg10114531731623173203

[60] Binh Pham Abhishek Bhattacharjee Yasuko Eckert and Gabriel HLoh 2014 Increasing TLB reach by exploiting clustering in page trans-lations 2014 IEEE 20th International Symposium on High PerformanceComputer Architecture (HPCA) (2014) 558ndash567

[61] Binh Pham Viswanathan Vaidyanathan Aamer Jaleel and AbhishekBhattacharjee 2012 CoLT Coalesced Large-Reach TLBs In Proceed-ings of the 2012 45th Annual IEEEACM International Symposium onMicroarchitecture (MICRO-45) IEEE Computer Society WashingtonDC USA 258ndash269 httpsdoiorg101109MICRO201232

[62] Binh Pham Jaacuten Veselyacute Gabriel H Loh and Abhishek Bhattacharjee2015 Large Pages and Lightweight Memory Management in Virtual-ized Environments Can You Have It Both Ways In Proceedings of the48th International Symposium on Microarchitecture (MICRO-48) ACMNew York NY USA 1ndash12 httpsdoiorg10114528307722830773

[63] Jee Ho Ryoo Nagendra Gulur Shuang Song and Lizy K John 2017Rethinking TLB Designs in Virtualized Environments A Very LargePart-of-Memory TLB In Proceedings of the 44th Annual InternationalSymposium on Computer Architecture (ISCA rsquo17) ACM New York NYUSA 469ndash480 httpsdoiorg10114530798563080210

[64] Indira Subramanian Clifford Mather Kurt Peterson and BalakrishnaRaghunath 1998 Implementation of Multiple Pagesize Support inHP-UX In USENIX Annual Technical Conference 105ndash119

[65] John R Tramm Andrew R Siegel Tanzima Islam and Martin Schulz[n d] XSBench - The Development and Verification of a PerformanceAbstraction for Monte Carlo Reactor Analysis In PHYSOR 2014 - TheRole of Reactor Physics toward a Sustainable Future Kyoto

[66] Koji Ueno and Toyotaro Suzumura 2012 Highly Scalable Graph Searchfor the Graph500 Benchmark In Proceedings of the 21st InternationalSymposium on High-Performance Parallel and Distributed Computing(HPDC rsquo12) ACM New York NY USA 149ndash160 httpsdoiorg10114522870762287104

[67] Carl AWaldspurger 2002 Memory ResourceManagement in VMwareESX Server SIGOPS Oper Syst Rev 36 SI (Dec 2002) 181ndash194 httpsdoiorg101145844128844146

[68] Xiao Zhang Sandhya Dwarkadas and Kai Shen 2009 Towards Practi-cal Page Coloring-based Multicore Cache Management In Proceedingsof the 4th ACM European Conference on Computer Systems (EuroSysrsquo09) ACM New York NY USA 89ndash102 httpsdoiorg10114515190651519076

  • Abstract
  • 1 Introduction
  • 2 Motivation
    • 21 Address Translation Overhead vs Memory Bloat
    • 22 Page fault latency vs Number of page faults
    • 23 Huge page allocation across multiple processes
    • 24 How to capture address translation overheads
      • 3 Design and Implementation
        • 31 Asynchronous page pre-zeroing
        • 32 Managing bloat vs performance
        • 33 Fine-grained huge page promotion
        • 34 Huge page allocation across multiple processes
        • 35 Limitations and discussion
          • 4 Evaluation
          • 5 Related Work
          • 6 Conclusions
          • References
Page 12: HawkEye: Efficient Fine-grained OS Support for …sbansal/pubs/hawkeye.pdfHawkEye: Efficient Fine-grained OS Support for Huge Pages Ashish Panwar Indian Institute of Science ashishpanwar@iisc.ac.in

Workload MMUOverhead

Time (seconds)4KB HawkEye-PMU HawkEye-G

random(4GB) 60 582 328(177times) 413(141times)sequential(4GB) lt 1 517 535 532Total 1099 863(127times) 945(116times)cgD(16GB) 39 1952 1202(162times) 1450(135times)mgD(24GB) lt 1 1363 1364 1377Total 3315 2566(129times) 2827(117times)

Table 9 Comparison between HawkEye-PMU andHawkEye-G Values in parentheses represent speedup overbaseline pagesa highly-exaggerated behavior of async pre-zeroing onworst-case workloads In practice the zeroing thread is rate-limited(eg at most 10k pages per second) and the expected cache-interference overheads would be proportionally smallerComparison betweenHawk-PMUandHawkEye-G Eventhough HawkEye-G is based on simple and approximatedestimations of TLB contention it is reasonably accurate inidentifying TLB sensitive processes and memory regionsin most cases However HawkEye-PMU performs betterthan HawkEye-G in some cases Table 9 demonstrates twosuch examples where four workloads all with high access-coverage but different MMU overheads are executed in twosets each set contains one TLB sensitive and one TLB insensi-tive workload The 4KB column represents the case where nohuge pages are promoted As discussed earlier (sect24) MMUoverheads depend on the complex interaction between thehardware and access pattern In this case despite havinghigh access-coverage in VA regions sequential and mgDhave negligible MMU overheads (ie they are TLB insen-sitive) While HawkEye-PMU correctly identifies the pro-cess with higher MMU overheads for huge page allocationHawkEye-G treats them similarly (due to imprecise estima-tions) and may allocate huge pages to the TLB insensitiveprocess Consequently HawkEye-PMU may perform up to36 better than HawkEye-G Bridging this gap between thetwo approaches in a hardware independent manner is aninteresting future work

5 Related WorkMMU overheads have been a topic of much research in re-cent years and several solutions have been proposed includ-ing solutions that are architecture-based OS-based andorcompiler-based [28 34 49 56 58 62ndash64]Hardware SupportMost proposals for hardware supportaim to reduce TLB-miss frequency or accelerate TLB-missprocessing Large multi-level TLBs and support for multiplepage sizes to minimize the number of TLB-misses is alreadycommon in general-purpose processors today [15 20 21 40]Further modern processors employ page-walk-caches toavoid memory lookups in the TLB-miss path [31 35] POM-TLB services a TLB miss using a single memory lookup andfurther leverages regular data caches to speed up addresstranslation [63] Direct segments [32 46] were proposed tocompletely avoid TLB-miss processing cost through special

segmentation hardware OS-level challenges of memory frag-mentation have also been considered in hardware designsCoLT or Coalesced-Large-reach TLBs [61] were initially pro-posed to increase TLB reach using base pages based on theobservation that OSs naturally provide contiguous mappingsat smaller granularity This approach was further extendedto page-walk caches and huge pages [35 40 60] Techniquesto support huge pages in non-contiguous memory have alsobeen proposed [44] Hardware optimizations are importantbut complementary to HawkEyeSoftware Support FreeBSDrsquos huge page management islargely influenced by previous work on superpages [57]Carrefour-LP [47] showed that ad-hoc usage of huge pagescan degrade performance in NUMA systems due to addi-tional remote memory accesses and traffic imbalance acrossnode interconnects and proposed hardware-profiling basedtechniques to overcome these problems Anti-fragmentationtechniques for clustering pages based on mobility type ofpages proposed by Gorman et al [48] have been adoptedby Linux to support the allocation of huge pages in long-running systems Illuminator [59] highlighted critical draw-backs of Linuxrsquos fragmentation mitigation techniques andtheir implications on performance and latency and proposedtechniques to solve these problems Mechanisms of dealingwith physical memory fragmentation and NUMA systemsare important but orthogonal problems and their proposedsolutions can be adopted to improve the robustness of Hawk-Eye Compiler or application hints can also be used to assistOSs in prioritizing huge page mapping for certain parts ofthe address space [27 56] for which an interface is alreadyprovided by Linux through the madvise system call [16]An alternative approach for supporting huge pages has

also been explored via libhugetlbfs [22] where the useris provided more control on huge page allocation Howeversuch an approach requires manual intervention for reserv-ing huge pages in advance and considers each applicationin isolation Windows and OS X support huge pages onlythrough this reservation-based approach to avoid issues asso-ciated with transparent huge page management [17 55 59]We believe that insights from HawkEye can be leveraged toimprove huge page support in these important systems

6 ConclusionsTo summarize transparent huge page management is es-sential but complex Effective and efficient algorithms arerequired in the OS to automatically balance the tradeoffs andprovide performant and stable system behavior We exposesome important subtleties related to huge page managementin existing proposals and propose a new set of algorithmsto address them in HawkEye

References[1] 2000 Clearing pages in the idle loop httpswwwmail-archivecom

freebsd-hackersfreebsdorgmsg13993html

[2] 2000 Linux Page Zeroing Strategy httpsyarchivenetcomplinuxpagezeroingstrategyhtml

[3] 2007 Memory part 5 What programmers can do httpslwnnetArticles255364

[4] 2010 Mysteries of Windows Memory Management RevealedPart Two httpsblogsmsdnmicrosoftcomtims20101029pdc10-mysteries-of-windows-memory-management-revealed-part-two

[5] 2012 khugepaged eating 100 CPU httpsbugzillaredhatcomshowbugcgiid=879801

[6] 2012 Recommendation to disable huge pages for Hadoophttpsdeveloperamdcomwordpressmedia201210HadoopTuningGuide-Version5pdf

[7] 2014 Arch Linux becomes unresponsive from khugepagedhttpunixstackexchangecomquestions161858arch-linux-becomes-unresponsive-from-khugepaged

[8] 2014 CORAL Benchmark Codes httpsascllnlgovCORAL-benchmarkshacc

[9] 2014 Recommendation to disable huge pages for NuoDBhttpwwwnuodbcomtechbloglinux-transparent-huge-pages-jemalloc-and-nuodb

[10] 2014 Why TokuDB Hates Transparent HugePageshttpswwwperconacomblog20140723why-tokudb-hates-transparent-hugepages

[11] 2015 The Black Magic Of Systematically Reducing Linux OS Jit-ter httphighscalabilitycomblog201548the-black-magic-of-systematically-reducing-linux-os-jitterhtml

[12] 2015 Tales from the Field Taming Transparent Huge Pages onLinux httpswwwperforcecomblog151016tales-field-taming-transparent-huge-pages-linux

[13] 2016 C++ associative containers httpsgithubcomsparsehashsparsehash

[14] 2016 Remove PG_ZERO and zeroidle (page-zeroing) entirely httpsnewsycombinatorcomitemid=12227874

[15] 2017 Hugepages httpswikidebianorgHugepages[16] 2017 MADVISE Linux Programmerrsquos Manual httpman7orglinux

man-pagesman2madvise2html[17] 2018 Large-Page Support in Windows httpsdocsmicrosoftcom

en-gbwindowsdesktopMemorylarge-page-support[18] 2019 CGROUPS Linux Programmerrsquos Manual httpman7orglinux

man-pagesman7cgroups7html[19] 2019 FreeBSD Pre-Faulting and Zeroing Optimizations

httpswwwfreebsdorgdocenUSISO8859-1articlesvm-designprefault-optimizationshtml

[20] 2019 Intel Haswell httpwww7-cpucomcpuHaswellhtml[21] 2019 Intel Skylake httpwww7-cpucomcpuSkylakehtml[22] 2019 Libhugetlbfs Linux man page httpslinuxdienetman7

libhugetlbfs[23] 2019 Recommendation to disable huge pages for MongoDB https

docsmongodbcommanualtutorialtransparent-huge-pages[24] 2019 Recommendation to disable huge pages for Redis httpredisio

topicslatency[25] 2019 Recommendation to disable huge pages for VoltDB https

docsvoltdbcomAdminGuideadminmemmgtphp[26] 2019 Transparent Hugepage Support httpswwwkernelorgdoc

Documentationvmtranshugetxt[27] Mohammad Agbarya Idan Yaniv and Dan Tsafrir 2018 Memomania

From Huge to Huge-Huge Pages In Proceedings of the 11th ACM In-ternational Systems and Storage Conference (SYSTOR rsquo18) ACM NewYork NY USA 112ndash112 httpsdoiorg10114532118903211918

[28] Hanna Alam Tianhao Zhang Mattan Erez and Yoav Etsion 2017Do-It-Yourself Virtual Memory Translation In Proceedings of the44th Annual International Symposium on Computer Architecture (ISCArsquo17) ACM New York NY USA 457ndash468 httpsdoiorg10114530798563080209

[29] Nadav Amit Dan Tsafrir and Assaf Schuster 2014 VSwapper AMemory Swapper for Virtualized Environments In Proceedings of the19th International Conference on Architectural Support for ProgrammingLanguages and Operating Systems (ASPLOS rsquo14) ACM New York NYUSA 349ndash366 httpsdoiorg10114525419402541969

[30] D H Bailey E Barszcz J T Barton D S Browning R L Carter LDagum R A Fatoohi P O Frederickson T A Lasinski R S SchreiberH D Simon V Venkatakrishnan and S KWeeratunga 1991 The NASParallel BenchmarksSummary and Preliminary Results In Proceedingsof the 1991 ACMIEEE Conference on Supercomputing (Supercomputingrsquo91) ACM New York NY USA 158ndash165 httpsdoiorg101145125826125925

[31] Thomas W Barr Alan L Cox and Scott Rixner 2010 TranslationCaching Skip DonrsquoT Walk (the Page Table) In Proceedings of the37th Annual International Symposium on Computer Architecture (ISCArsquo10) ACM New York NY USA 48ndash59 httpsdoiorg10114518159611815970

[32] Arkaprava Basu Jayneel Gandhi Jichuan Chang Mark D Hill andMichael M Swift 2013 Efficient Virtual Memory for Big MemoryServers In Proceedings of the 40th Annual International Symposium onComputer Architecture (ISCA rsquo13) ACM New York NY USA 237ndash248httpsdoiorg10114524859222485943

[33] Scott Beamer Krste Asanovic and David A Patterson 2015 TheGAP Benchmark Suite CoRR abs150803619 (2015) arXiv150803619httparxivorgabs150803619

[34] Ravi Bhargava Benjamin Serebrin Francesco Spadini and SrilathaManne 2008 Accelerating Two-dimensional Page Walks for Vir-tualized Systems In Proceedings of the 13th International Confer-ence on Architectural Support for Programming Languages and Op-erating Systems (ASPLOS XIII) ACM New York NY USA 26ndash35httpsdoiorg10114513462811346286

[35] Abhishek Bhattacharjee 2013 Large-reach Memory ManagementUnit Caches In Proceedings of the 46th Annual IEEEACM InternationalSymposium on Microarchitecture (MICRO-46) ACM New York NYUSA 383ndash394 httpsdoiorg10114525407082540741

[36] Christian Bienia Sanjeev Kumar Jaswinder Pal Singh and Kai Li 2008The PARSEC Benchmark Suite Characterization and ArchitecturalImplications In Proceedings of the 17th International Conference onParallel Architectures and Compilation Techniques (PACT rsquo08) ACMNew York NY USA 72ndash81 httpsdoiorg10114514541151454128

[37] Daniel Bovet and Marco Cesati 2005 Understanding The Linux KernelOreilly amp Associates Inc

[38] Josiah L Carlson 2013 Redis in Action Manning Publications CoGreenwich CT USA

[39] Jonathan Corbet 2010 Memory compaction httpslwnnetArticles368869

[40] Guilherme Cox and Abhishek Bhattacharjee 2017 Efficient AddressTranslation for Architectures with Multiple Page Sizes In Proceed-ings of the Twenty-Second International Conference on ArchitecturalSupport for Programming Languages and Operating Systems (ASPLOSrsquo17) ACM New York NY USA 435ndash448 httpsdoiorg10114530376973037704

[41] Cort Dougan Paul Mackerras and Victor Yodaiken 1999 Opti-mizing the Idle Task and Other MMU Tricks In Proceedings of theThird Symposium on Operating Systems Design and Implementation(OSDI rsquo99) USENIX Association Berkeley CA USA 229ndash237 httpdlacmorgcitationcfmid=296806296833

[42] Niall Douglas 2011 User Mode Memory Page Allocation A Sil-ver Bullet For Memory Allocation CoRR abs11051811 (2011)arXiv11051811 httparxivorgabs11051811

[43] Ulrich Drepper 2007 What Every Programmer Should Know AboutMemory

[44] Yu Du Miao Zhou Bruce R Childers Daniel Mosseacute and Rami GMelhem 2015 Supporting superpages in non-contiguous physicalmemory 2015 IEEE 21st International Symposium on High Performance

Computer Architecture (HPCA) (2015) 223ndash234[45] M Franklin D Yeung n Xue Wu A Jaleel K Albayraktaroglu B

Jacob and n Chau-Wen Tseng 2005 BioBench A Benchmark Suite ofBioinformatics Applications In IEEE International Symposium on Per-formance Analysis of Systems and Software 2005 ISPASS 2005(ISPASS)Vol 00 2ndash9 httpsdoiorg101109ISPASS20051430554

[46] Jayneel Gandhi Arkaprava Basu Mark D Hill and Michael M Swift2014 Efficient Memory Virtualization Reducing Dimensionality ofNested Page Walks In Proceedings of the 47th Annual IEEEACM Inter-national Symposium on Microarchitecture (MICRO-47) IEEE ComputerSociety Washington DC USA 178ndash189 httpsdoiorg101109MICRO201437

[47] Fabien Gaud Baptiste Lepers Jeremie Decouchant Justin FunstonAlexandra Fedorova and Vivien Queacutema 2014 Large Pages MayBe Harmful on NUMA Systems In Proceedings of the 2014 USENIXConference on USENIX Annual Technical Conference (USENIX ATCrsquo14)USENIX Association Berkeley CA USA 231ndash242 httpdlacmorgcitationcfmid=26436342643659

[48] Mel Gorman and Patrick Healy 2008 Supporting Superpage Alloca-tion Without Additional Hardware Support In Proceedings of the 7thInternational Symposium on Memory Management (ISMM rsquo08) ACMNew York NY USA 41ndash50 httpsdoiorg10114513756341375641

[49] Mel Gorman and Patrick Healy 2012 Performance Characteristics ofExplicit Superpage Support In Proceedings of the 2010 InternationalConference on Computer Architecture (ISCArsquo10) Springer-Verlag BerlinHeidelberg 293ndash310 httpsdoiorg101007978-3-642-24322-624

[50] Mel Gorman and Andy Whitcroft 2006 The What The Why and theWhere To of Anti-Fragmentation In Linux Symposium 141

[51] Fei Guo Seongbeom Kim Yury Baskakov and Ishan Banerjee 2015Proactively Breaking Large Pages to Improve Memory Overcom-mitment Performance in VMware ESXi In Proceedings of the 11thACM SIGPLANSIGOPS International Conference on Virtual ExecutionEnvironments (VEE rsquo15) ACM New York NY USA 39ndash51 httpsdoiorg10114527311862731187

[52] Fan Guo Yongkun Li Yinlong Xu Song Jiang and John C S Lui2017 SmartMD A High Performance Deduplication Engine withMixed Pages In Proceedings of the 2017 USENIX Conference on UsenixAnnual Technical Conference (USENIX ATC rsquo17) USENIX AssociationBerkeley CA USA 733ndash744 httpdlacmorgcitationcfmid=31546903154759

[53] John L Henning 2006 SPEC CPU2006 Benchmark DescriptionsSIGARCH Comput Archit News 34 4 (Sept 2006) 1ndash17 httpsdoiorg10114511867361186737

[54] Vasileios Karakostas Osman S Unsal Mario Nemirovsky AdriaacutenCristal and Michael M Swift 2014 Performance analysis of thememory management unit under scale-out workloads 2014 IEEE In-ternational Symposium on Workload Characterization (IISWC) (2014)1ndash12

[55] Youngjin Kwon Hangchen Yu Simon Peter Christopher J Rossbachand Emmett Witchel 2016 Coordinated and Efficient Huge PageManagement with Ingens In Proceedings of the 12th USENIX Con-ference on Operating Systems Design and Implementation (OSDIrsquo16)USENIX Association Berkeley CA USA 705ndash721 httpdlacmorgcitationcfmid=30268773026931

[56] Joshua Magee and Apan Qasem 2009 A Case for Compiler-drivenSuperpage Allocation In Proceedings of the 47th Annual SoutheastRegional Conference (ACM-SE 47) ACM New York NY USA Article82 4 pages httpsdoiorg10114515664451566553

[57] Juan Navarro Sitararn Iyer Peter Druschel and Alan Cox 2002 Prac-tical Transparent Operating System Support for Superpages SIGOPSOper Syst Rev 36 SI (Dec 2002) 89ndash104 httpsdoiorg101145844128844138

[58] Ashish Panwar Naman Patel and K Gopinath 2016 A Case forProtecting Huge Pages from the Kernel In Proceedings of the 7th ACMSIGOPS Asia-Pacific Workshop on Systems (APSys rsquo16) ACM New YorkNY USA Article 15 8 pages httpsdoiorg10114529673602967371

[59] Ashish Panwar Aravinda Prasad and K Gopinath 2018 Making HugePages Actually Useful In Proceedings of the Twenty-Third InternationalConference on Architectural Support for Programming Languages andOperating Systems (ASPLOS rsquo18) ACM New York NY USA 679ndash692httpsdoiorg10114531731623173203

[60] Binh Pham Abhishek Bhattacharjee Yasuko Eckert and Gabriel HLoh 2014 Increasing TLB reach by exploiting clustering in page trans-lations 2014 IEEE 20th International Symposium on High PerformanceComputer Architecture (HPCA) (2014) 558ndash567

[61] Binh Pham Viswanathan Vaidyanathan Aamer Jaleel and AbhishekBhattacharjee 2012 CoLT Coalesced Large-Reach TLBs In Proceed-ings of the 2012 45th Annual IEEEACM International Symposium onMicroarchitecture (MICRO-45) IEEE Computer Society WashingtonDC USA 258ndash269 httpsdoiorg101109MICRO201232

[62] Binh Pham Jaacuten Veselyacute Gabriel H Loh and Abhishek Bhattacharjee2015 Large Pages and Lightweight Memory Management in Virtual-ized Environments Can You Have It Both Ways In Proceedings of the48th International Symposium on Microarchitecture (MICRO-48) ACMNew York NY USA 1ndash12 httpsdoiorg10114528307722830773

[63] Jee Ho Ryoo Nagendra Gulur Shuang Song and Lizy K John 2017Rethinking TLB Designs in Virtualized Environments A Very LargePart-of-Memory TLB In Proceedings of the 44th Annual InternationalSymposium on Computer Architecture (ISCA rsquo17) ACM New York NYUSA 469ndash480 httpsdoiorg10114530798563080210

[64] Indira Subramanian Clifford Mather Kurt Peterson and BalakrishnaRaghunath 1998 Implementation of Multiple Pagesize Support inHP-UX In USENIX Annual Technical Conference 105ndash119

[65] John R Tramm Andrew R Siegel Tanzima Islam and Martin Schulz[n d] XSBench - The Development and Verification of a PerformanceAbstraction for Monte Carlo Reactor Analysis In PHYSOR 2014 - TheRole of Reactor Physics toward a Sustainable Future Kyoto

[66] Koji Ueno and Toyotaro Suzumura 2012 Highly Scalable Graph Searchfor the Graph500 Benchmark In Proceedings of the 21st InternationalSymposium on High-Performance Parallel and Distributed Computing(HPDC rsquo12) ACM New York NY USA 149ndash160 httpsdoiorg10114522870762287104

[67] Carl AWaldspurger 2002 Memory ResourceManagement in VMwareESX Server SIGOPS Oper Syst Rev 36 SI (Dec 2002) 181ndash194 httpsdoiorg101145844128844146

[68] Xiao Zhang Sandhya Dwarkadas and Kai Shen 2009 Towards Practi-cal Page Coloring-based Multicore Cache Management In Proceedingsof the 4th ACM European Conference on Computer Systems (EuroSysrsquo09) ACM New York NY USA 89ndash102 httpsdoiorg10114515190651519076

  • Abstract
  • 1 Introduction
  • 2 Motivation
    • 21 Address Translation Overhead vs Memory Bloat
    • 22 Page fault latency vs Number of page faults
    • 23 Huge page allocation across multiple processes
    • 24 How to capture address translation overheads
      • 3 Design and Implementation
        • 31 Asynchronous page pre-zeroing
        • 32 Managing bloat vs performance
        • 33 Fine-grained huge page promotion
        • 34 Huge page allocation across multiple processes
        • 35 Limitations and discussion
          • 4 Evaluation
          • 5 Related Work
          • 6 Conclusions
          • References
Page 13: HawkEye: Efficient Fine-grained OS Support for …sbansal/pubs/hawkeye.pdfHawkEye: Efficient Fine-grained OS Support for Huge Pages Ashish Panwar Indian Institute of Science ashishpanwar@iisc.ac.in

[2] 2000 Linux Page Zeroing Strategy httpsyarchivenetcomplinuxpagezeroingstrategyhtml

[3] 2007 Memory part 5 What programmers can do httpslwnnetArticles255364

[4] 2010 Mysteries of Windows Memory Management RevealedPart Two httpsblogsmsdnmicrosoftcomtims20101029pdc10-mysteries-of-windows-memory-management-revealed-part-two

[5] 2012 khugepaged eating 100 CPU httpsbugzillaredhatcomshowbugcgiid=879801

[6] 2012 Recommendation to disable huge pages for Hadoophttpsdeveloperamdcomwordpressmedia201210HadoopTuningGuide-Version5pdf

[7] 2014 Arch Linux becomes unresponsive from khugepagedhttpunixstackexchangecomquestions161858arch-linux-becomes-unresponsive-from-khugepaged

[8] 2014 CORAL Benchmark Codes httpsascllnlgovCORAL-benchmarkshacc

[9] 2014 Recommendation to disable huge pages for NuoDBhttpwwwnuodbcomtechbloglinux-transparent-huge-pages-jemalloc-and-nuodb

[10] 2014 Why TokuDB Hates Transparent HugePageshttpswwwperconacomblog20140723why-tokudb-hates-transparent-hugepages

[11] 2015 The Black Magic Of Systematically Reducing Linux OS Jit-ter httphighscalabilitycomblog201548the-black-magic-of-systematically-reducing-linux-os-jitterhtml

[12] 2015 Tales from the Field Taming Transparent Huge Pages onLinux httpswwwperforcecomblog151016tales-field-taming-transparent-huge-pages-linux

[13] 2016 C++ associative containers httpsgithubcomsparsehashsparsehash

[14] 2016 Remove PG_ZERO and zeroidle (page-zeroing) entirely httpsnewsycombinatorcomitemid=12227874

[15] 2017 Hugepages httpswikidebianorgHugepages[16] 2017 MADVISE Linux Programmerrsquos Manual httpman7orglinux

man-pagesman2madvise2html[17] 2018 Large-Page Support in Windows httpsdocsmicrosoftcom

en-gbwindowsdesktopMemorylarge-page-support[18] 2019 CGROUPS Linux Programmerrsquos Manual httpman7orglinux

man-pagesman7cgroups7html[19] 2019 FreeBSD Pre-Faulting and Zeroing Optimizations

httpswwwfreebsdorgdocenUSISO8859-1articlesvm-designprefault-optimizationshtml

[20] 2019 Intel Haswell httpwww7-cpucomcpuHaswellhtml[21] 2019 Intel Skylake httpwww7-cpucomcpuSkylakehtml[22] 2019 Libhugetlbfs Linux man page httpslinuxdienetman7

libhugetlbfs[23] 2019 Recommendation to disable huge pages for MongoDB https

docsmongodbcommanualtutorialtransparent-huge-pages[24] 2019 Recommendation to disable huge pages for Redis httpredisio

topicslatency[25] 2019 Recommendation to disable huge pages for VoltDB https

docsvoltdbcomAdminGuideadminmemmgtphp[26] 2019 Transparent Hugepage Support httpswwwkernelorgdoc

Documentationvmtranshugetxt[27] Mohammad Agbarya Idan Yaniv and Dan Tsafrir 2018 Memomania

From Huge to Huge-Huge Pages In Proceedings of the 11th ACM In-ternational Systems and Storage Conference (SYSTOR rsquo18) ACM NewYork NY USA 112ndash112 httpsdoiorg10114532118903211918

[28] Hanna Alam Tianhao Zhang Mattan Erez and Yoav Etsion 2017Do-It-Yourself Virtual Memory Translation In Proceedings of the44th Annual International Symposium on Computer Architecture (ISCArsquo17) ACM New York NY USA 457ndash468 httpsdoiorg10114530798563080209

[29] Nadav Amit Dan Tsafrir and Assaf Schuster 2014 VSwapper AMemory Swapper for Virtualized Environments In Proceedings of the19th International Conference on Architectural Support for ProgrammingLanguages and Operating Systems (ASPLOS rsquo14) ACM New York NYUSA 349ndash366 httpsdoiorg10114525419402541969

[30] D H Bailey E Barszcz J T Barton D S Browning R L Carter LDagum R A Fatoohi P O Frederickson T A Lasinski R S SchreiberH D Simon V Venkatakrishnan and S KWeeratunga 1991 The NASParallel BenchmarksSummary and Preliminary Results In Proceedingsof the 1991 ACMIEEE Conference on Supercomputing (Supercomputingrsquo91) ACM New York NY USA 158ndash165 httpsdoiorg101145125826125925

[31] Thomas W Barr Alan L Cox and Scott Rixner 2010 TranslationCaching Skip DonrsquoT Walk (the Page Table) In Proceedings of the37th Annual International Symposium on Computer Architecture (ISCArsquo10) ACM New York NY USA 48ndash59 httpsdoiorg10114518159611815970

[32] Arkaprava Basu Jayneel Gandhi Jichuan Chang Mark D Hill andMichael M Swift 2013 Efficient Virtual Memory for Big MemoryServers In Proceedings of the 40th Annual International Symposium onComputer Architecture (ISCA rsquo13) ACM New York NY USA 237ndash248httpsdoiorg10114524859222485943

[33] Scott Beamer Krste Asanovic and David A Patterson 2015 TheGAP Benchmark Suite CoRR abs150803619 (2015) arXiv150803619httparxivorgabs150803619

[34] Ravi Bhargava Benjamin Serebrin Francesco Spadini and SrilathaManne 2008 Accelerating Two-dimensional Page Walks for Vir-tualized Systems In Proceedings of the 13th International Confer-ence on Architectural Support for Programming Languages and Op-erating Systems (ASPLOS XIII) ACM New York NY USA 26ndash35httpsdoiorg10114513462811346286

[35] Abhishek Bhattacharjee 2013 Large-reach Memory ManagementUnit Caches In Proceedings of the 46th Annual IEEEACM InternationalSymposium on Microarchitecture (MICRO-46) ACM New York NYUSA 383ndash394 httpsdoiorg10114525407082540741

[36] Christian Bienia Sanjeev Kumar Jaswinder Pal Singh and Kai Li 2008The PARSEC Benchmark Suite Characterization and ArchitecturalImplications In Proceedings of the 17th International Conference onParallel Architectures and Compilation Techniques (PACT rsquo08) ACMNew York NY USA 72ndash81 httpsdoiorg10114514541151454128

[37] Daniel Bovet and Marco Cesati 2005 Understanding The Linux KernelOreilly amp Associates Inc

[38] Josiah L Carlson 2013 Redis in Action Manning Publications CoGreenwich CT USA

[39] Jonathan Corbet 2010 Memory compaction httpslwnnetArticles368869

[40] Guilherme Cox and Abhishek Bhattacharjee 2017 Efficient AddressTranslation for Architectures with Multiple Page Sizes In Proceed-ings of the Twenty-Second International Conference on ArchitecturalSupport for Programming Languages and Operating Systems (ASPLOSrsquo17) ACM New York NY USA 435ndash448 httpsdoiorg10114530376973037704

[41] Cort Dougan Paul Mackerras and Victor Yodaiken 1999 Opti-mizing the Idle Task and Other MMU Tricks In Proceedings of theThird Symposium on Operating Systems Design and Implementation(OSDI rsquo99) USENIX Association Berkeley CA USA 229ndash237 httpdlacmorgcitationcfmid=296806296833

[42] Niall Douglas 2011 User Mode Memory Page Allocation A Sil-ver Bullet For Memory Allocation CoRR abs11051811 (2011)arXiv11051811 httparxivorgabs11051811

[43] Ulrich Drepper 2007 What Every Programmer Should Know AboutMemory

[44] Yu Du Miao Zhou Bruce R Childers Daniel Mosseacute and Rami GMelhem 2015 Supporting superpages in non-contiguous physicalmemory 2015 IEEE 21st International Symposium on High Performance

Computer Architecture (HPCA) (2015) 223ndash234[45] M Franklin D Yeung n Xue Wu A Jaleel K Albayraktaroglu B

Jacob and n Chau-Wen Tseng 2005 BioBench A Benchmark Suite ofBioinformatics Applications In IEEE International Symposium on Per-formance Analysis of Systems and Software 2005 ISPASS 2005(ISPASS)Vol 00 2ndash9 httpsdoiorg101109ISPASS20051430554

[46] Jayneel Gandhi Arkaprava Basu Mark D Hill and Michael M Swift2014 Efficient Memory Virtualization Reducing Dimensionality ofNested Page Walks In Proceedings of the 47th Annual IEEEACM Inter-national Symposium on Microarchitecture (MICRO-47) IEEE ComputerSociety Washington DC USA 178ndash189 httpsdoiorg101109MICRO201437

[47] Fabien Gaud Baptiste Lepers Jeremie Decouchant Justin FunstonAlexandra Fedorova and Vivien Queacutema 2014 Large Pages MayBe Harmful on NUMA Systems In Proceedings of the 2014 USENIXConference on USENIX Annual Technical Conference (USENIX ATCrsquo14)USENIX Association Berkeley CA USA 231ndash242 httpdlacmorgcitationcfmid=26436342643659

[48] Mel Gorman and Patrick Healy 2008 Supporting Superpage Alloca-tion Without Additional Hardware Support In Proceedings of the 7thInternational Symposium on Memory Management (ISMM rsquo08) ACMNew York NY USA 41ndash50 httpsdoiorg10114513756341375641

[49] Mel Gorman and Patrick Healy 2012 Performance Characteristics ofExplicit Superpage Support In Proceedings of the 2010 InternationalConference on Computer Architecture (ISCArsquo10) Springer-Verlag BerlinHeidelberg 293ndash310 httpsdoiorg101007978-3-642-24322-624

[50] Mel Gorman and Andy Whitcroft 2006 The What The Why and theWhere To of Anti-Fragmentation In Linux Symposium 141

[51] Fei Guo Seongbeom Kim Yury Baskakov and Ishan Banerjee 2015Proactively Breaking Large Pages to Improve Memory Overcom-mitment Performance in VMware ESXi In Proceedings of the 11thACM SIGPLANSIGOPS International Conference on Virtual ExecutionEnvironments (VEE rsquo15) ACM New York NY USA 39ndash51 httpsdoiorg10114527311862731187

[52] Fan Guo Yongkun Li Yinlong Xu Song Jiang and John C S Lui2017 SmartMD A High Performance Deduplication Engine withMixed Pages In Proceedings of the 2017 USENIX Conference on UsenixAnnual Technical Conference (USENIX ATC rsquo17) USENIX AssociationBerkeley CA USA 733ndash744 httpdlacmorgcitationcfmid=31546903154759

[53] John L Henning 2006 SPEC CPU2006 Benchmark DescriptionsSIGARCH Comput Archit News 34 4 (Sept 2006) 1ndash17 httpsdoiorg10114511867361186737

[54] Vasileios Karakostas Osman S Unsal Mario Nemirovsky AdriaacutenCristal and Michael M Swift 2014 Performance analysis of thememory management unit under scale-out workloads 2014 IEEE In-ternational Symposium on Workload Characterization (IISWC) (2014)1ndash12

[55] Youngjin Kwon Hangchen Yu Simon Peter Christopher J Rossbachand Emmett Witchel 2016 Coordinated and Efficient Huge PageManagement with Ingens In Proceedings of the 12th USENIX Con-ference on Operating Systems Design and Implementation (OSDIrsquo16)USENIX Association Berkeley CA USA 705ndash721 httpdlacmorgcitationcfmid=30268773026931

[56] Joshua Magee and Apan Qasem 2009 A Case for Compiler-drivenSuperpage Allocation In Proceedings of the 47th Annual SoutheastRegional Conference (ACM-SE 47) ACM New York NY USA Article82 4 pages httpsdoiorg10114515664451566553

[57] Juan Navarro Sitararn Iyer Peter Druschel and Alan Cox 2002 Prac-tical Transparent Operating System Support for Superpages SIGOPSOper Syst Rev 36 SI (Dec 2002) 89ndash104 httpsdoiorg101145844128844138

[58] Ashish Panwar Naman Patel and K Gopinath 2016 A Case forProtecting Huge Pages from the Kernel In Proceedings of the 7th ACMSIGOPS Asia-Pacific Workshop on Systems (APSys rsquo16) ACM New YorkNY USA Article 15 8 pages httpsdoiorg10114529673602967371

[59] Ashish Panwar Aravinda Prasad and K Gopinath 2018 Making HugePages Actually Useful In Proceedings of the Twenty-Third InternationalConference on Architectural Support for Programming Languages andOperating Systems (ASPLOS rsquo18) ACM New York NY USA 679ndash692httpsdoiorg10114531731623173203

[60] Binh Pham Abhishek Bhattacharjee Yasuko Eckert and Gabriel HLoh 2014 Increasing TLB reach by exploiting clustering in page trans-lations 2014 IEEE 20th International Symposium on High PerformanceComputer Architecture (HPCA) (2014) 558ndash567

[61] Binh Pham Viswanathan Vaidyanathan Aamer Jaleel and AbhishekBhattacharjee 2012 CoLT Coalesced Large-Reach TLBs In Proceed-ings of the 2012 45th Annual IEEEACM International Symposium onMicroarchitecture (MICRO-45) IEEE Computer Society WashingtonDC USA 258ndash269 httpsdoiorg101109MICRO201232

[62] Binh Pham Jaacuten Veselyacute Gabriel H Loh and Abhishek Bhattacharjee2015 Large Pages and Lightweight Memory Management in Virtual-ized Environments Can You Have It Both Ways In Proceedings of the48th International Symposium on Microarchitecture (MICRO-48) ACMNew York NY USA 1ndash12 httpsdoiorg10114528307722830773

[63] Jee Ho Ryoo Nagendra Gulur Shuang Song and Lizy K John 2017Rethinking TLB Designs in Virtualized Environments A Very LargePart-of-Memory TLB In Proceedings of the 44th Annual InternationalSymposium on Computer Architecture (ISCA rsquo17) ACM New York NYUSA 469ndash480 httpsdoiorg10114530798563080210

[64] Indira Subramanian Clifford Mather Kurt Peterson and BalakrishnaRaghunath 1998 Implementation of Multiple Pagesize Support inHP-UX In USENIX Annual Technical Conference 105ndash119

[65] John R Tramm Andrew R Siegel Tanzima Islam and Martin Schulz[n d] XSBench - The Development and Verification of a PerformanceAbstraction for Monte Carlo Reactor Analysis In PHYSOR 2014 - TheRole of Reactor Physics toward a Sustainable Future Kyoto

[66] Koji Ueno and Toyotaro Suzumura 2012 Highly Scalable Graph Searchfor the Graph500 Benchmark In Proceedings of the 21st InternationalSymposium on High-Performance Parallel and Distributed Computing(HPDC rsquo12) ACM New York NY USA 149ndash160 httpsdoiorg10114522870762287104

[67] Carl AWaldspurger 2002 Memory ResourceManagement in VMwareESX Server SIGOPS Oper Syst Rev 36 SI (Dec 2002) 181ndash194 httpsdoiorg101145844128844146

[68] Xiao Zhang Sandhya Dwarkadas and Kai Shen 2009 Towards Practi-cal Page Coloring-based Multicore Cache Management In Proceedingsof the 4th ACM European Conference on Computer Systems (EuroSysrsquo09) ACM New York NY USA 89ndash102 httpsdoiorg10114515190651519076

  • Abstract
  • 1 Introduction
  • 2 Motivation
    • 21 Address Translation Overhead vs Memory Bloat
    • 22 Page fault latency vs Number of page faults
    • 23 Huge page allocation across multiple processes
    • 24 How to capture address translation overheads
      • 3 Design and Implementation
        • 31 Asynchronous page pre-zeroing
        • 32 Managing bloat vs performance
        • 33 Fine-grained huge page promotion
        • 34 Huge page allocation across multiple processes
        • 35 Limitations and discussion
          • 4 Evaluation
          • 5 Related Work
          • 6 Conclusions
          • References
Page 14: HawkEye: Efficient Fine-grained OS Support for …sbansal/pubs/hawkeye.pdfHawkEye: Efficient Fine-grained OS Support for Huge Pages Ashish Panwar Indian Institute of Science ashishpanwar@iisc.ac.in

Computer Architecture (HPCA) (2015) 223ndash234[45] M Franklin D Yeung n Xue Wu A Jaleel K Albayraktaroglu B

Jacob and n Chau-Wen Tseng 2005 BioBench A Benchmark Suite ofBioinformatics Applications In IEEE International Symposium on Per-formance Analysis of Systems and Software 2005 ISPASS 2005(ISPASS)Vol 00 2ndash9 httpsdoiorg101109ISPASS20051430554

[46] Jayneel Gandhi Arkaprava Basu Mark D Hill and Michael M Swift2014 Efficient Memory Virtualization Reducing Dimensionality ofNested Page Walks In Proceedings of the 47th Annual IEEEACM Inter-national Symposium on Microarchitecture (MICRO-47) IEEE ComputerSociety Washington DC USA 178ndash189 httpsdoiorg101109MICRO201437

[47] Fabien Gaud Baptiste Lepers Jeremie Decouchant Justin FunstonAlexandra Fedorova and Vivien Queacutema 2014 Large Pages MayBe Harmful on NUMA Systems In Proceedings of the 2014 USENIXConference on USENIX Annual Technical Conference (USENIX ATCrsquo14)USENIX Association Berkeley CA USA 231ndash242 httpdlacmorgcitationcfmid=26436342643659

[48] Mel Gorman and Patrick Healy 2008 Supporting Superpage Alloca-tion Without Additional Hardware Support In Proceedings of the 7thInternational Symposium on Memory Management (ISMM rsquo08) ACMNew York NY USA 41ndash50 httpsdoiorg10114513756341375641

[49] Mel Gorman and Patrick Healy 2012 Performance Characteristics ofExplicit Superpage Support In Proceedings of the 2010 InternationalConference on Computer Architecture (ISCArsquo10) Springer-Verlag BerlinHeidelberg 293ndash310 httpsdoiorg101007978-3-642-24322-624

[50] Mel Gorman and Andy Whitcroft 2006 The What The Why and theWhere To of Anti-Fragmentation In Linux Symposium 141

[51] Fei Guo Seongbeom Kim Yury Baskakov and Ishan Banerjee 2015Proactively Breaking Large Pages to Improve Memory Overcom-mitment Performance in VMware ESXi In Proceedings of the 11thACM SIGPLANSIGOPS International Conference on Virtual ExecutionEnvironments (VEE rsquo15) ACM New York NY USA 39ndash51 httpsdoiorg10114527311862731187

[52] Fan Guo Yongkun Li Yinlong Xu Song Jiang and John C S Lui2017 SmartMD A High Performance Deduplication Engine withMixed Pages In Proceedings of the 2017 USENIX Conference on UsenixAnnual Technical Conference (USENIX ATC rsquo17) USENIX AssociationBerkeley CA USA 733ndash744 httpdlacmorgcitationcfmid=31546903154759

[53] John L Henning 2006 SPEC CPU2006 Benchmark DescriptionsSIGARCH Comput Archit News 34 4 (Sept 2006) 1ndash17 httpsdoiorg10114511867361186737

[54] Vasileios Karakostas Osman S Unsal Mario Nemirovsky AdriaacutenCristal and Michael M Swift 2014 Performance analysis of thememory management unit under scale-out workloads 2014 IEEE In-ternational Symposium on Workload Characterization (IISWC) (2014)1ndash12

[55] Youngjin Kwon Hangchen Yu Simon Peter Christopher J Rossbachand Emmett Witchel 2016 Coordinated and Efficient Huge PageManagement with Ingens In Proceedings of the 12th USENIX Con-ference on Operating Systems Design and Implementation (OSDIrsquo16)USENIX Association Berkeley CA USA 705ndash721 httpdlacmorgcitationcfmid=30268773026931

[56] Joshua Magee and Apan Qasem 2009 A Case for Compiler-drivenSuperpage Allocation In Proceedings of the 47th Annual SoutheastRegional Conference (ACM-SE 47) ACM New York NY USA Article82 4 pages httpsdoiorg10114515664451566553

[57] Juan Navarro Sitararn Iyer Peter Druschel and Alan Cox 2002 Prac-tical Transparent Operating System Support for Superpages SIGOPSOper Syst Rev 36 SI (Dec 2002) 89ndash104 httpsdoiorg101145844128844138

[58] Ashish Panwar Naman Patel and K Gopinath 2016 A Case forProtecting Huge Pages from the Kernel In Proceedings of the 7th ACMSIGOPS Asia-Pacific Workshop on Systems (APSys rsquo16) ACM New YorkNY USA Article 15 8 pages httpsdoiorg10114529673602967371

[59] Ashish Panwar Aravinda Prasad and K Gopinath 2018 Making HugePages Actually Useful In Proceedings of the Twenty-Third InternationalConference on Architectural Support for Programming Languages andOperating Systems (ASPLOS rsquo18) ACM New York NY USA 679ndash692httpsdoiorg10114531731623173203

[60] Binh Pham Abhishek Bhattacharjee Yasuko Eckert and Gabriel HLoh 2014 Increasing TLB reach by exploiting clustering in page trans-lations 2014 IEEE 20th International Symposium on High PerformanceComputer Architecture (HPCA) (2014) 558ndash567

[61] Binh Pham Viswanathan Vaidyanathan Aamer Jaleel and AbhishekBhattacharjee 2012 CoLT Coalesced Large-Reach TLBs In Proceed-ings of the 2012 45th Annual IEEEACM International Symposium onMicroarchitecture (MICRO-45) IEEE Computer Society WashingtonDC USA 258ndash269 httpsdoiorg101109MICRO201232

[62] Binh Pham Jaacuten Veselyacute Gabriel H Loh and Abhishek Bhattacharjee2015 Large Pages and Lightweight Memory Management in Virtual-ized Environments Can You Have It Both Ways In Proceedings of the48th International Symposium on Microarchitecture (MICRO-48) ACMNew York NY USA 1ndash12 httpsdoiorg10114528307722830773

[63] Jee Ho Ryoo Nagendra Gulur Shuang Song and Lizy K John 2017Rethinking TLB Designs in Virtualized Environments A Very LargePart-of-Memory TLB In Proceedings of the 44th Annual InternationalSymposium on Computer Architecture (ISCA rsquo17) ACM New York NYUSA 469ndash480 httpsdoiorg10114530798563080210

[64] Indira Subramanian Clifford Mather Kurt Peterson and BalakrishnaRaghunath 1998 Implementation of Multiple Pagesize Support inHP-UX In USENIX Annual Technical Conference 105ndash119

[65] John R Tramm Andrew R Siegel Tanzima Islam and Martin Schulz[n d] XSBench - The Development and Verification of a PerformanceAbstraction for Monte Carlo Reactor Analysis In PHYSOR 2014 - TheRole of Reactor Physics toward a Sustainable Future Kyoto

[66] Koji Ueno and Toyotaro Suzumura 2012 Highly Scalable Graph Searchfor the Graph500 Benchmark In Proceedings of the 21st InternationalSymposium on High-Performance Parallel and Distributed Computing(HPDC rsquo12) ACM New York NY USA 149ndash160 httpsdoiorg10114522870762287104

[67] Carl AWaldspurger 2002 Memory ResourceManagement in VMwareESX Server SIGOPS Oper Syst Rev 36 SI (Dec 2002) 181ndash194 httpsdoiorg101145844128844146

[68] Xiao Zhang Sandhya Dwarkadas and Kai Shen 2009 Towards Practi-cal Page Coloring-based Multicore Cache Management In Proceedingsof the 4th ACM European Conference on Computer Systems (EuroSysrsquo09) ACM New York NY USA 89ndash102 httpsdoiorg10114515190651519076

  • Abstract
  • 1 Introduction
  • 2 Motivation
    • 21 Address Translation Overhead vs Memory Bloat
    • 22 Page fault latency vs Number of page faults
    • 23 Huge page allocation across multiple processes
    • 24 How to capture address translation overheads
      • 3 Design and Implementation
        • 31 Asynchronous page pre-zeroing
        • 32 Managing bloat vs performance
        • 33 Fine-grained huge page promotion
        • 34 Huge page allocation across multiple processes
        • 35 Limitations and discussion
          • 4 Evaluation
          • 5 Related Work
          • 6 Conclusions
          • References