Proceedings of Third International Workshop on Data...

78
Proceedings of Third International Workshop on Data Management on New Hardware (DaMoN 2007 ) Anastasia Ailamaki Qiong Luo (Editors) Second International Workshop on Performance and Evaluation of Data Management Systems (ExpDB 2007 ) Philippe Bonnet Stefan Manegold (Editors) Sponsored by June 15, 2007, Beijing International Convention Center (BICC), Beijing, China

Transcript of Proceedings of Third International Workshop on Data...

Page 1: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

Proceedings of

Third International Workshop on

Data Management on New Hardware(DaMoN 2007 )

Anastasia Ailamaki Qiong Luo(Editors)

Second International Workshop on

Performance and Evaluation of Data Management Systems(ExpDB 2007 )

Philippe Bonnet Stefan Manegold(Editors)

Sponsored by

June 15, 2007, Beijing International Convention Center (BICC), Beijing, China

Page 2: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................
Page 3: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

Contents

Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

DaMoN Foreword . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

ExpDB Foreword . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

Invited Talks

How do DBMS take advantage of future computer systems? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (DaMoN) . . . ixHonesty Young (IBM China Research Lab)

From Moore to Metcalf - The Network as the Next Database Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (ExpDB) . . . ixMichael J. Franklin (University of California, Berkeley)

Multi-core, Multi-threading, and Deep Memory Hierarchies

Pipelined Hash-Join on Multithreaded Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (DaMoN) . . . 1Philip Garcia (University of Wisconsin - Madison), Henry Korth (Lehigh University)

Parallel Buffers for Chip Multiprocessors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (DaMoN) . . . 9John Cieslewicz (Columbia University), Ken Ross (Columbia University), Ioannis Giannakakis (Columbia University)

A General Framework for Improving Query Processing Performance on Multi-Level Memory Hierarchies . . (DaMoN) . . . 19Bingsheng He (HKUST), Yinan Li (Peking University), Qiong Luo (HKUST), Dongqing Yang (Peking University)

Query Processing on Unconventional Processors

Vectorized Data Processing on the Cell Broadband Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (DaMoN) . . . 29Sándor Héman (CWI), Niels Nes (CWI), Marcin Zukowski (CWI), Peter Boncz (CWI)

In-Memory Grid Files on Graphics Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (DaMoN) . . . 35Ke Yang (HKUST), Bingsheng He (HKUST), Rui Fang (HKUST), Mian Lu (HKUST), Naga Govindaraju (MicrosoftCorporation), Qiong Luo (HKUST), Pedro Sander (HKUST), Jiaoying Shi (Zhejiang University)

Trends and Workload Characterization

The five-minute rule twenty years later, and how flash memory changes the rules . . . . . . . . . . . . . . . . . . . . . . . . . (DaMoN) . . . 43Goetz Graefe (HP Labs)

Architectural Characterization of XQuery Workloads on Modern Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (ExpDB) . . . 53Rubao Lee (ICT, Chinese Academy of Sciences), Bihui Duan (ICT, Chinese Academy of Sciences), Taoying Liu (ICT,Chinese Academy of Sciences)

i

Page 4: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................
Page 5: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

Program

This year, DaMoN and ExpDB audiences are united into a joint fun-filled day with several excellent technical talks and twovery interesting keynotes.

8:30 - 8:45 Registration8:45 - 9:00 Welcome & Opening remarks9:00 - 10:00 DaMoN Invited talk by Honesty Young (IBM China Research Lab) on

How do DBMS take advantage of future computer systems?10:00 - 10:30 Coffee break10:30 - 12:00 Session 1 (DaMoN):

Multi-core, Multi-threading, and Deep Memory Hierarchies12:00 - 1:30 Lunch break

1:30 - 2:30 Session 2 (DaMoN):Query Processing on Unconventional Processors

2:30 - 2:45 Short break2:45 - 3:45 Session 3 (DaMoN/ExpDB):

Trends and Workload Characterization3:45 - 4:00 Short break4:00 - 5:00 ExpDB Invited talk by Michael J. Franklin (University of California, Berkeley) on

From Moore to Metcalf - The Network as the Next Database Platform5:00 - 5:30 Reflections and Feedback

iii

Page 6: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................
Page 7: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

DaMoN Foreword

The DaMoN workshop takes place for the third time in cooperation with the ACM SIGMOD/PODS 2007 conference in Beijing,China.The second DaMoN workshop took place in cooperation with the ACM SIGMOD/PODS 2006 conference in Chicago, Illinois,USA.The first DaMoN workshop took place in cooperation with the ACM SIGMOD/PODS 2005 conference in Baltimore, Maryland,USA.

Objective

The aim of this one-day workshop is to bring together researchers who are interested in optimizing database performance onmodern computing infrastructure by designing new data management techniques and tools.

Topics of Interest

The continued evolution of computing hardware and infrastructure imposes new challenges and bottlenecks to program perfor-mance. As a result, traditional database architectures that focus solely on I/O optimization increasingly fail to utilize hardwareresources efficiently. CPUs with superscalar out-of-order execution, simultaneous multi-threading, multi-level memory hier-archies, and future storage hardware (such as MEMS) impose a great challenge to optimizing database performance. Conse-quently, exploiting the characteristics of modern hardware has become an important topic of database systems research.

The goal is to make database systems adapt automatically to the sophisticated hardware characteristics, thus maximizing per-formance transparently to applications. To achieve this goal, the data management community needs interdisciplinary col-laboration with computer architecture, compiler and operating systems researchers. This involves rethinking traditional datastructures, query processing algorithms, and database software architectures to adapt to the advances in the underlying hardwareinfrastructure.

Workshop Co-ChairsAnastasia Ailamaki (Carnegie Mellon University, [email protected])Qiong Luo (Hong Kong University of Science and Technology, [email protected])

Program CommitteeChristiana Amza (University of Toronto)Peter Boncz (CWI Amsterdam)Philippe Bonnet (University of Copenhagen)Shimin Chen (Intel Research)Bettina Kemme (McGill University)Jun Rao (IBM)Ken Ross (Columbia University)Jingren Zhou (Microsoft)

Anastasia AilamakiQiong Luo

v

Page 8: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................
Page 9: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

ExpDB Foreword

The ExpDB workshop takes place for the second time in cooperation with the ACM SIGMOD/PODS 2007 conference inBeijing, China.The first ExpDB workshop took place in cooperation with the ACM SIGMOD/PODS 2006 conference in Chicago, Illinois,USA.

Objective

The first goal of this workshop is to present insights gained from experimental results in the area of data management systems.The second goal is to promote the scientific validation of experimental results in the database community and facilitate theemergence of an accepted methodology for gathering, reporting, and sharing performance measures in the data managementcommunity.

Current conferences and/or journals do not encourage submission of mostly (or purely) experimental results. It is often difficultor impossible to reproduce the experimental results being published, either because the source code of research prototypes isnot made available or because the experimental framework is under documented. Most performance studies have limited depthbecause of space limitation. Their validity is limited in time because assumptions made in the experimental framework becomeobsolete.

Topics of Interest

ExpDB is meant as a forum for presenting quantitative evaluation of various data management techniques and systems. Weinvite the submission of original results from researchers, practitioners and developers. Of particular interest are:

• performance comparisons between competing techniques,

• studies revisiting published results,

• unexpected performance results on rare but interesting cases,

• scalability experiments,

• contributions quantifying the performance of deployed applications of data management systems.

Workshop Co-ChairsPhilippe Bonnet (University of Copenhagen, Denmark, [email protected])Stefan Manegold (CWI Amsterdam, The Netherlands, [email protected])

Program CommitteeGustavo Alonso (ETH Zurich, Switzerland)Mehmet Altinel (IBM Almaden Research Center, USA)Laurent Amsaleg (IRISA, France)David DeWitt (University of Wisconsin, Madison, USA)Stavros Harizopoulos (MIT, USA)Björn Þór Jónsson (Reykjavik University, Iceland)Carl-Christian Kanne (Universität Mannheim, Germany)Paul Larson (Microsoft, USA)Ioana Manolescu (INRIA Futurs, France)Matthias Nicola (IBM Silicon Valley Lab., USA)Raghunath Othayoth Nambiar (Hewlett-Packard, USA)Meikel Poess (Oracle, USA)Kian Lee Tan (NUS, Singapore)Jens Teubner (Technische Universität München, Germany)Anthony Tomasic (CMU, USA)Jingren Zhou (Microsoft, USA)

Philippe BonnetStefan Manegold

vii

Page 10: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................
Page 11: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

Invited Talks

How do DBMS take advantage of future computer systems? (DaMoN)

Speaker Honesty Young (IBM China Research Lab)

Abstract Historically, CMOS scaling provides certain level of performance enhancement automatically. However, that "free"performance enhancement from device scaling will come to an end while CMOS scaling will continue for several more gen-erations. Multi-core has been one architectural feature to improve chip level performance. Partially because of the powerdissipation limit, each core of a multi-core chip becomes simpler/smaller and offers weaker single thread performance. Inthis talk, we will explain how to avoid potential performance bottlenecks when running typical DBMS software on a massivemulti-core chip. For a high-end transaction system, the main memory cost is easily several times of CPU cost; the storage costis even higher than the main memory cost. We will examine how potential future memory technologies (such as phase-changememory) may impact computer system architecture. A new class of high volume transaction systems is emerging. Each trans-action is relatively simple. However, the potential revenue for each transaction may be very low. Thus, the transaction systemsdesigned for banking-like applications may not be suitable for this new type of applications. We will describe the problem andencourage researchers and practitioners to come up with cost-effective solutions.

Biography Dr. Honesty Young earned his Ph.D. in Computer Science from University of Wisconsin-Madison. Currently heis the Deputy Director and the CTO of IBM China Research Lab. He helped build the first parallel database prototype insideIBM. He led an effort that achieved leadership TPC database benchmark results. He has initiated and managed projects instorage appliances and controllers. He spent a year at IBM Research Division Headquarters as a technical staff. Dr. Young haspublished more than 40 journal and conference papers, including one best paper and one invited paper. He was the IndustrialProgram Chair of the Parallel and Distributed Information Systems (PDIS), taught two tutorials at key conferences, and servedon the program committees of eight conferences. He is an IBM Master Inventor.

From Moore to Metcalf - The Network as the Next Database Platform (ExpDB)

Speaker Michael J. Franklin (University of California, Berkeley)

Abstract Database systems architecture has traditionally been driven by Moore’s Law and Shugart’s Law, which dictate thecontinued exponential improvement of both processing and storage. In an increasingly interconnected world, however, Met-calf’s Law is what will drive the need for database systems innovation going forward. Metcalf’s law states that the value of anetwork grows with the square of the number of participants, meaning that networked applications will become increasinglyubiquitous. Stream query processing is one emerging approach that enables database technology to be better integrated intothe fabric of network-intensive environments. For many applications, this technology can provide orders of magnitude per-formance improvement over traditional database systems, while retaining the benefits of SQL-based application development.Increasingly stream processing has been moving from the research lab into the real world. In this talk, I’ll survey the state ofthe art in stream query processing and related technologies, discuss some of the implications for database system architectures,and provide my views on the future role of this technology from both a research and a commercial perspective.

Biography Michael Franklin is a Professor of Computer Science at the University of California, Berkeley and is a Co-Founderand CTO of Amalgamated Insight, Inc., a technology start up in Foster City, CA. At Berkeley his research focuses on thearchitecture and performance of distributed data management and information systems. His recent projects cover the areas ofwireless sensor networks, XML message brokers, data stream processing, scientific grid computing, and data management forthe digital home. He worked several years as a database systems developer prior to attending graduate school at the Universityof Wisconsin, Madison, where he received his Ph.D. in 1993. He was program committee chair of the 2005 ICDE conferenceand 2002 ACM SIGMOD conference, and has served on the editorial boards of the ACM Transactions on Database Systems,ACM Computing Surveys, and the VLDB Journal. He is a Fellow of the Association for Computing Machinery, a recipient ofthe National Science Foundation CAREER Award, and the ACM SIGMOD "Test of Time" award.

ix

Page 12: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................
Page 13: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

Pipelined Hash-Join on Multithreaded Architectures

Philip GarciaUniversity of Wisconsin-Madison

Madison, WI 53706 [email protected]

Henry F. KorthLehigh University

Bethlehem, PA 18015 [email protected]

ABSTRACTMulti-core and multithreaded processors present both op-portunities and challenges in the design of database queryprocessing algorithms. Previous work has shown the poten-tial for performance gains, but also that, in adverse circum-stances, multithreading can actually reduce performance.This paper examines the performance of a pipeline of hash-join operations when executing on multithreaded and multi-core processors. We examine the optimal number of threadsto execute and the partitioning of the workload across thosethreads. We then describe a buffer-management scheme thatminimizes cache conflicts among the threads. Additionallywe compare the performance of full materialization of theoutput at each stage in the pipeline versus passing pointersbetween stages.

1. INTRODUCTIONRecently, multi-core and multithreaded processors have

reached the mainstream market. Unfortunately, softwaredesigns must be restructured to exploit the new architec-tures fully. Doing so presents both opportunities and chal-lenges in the design of query-processing algorithms. In thispaper, we describe some of the challenges presented to data-base system designers by modern computer architectures.We then propose parallelization techniques that speed up in-dividual database operations, and improve overall through-put, while avoiding some of the problems such as those de-scribed in [18], that can limit performance gains on multi-threaded processors.

This study builds on the work in [9, 7, 24], but insteadof focusing solely on optimizing a single join operation, weexamine a pipeline of join operations on uniform hetero-geneous multithreaded (UHM) processors, an architecturalmodel that we describe in Section 2.1. The techniques we de-velop and evaluate are applicable beyond join, and relate toother data-intensive operations. By accounting for the het-erogeneous threading model of modern processors and theefficient sharing of data offered by them, we develop query

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.Proceedings of the Third International Workshop on Data Management onNew Hardware (DaMoN 2007), June 15, 2007,Beijing, ChinaCopyright 2007 ACM 978-1-59593-772-8 ...$5.00.

processing algorithms that are more efficient and allow formore accurate runtime estimates, which then can be usedby query optimizers.

In this paper, we make the following observations:

• Assigning threads to specific “processor thread slots”allows for high performance and throughput.

• Single-die UHM architectures can share data amongthreads more efficiently than SMP architectures.

• Writing pointers to a buffer instead of writing thefull tuple does not save as much work as previouslythought.

• Hardware and software prefetching can result in largeperformance gains within query pipelines.

• Properly scheduling threads on an SMT processor cansignificantly improve query pipeline runtimes.

• To exploit a multithreaded processor fully, a querypipeline should generate more threads than the archi-tecture can execute concurrently.

• A large memory bandwidth is required to keep all ofthe processing units busy in multi-core systems.

In Section 2, we describe the changes in computer ar-chitectures that motivate this work. Then, we discuss theimplications of these new architectures on database systemsand describe the specific database query-processing issues onwhich we focus. In Section 4, we propose a threading modelto help take advantage of these processors, and finally inSection 5, we discuss the results of our study, and speculatehow this model will perform on future UHM processors.

2. PROCESSOR ARCHITECTUREComputer architectures are continuously evolving to take

advantage of the rapidly increasing number of transistorsthat can fit on a single processor die. These new archi-tectures include larger caches, increased memory and cachelatencies (in terms of CPU cycles), the ability to executemultiple threads on the same core simultaneously, and thepackaging of multiple cores (processors) on the same die.These new features interact in complex ways that make tra-ditional simulations difficult. We have therefore chosen torun our tests on real hardware. This provides a more realis-tic view of both the processor and main-memory subystem.

We ran our tests on both a dual 3.0 GHz Xeon Northwoodprocessor, a 2.0 GHz Core Duo (Yonah) processor, as well

1

Page 14: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

P4 Xeon Core DuoPrescott Northwood

Number of cores 1 2 2Clock Speed 2.8GHz 3GHz 2GHzFSB speed 800MHz 533MHZ 667MHz

L1 Size 16KB 8KB 32KBL2 Size 1MB 512KB 2MB (shared)L3 Size - 1MB -

Table 1: Details of the processors used

as a 2.8 GHz Pentium 4 Prescott as shown in Table 1. Allof the machines ran Debian GNU/Linux with kernel version2.6. We focused on the results obtained on the Pentium 4processor, and unless otherwise noted, all results given arefor it.

In this section, we discuss some of the details of multi-threaded architectures and their impact on database queryprocessing.

2.1 Multithreaded ArchitecturesMultithreaded processor architectures are being designed

not only to enable the highest performance per unit die area,but also to obtain the highest performance per watt of powerconsumed[6, 5, 19, 3]. To achieve these goals, computer ar-chitects are no longer focusing on increasing instruction-levelparallelism and clock frequencies, and instead are designingnew architectures that can exploit thread-level parallelism(TLP). These architectures manifest themselves in two ways:chip multiprocessors (CMP) and multithreaded processors.CMP systems are a logical extension of SMP systems, butwith the multiple cores integrated on a single processor die.However, many of the CMP systems differ from traditionalSMP systems in that the cores share one or more levels ofcache. Multithreaded processors, on the other hand, allowthe system to execute multiple threads simultaneously onthe same processor core. One of the more popular forms ofmultithreading is simultaneous multithreading (SMT), how-ever other methods are possible[8, 23, 16, 22].

Many of these new multithreaded and CMP processorsbelong to a class of processors called uniform heterogeneousmultithreaded (UHM) processors[21]. This class of architec-tures allows multiple threads (of the same instruction set)to share limited resources in order to maximize utilization.In this model, not all hardware-thread contexts are equiv-alent, and the behavior of one thread can adversely effectthe behavior of another. This effect is generally due toshared caches, but it could also be caused by poor instruc-tion mixes. UHM architectures should not be confused withheterogeneous multiprocessors in which the processor unitsthemselves vary significantly or have differing instructionsets, such as a graphics coprocessor.1

Multithreaded processors have become the standard forhigh-performance microcomputing. The major vendors ofhigh-performance processors are currently focusing on dualand multi-core designs[2, 1, 14, 3], and many are either ship-ping processors using multithreaded and/or SMT technol-ogy [16, 14, 3] to accelerate their processors.

Today’s high-end database servers often contain 2-16 pro-cessors that are each capable of executing two threads. With-in the next few years it is likely that a single microprocessor

1See [11] for an example of database processing on a graphicsco-processor.

will contain many more cores that are each capable of ex-ecuting multiple threads (using fine-grained multithreadingor SMT)[6]. Many of these architectures (such as the SunNiagara processor[3]) will implement multiple simple coresthat sacrifice single-thread performance but yield substan-tially more throughput per watt and/or die area[6, 5, 3].

2.2 Impact on Database System DesignThe architectural changes that we have discussed force

a re-examination of database system design. Concurrentdatabase transactions generate inter-query parallelism butthat increased parallelism can result in cache contentionwhen threads or cores share one or more levels of the pro-cessor’s cache. This puts a higher premium in intra-queryparallelism (see, e.g., [12]), which current database systemsdo not exploit to the same degree as inter-query parallelism.

The rapidly expanding number of concurrently executingthreads in a UHM architecture[21] combined with increas-ing memory latency (in terms of cycles) means databasesystems must be capable of executing an increasing numberof threads at once to keep up with the growing thread-levelparallelism offered by modern computer architectures.

We propose a threading model that breaks down a queryinto not just a series of pipeline operations (where each stageexecutes a thread), but into a series of operations that them-selves can be broken down and executed by multiple threads.This allows the system to choose a level of threading that isappropriate for both the workload presented to it and thearchitectural features of the machine on which it is running.Additionally, on UHM systems, the system can choose thethread context on which to schedule a thread, in order tomake the greatest use of the resources available at the time.

While much work has been done on optimizing querypipelines, much of this work has focused on either uniproces-sor or SMP systems that assume a homogeneous threadingmodel. New designs with UHM processors must first de-cide on which physical processor to execute the thread, andseparately decide both on which core within the processor,and on which thread within the core to run. New schedulersmust take into account how many threads are currently ex-ecuting on the core, as well as what each thread on the coreis doing. Much of the work on query pipeline optimizationhas also not taken into account the effects of using softwareprefetch instructions within the pipeline to improve perfor-mance further, with exceptions being [7, 9].

In this study, we examine intra-query parallelism withinmultiple hash-join operations. By breaking down each joininto parallelizable threads, we have shown that both re-sponse time and throughput can be improved.

2.3 Prior WorkThe work we describe here differs from earlier work [9,

7, 24] in several significant ways. In earlier work, softwareprefetching was examined in a single-threaded simulation[7],and was later extended to run on real machines[24, 9]. Theprior work of Zhou, et al. [24], examined a single hash-joinoperation on an SMT processor, however this work was doneon the Northwood variant of the Pentium 4, which doesn’tfully support software prefetching, so a form of preloadingdata was used instead of prefetching. [9] further built uponthe model in [7, 24] and was designed such that multiplethreads could perform a single hash join. That work, how-ever, did not consider a pipeline of operations, and addi-

2

Page 15: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

tionally required an initial partitioning that can result insuboptimal performance.

In this paper, we consider a larger problem domain (pipe-lines) and a richer processing model aimed at UHM pro-cessors. This work differentiates itself by studying not thealgorithms involved, but rather the impact of architectureon the end result. Through executing an example databasepipeline, we can observe the interaction of program struc-tures with the system architecture. By doing this we gainvaluable insight into how to best design query pipeline ex-ecution strategies, and how to best choose an appropriateplatform for query processing systems.

3. PROBLEM DESCRIPTIONWe chose to examine a pipeline of two joins; however our

algorithm can easily be extended to support more generaln-way joins. For this study, we examine the performance ofthe query pipeline when running on various computer archi-tectures. We also examine the performance of our threadingmodel as a function of the number, size, and type of datastored in the buffers used to share data among the threads.

An important consideration in query-pipeline processingis the buffer size used and the number of buffers that are al-located to facilitate inter-process communication. We showthat the buffer size has a major affect on overall algorithmperformance as do prefetching attempts (done by both hard-ware and software).

Another important consideration is the issue of whether ornot to materialize pointers. This becomes doubly importantin a query pipeline consisting of operations O1, O2, . . . , Ombecause the data must be brought into cache for the firstjoin (operation Oi) and are possibly reused in the next join(operation Oi+j).

2 Because of this, materializing the out-put requires memory to store both the input relation andthe output relation. This results in a larger overall cachefootprint3, although there is no deterministic way to tellhow much larger this is on current computer architectures(due to streaming prefetch-buffers, memory access patterns,prefetch instructions etc). Recent research [9] has also shownthat the time required to copy small amounts of data (<100bytes) that is already loaded in cache can be prohibitivelycostly, and should therefore be avoided when possible. Weexamine the cost of materializing the relation fully at everystage of the pipeline in the Appendix.

Our hash-join algorithm is modified from the Grace algo-rithm[15]. Our algorithm was designed under the assump-tion that the system performing the join has sufficient freemain memory to hold the entire set of input relations, tem-porary structures (such as hash tables), as well as outputrelations. This execution model has been shown to be validfor systems with sufficient main-memory and sufficient diskI/O performance[4, 7]. By doing an in-memory join, weare able to focus our analysis on the effects that both themain-memory/cache hierarchy and UHM processors have onquery-pipeline performance. Disk accesses would not onlydistort those results, but make those results less applicableto modern systems with large-main-memories.

Our system implements the form of software pipelining de-scribed in [7]. We chose to focus on software pipelining, as it

2For our tests i=1 and j=1.3This is assuming, of course, that the size of each tuple inthe output relation is greater than the size of a pointer.

O3:QA.a, B.b, C.c

O2: 1B.bkey=C.bkey

PPPPP�����

O1: 1A.name=B.name

HHH���

Rel: A Rel: B

Rel: C

Figure 1: Example pipeline where O1 is an indexjoin, O2 is a hash join, and O3 is a projection.

was shown to outperform both group prefetching and cache-sized partitioning[9, 7, 20]. Using the software-prefetch op-timized code results in faster runtimes; however on multi-threaded processors (running multithreaded algorithms), ithas been shown that the speedup from multithreading is lesswhen using the prefetch-optimized code due to there beingfewer stall cycles to overlap execution[9]. Software prefetch-ing still results in the best overall performance (even on mul-tithreaded architectures), and it is therefore important thatour measurements run with this algorithm rather than thestandard hash-join algorithm, as that would overestimatethe performance benefits of multithreading.

Our algorithm differs from prior work[7, 9, 24] in thatwe do not first partition the relations. Our previous resultshave shown that the size of the partition does not effectthe throughput of the probe phase of the algorithm whenprefetching is used[9]. Because the data are no longer par-titioned, we must use a different method of breaking up theworkload among multiple threads than that in [9]. We mod-ified the system to use a series of buffers for both input andoutput so that multiple threads can cooperate to execute asingle join concurrently.

4. THREADS AND BUFFERSOur threading model is based on using both control par-

allelization (through the pipelining of the query operation)and single-program multiple data parallelism (SPMD)[13].This allows our model to allocate multiple threads for eachoperation in the query pipeline.

4.1 Threading ConsiderationsWe use a buffer-management scheme to allocate data from

the input relations to the various threads and also allow forforwarding output data to operations further in the pipeline.We found that the number of buffers used and their sizecan affect system performance. Using buffers with fewer en-tries allows the working set to be smaller and better able tofit within the processor’s data cache, however this requireseach thread to acquire more buffers, potentially resulting inslower performance. Conversely using buffers with more en-tries means the system spends a smaller percentage of itstime obtaining buffers, but the memory footprint of eachbuffer is larger, and could result in poor sharing of the pro-cessor’s cache among the threads.

3

Page 16: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

Thread 1 Buffer 1

Buffer 2

Buffer 3

. . .

Buffer NThread 2

Thread 1

Thread 2

Thread 3Output

Output

Input

Input

Input

Operator Oi-1 Operator O i

Figure 2: Example of two pipeline operations, Oi−1 and Oi, sharing a set of buffers.

Another important issue to consider is the number ofthreads that are allocated for each operation in the pipeline.Even when we just concern ourselves with join operations,earlier joins can often take significantly longer than laterones depending on the selectivity of the earlier joins. Thiseffect is coupled with the fact that pipelined workloads arenot always evenly distributed.

Figure 1 shows an example pipeline in which pipeline op-eration O1 may generate output tuples at a varying ratebecause there may be many tuples generated for commonnames, but far fewer for the less common names4. Thiswould cause operation O2 to have a varying workload andto alternate between periods of idling (due to lack of inputdata) and busy periods where it has sufficient work to allowit to take advantage of multiple threads or processors.

Additionally, it is important that databases running onUHM processors schedule multiple threads carefully to avoidone thread adversely effecting the performance of another.In [18], it was noted that enabling SMT on Intel’s Netburstprocessor can be detrimental to database performance. Thisis often caused by “cache thrashing” behavior of a singlethread. For example, when one thread is running a largescan, it could cause a concurrent thread to experience morecache misses than if the two operations were serialized.

4.2 The Buffer-Management SystemOur threading model helps to solve these issues by letting

each join operation in the pipeline be handled by multiplethreads, while allowing many of threads to sleep when theyare not needed. This allows the operations that need theprocessing resources the most to utilize them, while otheroperations wait until the input is ready.

We implemented a buffer system designed for unbalancedworkloads. This was accomplished by waking threads upupon availability of work, and putting them to sleep when nonew work is available. The buffer manager uses a producer-consumer queue that shares buffers in common. We usedthe pthreads library[17] for the purposes of threading andinter-process communication.

The buffer manager (Figure 2), contains a finite numberof buffers that it allocates to the producer and consumerthreads. A buffer consists of a collection of tuples or point-ers, and we choose both the number of buffers to allocate andthe number of tuples or pointers that each buffer contains.Each buffer can be used by one thread at a time regardless of

4While our system does not currently support index join orprojection operations used in Figure 1, our threading modelcould easily apply there as well.

O4: 1B.ckey=C.ckey

aaaaa!!!!!

O2: 1A.bkey=B.bkey

bbb

"""

O1: BuildTable

Rel:A50MB

Rel:B100MB

O3: BuildTable

Rel:C200MB

Figure 3: Pipeline used for our tests, where eachentry in A matched exactly two entries in B, eachentry in B matched exactly two entries in C andtuple sizes were the same for all three relations.

whether the thread is writing to or reading from the buffer.The assignment of a buffer to a thread is made by the buffersystem. This avoids any need for concurrency control (e.g.,locking) while a thread is executing.

This buffer management system allows the system to allo-cate multiple threads for each operation in the pipeline (Oiand Oi−1, shown in the figure). Because each thread is con-strained to execute on a single thread-context, the systemaccounts for imbalances in workloads among operations aswell as variances in the workload, allowing the processor’sresources to be utilized more effectively. The system accom-plishes this by creating more threads for each operation thanthere are execution slots available on the processor. The sys-tem executes only those threads that currently have a bufferallocated to them. Limiting the size and number of buffersprevents any particular operation Oi−1 from getting too farahead of dependent operation: Oi. By limiting the numberof buffers and their size, we can ensure that the output dataproduced by operation Oi−1 is still in the processor’s cachewhen it is consumed by operation Oi. Additionally, limit-ing the number and size of buffers can be used to preventpipeline threads from running alongside concurrent threadsin the system that could cause the cache to thrash.

5. EXPERIMENTAL RESULTSFigure 3 shows the example query pipeline that we used

for all of our tests. We used this pipeline because it is simple

4

Page 17: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

0

1

2

3

4

5

6

7

8

20 30 40 50 60 70 80 90 100110120130140150160170180190

Bil

lio

ns

Entries Per Buffer

To

tal

Cycl

es

2357

Figure 4: Pentium 4 results. The numbers on theright represent how many buffers were in use.

and has an imbalance in the workload between operationsO2 and O4. This allows us to examine the effectiveness ofthe buffer-management system.

5.1 Number and Size of BuffersTo determine the ideal number of buffers to use, we ran a

series of tests using multiple buffers of varying sizes. Run-ning a single thread for each operation in the pipeline didnot fully utilize processor resources. This is because opera-tion O2 processes approximately half as many tuples as O4.To help alleviate this problem, we allocated two threads toO4 and only a single thread to O2. By doing this, we allowedthe more expensive operation to utilize the majority of theprocessor’s resources.

Parallelizing a hash-join operation (O2 andO4 in Figure 3)is fairly straightforward because multiple threads can sharethe hash table (as it is read-only). The input/output buffersystem described in Section 4 protects the input and outputdata so that only one thread can write to a buffer at a time,ensuring correctness.5 However, not all stages of the pipelinecan be parallelized easily(for example the build operationsO1 and O3). The general problem of parallelizing highlydependent database operations is left for future work.

Our tests were run for both the cases where the interme-diate buffers contain the full output tuple, and when theycontain only pointers to the tuple. We found that passingpointers was marginally faster (but, except for very largetuples, only marginally faster) than passing the full resul-tant tuples. Therefore, in this section we focus on the caseof passing pointers between the pipeline stages. For a moredetailed comparison of the performance when using pointersversus full materialization see the Appendix.

We found that the we needed at least one more buffer thanthe number of readers and writers concurrently executing.Figure 4 shows that at least three buffers are needed to uti-lize the processor effectively. This is logical as we executed

5While this ensures correctness, it is important to note thatwhen multiple threads execute a hash-join operation, theorder of the input tuples is not preserved in the output,however this is rarely a problem, and an additional threadcould be used to piece the buffers back together in order ifnecessary.

1

1.05

1.1

1.15

1.2

1.25

1.3

1.35

1.4

30 60 100

Tuple Size

Spee

dup F

acto

r

PointersFull

Figure 5: Multithreaded speedup factors for tuple-pointers and full materialization.

1

1.05

1.1

1.15

1.2

1.25

1.3

1.35

30 60 100

Tuple Size

Spee

dup F

acto

r

1p2t2p2t

Figure 6: Xeon results: 1p2t means two threads onone processor, and 2p2t two threads, each on its ownprocessor

three threads (of which only two executed simultaneously)so only having two buffers causes extra contention for sharedobjects between threads. Extrapolating these results to fu-ture architectures, we should have at least one more bufferthan we have threads concurrently executing. These figuresalso show that enabling further additional buffers seems todo little to help or hinder performance.

5.2 Multithreaded SpeedupTo quantify SMT’s speedup on data-intensive algorithms,

we ran our tests with and without SMT support. Figure 5shows the speedup when we ran the query pipeline with boththreads enabled versus when the case of only one threadenabled. This graph also illustrates SMT’s greater benefitwhen copying the larger amount of data needed when fullymaterializing the output, rather than passing only pointers.

These numbers are similar to the speedups seen in [9]during the probe phase of the software-pipelining optimizedhash join. However our results managed this speedup acrossthe entire hash-join operation (including the build phase),and additionally accounts for a much finer-grain level of par-allelism than that used in [9].

Performance was poor on the SMP/SMT Xeon. This isdue partially to the Xeon’s slower memory subsystem, butmore importantly it is due to the Xeon’s inability to properly

5

Page 18: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

1

1.1

1.2

1.3

1.4

1.5

1.6

30 60 100

Tuple Size

Spee

dup F

acto

r

2p2t

Figure 7: Multithreaded speedup factors obtainedon the Core Duo processor (two processors, onethread per processor).

Architecture 30 60 100Pentium 4 1p1t 6.32 3.17 2.12Pentium 4 1p2t 4.98 2.56 1.73

Xeon 1p1t 11.95 6.13 4.17Xeon 1p2t 9.71 5.06 3.38Xeon 2p2t 8.98 4.75 3.22

Core Duo 1p1t 4.98 2.57 1.68Core Duo 2p2t 3.33 1.75 1.22

Table 2: Wall-clock runtimes (in seconds) for tu-ple sizes of 30, 60, and 100 bytes. 1p2t means twothreads on one processor, and 2p2t two threads,each on its own processor

handle prefetch instruction as explained in [9]. The compari-son between these two architectures illustrates how much ofan effect minor architectural changes can have on overallsystem performance. Figure 6 shows the speedups obtainedwhen splitting the threads up among the different contextsavailable on this system. It is important to note when com-paring these results to the other architectures, that the lackof software prefetching causes excessive data-cache misses,which SMT processors can effectively overlap with process-ing from the additional thread[22, 8, 23, 16, 9].

A rather surprising result of these experiments was thatthe dual-processor algorithm was only marginally faster thanthe SMT algorithm, running about 10% faster when using30-byte tuples, and only 6% faster when using 100-byte tu-ples. One of the reasons for the Xeon’s poor multiproces-sor performance is due to its bus-based architecture, whichquickly became overloaded as buffers were moved betweenthe two procesors. Performance could be improved some-what if two separate queues were used, with each processorproducing distinct output. For example, if processor P1 ob-tained data D1, it would perform both operation O1 andO2 on it while P2 would obtain D2 and likewise performthe necessary operations upon it. This would prevent Difrom having to be sent across the bus for further processingon a different processing node. Such considerations are notnecessary when executing upon many SMT or CMP pro-cessors, and aren’t as important on processors that utilizepoint-to-point interconnects.

Figure 7 shows the speedups obtained on the Core Duoprocessor when executing the example query pipeline. Ta-ble 2 reveals that this platform outperformed the others, de-spite the fact that this platform had the slowest clock speed.The performance on this machine is due to the Core Duo’smore efficient microarchitecture combined with its largerand faster cache. The multi-core speedup on the Core Duowas between 1.35 and 1.5. This is likely due to the limitedmemory bandwidth available on this mobile platform. Thispoor speedup also suggests that query optimizers should bal-ance the workload among multiple physical processors toprevent any single core from being memory-bound.

6. CONCLUSIONIn this paper, we have examined the impact of UHM pro-

cessors on pipeline operations. Specifically, we studied therunning of the hash-join algorithm in a pipelined fashion ona UHM processor. This overview of the effects of pipelineoperations shows that a simple naıve approach to thread-ing will not yield optimal performance on new processors.The need for highly parallel code to run on multithreadedprocessors[3, 14, 1, 2], combined with the increasing pro-cessor/memory gap and heterogeneous threading abilities ofmodern and future processors, have fueled the need for fur-ther research into query pipelines.

By examining the effect of multithreading, we have seenimpressive gains in the performance of hash-join operationswithin queries and have shown the importance of combiningSPMD techniques with traditional “unstructured” threadingtechniques (such as running a separate thread for each op-eration) when executing query pipelines. These techniquesallow the system better control of the execution of algo-rithms and are necessary to achieve the greatest throughputon processors.

The issues of thread allocation and buffer managementare of reduced complexity in our work due to the relativesimplicity of the multicore and multithread architectures wetested and our focus on a relatively short pipeline of twojoins. A higher number of cores with a higher number ofconcurrent threads per core opens the possibility of

1. Devoting more threads (and buffers) to each operation(greater horizontal parallelism).

2. Deepening the pipeline to include more joins as wellas other operations (greater vertical parallelism).

These considerations add to the complexity of the buffersystem.

Future work is still needed in expanding our threadingmodel to support other operations (such as merge join, sort,selection, etc), and to examine performance on other mul-tithreaded processors[3, 14, 1, 2]. This work should alsofocus on more parallel architectures than the two-threadedPentium 4, taking into account the fast inter-core communi-cation presented by these new UHM processors. By examin-ing the performance of such operations, we can both stressthe ability of our threading model to distribute evenly theworkloads of multiple pipeline operations as well as deter-mine more accurate estimates for ideal buffer sizes basedupon processor cache size, and the number of threads thatcan run on a given processor.

Future work is needed to parallelize traditionally serialalgorithms, including techniques to allow multiple threads

6

Page 19: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

to write to linked data structures simultaneously and effi-ciently. As the number of threads executing the probe phaseincreases, the percentage of time spent waiting in these se-rial algorithms will become excessively large. These stallsin the pipelines will become increasingly important as cur-rent techniques don’t allow multiple threads to execute si-multaneously. Finally, it will be necessary to develop goodcost predictors for these parallel algorithms so that futuredatabase query optimizers have appropriate cost estimateson which to base their choice of overall execution strategyfor the entire query.

7. REFERENCES[1] Intel multi-core processor architecture development

backgrounder. Intel White Paper, 2005.

[2] Multi-core processors– the next evolution incomputing. AMD White Paper, 2005.

[3] Throughput computing: Changing the economics andecology of the data center with innovative SPARC r©technology. Sun Microsystems White Paper, November2005.

[4] A. Ailamaki, D. J. DeWitt, M. D. Hill, and D. A.Wood. DBMSs on a modern processor: Where doestime go? In Proc. 25th Int’l Conf. on Very Large DataBases, pages 266–277, 1999.

[5] D. Burger and J. R. Goodman. Billion-transistorarchitectures: There and back again. IEEE Computer,37:22–28, Mar. 2004.

[6] D. Carmean. Data management challenges on newcomputer architectures. In First Int’l Workshop onData Management on New Hardware (DaMoN), June2005. Oral Presentation.

[7] S. Chen, A. Ailamaki, P. B. Gibbons, and T. C.Mowry. Improving hash join performance throughprefetching. In Proc. Int’l Conf. on Data Engineering,2004.

[8] S. J. Eggers, J. S. Emer, H. M. Levy, J. L. Lo, R. L.Stamm, and D. M. Tullsen. Simultaneousmultithreading: A platform for next-generationprocessors. IEEE Micro, 17(5):12–18, September 1997.

[9] P. Garcia and H. F. Korth. Hash-join algorithms onmodern multithreaded computer architectures. InACM Int’l Conf. on Computing Frontiers, May 2006.

[10] P. C. Garcia. Optimizing database algorithms formodern computer architectures. Master’s thesis,August 2005.http://www.cse.lehigh.edu/∼pcg2/thesis.pdf.

[11] N. K. Govindaraju, J. Gray, R. Kumar, andD. Manocha. GPUTeraSort: High performancegraphics co-processor sorting for large databasemanagement. In Proc. ACM SIGMOD Int’l Conf. onthe Management of Data, June 2006.

[12] G. Graefe. Encapsulation of parallelism in the volcanoquery processing system. In Proceedings of the 1990ACM SIGMOD International Conference onManagement of Data, pages 102–111, 1990.

[13] W. D. Hillis and J. Guy L. Steele. Data parallelalgorithms. Commun. ACM, 29(12):1170–1183, 1986.

[14] R. Kalla, B. Sinharoy, and J. M. Tendler. IBM Power5chip: A dual-core multithreaded processor. 2004.

[15] M. Kitsuregawa, H. Tanaka, and T. Moto-Oka.Application of hash to data base machine and its

architecture. In New Generation Computing,volume 1, pages 63–74, 1983.

[16] D. T. Marr, F. Binns, D. L. Hill, G. Hinton, D. A.Koufaty, J. A. Miller, and M. Upton. Hyper-threadingtechnology architecture and microarchitecture. IntelTechnology Journal, (Q1):4–15, 2002.

[17] F. Mueller. Pthreads library interface, 1993.

[18] S. Oks. Be aware: To hyper or not to hyper. Slava OksWeblog http://blogs.msdn.com/slavao/archive/-

2005/11/12/492119.aspx,Nov2005.

[19] P. S. Otellini. Multi-core enables performance withoutpower penalties. In Intel Developer Forum Keynote,http://www.embedded-controleurope.com/pdf/-

ecedec05p26.pdf,2005.

[20] A. Shatdal, C. Kant, and J. Naughton. Cacheconscious algorithms for relational query processing.In Proceedings of 20th Int’l Conf. on Very Lage DataBases, pages 510–524, 1994.

[21] D. Towner and D. May. The ‘uniform heterogeneousmulti-threaded’ processor architecture. InA. Chalmers, M. Mirmehdi, and H. Muller, editors,Communicating Process Architectures – 2001, pages103–116. IOS Press, September 2001.

[22] D. M. Tullsen, S. Eggers, J. S. Emer, H. M. Levy,J. L. Lo, and R. L. Stamm. Exploiting choice:Instruction fetch and issue on an implementablesimultaneous multithreading processor. In Proc. ACMIEEE Int’l Symposium on Computer Architecture,pages 191–202, 1996.

[23] D. M. Tullsen, S. Eggers, and H. M. Levy.Simultaneous multithreading: Maximizing on-chipparallelism. In Proceedings of the 22nd Annual Int’lSymposium on Computer Architecture, June 1995.

[24] J. Zhou, J. Cieslewicz, K. A. Ross, and M. Shah.Improving database performance on simultaneousmultithreading processors. In VLDB ’05: Proceedingsof the 31st Int’l Conf. on Very Large Data Bases,pages 49–60, 2005.

APPENDIX

Pointers Versus Tuple MaterializationWe discuss here in more detail our comparison between pass-ing full output tuples between pipeline stages versus passingonly pointers to those tuples.

Our results showed that the performance of the systemwas similar for the case when we passed pointers in thepipeline versus using full materialization[10]. This countersthe previous belief that using pointers rather than copyingthe full tuples to a new buffer saves a significant amount oftime that would otherwise be spent doing useful work. Thereare several reasons for this somewhat surprising result:

• Operation O2 in Figure 3 (the only one copying datato the buffers) operates on half as many tuples as O4,therefore the maximum possible speedup (where thetime to run O2 is zero) is 33%. When more operationsutilize the buffers, the speedup may be greater.

• The time spent processing a single tuple in O2 (dis-counting main-memory latency, which is hidden by thesoftware prefetching and split evenly across the twoprocessors) is less than the time it takes to process a tu-

7

Page 20: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

0

1

2

3

4

5

6

7

8

9

20 30 40 50 60 70 80 90 100110120130140150160170180190

Bil

lio

ns

Entries Per Buffer

To

tal

Cycl

es

2357

Figure 8: Cycles to run the multithreaded joinpipeline on the Pentium 4 when fully materializingthe intermediate relation. The numbers on the rightrepresent how many buffers were in use by the sys-tem.

ple in O4 because O4 must copy data from three inputrelations.

• When the buffers fully materialize the data, the dataare read back “in-order” during operation O4, resultingin superior cache performance and eliminating exces-sive pointer chasing.

• Due to the latency-hiding nature of software-prefetchinstructions, when pointers are used to pass values,data stalls are likely to occur as there is less usefulwork available in the rest of the hash-join algorithm tohide all of the cache-miss latency with prefetching.

• Hyperthreading allows multiple in-cache memory copyoperations to occur simultaneously, hiding some of theextra time required to materialize the full tuple. There-fore the speedup of using pointers would be greater forthe single-threaded algorithm.

These reasons help explain why using pointers to pass databetween the operations does not result in as significant aspeedup as initially expected. We also compared our re-sults when using larger tuples. In Figure 9, we see that thespeedups obtained due to using pointers are much greater,approaching the theoretical maximum of 33%. Thus, forlarge tuple sizes using pointers is a much more effective wayto handle inter-process communication.

As UHM processors become more common, it will becomeeven more important to use pointers to pass data betweenpipeline stages. On future architectures, it is likely that wewill have more threads running on each processor core simul-taneously. Under this model, context switches will occur ondata cache misses. Because of this, memory latency canbe hidden better than on current systems. This will enablenon-latent threads to run while another thread is stalled6.

6While this is also true on the Pentium 4 as two threads

1

1.05

1.1

1.15

1.2

1.25

1.3

1.35

100 200 300 500 700 1000 1500

Tuple Size

Spee

dup F

acto

r

Speedup

Figure 9: Speedups obtained by using pointers.

Thus, while our intuition about the merits of pointer pass-ing do not hold true for our experiments, our data indicatea need to re-examine this issue in the context of future ar-chitectures.

share the CPU, as the number of simultaneous threads thata processor can execute increases the overall system through-put will increase.

8

Page 21: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

Parallel Buffers for Chip Multiprocessors

John Cieslewicz∗†Columbia University

[email protected]

Kenneth A. Ross†Columbia University

[email protected]

Ioannis Giannakakis†Columbia University

[email protected]

ABSTRACTChip multiprocessors (CMPs) present new opportunities forimproving database performance on large queries. BecauseCMPs often share execution, cache, or bandwidth resourcesamong many hardware threads, implementing parallel data-base operators that efficiently share these resources is keyto maximizing performance. A crucial aspect of this paral-lelism is managing concurrent, shared input and output tothe parallel operators. In this paper we propose and evalu-ate a parallel buffer that enables intra-operator parallelismon CMPs by avoiding contention between hardware threadsthat need to concurrently read or write to the same buffer.The parallel buffer handles parallel input and output coor-dination as well as load balancing so individual operators donot need to reimplement that functionality.

1. INTRODUCTIONModern database systems process queries by constructing

query plans. A plan consists of a collection of operatorsconnected to each other by data buffers. The output fromone operator is fed as input to another operator.

Recently, microprocessor designs have shifted from fastuniprocessors that exploit instruction level parallelism tochip multiprocessors that exploit thread level parallelism.Because of power and design issues that reduce the perfor-mance improvement obtainable from faster uniprocessors,improved performance now depends on taking advantage ofon-chip parallelism by writing applications with a high de-gree of thread level parallelism [11].

In this paper, we focus on query plans running on a chipmultiprocessor. On such a machine, many concurrent threadsmay cooperate to collaboratively perform a single databaseoperation [20, 6]. The advantage of using the available par-allelism in this way is improved locality. Instructions andsome data structures are shared, leading to good cache be-

∗Supported by a U.S. Department of Homeland SecurityGraduate Research Fellowship†Supported by NSF grant IIS-0534389

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.Proceedings of the Third International Workshop on Data Management onNew Hardware (DaMoN 2007), June 15, 2007, Beijing, China.Copyright 2007 ACM 978-1-59593-772-8 ...$5.00.

havior. In contrast, treating each thread as a parallel pro-cessor that performs independent tasks can lead to cacheinterference [20].

In a collaborative design, an operator would be executedby all threads for a certain time-slice. The time-slice needsto be long enough to amortize the initial compulsory misseson the instruction and data caches, as well as the context-switch costs. A time-slice might end when either (a) a time-window expires, (b) the input is fully consumed, or (c) theoutput buffer becomes full.

A critical question for a system employing parallel opera-tors is the design of buffers for passing data between opera-tors. A naıve choice can lead to hot-spot points of contentionthat prevent the operator from working at its full capacity[3]. For example, if all threads write output records to acommon output array, then there will be contention for themutex on the pointer to the current position within the ar-ray.

One way to avoid output contention is to give each threadits own output array [3, 20]. While this choice avoids con-tention, it has other disadvantages. The next operator thatconsumes the data has to be aware of the partitioned natureof its input. This can lead to a relatively complex imple-mentation for all operators, because they have to take intoaccount run-time information such as the number of avail-able threads, which may vary during the course of queryexecution. An output partition could, in the worst case,grow much faster than the others. For instance, considera parallel range selection operator where each thread runson a portion of the data that has been partitioned by theattribute being selected. As a result, each output threadmust be pessimistic in allocating output space consistentwith worst-case behavior. Such allocation can waste mem-ory resources, particularly when there are many threads.

There is no guarantee that the separate output partitionswill be balanced. Therefore, the consuming operator alsobecomes responsible for load balancing. If the operator doesnot balance the load, then it is possible for a single slowthread to cause all other threads to stall for long periods,substantially reducing the effective data parallelism. Wetake a closer look at load balancing in Section 5.4.

Simply put, a challenge for multithreaded database andother similar data intensive workloads is that each threadmay consume different amounts of input, generate differentamounts of output, and take significantly different amountsof time to run. In this paper, we propose a buffer structurethat avoids these pitfalls, while still minimizing contentionfor a shared buffer. Our solution has the following desirable

9

Page 22: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

properties:

• The output and input structures are the same, so thatarbitrary operators can be composed.

• The buffer is allocated from a single array, so thatmemory can be allocated in proportion to the expected

output of all threads rather than the worst-case output.

• Data records are processed in chunks. Mutex or atomicoperations are required only for whole chunks. Bychoosing sufficiently large chunks, contention can beminimized.

• Parallel utilization is high. In particular, no thread isstalled for longer than the smaller of the input-chunkprocessing time and the time to generate one output-chunk. Utilization is high even when there is an im-balance in the rate of progress made by the variousthreads.

• Operators use simple get_chunk and put_chunk ab-stractions, and do not have to re-implement locking orload-balancing functions.

We evaluate our parallel buffer data structure on realhardware, the Sun Microsystems UltraSPARC T1. The T1is a chip multiprocessor with eight cores and four hardwarethreads per core for a total of 32 threads on one chip.

2. RELATED WORKParallelism in databases has been well studied, however

most research, and therefore the lessons learned, predatechip multiprocessors. DeWitt and Gray [5] and DeWitt etal. [4] advocate shared-nothing parallelism for database op-erations. A key part of this argument is that interferencelimits the performance of shared-memory systems. Mod-ern commodity chip multiprocessors exhibit shared-memoryparallelism, sharing some levels of the memory hierarchy.Therefore, managing interference between hardware threadswill be paramount to achieving good query performance.

Graefe [8, 7] advocates creating intra-operator parallel-ism via partitioning. On a shared memory system, staticpartitioning makes sense when the coordination overheadbetween processors is high, but this coordination overhead,such as cache coherency, is lower on chip multiprocessor be-cause all communication is done on chip. A problem withdata partitioning is that it is static and can be sensitiveto skew in the data, resulting in sub-optimal load balanc-ing. Whereas a conventional shared memory multiprocessorsystem avoids close coordination between threads or pro-cesses because of the high cost of coordination and datasharing, chip multiprocessors benefit from it because coop-erative threads can better share on-chip resources and anycoordination is done at on-chip speeds.

A recent study by Hardavellas et al. [10] explored data-base systems on chip multiprocessors. This work found thatOLAP workloads on chips similar to the UltraSPARC T1exhibit good throughput. In this study, parallelism in data-base operations was achieved by increasing the number ofconcurrent clients accessing the database (inter-query par-allelism). The good throughput was found when the num-ber of clients saturated the system. When few concurrentclients were connected, throughput was poor because somehardware threads were idled. This highlights a significant

performance pitfall of parallel architectures: not taking ad-vantage of the parallelism [11].

Our proposal to exploit intra-operator parallelism for OLAPaims to keep all hardware threads busy, regardless of thenumber of concurrent clients, thus yielding high system uti-lization and throughput. Additionally, by exploiting intra-operator parallelism, it is easier to manage resource sharingand thus improve performance because all hardware threadsare working on the same task. In contrast, in an inter-queryparallelism model, threads sharing memory, cache, and ex-ecution resources may conflict. Understanding and manag-ing inter-operator or inter-query interference is fraught withcomplications.

Parallel queue data structures on shared memory systemshave been studied in the past [16, 15, 14, 18]. These investi-gations focus on creating general purpose queue structuresthat allow concurrent enqueue and dequeue operations. Thisis most useful in situations with multiple concurrent produc-ers and consumers. In our parallel buffer structure describedin Section 3, we leverage the semantics of database process-ing to guarantee that only concurrent enqueue or dequeueoperations to a buffer occur during a time slice, but notboth.

Other work has suggested inserting buffers between oper-ators [21] and processing blocks of input at a time and ma-terializing the intermediate results instead of pipelining [12,17]. Advantages of this block processing approach includethe efficient reuse of instructions and data structures, suchas an index. With block processing, fewer instruction anddata cache misses occur because the operator’s instructionsand data structures remain cache resident. Work by Bonczet al. [2] shows that processing a vector of tuples can leadto more than an order of magnitude performance improve-ment compared to previous Volcano-style query execution.In pipelined query execution, an operator may process onlyone tuple, yet must pay the cost of many cache misses forboth its data structures and instructions. On a chip multi-processor with shared cache and execution resources, blockprocessing is even more important because it allows for eas-ier concurrent management of these shared resources. Theparallel buffer proposed in this paper will help enable multi-threaded block processing on chip multiprocessors.

3. PARALLEL BUFFERWe assume records have fixed length, and allocate an ar-

ray that is capable of holding a large number M of records.We divide the array into chunks of size c, and assume M isa multiple of c. The buffer structure maintains a count hof the number of chunks in the array that are in use. Addi-tionally, the structure contains p additional count variablesd1, . . . , dp where p is the maximum number of thread con-texts available on the system. The use of these counters willbe described below.

The buffer is in a stable state when every chunk fromp + 1 to h is fully occupied. In a stable state, the variabledi denotes the number of records present in chunk i; chunks1 through p may be partially occupied.

To write data to a buffer, a thread atomically incrementsthe chunk counter h, and uses chunk number h as the des-tination for output records. Each thread will be accessinga different chunk, and so actual data output does not needto be regulated by locks or mutexes. Only accesses to h arecontrolled, using the atomic increment instruction. Once an

10

Page 23: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

(a) Each thread starts (b) Chunk counter (c) Threads exhaust output (d) Threads top-up andwriting to its first chunk. incremented as threads buffer space. update counters to reflect

write to new chunks. first chunks’ actual counts.

Figure 1: Parallel buffer example with three threads where threads terminate because the output buffercapacity is reached. Each chunk can hold two tuples.

output chunk is full, the thread tries to obtain a new chunkin the same fashion. When no more chunks are available,i.e., h = M/c, a flag is returned to the operator to signalthis fact. A thread in this state is said to be finalized, andwill stall until all other threads are also finalized.

Reading data from a buffer proceeds in a similar fashionto writing. When reading a stable buffer, a thread will betold how full its chunk is based on the di values. A threadthat requests a new input chunk and finds none availablealso enters a finalized state. Such a thread (say it is threadnumber j) performs a top-up operation, in which its currentoutput chunk is topped up using data from chunk j. Thecount dj in the output buffer is set according to how manyrecords remain in chunk j. The top-up operations restorethe buffer to a stable state.

Figure 1 demonstrates the filling of an output buffer whenthreads terminate because the capacity of the output bufferis reached. Note that finalization begins when one threadfails to obtain a new output chunk. Finalization preventsthreads from obtaining new input chunks, thereby guaran-teeing termination of all threads quickly. This can leaveholes in the buffer, as shown in Figure 1, requiring a top-up

operation to return the buffer to a stable state. Figure 2demonstrates the filling of an output buffer when threadsterminate because the input is exhausted.

Finalization can happen due to a full output buffer or anempty input buffer. Once one thread becomes finalized, weinduce finalization in all other threads by preventing themfrom obtaining new input or output chunks. For example,if one thread has found the output buffer full, then no otherthreads can get new input chunks. When another threadtries to get a new input chunk, it sees that the finalizationflag has been set, and instead enters the top-up phase andfinalizes itself.

The advantage of coordinating finalization between inputand output is that we can bound the idle time of all threadsto the smaller of the time taken to process one input chunk,and the time taken to generate one output chunk. Withoutcoordination, it might be possible for one thread to continueoperating on the input for a very long time, even after allother threads have finalized due to running out of output

chunks. If the thread’s operator is very selective, for ex-ample, many input records would need to be consumed togenerate an output record. While this remaining thread ismaking progress, it harms overall utilization because p − 1threads remain idle.

A dual problem can occur in the absence of input/outputcoordination. Suppose p − 1 threads have finalized due torunning out of input, but one remaining thread is generatinga lot of output for each record in its chunk. This one threadwould keep working, even though it is forcing p− 1 threadsto remain idle.

Finalization can also be externally induced, for exampleby an interrupt at the end of an operator’s time-slice. Sincewe schedule operators one at a time, a buffer is used eitherfor input or for output, but not both at the same time. Oncea buffer has begun to be used as input, it cannot be usedfor output (by an upstream operator) until it has been fullyemptied. Double-buffering can be used if upstream oper-ators need to be rescheduled before downstream operatorshave consumed all of the previous records. Double-bufferingis actually desirable, because our coordination mechanismcan leave a buffer in a state with just a few remainingrecords.

The contiguous nature of the parallel buffer data structureis a natural fit for the output of table scan operations oftenfound at the leaves of a query plan, with the exception thatthe final chunks may need an inexpensive top-up operation.

4. MODELING BUFFER CONTENTIONThe appropriate chunk size can be determined theoreti-

cally. We will first present a simple model that provides in-sight into issues related to contention and chunk size. Thenwe present a probabilistic model that provides a better esti-mation of the necessary chunk size to eliminate contention.

The total time to process a chunk is T = cr where c is thesize of a chunk in tuples, and r is the time to process a tu-ple. If the time to perform the necessary mutually exclusiveoperations for each chunk processed is L, then

(p− 1)L ≤ T − L (1)

11

Page 24: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

(a) Each thread starts (b) Chunk counter (c) Threads exhaust input (d) Threads top-up andwriting to its first chunk. incremented as threads and must terminate update counters to reflect

write to new chunks. in current state. first chunks’ actual counts.

Figure 2: Parallel buffer example with three threads where threads terminate because the input is exhausted.The output is left unfilled, but in a stable state. Each chunk can hold two tuples.

where p is the number of hardware threads. As more hard-ware threads are added, the potential contention for lockingor atomic operations increases. We find the size of a chunkthat avoids contention is

c =T

r≥ pL

r(2)

Equation 2 shows that the chunk size must increase as morehardware threads are used or the time per tuple processeddecreased. Simply stated, both more threads or shorter tu-ple processing time results in a shorter amount of time untilsome hardware thread must execute the critical section ofthe parallel buffer code. Therefore more tuples must be pro-cessed by each thread between subsequent executions of thecritical section, hence the larger chunk size.

This simple model could underestimate the necessary chunksize. The solution to Equation 2 assumes the best case thatthe concurrent accesses to the critical section are spread outevenly in time. The problem is that even uniformly dis-tributed accesses might not be equally spread out. Whensome accesses cluster, all of the participating threads slowdown significantly. The period during which more accessescause contention also extends. We need to model the prob-ability that two accesses to the critical section will conflict.We model this as a statistical process.

Consider the locked resource to be a line extending to in-finity, indexed by time. When a thread needs to execute thecritical section, it reserves a line segment of size L startingat the current point in time. If a thread finds the lock al-ready taken, it must reserve a line segment of size L startingat the end of the last reservation on the line. The cost ofcontention is the overlap of requests on this line. We canapproximate this effect discretely by dividing the line into Nbuckets, where N = t/L. That is, we discretize the numberof locking slots available over the duration of our experiment(time slice), t. If one thread acquires the lock f times persecond and p threads are used, a total of pft locks will beacquired during the experiment. The discrete probabilisticmodel, therefore, is to place balls in a uniformly randommanner into the N buckets and count how many have twoor more balls (contention).

E[0 Ball buckets] = N

1−1

N

«pft

(3)

E[1 Ball buckets] = pft

1− 1

N

«pft−1

(4)

We can use Equations 3 and 4 to estimate the number ofbuckets with two or more balls and therefore the amount ofcontention. If we want to limit the amount of contention toa fraction b of the buckets, we need to solve:

N −N

1−1

N

«pft

− pft

1−1

N

«pft−1

= bN (5)

We can simplify by introducing a variable X:

X =

1− 1

N

«pft−1

=

1− L

t

«pft−1

(6)

Thus, Equation 5 becomes:

(1− b) = X

1− L

t− pfL

«

(7)

We will assume that N ≫ pft, which means that there aremany more locking slots available than lock requests. Thisis reasonable since we are trying to configure the systemto have few slots with more than one request. With thisassumption we can approximate X as

1− (pft− 1)L

t(8)

As t goes to infinity, Equation 7 can be rewritten as:

(1− b) = (1− pfL)2 (9)

f =1−

√1− b

pL(10)

Based on our earlier terminology, f = 1

cr, so

c =pL

r(1−√

1− b)(11)

12

Page 25: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

When b = 0.1, (1−√

1− b) ≈ 0.05. In this case, the chunksize estimate, c, is about 20 times greater than the esti-mate with the simpler model. The appropriate chunk sizefor a buffer is a chunk size that eliminates contention in thebuffer’s upstream and downstream operator. This is sim-ply the maximum chunk size found by performing the aboveanalysis on the operators that share a buffer. We will ex-amine the two chunk size models empirically in Section 5.

5. EXPERIMENTSTo validate our chunk size model, we performed experi-

ments on real hardware using a machine with a Sun Ultra-SPARC T1 processor. The specifications of our test platformcan be found in Table 1. The T1 has some unique character-istics. For one, the cores are much simpler than those foundon other commodity processors: the pipeline is a shallowsix stages, instructions are issued in order, and no hard-ware prefetching occurs. This simpler core does, however,support four hardware threads that share the core in a fairmanner. A context switch occurs on each clock cycle and aninstruction is then issued from the least recently used, readythread. This sharing has important implications for perfor-mance on the T1. When all threads are ready, each issues aninstruction every fourth cycle, which means that the effec-tive clockrate seen by each thread is one quarter that of thecore’s clockrate. In the event of longer latency instructionsor events (e.g., a cache miss) having other threads ready torun keeps the core from becoming idle. The T1 also fore-goes branch prediction, instead relying on the other threadsto issue instructions to fill the pipeline until a branch isresolved. These characteristics suggest the importance ofkeeping all threads on all cores busy to achieve optimumperformance, particularly for data dependent applications,such as databases, that often cause many long latency cachemisses.

5.1 SetupThe parallel buffer data structure was implemented in C

and is very lightweight, requiring less than 200 lines of code.To test the parallel performance of the buffer, an operatorwas created that reads from an input buffer and writes toan output buffer. This operator allowed many performanceparameters to be specified, including the amount of work pertuple, size of the input and output tuples, chunk size in theinput and output buffer, and the selectivity. The selectivityis the number of output tuples produced for each input tupleread. In our test operator, selectivity was simulated usinga random number test, but in practice it would be datadependent.

For all experiments, the results are averages of four runsusing the same parameters. We also implemented a versionof the parallel buffer that did not use the top-up procedure,but instead kept a counter for each chunk in the buffer tokeep track of the occupancy of each chunk. This designallows partially filled chunks anywhere in the array, but atthe expense of additional storage for the counters. This datastructure performed nearly identically to the data structuredescribed in Section 3. Therefore we provide results us-ing only the parallel buffer defined in Section 3 because itis more space efficient. Additionally, because all but thefirst chunk processed by each thread are a fixed size, com-pile time optimizations such as loop unrolling and instruc-tion reordering can improve performance. Although we did

Operating System Solaris 10 11/06Cores (Threads/core) 8 (4)RAM 8GBShared L2 Cache 3MB, 12-way associative

Hit latency: 21 cyclesMiss latency: 90–155 cycles1

L1 Data Cache 8KB per coreShared by 4 threads

L1 Instruction Cache 16KB per coreShared by 4 threads

On-chip bandwidth 132GB/sOff-chip bandwidth 25GB/s over 4 DDR2Compiler Sun C 5.8

Table 1: Specifications of the Sun UltraSPARC T1.

not find a significant performance difference on the Sun T1,other architectures with deeper pipelines, branch mispredic-tion penalties, and out of order execution may benefit morefrom these optimizations.

To minimize the amount of time spent starting and stop-ping threads, the threads were created and joined recur-sively. For example, the initial thread would create twochild threads, begin processing, and then join its two chil-dren when processing completes. Those children would alsocreate additional threads, and so on in a recursive manneruntil the target number of threads have been created. Whenone thread created all of the threads in a linear fashion, thetime until all threads were working was longer, which re-sulted in an uneven amount of processing by each thread.Getting all threads to work as quickly as possible is impor-tant to achieving good processor utilization and thus maxi-mizing the parallelism available on the processor.

5.2 Buffer PerformanceWe performed a number of experiments, varying the pa-

rameters of our test operator described above. Figure 3shows the throughput of the test configuration when all 32hardware threads were used to copy every input tuple fromthe input buffer to the output buffer (selectivity of 1.0) us-ing different chunk sizes. The tuple size was 16 bytes. Thisgraph clearly shows that the cost of contention is over an or-der of magnitude in lost throughput. This graph also showsthe estimates provided by the simple and probabilistic chunksize models. The simple model provides an estimate that isvery close to a chunk size that provides the best possible per-formance. The probabilistic model with b = 0.1 estimates achunk size that is well into the performance plateau of chunksizes that provide the same operator throughput. The twomodels provide a range of good chunk sizes for parallelismwithout significant contention. In practice, one can expectto have low contention if a chunk size somewhat greater thanthe simple model estimate is chosen.

A closer analysis of the two models in the context of theexperiments sheds some light on their accuracy. In the prob-abalistic model, our discretization of the experiment’s run-ning time into buckets means that we consider any two re-quests to the same bucket to be contentious. In reality,unless the requests arrive at the exact same time, the over-lap is not total. On average, one might expect a contentious

1The miss latency varies with the workload and with theload on the memory controllers [11].

13

Page 26: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

operation to overlap with half of another operation. Thisis one reason why the probabalistic model may produce anoverestimate. The probabalistic model was also proposedbecause uniformly distributed accesses might not be evenlyspread out. But in practice if the tuple processing times areuniformly distributed, one can expect the chunk processingfinishing times to be evenly spread out. If each chunk takesroughly the same amount of time to finish, then the sim-ple model does provide the chunk size necessary to avoidcontention.

Figure 3: The dashed line is the simple estimateand the solid line is the probabilistic estimate withb = 0.1.

Figure 4: Scaling Performance.

The effects of contention are clearly demonstrated by Fig-ure 4. As the number of threads concurrently accessingthe parallel buffer increases, the performance penalty due tocontention increases. This confirms predictions made by thechunk size model that the chunk size necessary to avoid con-tention increases as the amount of thread level parallelismincreases. The penalty due to contention is so severe thatfewer threads without contention outperform more threadsthat do have contention. This graph also shows that choos-ing a larger chunk size also helps amortize the cost of amore expensive atomic or locking operation over more tu-ples. In the case of one thread, there is no contention, butthe throughput improves for larger chunk sizes because ofthis amortization.

Figure 5: Performance based on tuple size. Largertuples yield more cache misses and slower process-ing, but less contention.

As the size of the tuples stored in the parallel buffer in-creases, the number of cache misses incurred during process-ing each tuple increases. Figure 5 shows the results. Becausethe time to process each chunk increases, contention is re-duced. For example, contention appears to be absent for64-byte records at a chunk size of 16, while contention re-mains an issue for 16-byte records for chunks containing 100records. The UltraSPARC T1 has 64 byte L2 cache lines, so16, 32, and 64 byte tuples represent 1/4, 1/2, and 1 cachemisses per tuple read or written, respectively.

Figure 6: Throughput as work per tuple increases.

As the amount of work performed per tuple is increased,throughput naturally decreases as shown in Figure 6. The“baseline” in this experiment is 32 threads processing 16byte tuples with a selectivity of 1.0. Extra work was thenadded to the processing of each tuple. The increased timeper tuple and, therefore, increased time per chunk also re-sults in less contention. Figure 6 shows that as the work pertuple increases, a smaller chunk size is necessary to eliminatecontention.

5.3 Mutexes vs. Atomic OperationsIncrementing the chunk counter must be done atomically

to ensure that each thread obtains a unique chunk to readfrom or write to. Two techniques may be used to achieve

14

Page 27: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

Figure 7: Incrementing the chunk counter atomi-cally using a mutex vs. atomic operations.

this atomicity. First, threading libraries provide mutexesthat can be used to provide exclusive access to particularvariables and critical sections of code. When one thread haslocked a mutex, all other threads that request that lock mustwait for the first thread to release the mutex. In the case ofincrementing the chunk counter, we acquire the mutex (itis shared among all of the threads), increment the counterremembering the new value, and then release the mutex.

Another way of atomically incrementing the chunk counteris to use atomic operations provided by the architecture’s in-struction set. Most microarchitectures provide some type ofatomic operations on which synchronization objects, such asthe mutexes described above, can be built. Some microar-chitectures, including the Sun T1, provide more advancedatomic operations that can be used to perform atomic arith-metic and logical operations.2 Using an atomic operation, ifavailable, to increment the chunk counter has a number ofadvantages. First, acquiring and releasing a mutex requiresusing atomic operations anyway, so the atomic incrementoperation is unlikely to be slower. Second, using a mutex re-quires invoking the threading library, which at the very leastmeans that more instructions will be executed, lowering per-formance. A third issue is that when a thread attempts toacquire a mutex, but fails, it may be put to sleep until themutex becomes available. This means that the mutex im-plementation interacts with the system scheduler, incurringan even higher overhead. For a very lightweight operationsuch as incrementing the chunk counter, the overhead of ac-quiring a mutex can be significant, especially when there iscontention between threads acquiring the mutex.

In all of our experiments we have used a lighter-weightatomic increment operation instead of mutexes. Figure 7shows a performance comparison of buffer performance us-ing the atomic increment and a mutex. The experimentalparameters are the same as those in the experiment fromFigure 3. In a simple experiment, we measure the singlethreaded latency incurred while performing the atomic in-crement using a mutex to be about 128 cycles compared to88 cycles when using the atomic increment operation. Fig-ure 7 demonstrates that a larger chunk size is required for

2Though the T1 ISA does not have, for example, an atomicadd instruction, such an operation can be built using pro-vided atomic primitives.

the mutex approach to achieve a performance comparableto the implementation using atomic operations. This re-sult follows from the mutex’s higher latency, which increasesthe chance of contention at lower chunk sizes. For suffi-ciently large chunk sizes, the throughput obtained by bothapproaches is nearly identical. This is because the time toprocess a chunk dominates the time required to atomicallyincrement the chunk counter.

5.4 Load BalancingAchieving high performance on a chip multiprocessor re-

quires keeping all of the hardware thread contexts busy sothat the processor is fully utilized. In the context of data-base operations that exploit intra-operator parallelism, thekey to keeping the processor fully utilized is load balancing.Threads that complete their work and then wait for otherthreads to also finish lower overall performance, whereasthreads that complete their work and then find other workto complete help maximize overall performance. In this sec-tion we examine some examples of skew that can occur withother approaches to parallelism and demonstrate how theproposed parallel buffer structure achieves good load bal-ancing even in the presence of significant skew.

A common method of partitioning data for parallel pro-cessing is to partition the tuples based on a hash of somecombination of attribute values [4]. In the case of the SunT1, we might want 32 partitions – one for each hardwarethread. Partitioning, however, is sensitive to skew in thedata that can cause some partitions to be much larger thanothers. In a simple experiment we used multiplicative hash-ing [13] to partition input into 32 partitions. The input dis-tributions consisted of 224 tuples and were generated usingtechniques similar to those found in Gray et al [9].

When the input values were distinct, the sizes of the parti-tions were very similar with a standard deviation of just 1.8tuples. For distributions such as Zipf and self-similar, theamount of skew was higher. For Zipf, the measured stan-dard deviation was about 1000 tuples when the values werechosen from a range as large as the size of the input, butincreased significantly as the range was decreased causingvalues to repeat more frequently. For the self-similar dis-tribution, the standard deviation was almost 300000 tuplesand the largest partition was about five times greater thanthe smallest partition. Even in the presence of moderateamounts of skew, threads assigned to smaller partitions willfinish early and wait for other threads to complete, under-utilizing the processor. Using the proposed parallel buffer,threads continue to work on new chunks until a the buffer isexhausted, thus keeping all threads busy performing usefulwork.

Another significant problem with the partitioned approachis skew introduced during query processing. Even if an in-itial partitioning of the input is well balanced, some parti-tions may contain tuples that fail to pass a selection con-dition, while other partitions contain tuples that have largejoin products, bloating their join output relative to otherpartitions. Solutions to this problem include repartitioningand variable sized buffers for the partitions between oper-ators, which we argue is much more complicated than thesingle, unified parallel buffer proposed in this paper.

A different form of skew involves the amount of time re-quired to process a tuple. Some tuples take longer to processthan others. Consider the example of a hash join. If a tu-

15

Page 28: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

(a) Constant amount of work per tuple. (b) Skewed amount of work per tuple.

Figure 8: Performance using parallel buffers vs. static partitioning.

ple hashes to an empty bucket, processing stops because thetuple does not participate in the join. In contrast, if a tuplehashes to an occupied bucket, then the values in that bucketmust be interrogated along with any potential overflow buck-ets. In the partitioned approach, even if the partitions areof equal size the amount of processing time may be skewed.To compare partitioned processing with using the parallelbuffer, we create a skewed scenario. Each tuple in the first1/32 of the input requires twice as long to process as the restof the input. In the partitioned approach, this first 1/32 ofthe input corresponds with the partition processed by thefirst thread. Using a parallel buffer, all of the threads sharein processing this more expensive input.

Figure 8 shows the effects of processing time skew onperformance of both the partitioned and parallel buffer ap-proaches. The work is introduced in the same manner asin the experiment associated with Figure 6. When workper tuple is constant, as in Figure 8a, the partitioned andparallel buffer approaches perform similarly when the bufferchunk size is sufficiently large. This is good because it meansthat the buffer infrastructure has negligible overhead com-pared to course grained partitioning. The benefit of usinga parallel buffer is considerable when significant skew is in-troduced in the manner described above. The difference inperformance between the baseline buffer performance (Fig-ure 8a) and the buffer performance with skew (Figure 8b)is less than 1/32, which is what we would expect since 1/32of the work is twice as expensive. In contrast, Figure 8bshows that the difference between the parallel buffer andpartitioned processing for the skewed workload or the par-titioned approach is almost 30%, which means that manythreads are idle, resulting in lower processor utilization.

The skewed performance might be expected to equal thatof the partitioned approach with twice the work for each tu-ple. However, this does not happen because of the way thatfour threads share one core on the Sun T1. In the skewedcase, one thread is doing more work and issuing more in-structions, which may fill holes where the other threads can-not issue instructions because of delays. In the case whereall threads have equal work, they compete evenly for execu-tion resources. Also, once other threads terminate early, theslower thread that is processing the more expensive partitionnever conflicts with other threads and can always issue in-

structions when ready. Therefore the partitioned approach’sskewed performance is somewhat better than might be ex-pected, but still significantly worse than the parallel buffer.

The advantage of the parallel buffer is that each threadwill process as many input chunks as it is able and write to asmany output chunks as needed. No adjustment is needed ifone thread produces more output than other threads. Sim-ilarly, no load balancing steps are required within the plan.Keeping threads busy with work is obviously important, butthere are situations, such as the exhaustion of input tuplesor space to write output that will require that a thread ter-minates. Ensuring that all threads terminate quickly whenone thread is forced to finish is also important to maintain-ing high processor utilization and is the focus of the nextsection.

5.5 Thread FinalizationEfficiency during the finalization of threads using the par-

allel buffer data structure is also important to performance.If some threads take a much longer time to stop work, theprocessor could be underutilized while a majority of threadswait for all of the threads to terminate. Section 3 describesthis condition and how our parallel buffer avoids this prob-lem. Figure 9 shows the finishing times of all 32 hardwarethreads during an experiment adjusted to the time that thefirst thread finishes. The amount of time between the firstand last thread termination is much less than 1% of the totalexecution time and represents the time to process about 30chunks. Though this overhead is not as low as suggested inSection 3, it still ensures that the processor is fully utilizedduring almost all processing.

We suspect that the reason we observe a 30-chunk windowrather than a smaller one is that there is contention on thetiming counter used to perform the measurement for thisexperiment, forcing the threads to serialize their access tothe counter. The true finishing times (in the absence of ameasurement or other serialization point) would show anapproximately cumulative normal distribution, somethingnot apparent in Figure 9.

6. CONCLUSION AND FUTURE WORKAchieving good performance on chip multiprocessors re-

quires applications to exhibit sufficient thread level parallel-

16

Page 29: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

0

5

10

15

20

25

30

35

40

45

50

0 5 10 15 20 25 30

Mic

rose

cond

s (1

E-6

s)

Thread Number

Figure 9: Difference in thread finishing times fromfirst thread to finish.

ism to saturate the available hardware threads, while alsomanaging shared resources efficiently. Database operationsexhibit a high degree of parallelism, but a challenge is incoordinating the input and output to a parallel operator ina manner that avoids contention between threads. In thispaper we present a new parallel buffer data structure thathelps to enable intra-operator parallelism. This buffer pro-vides unified input or output to a parallel data structure.Based on a theoretical analysis and experimental validation,processing portions of the input and generating the outputin sufficiently large chunks can eliminate contention betweenthreads. The appropriate chunk size can be determined viaa theoretical model, which we have verified experimentally.

Another advantage of our data structure is that it alsoprovides load balancing between threads running a paral-lel operator. Because every thread consumes and produceschunks of tuples, the amount of input processed or outputgenerated by any one thread can adapt to the speed of thatthread. This is in contrast to a per-thread buffer or staticpartitioning. Those techniques are sensitive to skew andmay result in under utilization. An example of this un-der utilization occurs when some threads drain their inputbuffers quickly and others process input more slowly. Witha unified parallel buffer, individual operators do not need toaddress this load balancing problem.

The parallel buffer is also compatible with row-wise orcolumn-wise storage. Column-wise storage has been shownto be particularly beneficial for OLAP workloads [1, 19].The parallel buffer only requires fixed size elements. Whetherthis is a full record or only a single attribute does not af-fect the load balancing and contention avoidance propertiesof the data structure. If multiple columns are required asinput or output, multiple buffers may be used or a singlebuffer with multiple arrays could be used. In the later case,the columns in each chunk would represent values from thesame records. We have not implemented the buffer datastructure for a particular data layout, but as future workwe will investigate the best way to use parallel buffers forcolumn- and row-based data.

In future work, we plan to implement real database oper-ators and validate complete system performance. The par-allel buffers presented here form the core of the necessaryinfrastructure for managing load balancing and parallelism.

Operator implementation can thus focus on achieving goodthreaded performance, by choosing efficient algorithms thatshare cache-resident data structures and avoid inter-threadinterference.

7. ACKNOWLEDGEMENTSWe thank the anonymous peer reviewers for their con-

structive comments.

8. REFERENCES[1] P. A. Boncz, S. Manegold, and M. L. Kersten.

Database architecture optimized for the newbottleneck: Memory access. In VLDB, 1999.

[2] P. A. Boncz, M. Zukowski, and N. Nes.Monetdb/x100: Hyper-pipelining query execution. InCIDR, pages 225–237, 2005.

[3] J. Cieslewicz, J. W. Berry, B. Hendrickson, and K. A.Ross. Realizing parallelism in database operations:insights from a massively multithreaded architecture.In DaMoN, 2006.

[4] D. J. DeWitt, S. Ghandeharizadeh, D. A. Schneider,A. Bricker, H.-I. Hsiao, and R. Rasmussen. Thegamma database machine project. IEEE Trans.

Knowl. Data Eng., 2(1):44–62, 1990.

[5] D. J. DeWitt and J. Gray. Parallel database systems:The future of high performance database systems.Commun. ACM, 35(6):85–98, 1992.

[6] P. Garcia and H. F. Korth. Database hash-joinalgorithms on multithreaded computer architectures.In Conf. Computing Frontiers, pages 241–252, 2006.

[7] G. Graefe. Query evaluation techniques for largedatabases. ACM Comput. Surv., 25(2):73–170, 1993.

[8] G. Graefe. Volcano - an extensible and parallel queryevaluation system. IEEE Trans. Knowl. Data Eng.,6(1):120–135, 1994.

[9] J. Gray, P. Sundaresan, S. Englert, K. Baclawski, andP. J. Weinberger. Quickly generating billion-recordsynthetic databases. In SIGMOD, 1994.

[10] N. Hardavellas, I. Pandis, R. Johnson, N. Mancheril,A. Ailamaki, and B. Falsafi. Database servers on chipmultiprocessors: Limitations and opportunities. InCIDR, pages 79–87, 2007.

[11] J. L. Hennessy and D. A. Patterson. Computer

Architecture. Morgan Kaufman, 4th edition, 2007.

[12] M. L. Kersten, S. Manegold, P. A. Boncz, and N. Nes.Macro- and micro-parallelism in a dbms. In Euro-Par,pages 6–15, 2001.

[13] D. E. Knuth. The Art of Computer Programming,volume 3. Addison-Wesley, 2nd edition, 1998.

[14] E. Ladan-Mozes and N. Shavit. An optimisticapproach to lock-free fifo queues. In DISC, pages117–131, 2004.

[15] M. M. Michael and M. L. Scott. Simple, fast, andpractical non-blocking and blocking concurrent queuealgorithms. In PODC, pages 267–275, 1996.

[16] M. M. Michael and M. L. Scott. Nonblockingalgorithms and preemption-safe locking onmultiprogrammed shared memory multiprocessors. J.

Parallel Distrib. Comput., 51(1):1–26, 1998.

[17] S. Padmanabhan, T. Malkemus, R. C. Agarwal, andA. Jhingran. Block oriented processing of relational

17

Page 30: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

database operations in modern computerarchitectures. In ICDE, pages 567–574, 2001.

[18] C.-H. Shann, T.-L. Huang, and C. Chen. A practicalnonblocking queue algorithm using compare-and-swap.In ICPADS, pages 470–475, 2000.

[19] M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen,M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden,E. J. O’Neil, P. E. O’Neil, A. Rasin, N. Tran, andS. B. Zdonik. C-store: A column-oriented dbms. InVLDB, pages 553–564, 2005.

[20] J. Zhou, J. Cieslewicz, K. A. Ross, and M. Shah.Improving database performance on simultaneousmultithreading processors. In VLDB, pages 49–60,2005.

[21] J. Zhou and K. A. Ross. Buffering database operationsfor enhanced instruction cache performance. InSIGMOD Conference, pages 191–202, 2004.

18

Page 31: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

A General Framework for Improving Query ProcessingPerformance on Multi-Level Memory Hierarchies

Bingsheng He† Yinan Li‡ Qiong Luo† Dongqing Yang‡†Hong Kong Univ. of Science and Technology ‡Peking University

{saven,luo}@cse.ust.hk {liyinan,dqyang}@pku.edu.cn

ABSTRACTWe propose a general framework for improving the query process-ing performance on multi-level memory hierarchies. Our motiva-tion is that (1) the memory hierarchy is an important performancefactor for query processing, (2) both the memory hierarchy anddatabase systems are becoming increasingly complex and diverse,and (3) increasing the amount of tuning does not always improvethe performance. Therefore, we categorize multiple levels of mem-ory performance tuning and quantify their performance impacts.As a case study, we use this framework to improve the in-memoryperformance of storage models, B+-trees, nested-loop joins andhash joins. Our empirical evaluation verifies the usefulness of theproposed framework.

1. INTRODUCTIONFor the last two decades, processor speeds have been growing

at a much faster rate (60% per year) than memory speeds (10%per year) [1]. Due to this widening speed gap, the memory hier-archy has become an important factor for the overall performanceof relational query processing [3, 10]. Meanwhile, both relationaldatabase systems and hardware platforms are becoming increas-ingly complex and diverse. It is important and challenging to auto-matically and consistently achieve a good query processing perfor-mance across platforms.

In this paper, we propose a general framework to quantify therelationships between the performance improvement and the auto-maticity of in-memory query processing techniques. Intuitively, analgorithm that knows much about a specific memory hierarchy canutilize this knowledge to improve its efficiency, but it may require alarge amount of tuning due to its dependency on platform-specificparameters, and its performance may also differ on different plat-forms. Considering these issues, we categorize in our frameworkthe automaticity of an algorithm by the amount of knowledge aboutthe memory hierarchy.

A memory hierarchy has quite a few parameters that affect thequery processing performance. The common ones include (1) thenumber of levels of the hierarchy and (2) the capacity, block size,associativity, and access latency of each level. Other characteris-

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.Proceedings of the Third International Workshop on Data Management onNew Hardware (DaMoN 2007)June 15, 2007, Beijing, China.Copyright 2007 ACM 978-1-59593-772-8 ...$5.00.

tics include prefetching and non-blocking data transfers betweentwo adjacent levels of the memory hierarchy. Some of these char-acteristics are correlated and others are independent.

So far, variouscache-conscioustechniques [1, 10, 31] have con-sidered one or two of these parameters individually and have demon-strated a high performance with suitable parameter values and finetuning on a specific memory hierarchy. In contrast, there has emergedinitial work on cache-obliviousalgorithms [5, 7, 11, 12, 18], whichassume no knowledge about a specific memory hierarchy and usu-ally have provable upper bounds on the number of block transfersbetween any two adjacent levels of an arbitrary memory hierarchy.

Considering both the memory hierarchy characteristics and theexisting algorithms, we define the tuning levels in our frameworkcorresponding to the memory hierarchy characteristics and studythe performance of the algorithms at different tuning levels. Specif-ically, we start from a cache-oblivious algorithm, which requires notuning, and gradually add more knowledge about the memory hier-archy and thus more tuning to the algorithm. Finally, we comparethe performance of these algorithms at each tuning level and acrossplatforms. The algorithms we studied include in-memory storagemodels, the B+-tree, the non-indexed nested-loop join and the hashjoin. Our empirical evaluation verifies the usefulness of the pro-posed framework.

In brief, this paper makes the following three contributions. First,we propose a general framework for improving the query process-ing performance on multilevel memory hierarchies. To our bestknowledge, this is the first work on quantifying the correlations be-tween the performance improvement and the amount of tuning forthe memory hierarchy. Second, we use our framework to study fourcommon data structures and algorithms for in-memory query pro-cessing. As a result, we develop a series of algorithms that carrydifferent degrees of tuning for in-memory databases. Third, weempirically evaluate the in-memory performance of the algorithms.Our results demonstrate the effectiveness of our framework.

The remainder of this paper is organized as follows. In Section2, we briefly review the background and related work. In Section 3,we present our framework. In Section 4, we use our framework tostudy in-memory storage models, B+-trees, nested-loop joins andhash joins. We experimentally verify our framework in Section 5.Finally, we conclude in Section 6.

2. PRELIMINARY AND RELATED WORKIn this section, we first introduce the background on the memory

hierarchy. Next, we review the related work on cache-consciousand cache-oblivious techniques.

2.1 Memory hierarchiesThe memory hierarchy in modern computers typically contains

19

Page 32: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

multiple levels of memory from bottom up: disks, the main mem-ory, the L2 cache, the L1 cache and registers. Each level has alarger capacity and a slower access speed than its higher levels.Weuse the cache and the memory to represent any two adjacent levelsin the memory hierarchy.

We summarize the followingstaticcharacteristics of a memoryhierarchy.

P0. The number of levels in the hierarchy.

P1. The cache configuration:<C, B, A>, whereC is the cachecapacity in bytes,B the cache line size in bytes, andA thedegree of set-associativity.

P2. The transfer latency of the cache,l.

P3. Transfer characteristic between two adjacent levels: support-ing software prefetching or not and the non-blocking capa-bility to support multiple transfers simultaneously. We usethe number of concurrent transfers supported,D, to quantifythe non-blocking capability.

Compared with static characteristics, the dynamic ones such asthe number of concurrent threads are more difficult to capture butare important in multi-task systems, such as databases [22, 23]. Inthis study, we focus on the static characteristics and leave the studyon the dynamic characteristics as future work.

The notations used throughout this paper are summarized in Ta-ble 1. For readability, we will simply useC, B, A, l, d, D (i.e.,without subscripti) whenever we refer to any level of the memoryhierarchy without explicitly specifying a level.

Table 1: Notations used in this paperParameter Description

Pi Characteristics of the memory hierarchy,0 ≤ i ≤ 3Γi The knowledge about the memory hierarchy,Γ0 = φ or

Γi ⊆ {P0, ..., Pi}, 1 ≤ i ≤ 3.Ti The levels of tuning corresponding toΓi in our frame-

work, 0 ≤ i ≤ 3L Number of levels in the memory hierarchy considered for

tuningCi Cache capacity of theith level (bytes)Bi Cache line size of theith level (bytes)Ai Cache associativity of theith levelli Access latency of theith level for random accesses (ns)di Prefetching distance (number of cache blocks to prefetch

ahead)Di Number of concurrent transfers supported by the non-

blocking capability

R, S Outer and inner relations of the joinr,s Tuple sizes ofR andS (bytes)

|R|,|S| Cardinalities ofR andS||R||,||S|| Sizes ofR andS (bytes)

2.2 Cache-centric query processingDue to the widening speed gap between the processor and the

main memory, the CPU caches, especially the L2 cache, have be-come a bottleneck for in-memory relational query processing [3,10]. Consequently, many contributions have focused on optimizingthe L2 cache performance using cache-centric techniques includingcache-conscious [10,13,31] and cache-oblivious ones [7,24].

Cache-conscious techniques have been the leading approach tooptimizing the cache performance. Specialized data structures, suchas cache-conscious B+-trees [9,30], R-trees [27] and storage mod-els [2, 20], have been proposed to reduce cache misses. Typical

cache-conscious techniques, including blocking [31], data parti-tioning [28, 31], compression [9], data clustering [31], prefetch-ing [13–16], staging [23] and buffering [33], were proposed for im-proving the cache behavior of traditional database workloads. Mostof these studies optimize a single level on a multi-level memory hi-erarchy, e.g., the L2 data cache.

With the same focus on reducing cache stalls, cache-oblivious al-gorithms do not require the knowledge of cache parameters or anytuning on the cache parameters. Representatives of existing cache-oblivious techniques include recursive partitioning [18] and buffer-ing [11, 18] for temporal locality, and recursive clustering [5] forspatial locality. For relational query processing, He et al. [24, 25]proposed cache-oblivious join algorithms, including nested-loopjoins with and without indexes, sort-merge joins and hash joins.Both theoretical results and empirical evaluation show that cache-oblivious algorithms can match the performance of their manuallyoptimized, cache-conscious counterparts [7,12,24].

In contrast with the existing cache-centric techniques that aretuned based on a certain amount of knowledge about a specificmemory hierarchy, we propose a general framework to quantifythe correlations between the amount of tuning and the potentialperformance gain. The framework serves as a guide for optimiz-ing a cache-centric algorithm given a certain amount of knowledgeabout a specific memory hierarchy.

3. FRAMEWORKIn this section, we present our framework and a cost model for

applying the framework to an algorithm on the memory hierarchy.Our framework quantifies the performance impact of the tuning us-ing a certain amount of knowledge about the memory hierarchy.The basic idea of our categorization is that the more knowledgeabout the memory hierarchy considered, the more tuning can beinvolved to improve the performance. Given a certain amount ofknowledge about the memory hierarchy, we can apply certain kindsof optimizations to improve the query processing performance.

3.1 CategorizationFigure 1 illustrates the spectrum of tuning for the memory hier-

archy. Techniques on the left of the spectrum require less tuningthan the ones on the right. Based on our categorization on the char-acteristics of a memory hierarchy, we divide the spectrum into thefollowing four levels of memory performance tuning.

T0 No knowledge about the cache parameters, i.e., cache-oblivious.

T1 Having the knowledge of the cache capacity and/or cache blocksize of the target level.

T2 Having the knowledge of the latency, in addition toT1.

T3 Having the knowledge of software prefetching and/or non-blockingcapability of the target level, in addition toT2.

T1–T3 implicitly have the knowledge ofP0. At T1, we can deter-mine the highest level of cache that can hold the working set of thealgorithm. Let this level of cache bex. Thus,T1–T3 knows thenumber of levels of the cache considered for tuning,L = x.

In our categorization,Ti (0 ≤ i ≤ 3) requires a set of character-istics of the target level of cache,Γi. Thus, the levels of tuning inour framework are in a total order according to the required amountof knowledge about the memory hierarchy. ATi with a largerivalue requires more information about the memory hierarchy andinvolves more tuning, i.e.,T0 < T1 < T2 < T3. TuningT0 iscache-oblivious, since it requires no tuning on the hardware de-pendent parameters. In contrast,T1–T3 are cache-conscious. For

20

Page 33: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

T0 T1 T2 T3P1 P2 P3

Notuning

Finetuning

Recursiveclustering,recursivepartitioning

Blocking,buffering

Tiling,grouping

Softwareprefetching

Figure 1: The spectrum of tuning.T0 on the left of the spectrumis cache-oblivious, andT1–T3 on the right of the spectrum arecache-conscious. On the bottom of the figure show some exist-ing techniques belonging to each level of tuning.

instance, ifT3 applies software prefetching to a technique, it re-quires the tuning based on the cache capacity and the block size(T1) as well as the latency (T2). The block size is used to deter-mine the size of the data to be prefetched. The cache capacity andthe latency are used to determine the prefetch distance so that thememory latency is fully hidden by prefetching.

We have categorized existing techniques according to our frame-work, as shown in Table 2. The majority of cache-conscious tech-niques belong toT1 andT2.

Table 2: Categorizing existing cache-centric techniquesKnowledge Tuning Representative techniques

Γ0 T0 CO B+-trees [5, 6], funnel sort [11],CO nested-loop join [24, 25] and stor-age models [6]

Γ1 T1 Blocking [31], buffering [33], partition-ing [10, 31], compression [9, 27], clus-tering [17]

Γ2 T2 Tiling [19], grouping [2,20,29,30]Γ3 T3 Loop unrolling [28], prefetching [13–

16]

In this tuning hierarchy, a higher level of tuning canpotentiallyachieve a higher performance at the price of a larger amount of tun-ing. Given a technique at the tuning level,Ti, we consider highertuning levels,Tj (j > i), to optimize this technique with moreknowledge about the memory hierarchy. We briefly describe thebasic use for each level of tuning: (a)T0 mainly uses the divide-and-conquer methodology to improve the cache locality. (b)T1

packs the data into the cache or a cache block. Additionally, wecan estimate the number of cache misses on each level of cache.(c) T2 determines the most significant levels of cache for the to-tal execution time and applies techniques to those levels of cache.(d) T3 applies software prefetching to hide the cache stalls. Aswe will demonstrate in Section 4, we apply our framework to fourcase studies. We optimize each base technique with the knowledgeabout the memory hierarchy and without much modification to thebase technique.

3.2 Determining the target levels of cachesBecause a memory hierarchy consists of multiple levels, we need

to determine the target level for a cache-conscious algorithm. Notethat cache-oblivious techniques do not require determining the tar-get levels due to their automaticity.

Since the complexity of tuning dramatically increases with thenumber of levels of caches considered, we choose one or two levels

of caches that are most significant for the overall performance tobe the target levels of caches for tuning. The following are tworepresentative cases:

CaseΓ = {P0, P1}. Since we do not know the latency informa-tion, we can not determine which levels of caches are significant inthe overall performance. In practice, we choose the lowest level ofcaches that can not hold the working set of the algorithm,(L− 1),to be the target level, because the lower levels have a larger latency(even though the actual latency is unknown) and are likely to besignificant in the overall performance.

CaseΓ ⊇ {P0, P1, P2}. With the latency information of thememory hierarchy, we develop our overall cache performance modelfor a memory hierarchy. This model estimates the overall cacheperformance, which is defined to be the total cache stalls of an algo-rithm on all levels of the memory hierarchy. Suppose the cost func-tion Fi(P1, ..., P3, T0, ..., T3) gives the number of cache misseson theith level of the memory hierarchy caused by the algorithm.Since caches at different levels of the memory hierarchy are inde-pendent with each other, the cost functions for two distinct cachelevels may be different due to different levels of tuning applied. Eq.1 gives the cache stalls on theith level. Thus, the overall cache per-formance of the algorithm on the entire memory hierarchy is givenin Eq. 2.

τi = Fi(P1, ..., P3, T0, ..., T3)× li+1 (1)

τ =

L−1∑i=1

τi (2)

To determine the target levels, we rank the levels of caches ac-cording toτi. The largerτi, the more significant theith level ofcaches.

4. CASE STUDIESIn this section, we use in-memory storage models, the B+-tree,

the nested-loop join without indexes (NLJ) and the hash join as casestudies to illustrate the applicability of our framework. We start thetuning process at a certain level of tuning and apply the upper levelsof tuning to the algorithm. Additionally, we have developed costfunctions for each algorithm at different tuning levels. These costfunctions are used in our cost model to determine the significantlevel of cache.

4.1 Storage modelsWe consider two kinds of storage models, the array for static data

and the linked list (LL) for dynamic data. Especially, we consideroptimizing the scan on the storage models.

Array scan. Since the array has a good spatial locality onany level of the memory hierarchy,T1 and T2 do not have anyperformance improvement on the array scan. We considerT3 tosee whether software prefetching helps reduce the number of cachemisses for loading the array. Given the prefetching distance,d, thealgorithm issues a prefetching instruction on the(i + d)th cacheblock, before processing theith cache block.

LL scan. At T1, the algorithm determines the suitable nodesize for the linked list according to the cache block size of the low-est level of caches so that the spatial locality of each node is max-imized. AtT2, the algorithm determines the suitable node size forthe linked list according to the cache block size of the target level.At T3, the algorithm determines the suitable prefetching distance.Similar to the array scan, the algorithm prefetches the(i + d/z)thnode (z is the node size in number of cache blocks) when it pro-cesses theith node. The algorithm keeps a jump-pointer array,J ,to maintain the addresses of the nodes in the linked list. The idea

21

Page 34: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

LL0LL0 LL1LL1 LL2LL2 LLnLLn

Jump-pointerarray, J

Linkedlist, LL

Figure 2: The jump-pointer array for the linked list.

T1

T2 Tt

...T1 T2 Tt

...

Cut1

Figure 3: The VEB layout.

of the jump-pointer array is shown in Figure 2.J [i] stores the startaddress of theith node in the linked list.

With the jump-pointer array, we apply the prefetching techniqueto scan the linked list. Given the prefetching distance,d, the algo-rithm issues prefetch instructions for the node atJ [i+d/z] in orderto prefetch the(i + d/z)th node.

4.2 B+-treesWe start with a cache-oblivious B+-tree atT0. A COB+-tree

[6] consists of two arrays. One stores the data leaves of the B+-tree, and the other one stores the directory. The index nodes in thedirectory are organized into a binary tree stored in the van EmdeBoas (VEB) layout [5] without considering any specific memoryparameters.

The VEB layout proceeds as follows. Leth be the number oflevels in the tree. We split the tree at the middle level (Cut 1 inFigure 3) and obtain aroundN1/2 subtrees, each of which containsroughlyN1/2 index nodes (T1, ..., Tt in Figure 3). The resultinglayout of the tree is obtained by recursively storing each subtreein the order ofT1, ..., Tt. The recursion stops when the subtreecontains only one node. In the VEB layout, an index node and itschild nodes are stored close to each other. Thus, the spatial localityof the tree index is improved.

At T1, each tree node has the exact size ofB. It is a small binarytree stored in the VEB layout. Additionally, the entire tree is storedaccording to the VEB layout. This idea of “tree within a tree” issimilar to the fractal tree structure [16]. The difference is that weuse the VEB layout to store each node so that the node has a goodspatial locality for all levels of caches above the target level. AtT2 andT3, the B+-tree is similar to that atT1 except that the nodesize is the cache block size at the target level of cache atT2, and is(D ×B) atT3.

4.3 Nested-loop joinsWe use the cache-oblivious non-indexed nested-loop joins (CONLJ)

[24] as NLJ atT0. CO NLJ first divides each of the inner and outerrelations (denoted asS andR, respectively) into two equal-sizedsub-relations. Next, it performs joins on the pairs of inner and outersub-relations. This partitioning and joining process goes on recur-sively, until it reaches the base case when|S| is no larger than thebase case size,CS (defaultCS = 1). It then applies the tuple-basednon-indexed nested-loop join algorithm to evaluate the base case.

At T1 or T2, the algorithm sets the base case to be the cachecapacity of the target level depending on the level of cache to be

optimized. Thus, the inner relation of the base case can fit into thecache. AtT3, the algorithm sets the prefetching distance to be asmall constant so that prefetching does not interfere with the cachelocality tuned atT1. The optimal prefetching distance isl

w, given

the computation time on each cache block,w. With this prefetchingdistance, the prefetching can fully hide the cache stall.

4.4 Hash joinsWe start with the simple hash join. We use two techniques to im-

prove its cache performance, cache partitioning [10,31] and prefetch-ing [13]. The former one belongs toT1 or T2 depending on itstarget level of cache, whereas the latter one belongs toT3.

Since these two techniques are independent with each other, wehave implemented two variants of cache-optimized hash joins: (1)we first implement the partitioned hash join, and then apply prefetch-ing within the join on each partition pair. (2) we apply the prefetch-ing in both the partitioning and the probing.

4.5 SummaryWe derive the cost function of each level of tuning for each case,

as shown in Table 3. These cost functions are used to determine thetarget levels of caches.

5. EVALUATIONWe verified the usefulness of the proposed framework by imple-

menting and evaluating the case studies in Section 4.

5.1 Experimental setupOur empirical study was conducted on three machines of differ-

ent architectures, namely P4, AMD and Ultra-Sparc. Some featuresof these machines are listed in Table 4. The rowDTLB gives thenumber of entries in the data TLB (Translation Look-aside Buffer).The Ultra-Sparc does not support hardware prefetching data fromthe main memory [32], whereas both P4 and AMD do. AMDperforms prefetching for ascending sequential accesses only [4]whereas P4 supports prefetching for both ascending and descend-ing accesses [26].

Table 4: Machine characteristicsName P4 AMD Ultra-SparcOS Windows XP Linux 2.6.15 Solaris 8

Processor Intel P4 1.8GHz AMD Opteron 1.8GHz Ultra-Sparc III 900MhzL1 DCache <32K, 64, 4> <64K, 64, 2> <64K, 32, 4>L2 cache <2M, 64, 8> <1M, 128, 16> <8M, 64, 8>DTLB 64 1024 64

Memory 1.0 GB 15.0 GB 8.0 GB

We performed calibration on these machines to obtain the cachelatency. For instance, the latency of the L1 and L2 caches on P4 is12 and 42 cycles, respectively.Workload design. The workloads in our study contain (1) oneselection query on tableR, and (2) two join queries each on twotablesR andS. Each of tablesR andS consists ofn integer at-tributes,a1, a2, ..., andan. Each field was a randomly generated4-byte integer. We variedn to scale up or down the tuple size ofthe table. These workloads are similar to those in a the previousstudy [3].

We consider the following selection query, “SELECT R.a1 FROMR WHERER.a1 = 1 and...and R.an = 1”. We used a full scanon R to evaluate this selection query. All fields of the table are in-volved in the predicate so that an entire tuple is brought into thecache for the evaluation of the predicate.

The join queries considered in our experiments are “SELECTR.a1 FROMR, S WHERE<predicate >”. There are two

22

Page 35: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

Table 3: Cost functions.z is the size of an index entry in the B+-tree node (bytes).Cases Array scan LL scan NLJs B+-Trees Hash join (1) Hash join (2)T0

||R||B

|R| ·B 3||R||·||S||C·B 2 log B

z|R| |R| · (1 + s

B) + ||R||

B|R| · (1 + s

B) + ||R||

B

T1||R||

B||R||

B||R||·||S||

C·B log Bz|R| (||R||+ ||S||) · log C

B|S| (||R||+ ||S||) · log C

B|S|

T2||R||

B||R||

B||R||·||S||

C·B log Bz|R| (||R||+ ||S||) · log C

B|S| (||R||+ ||S||) · log C

B|S|

T3||R||D·B

||R||D·B

||R||·||S||D·C·B logD·B

z|R| (||R||+ ||S||) · log C

B|S| 1

D· (||R||+ ||S||) · log C

B|S|

predicates,R.a1 = S.a1 for an equi-join andR.a1 < S.a1 and...and R.an < S.an for a non-equijoin. We used the non-indexedNLJ algorithm to evaluate the non-equijoin, and the hash join aswell as the B+-tree to evaluate the equi-join.

Metrics. Table 5 lists the main performance metrics used in ourexperiments. We used the C/C++ functionclock() to obtain thetotal execution time on all three platforms. In addition, we used ahardware profiling tool, PCL [8], to count cache misses on P4 only,because we did not have privileges to perform profiling on AMDor Ultra-Sparc.

Table 5: Performance metricsMetrics Description

TOT CYC Total execution time on all three platforms in milliseconds (ms)L1 DCM Number of L1 data cache misses on P4 in billions (109)L2 DCM Number of L2 data cache misses on P4 in millions (106)TLB DM Number of TLB misses on P4 in millions (106)

5.2 ResultsWe present the experimental results on the three platforms. In

general, the results on AMD are similar to those on P4. Addition-ally, the prefetching technique achieves a considerable performanceimprovement on P4 and AMD, whereas the prefetching techniquehas little performance improvement on Ultra-Sparc.

5.2.1 Storage modelsArray. Figure 4 shows the execution time of the array scan with

the software prefetching when|R| = 8M . We varied the tuple size.For each tuple size, we varied the prefetching distance in number ofL2 cache lines. Software prefetching improves the array scan on P4and AMD, whereas it has little performance impact on Ultra-Sparc.

Since hardware prefetching is enabled on P4 and AMD, softwareprefetching does not necessarily improve the performance. Whenthe tuple size is small, the memory stalls are fully hidden with suf-ficient computation in the presence of hardware prefetching. Forexample, whenr = 8B, each cache line contains 8 tuples. Soft-ware prefetching degrades the performance due to its computationoverhead on P4. In contrast, when the tuple size is large, softwareprefetching further improves the scan performance in addition tohardware prefetching. Figure 5 shows the performance of arrayscan with the tuple size varied. The performance improvementof software prefetching increases as the tuple size increases. OnP4, when the tuple size is larger than 16 bytes, software prefetch-ing starts to improve the performance of the array scan. Note, onAMD, software prefetching improve the performance of the arrayscan when the tuple size is larger than 8 bytes. One possible reasonis that the AMD has a larger memory latency than P4.

Software prefetching requires tuning on the prefetching distancein order to achieve the best performance on P4 and AMD. Whenthe prefetching distance is small, the memory stalls are not com-pletely hidden. When the prefetching distance is larger than the L1cache capacity, a performance slowdown occurs due to the cache

Array scan

0

20

40

60

80

100

120

0 128 256 384 512 640 768 896 1024Prefetching distance (#cache lines)

Tim

e(m

s)

(a) r = 8B on P4

Array scan

0

50

100

150

200

250

0 128 256 384 512 640 768 896 1024

Prefetching distance (#cache lines)

Tim

e(m

s)

(b) r = 64B on P4

Array scan

0

20

40

60

80

100

120

140

160

0 128 256 384 512 640 768 896 1024Prefetching distance (#cache lines)

Tim

e(m

s)

r=8B

(c) r = 8B on AMD

Array scan

0

50

100

150

200

250

300

350

400

450

0 128 256 384 512 640 768 896 1024Prefetching distance (#cache lines)

Tim

e(m

s)

r=64B

(d) r = 64B on AMD

Array scan

0

50

100

150

200

250

300

350

400

450

0 128 256 384 512 640 768 896 1024Prefetching distance (#cache lines)

Tim

e(m

s)

r=8B

(e) r = 8B on Ultra-Sparc

Array scan

0

500

1000

1500

2000

2500

3000

0 128 256 384 512 640 768 896 1024Prefetching distance (#cachelines)

Tim

e(m

s)

r=64B

(f) r = 64B on Ultra-Sparc

Figure 4: Array scan at the tuning level T3: varying theprefetching distance.

thrashing in the L1 cache. A similar performance slowdown is ob-served when the prefetching distance is larger than the number ofcache lines in the L2 cache. Thus, to develop an efficient prefetch-ing scheme, we need to (1) the latency and the cache block size todetermine whether software prefetching can help hide the memorystalls; (2) the cache capacity to avoid prefetch too many cache lines.This validates our framework thatT3 includes the lower levels oftuning,T1 andT2.

Linked list. We first investigated the performance of the linkedlist scan without software prefetching. We varied the node size andfound that the stable node size is 128B on P4 and AMD, and 64Bon Ultra-Sparc. When the node size is the stable node size, theexecution time becomes stable.

We next evaluated the prefetching technique on the linked listscan. The execution time of the linked list scan with software

23

Page 36: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

Array scan

0

20

40

60

80

100

120

140

160

180

200

8 16 24 32 40 48 56 64Tuple size (bytes)

Tim

e(m

s)

w/o prefetchingw/ prefetching

(a) On P4

Array scan

0

50

100

150

200

250

8 16 24 32 40 48 56 64

Tuple size (bytes)

Tim

e(m

s)

w/o prefetchingw/ prefetching

(b) On AMD

Figure 5: Array scan with and without prefetching: varyingthe tuple size.

prefetching is shown in Figure 6. Due to the random nature of thelinked list scan, hardware prefetching has little performance im-pact. The software prefetching technique helps reduce the cachestalls on P4 and AMD, whereas it does not help on Ultra-Sparc.The performance improvement on a relation with a large tuple sizeis larger than that with a small tuple size. This is because, memorystalls are more significant and software prefetching better hides thememory stalls on a relation with a large tuple size. Similar to thearray scan, software prefetching on the linked list requires the tun-ing on the cache block size, the cache capacity and the latency inorder to determine the suitable prefetching distance.

5.2.2 B+-treesWe used B+-tree indexing to evaluate the equijoin query. The

measurements are shown in Figure 7. The reported results wereobtained when|R| = 200K, |S| = 32M andr = s = 8 bytes.|R| was much smaller than|S|, since we focused on the spatiallocality of the B+-tree index. This setting was comparable to theprevious study [29, 30]. Since the tree index is static, we did notstore the pointers in its internal nodes and used implicit addressinglike CSS-trees [29]. In this implementation, the performance of ourB+-trees atT1 was similar to that of CSS-trees.

On all platforms,T3 is the best among all variants;T0 is 20%–30% slower thanT3. The reason for this phenomenon is that B+-trees atT0 has a good spatial locality with the VEB layout usingimplicit addressing. In our experiments, the COB+-tree is a binarytree. Each internal node is 4 bytes. Suppose a L2 cache line is 128bytes (the cache line size on P4 or AMD). It can hold 32 nodes.Ignoring the cache block alignment, a subtree of five levels canfit into one cache line. This good spatial locality of VEB layoutgreatly reduces the cache misses on the index probes.

Comparing the performance gain of each level of tuning, wefind thatT2 has little performance impact on the B+-trees. Finally,the software prefetching technique,T3, considerably improves theoverall performance (except on Ultra-Sparc). The performance im-provement is due to (1) the software prefetching hiding the cachestalls, and (2) the small tree height. The performance improvementis 20%–30%, which is comparable to that shown in previous studieson simulators [15].

We investigated the cache performance of the index probes onP4. Figure 8 shows the time breakdown of the index probes atdifferent levels of tuning.T1–T3 have a stable busy time, andT0

has a larger busy time than other levels of tuning due to the largeramount of computation required by the VEB layout.T0–T2 have asimilar cache performance. Among all levels of tuning,T3 has thebest cache performance.

LL scan

0

50

100

150

200

250

0 64 128 192 256 320 384 448 512Prefetching distance (#cache lines)

Tim

e(m

s)

r=8B

(a) r = 8B on P4

LL scan

0

100

200

300

400

500

600

700

0 64 128 192 256 320 384 448 512Prefetching distance (#cache lines)

Tim

e(m

s)

r=64B

(b) r = 64B on P4

LL scan

0

50

100

150

200

250

300

350

0 64 128 192 256 320 384 448 512Prefetching distance (#cache lines)

Tim

e(m

s)

r=8B

(c) r = 8B on AMD

LL scan

0

500

1000

1500

2000

2500

3000

0 64 128 192 256 320 384 448 512Prefetching distance (#cache lines)

Tim

e(m

s)

r=64B

(d) r = 64B on AMD

LL scan

0

200

400

600

800

1000

1200

1400

1600

1800

0 64 128 192 256 320 384 448 512Prefetching distance (#cache lines)

Tim

e(m

s)

r=8B

(e) r = 8B on Ultra-Sparc

LL scan

0

2000

4000

6000

8000

10000

12000

14000

0 64 128 192 256 320 384 448 512Prefetching distance (#cache lines)

Tim

e(m

s)

r=64B

(f) r = 64B on Ultra-Sparc

Figure 6: Linked list scan at the tuning level T3: varying theprefetching distance.

5.2.3 NLJsFigure 9 shows the time comparison for non-indexed nested-loop

joins with different levels of tuning. The reported results were ob-tained when||R|| = ||S|| = 32M bytes andr = s = 128 bytes(both relations have256K tuples).

According to our cost model,T1 chooses the L2 cache as itstarget level, whereasT2 chooses the L1 cache as its target level.T1 is even slower thanT0, becauseT1 chooses the incorrect targetlevel at the absence of the knowledge of the latency. Note that theL1 cache is the most significant level of the cache according to ourcost function in Table 3. This is evidence that a higher level oftuning does not guarantee a higher performance. We illustrate thisresult by comparing the performance ofT1 when the target level ofcache is the L2, the L1 or the TLB, as shown in Figure 10.

T3 applies software prefetching to both of the outer and the innerrelations in the join. The prefetching distance was set to be one onthe three platforms. The join performance improvement by soft-ware prefetching is insignificant, because the blocking techniquehas achieved a good cache locality.

5.2.4 Hash joinsFigure 11 shows the performance of hash joins when|R| =

|S| = 8M andr = s = 8 bytes. Note, in Figure 11 (a)(d), theprefetching distance being zero means that the result is obtainedfrom the simple hash join (without prefetching). We do not showthe results on Ultra-Sparc, because the performance impact of soft-

24

Page 37: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

(a) P4

(b) AMD

(c) Ultra-Sparc

0

50

100

150

200

250

T0 T1(L2) T2(L2) T3(L2)

Ela

psed tim

e (

ms)

0

50

100

150

200

250

T0 T1(L2) T2(L2) T3(L2)

Ela

psed tim

e (

ms)

0

100

200

300

400

500

T0 T1(L2) T2(L2) T3(L2)

Ela

psed tim

e (

ms)

Figure 7: B+-trees

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

0

50

100

150

200

250

T0 T1(L2) T2(L2) T3(L2)

Ela

pse

d t

ime

(ms)

TLB_DM

L1_DCM

L2_DCM

xxxxxxxxxxxx

Busy

Figure 8: B+-trees: time breakdown on P4

ware prefetching is insignificant.We summarize the results in two aspects. First, both partition-

ing and prefetching improve the performance of the hash join. Theprefetching hash join achieves the best performance when the prefetch-ing distance is 16 and four on P4 and AMD, respectively. The par-titioned hash join achieves the best performance when the partitiongranularity is 4K and 16K tuples, respectively. The performancetrend with a single technique is concave. Thus, the suitable settingof these techniques requires tuning according to the cache capacity.

Second, the performance improvements of applying prefetchingonly, applying partitioning only and applying both techniques overthe simple hash join are 40%, 47% and 30%, respectively, on P4and are 40%, 85% and 69%, respectively, on AMD. The cumula-tive performance improvement of the two techniques can be smallerthan that of applying a single technique. It indicates that prefetch-ing hurts the performance of the optimized partitioned hash join.

5.2.5 SummaryThrough the four case studies, we observed the following three

results. First, among the four levels of tuning in our framework,T3 utilizes the complete knowledge of a specific memory hierar-

0

200

400

600

800

1000

1200

1400

T0 T1(L2) T2(L1) T3(L1)

Ela

psed tim

e (

ms)

0

500

1000

1500

2000

2500

T0 T1(L2) T2(L1) T3(L1)

Ela

psed tim

e (

ms)

0

1000

2000

3000

4000

5000

6000

7000

T0 T1(L2) T2(L1) T3(L1)

Ela

psed tim

e (

ms)

(a) P4

(b) AMD

(c) Ultra-Sparc

Figure 9: Non-indexed nested-loop joins

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

0

1000

2000

3000

4000

5000

6000

7000

P4 AMD Ultra-Sparc

Ela

pse

d t

ime

(m

s)

xxxxxxxxx

T1(L2)

T1(L1)

T1(TLB)

Figure 10: Non-indexed nested-loop joins atT1

chy and achieves the best performance on all the four case studiesexcept the hash join. Second,T1 andT2 in our framework do notnecessarily improve the performance overT0 due to the possibleineffective tuning based on the incomplete knowledge of the mem-ory hierarchy. Third,T0 achieves a comparable performance to itshigher levels of tuning, whose execution time is less than twice thatof the fine-tuned algorithm.

6. CONCLUSIONAs the memory hierarchy becomes an important factor for the

performance of database applications, it is imperative to improvethe memory performance of relational query processing. Cache-oblivious techniques optimize all levels of any memory hierarchieswithout knowledge of cache parameters of a specific memory hier-archy, whereas cache-conscious techniques can potentially achievea better performance with careful tuning based on the cache char-acteristics. Considering the strengths and weaknesses of both tech-niques, we propose a general framework to quantify the perfor-mance impact of different degrees of tuning. Through studying onseveral basic data structures and algorithms in query processing,we show that our framework is useful in this process of tuning.

As future work, we are interested in extending our framework to

25

Page 38: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

Prefetching hash join

0

1

2

3

4

5

6

7

8

9

0 1 2 4 8 16 32 64 128Prefetching distance

Tim

e (

sec)

Build

Probe

(a) T3 (prefetching only) on P4

Partitioned hash join

0

1

2

3

4

5

6

7

8

9

2 4 8 16 32 64 128 256 512 10242048 40968192Partition size (K)

Tim

e (

sec)

Join

Partition

(b) T1 (alsoT2, partitioning only) onP4

Partitioned hash join + Prefetching

0

1

2

3

4

5

6

7

8

9

2 4 8 16 32 64 128 256 512 1024 204840968192Partition size (K)

Tim

e (

sec)

Join

Partition

(c) T3 (partitioning+prefetching) onP4

Prefetched hash join

0

1

2

3

4

5

6

7

8

9

0 1 2 4 8 16 32 64 128Prefetching distance

Tim

e (

sec)

Build

Probe

(d) T3 (prefetching only) on AMD

Partitioned hash join

0

1

2

3

4

5

6

7

8

9

2 4 8 16 32 64 128 256 512 10242048 4096 8192

Partition size (K)

Tim

e (

sec)

Join

Partition

(e) T1 (alsoT2, partitioning only) onAMD

Partitioned hash join + Prefetching

0

1

2

3

4

5

6

7

8

9

2 4 8 16 32 64 128 256 512 1024 2048 4096 8192

Partition size (K)

Tim

e (

sec)

Join

Partition

(f) T3 (partitioning+prefetching) onAMD

Figure 11: Hash joins

the dynamic characteristics of the memory hierarchy on the chip-multiprocessors [21]. We are also interested in tuning the cache-conscious algorithm adapting to the runtime dynamics of the mem-ory hierarchy based on the hardware profile.

7. ACKNOWLEDGEMENTWe thank the anonymous reviewers for their comments on the

earlier versions of this paper. This work was supported by grantsDAG05/06.EG11, HKUST6263/04E, and 617206, all from the HongKong Research Grants Council.

8. REFERENCES[1] A. Ailamaki. Database architectures for new hardware. In

ICDE ’05: Proceedings of the 21st International Conferenceon Data Engineering (ICDE’05), page 1148, Washington,DC, USA, 2005. IEEE Computer Society.

[2] A. Ailamaki, D. J. DeWitt, M. D. Hill, and M. Skounakis.Weaving relations for cache performance. InVLDB ’01:Proceedings of the 27th International Conference on VeryLarge Data Bases, pages 169–180, San Francisco, CA, USA,2001. Morgan Kaufmann Publishers Inc.

[3] A. Ailamaki, D. J. DeWitt, M. D. Hill, and D. A. Wood.DBMSs on a modern processor: Where does time go? InVLDB ’99: Proceedings of the 25th International Conferenceon Very Large Data Bases, pages 266–277, San Francisco,CA, USA, 1999. Morgan Kaufmann Publishers Inc.

[4] AMD Corp. Software Optimization Guide for AMD64Processors, 2005.

[5] M. A. Bender, E. D. Demaine, and M. Farach-Colton.Cache-oblivious B-trees. InFOCS ’00: Proceedings of the41st Annual Symposium on Foundations of ComputerScience, page 399, Washington, DC, USA, 2000. IEEEComputer Society.

[6] M. A. Bender, Z. Duan, J. Iacono, and J. Wu. A

locality-preserving cache-oblivious dynamic dictionary.J.Algorithms, 53(2):115–136, 2004.

[7] M. A. Bender, M. Farach-Colton, and B. C. Kuszmaul.Cache-oblivious string B-trees. InPODS ’06: Proceedings ofthe twenty-fifth ACM SIGMOD-SIGACT-SIGART symposiumon Principles of database systems, pages 233–242, NewYork, NY, USA, 2006. ACM Press.

[8] R. Berrendorf, H. Ziegler, and B. Mohr. PCL: PerformanceCounter Library. http://www.fz-juelich.de/zam/PCL/, 2002.

[9] P. Bohannon, P. Mcllroy, and R. Rastogi. Main-memoryindex structures with fixed-size partial keys. InSIGMOD’01: Proceedings of the 2001 ACM SIGMOD internationalconference on Management of data, pages 163–174, NewYork, NY, USA, 2001. ACM Press.

[10] P. A. Boncz, S. Manegold, and M. L. Kersten. Databasearchitecture optimized for the new bottleneck: Memoryaccess. InVLDB ’99: Proceedings of the 25th InternationalConference on Very Large Data Bases, pages 54–65, SanFrancisco, CA, USA, 1999. Morgan Kaufmann PublishersInc.

[11] G. S. Brodal and R. Fagerberg. Cache oblivious distributionsweeping. InICALP ’02: Proceedings of the 29thInternational Colloquium on Automata, Languages andProgramming, pages 426–438, London, UK, 2002.Springer-Verlag.

[12] G. S. Brodal, R. Fagerberg, and K. Vinther. Engineering acache-oblivious sorting algorith. InALENEX/ANALC, pages4–17, 2004.

[13] S. Chen, A. Ailamaki, P. B. Gibbons, and T. C. Mowry.Improving hash join performance through prefetching. InICDE ’04: Proceedings of the 20th International Conferenceon Data Engineering, page 116, Washington, DC, USA,2004. IEEE Computer Society.

[14] S. Chen, A. Ailamaki, P. B. Gibbons, and T. C. Mowry.Inspector joins. InVLDB ’05: Proceedings of the 31st

26

Page 39: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

international conference on Very large data bases, pages817–828. VLDB Endowment, 2005.

[15] S. Chen, P. B. Gibbons, and T. C. Mowry. Improving indexperformance through prefetching.SIGMOD Rec.,30(2):235–246, 2001.

[16] S. Chen, P. B. Gibbons, T. C. Mowry, and G. Valentin.Fractal prefetching B+-trees: optimizing both cache and diskperformance. InSIGMOD ’02: Proceedings of the 2002ACM SIGMOD international conference on Management ofdata, pages 157–168, New York, NY, USA, 2002. ACMPress.

[17] T. M. Chilimbi, M. D. Hill, and J. R. Larus. Cache-consciousstructure layout. InPLDI ’99: Proceedings of the ACMSIGPLAN 1999 conference on Programming languagedesign and implementation, pages 1–12, New York, NY,USA, 1999. ACM Press.

[18] M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran.Cache-oblivious algorithms. InFOCS ’99: Proceedings ofthe 40th Annual Symposium on Foundations of ComputerScience, page 285, Washington, DC, USA, 1999. IEEEComputer Society.

[19] A. Ghoting, G. Buehrer, S. Parthasarathy, D. Kim,A. Nguyen, Y.-K. Chen, and P. Dubey. Cache-consciousFrequent Pattern Mining on a Modern Processor. VLDB,2005.

[20] R. A. Hankins and J. M. Patel. Data morphing: An adaptive,cache-conscious storage technique. InVLDB, pages417–428, 2003.

[21] N. Hardavellas, I. Pandis, R. Johnson, N. Mancheril,S. Harizopoulos, A. Ailamaki, and B. Falsafi. Databaseservers on chip multiprocessors: Limitations andopportunities. InCIDR ’07: Proceedings of the ThirdInternational Conference on Innovative Data SystemsResearch, Asilomar, CA, USA, 2007.

[22] S. Harizopoulos and A. Ailamaki. Improving instructioncache performance in OLTP.ACM Trans. Database Syst.,31(3):887–920, 2006.

[23] S. Harizopoulos, V. Shkapenyuk, and A. Ailamaki. QPipe: ASimultaneously Pipelined Relational Query Engine. InSIGMOD Conference, pages 383–394, 2005.

[24] B. He and Q. Luo. Cache-oblivious nested-loop joins. InCIKM ’06: Proceedings of the ACM Fifteenth Conference onInformation and Knowledge Management, 2006.

[25] B. He and Q. Luo. Cache-oblivious query processing. InCIDR ’07: Proceedings of the Third InternationalConference on Innovative Data Systems Research, Asilomar,CA, USA, 2007.

[26] Intel Corp.Intel(R) Itanium(R) 2 Processor ReferenceManual for Software Development and Optimization, 2004.

[27] K. Kim, S. K. Cha, and K. Kwon. Optimizingmultidimensional index trees for main memory access.SIGMOD Rec., 30(2):139–150, 2001.

[28] S. Manegold, P. Boncz, and M. Kersten. OptimizingMain-Memory Join on Modern Hardware.IEEETransactions on Knowledge and Data Engineering,14(4):709–730, 2002.

[29] J. Rao and K. A. Ross. Cache conscious indexing fordecision-support in main memory. InVLDB ’99:Proceedings of the 25th International Conference on VeryLarge Data Bases, pages 78–89, San Francisco, CA, USA,1999. Morgan Kaufmann Publishers Inc.

[30] J. Rao and K. A. Ross. Making B+-trees cache conscious inmain memory. InSIGMOD ’00: Proceedings of the 2000ACM SIGMOD international conference on Management ofdata, pages 475–486, New York, NY, USA, 2000. ACMPress.

[31] A. Shatdal, C. Kant, and J. F. Naughton. Cache consciousalgorithms for relational query processing. InVLDB ’94:Proceedings of the 20th International Conference on VeryLarge Data Bases, pages 510–521, San Francisco, CA, USA,1994. Morgan Kaufmann Publishers Inc.

[32] Sun Corp.UltraSPARC (R) III Cu Users Manual, 1997.[33] J. Zhou and K. A. Ross. Buffering accesses to

memory-resident index structures. InVLDB ’03:Proceedings of the 29th International Conference on VeryLarge Data Bases, pages 405–416, 2003.

27

Page 40: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................
Page 41: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

Vectorized Data Processing on the Cell Broadband Engine

Sandor Heman Niels Nes Marcin Zukowski Peter Boncz

CWI, Kruislaan 413Amsterdam, The Netherlands

{Firstname.Lastname}@cwi.nl

ABSTRACTIn this work, we research the suitability of the Cell Broad-band Engine for database processing. We start by outlin-ing the main architectural features of Cell and use micro-benchmarks to characterize the latency and throughput ofits memory infrastructure. Then, we discuss the challengesof porting RDBMS software to Cell: (i) all computationsneed to SIMD-ized, (ii) all performance-critical branchesneed to be eliminated, (iii) a very small and hard limit onprogram code size should be respected.

While we argue that conventional database implementa-tions, i.e. row-stores with Volcano-style tuple pipelining,are a hard fit to Cell, it turns out that the three challengesare quite easily met in databases that use column-wise pro-cessing. We managed to implement a proof-of-concept portof the vectorized query processing model of MonetDB/X100on Cell by running the operator pipeline on the PowerPC,but having it execute the vectorized primitives (data paral-lel) on its SPE cores. A performance evaluation on TPC-HQ1 shows that vectorized query processing on Cell can beatconventional PowerPC and Itanium2 CPUs by a factor 20.

1. INTRODUCTIONThe Cell Broadband Engine [9] is a new heterogeneous

multi-core CPU architecture that combines a traditionalPowerPC core with multiple mini-cores (SPEs), that havelimited but SIMD- and stream-optimized functionality. Cellis produced in volume for the Sony Playstation3, and is alsosold in blades by IBM for high-performance computationapplications (we used both incarnations). The Playstation3Cell runs at 3.2GHz and offers 6 SPEs providing a com-putational power of 6x25.6=154GFLOPs, which comparesfavorably to “classical” contemporary CPUs, which provideup to 10GFLOPs.

In this paper, we research the suitability of the Cell Broad-band Engine for database processing. This is especially in-teresting for highly compute-intensive analysis applications,like data warehousing, OLAP and data mining. Therefore,

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.Proceedings of the Third International Workshop on Data Management onNew Hardware (DaMoN 2007) June 15, 2007, Beijing, China.Copyright 2007 ACM 978-1-59593-772-8 ...$5.00.

Figure 1: Cell Broadband Engine Architecture

we use the TPC-H data warehousing benchmark to evaluatethe efficiency of running database software on Cell.

There turned out to be three main challenges in portingRDBMS software to Cell:

(i) all computations need to SIMD-ized, as the SPEs sup-port SIMD instructions only. While there has been workon using SIMD in database systems [15], this work needsto be extended to enable full database operation on Cell.We address this need partly in Section 4, by contributing anew method to process grouped aggregates (i.e. SELECT ..

GROUP BY) using SIMD instructions.(ii) all performance-critical if-then-else branches need to

be eliminated, as SPEs combine a high branch penalty witha lack of branch prediction. Some database processing tech-niques such as buffered execution of relational operators [16],predicated selection [13] but also vectorized execution [3],can be used to reduce the impact of these branch misses.

(iii) there is a very small yet hard limit on program codesize, as in each SPE data plus code should not exceed 256KB.It turned out impossible to run “conventional” database en-gines – such as Postgres – on the SPEs, as their code mea-sures MBs rather than KBs. The code size challenge impliesthat on Cell, a database system must not only manage thedata cache, but also its own instruction cache!

We report here on our initial experiences of porting thevectorized query processing model of MonetDB/X100 [3] toCell. This port uses the PowerPC to run the relational op-erator pipeline, but executes its data-intensive vectorizedprimitives data parallel on the SPEs, using a small run-time system that manages transfer of data and instructions.At the time of this writing, the port only consists of thisrun-time system, together with the operators and primitivesneeded by TPC-H Query 1.

Outline & Contributions. Section 2 summarizes theCell architecture, and characterizes its programmable DMAmemory infrastructure using micro-benchmarks. In Sec-tion 3 we discuss how various DBMS software architectures

29

Page 42: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

20

10

5

2

1

0.5

0.2

0.116K8K4K2K1K 512 256 128 64 32 16

Ban

dwid

th (

GB

/s)

Read Granularity (bytes)

Sequential DMA Bandwidth (Varying number of SPEs)

654321

(a) DMA read bandwidth as a func-tion of bytes per transfer, for varyingnumber of SPEs

20

10

5

2

1

0.5

0.2

0.12K1K 512 256 128 64 32 16

Ban

dwid

th (

GB

/s)

Read Granularity (bytes)

Random List DMA Bandwidth (Varying DMA list length)

1286432168421

(b) List-DMA read bandwidth as afunction of bytes per transfer element,for varying DMA-list lengths (1 SPE)

20

10

5

2

1

0.5

0.2

0.1256128643216

Ban

dwid

th (

GB

/s)

Read Granularity (bytes)

Random List DMA Bandwidth (Varying number of SPEs)

654321

(c) List-DMA read bandwidth as afunction of bytes per transfer ele-ment, for varying number of SPEs(list length 128)

Figure 2: DMA read bandwidth Micro-Benchmarks (logarithmic scale)

could be mapped to Cell hardware. We cover three mainprocessing models: classical Volcano-style NSM tuple pipelin-ing, column-wise materialization (MonetDB) and vector-ized query execution (MonetDB/X100). Section 4 showshow various vectorized relational database operators can beimplemented using SIMD instructions. Experiments withTPC-H Query 1 show that the vectorized query processingused in MonetDB/X100 can be a factor 20 faster on Cellthan on contemporary CPU architectures. Wrapping up, wehave related work in Section 5 and conclude in Section 6.

2. CELL ARCHITECTUREFigure 1 shows a diagram of the Cell architecture. To

the left is the PowerPC Processor Element (PPE), which isa general-purpose CPU, good at executing control-intensivecode such as operating systems and application logic. Theremaining eight cores are equivalent Synergistic ProcessingElements (SPEs). The SPEs are optimized for compute-intensive tasks, and operate independently from the PPE.However, they do depend on the PPE to run an operatingsystem, and in most cases the main thread of an applica-tion. The SPEs and PPE are connected using a 128-byteElement Interconnect Bus (EIB) that is connected to a 2-channel memory controller, with each channel being able todeliver 12.6GB/s of data, resulting in a theoretical maxi-mum memory bandwidth of 25.2 GB/s.

Our main experimentation platform, a Sony Playstation 3(PS3) game console, differs slightly from this architecture inthat it has two of the eight SPEs disabled. The PS3 containsa Cell processor running at 3.2GHz, 256MB RAM, and runsthe Linux operating system.

The SPE is an independent processor that runs threadsspawned by the PPE. It consist of a processing core, theSynergistic Processing Unit (SPU), a Memory Flow Con-troller (MFC), and a 256KB local storage memory area (LS),that must keep both data and code. There is no instructioncache, which implies that code must fit in the LS. AlthoughSPEs share the effective address (EA) space of the PPE,they cannot access main memory directly. All data an SPEwishes to operate on, needs to be explicitly loaded into theLS by means of DMA transfers. Once the data is in LS, theSPU can use it by explicitly loading it into one of its 128128-bit registers. The SPE instruction set differs from thePowerPC instruction set and consists of 128-bit SIMD in-structions only, of which it can execute two per clock cycle.The SPEs are designed for high-frequency (with a pipeline

depth of 18), but lack branch prediction logic. While it ispossible to provide explicit branch hints, if this is not donea branch costs 20 cycles, which implies that they should beavoided in performance-critical code paths.

Summing up, the SPE architecture requires careful engi-neering by the programmer to ensure that efficient branch-free SIMD code is generated, and to use parallel algorithmsto exploit all SPEs, while keeping a careful eye on code sizeor even employing some dynamic code management scheme(see Section 2.2).

DMA Engine. An interesting aspect of SPE program-ming is the DMA Engine it exposes. The explicit memoryaccess programming it enforces poses some extra work forthe application developer, compared to normal cache-basedmemory access. Explicit memory access can, however, be anadvantage for database software, as it provides full control ofdata placement and transfer, such that advance knowledgeof data access patterns can be exploited. Previous work ondata management using cache-less architectures has demon-strated that this is quite workable [6, 4] The DMA engine al-lows to request multiple (at most 256) memory blocks in onego (“List-DMA”). The practical minimal transfer unit (andalignment unit) is 128 bytes, while the maximum is 16KB.Each DMA transfer moves data between main memory andLS in asynchronous fashion, supporting both a polling andsignaling programming model.

2.1 Memory Micro-BenchmarksPotentially, the feature of List-DMA allows for efficient

scatter-gather algorithms that gather input data from a largeamount of random memory locations or scatter data outputover a series of random locations. On normal cached mem-ory architectures, such algorithms perform badly if the ran-domly accessed memory range exceeds the cache size, evenif all cache lines are used fully. The reason is that opti-mum memory bandwidth is only achieved when sequentialaccess triggers built-in hardware memory prefetching (e.g.a memory latency of 100ns and a cache line size of 64 bytesproduces 640MB/s of random throughput, while sequentialbandwidth on modern PCs gets to 4GB/s with prefetch-ing). An example of a gather algorithm is Hash Join. Forsuch algorithms, it is currently beneficial to perform addi-tional partitioning steps (a scatter operation) to make therandomly accessed range fit the CPU cache first [10].

We use micro-benchmarks to investigate whether the CellDMA Engine offers alternative ways of expressing data-intensivealgorithms (e.g. is cache-, or rather LS-partitioning required

30

Page 43: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

at all for hash-based algorithms?).

Sequential Access. We conducted a micro-benchmarkwhere we iteratively transfer a large region of main memoryinto the LS, in consecutive DMA transfers of x bytes, forvarying x. Figure 2(a), shows that DMA latency dominatesusing small transfer sizes, but good bandwidth is achievedwith memory blocks ≥ 2KB, giving one SPE a maximumof 6GB/s (these sequential accesses use only a single mem-ory channel). If multiple SPEs perform the same micro-benchmark, we observe that already around 1KB transfersthe SPEs fight for bandwidth. When all SPEs demand largesequential blocks simultaneously, we achieve a total score of20GB/s memory read bandwidth, thus relatively close to thetheoretical maximum of 25.6GB/s.

List-DMA allows to instantiate a list of (size, effective address)pairs in LS to pass to the MFC for processing in one go. Fig-ure 2(b) shows the single-SPE bandwidth, as a function ofthe transfer size per list element, where each list elementreads from a random, 128-bit aligned location. The lin-ear bandwidth increase up to 128 byte transfers is simplycaused by the fact that all data transfers over the EIB havea minimum 128-byte granularity. For transfer sizes below128 bytes, one is simply wasting bandwidth. We also seethat beyond list-DMA bandwidth continues to improve withlarger transfer sizes and approaches 10GB/s, surpassing the6GB/s achieved with sequential access. The reason is thatthese random transfers use both memory channels. The fig-ure furthermore shows that increasing the DMA list lengthkeeps improving performance.

Figure 2(c) shows that when more than one SPEs areperforming scatter/gather DMA at the same time, the busagain gets saturated, achieving peak 22GB/s memory band-width already at 128-byte transfer sizes.

The fact that random List-DMA is able to achieve highbandwidth indicates that Cell algorithms may indeed forgoLS-partitioning and work in a scatter/gather fashion directlyon RAM. However, three caveats apply. First, it is essen-tial that algorithms use a 128-byte granularity for memoryaccess, thus making full use of the EIB “cache lines” toavoid bandwidth waste. Secondly, as SPEs should oper-ate in parallel, the per-SPE usable throughput is limitedto roughly 3GB/s (i.e. 1 byte per cycle). Given the 2-per-cycle throughput of the SIMD instructions that may processtwo 16 byte inputs, there is a distinct danger of becomingLS bandwidth-bound (16-bytes/cycle max). Finally, as theLS is not a coherent cache, RAM-based scatter/gather al-gorithms must explicitly prevent that the same memory lo-cations are updated multiple times by the same DMA-Listcommand, which can significantly increase their complexity.

2.2 Code ManagementAs the SPE local storage (LS) is limited to 256KB and

needs to be shared between both code and data, major soft-ware products such as database systems simply will notfit. The Octopiler research compiler, being developed byIBM [5], tries to hide code size limitations by automati-cally partitioning code into small enough chunks. It em-beds in each SPE program a small runtime system calledthe partition manager. The compiler also translates callsto functions outside the current partition into calls into thepartition manager. At runtime, when called, the partitionmanager brings in the desired partition – swapping out the

current one – using a DMA data transfer, and then callsinto the newly loaded function. Additionally, this compilerpromises a 32KB software cache, that allows to transpar-ently access RAM resident arrays with a 12 instruction la-tency, and many other advanced features. Regrettably, how-ever, all these features are not yet available in the IBM com-pilers currently distributed. Thus, code partitioning on Cellremains a programmer responsibility, so we discuss variousways to do this.

Separate Binaries. Each SPE can only run one single-threaded program at a given time, and no operating systemcode is running in between. Such a program is spawned bythe PPE as an SPE thread, which transfers the SPE binaryto any number of SPEs and runs it till completion. The sim-plest approach to partitioning thus is to explicitly compilethe program into separate binaries and let these be spawnedas SPE threads by the PPE whenever appropriate. Regret-tably, this approach is slow, taking approximately 3804000cycles (1.2ms) on average per SPE, so it is only viable whenone partition will run for at least 100 milliseconds (in which300MB of data should be processed). A second disadvan-tage is that it is static: the partitions need to be defined atsystem compile-time, making it hard to e.g. adapt to theneeds of a run-time query plan.

Overlays. While the IBM compiler does not yet deliver thefancy features it promised, it does allow for code overlays:small libraries of SPE code that are compiled into the mainbinary, but do not get loaded into the SPE upon thread cre-ation. Only when a function from a certain overlay is called,the overlay is loaded into the SPE at run-time, and the func-tion gets executed. At roughly 775 cycles, this approach ismuch faster then separate binaries, but is still static (andeven if the overlay is already loaded, the function call over-head is still a hefty 236 cycles). Also, overlay technologydepends on inter-procedural analysis and thus cannot dealwith late binding and function de-referencing, often used toimplement DBMS execution engines.

Manual Loading. It is perfectly possible for an SPE pro-gram to issue a DMA memory transfer and upon comple-tion call into that location. This makes it a viable strategyfor a database system to add code management to the listof database (optimization) tasks. Like the Octopiler parti-tion manager, each SPE could run a small runtime systemthat waits for code and data requests from the PPE. Whena request comes in, it loads the data and code (if not al-ready present in LS) and executes the required operation.The overhead of this approach is similar to the overlay ap-proach. A limitation of this approach is that the code snip-pets that get transferred need to be stand-alone functions(i.e. functions that do not rely on relocation and/or callother functions). The advantage of manual loading is that,unlike overlays, it is able to deal with late binding – as saidan important feature for porting database systems. In ourdatabase experiments of Cell, we therefore have used theManual Loading approach to code management.

3. DBMS ARCHITECTURE ON CELL

3.1 Classical NSM Tuple PipeliningConventional relational database architecture uses a disk-

based storage manager using an NSM layout. Query execu-tion uses a Volcano-style [7] iterator class hierarchy, where

31

Page 44: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

all relational operators (Scan, Select, Join, Aggregation) areinstantiated as objects that implement an open(), next(),close() method interface. A full query plan is a tree of suchobjects, and the result is generated by calling next() on theroot of the tree, which pulls data up by calling next() in itschildren (and so on), finally producing a single result tuple.This is then repeated until no more tuples are returned.

Conventional relational database systems have a large codebase that does not fit the 256KB LS, and therefore we thinkit is an absolute necessity that Cell compilers support auto-matic code partitioning (see Section 2.2). As this is not yetthe case, we did not really attempt to port a real system.

Even if code partitioning would be available, the cost ofcrossing code partitions could be a problem. It may be pos-sible to overcome this problem by inserting Buffer() oper-ators [17] in the query plan, that force the next() methodof its child operator to be called many times, buffering theresults, before passing them up higher in the pipeline.

Finally, Cell may be an interesting platform for compiledquery execution, where a query plan generator emits (C/C++)program code that is compiled Just-in-Time (JIT). It hasbeen shown that query-specific code generation can be moreefficient than a query interpreter [12, 3]. On Cell, an addi-tional advantage is that the generated binary is much smallerthan the full DBMS, and likely fits in LS.

As the query interpreter is an important cause of the highamount of if-then-else branches in database code, JIT com-pilation likely reduces the effect of costly branches on Cell.Data related branches can be addressed by using predicatedprogramming techniques [13]. The SIMD instruction set ofCell also provides explicit intrinsics for predication.

Finally, it is known that a wide range of database oper-ations can be accelerated with SIMD instructions [15]. AsSIMD instructions apply an operation on X consecutive val-ues from the same column, as a preparation X NSM hori-zontal records need to be packed vertically to put them into aSIMD register. It has been shown that on-the-fly conversionof horizontal (NSM) layout into small vertical arrays canimprove performance [11]. However, the memory-intensivework of navigating through an NSM disk block followingdata layout offsets to gather column data is not a strength ofthe SPEs. As mentioned, a vertical (DSM) storage scheme,as used in MonetDB stores the data in SIMD-friendly verti-cal layout upfront, avoiding the need of packing. Note thatpacking can also be avoided by using column-wise storageonly within a disk block (i.e. PAX [1]).

3.2 MonetDB: Column MaterializationMonetDB is an open-source DBMS using vertically frag-

mented storage (DSM) supporting both SQL and XQuery.1

Internally, it implements a physical column-algebra [2] usinga column-wise materialization strategy, which means thateach operator reads one or more input columns, representedas contiguous arrays in RAM, and stores back the outputcolumn in RAM. MonetDB can be ported to Cell by run-ning its algebra interpreter on the PPE, and have it executethe column operations (data parallel) on the SPEs.

Arguably, column-wise processing overcomes all three mainCell efficiency challenges (full SIMD-ization, branch elimina-tion, instruction cache management): (i) Column algebrascarry out one basic action on all values of a column sequen-

1See monetdb.cwi.nl

Project

process setsof tuplesrepresented asaligned vectors

operators

primitivesprocess entirevectors at a time

map_select_lte

map_aggr_sum_col_vec_flt4

map_hash

selpack_4flt_cols

map_sub map_add

map_mul

map_mul

1 1

lines

tatu

s

ship

dat

e

dis

cou

nt

extp

rice

tax

retu

rnfl

ag

qu

anti

ty

Select

Aggregate Local Store

vectorscontain multiplevalues of a singleattribute and fitthe SPEs LS

Scan

selection

19981201

groupids

quadwordvector

vector

DMA transfer

1 2 3 4

765

Figure 3: MonetDB/X100 on Cell

tially, and thus have a high instruction locality. This makesit easy to do explicit SPE instruction cache management.(ii) Column-wise storage yields array-loop intensive codepatterns that often can be compiled into SIMD instructionsautomatically. (iii) Finally, column-wise execution lessensthe performance impact of branches caused by the query al-gebra interpreter, as interpretation decisions are made forwhole columns, rather than tuples.

The full materialization strategy of MonetDB causes prob-lems, however, if queries produce substantial intermediateresults. In the case of our example query TPC-H Q1, thisindeed occurs as the query starts with a selection that keeps95% of the tuples. In case of Cell, we will see in Section 4.1that this causes the SPEs to generate huge DMA traffic,such that performance becomes bus limited.

3.3 MonetDB/X100: Vectorized ProcessingContrary to MonetDB, the MonetDB/X100 system [3] al-

lows for Volcano-style pipelining (avoiding materializationof intermediates). For disk storage, it can use both hori-zontal (PAX) and vertical (DSM) storage. Figure 3 showsan operator tree, being evaluated within MonetDB/X100in a pipelined fashion, using the traditional open(), next(),close() interface. However, each next() call within Mon-etDB/X100 does not return a single tuple, as is the case inmost conventional DBMSs, but a collection of vectors, witheach vector containing a small horizontal slice of a singlecolumn. Vectorization of the iterator pipeline allows Mon-etDB/X100 primitives, which are responsible for computingcore functionality such as addition and multiplication, to beimplemented as simple loops over vectors. This results infunction call overheads being amortized over a full vectorof values instead of a single tuple, and allows compilers toproduce data-parallel code that can be executed efficientlyon modern CPUs. The vector size is configurable (typically100-1000) and should be tuned such that all vectors neededfor a query fit in the CPU cache (or LS – for Cell).

Cell Port. The vectorized query processing model of Mon-etDB/X100 can be mapped on Cell by running the rela-tional operator pipeline (Scan,Select,Aggregation, etc) onthe PPE. When a next() method needs to compute primi-tives, it sends a primitive request to all SPEs (data parallelon the vectors). Double buffering can be applied by execut-ing a request only when a subsequent request is issued, using

32

Page 45: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

the time in between to initiate DMA for code and data (ifneeded). As for code loading, we applied the manual loadingapproach described in Section 2.2 to load MonetDB/X100vectorized primitives on demand to the SPEs.

In vectorized query processing, rather than writing all in-termediate results to RAM, the result vectors coming outof vectorized primitives are kept in the local SPE memories,for use as input to the next operation. The main databasesystem running on the PPE, allocates vector registers toprimitive function inputs and outputs found in the queryplan. These vector registers are symbolic representations ofmemory areas in the LS. Upon plan generation, the totalnumber of vectors and their types are known, so a suitablevector size and vector register allocation can be chosen.

The Cell port of MonetDB/X100 is currently in a proof-of-concept stage. For the experiments presented in the nextsection, we hand-coded the PPE query plan and registerallocation, and ported only the MonetDB/X100 primitiveswe needed to the SPE. Also, we re-used this code for vector-ized query processing to emulate full column materialization(MonetDB), using an alternative plan that writes out eachprimitive output into a RAM-resident result column.

4. VECTORIZED SIMD PROCESSINGProjection. To integrate SIMD processing into a databasekernel, it is advisable to let the compiler do as much ofthe work as possible. Implementing database operators asbranch- and dependency free loops is crucial to make thatpossible. In MonetDB/X100, this holds automatically formost projection-related primitives, which are simple loopsover vectors of the form:

for (i=0; i<n; i++)res[i] = input1[i] OP input2[i] ;

If we take for example addition on two floating point vec-tors, this will get compiled as if the code was explicitlySIMD-ized as follows (trailing tuples left out):

vector float *input1, *input2, *res ;for (i=0; i < (n/4); i++)

res[i] = spu_add(input1[i], input2[i]) ;

This adds four pairs of floats in parallel, obtaining a through-put of 1 tuple per SPE cycle.

Selection. To be able to exploit SIMD, the input vector ar-rays need to be aligned and organized sequentially in mem-ory, so that multiple data items can be loaded into a SIMDregister. The constraint of sequential input data, however, isviolated by the way X100 originally implements selections,which it does by passing an optional selection vector to prim-itives, which is an integer array, containing the offsets ofthose tuples within the vector that are within the selection.This gives primitives of the form:

for (j=0; j < n; j++) {int i = sel[j] ;res[i] = input1[i] OP input2[i] ;

}

which decreases Cell throughput by a factor 20 (in caseof 100% selection). To avoid this, instead of a positionalselection vector we decided to use bit-mask selection vectorson Cell, which are aligned to data vectors, and contain bit-masks of either zeros or ones for non-selected and selected

a0

a1

a2

a3

a4

a5

a6

a7

b0

b1

b2

b3

b4

b5

b6

b7

c0

c1

c2

c3

c4

c5

c6

c7

d0

d1

d2

d3

d4

d5

d6

d7

a1

d1

c1

b1

a6

b6

c6

d6

SIM

D Q

uad

wo

rd

Selection Bitvector

b c d Resa Sel

0

0

0

0

0

1

0

1

Figure 4: selpack SOA into AOS with selection

tuples respectively. This has the advantage that code re-mains SIMD-izable, as non-selected tuples can be quicklymasked out whenever needed. A disadvantage is that non-selected tuples are still being processed and occupy space inthe LS. Thus, if the selectivity is high, it may be better tocompact the vectors, making them densely populated again.

SOA vs AOS. The representation used so far of horizon-tally aligned vectors, is called Structure of Arrays (SOA).SIMD operations can also be applied using an Array ofStructures (AOS) layout. These SIMD data layout con-cepts roughly correspond with column-wise versus row-wisedatabase storage. For some operations, the AOS representa-tion is more convenient. One example of this is compactionof selections, which introduces a data dependency, limit-ing SPE throughput severely. While paying this cost of se-lection, it is thus better to compact multiple SOA inputcolumns at once using SIMD, thus producing one AOS out-put. An example of such a reorganization is shown in Fig-ure 4, where we see four input vectors being combined intoa single quadword vector, with each quadword containing avalue from each of the four input columns, throwing awaynon-selected tuples in the process. Currently, we only con-sidered packing data of the same type together. For eachsupported data type we can have a pre-generated selpack

primitive, but also a pack (a version without selection) andan unpack. The query optimizer should decide the properdata layout (selection/dense, AOS/SOA).

Hash Aggregation. For the mapping of database opera-tions to SIMD instructions, we build on [15], that describedprojections, selections, joins and index lookup. Lacking stillin this list were grouped aggregates, which form an impor-tant part of our example query (TPC-H Q1). Grouped ag-gregates can be SIMD-ized if multiple aggregates of an equaltype need to be computed. In case of TPC-H Q1 (see Fig-ure 3), there are 4 floating point SUMs, one integer SUMand one COUNT (which can be treated as an integer SUMof a constant column filled with one-s).

In this case, we can pack the four float columns to beaggregated into a quadword vector values in AOS layout,using the selpack operation, and do the following:

vector int *grp;vector float *values;for (i=0; i < n; i++)

int id = si_to_int(grp[i]);aggr[id] = spu_add(aggr[id], values[i]);

This updates four aggregate results in parallel. Here weassume that previously an aggregate group-ID has been com-puted and is available in AOS int vector grp[i]. For spacereasons, we omit a detailed discussion of SIMD hashing here.SIMD-based Cuckoo hash on Cell is discussed in [14].

33

Page 46: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

Platform Evaluation Strategycolumn vector vector

at-a-time at-a-time SIMD

Itanium2 1.3Ghz 3400 311 n.a.PPE + 1 SPE (3.2GHz) 493 459 95PPE + 2 SPEs (3.2GHz) 280 229 47PPE + 3 SPEs (3.2GHz) 202 153 32PPE + 4 SPEs (3.2GHz) 178 115 24PPE + 5 SPEs (3.2GHz) 142 92 19PPE + 6 SPEs (3.2GHz) 129 77 16

Table 1: TPC-H Q1: avg elapsed msec (SF=1)

4.1 Evaluation: TPC-H Q1Table 1 lists the initial results of our Cell experiments

on the SF-1 TPC-H dataset (6M lineitems, RAM resident)on query 1. This query is a good measure of the computa-tional power of a database system as it is a simple Select-Project-Aggregate query, that consumes a large input table,producing almost no output, and performs quite a few com-putations. Our main result is the “Vector SIMD” columnthat shows the (almost perfect) parallel scaling of our SIMDimplementation of vectorized query processing. For compar-ison, we reproduce results from [3] obtained on a 1.3GHzItanium2: Cell is 20 times faster than MonetDB/X100 onItanium2 (16 vs 311 msec). An important requirement toobtain such speedup is proper use of SIMD friendly code.Just compiling the standard MonetDB/X100 primitives forthe Cell (“vector at-a-time”) is 5 times slower; but still beatsthe same code on Itanium2 by a factor 4. While the origi-nal MonetDB strategy of full materialization (which causeshuge memory traffic on Q1) brings the Itanium2 memorysubsystem to its knees (3.4sec), we see the 25.6GB/s Cellmemory infrastructure holding up well (129msec), thoughscaling is sub-linear. From this data, we speculate thatmassive scatter/gather algorithms on Cell are likely to yieldbandwidth-bound results. Such results might be acceptable,but certainly not optimal (129 vs 16 msec here).

5. RELATED WORKThe only paper that explicitly touches upon Cell in the

context of data management (hashing, in this case) is [14],where the computational power of the SPEs is used forquick hash function computation. Our work builds stronglyon [15], that describes the applicability of SIMD instructionfor database workloads. We extend this work by propos-ing use of Array-Of-Structure (AOS) data layout to per-form grouped aggregation in a SIMD-ized fashion as well.Database workloads on a network processor, which, simi-lar to Cell SPEs, lack a hardware cache are analyzed in [6].Both [17] and [8] try to improve instruction-cache reuse byreusing its contents on buffered data from within the samequery, or on data from other queries respectively. Work onautomatic SPE code partitioning and management is con-ducted by the IBM compiler team [5]. This compiler is as ofyet not available, so for our Cell database system we createdour own code management runtime.

6. CONCLUSION & FUTURE WORKIn this paper we have taken a sneak preview into a possi-

ble future of query processing on heterogeneous multi-coreCPUs, by using the Cell for database purposes. We made acase for column-wise query processing on Cell, as it reduces

branchiness of code, allows for better instruction locality,and produces code that is amendable to efficient (and some-times even automatic) SIMD translation. However, we havealso shown that care needs to be taken not to materialize in-termediate results in main-memory, to avoid bus contention.These ideas correspond to the vectorized query processingmodel used in MonetDB/X100, of which parts were portedto enable these experiments. However, the default Mon-etDB/X100 primitive functions turned out to yield subopti-mal SIMD translations on the SPEs. We added support forAOS (Array of Structures) vector data layout, which allowedto better SIMD-ize selection and aggregation primitives.

We experimented with a limited set of operators here,but we believe that with careful engineering of parallel algo-rithms, more complex operators like joins and aggregationsthat exceed LS capacity can benefit from the exceptionalcomputational power of Cell as well. So far, we performedmain memory resident queries only. Given its enormousthroughput, it is an interesting question whether Cell canbe kept in balance with secondary storage when processingdata beyond main-memory.

7. REFERENCES[1] A. Ailamaki, D. DeWitt, M. Hill, and M. Skounakis.

Weaving Relations for Cache Performance. In Proc. VLDB,2001.

[2] P. Boncz and M. Kersten. MIL primitives for querying afragmented world. VLDB Journal, 8(2):101–119, 1999.

[3] P. Boncz, M. Zukowski, and N. Nes. MonetDB/X100:Hyper-Pipelining Query Execution. In Proc. CIDR, 2005.

[4] J. Cieslewicz, J. W. Berry, B. Hendrickson, and K. A. Ross.Realizing parallelism in database operations: insights froma massively multithreaded architecture. In DaMoN, 2006.

[5] A. E. Eichenberger et al. Using Advanced CompilerTechnology to Exploit the Performance of the CellBroadband Engine Architecture. IBM Systems Journal,45(1):59–84.

[6] B. T. Gold, A. Ailamaki, L. Huston, and B. Falsafi.Accelerating database operations using a networkprocessor. In DaMoN, 2005.

[7] G. Graefe. Volcano - an extensible and parallel queryevaluation system. IEEE TKDE, 6(1):120–135, 1994.

[8] S. Harizopoulos and A. Ailamaki. STEPS TowardsCache-Resident Transaction Processing. In Proc. VLDB,2004.

[9] IBM Corporation. Cell Broadband Engine ProgrammingHandbook, 2006.

[10] S. Manegold, P. Boncz, N. Nes, and M. Kersten.Cache-Conscious Radix-Decluster Projections. In Proc.VLDB, Toronto, Canada, 2004.

[11] S. Padmanabhan, T. Malkemus, R. C. Agarwal, andA. Jhingran. Block oriented processing of relationaldatabase operations in modern computer architectures. InProc. ICDE, 2001.

[12] J. Rao, H. Pirahesh, C. Mohan, and G. M. Lohman.Compiled Query Execution Engine using JVM. In Proc.ICDE, 2006.

[13] K. A. Ross. Conjunctive selection conditions in mainmemory. In Proc. PODS, Washington, DC, USA, 2002.

[14] K. A. Ross. Efficient hash probes on modern processors. InProc. ICDE, 2006.

[15] J. Zhou and K. A. Ross. Implementing database operationsusing simd instructions. In Proc. SIGMOD, 2002.

[16] J. Zhou and K. A. Ross. Buffering accesses tomemory-resident index structures. In Proc. VLDB, 2003.

[17] J. Zhou and K. A. Ross. Buffering database operations forenhanced instruction cache performance. In Proc.SIGMOD, 2004.

34

Page 47: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

In-Memory Grid Files on Graphics Processors Ke Yang, Bingsheng He, Rui Fang, Mian Lu, Naga Govindaraju*, Qiong Luo, Pedro Sander, Jiaoying Shi†

HKUST, China {keyang, saven, rayfang, mianlu, luo, psander}

@cse.ust.hk

*Microsoft Corporation, USA [email protected]

†Zhejiang University [email protected]

ABSTRACT Recently, graphics processing units, or GPUs, have become a viable alternative as commodity, parallel hardware for general-purpose computing, due to their massive data-parallelism, high memory bandwidth, and improved general-purpose programming interface. In this paper, we explore the use of GPU on the grid file, a traditional multidimensional access method. Considering the hardware characteristics of GPUs, we design a massively multi-threaded GPU-based grid file for static, memory-resident multidimensional point data. Moreover, we propose a hierarchical grid file variant to handle data skews efficiently. Our implementations on the NVIDIA G80 GTX graphics card are able to achieve two to eight times’ higher performance than their CPU counterparts on a single PC.

1. INTRODUCTION Multidimensional access methods, such as grid files [16] and R-trees [10], usually involve more complex data structures as well as more computation- and data-intensive operations than single-dimensional ones. For such multidimensional access methods, new generation graphics processors (GPUs) are a promising hardware platform due to their high memory bandwidth and massively parallel computation. For instance, in an NVIDIA G80 GTX graphics card, there are 16 multiprocessors, each containing 8 processors and supporting up to 512 threads. The observed overall performance is 330 GFLOPS and the device memory (of a size 768MB) bandwidth 86GB/sec. Encouraged by the hardware features of GPUs, we study their use on the grid file, a representative multidimensional point access method. As a first step, we look at static multidimensional point data, such as those in the On-Line Analytical Processing (OLAP) environments or in CAD (Computer-Aided Design). These environments are query-intensive and have infrequent reorganizations on the data. Furthermore, we assume such data are brought into the GPU device memory from the main memory before access and are device memory resident throughout the query time. Targeting at in-memory static data, we present a GPU acceleration of grid file for multidimensional point queries. To efficiently handle data skews, we adopt a hierarchical strategy by recursively constructing a sub-grid for a skewed cell that contains a large number of points. We have implemented the grid file on

the CPU and the GPU. Our implementations achieve two to eight times speedup on the G80 GPU compared with their CPU counterparts. This paper makes the following three contributions. First, we adapt the traditional, CPU-based grid file structure to fit for in-memory parallel environments, and provide a massively multithreaded GPU-based design. Second, we propose a hierarchical grid file variation that handles skewed data efficiently. This variation works both on the CPU and on the GPU, with a more significant performance improvement on the GPU due to GPU’s inherent parallelism. Third, we empirically evaluate our GPU-based implementations in comparison with their CPU-based counterparts using an off-the-shelf PC equipped with the G80 graphics card. The remainder of the paper is organized as follows. In Section 2, we briefly review the grid file structure, database processing on GPUs and the programming features of new generation GPUs. In Section 3, we describe the mapping of the basic grid structure on GPUs. In Section 4, we describe the hierarchical grid structure for skew handling. We present our experimental results in Section 5 and conclude in Section 6.

2. BACKGROUND Multidimensional Access Methods Multidimensional access falls into two categories [6]: point access, which searches multidimensional points, and spatial access, which handles extended objects such as polyhedra. As a start to study the multidimensional access, we focus on point access methods. The following are three typical kinds of point access queries and their examples: Exact match query. In such a query, the values of all attributes are given in equality predicates, and the query result is the record that exactly matches all the attribute values. E.g., find the student seated at Row 5, Column 3. Partial match query. Such a query is a generalization of the exact match query. All predicates in a partial match query are equality predicates, but some attributes of the data points are absent in the query. Therefore, a partial match query retrieves all records that match on the specified attributes. E.g., find all students seated in Row 5. Range query. A range query specifies a d-dimensional query box using range predicates and retrieves all records whose attributes represent a d-dimensional point located in the query box. E.g., find all students seated between Rows 1 to 3 and Columns 2 to 5.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Proceedings of the Third International Workshop on Data Management on New Hardware (DaMoN 2007), June 15, 2007, Beijing, China. Copyright 2007 ACM 978-1-59593-772-8 ...$5.00.

There have been various multidimensional point access methods, including hashing-based methods [4][13][14][27], tree-structured methods [5][22][24] and space-filling curves [23]. Generally, the storage of hashing-based structures can be easily distributed, which is suitable for parallelization. Moreover, hashing-based methods require a constant access time to retrieve a record,

35

Page 48: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

whereas tree structures often require a logarithmic function of the data size. Considering the GPU’s parallel computation and high-bandwidth memory access features, we investigate further into hashing-based methods.

Grid Files According to the classification by Geade et al. [6], the grid file [16] is a hashing-based multidimensional access method, even though the units of each dimension of a grid can be determined by range partitioning. There have been a number of variants of the grid file [3][12][18][25][28], and some parallelized methods [15][17]. In its basic form, a grid file superimposes a d-dimensional orthogonal grid on the d-dimensional data space. It partitions the space into hyper-rectangular cells using splitting hyper-planes that are parallel to the axes. These splitting hyper-planes follow the data distribution and the splitting positions may not be uniform across dimensions. As such, the positions along each axis indicating where to split are maintained in an ordered array called a scale. Finally, there is a grid directory to associate each grid cell to a bucket in the storage. Figure 1 illustrates a 2D grid file. When accessing a point through the grid, we first use the scales to locate the cell that the point falls in, and then follow the pointer in the cell to the bucket in the storage, and scan the bucket for a match. Traditional grid files are able to handle dynamic insertions and deletions. They split overflowed cells and merge underoccupied cells, to distribute the data to the buckets evenly so that a single record retrieval can be answered in at most two storage accesses. However, the splitting may lead to superlinear directory growth [21], and the merging can result in deadlocks [12][24]. Since we deal with static, memory-resident data in this paper, we leave dynamic insertions and deletions as future work.

Database Processing on GPUs There has been intensive research on general-purpose computing using the GPU (GPGPU) [19]. One branch of our particular interest is on GPU-based database processing [2][7][8][9][26]. The existing work mainly utilizes the 3D graphics pipeline by drawing primitives such as a quad with OpenGL/DirectX programs. Our work, in contrast, exploits the most recent advance of the graphics hardware, and is implemented as general-purpose computing programs without utilizing any graphics APIs. Free from the constraints given by the graphics pipeline, our design and implementation is much more practical and flexible. Furthermore, to our best knowledge, this work is the first to develop multidimensional index structures on GPUs.

Programming on New Generation GPUs We take the NVIDIA GeForce 8800 series (G80) graphics card, which was available on the market starting from Nov 2006 and has been used in our implementation, as an example to introduce the new generation GPUs. In comparison with its most advanced predecessors, G80 has made significant improvements for general-purpose processing. The computing resource consists of tens of SIMD multiprocessors, each of which contains a group of processors and registers that support a massive number of concurrent threads. Processors in the same multiprocessor share a cache called the shared memory, which is fully exposed to the programmer. The device memory could be accessed as textures in traditional graphics applications. Furthermore, it can be accessed as global memory in general-purpose computing programs, in a way similar to main memory. In addition, there is a constant

memory that is shared by all multiprocessors and is cached on each multiprocessor. The G80 card is released with a Compute Unified Device Architecture (CUDA) [1] for general data-parallel computing. CUDA provides a programming interface as an extension of the C language, with a runtime library for multithreaded parallel computing. This API treats the GPU as a general-purpose computing device as opposed to a programmable graphics pipeline, thus allows non-graphics researchers to utilize the GPU hardware features easily. Specifically, CUDA allows the programmer to specify the usage of the GPU resources such as the number of thread blocks (groups), and to write kernel programs that are executed on all threads.

3. BASIC STRUCTURES In this section, we present the mapping of the grid file structure on the GPU and describe its construction and query processing.

3.1 Construction To build a static grid file from a given data set, we first partition the data space in order to balance the bucket size of each cell as much as possible. Denote the numbers of splits along the d dimensions as p1, p2, …, pd, respectively, then the total number of grid cells c = p1 p2… pd. Given the average number of records in each bucket, H, we have cH = N, where N is the number of records of relation R. We build a grid for R and rearrange R such that the records belonging to the same cell are clustered into one bucket. First, we obtain a scale in each dimension i (i=1, 2, …, d) by sorting R along that dimension and sampling pi quantiles as the elements of the scale. Then, for each record, we use the scales to identify the bucket it belongs to. This procedure is done in the LocateCell routine, which performs binary search into each ordered scale array for the location of the record. Second, to get the starting position of each bucket in R, we build a histogram of number of records in each bucket. This is done by scanning R once and using the scales to identify the bucket each record belongs to. Third, a prefix sum routine translates the histogram into bucket offsets in the rearranged relation. Then the records are scanned again to be scattered into the corresponding buckets with given starting offsets. After the construction, the rearranged R stores the buckets, the grid directory entries contain the bucket offsets, as illustrated in Figure 1. In this example, d = 2, N = 10, H = 2, and p1 = p2 =2.

Figure 1. Structure of a 2D grid file.

4

3

0

0

0 4

xscale

5 6

storage

ysca

le

offset:

36

Page 49: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

We have implemented the construction with the GPU co-processing on the sorting step. On a relation consisting of 16 million 2D records, the pure CPU-based construction takes about 12 seconds, among which 8 seconds are spent on sorting. We employ the GPU sort primitive [11] and reduce the sorting to 3 seconds.

3.2 Query Processing After a grid file is ready on the GPU, we can process the three kinds of multidimensional point queries using the grid, namely the exact match queries, the partial match queries, and the range queries. Note that a partial match query is similar to an exact match query on the equality predicates and to a range query on an unbounded attribute where a range box extends throughout the entire domain of the attribute. As a result, our implementation on the processing of partial match queries is similar to that of the other two types. In the following, we will mainly discuss the exact match and the range queries. Since the GPU is a data-parallel computing device, we use a pool of threads, which is tuned to fully utilize the hardware computation power, to handle a large number of queries in parallel. Each thread takes charge of a query independently, and after finishing one, it handles another. In specific, the thread with index t starts its i-th query by reading in the (t + iT)’s query from the device memory, where T is the number of threads. This strategy of assignment is called coalesced read [1], which can speed up the device memory access. For an exact match query, a thread scans the bucket corresponding to the cell that contains the query record to search for a match. This search is achieved by a LocateCell followed by a sequential scan in the bucket. The termination condition of the scan, i.e., the boundary of the bucket, is marked by the offset of the next bucket. For a range query, a thread scans all the buckets that correspond to the cells that overlap the query box. Given the two end points, L and H, of the major diagonal of the box, the thread calls LocateCell on L and H to obtain the two corresponding end cells, CL and CH. The coordinates of CL and CH bound the cells whose coordinates fall in the multidimensional range. Then the thread sequentially scans these cells. For the points in the boundary cells (those having at least one coordinate equal to that of an end cell), the thread further takes a point-level test to check if they are located in the query box. To avoid the conflict among multiple threads that concurrently write query results to the shared output region, we perform the write in a three-step scheme similar to that in our previous work [11]. For completeness, we briefly present the scheme here. In the first step, each thread executes the query and counts the number of query results it generated. In this step, each thread only outputs the count but does not produce the actual query result records. Then a prefix sum routine gathers these local counts and translates them into an array of global write locations, each of which contains the start position for the corresponding thread to output. In the last step, each thread computes the query result records and writes the results to its slot in the global memory.

4. SKEW HANDLING 4.1 Hierarchical Grid File When partitioning the data space in the construction process of a grid file (described in Section 3.1), the data distribution in the resulting buckets may not be necessarily balanced after a single partitioning pass. For example, the two buckets starting from 0 and 6 in Figure 1 are more crowded than the other two. As a result of data skews, querying a grid cell that corresponds to a crowded bucket is more expensive than querying a less crowded one. This imbalance has significant performance implications on the GPU, because the processors of the GPU are SIMD. Therefore, we propose a hierarchical scheme to further divide the crowded cells. This division is done recursively until the bucket size of a resulting cell is below a given threshold. Our proposed hierarchical grid structure is similar to two existing schemes, the multilevel grid file [28] and the buddy tree [24]. The common idea among the three schemes is to divide crowded regions recursively. However, our scheme has two major differences from the existing work. First, both existing structures cover only those cells that contain data points, and maintain a directory entry for each non-empty cell. In contrast, our hierarchical grid covers the entire data space, and locates cells through shared scales. This structure is relatively simpler and more suitable for bulk loading in a parallel computing environment. Second, as dynamic maintenance techniques, the two existing methods split an overflowed bucket into two at each level, thus the structures contain a relative large number of levels in the tree or in the grid. In comparison, our hierarchical grid is a static structure, and the number of levels of sub-grids in a crowded cell is relatively small. To build a hierarchical grid for relation R, we first perform the splitting and the data rearrangement of R in the same way as described in Section 3.1. We then store the information about the grid, such as the scales, the bucket sizes and offsets, in the directory. Next we check the size of each bucket in this grid: if the size of a bucket is larger than a given threshold, we perform another round of splitting, and the information of the newborn sub-grid is appended to the directory. The offset of the parent bucket is now redirected to the offset of the sub-grid in the directory. To distinguish the offset of a sub-grid from that of a bucket, we add a sign bit flag to the offset of the sub-grid. The pseudo code for constructing a hierarchical grid is given in Figure 2. The average bucket size H is pre-specified as a constant. For simplicity, here the routine PNum assumes the same number of splits on all dimensions, i.e. p1 = p2 = … = pd = p. This can be generalized to cases where p1, p2 … pd are functions of p.

37

Page 50: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

Figure 2. Pseudo code for building a hierarchical grid file.

An example hierarchical grid is illustrated in Figure 3, obtained through splitting the original grid in Figure 1. After identifying the two crowed cells, we construct a sub-grid for each cell and change the offset in the storage into the flagged offset in the directory, i.e., -12 and -24.

Figure 3. Structure of a hierarchical grid file.

4.2 Query Processing Query processing over a hierarchical grid is similar to that over an original grid, except that each thread recursively decodes the offset of a sub-grid in the directory until it reaches the final bucket. Pseudo code is given in Figure 4. Note that, recursion is not supported in a GPU kernel program, due to the hardware limitations. In our implementation, we rewrite the code to a while loop with offset[i]<0 as the condition. Furthermore, flow control instructions can cause threads to diverge, and different execution paths have to be serialized. Since our grid hierarchy usually has a

low level, the kernel requires only a small number of branches (less than five in our tests). Additionally, the hierarchy greatly improves the worst cases of bucket imbalance, thus effectively limits the serialization cost.

int PNum(int size) // decide the number of partition on a dimension {

p = 2; // number of partitions while(size/H > pd) p++; // pd: total number of cells return p;

} BuildGrid(void* dir, rec R[n], int p) // build a grid directory for R { compute scales, bucket offsets and sizes; append them to dir; rearrange R; for each bucket i of size[i]/H > 2d // the threshold to split

BuildGrid(dir, R + offset[i], PNum(size[i])); offset[i] = 0 - dir’s current offset;

}

Search(rec q, int cur) //search q from current position of directory {

i = LocateCell(dir + cur, q); // bucket id if(offset[i] > 0) // a bucket

scan the bucket for a match; else // a sub-grid

Search(q, - offset[i]); // grid offset }

Figure 4. Pseudo code for query in a hierarchical grid file.

5. EXPERIMENTS 5.1 Experimental setup We have implemented and tested our algorithms on a PC with a G80 GPU and an Intel P4 Dual-Core processor running Windows XP. The hardware configuration of the PC is shown in Table 1.

Table 1. Hardware configuration

GPU CPUProcessors 1350MHz 8 16 × × 3.2GHz (Dual-core)DRAM (MB) 768 1024 DRAM latency (cycle) 200-300 300 Bus width (bit) 384 64 Memory clock (GHz) 1.8 0.8

Each multiprocessor on the GPU has a piece of constant memory sized 64KB. The accesses to the constant memory are cached. The constant cache on each multiprocessor is 8KB. Since the scales of the grid file are frequently accessed, we store the scales of the first level of the grid in the constant memory for fast access. In our GPU programs, the numbers are tuned for the best performance. We use a configuration of 5120 thread blocks, each containing 256 threads. We consider two kinds of workloads in our experiments, the exact match query and the range query. For exact match queries, we first generate uniform datasets with number of tuples and number of dimensions varied. Then we test on skewed data sets, some synthetic, and some real-world. For range queries, we use a uniform dataset and vary the selectivity of the range query. The synthetic skewed data follows the Gaussian distribution with the parameter standard deviation varied. The smaller the standard deviation, the more skewed the data distribution is. The real-world skewed data sets are from 3D point cloud models, which have been used extensively in graphics and computational geometry studies. The data structure for a d-dimensional record is an integer id followed by d 32-bit unsigned integer keys. We set the average bucket size to be H=8 in all experiments. In each experiment, the time cost for the CPU and the GPU executions are separately measured for comparison.

4

3

0

0

storage

0 4 5 6

0 2

6 4

4 6

2 7 8 9

directory

directory offset: 0 4, 0 3; 0 4 5 6; 4 1 1 4

0 2, 0 1; 0 2 2 2; 2 0 0 2 4 6, 4 6; 6 7 8 9; 1 1 1 1

scales; offset; size

scales; offset; size scales; offset; size

0

12 24

Level 1

Level 2

-12 -24

1 0

38

Page 51: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

5.2 Results 5.2.1 Exact match query Figure 5 demonstrates the performance of evaluating 1 million exact match queries on a relation with the number of tuples varied. As the number of tuples increases, the performance speedup of the GPU-based algorithm over the CPU-based one increases slightly. In particular, the performance speedup is 6.5x on the data set of one million tuples and is 7.7x on the one of 16 million tuples. We also test the GPU grid file without the optimization of storing scales into the constant memory, and the comparison shows this optimization greatly reduces the memory stalls, and improves the overall performance by 40% on average. Figure 6 shows the performance of evaluating 1 million exact match queries on a relation of 16 million tuples with the number of dimensions varied. Because the overhead of LocateCell is proportional to the number of scale arrays, the time cost increases with the number of dimensions. The GPU-based grid file is 2-5 times faster than the CPU-based grid file. Figure 7 shows the measurements on the synthetic skewed data, with the standard deviation ranging from 103 to 107. Both CPU- and GPU-based implementations suffer when the data skew is severe. As the data is becoming less skewed, the maximum level of the hierarchical grid decreases accordingly. The GPU-based hierarchical grid file is generally more than five times faster than the CPU-based one. Finally, we evaluate the exact match query on the skewed data using 3D point cloud models. We use two models, Sphere and Dragon, as shown in Figure 8. We varied the number of points in each model from 1 to 16 million, and issued 1 million queries with random search keys. The performance results on these two models are shown in Figure 9 and Figure 10, respectively. On both the CPU and the GPU, we compare the performance with and without the hierarchical scheme, denoted as “Y” or “N” for with or without a hierarchy, respectively. In general, the GPU versions are 2x-5x faster than their CPU counterparts. For the sphere model (Figure 9), the performance of the grid file with the hierarchical scheme is similar to that without, both on the CPU and on the GPU. This performance similarity is because the sphere model is a uniformly distributed point cloud within a sphere, which is of low skewness. The maximum level of the grid file is one. This shows that our partition scheme can handle slightly skewed data in a single-level grid. Similar to the performance speedup of GPU over CPU on the uniform data set, the performance speedup on the sphere model increases slightly as the number of points in the model increases. The dragon model is the point cloud on the surface of a dragon, which is of high correlation and skewness. The maximum level in this grid file is 3. The CPU-based grid file has an improvement of 1.2x-1.5x by utilizing a hierarchy, whereas the GPU-based grid file gains an improvement of 2.3x-4.5x by utilizing a hierarchy. The main reason for this different degree of improvement is that the GPU benefits more from load balance than the CPU does. On the GPU, when threads are severely load-unbalanced and thus greatly diverged, the less loaded threads have to wait for the busy threads. Since the hierarchy helps balancing the load, the waste on waiting among threads is largely reduced. Thus the GPU benefits much from load balance.

Exact match query

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 4 8 12 16Number of tuples (million)

Tim

e(se

c)

CPUGPU (w/o optimization)

GPU (w/ optimization)

Figure 5. Exact match query on uniform data sets with the

number of tuples varied.

Exact match query

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

2D 3D 4D 5DDimension of space

Tim

e(se

c)

CPUGPU w/o optimizationGPU w/ optimization

Figure 6. Exact match query on uniform data sets with the

number of dimensions varied.

5.2.2 Range query Figure 11 shows the performance of evaluating 100k range queries to the grid of a relation of 16 million randomly generated 2D tuples. Both the width and the length of the rectangles are varied from 0.1% to 1% of the integer range. As the selectivity increases, the execution time of both the CPU- and GPU-based grid files increases almost linearly. The GPU-based grid file is around 4x-6x faster than the CPU-based one.

5.2.3 Discussion We have shown that our GPU-based algorithm outperforms its CPU counterpart in all our tests, with a 2x-8x speedup. The reasons for the performance improvement are as follows: (1) We utilize the GPU as a parallel device with a large number (more than 1 million) of lightweight threads. This massive-threading model well matches the query-intensive workloads. (2) Our GPU-based grid file structure is relatively simple, and the record type is regular. This storage structure fits the array-based GPU memory access. (3) Each single query operation is relatively simple, and the hierarchical structure further improves the load balance. Such workload takes full advantage of the SIMD GPU processing and alleviates the high cost on branches or inter-thread load unbalancing. For these reasons, the GPU is more suitable for grid files than a multi-core CPU, which is equipped with a powerful instruction set but executes a small number of heavyweight threads.

39

Page 52: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

Synthesized skew

0

1

2

3

4

0.E+

00

1.E+

04

2.E+

04

3.E+

04

4.E+

04

5.E+

04

6.E+

04

7.E+

04

8.E+

04

9.E+

04

1.E+

05

1.E+

07

standard deviation of Gaussian distribution

Tim

e(se

c)

0

1

2

3

4

5

Max

leve

l

CPUGPUMax level

Figure 7. Exact match query on skewed data sets with the

standard deviation in the Gaussian distribution varied.

Figure 8. Visualization of the 3D real-world datasets: (left)

Sphere; (right) Dragon.

Sphere model, with-hierarchy (Y) vs. without-hierarchy (N)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 4 8 12

The number of points (million)

Tim

e(se

c)

16

CPU NGPU NCPU YGPU Y

Figure 9. Exact match query on the Sphere model with the

number of points varied.

Dragon model, with-hierarchy(Y) vs. without-hierarchy(N)

0

0.3

0.6

0.9

1.2

1.5

1.8

0 4 8 12 16The number of points (million)

Tim

e (s

ec)

CPU NGPU NCPU YGPU Y

Figure 10. Exact match query on the Dragon model with the

number of points varied.

Range query

0

2

4

6

8

10

12

14

16

18

20

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0Selectivity (%)

Tim

e(se

c)

CPUGPU

Figure 11. Range query on uniform data set with the

selectivity varied.

6. CONCLUSIONS AND FUTURE WORK We have developed in-memory grid files on the GPU and have shown that the new generation of GPUs is a well-suited parallel platform for accelerating this traditional multidimensional point access method. Moreover, we have proposed a static hierarchical grid file structure that handles skewed data efficiently. Experimental results show that our GPU algorithms greatly outperform their CPU counterparts in processing exact match and range queries, and that they work well up to five dimensions. As future work, we plan to study alternatives for dynamic insertion and deletion operations for grid files. Also, we are interested in designing multidimensional spatial access methods, such as R-trees [10], on GPUs.

7. ACKNOWLEDGEMENTS We thank the anonymous reviewers for their insightful comments and suggestions. We also thank people at the NVIDIA CUDA Forum, especially Mark Harris, for their help with the G80 implementation issues. Finally, we thank Dr. Lidan Shou of Zhejiang University for his lectures on multidimensional access methods.

8. REFERENCES [1] NVIDIA CUDA (Compute Unified Device Architecture),

http://developer.nvidia.com/object/cuda.html. [2] Bandi, N., Sun, C., Agrawal, D. and El Abbadi, A.,

Hardware acceleration in commercial databases: A case study of spatial operations. VLDB, 2004.

[3] Blanken, H., Ijbema, A., Meek, P. and van den Akker, B., The generalized grid file: Description and performance aspects. In Proc. 6th IEEE Int. Conf. on Data Eng., pp. 380-388. 1990.

[4] Fagin, R., Nievergelt, J., Pippenger, N. and Strong, R., Extendible hashing: A fast access method for dynamic files. ACM Trans. Database Systems 4 (3), 315-344. 1979.

[5] Finkel, R. and Bentley, J. L., Quad trees: A data structure for retrieval of composite keys. Acta Informatica 4(1), 1-9. 1974.

[6] Gaede V, Gunther O, Multidimensional Access Methods. ACM Computing Surveys,1998, 30(2).

40

Page 53: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

[7] Govindaraju, N., Gray, J., Kumar, R. and Manocha, D., GPUTeraSort: high performance graphics coprocessor sorting for large database management. SIGMOD, 2006.

[8] Govindaraju, N., Lloyd, B., Wang, W., Lin, M. and Manocha, D., Fast computation of database operations using graphics processors. SIGMOD, 2004.

[9] Govindaraju, N., Raghuvanshi, N. and Manocha, D., Fast and approximate stream mining of quantiles and frequencies using graphics processors. SIGMOD, 2005.

[10] Guttman, A. R-trees: A dynamic index structure for spatial searching. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pp. 47-54. 1984.

[11] He, B., Yang, K., Fang, R., Lu, M., Govindaraju, N., Luo, Q. and Sander, P., Relational Joins on Graphics Processors. Technical report, Department of Computer Science and Engineering, HKUST, March 2007.

[12] Hinrichs, K., Implementation of the grid file: Design concepts and experience. BIT 25, 569-592. 1985.

[13] Kriegel, H.-P. and Seeger. B., Multidimensional order preserving linear hashing with partial expansions. In Proc. Int. Conf. on Database Theory, Number 243 in LNCS, Berlin/Heidelberg/New York. Springer-Verlag. 1986.

[14] Kriegel, H.-P. and Seeger, B., Multidimensional quantile hashing is very efficient for non-uniform record distributions. In Proc. 3rd IEEE Int. Conf. on Data Eng., pp. 10-17. 1987.

[15] Li, J., Rotem, D., Srivastava, J., Algorithms for Loading Parallel Grid Files. SIGMOD Conference 1993: 347-356

[16] Nievergelt, J., Hinterberger, H., and Sevcik, K. C., The grid file: An adaptable, symmetric multikey file structure. ACM Trans. Database Systems 9 (1), 38-71, 1984.

[17] Mohammed, S., Srinivasan, B., Bozyigit, M., Phu, D,. Novel parallel join algorithms for grid files. 3rd International

Conference on High Performance Computing. Dec 1996, pp 144-149.

[18] Ouksel, M., The interpolation based grid file. In Proc. 4th ACM SIGACT-SIGMOD Symp. on Principles of Database Systems, pp. 20-27. 1985.

[19] Owens, J. D., Luebke, D., Govindaraju, N., Harris, M., Krüger, J., A. E. Lefohn and T. J. Purcell. A survey of general-purpose computation on graphics hardware. Computer Graphics Forum, Volume 26, 2007.

[20] Rao, J. and Ross. K. A., Cache conscious indexing for decision-support in main memory. VLDB, 1999.

[21] Regnier, M. Analysis of the grid file algorithms. BIT 25, 335-357. 1985.

[22] Robinson, J. T. The K-D-B tree: A search structure for large multidimensional dynamic indexes. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pp. 10-18. 1981.

[23] Sagan, H., Space-Filling Curves. Berlin/Heidelberg/New York: Springer-Verlag, 1994.

[24] Seeger, B. and Kriegel, H.-P., The buddy-tree: An efficient and robust access method for spatial data base systems. In Proc 16th Int. Conf. on Very Large Data Bases, pp. 590-601. 1990.

[25] Six, H. and Widmayer, P., Spatial searching in geometric databases. In Proc.4th IEEE Int. Conf. on Data Eng., pp. 496-503. 1988.

[26] Sun, C., Agrawal, D. and El Abbadi, A., Hardware acceleration for spatial selections and joins. SIGMOD, 2003.

[27] Tamminen, M. The extendible cell method for closest point problems. BIT 22, 27-41. 1982.

[28] Whang, K.-Y. and Krishnamurthy, R., Multilevel grid files. Yorktown Heights, NY: IBM Research Laboratory. 1985.

41

Page 54: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................
Page 55: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

The five-minute rule twenty years later, and how flash memory changes the rules

Goetz GraefeHP Labs, Palo Alto, CA

AbstractIn 1987, Gray and Putzolo presented the five-minute

rule, which was reviewed and renewed ten years later in 1997. With the advent of flash memory in the gap between traditional RAM main memory and traditional disk systems, the five-minute rule now applies to large pages appropriate for today’s disks and their fast transfer bandwidths, and it also applies to flash disks holding small pages appropriate for their fast access latency.

Flash memory fills the gap between RAM and disks in terms of many metrics: acquisition cost, access latency, transfer bandwidth, spatial density, and power consump-tion. Thus, within a few years, flash memory will likely be used heavily in operating systems, file systems, and data-base systems. Research into appropriate system architec-tures is urgently needed.

The basic software architectures for exploiting flash in these systems are called “extended buffer pool” and “ex-tended disk” here. Based on the characteristics of thesesoftware architectures, an argument is presented why oper-ating systems and file systems on one hand and database systems on the other hand will best benefit from flash memory by employing different software architectures.

1 Introduction In 1987, Gray and Putzolo published their now-famous

five-minute rule [GP 87] for trading off memory and I/O capacity. Their calculation compares the cost of holding a record (or page) permanently in memory with the cost to perform disk I/O each time the record (or page) is accessed, using appropriate fractions of prices for RAM chips and for disk drives. The name of their rule refers to the break-even interval between accesses. If a record (or page) is accessed more often, it should be kept in memory; otherwise, it should remain on disk and read when needed.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee pro-vided that copies are not made or distributed for profit or commer-cial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on serv-ers or to redistribute to lists, requires prior specific permission and/or a fee. Proceedings of the Third International Workshop on Data Management on New Hardware (DaMoN 2007), June 15, 2007, Beijing, China. Copyright 2007 ACM 978-1-59593-772-8 ...$5.00.

Based on then-current prices and performance charac-teristics of Tandem equipment, they found that the price of RAM memory to hold a record of 1 KB was about equal to the (fractional) price of a disk drive required to access such a record every 400 seconds, which they rounded to 5 minutes. The break-even interval is about inversely pro-portional to the record size. Gray and Putzolo gave 1 hour for records of 100 bytes and 2 minutes for pages of 4 KB.

The five-minute rule was reviewed and renewed ten years later in 1997 [GG 97]. Lots of prices and performance parameters had changed, e.g., the price for RAM memory had tumbled from $5,000 to $15 per megabyte. Nonethe-less, the break-even interval for pages of 4 KB was still around 5 minutes. The first purpose of this paper is to re-view the five-minute rule after another ten years.

Of course, both prior papers acknowledge that prices and performance vary among technologies and devices at any point in time, e.g., RAM for mainframes versus mini-computers, SCSI versus IDE disks, etc. Therefore, inter-ested readers are invited to re-evaluate the appropriate for-mulas for their environments and equipment. The values used in this paper, e.g., in Table 1, are meant to be typical for today’s technologies rather than universally accurate.

RAM Flash disk SATA disk Price and capacity

$3 for 8x64 Mbit

$999 for 32 GB

$80 for 250 GB

Access latency 0.1 ms ? 12 ms

average Transfer

bandwidth 66 MB/s

API300 MB/s

APIActivepower 1 W 10 W

Idle power 0.1 W 8 WSleep power 0.1 W 1 W

Table 1. Prices and performance of flash and disks. In addition to quantitative changes in prices and per-

formance, qualitative changes already underway will affect the software and hardware architectures of servers and in particular of database systems. Database software will change radically with the advent of new technologies: vir-tualization with hardware and software support as well as higher utilization goals for physical machines, many-core processors and transactional memory supported both in programming environments and in hardware [LR 07], de-ployment in containers housing 1,000s of processors and many TB of data [H 07], and flash memory that fills the gap between traditional RAM and traditional rotating disks.

43

Page 56: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

Flash memory falls between traditional RAM and per-sistent mass storage based on rotating disks in terms of ac-quisition cost, access latency, transfer bandwidth, spatial density, power consumption, and cooling costs [GF 07]. Table 1 and some derived metrics in Table 2 illustrate this point. (From dramexchange.com, dvnation.com, buy.com, seagate.com, and samsung.com; all 4/11/2007).

Given that the number of CPU instructions possible during the time required for one disk I/O has steadily in-creased, an intermediate memory in the storage hierarchy is very desirable. Flash memory seems to be a highly probable candidate, as has been observed many times by now.

Many architecture details remain to be worked out. For example, in the hardware architecture, will flash memory be accessible via a DIMM memory slot, via a SATA disk in-terface, or via yet another hardware interface? Given the effort and delay in defining a new hardware interface, adap-tations of existing interfaces are likely.

A major question is whether flash memory is consid-ered a special part of main memory or a special part of per-sistent storage. Asked differently: if a system includes 1 GB traditional RAM, 8 GB flash memory, and 250 GB tradi-tional disk, does the software treat it as 250 GB persistent storage and a 9 GB buffer pool, or as 258 GB persistent storage and a 1 GB buffer pool? The second purpose of this paper is to answer this question, and in fact to argue for different answers in file systems and in database systems.

NAND Flash

SATA disk

Price and capacity $999 for 32 GB

$80 for 250 GB

Price per GB $31.20 $0.32Time to reada 4 KB page 0.16 ms 12.01 ms

4 KB reads per second 6,200 83

Price per 4 KB read per second $0.16 $0.96

Time to read a 256 KB page 3.98 ms 12.85 ms

256 KB reads per second 250 78

Price per 256 KB read per second $3.99 $1.03

Table 2. Relative costs for flash memory and disks. Many design decisions depend on the answer to this

question. For example, if flash memory is part of the buffer pool, pages must be considered “dirty” if their contents differ from the equivalent page in persistent storage. Syn-chronizing the file system or checkpointing a database must force disk writes in those cases. If flash memory is part of persistent storage, these write operations are not required.

Designers of operating systems and file systems will want to employ flash memory as extended buffer pool (ex-tended RAM memory), whereas database systems will

benefit from flash memory as extended disk (extended per-sistent storage). Multiple aspects of file systems and of da-tabase systems consistently favor these two designs.

Moreover, the characteristics of flash memory suggest some substantial differences in the management of B-tree pages and their allocation. Beyond optimization of page sizes, we argue that B-trees will use different units of I/O for flash memory and for disks. Presenting the case for this design is the third purpose of this paper.

2 Assumptions Forward-looking research always relies on many as-

sumptions. This section attempts to list the assumptions that lead to our conclusions. Some of the assumptions seem fairly basic while others are more speculative.

One of our assumptions is that file systems and data-base systems assign to the flash memory between RAM and the disk drive. Both software systems favor pages with some probability that they will be touched in the future but not with sufficient probability to warrant keeping them in RAM. The estimation and administration of such probabili-ties follows the usual lines, e.g., LRU.

We assume that the administration of such information employs data structures in RAM memory, even for pages whose contents have been removed from RAM to flash memory. For example, the LRU chain in a file system’s buffer pool might cover both RAM memory and the flash memory, or there might be two separate LRU chains. A page is loaded into RAM and inserted at the head of the first chain when it is needed by an application. When it reaches the tail of the first chain, the page is moved to flash memory and its descriptor to the head of the second LRU chain. When it reaches the tail of the second chain, the page is moved to disk and removed from the LRU chain. Otherreplacement algorithms would work mutatis mutandis.

Such fine-grained LRU replacement of individual pages is in contrast to assigning entire files, directories, tables, or databases to different storage units. It seems that page replacement is the appropriate granularity in buffer pools. Moreover, proven methods exist to load and replace buffer pool contents entirely automatically, without assis-tance by tuning tools and without directives by users or administrators. An extended buffer pool in flash memory should exploit the same methods as a traditional buffer pool. For truly comparable and competitive performanceand administration costs, a similar approach seems advis-able when flash memory is used as an extended disk.

2.1 File systems In our research, we assumed a fairly traditional file sys-

tem. Many file systems differ in some way or another from this model, but it seems that most usage of file systems still follows this model in general.

44

Page 57: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

Each file is a large byte stream. Files are often read in their entirety, their contents manipulated in memory, and the entire file replaced if it is updated at all. Archiving, ver-sion retention, hierarchical storage management, data movement using removable media, etc. all seem to follow this model as well.

Based on that model, space allocation on disk attempts to employ contiguous disk blocks for each file. Metadata are limited to directories, a few standard tags such as a crea-tion time, and data structures for space management.

Consistency of these on-disk data structures is achieved by careful write ordering, fairly quick write-back of up-dated data blocks, and expensive file system checks after any less-than-perfect shutdown or media removal. In other words, we assume that the absence of transactional guaran-tees and transactional logging, at least for file contents. If log-based recovery is supported for file contents such as individual pages or records within pages, a number of our arguments need to be revisited.

2.2 Database systems We assume fairly traditional database systems with B-

tree indexes as the “work horse” storage structure. Similar tree structures capture not only traditional clustered and non-clustered indexes but also bitmap indexes, columnar storage, contents indexes, XML indexes, catalogs (meta-data), and allocation data structures.

With respect to transactional guarantees, we assume traditional write-ahead logging of both contents changes (such as inserting or deleting a record) and structural changes (such as splitting B-tree nodes). Efficient log-based recovery after failures is enabled by checkpoints that force dirty data from the buffer pool to persistent storage.

Variations such as “second chance” checkpoints or fuzzy checkpoints are included in our assumptions. In addi-tion, “non-logged” (allocation-only logged) execution is permitted for some operations such as index creation. These operations require appropriate write ordering and a “force” buffer pool policy [HR 83].

2.3 Flash memory We assume that hardware and device drivers hide

many implementation details such as the specific hardware interface to flash memory. For example, flash memory might be mounted on the computer’s mother board, on a memory DIMM slot, on a PCI board, or within a standard disk enclosure. In all cases, we assume DMA transfers (or something better) between RAM and flash memory. More-over, we assume that either there is efficient DMA data transfer between flash and disk or there is a transfer buffer in RAM. The size of such transfer buffer should be, in a first approximation, about equal to the product of transfer bandwidth and disk latency. If it is desirable that disk writes

should never delay disk reads, the increased write-behind latency must be included in the calculation.

We also assume that transfer bandwidths of flash memory and disk are comparable. While flash write band-width has lagged behind read bandwidth, some products claim a difference of less than a factor of two, e.g., Sam-sung’s Flash-based solid state disk also used in Table 1. If necessary, the transfer bandwidth can be increased by use of array arrangements as well known for disk drives[CLG 94]. Even redundant arrangement of flash memory may prove advantageous in some cases [CLG 94].

Since the reliability of current NAND flash suffers af-ter 100,000 – 1,000,000 erase-and-write cycles, we assume that some mechanisms for “wear leveling” are provided. These mechanisms ensure that all pages or blocks of pages are written similarly often. It is important to recognize the similarity between wear leveling algorithms and log-structured file systems [OD 89, W 01], although the former also move stable, unchanged data such that their locations can also absorb some of the erase-and-write cycles.

Also note that traditional disk drives do not supportmore write operations, albeit for different reasons. For ex-ample, 6 years of continuous and sustained writing at 100 MB/sec overwrites an entire 250 GB disk less than 80,000 times. In other words, assuming a log-structured file system as appropriate for RAID-5 or RAID-6 arrays, the reliability of current NAND flash seems comparable. Simi-larly, overwriting a 32 GB flash disk 100,000 times at 30 MB/s takes about 3½ years.

In addition to wear leveling, we assume that there is an asynchronous agent that moves fairly stale data from flash memory to disk and immediately erases the freed up space in flash memory to prepare it for write operations without further delay. This activity also has an immediate equiva-lence in log-structured file systems, namely the clean-up activity that prepares space for future log writing. The dif-ference is disk contents must merely be moved, whereas flash contents must also be erased before the next write operation at that location.

In either file systems or database systems, we assume separate mechanisms for page tracking and page replace-ment. A traditional buffer pool, for example, provides both, but it uses two different data structures for these two pur-poses. The standard design relies on a LRU list for page replacement and on a hash table for tracking pages, i.e., which pages are present in the buffer pool and in which buffer frames. Alternative algorithms and data structures also separate page tracking and replacement management.

We assume that the data structures for the replacement algorithm are small, high-traffic data structures and are therefore kept in RAM memory. We also assume that page tracking must be as persistent as the data; thus, a buffer pool’s hash table is re-initialized during a system reboot but page tracking information for pages on a persistent store such as a disk must be as be stored with the data.

45

Page 58: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

As mentioned above, we assume page replacement on demand. In addition, there may be automatic policies and mechanisms for prefetch, read-ahead, write-behind.

Based on these considerations, we assume that the con-tents of a flash memory are pretty much the same, whether the flash memory extends the buffer pool or the disk. The central question is therefore not what to keep in cache but how to manage flash memory contents and its lifetime.

In database systems, flash memory can also be used for recovery logs, because its short access times permit very fast transaction commit. However, limitations in write bandwidth discourage such use. Perhaps systems with dual logs can combine low latency and high bandwidth, one log on a traditional disk and one log on an array of flash chips.

2.4 Other hardware In all cases, we assume RAM memory of a substantial

size, although probably less than flash memory or disk. The relative sizes should be governed “five-minute rule” [GP 87]. Note that, despite similar transfer bandwidth, the short access latency of flash memory compare to disk re-sults in surprising retention times for data in RAM memory, as discussed below.

Finally, we assume sufficient processing bandwidth as provided by modern many-core processors. Moreover, we believe that forthcoming transactional memory (in hardware and in the software run-time system) permits highly concur-rent maintenance of complex data structures. For example, page replacement heuristics might employ priority queues rather than bitmaps or linked lists. Similarly, advanced lock management might benefit from more complex data struc-tures. Nonetheless, we do not assume or require data struc-tures more complex than those already in common use for page replacement and location tracking.

3 The five-minute rule If flash memory is introduced as an intermediate level

in the memory hierarchy, relative sizing of memory levels requires renewed consideration.

Tuning can be based on purchasing cost, total cost of ownership, power, mean time to failure, mean time to data loss, or a combination of metrics. Following Gray and Put-zolo [GP 87], we focus here on purchasing cost. Other met-rics and appropriate formulas to determine relative sizes can be derived similarly, e.g., by replacing dollar costs with energy use for caching and for moving data.

Gray and Putzolo introduced the formula BreakEvenIn-tervalinSeconds = (PagesPerMBofRAM / AccessesPer-SecondPerDisk) × (PricePerDiskDrive / PricePerM-BofRAM) [GG 97, GP 87]. It is derived using formulas for the costs of RAM to hold a page in the buffer pool and of a (fractional) disk to perform I/O every time a page is needed, equating these two costs, and solving the equation for the interval between accesses.

Assuming modern RAM, a disk drive using pages of 4 KB, and the values from Table 1 and Table 2, this pro-duces (256 / 83) × ($80 / $0.047) = 5,248 seconds =90 minutes = 1½ hours1. This compares to 2 minutes (for pages of 4 KB) 20 years ago.

If there is a surprise in this change, it is that the break-even interval has grown by less than two orders of magni-tude. Recall that RAM memory was estimated in 1987 at about $5,000 per MB whereas today the cost is about $0.05per MB, a difference of five orders of magnitude. On the other hand, disk prices have also tumbled ($15,000 per disk in 1987) and disk latency and bandwidth have improved considerably (from 15 accesses per second to about 100 on SATA and about 200 on high-performance SCSI disks).

For RAM and flash disks of 32 GB, the break-even in-terval is (256 / 6,200) × ($999 / $0.047) = 876 seconds =15 minutes. If today’s price for flash disks includes a “nov-elty premium” and comes down closer to the price of raw flash memory, say to $400 (a price also anticipated by Grayand Fitzgerald [GF 07]), then the break-even interval is 351 seconds = 6 minutes.

An important consequence is that in systems tuned us-ing economic considerations, turn-over in RAM memory is about 15 times faster (90 minutes / 6 minutes) if flash mem-ory rather than a traditional disk is the next level in the stor-age hierarchy. Much less RAM is required resulting in lower costs for purchase, power, and cooling.

Perhaps most interestingly, applying the same formula to flash and disk gives (256 / 83) × ($80 / $0.03) =8,070 seconds = 2¼ hours. Thus, all active data will remain in RAM and flash memory.

Without doubt, 2 hours is longer than any common checkpoint interval, which implies that dirty pages in flash are forced to disk not by page replacement but always by checkpoints. Pages that are updated frequently must be written much more frequently (due to checkpoints) than is optimal based on Gray and Putzolo’s formula.

In 1987, Gray and Putzolo speculated 20 years into the future and anticipated a “five-hour rule” for RAM and disks. For records of 1 KB, today’s devices suggest 20,978 seconds or a bit less than 6 hours. Their prediction was amazingly accurate. Page size 1KB 4KB 16KB 64KB 256KBRAM-SATA 20,978 5,248 1,316 334 88RAM-flash 2,513 876 467 365 339Flash-SATA 32,253 8,070 2,024 513 135RAM-$400 1,006 351 187 146 136$400-SATA 80,553 20,155 5,056 1,281 337

Table 3. Break-even intervals [seconds].All break-even intervals are different for larger page

sizes, e.g., 64 KB or even 256 KB. Table 3 shows the break-even intervals, including ones cited above, for a vari-ety of page sizes and combinations of storage technologies.

1 The “=” sign often indicates rounding in this paper.

46

Page 59: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

“$400” stands for a 32 GB NAND flash drive available in the future for $400 rather than for $999 today.

The old five-minute rule for RAM and disk now ap-plies to page sizes of 64 KB (334 seconds). Five minutes had been the approximate break-even interval for 1 KB in 1987 [GP 87] and for 8 KB in 1997 [GG 97]. This trend reflects the different rates of improvement in disk access latency and transfer bandwidth.

The five-minute break-even interval also applies to RAM and today’s expensive flash memory for page sizes of 64 KB and above (365 and 339 seconds). As the price pre-mium for flash memory decreases, so does the break-even interval (146 and 136 seconds).

The two new five-minute rules promised in the abstract are indicated with values in bold italics in Table 3. We will come back to this table and these rules in the discussion on optimal node sizes for B-tree indexes.

4 Page movement In addition to I/O to and from RAM memory, a three-

level memory hierarchy also requires data movement be-tween flash memory and disk storage.

The pure mechanism for moving pages can be realized in hardware, e.g., by DMA transfer, or it might require an indirect transfer via RAM memory. The former case prom-ises better performance, whereas the latter design can be realized entirely in software without novel hardware. On the other hand, hybrid disk manufacturers might have cost-effective hardware implementations already available.

The policy for page movement is governed or derived from demand-paging and LRU replacement. As discussed above, replacement policies in both file systems and data-base systems may rely on LRU and can be implemented with appropriate data structures in RAM memory. As with buffer management in RAM memory, there may be differ-ences due to prefetch, read-ahead, and write-behind, which in database systems may be directed by hints from the query execution layer, whereas file systems must detect page access patterns and worthwhile read-ahead actions without the benefit of such hints.

If flash memory is part of the persistent storage, page movement between flash memory and disk is very similar to page movement during defragmentation, both in file sys-tems and in database systems. Perhaps the most significant difference is how page movement and current page loca-tions are tracked in these two kinds of systems.

5 Tracking page locations The mechanisms for tracking page locations are quite

different in file systems and database systems. In file sys-tems, pointer pages keep track of data pages or of runs of contiguous data pages. Moving an individual page may require breaking up a run. It always requires updating and then writing a pointer page.

In database systems, most data is stored in B-tree in-dexes, including clustered and non-clustered indexes on tables, materialized views, and metadata catalogs. Bitmap indexes, columnar storage, and master-detail clustering can readily and efficiently be represented in B-tree indexes[G 07]. Tree structures derived from B-trees are also usedfor binary large objects (“blobs”) and are similar to the stor-age structures of some file systems [CDR 89, S 81].

For B-trees, moving an individual page can range from very expensive to very cheap. The most efficient mecha-nisms are usually found in utilities for defragmentation or reorganization. Cost or efficiency result from two aspects of B-tree implementation, namely maintenance of neighbor pointers and logging for recovery.

First, if physical neighbor pointers are maintained in each B-tree page, moving a single page requires updating two neighbors in addition to the parent node. If the neighbor pointers are logical using “fence keys,” only the parent page requires an update during a page movement[G 04]. If the parent page is in memory, perhaps even pinned in the buffer pool, recording the new location is rather like updating an in-memory indirection array. The pointer change in the parent page is logged in the recovery log, but there is no need to force the log immediately to stable storage because this change is merely a structural change, not a database contents change.

Second, database systems log changes in the physical database, and in the extreme case both the deleted page image and the newly created page image are logged. Thus, an inefficient implementation produces two full log pages whenever a single data page moves from one location to another. A more efficient implementation only logs alloca-tion actions and delays de-allocation of the old page image until the new image is safely written in its intended location [G 04]. In other words, moving a page from one location, e.g., on persistent flash memory, to another location, e.g., on disk, requires only a few bytes in the database recovery log.

The difference between file systems and database sys-tems is the efficiency of updates enabled by the recovery log. In a file system, the new page location must be savedas soon as possible by writing a new image of the pointer page. In a database system, only a few short log recordsmust be added to the log buffer. Thus, the overhead for a page movement in a file system is writing an entire pointer page using a random access, whereas a database system adds a log record of a few dozen bytes to the log buffer that will eventually be written using large sequential write op-erations.

If a file system uses flash memory as persistent store, moving a page between a flash memory location and an on-disk location adds substantial overhead. Thus, we believe that file system designers will prefer flash memory as ex-tension to the buffer pool rather than extension of the disk, thus avoiding this overhead.

47

Page 60: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

A database system, however, has built-in mechanisms that can easily track page movements. These mechanisms are inherent in the “work horse” data structure, B-tree in-dexes. In comparison to file systems, these mechanisms permit very efficient page movement. Each page movement requires only a fraction of a sequential write (in the recov-ery log) rather than a full random write.

Moreover, the database mechanisms are also very reli-able. Should a failure occur during a page movement, data-base recovery is driven by the recovery log, whereas a file system requires checking the entire storage during reboot.

6 Checkpoint processing To ensure fast recovery after a system failure, database

systems employ checkpoints. Their effect is that recovery only needs to consider database activity later than the most recent checkpoint plus some limited activity explicitly indi-cated in the checkpoint information. This effect is achieved partially by writing dirty pages in buffer pool.

If pages in flash memory are considered part of the buffer pool, dirty pages must be written to disk during data-base checkpoints. Common checkpoint intervals are meas-ured in seconds or minutes. Alternatively, if checkpoints are not truly points but intervals, it is even reasonable to flush pages and perform checkpoint activities continuously, start-ing the next one as soon as one finishes. Thus, many writes to flash memory will soon require a write to disk, and flashmemory as intermediate level in the memory hierarchy fails to absorb write activity. This effect may be exacerbated if, as discussed in the previous section, RAM memory is kept small due to the presence of flash memory.

If, on the other hand, flash memory is part of the per-sistent storage, writing to flash memory is sufficient. Write-through to disk is required only as part of page replacement, i.e., when a page’s usage suggests placement on disk rather than in flash memory. Thus, checkpoints do not incur the cost of moving data from flash memory to disk.

Checkpoints might even be faster in systems with flash memory because dirty pages in RAM memory need to be written merely to flash memory, not to disk. Given the very fast random access in flash memory relative to disk drives, this difference might speed up checkpoints significantly.

To summarize, database systems benefit if the flash memory is managed as part of the persistent storage. In contrast, traditional file systems do not have system-wide checkpoints that flush the recovery log and any dirty data in the buffer pool. Instead, they rely on carefully writing modified file system pages due to the lack of a recovery log protecting file contents.

7 Page sizes In addition to the tuning based on the five-minute rule,

another optimization based on access performance is sizing of B-tree nodes. The optimal page size combines a short

access time with a high reduction in remaining search space. Assuming binary search within each B-tree node, the latter is measured by the logarithm of records within a node. This measure was called a node’s “utility” in our ear-lier work [GG 97]. This optimization is essentially equiva-lent to one described in the original research on B-trees [BM 70].

Page size Records/ page

Node utility

Accesstime

Utility / time

4 KB 140 7 12.0ms 0.5816 KB 560 9 12.1ms 0.7564 KB 2,240 11 12.2ms 0.90

128 KB 4,480 12 12.4ms 0.97256 KB 8,960 13 12.9ms 1.01512 KB 17,920 14 13.7ms 1.02

1 MB 35,840 15 15.4ms 0.97Table 4. Page utility for B-tree nodes on disk.

Table 4 illustrates this optimization for records of 20 bytes, typical if prefix and suffix truncation [BU 77] are employed, and nodes filled at about 70%.

Not surprisingly, the optimal page size for B-tree in-dexes on modern high-bandwidth disks is much larger than traditional database systems have employed. The access time dominates for all small page sizes, such that additional byte transfer and thus additional utility are almost free.

B-tree nodes of 256 KB are very near optimal. For those, Table 3 indicates a break-even time for RAM and disk of 88 seconds. For a $400-flash disk and a traditional rotating hard disk, Table 3 indicates 337 seconds or just over 5 minutes. This is the first of the two five-minute rules promised in the abstract.

Page size

Recordsper page

Node utility

Access time

Utility/ time

1 KB 35 5 0.11ms 43.42 KB 70 6 0.13ms 46.14 KB 140 7 0.16ms 43.68 KB 280 8 0.22ms 36.2

16 KB 560 9 0.34ms 26.364 KB 2,240 11 1.07ms 10.3

Table 5. Page utility for B-tree nodes on flash memory.Table 5 illustrates the same calculations for B-trees on

flash memory. Due to the lack of mechanical seeking and rotation, the transfer time dominates even for small pages. The optimal page size for B-trees on flash memory is 2 KB, much smaller than for traditional disk drives.

In Table 3, the break-even interval for pages of 4 KB is 351 seconds. This is the second five-minute rule promised in the abstract.

The implication of two different optimal page sizes is that a uniform node size for B-trees on flash memory and traditional rotating hard disks is sub-optimal. Optimizing page sizes for both media requires a change in buffer man-agement, space allocation, and some of the B-tree logic.

Fortunately, O’Neil already designed a space allocation scheme for B-trees in which neighboring leaf nodes usually

48

Page 61: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

reside within the same contiguous extent of pages [O 92]. When a new page is needed for a node split, another page within the same extent is allocated. When extent overflows, half its pages moved to a newly allocated extent.

Using O’Neil’s SB-trees, extents of 256 KB are the units of transfer between flash memory and disk, whereas pages of 4 KB are the unit of transfer between RAM and flash memory.

Similar notions of self-similar B-trees have also been proposed for higher levels in the memory hierarchy, e.g., in the form of B-trees of cache lines for the indirection vector within a large page [L 01]. Given that there are at least 3 levels of B-trees and 3 node sizes now, i.e., cache lines, flash memory pages, and disk pages, research into cache-oblivious B-trees [BDC 05] might be very promising.

8 Query processing Self-similar designs apply both to data structures such

as B-trees and to algorithms. For example, sort algorithms already employ algorithms similar to traditional external merge sort in multiple ways, not only to merge runs on disk but also to merge runs in memory, where the initial runs in memory are sized to limit run creation to the CPU cache [G 06, NBC 95].

The same technique might be applied three times in-stead of two, i.e., runs in memory are merged into runs in flash memory, and for very large sort operations, runs on flash memory are merged into runs on disk. Read-ahead, forecasting, write-behind, and page sizes all deserve a new look in a multi-level memory hierarchy consisting of cache, RAM, flash memory, and traditional disk drives. These page sizes can then inform the break-even calculation for page retention versus I/O and thus guide the optimal capaci-ties at each memory level.

It may be surmised that a variation of this sort algo-rithm will not only be fast but also energy-efficient. While energy efficiency has always been crucial for battery-powered devices, research into energy-efficient query proc-essing on server machines is only now beginning [RSK 07]. For example, both for flash memory and for disks, the en-ergy-optimal page sizes might well differ from the perform-ance-optimal page sizes.

The I/O pattern of external merge sort is similar (albeit in the opposite direction) to the I/O pattern of external parti-tion sort as well as to the I/O pattern of partitioning during hash join and hash aggregation. The latter algorithms, too, require re-evaluation and -design in a three-level memory hierarchy, or even a four-level memory hierarchy is CPU caches are also considered [SKN 94].

Flash memory with its very fast access times may re-vive interest in index-based query execution [DNB 93, G 03]. Optimal pages size and turn-over times are those derived in the earlier sections.

9 Record and object caches Page sizes in database systems have grown over the

years, although not as fast as disk transfer bandwidth. On the other hand, small pages require less buffer pool space for each root-to-leaf search. For example, consider an index with 20,000,000 entries. With index pages of 128 KB and 4,500 records, a root-to-leaf search requires 2 nodes and thus 256 KB in the buffer pool, although half it that (the root node) can probably be shared with other transactions. With index pages of 8 KB and 280 records per page, a root-to-leaf search requires 3 nodes or 24 KB in the buffer pool, or one order of magnitude less.

In the traditional database architecture, the default page size is a compromise between efficient index search (using large B-tree nodes as discussed above and already in the original B-tree papers [BM 70]) and moderate buffer pool requirements for each index search. Nonetheless, the exam-ple above requires 24 KB in the buffer pool for finding a record of perhaps only 20 bytes, and it requires 8 KB of the buffer pool for retaining these 20 bytes in memory. An al-ternative design employs large on-disk pages and a record cache that serves applications, because record cache mini-mize memory needs yet provide the desired data retention.

The introduction of flash memory with its fast access latency and its small optimal page size may render record caches obsolete. With the large on-disk pages in flash memory and only small pages in the in-memory buffer pool, the desired compromise can be achieved without the need for two separate data structures, i.e., a transacted B-tree and a separate record cache.

In object-oriented applications that assemble complex objects from many tables and indexes in a relational data-base, the problem may be either better or worse, depending on the B-tree technology employed. If traditional indexes are used with a separate B-tree for each record format, as-sembling a complex object in memory requires many root-to-leaf searches and thus many B-tree nodes in the buffer pool. If records from multiple indexes can be interleaved within a single B-tree based on their common search keys and sort order [G 07, H 78], e.g., on object identifier plus appropriate additional keys, very few or even a single B-tree search may suffice. Moreover, the entire complex ob-ject may be retained in a single page within the buffer pool.

10 Directions for future work Several directions for future research suggest them-

selves. We plan on pursuing multiple of these in the future. First, the analyses in this paper are focused on purchas-

ing costs. Other costs could be taken into consideration in order to capture total cost of ownership. Perhaps most inter-estingly, a focus on energy consumption may lead to differ-ent break-even points or even entirely different conclusions. Along with CPU scheduling, algorithms for staging data in the memory hierarchy, including buffer pool replacement

49

Page 62: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

and compression, may be the software techniques with the highest impact on energy consumption.

Second, the five-minute rule applies to permanent data and their management in a buffer pool. The optimal reten-tion time for temporary data such as run files in sorting and overflow files in hash join and hash aggregation may be different. For sorting, as for B-tree searches, the goal should be to maximize the number of comparisons per unit of I/O time or per unit of energy spent on I/O. Focused research may lead to new insights about query processing in multi-level memory hierarchies.

Third, Gray and Putzolo offered further rules of thumb, e.g., the ten-byte rule for trading memory and CPU power.These rules also warrant revisiting for both costs and en-ergy. Compared to 1987, the most fundamental change may be that CPU power should be measured not in instructions but in cache line replacements. Trading off space and time seems like a new problem in this environment.

Fourth, what are the best data movement policies? One extreme is a database administrator explicitly moving entire files, tables and indexes between flash memory and tradi-tional disk. Another extreme is automatic movement of individual pages, controlled by a replacement policy such as LRU. Intermediate policies may focus on the roles of individual pages within a database or on the current query processing activity. For example, catalog pages may be moved after schema changes to facilitate fast recompilation of all caches query execution plans, and upper B-tree levels may be prefetched and cached in RAM memory or in flash memory during execution of query plans relying on index navigation.

Fifth, what are secondary effects of introducing flash memory into the memory hierarchy of a database server? For example, short access times permit a lower multi-programming level, because only short I/O operations must be “hidden” by asynchronous I/O and context switching. A lower multi-programming level in turn may reduce conten-tion for memory in sort and hash operations and for locks and latches (concurrency control for in-memory data struc-tures). Should this effect prove significant, effort and com-plexity of using a fine granularity of locking may be re-duced.

Sixth, how will flash memory affect in-memory data-base systems? Will they become more scalable, affordable, and popular based on memory inexpensively extended with flash memory rather than RAM memory? Will they become less popular due to very fast traditional database systems using flash memory instead of (or in addition to) disks? Can a traditional code base using flash memory instead of tradi-tional disks compete with a specialized in-memory database system in terms of performance, total cost of ownership, development and maintenance costs, time to market of fea-tures and releases, etc.?

Finally, techniques similar to generational garbage col-lection may benefit storage hierarchies. Selective reclama-tion applies not only to unreachable in-memory objects but

also to buffer pool pages and favored locations on perma-nent storage. Such research also may provide guidance for log-structured file systems, for wear leveling for flash memory, and for write-optimized B-trees on RAID storage.

11 Summary and conclusions In summary, the 20-year-old “five minute rule” for

RAM and disks still holds, but for ever larger disk pages. Moreover, it should be augmented by two new five-minute rules, one for large pages moving between RAM and flash memory and for small pages moving between flash memory and disks. For small pages moving between RAM and disk, Gray and Putzolo were amazingly accurate in predicting a five-hour break-even point 20 years into the future.

Research into flash memory and its place in system ar-chitectures is urgent and important. Within a few years, flash memory will be used to fill the gap between tradi-tional RAM memory and traditional disk drives in many operating systems, file systems, and database systems.

Flash memory can be used to extend RAM or to extend persistent storage. These models are called “extended buffer pool” and “extended disk” here. Both models may seem viable in operating systems, file systems, and in database systems. Due to the characteristics of these systems, how-ever, they will employ different usage models.

In both models, contents of RAM and of flash will be governed by LRU-like replacement algorithms that attempt to keep the most valuable pages in RAM and the least valu-able pages on traditional disks. The linked list or other data structure implementing the replacement policy for the flash memory will be maintained in RAM.

Operating systems and file systems will employ flash memory mostly as transient memory, e.g., as a fast backup store for virtual memory and as a secondary file system cache. Both of these applications fall into the extended buffer pool model. During an orderly system shutdown, the flash memory contents might be written to persistent stor-age. During a system crash, however, the RAM-based de-scription of flash memory contents will be lost and must be reconstructed by a contents analysis very similar to a tradi-tional file system check. Alternatively, flash memory con-tents can be voided and be reloaded on demand.

Database systems, on the other hand, will employ flash memory as persistent storage, using the extended disk model. The current contents will be described in persistent data structures, e.g., parent pages in B-tree indexes. Tradi-tional durability mechanisms, in particular logging and checkpoints, ensure consistency and efficient recovery after system crashes. An orderly system shutdown has no need to write flash memory contents to disk.

There are two reasons for these different usage models for flash memory. First, database systems rely on regular checkpoints during which dirty pages in the buffer pool are flushed to persistent storage. If a dirty page is moved from RAM to the extended buffer pool in flash memory, it cre-

50

Page 63: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

ates substantial overhead during the next checkpoint. A free buffer must be found in RAM, the page contents must be read from flash memory into RAM, and then the page must be written disk. Adding such overhead to checkpoints is not attractive in database systems with frequent checkpoints. Operating systems and file systems, on the other hand, do not rely on checkpoints and thus can exploit flash memory as extended buffer pool.

Second, the principal persistent data structures of data-bases, B-tree indexes, provide precisely the mapping and location tracking mechanisms needed to complement fre-quent page movement and replacement. Thus, tracking a data page when it moves between disk and flash relies on the same data structure maintained for efficient database search. In addition to avoiding buffer descriptors etc. for pages in flash memory, avoiding indirection in locating a page also makes database searches as efficient as possible.

Finally, as the ratio of access latencies and transfer bandwidth is very different for flash memory and for disks, different B-tree node sizes are optimal. O’Neil’s SB-treeexploits two nodes sizes as needed in a multi-level storage hierarchy. The required inexpensive mechanisms for mov-ing individual pages are the same as those required when moving pages between flash memory and disk.

Acknowledgements This paper is dedicated to Jim Gray, who has suggested

this research and has helped me and many others many times in many ways. – Barb Peters, Lily Jow, Harumi Kuno, José Blakeley, Mehul Shah, and the reviewers sug-gested multiple improvements after reading earlier versions of this paper.

References [BDC 05] Michael A. Bender, Erik D. Demaine, Martin

Farach-Colton: Cache-Oblivious B-Trees. SIAM J. Comput. 35(2): 341-358 (2005).

[BM 70] Rudolf Bayer, Edward M. McCreight: Organiza-tion and Maintenance of Large Ordered Indexes. SIG-FIDET Workshop 1970: 107-141.

[BU 77] Rudolf Bayer, Karl Unterauer: Prefix B-Trees. ACM TODS 2(1): 11-26 (1977).

[CDR 89] Michael J. Carey, David J. DeWitt, Joel E. Richardson, Eugene J. Shekita: Storage Management in EXODUS. Object-Oriented Concepts, Databases, and Applications 1989: 341-369.

[CLG 94] Peter M. Chen, Edward L. Lee, Garth A. Gibson, Randy H. Katz, David A. Patterson: RAID: High-Performance, Reliable Secondary Storage ACM Com-put. Surv. 26(2): 145-185 (1994).

[DNB 93] David J. DeWitt, Jeffrey F. Naughton, Joseph Burger: Nested Loops Revisited. PDIS 1993: 230-242.

[G 03] Goetz Graefe: Executing Nested Queries. BTW 2003: 58-77.

[G 04] Goetz Graefe: Write-Optimized B-Trees. VLDB 2004: 672-683.

[G 06] Goetz Graefe: Implementing Sorting in Database Systems. ACM Comput. Surv. 38(3): (2006).

[G 07] Goetz Graefe: Master-detail clustering using merged indexes. Informatik – Forschung und Entwicklung (2007).

[GF 07] Jim Gray, Bob Fitzgerald: FLASH Disk Opportu-nity for Server-Applications. http://research.micro-soft.com/~gray/papers/FlashDiskPublic.doc.

[GG 97] Jim Gray, Goetz Graefe: The Five-Minute Rule Ten Years Later, and Other Computer Storage Rules of Thumb. SIGMOD Record 26(4): 63-68 (1997).

[GP 87] Jim Gray, Gianfranco R. Putzolu: The 5 Minute Rule for Trading Memory for Disk Accesses and The 10 Byte Rule for Trading Memory for CPU Time. SIGMOD 1987: 395-398.

[H 78] Theo Härder: Implementing a Generalized Access Path Structure for a Relational Database System. ACM TODS 3(3): 285-298 (1978).

[H 07] James Hamilton: An Architecture for Modular Data Centers. CIDR 2007.

[HR 83] Theo Härder, Andreas Reuter: Principles of Trans-action-Oriented Database Recovery. ACM Comput. Surv. 15(4): 287-317 (1983).

[L 01] David B. Lomet: The Evolution of Effective B-treePage Organization and Techniques: A Personal Ac-count. SIGMOD Record 30(3): 64-69 (2001).

[LR 07] James R. Larus, Ravi Rajwar: Transactional Mem-ory. Synthesis Lectures on Computer Architecture, Morgan & Claypool (2007).

[NBC 95] Chris Nyberg, Tom Barclay, Zarka Cvetanovic, Jim Gray, David B. Lomet: AlphaSort: A Cache-Sensitive Parallel External Sort VLDB J. 4(4): 603-627 (1995).

[OD 89] John K. Ousterhout, Fred Douglis: Beating the I/O Bottleneck: A Case for Log-Structured File Systems. Operating Systems Review 23(1): 11-28 (1989).

[O 92] Patrick E. O'Neil: The SB-Tree: An Index-Sequential Structure for High-Performance Sequential Access. Acta Inf. 29(3): 241-265 (1992).

[RSK 07] Suzanne Rivoire, Mehul Shah, Partha Rangana-than, Christos Kozyrakis: JouleSort: A Balanced En-ergy-Efficiency Benchmark. SIGMOD 2007.

[S 81] Michael Stonebraker: Operating System Support for Database Management. CACM 24(7): 412-418 (1981).

[SKN 94] Ambuj Shatdal, Chander Kant, Jeffrey F. Naugh-ton: Cache Conscious Algorithms for Relational Query Processing. VLDB 1994: 510-521.

[W 01] David Woodhouse: JFFS: the Journaling Flash File System. Ottawa Linux Symposium, Red Hat Inc, 2001.

51

Page 64: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................
Page 65: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

Architectural Characterization of XQuery Workloads on Modern Processors

Rubao Lee, Bihui Duan, Taoying Liu

Research Centre for Grid and Service Computing,

Institute of Computing Technology, Chinese Academy of Sciences

PO Box 2704, Beijing, China

{lirubao,duanbihui,liutaoying}@software.ict.ac.cn

ABSTRACT

As XQuery rapidly emerges as the standard for querying XML documents, it is very important to understand the architectural characteristics and behaviors of such workloads. A lot of efforts are focused on the implementation, optimization, and evaluation of XQuery tools. However, few or no prior work studies the architectural and memory system behaviors of XQuery workloads on modern hardware platforms. This makes it unclear whether modern CPU techniques, such as the multi-level caches and hardware branch predictors, can support such workloads well enough.

This paper presents a detailed characterization of the architectural behavior of XQuery workloads. We examine four XQuery tools on three hardware platforms (AMD, Intel, and Sun) using well-designed XQuery queries. We report measured architectural data, including the L1/L2 cache misses, TLB misses, and branch mispredictions. We believe that the information will be useful in understanding XQuery workloads and analyzing the potential architectural optimization opportunities of improving XQuery performance.

1. INTRODUCTION The wide spread of XML storages and web service creates a lot of applications which need to query and process XML documents. More than traditional scientific computing applications and OLTP/DSS workloads [2], such applications run on a larger range of hardware platforms, which include web servers, desktops, and even mobile devices. As XQuery rapidly emerges as the standard for querying XML documents, it is very important to understand the architectural characteristics and behaviors of running such workloads on different hardware platforms.

However, the current architectural research community and the current XQuery research community are not well joining to study and make it clear how XQuery workloads run on modern processors. Architectural researchers have focused on studying the architectural characteristics of database workloads including query processing and transaction processing. Furthermore,

architectural optimized database algorithms and schemes [24][25][26] are presented on the basis of insights given by DBMS-characterizing works. Compared with the joint of architecture research and database research, researching XQuery from the viewpoint of architecture is just beginning.

Although XQuery shares many common concepts with SQL, its executions have many specific features differing from them in relational database. Intuitively, XQuery is basically computing and memory bound since (1) XML documents are relatively small (<100MB) in typical applications and as thus the I/O is not the dominating factor, and (2) executions of XQuery queries are often time-consuming, which need to manipulate a lot of nodes resident in memory. Path navigation is the cornerstone of XQuery, which creates different memory-accessing patterns from DBMS’s tuple-based query executions. In addition, current XQuery engines are often written in an object-oriented language and running on a virtual machine, such as typical Java-based systems [20][22]. These factors make it necessary to characterize the CPU and memory behavior of XQuery workloads.

This paper is our first step to understand the architectural characteristics of running XQuery workloads on modern processors. We present a detailed characterization of architectural behaviors of XQuery workloads. We examined four XQuery tools on three hardware platforms (AMD, Intel, and Sun) using our designed XQuery queries. We report measured architectural data, including the L1/L2 cache misses, TLB misses, and branch mispredictions. We believe that the information will be useful in understanding XQuery workloads and analyzing the potential architectural optimization opportunities of improving XQuery performance.

The remainder of this paper is organized as follows. Section 2 lists related work very briefly. Section 3 introduces our designed XQuery workloads. In section 4, we introduce the target hardware platforms and corresponding tools. All test results are presented in section 5. We conclude this paper in section 6.

2. RELATED WORK There are many papers for characterizing architectural behaviors of DBMS queries including OLTP and DSS workloads, such as [1][2][3][4][5][6]. Some papers involve examining Java workloads, such as [7][8][9]. To our knowledge, almost no prior work is specially focused on charactering XQuery workloads. In [10], the authors report the measured architectural characteristics for XML processing on an Intel Xeon platform. In addition, work on benchmarking XQuery is related to this paper, such as [16][17][18][19][28][29].

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Proceedings of the Second International Workshop on Performance and Evaluation of Data Management Systems (ExpDB 2007), June 15, 2007, Beijing, China. Copyright 2007 ACM 978-1-59593-773-5/07/06 ...$5.00.

53

Page 66: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

3. XQUERY WORKLOAD DESIGN In this section, we introduce the XQuery workloads used in our experiments, which include the data part and the query part.

3.1 Documents We design three categories of XML documents. Their visual shapes are illustrated by Figure 1.

3.1.1 Rectangle XML Rectangle XML has two parameters: the width and the height. The width means how many child nodes the root has, while the height means how many levels the tree has under the root node.

We distinguish two kinds of Rectangle XML: the wide one and the narrow one. A wide Rectangle XML has a very large width but a very small height, while a narrow one has a very small width but a very large height. Although the narrow Rectangle XML is very rare, we are expecting that the difference between the two shapes can show different memory system behaviors considering their structural difference.

The concrete values of the two parameters are given as the following.

Wide Rectangle XML:

Width: [10000,20000,40000,80000,100000]

Height: [1,2,4,6,8,10]

Narrow Rectangle XML:

Width: [1,2,4,6,8,10]

Height: [500,1000,5000,10000,20000,40000]

In a Rectangle XML document, the root is labeled root. Nodes at the level x are all labeled tx. No node contains attribute.

3.1.2 Triangle XML In a Triangle XML document, each non-leaf node has two child nodes, so the tree is a complete binary tree. Triangle XML has only one parameter: the height. The concrete values of heights are given as the following.

Triangle XML:

Height: [15,16,17,18,19,20]

In a Triangle XML document, the root is labeled t1. Nodes at the level x are all labeled tx. No node has attribute.

3.1.3 List XML (300k.xml) Unlike the above Rectangle XML and Triangle XML, List XML has payloads and it is defined as the following:

(1) The root node has 300000 child nodes and each child node has a text child node whose value is a random integer between 1 and 300000.

(2) The root node is labeled result and all 300000 child nodes are labeled t.

3.2 Queries In our experiments, we only use three queries which contain basic operations of XQuery: path navigation, selection and sorting, considering our major goal is not to evaluate language supporting capabilities of concrete tools, but to understand and analyze architectural behaviors of basic XQuery workloads.

Q1: Retrieving all leaf nodes This query is actually an XPath query. It is executed on Rectangle XML and Triangle XML. The query has the form of “doc(“xmlname”)//tn”, in which “n” is the level of all leaf nodes in the corresponding document “xmlname”.

Q2: Selection This query is executed only on List XML. The selectivity of the where clause in the query is 50%.

for $t in doc(“300k.xml")/result//t

where number($t/child::node()) < 150000

return $t

Q3: Sorting This query is executed only on List XML.

for $t in doc(“300k.xml")/result//t

order by number($t/child::node())

return $t

Figure 1: visual shapes of tree kinds of XML documents. In

the graphs, a circle means an element node, while a rectangle

means a text node (only existing in List XML).

Table 1: Characteristics of three CPUs. (The data is from corresponding official websites [11][14][15].)

Characteristic AMD Sempron 2500+ Intel Pentium P4 2.8GHz Sun UltraSPARC T1

L1 cache organization Split instruction/data caches Split instruction/data caches Split instruction/data caches

L1 cache size 64 KB each for instructions/data 8KB for data, 96KB trace cache 16KB instruction cache per core; 8KB data cache per core

L1 cache associativity 2-way set associative 4-way set associative 4-way set associative

L1 block size 64 bytes 64 bytes 32 bytes(instruction cache)

16 bytes(data cache)

L2 cache organization Unified (instruction and data) Unified (instruction and data) Unified (instruction and data)

L2 cache size 256 KB 512 KB 3 MB (shared by cores)

L2 cache associativity 16-way set associative 8-way set associative 12-way set associative

L2 block size 64 bytes 128 bytes 64 bytes

Rectangle XM L Triangle XM L List XM L

54

Page 67: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

4. HARDWARE & SOFTWARE

4.1 Hardware Platforms and Measuring

Tools Our experiments were executed on three hardware platforms. The specifications of involved CPUs are briefly given in Table 1.

4.1.1 AMD This system contains an AMD Sempron 2500+ CPU, 2048MB DDR400 RAM. The operating system of this system is FreeBSD 6.2. We use the pmcstat tool [12] to count hardware events.

This system is our primary platform in the experiments, on which we measure the following architectural characteristics:

(1) L2 miss rate (2) Frequency of branch mispredictions (3) L1 I-TLB misses per 1000 instructions (4) L1 D-TLB misses per 1000 instructions (5) L1 I-Cache misses per 1000 instructions (6) L1 D-Cache misses per 1000 instructions

To measure these targets, we need to count events listed in Table 2. We use the indirect method [13] to measure L2 cache request/misses and calculate L2 miss rate.

Table 2: events used by pmcstat on AMD Sempron

4.1.2 Intel This system contains an Intel Pentium P4 2.8GHz CPU, 512MB DDR266 memory. Although this CPU supports Hyper-Threading technology, we disabled this feature in our experiments. Like the above AMD system, we also run FreeBSD 6.2 operating system and use pmcstat tool on this system.

On this system, we only measure the following architectural characteristics using the events listed in Table 3 [11]:

(1) L2 miss rate (2) Frequency of branch mispredictions

Table 3: events used by pmcstat on Intel P4

Characteristics Event name (used by the pmcstat tool)

L2 cache accesses P4-bsq-cache-reference

L2 Misses P4-bsq-cache-reference,mask=rd-2ndl-miss+wr-2ndl-miss

branches Branches (common alias)

Branch mispredictions

Branch-mispredicts (common alias)

4.1.3 Sun UltraSPARC T1 This system is a Sun Fire T1000 Server with 8 cores, 8GB RAM, running Solaris 10. On this system, we use the collect and er_print command provided by Sun Studio to measure instructions, and misses of L1/L2 instruction/data cache [27].

4.2 Software We examined the following XQuery tools:

� Berkeley DB XML v2.3.10 (Bdb-xml for short) [23]

We only ran this software on our AMD/Intel systems. We downloaded the source tarball which includes sources of all the components: Berkeley DB, Xerces, and XQilla, and build them on our FreeBSD systems. The query execution command was “dbxml –s ourquery”, in which the “ourquery” was a script containing a query command and a quit command. For our experiments, we did not use the container concept but build our queries on XML files directly.

� Galax v0.5.0. (Galax for short) [21]

We ran it on all systems. We used the Linux binary version on our FreeBSD systems without any modifications. And on the Sun system, we used the Solaris binary version downloaded from the official website. The query executing command was “Galax-run –print-xml off our.xquery”.

� Saxon-B 8.9 for Java. (Saxon for short) [20]

This is written in Java. We ran it only on the Sun system. The query executing command was “java net.sf.Saxon.Query our.xquery”.

� Gnu Qexo v1.9.1. (Qexo for short) [22].

This is also written in Java. We ran it only on the Sun system. We downloaded the executable jar file and executed it directly, and the command was “java –jar kawa.jar our.xquery”.

5. RESULTS In this section, we first report results on AMD and Intel systems for Bdb-xml and Galax, which include the following three parts:

(1) The L2 Cache miss rates on both AMD and Intel systems (subsection 5.1)

(2) The frequency of branch mispredictions on both AMD and Intel systems (subsection 5.2)

(3) The L1 Cache misses and TLB misses on only AMD system (subsection 5.3)

Then, we report results of running Q1 to wide Rectangle XML documents on the Sun UltraSPARC T1 system for Galax, Saxon, and Qexo (subsection 5.4), which include the following parts:

(1) The total count of instructions (2) The L1 I-Cache/D-Cache misses per 1000 instructions (3) The L2 I-Cache/D-Cache misses per 1000 instructions

Characteristics Event name (used by the pmcstat tool)

branches Branches (common alias)

Branch mispredictions

Branch-mispredicts (common alias)

instructions Instructions (common alias)

L1 I-TLB misses K8-ic-l1-itlb-miss-and-l2-itbl-hit +

K8-ic-l1-itlb-miss-and-l2-itlb-miss

L1 D-TLB misses K8-ic-l1-dtlb-miss-and-l2-dtlb-hit +

K8-ic-l1-dtlb-miss-and-l2-dtlb-miss

L1 I-Cache misses K8-ic-refill-from-l2 + K8-ic-refill-from-system

L1 D-Cache misses K8-dc-refill-from-l2 + K8-dc-refill-from-system

L2 Cache accesses L1 I-TLB misses + L1 D-TLB misses + L1 I-Cache misses + L1 D-Cache misses

L2 Cache misses K8-ic-refill-from-system + k8-dc-refill-from-system + K8-ic-l1-itlb-miss-and-l2-itlb-miss + K8-ic-l1-dtlb-miss-and-l2-dtlb-miss

55

Page 68: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

5.1 L2 Cache Miss Rate

5.1.1 AMD On this system, we measure Sempron’s L2 Cache miss rates for running Q1 on Rectangle XML and Triangle XML, and running Q2/Q3 on List XML.

5.1.1.1 Rectangle XML Figure 2 shows L2 miss rates of executing Q1 on wide Rectangle XML in Bdb-xml, while Figure 3 shows them in Galax. By comparing these two graphs, we can see that Bdb-xml has a better L2 miss rate than Galax for wide XML documents. Moreover, when the width increases from 10000 to 100000, the L2 miss rate decreases for Bdb-xml but increases for Galax (not including height 1/2).

Figure 4 shows L2 miss rates of executing Q1 on narrow Rectangle XML in Bdb-xml, while Figure 5 shows them in Galax. We can see that Bdb-xml has a very bad L2 miss rate for this situation, especially when the height is larger than 5000 (up to 50%). What is interesting is that the miss rate in Bdb-xml is minimized when the height equals to 1000.

As shown in the graphs for Galax, no matter whether the Rectangle XML is wide or narrow, Galax has a relatively close L2 miss rate (8% -13%). However, Bdb-xml cannot fit narrow XML shapes as well as to wide shapes (ranging from 2% to 50%).

Figure 2: L2 miss rates of executing Q1 on wide Rectangle

XML documents in Bdb-xml (varying width and height)

Figure 3: L2 miss rates of executing Q1 on wide Rectangle

XML documents in Galax (varying width and height)

Figure 4: L2 miss rates of executing Q1 on narrow Rectangle

XML documents in Bdb-xml (varying width and height)

Figure 5: L2 miss rates of executing Q1 on narrow Rectangle

XML documents in Galax (varying width and height)

5.1.1.2 Triangle XML Figure 6 shows that Bdb-xml and Galax have completely converse L2 cache miss behaviors for Triangle XML when the height of the tree increases. Bdb-xml fits Triangle XML better than Galax, as it does for wide Rectangle XML.

Figure 6: L2 miss rates of executing Q1 on Triangle XML

documents in Bdb-xml and in Galax (varying height)

0.00%

2.00%

4.00%

6.00%

8.00%

10.00%

12.00%

14.00%

1 2 4 6 8 10

L2

Mis

s R

ate

Height of XML Tree

Bdb-xml on AMD: Rectangle (wide)

10000

20000

40000

80000

100000

0.00%

1.00%

2.00%

3.00%

4.00%

5.00%

6.00%

7.00%

8.00%

9.00%

10.00%

1 2 4 6 8 10

L2

Mis

s R

ate

Height of XML Tree

Galax on AMD: Rectangle (wide)

10000

20000

40000

80000

100000

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

1 2 4 6 8 10

L2

Mis

s R

ate

Width of XML Tree

Bdb-xml on AMD: Rectangle (narrow)

500

1000

5000

10000

20000

40000

0.00%

2.00%

4.00%

6.00%

8.00%

10.00%

12.00%

14.00%

16.00%

1 2 4 6 8 10

L2

Mis

s R

ate

Width of XML Tree

Galax on AMD: Rectangle (narrow)

500

1000

5000

10000

20000

40000

0.00%

2.00%

4.00%

6.00%

8.00%

10.00%

12.00%

14.00%

16.00%

18.00%

20.00%

15 16 17 18 19 20

L2

Mis

s R

ate

Height of XML Tree

AMD: Triangle (Bdb-xml vs Galax)

Bdb-xml

Galax

56

Page 69: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

5.1.1.3 List XML Figure 7 shows L2 cache miss rates when executing Q2 (selection) and Q3 (sorting) on List XML in Bdb-xml and Galax. Although Bdb-xml has a very low L2 cache miss rate (<2%) for selection, its sorting algorithm has to suffer a high L2 cache miss rate (>14%). Galax has converse behaviors for the two operations.

Figure 7: L2 miss rates of executing Q2 (selection) and Q3

(sorting) on List XML document in Bdb-xml and in Galax.

5.1.2 Intel On this system, we only report results of executing Q1 on Rectangle XML. Figure 8 and Figure 9 show L2 miss rates of Bdb-xml and Galax for executing Q1 on wide shapes. Figure 10 and Figure 11 show them on narrow ones. We summarize them as following:

(1) The four graphs show similar L2 miss rate trends as corresponding AMD versions (in subsection 5.1.1.1).

(2) Except the case of Bdb-xml for narrow Rectangle XML, P4 shows lower L2 miss rates than Sempron, which is the benefit of increased L2 cache size (512KB vs 256KB).

(3) Amazingly, for executing Q1 on narrow Rectangle XML in Bdb-xml, Figure 10 presents a worse L2 miss rate for P4 than Figure 4 for Sempron, despite P4’s larger L2 cache. For the maximized document (width=10, height=40000), the L2 miss rate of P4 is up to 87%.

Figure 8: L2 miss rates of executing Q1 on wide Rectangle

XML documents in Bdb-xml (varying width and height)

Figure 9: L2 miss rates of executing Q1 on wide Rectangle

XML documents in Galax (varying width and height)

Figure 10: L2 miss rates of executing Q1 on narrow Rectangle

XML documents in Bdb-xml (varying width and height)

Figure 11: L2 miss rates of executing Q1 on narrow Rectangle

XML documents in Galax (varying width and height)

5.2 Branch Mispredictions

5.2.1 AMD

5.2.1.1 Rectangle XML Figure 12-15 shows branch misprediction rates of executing Q1 on wide Rectangle XML and narrow one in Bdb-xml and in Galax on our AMD system, correspondingly. We summarize the results as follows:

(1) Whether the XML document is wide or narrow, running Galax has similar behaviors of branch mispredictions. (comparing Figure 13 with Figure 15)

0.00%

2.00%

4.00%

6.00%

8.00%

10.00%

12.00%

14.00%

16.00%

Bdb-xml sort Bdb-xml sel Galax sort Galax sel

L2

Mis

s R

ate

AMD: List XML

0.00%

0.50%

1.00%

1.50%

2.00%

2.50%

3.00%

1 2 4 6 8 10

L2

Mis

s R

ate

Height of XML Tree

Bdb-xml on P4: Rectangle XML (wide)

10000

20000

40000

80000

100000

0.00%

0.50%

1.00%

1.50%

2.00%

2.50%

3.00%

3.50%

4.00%

4.50%

1 2 4 6 8 10

L2

Mis

s R

ate

Height of XML Tree

Galax on P4: Rectangle XML (wide)

10000

20000

40000

80000

100000

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

1 2 4 6 8 10

L2

Mis

s R

ate

Width of XML Tree

Bdb-xml on P4: Rectangle XML (narrow)

500

1000

5000

10000

20000

40000

0.00%

1.00%

2.00%

3.00%

4.00%

5.00%

6.00%

7.00%

1 2 4 6 8 10

L2

Mis

s R

ate

Width of XML Tree

Galax on P4: Rectangle XML (narrow)

500

1000

5000

10000

20000

40000

57

Page 70: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

(2) However, the XML shape has a significant influence on Bdb-xml. By comparing Figure 12 with Figure 14, we can see that it is better for Bdb-xml to process narrow documents than wide ones.

Figure 12: branch misprediction rates of executing Q1 on

wide Rectangle XML documents in Bdb-xml (varying width

and height)

Figure 13: branch misprediction rates of executing Q1 on

wide Rectangle XML documents in Galax (varying width and

height)

Figure 14: branch misprediction rates of executing Q1 on

narrow Rectangle XML documents in Bdb-xml (varying

width and height)

Figure 15: branch misprediction rates of executing Q1 on

narrow Rectangle XML documents in Galax (varying width

and height)

5.2.1.2 Triangle XML Figure 16 shows the difference between Bdb-xml and Galax when executing Q1 on Triangle XML. From the graph, we can see, Bdb-xml has higher branch misprediction rates than Galax in this situation.

Figure 16: branch misprediction rates of executing Q1 on

Triangle XML documents in Bdb-xml and in Galax (varying

width and height)

5.2.1.3 List XML Figure 17 shows the branch information on List XML. Bdb-xml has larger misprediction rates than Galax for both selection and sorting.

Figure 17: branch misprediction rates of executing Q2

(selection) and Q3 (sorting) on List XML document in Bdb-

xml and in Galax.

10.5

11

11.5

12

12.5

13

13.5

1 2 4 6 8 10

Fre

qu

en

cy o

f m

isp

red

icti

on

s (%

)

Height of XML Tree

Bdb-xml on AMD: Rectangle XML (wide)

10000

20000

40000

80000

100000

0

1

2

3

4

5

6

7

8

9

10

1 2 4 6 8 10

Fre

qu

en

cy

of

mis

pre

dic

tio

ns

(%)

Height of XML Tree

Galax on AMD: Rectangle XML (wide)

10000

20000

40000

80000

100000

0

2

4

6

8

10

12

14

1 2 4 6 8 10

Fre

qu

en

cy o

f m

isp

red

icti

on

s (%

)

Width of XML Tree

Bdb-xml on AMD: Rectangle XML (narrow)

500

1000

5000

10000

20000

40000

0

1

2

3

4

5

6

7

8

9

1 2 4 6 8 10

Fre

qu

en

cy o

f m

isp

red

icti

on

s (%

)

Width of XML Tree

Galax on AMD: Rectangle XML (narrow)

500

1000

5000

10000

20000

40000

6

7

8

9

10

11

12

13

14

15 16 17 18 19 20

Fre

qu

en

cy o

f m

isp

red

icti

on

s (%

)

Height of XML Tree

AMD: Triangle XML

Bdb-xml

Galax

0

2

4

6

8

10

12

14

16

18

Bdb-xml sort Bdb-xml sel Galax sort Galax sel

Fre

qu

en

cy o

f m

isp

red

icti

on

s (%

)

AMD: List XML

58

Page 71: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

5.2.2 Intel On this system, we only report results for executing Q1 on Rectangle XML. Figure 18-21 show branch misprediction rates of executing Q1 on wide Rectangle XML and narrow one in Bdb-xml and in Galax on our P4 system, correspondingly. There graphs are similar with corresponding AMD Sempron versions, only with the differences of lower misprediction rates, which show that Pentium P4 has a more effective branch predictor than AMD Sempron.

Figure 18: branch misprediction rates of executing Q1 on

wide Rectangle XML documents in Bdb-xml (varying width

and height)

Figure 19: branch misprediction rates of executing Q1 on

wide Rectangle XML documents in Galax (varying width and

height)

Figure 20: branch misprediction rates of executing Q1 on

narrow Rectangle XML documents in Bdb-xml (varying

width and height)

Figure 21: branch misprediction rates of executing Q1 on

narrow Rectangle XML documents in Galax (varying width

and height)

5.3 L1 Cache & TLB Misses In this subsection, we present our measured results about L1 cache and TLB behaviors of Bdb-xml and Galax on our AMD system. We examine five fixed queries in turn:

(1) wide: Executing Q1 on the maximized wide Rectangle XML document. The width is 100000 and the height is 10.

(2) narrow: Executing Q1 on the maximized narrow Rectangle XML document. The width is 10 and the height is 40000.

(3) triangle: Executing Q1 on the maximized Triangle XML document. The height is 20.

(4) selection: Executing Q2 on the List XML document. (5) sorting: Executing Q3 on the List XML document.

We measure five architectural characteristics available on our AMD system: total instructions, L1 instruction cache misses per 1000 instructions, L1 data cache misses per 1000 instructions, L1 instruction TLB misses per 1000 instructions, and L1 data TLB misses per 1000 instructions.

Figure 22 shows the count of total instructions for executing the five queries in Bdb-xml and in Galax. We can see that, except the query “narrow”, Bdb-xml needs fewer instructions than Galax, especially for the query “wide” and “sorting”.

Figure 22: count of instructions of executing corresponding

query in Bdb-xml and in Galax on Sempron

0

2

4

6

8

10

12

1 2 4 6 8 10

Fre

qu

en

cy o

f m

isp

red

icti

on

s (%

)

Height of XML Tree

Bdb-xml on P4: Rectangle XML (wide)

10000

20000

40000

80000

100000

0

1

2

3

4

5

6

7

8

9

1 2 4 6 8 10

Fre

qu

en

cy o

f m

isp

red

icti

on

s (%

)

Height of XML Tree

Galax on P4: Rectangle XML (wide)

10000

20000

40000

80000

100000

0

2

4

6

8

10

12

1 2 4 6 8 10

Fre

qu

en

cy o

f m

isp

red

icti

on

s (%

)

Width of XML Tree

Bdb-xml on P4: Rectangle XML (narrow)

500

1000

5000

10000

20000

40000

0

1

2

3

4

5

6

7

8

9

1 2 4 6 8 10

Fre

qu

en

cy o

f m

isp

red

icti

on

s (%

)

Width of XML Tree

Galax on P4: Rectangle XML (narrow)

500

1000

5000

10000

20000

40000

0.0E+00 5.0E+10 1.0E+11 1.5E+11 2.0E+11 2.5E+11

wide

wide

narrow

narrow

triangle

triangle

selection

selection

sortting

sortting

Instructions: Bdb-xml vs Galax

Bdb-xml

Galax

59

Page 72: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

Figure 23 shows the difference between Bdb-xml and Galax for the query “wide”. Except L1 data cache misses, Galax has a better behavior than Bdb-xml.

Figure 24 shows the astonishing difference between Bdb-xml and Galax when executing the query “narrow”. The graph reveals two points. First, for both tools, L1 instruction cache misses and L1 instruction TLB misses are very few. Second, Bdb-xml has a very bad L1 data cache behavior in this situation. This result is consistent with the above L2 cache miss rates for Bdb-xml on both AMD and Intel systems, which reflects the bad memory-access pattern of Bdb-xml for narrow Rectangle XML documents.

Figure 25 shows the compared results for query “triangle”, which is similar with the query “wide”.

Figure 26 shows the difference between Bdb-xml and Galax when executing “selection” on List XML. From the graph, we can see that Bdb-xml has a bad L1 instruction cache behavior (more than 25 misses per 1000 instructions).

Figure 27 shows the compared results for query “sorting”. For each aspect shown in the graph, Bdb-xml is worse than Galax. However, as shown in Figure 22, Bdb-xml needs much fewer instructions to finish sorting than Galax (3.2E+10 vs 2.5E+11).

Figure 23: L1 behaviors of executing “wide” in Bdb-xml and

in Galax on Sempron

Figure 24: L1 behaviors of executing “narrow” in Bdb-xml

and in Galax on Sempron

Figure 25: L1 behaviors of executing “triangle” in Bdb-xml

and in Galax on Sempron

Figure 26: L1 behaviors of executing “selection” in Bdb-xml

and in Galax on Sempron

Figure 27: L1 behaviors of executing “sorting” in Bdb-xml

and in Galax on Sempron

We summarize this group of measures into three points. First, from the viewpoint of architecture, Bdb-xml is worse than Galax, especially in the respect of being aware of instruction locality. Bdb-xml has more misses than Galax in L1 I-Cache and L1 I-TLB. Second, from the viewpoint of algorithms, however, Bdb-xml is more effective than Galax (except the case for narrow

0 2 4 6 8 10 12 14 16

L1 I-Cache

L1 I-Cache

L1 D-Cache

L1 D-Cache

L1 I-TLB

L1 I-TLB

L1 D-TLB

L1 D-TLB

8.14

0.72

9.73

15.99

12.86

5.33

9.65

2.33

Misses per 1000 instructions

wide: Bdb-xml vs Galax

Bdb-xml

Galax

0 20 40 60 80 100 120

L1 I-Cache

L1 I-Cache

L1 D-Cache

L1 D-Cache

L1 I-TLB

L1 I-TLB

L1 D-TLB

L1 D-TLB

0.68

0.54

108.93

10.13

0.54

0.52

16.9

3.07

Misses per 1000 instructions

narrow: Bdb-xml vs Galax

Bdb-xml

Galax

0 2 4 6 8 10 12 14

L1 I-Cache

L1 I-Cache

L1 D-Cache

L1 D-Cache

L1 I-TLB

L1 I-TLB

L1 D-TLB

L1 D-TLB

8.51

1.43

5.07

7.06

13.8

1.19

10.62

5.36

Misses per 1000 instructions

triangle: Bdb-xml vs Galax

Bdb-xml

Galax

0 5 10 15 20 25 30

L1 I-Cache

L1 I-Cache

L1 D-Cache

L1 D-Cache

L1 I-TLB

L1 I-TLB

L1 D-TLB

L1 D-TLB

27.67

1.59

4.5

15.89

19.48

1.37

10.14

1.95

Misses per 1000 instructions

selection: Bdb-xml vs Galax

Bdb-xml

Galax

0 2 4 6 8 10 12 14 16 18

L1 I-Cache

L1 I-Cache

L1 D-Cache

L1 D-Cache

L1 I-TLB

L1 I-TLB

L1 D-TLB

L1 D-TLB

11.88

4.78

13.02

9.4

11.56

4.67

17.65

5.49

Misses per 1000 instructions

sorting: Bdb-xml vs Galax

Bdb-xml

Galax

60

Page 73: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

Rectangle XML). Bdb-xml needs fewer instructions than Galax for finishing corresponding queries. Last, Bdb-xml cannot fit deep XML documents (with large heights) as shown in Figure 4/10/24—it suffers significant data cache misses (at both L1 and L2).

5.4 Results on UltraSPARC T1 In this subsection, we present measured results on UltraSPARC T1 for Saxon, Qexo, and Galax, in turn. The data are only for executing Q1 on wide Rectangle XML documents. Limited by the tool we use on the system, we report the following characteristics:

(1) Count of total instructions (2) L1 instruction cache misses per 1000 instructions (3) L1 data cache misses per 1000 instructions (4) L2 instruction cache misses per 1000 instructions

(unavailable for Galax) (5) L2 data cache misses per 1000 instructions

5.4.1 Saxon Figure 28 shows count of instructions needed by Saxon for wide Rectangle XML documents with varied dimensions.

Figure 28: count of instructions of executing Q1 on wide

Rectangle XML documents in Saxon (varying width and

height)

Figure 29 and Figure 30 show the L1 instruction cache misses and L1 data cache misses of Saxon. By comparing the two graphs, we can see that the width has a more dramatic influence on the misses of instruction cache than on the misses of data cache.

Figure 29: L1 instruction cache misses per 1000 instructions

of executing Q1 on wide Rectangle XML documents in Saxon

(varying width and height)

Figure 30: L1 data cache misses per 1000 instructions of

executing Q1 on wide Rectangle XML documents in Saxon

(varying width and height)

Figure 31 and Figure 32 show the L2 cache behavior of Saxon. They reflect expectable results since UltraSPARC T1 provides a 3MB L2 cache so that misses on this level cache are few.

Figure 31: L2 instruction cache misses per 1000 instructions

of executing Q1 on wide Rectangle XML documents in Saxon

(varying width and height)

Figure 32: L2 data cache misses per 1000 instructions of

executing Q1 on wide Rectangle XML documents in Saxon

(varying width and height)

5.4.2 Qexo Figure 33 shows the count of instructions of Qexo.

0

2E+09

4E+09

6E+09

8E+09

1E+10

1.2E+10

1.4E+10

1.6E+10

1 2 4 6 8 10

inst

ruct

ion

s

Height of XML Tree

Saxon on T1: Rectangle XML (wide)

10000

20000

40000

80000

100000

0

5

10

15

20

25

30

35

40

45

50

1 2 4 6 8 10

L1

inst

ruct

ion

ca

che

mis

ses

Height of XML Trees

Saxon on T1: Rectangle XML (wide)

10000

20000

40000

80000

100000

0

5

10

15

20

25

30

1 2 4 6 8 10

L1

da

ta c

ach

e m

isse

s

height of XML Tree

Saxon on T1: Rectangle XML (wide)

10000

20000

40000

80000

100000

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

1 2 4 6 8 10

L2

inst

rcti

on

ca

che

mis

ses

Height of XML Tree

Saxon on T1: Rectangle XML (wide)

10000

20000

40000

80000

100000

0

0.5

1

1.5

2

2.5

3

3.5

1 2 4 6 8 10

L2

da

ta c

ach

e m

isse

s

Height of XML Tree

Saxon on T1: Rectangle XML (wide)

10000

20000

40000

80000

100000

61

Page 74: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

Figure 33: count of instructions of executing Q1 on wide

Rectangle XML documents in Qexo (varying width and

height)

Figure 34 and Figure 35 show the L1 cache behaviors of Qexo. Comparing Figure 34 and the above Figure 29 for Saxon, we can see that Qexo has fewer L1 instruction cache misses than Saxon, and Qexo is less sensitive to the change of the width than Saxon. As shown in Figure 35 and Figure 30, Saxon and Qexo have similar L1 data cache behaviors, although Saxon suffers more misses than Qexo, correspondingly.

Figure 34: L1 instruction cache misses per 1000 instructions

of executing Q1 on wide Rectangle XML documents in Qexo

(varying width and height)

Figure 35: L1 data cache misses per 1000 instructions of

executing Q1 on wide Rectangle XML documents in Qexo

(varying width and height)

Figure 36 and Figure 37 shows the L2 cache behaviors of Qexo, for instruction cache and data cache respectively. As shown in Figure 31 and Figure 32, misses in this level cache are rare.

Figure 36: L2 instruction cache misses per 1000 instructions

of executing Q1 on wide Rectangle XML documents in Qexo

(varying width and height)

Figure 37: L2 data cache misses per 1000 instructions of

executing Q1 on wide Rectangle XML documents in Qexo

(varying width and height)

5.4.3 Galax Figure 38 shows the count of instructions of Galax. Compared with Saxon and Qexo, we find that Galax needs the most instructions to finish the query executions. In addition, the change of width has a more significant influence on Galax than on Saxon and Qexo.

Figure 38: count of instructions of executing Q1 on wide

Rectangle XML documents in Galax (varying width and

height)

0

1E+09

2E+09

3E+09

4E+09

5E+09

6E+09

7E+09

8E+09

9E+09

1 2 4 6 8 10

inst

ruct

ion

s

Height of XML Tree

Qexo on T1: Rectangle XML (wide)

10000

20000

40000

80000

100000

0

2

4

6

8

10

12

14

1 2 4 6 8 10

L1

inst

ruct

ion

ca

che

mis

ses

Height of XML Tree

Qexo on T1: Rectangle XML (wide)

10000

20000

40000

80000

100000

0

5

10

15

20

25

1 2 4 6 8 10

L1 d

ata

cac

he

mis

ses

Height of XML Tree

Qexo on T1: Rectangle XML (wide)

10000

20000

40000

80000

100000

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

1 2 4 6 8 10

L2

inst

ruct

ion

ca

che

mis

ses

Height of XML Tree

Qexo on T1: Rectangle XML (wide)

10000

20000

40000

80000

100000

0

0.2

0.4

0.6

0.8

1

1.2

1 2 4 6 8 10

L2

da

ta c

ach

e m

isse

s

Height of XML Tree

Qexo on T1: Rectangle XML (wide)

10000

20000

40000

80000

100000

0

2E+10

4E+10

6E+10

8E+10

1E+11

1.2E+11

1 2 4 6 8 10

inst

ruct

ions

Height of XML Tree

Galax on T1: Rectangle XML (wide)

10000

20000

40000

80000

100000

62

Page 75: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

Figure 39/40 show L1 cache behaviors of Galax. Galax has fewer instruction cache misses but more data cache misses than Saxon and Qexo.

We compare Galax on T1 with it on Sempron for only the case of maximized wide Rectangle XML documents in Table 4. The results show the benefit of larger L1 cache size (both instruction and data) in Sempron than T1.

Table 4: Galax: UltraSPARC T1 vs Sempron (wide Rectangle

XML: width=100000, height=10)

Metrics T1 Sempron

Count of instructions 1.16E+11 8.6E+10

L1 IC misses per 1k instructions 3.47 0.72

L1 DC misses per 1k instructions 36.5 15.99

Figure 39: L1 instruction cache misses per 1000 instructions

of executing Q1 on wide Rectangle XML documents in Galax

(varying width and height)

Figure 40: L1 data cache misses per 1000 instructions of

executing Q1 on wide Rectangle XML documents in Galax

(varying width and height)

Figure 41 shows L2 data cache behaviors of Galax. There are distinct differences when the width increases from 40000 to 80000.

Figure 41: L2 data cache misses per 1000 instructions of

executing Q1 on wide Rectangle XML documents in Galax

(varying width and height)

6. CONCLUSION This paper is our first step to understand the architectural behaviors of XQuery workloads on modern processors. In this paper, we are focusing on report detailed measures of architectural characteristics for executing basic XQuery operations on three modern hardware platforms including AMD’s Sempron, Intel’s Pentium P4, and Sun’s UltraSPARC T1. We examine four XQuery tools: Berkeley DB XML, Galax, Saxon-B, and GNU Qexo. Our measured architectural behaviors include L1 cache misses, L2 cache misses, TLB misses, and branch misprediction rates. We believe that these data can be useful in understanding the specific features of XQuery, comparing XQuery workloads and RDBMS query workloads, and analyzing potential optimizing opportunities for XQuery implementations.

Our future work covers the following aspects:

(1) Testing more operations, more complex xml structures, and more software: Currently we only consider path navigation, selection, and sorting of XQuery, and we only study Rectangle XML, Triangle XML, and List XML. We will further consider existing XQuery application-benchmarks [16][17][29] and micro-benchmarks [19][28], and make more wide measures.

(2) Testing concurrent XQuery workloads with write operations: We are planning to utilize concurrent XQuery workloads to study the architectural behaviors on platforms with the new chip-multiprocessor and simultaneous multithreading technology. Although XQuery is a query-oriented language, many tools (e.g. Berkeley DB XML) support modification and transaction processing. We hope to understand the behaviors of shared L2 cache by multiple cores when executing concurrent XQuery instances.

7. ACKNOWLEDGMENTS We thank Professor Kai Li and Dr. Zhiwei Xu for their valuable suggestions, and thank the ExpDB workshop reviewers for their comments. This work is supported in part by the National Science Foundation of China (Grant No. 90412010), China Ministry of Science and Technology 863 Program (Grant No. 2006AA01A106), and the China National 973 Program (Grant No. 2005CB321807).

0

1

2

3

4

5

6

7

8

9

10

1 2 4 6 8 10

L1

inst

ruct

ion

ca

che

mis

ses

Height of XML Trees

Galax on T1: Rectangle XML (wide)

10000

20000

40000

80000

100000

0

5

10

15

20

25

30

35

40

1 2 4 6 8 10

L1

data

cach

e m

isse

s

Height of XML Trees

Galax on T1: Rectangle XML (wide)

10000

20000

40000

80000

100000

0

1

2

3

4

5

6

1 2 4 6 8 10

L2

da

ta c

ach

e m

isse

s

Height of XML Trees

Galax on T1: Rectangle XML (wide)

10000

20000

40000

80000

100000

63

Page 76: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

8. REFERENCES [1] A. Ailamaki, D. J. DeWitt, M. D. Hill, and D. A. Wood.

DBMSs on a modern processor: Where does time go? In Proc. VLDB, 1999.

[2] L. A. Barroso, K. Gharachorloo, and E. Bugnion. Memory System Characterization of Commercial Workloads. In Proc. ISCA, 1998.

[3] R. J. Eickemeyer, R. E. Johnson, S. R. Kunkel, M. S. Squillante, and S. Liu. Evaluation of multithreaded uniprocessors for commercial application environments. In Proc. ISCA, 1996.

[4] K. Keeton, D. A. Patterson, Y. Q. He, R. C. Raphael, and W. E. Baker. Performance characterization of a quad Pentium pro SMP using OLTP workloads. In Proc.ISCA, 1998.

[5] J. L. Lo, L. A. Barroso, S. J. Eggers, K. Gharachorloo, H. M. Levy, and S. S. Parekh. An analysis of database workload performance on simultaneous multithreaded processors. In Proc. ISCA, 1998

[6] P. Trancoso, J.L. Larriba-Pey, Z. Zhang, and J. Torellas. The memory performance of DSS commercial workloads in shared-memory multiprocessors. In Proc. HPCA, 1997

[7] M. Karlsson, K. E. Moore, E. Hagersten, and D. A. Wood. Memory System Behavior of Java-Based Middleware. In Proc.HPCA, 2003.

[8] Y. Luo and L. K. John. Workload Characterization of Multithreaded Java Servers. In IEEE International Symposium on Performance Analysis of Systems and Software, 2001.

[9] Y. Shuf, M. J. Serrano, M. Gupta, and J. P. Singh. Characterizing the Memory Behavior of Java Workloads: A Structured View and Opportunities for Optimizations. In Proc. SIGMETRICS, 2001

[10] P. Apparao, R. Iyer, R. Morin, N. Nayak, and M. Bhat. Architectural Characterization of an XML-centric Commercial Server Workload. In Proc.ICPP, 2004

[11] IA-32 Intel(R) Architecture Optimization Reference Manual, http://developer.intel.com/design/pentium4/manuals/

[12] PMC based Performance Measurement in FreeBSD, http://people.freebsd.org/~jkoshy/projects/perf-measurement/

[13] Basic Performance Measurements for AMD Athlon™ 64 and AMD Opteron™ Processors, http://developer.amd.com/articles.jsp?id=90&num=1

[14] AMD Sempron Processor Family, http://www.amd.com/us-en/Processors/ProductInformation/0,,30_118_11599,00.html

[15] OpenSPARC T1 Documents, http://opensparct1.sunsource.net/

[16] A. R. Schmidt, F. Waas, M. L. Kersten, D. Florescu, M. J. Carey, I. Manolescu, and R. Busse. Why and How to Benchmark XML Databases. SIGMOD Record, 3(30):27-32, 2001.

[17] B. Yao, T. Ozsu, and N. Khandelwal. XBench benchmark and performance testing of XML DBMSs. In Proc.ICDE,2004.

[18] S. Manegold. An Empirical Evaluation of XQuery Processors. In ExpDB, 2006

[19] I. Manolescu, C. Miachon, and P. Michiels. Towards micro-benchmarking XQuery. In ExpDB, 2006.

[20] The Saxon project web site. http://saxon.sourceforge.net

[21] The Galax web site. http://www.galaxquery.org

[22] The Qexo web site. http://www.gnu.org/software/qexo/

[23] The Berkeley DB XML web site. http://www.oracle.com/database/berkeley-db/xml/index.html

[24] S. Harizopoulos, V. Shkapenyuk, and A. Ailamaki. QPipe: A Simultaneously Pipelined Relational Query Engine. In Proc. SIGMOD, 2005.

[25] S. Harizopoulos and A. Ailamaki. Improving instructions cache performance in OLTP. In TODS, 31(3): 887-920, 2006

[26] S. Chen, A. Ailamiaki, P. B. Gibbons, and T. C. Mowry. Improving Hash Join Performance through Prefetching. In Proc. ICDE, 2004.

[27] Sun Studio Performance Analyzer. http://developers.sun.com/sunstudio/analyzer_index.html

[28] K. Runapongsa, J. M. Patel, H. V. Jagadish, and S. AI-Kalifa. The Michigan Benchmark: A Microbenchmark for XML Querying Systems. In EEXTT, 2002.

[29] S. Bressnan, G. Dobbie, Z. Lacroix, M. Lee, Y. Li, U. Nambiar, and B. Wadhwa. X007: Applying 007 benchmark for XML Querying Tool. In Proc.CIKM, 2001.

64

Page 77: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................
Page 78: Proceedings of Third International Workshop on Data ...homepages.cwi.nl/~manegold/DaMoN-ExpDB_2007/DaMoN... · Contents Program .......................................................................................................

designed by Nikos Ailamakis. designed by Ioana Manolescu.

DaMoN/ExpDB 2007 is brought to you by Anastasia Ailamaki, Qiong Luo, Philippe Bonnet & Stefan Manegold.