Query evaluation techniques for large...

Query Evaluation Techniques for Large Databases

GOETZ GRAEFE

Portland State University, Computer Science Department, P. O. Box751, Portland, Oregon 97207-0751

Database management systems will continue to manage large data volumes. Thus,efficient algorithms for accessing and manipulating large sets and sequences will berequired to provide acceptable performance. The advent of object-oriented and extensibledatabase systems will not solve this problem. On the contrary, modern data modelsexacerbate the problem: In order to manipulate large sets of complex objects asefficiently as today’s database systems manipulate simple records, query processingalgorithms and software will become more complex, and a solid understanding ofalgorithm and architectural issues is essential for the designer of database managementsoftware.

This survey provides a foundation for the design and implementation of queryexecution facilities innew database management systems. It describes awide array ofpractical query evaluation techniques for both relational and postrelational databasesystems, including iterative execution of complex query evaluation plans, the duality ofsort- and hash-based set-matching algorithms, types of parallel query execution andtheir implementation, and special operators for emerging database application domains.

Categories and Subject Descriptors: E.5 [Data]: Files; H.2.4 [Database Management]:Systems—query processing

General Terms: Algorithms, Performance

Additional Key Words and Phrases: Complex query evaluation plans, dynamic queryevaluation plans; extensible database systems, iterators, object-oriented databasesystems, operator model of parallelization, parallel algorithms, relational databasesystems, set-matching algorithms, sort-hash duality

INTRODUCTION

Effective and efficient management oflarge data volumes is necessary in virtu-ally all computer applications, from busi-ness data processing to library infor-mation retrieval systems, multimediaapplications with images and sound,computer-aided design and manufactur-ing, real-time process control, and scien-tific computation. While databasemanagement systems are standardtools in business data processing, theyare only slowly being introduced to all

the other emerging database applicationareas.

In most of these new application do-mains, database management systemshave traditionally not been used for tworeasons. First, restrictive data definitionand manipulation languages can makeapplication development and mainte-nance unbearably cumbersome. Researchinto semantic and object-oriented datamodels and into persistent database pro-gramming languages has been address-ing this problem and will eventually leadto acceptable solutions. Second, data vol-

Permission to copy without fee all or part of this material is granted provided that the copies are not madeor distributed for direct commercial advantage, the ACM copyright notice and the title of the publicationand its date appear, and notice is given that copying ie by permission of the Association for ComputingMachinery. To copy otherwise, or to republish, requires a fee and/or specific permission.@ 1993 ACM 0360-0300/93/0600-0073 $01.50

ACM Computing Surveys,Vol. 25, No. 2, June 1993

GUN I tNT”S

INTRODUCTION1

9-,

3.

4

5

67

89

10

11

12

ARCHITECTURE OF QUERY EXECUTIONENGINES

SORTING AND HASHING

2.1 Sorting

2.2 Hashing

DISK ACCESS

31 File Scans32 Associative Access Using Indices3.3 Buffer Management

AGGREGATION AND DUPLICATE REMOVAL41 Aggregation Algorithm sBased on Nested

Loops4,z Aggrcgat~on Algorithms Based on S(,rtlng

43 Aggregation Algorithms Based on Hashing44 ARough Performance C’omparlson

45 Addltlonal Remarks on AggregationBINARY MATCHING OPERATIONS

51 Nested-Loops Jom Algorithms

52 Merge-Join Algorithms53 Hash Join AIgorlthms54 Pointer-Based Joins

55 ARough Performance ComparisonUNIVERSAL QuANTIFICATION

DUALITY OF SORT- AND HASH-BASEDQUERY PROCESSING ALGORITHMS

EXECUTION OF COMPLEX QUERY PLANSMECHANISMS FOR PARALLEL QUERYEXECUTION

91 Parallel versus Dlstrlbuted DatabaseSystems

g~ ~~rm~ ~fp~ralle]lsm

9.3 Implementation Strategies

94 Load Balancing and Skew95 Arcbltectures and Architecture

Independence

PARALLEL ALGORITHMS

10.1 Parallel Selections and Updates102 Parallel SOrtmg

103 Parallel Aggregation and DuphcateRemoval

10.4 Parallel Jolnsand Otber Binary Matcb, ngOperations

105 Parallel Universal QuantlficatlonNONSTANDARD QUERY PROCESSING

ALGORITHMS11 1 Nested Relations112 TemPoral and Scientific Database

Management11.3 Object-oriented Database Systems114 More Control OperatorsALJDITIONAL TECHNIQUES FORPERFORMANCE IMPROVEMENT

121 Precomputatlon and Derived Data122 Data Compression12.3 Surrogate Processing124 B,t Vector F,lter, ng

12.5 Specmllzed HardwareSUMMARY AND OUTLOOKACKNOWLEDGMENTSREFERENCES

umes might be so large or complex thatthe real or perceived performance advan-tage of file systems is considered moreimportant than all other criteria, e.g., thehigher levels of abstraction and program-mer productivity typically achieved withdatabase management systems. Thus,object-oriented database managementsystems that are designed for nontradi-tional database application domains andextensible database management systemtoolkits that support a variety of datamodels must provide excellent perfor-mance to meet the challenges of verylarge data volumes, and techniques formanipulating large data sets will findrenewed and increased interest in thedatabase community.

The purpose of this paper is to surveyefficient algorithms and software archi-tectures of database query execution en-gines for executing complex queries overlarge databases. A “complex” query isone that requires a number of query-processing algorithms to work together,and a “large” database uses files withsizes from several megabytes to manyterabytes, which are typical for databaseapplications at present and in the nearfuture [Dozier 1992; Silberschatz et al.1991]. This survey discusses a large vari-ety of query execution techniques thatmust be considered when designing andimplementing the query execution mod-ule of a new database management sys-tem: algorithms and their execution costs,sorting versus hashing, parallelism, re-source allocation and scheduling issuesin complex queries, special operations foremerging database application domainssuch as statistical and scientific data-bases, and general performance-enhanc-ing techniques such as precomputationand compression. While many, althoughnot all, techniques discussed in this pa-per have been developed in the context ofrelational database systems, most ofthem are applicable to and useful in thequery processing facility for any databasemanagement system and any data model,provided the data model permits queriesover “bulk” data types such as sets andlists.

Query Evaluation Techniques ● 75

FUser Interface

Database Query Language

Query Optimizer

Query Execution Engine

Files and Indices

1/0 Buffer

Disk

Figure 1. Query processing in a database system,

It is assumed that the reader possessesbasic textbook knowledge of databasequery languages, in particular of rela-tional algebra, and of file systems, in-cluding some basic knowledge of indexstructures. As shown in Figure 1, queryprocessing fills the gap between databasequery languages and file systems. It canbe divided into query optimization andquery execution. A query optimizertranslates a query expressed in a high-level query language into a sequence ofoperations that are implemented in thequery execution engine or the file system.The goal of query optimization is to finda query evaluation plan that minimizesthe most relevant performance measure,which can be the database user’s wait forthe first or last result item, CPU, 1/0,and network time and effort (time andeffort can differ due to parallelism),memory costs (as maximum allocation oras time-space product), total resource us-

age, even energy consumption (e.g., forbattery-powered laptop systems or spacecraft), a combination of the above, or someother performance measure. Query opti-mization is a special form of planning,employing techniques from artificial in-telligence such as plan representation,search including directed search andpruning, dynamic programming, branch-and-bound algorithms, etc. The query ex-ecution engine is a collection of queryexecution operators and mechanisms foroperator communication and synchro-nization—it employs concepts from al-gorithm design, operating systems,networks, and parallel and distributedcomputation. The facilities of the queryexecution engine define the space of

possible plans that can be chosen bythe auerv o~timizer,

A ~en~ra~ outline of the steps requiredfor processing a database query areshown in Fimu-e 2. Of course. this se-quence is only a general guideline, anddifferent database systems may use dif-ferent steps or merge multiple steps intoone. After a query or request has beenentered into the database system, be itinteractively or by an application pro-gram, the query is parsed into an inter-nal form. Next, the query is validatedagainst the metadata (data about thedata, also called schema or catalogs) toensure that the query contains only validreferences to existing database objects. Ifthe database system provides a macrofacility such as relational views, refer-enced macros and views are expandedinto the query [ Stonebraker 1975]. In-tegrity constraints might be expressed asviews (externally or internally) and wouldalso be integrated into the query at thispoint in most systems [Metro 1989]. Thequery optimizer then maps the expandedquery expression into an optimized planthat operates directly on the storeddatabase objects. This mapping processcan be very complex and might requiresubstantial search and cost estimationeffort. (O~timization is not discussed inthis pape~; a survey can be found in Jarkeand Koch [1984 ].) The optimizer’s outputis called a query execution plan, queryevaluation plan, QEP, or simply plan.Using a simple tree traversal algorithm,this plan is translated into a representa-tion readv for execution bv the database’squery ex~cution engine; t~e result of thistranslation can be compiled machine codeor a semicompiled or interpreted lan-guage or data structure.

This survey discusses only read-onlyqueries explicitly; however, most of thetechniques are also applicable to updaterequests. In most database managementsystems, update requests may include asearch predicate to determine thedatabase objects are to be modified. Stan-dard query optimization and executiontechniques apply to this search; the ac-tual update procedure can be either ap-

ACM Computing Surveys, Vol. 25, No. 2, June 1993

76 * Goetz Graefe

Parsing

L Query Validation

L View Resolution

L Optimization

L Plan Compilation

L Execution

Figure 2. Query processing steps.

plied in a second phase, a method calleddeferred updates, or merged into thesearch phase if there is no danger ofcreating ambiguous update semantics. 1

The problem of ensuring ACID seman-tics for updates—making updatesAtomic (all-or-nothing semantics), Con-sistent (translating any consistentdatabase state into another consistentdatabase state), Isolated (from otherqueries and requests), and Durable (per-sistent across all failures)—is beyond thescope of this paper; suitable techniqueshave been described by many other au-thors, e.g., Bernstein and Goodman[1981], Bernstein et al. [1987], Gray andReuter [1991], and Haerder and Reuter[1983].

Most research into providing ACID se-mantics focuses on efficient techniquesfor processing very large numbers ofrelatively small requests. For example,increasing the balance of one account anddecreasing the balance of another accountrequire exclusive access to only twodatabase records and writing someinformation to an update log. Currentresearch and development efforts intransaction processing target hundredsand even thousands of small transactionsper second [Davis 1992; Serlin 1991].

.—1A standard example for this danger is the “Hal-loween” problem: Consider the request to “give allemployees with salaries greater than $30,000 a 3%raise.” If (i) these employees are found using anindex on salaries, (ii) index entries are scanned inincreasing salary order, and (iii ) the index is up-dated immediately as index entries are found, theneach qualifying employee will get an infinite num-ber of raises.

ACM Computing Surveys, Vol 25, No 2, June 1993

Query processing, on the other hand, fo-cuses on extracting information from alarge amount of data without actuallychanging the database. For example,printing reports for each branch officewith average salaries of employees under30 years old requires shared access to alarge number of records. Mixed requestsare also possible, e.g., for creditingmonthly earnings to a stock account bycombining information about a numberof sales transactions. The techniques dis-cussed here apply to the search effort forsuch a mixed request, e.g., for finding therelevant sales transactions for each stockaccount.

Embedded queries, i.e., databasequeries that are contained in an applica-tion program written in a standard pro-gramming language such as Cobol, PL/1,C, or Fortran, are also not addressedspecifically in this paper because alltechniques discussed here can be usedfor interactive as well as embeddedqueries. Embedded queries usually areoptimized when the program is compiled,in order to avoid the optimization over-head when the program runs. Thismethod was pioneered in System R, in-cluding mechanisms for storing opti-mized plans and invalidating stored planswhen they become infeasible, e.g., whenan index is dropped from the database[Chamberlain et al. 1981b]. Of course, thecut between compile-time and run-timecan be placed at any other point in thesequence in Figure 2.

Recursive queries are omitted from thissurvey, because the entire field of recur-sive query processing—optimizationrules and heuristics, selectivity and cost

Query Evaluation Techniques “ 77

estimation, algorithms and their paral-lelization—is still developing rapidly(suffice it to point to two recent surveysby Bancilhon and Ramakrishnan [1986]and Cacace et al. [1993]).

The present paper surveys query exe-cution techniques; other surveys thatpertain to the wide subject of databasesystems have considered data models andquery langaages [Gallaire et al. 1984;Hull and King 1987; Jarke and Vassiliou1985; McKenzie and Snodgrass 1991;Peckham and Maryanski 1988], accessmethods [Comer 1979; Enbody and Du1988; Faloutsos 1985; Samet 1984;Sockut and Goldberg 1979], compressiontechniques [Bell et al. 1989; Lelewer andHirschberg 1987], distributed and het-erogeneous systems [Batini et al. 1986;Litwin et al. 1990; Sheth and Larson1990; Thomas et al. 1990], concurrencycontrol and recovery [Barghouti andKaiser 1991; Bernstein and Goodman1981; Gray et al. 1981; Haerder andReuter 1983; Knapp 1987], availabilityand reliability [Davidson et al. 1985; Kim1984], query optimization [Jarke andKoch 1984; Mannino et al. 1988; Yu andChang 1984], and a variety of otherdatabase-related topics [Adam and Wort-mann 1989; Atkinson and Bunemann1987; Katz 1990; Kemper and Wallrath1987; Lyytinen 1987; Teoroy et al. 1986].Bitton et al. [1984] have discussed anumber of parallel-sorting techniques,only a few of which are really used indatabase systems. Mishra and Eich’s[1992] recent survey of relational join al-gorithms compares their behavior usingdiagrams derived from one by Kitsure-gawa et al. [1983] and also describes joinmethods using index structures and joinmethods for distributed systems. Thepresent survey is much broader in scopeas it also considers system architecturesfor complex query plans and for parallelexecution, selection and aggregation al-gorithms, the relationship of sorting andhashing as it pertains to database queryprocessing, special operations for nontra-ditional data models, and auxiliary tech-niques such as compression.

Section 1 discusses the architecture ofquery execution engines. Sorting and

hashing, the two general approaches tomanaging and matching elements of largesets, are described in Section 2. Section 3focuses on accessing large data sets ondisk. Section 4 begins the discussion ofactual data manipulation methods withalgorithms for aggregation and duplicateremoval, continued in Section 5 with bi-nary matching operations such as joinand intersection and in Section 6 withoperations for universal quantification.Section 7 reviews the manv dualities be-tween sorting and hashi~g and pointsout their differences that have an impacton the performance of algorithms basedon either one of these approaches. Execu-tion of very complex query plans withmany operators and with nontrivial planshaues is discussed in Section 8. Section9 is’ devoted to mechanisms for parallelexecution, including architectural issuesand load balancing, and Section 10discusses specific parallel algorithms.Section 11 outlines some nonstandardoperators for emerging database appli-cations such as statistical and scientificdatabase management systems. Section12 is a potpourri of additional techniquesthat enhance the performance of manyalgorithms, e.g., compression, precompu-tation, and specialized hardware. The fi-nal section contains a brief summarv and.an outlook on query processing researchand its future.

For readers who are more interested insome tonics than others. most sectionsare fair~y self-contained.’ Moreover, thehurried reader may want to skip thederivation of cost functions ;2 their re-sults and effects are summarized later indiagrams.

1. ARCHITECTURE OF QUERY

EXECUTION ENGINES

This survey focuses on useful mecha-nisms for processing sets of items. Theseitems can be records, tuples, entities, orobjects. Furthermore, most of the tech-

2 In any case, our cost functions cover only a lim-ited, though Important, aspect of query executioncost, namely 1/0 effort.


78 “ Goetz Graefe

niques discussed in this survey apply tosequences, not only sets, of items, al-though most query processing researchhas assumed relations and sets. All queryprocessing algorithm implementations it-erate over the members of their inputsets; thus, sets are always representedby sequences. Sequences can be used torepresent not only sets but also otherone-dimensional “bulk types such aslists, arrays, and time series, and manydatabase query processing algorithmsand techniques can be used to manipu-late these other bulk types as well assets. The important point is to think ofthese algorithms as algebra operatorsconsuming zero or more inputs (sets orsequences) and producing one (or some-times more) outputs. A complete queryexecution engine consists of a collectionof operators and mechanisms to executecomplex expressions using multiple oper-ators, including multiple occurrences ofthe same operator. Taken as a whole, thequery processing algorithms form an al-gebra which we call the physical algebraof a database system.

The physical algebra is equivalent to,but quite different from, the logical alge-bra of the data model or the databasesystem. The logical algebra is moreclosely related to the data model anddefines what queries can be expressed inthe data model; for example, the rela-tional algebra is a logical algebra. Aphysical algebra, on the other hand, issystem specific. Different systems mayimplement the same data model and thesame logical algebra but may use verydifferent physical algebras. For example,while one relational system may use onlynested-loops joins, another system mayprovide both nested-loops join andmerge-join, while a third one may relyentirely on hash join algorithms. (Joinalgorithms are discussed in detail laterin the section on binary matching opera-tors and algorithms.)

Another significant difference betweenlogical and physical algebras is the factthat specific algorithms and thereforecost functions are associated only withphysical operators, not with logical alge-

bra operators. Because of the lack of analgorithm specification, a logical algebraexpression is not directly executable andmust be mapped into a physical algebraexpression. For example, it is impossibleto determine the execution time for theleft expression in Figure 3, i.e., a logicalalgebra expression, without mapping itfirst into a physical algebra expressionsuch as the query evaluation plan on theright of Figure 3. This mapping processcan be trivial in some database systemsbut usually is fairly complex in realdatabase svstems because it involves al-gorithm c~oices and because logical andphysical operators frequently do not map

directly into one another, as shown in thefollowing four examples. First, some op-erators in the physical algebra may im-plement multiple logical operators. Forexample, all serious implementations ofrelational join algorithms include a facil-ity to output fewer than all attributes,i.e., a relational delta-project (a projec-tion without dudicate removal) is in-cluded in the ph~sical join operator. Sec-ond, some physical operators implementonly part of a logical operator. For exam-ple, a duplicate removal algorithm imple-ments only the “second half” of a rela-tional projection operator. Third, somephysical operators do not exist in thelogical algebra. Concretely, a sort opera-tor has no place in pure relational alge-bra because it is an algebra of sets. andsets are, by their defi~ition, unordered.Finally, some properties that hold forlogical operators do not hold, or only with

some qualifications, for the counterpartsin physical algebra. For example, whileintersection and union are entirely sym-metric and commutative, algorithms im-plementing them (e.g., nested loops orhybrid hash join) do not treat their twoinputs equally.

The difference of logical and physicalalgebras can also be looked at in a differ-ent way. Any database system raises thelevel of abstraction above files andrecords; to do so, there are some logicaltype constructors such as tuple, relation,set, list, array, pointer, etc. Each logicaltype constructor is complemented by

ACM Comput,ng Surveys, V.1 25, No. 2, June 1993


Merge-Join (Intersect)

Intersection

/\

/\sort sort

Set A Set BI I

File Scan A File Scan B

Figure 3. Logical and physical algebra expressions.

some operations that are permitted oninstances of such types, e.g., attributeextraction, selection, insertion, deletion,etc.

On the physical or representation level,there is typically a smaller set of repre-sentation types and structures, e.g., file,record, record identifier (RID), and maybevery large byte arrays [Carey et al. 1986].For manipulation, the representationtypes have their own operations, whichwill be different from the operations onlogical types. Multiple logical types andtype constructors can be mapped to thesame physical concept. They may also besituations in which one logical type con-structor can be mapped to multiple phys-ical concepts, e.g., a set depending on itssize. The mapping from logical types tophysical representation types and struc-tures is called physical database design.Query optimization is the mapping fromlogical to physical operations, and thequery execution engine is the imple-mentation of operations on physical rep-resentation types and of mechanismsfor coordination and cooperation amongmultiple such operations in complex que-ries. The policies for using these mech-anisms are part of the query optimizer.

Synchronization and data transfer be-tween operators is the main issue to beaddressed in the architecture of the queryexecution engine. Imagine a query withtwo joins, and consider how the result ofthe first join is passed to the second one.The simplest method is to create (write)and read a temporary file. The need fortemporary files, whether they are kept inthe buffer or not, is a direct result ofexecuting an operator’s input subplanscompletely before starting the operator.Alternatively, it is possible to create one

process for each operator and then to useinterprocess communication mechanisms

(e.g., pipes) to transfer data between op-erators, leaving it to the operating sys-tem to schedule and suspend operatorprocesses as pipes are full or empty.While such data-driven execution re-moves the need for temporary disk files,it introduces another cost, that of operat-ing system scheduling and interprocesscommunication. In order to avoid bothtemporary files and operating systemscheduling, Freytag and Goodman [1989]proposed writing rule-based translationprograms that transform a plan repre-sented as a tree structure into a singleiterative program with nested loops andother control structures. However, the re-quired rule set is not simple, in particu-lar for algorithms with complex controllogic such as sorting, merge-join, or evenhybrid hash join (to be discussed later inthe section on matching).

The most practical alternative is to im-plement all operators in such a way thatthey schedule each other within a singleoperating system process. The basic ideais to define a granule, typically a singlerecord, and to iterate over all granulescomprising an intermediate query result.3Each time an operator needs anothergranule, it calls its input (operator) toproduce one. This call is a simple pro-

—-3 It is possible to use multiple granule sizes withina single query-processing system and to providespecial operators with the sole purpose of translat-ing from one granule size to another. An example Ma query processing system that uses records as aniteration granule except for the inputs of merge-join(see later in the section on binary matching), forwhich it uses “value packets,” i.e., groups of recordswith equal join attribute values.

ACM Computmg Surveys, Vol. 25, No. 2, June 1993

80 ● Goetz Graefe

cedure call, much cheaper than inter-process communication since it does notinvolve the operating system. The callingoperator waits (just as any calling rou-tine waits) until the input operator hasproduced an item. That input operator,in a complex query plan, might requirean item from its own input to produce anitem; in that case, it calls its own input

(operator) to produce one. Two importantfeatures of operators implemented in thisway are that they can be combined intoarbitrarily complex query evaluationplans and that any number of operatorscan execute and schedule each other in asingle process without assistance from orinteraction with the underlying operat-ing system. This model of operator imple-mentation and scheduling resembles veryclosely those used in relational systems,e.g., System R (and later SQL/DS andDB2), Ingres, Informix, and Oracle, aswell as in experimental systems, e.g., theE programming language used in EXO-DUS [Richardson and Carey 1987], Gen-esis [Batory et al. 1988a; 1988 b], andStarburst [Haas et al. 1989; 1990]. Oper-ators imdemented in this model are

L

called iterators, streams, synchronouspipelines, row-sources, or similar namesin the “lingo” of commercial systems.

To make the implementation of opera-tors a little easier, it makes sense toseparate the functions (a) to prepare anoperator for producing data, (b) to pro-duce an item, and (c) to perform finalhousekeeping. In a file scan, these func-tions are called o~en. next. and closeprocedures; we adopt these names for alloperators. Table 1 gives a rough idea ofwhat the open, next, and close proce-dures for some operators do, as well asthe principal local state that needs to besaved from one invocation to the next.(Later sections will discuss sort and joinoperations in detail. ) The first threeexamples are trivial, but the hash joinoperator shows how an operator canschedule its inputs in a nontrivialmanner. The interesting observationsare that (i) the entire query plan is exe-cuted within a single process, (ii) oper-ators produce one item at a time on

request, (iii) this model effectively im-plements, within a single process,

(special-purpose) coroutines and de-mand-driven data flow. (iv) items neverwait in a temporary ‘file or buffer be-tween operators because they are neverm-educed before thev are needed. (v)~herefore this model “& very efficient inits time-space-product memory costs,(vi) iterators can schedule anv tree. in-cluding bushy trees (see below), ‘(vii)no operator is affected by the complex-ity of the whole plan, i.e., this modelof operator implementation and syn-chronization works for sim~le as wellas very complex query plan;. As a finalremark, there are effective ways to com-bine the iterator model with ~arallelquery processing, as will be disc~ssed inSection 9.

Since query plans are algebra expres-sions, they can be represented as trees.Query plans can be divided into prototyp-ical shapes, and query execution enginescan be divided into groups according towhich shapes of plans they can evaluate.Figure 4 shows prototypical left-deep,right-deep, and bushy plans for a join offour inputs. Left-deep and right-deepplans are different because join al-gorithms use their two inputs in differ-ent ways; for example, in the nested-loopsjoin algorithm, the outer loop iteratesover one input (usually drawn as leftinput) while the inner loop iterates overthe other input. The set of bushy plans isthe most general as it includes the sets ofboth left-deep and right-deep plans.These names are taken from Graefe andDeWitt [ 1987]; left-deep plans are alsocalled “linear processing trees”[Krishnamurthy et al. 1986] or “planswith no composite inner” [Ono andLehman 1990].

For queries with common subexpres-sions, the query evaluation plan is not atree but an acyclic directed graph (DAG).Most systems that identify and exploitcommon subexpressions execute the planequivalent to a common subexpressionseparately, saving the intermediate re-sult in a temporary file to be scannedrepeatedly and destroyed after the last

ACM Computing Surveys, Vol 25, No 2. June 1993


Table 1. Examples of Iterator Functions

Iterator Open Next Close Local State

Print open input call next on input; close inputformat the item onscreen

Scan open file read next item close file open file descriptor

Select open input call next on input close inputuntil an itemqualifies

Hash join allocate hash call next on probe(without

close probe input; hash directorydirectory; open left input until a match is deallocate hash

overflow “build” input; build found directoryresolution) hash table calling

next on build input;close build input;open right “probe”input

Merge-Join open both inputs get next item from(without

close both inputsinput with smaller

duplicates) key until a match isfound

Sort open input; build all determine next destroy remaining merge heap, open fileinitial run files output item; read run files descriptors for run filescalling next on input; new item from theclose input; merge correct run filerun files untd onlyone merge step is left

Join C-D Join A-B

Jo::@ :fi: ‘m:.-.A B c D

Figure 4. Left-deep, bushy, and right-deep plans.

scan. Each plan fragment that is exe-cuted as a unit is indeed a tree. Thealternative is a “split” iterator that candeliver data to multiple consumers, i.e.,that can be invoked as iterator by multi-ple consumer iterators. The split iteratorpaces its input subtree as fast as thefastest consumer requires it and holdsitems until the slowest consumer hasconsumed them. If the consumers re-quest data at about the same rate, thesplit operator does not require a tempo-rary spool file; such a file and its associ-ated 1/0 cost are required only if thedata rate required by the consumers di-

verges above some predefine threshold.Among the implementations of itera-

tors for query processing, one group canbe called “stored-set oriented and theother “algebra oriented.” In System R, anexample for the first group, complex joinplans are constructed using binary joiniterators that “attach” one more set(stored relation) to an existing intermedi-ate result [Astrahan et al. 1976; Lorieand Nilsson 1979], a design that sup-ports only left-deep plans. This designled to a significant simplification of theSystem R optimizer which could be basedon dynamic programming techniques, but


82 - Goetz Graefe

it ignores the optimal plan for somequeries [Selinger et al. 1979].4 A similardesign was used, although not strictlyrequired by the design of the executionengine, in the Gamma database machine[DeWitt et al. 1986; 1990; Gerber 1986].On the other hand, some systems usebinary operators for which both inputscan be intermediate results, i.e., the out-put of arbitrarily complex subplans. Thisdesign is more general as it also permitsbushy plans. Examples for this approachare the second query processing engine ofIngres based on Kooi’s thesis [Kooi 1980;Kooi and Frankforth 1982], the Starburstexecution engine [Haas et al. 1989], andthe Volcano query execution engine[Graefe 1993b]. The tradeoff betweenleft-deep and bushy query evaluationplans is reduction of the search space inthe query optimizer against generality ofthe execution engine and efficiency forsome queries. Right-deep plans have onlyrecently received more interest and mayactually turn out to be very efficient, inparticular in systems with ample mem-ory [Schneider 1990; Schneider and De-Witt 1990].

The remainder of this section providesmore details of how iterators are imple-mented in the Volcano extensible queryprocessing system. We use this systemrepeatedly as an example in this surveybecause it provides a large variety ofmechanisms for database query process-ing, but mostly because its model of oper-ator implementation and schedulingresembles very closely those used inmany relational and extensible systems.The purpose of this section is to provideimplementation concepts from which anew query processing engine could bederived.

Figure 5 shows how iterators are rep-resented in Volcano. A box stands for a

~Since each operator in such a query executionsystem will accessa permanent relation, the name“accesspath selection” used for System R optimiza-tion, although including and actually focusing onjoin optimization, was entirely correct and moredescriptive than “query optimization.”

record structure in Volcano’s implemen-tation language (C [Kernighan andRitchie 1978]), and an arrow representsa pointer. Each operator in a query eval-uation dan consists of two record struc-tures, ; small structure of four pointsand a state record. The small structure isthe same for all algorithms. It representsthe stream or iterator abstraction andcan be invoked with the open, next, andclose procedures. The purpose of staterecords is similar to that of activationrecords allocated by com~iler-~eneratedcode upon entry in~o a p>oced~re. Bothhold values local to the procedure or theiterator. Their main difference is thatactivation records reside on the stack andvanish upon procedure exit, while staterecords must ~ersist from one invocationof the iterato~ to the next, e.g., from theinvocation of open to each invocation ofnext and the invocation of close. Thus.state records do not reside on the stackbut in heap space.

The type of state records is differentfor each iterator as it contains iterator-specific arguments and local variables

(state) while the iterator is suspended,e.g., currently not active between invoca-tions of the operator’s next m-ocedure.Query plan nodes are linked t~gether bymeans of input pointers, which are alsoke~t in the state records. Since ~ointersto ‘functions are used extensivel~ in thisdesign, all operator code (i.e., the open,next, and close procedures) can be writ-ten in such a way that the names ofinput operators and their iterator proce-dures are not “hard-wired” into the code.and the operator modules do not need tobe recompiled for each query. Further-more, all operations on individual items,e.g., printing, are imported into Volcano

o~erators as functions. makirw the o~er-a~ors independent of the sem%tics ‘andrepresentation of items in the datastreams they are processing. This organi-zation using function pointers for inputoperators is fairly standard in commer-cial database management systems.

In order to make this discussion moreconcrete, Figure 5 shows two operators ina query evaluation plan that prints se-

ACM Computmg Surveys, Vol 25, No 2, June 1993


* open-filter* next-filtere+ close-filter

1 [Arguments I Input I State

I t I I + open-filescanT + next-filescane

print () + close-filescan

I i

I 3+Arguments i Input 1 State

1 1 I

predicate () (none)

Figure 5. Two operators in a Volcano query plan.

Iected records from a file. The purposeand capabilities of the filter operator inVolcano include printing items of astream using a print function passed tothe filter operator as one of its argu-ments. The small structure at the topgives access to the filter operator’s itera-tor functions (the open, next, and closeprocedures) as well as to its state record.Using a pointer to this structure, theopen, next, and close procedures of thefilter operator can be invoked, and theirlocal state can be passed to them as aprocedure argument. The filter’s iteratorfunctions themselves, e.g., open-filter, canuse the input pointer contained in thestate record to invoke the input operator’sfunctions, e.g., open-file-scan. Thus, thefilter functions can invoke the file scanfunctions as needed and can pace the filescan according to the needs of the filter.

In this section, we have discussed gen-eral physical algebra issues and syn-chronization and data transfer betweenoperators. Iterators are relativelystraightforward to implement and aresuitable building blocks for efficient, ex-tensible query processing engines. In thefollowing sections, we consider individualoperators and algorithms including acomparison of sorting and hashing, de-tailed treatment of parallelism, specialoperators for emerging database applica-tions such as scientific databases, andauxiliary techniques such as precompu-tation and compression.

2. SORTING AND HASHING

Before discussing specific algorithms, twogeneral approaches to managing sets ofdata are introduced. The purpose of manyquery-processing algorithms is to per-form some kind of matching, i.e., bring-ing items that are “alike” together andperforming some operation on them.There are two basic approaches used forthis purpose, sorting and hashing. Thispair permeates many aspects of queryprocessing, from indexing and clusteringover aggregation and join algorithms tomethods for parallelizing database oper-ations. Therefore, we discuss these ap-proaches first in general terms, withoutregard to specific algorithms. After a sur-vey of specific algorithms for unary (ag-gregation, duplicate removal) and binary(join, semi-join, intersection, division,

etc.) matching problems in the followingsections, the duality of sort- and hash-based algorithms is discussed in detail.

2.1 Sorting

Sorting is used very frequently indatabase systems, both for presentationto the user in sorted reports or listingsand for query processing in sort-basedalgorithms such as merge-join. There-fore, the performance effects of the manyalgorithmic tricks and variants of exter-nal sorting deserve detailed discussion inthis survey. All sorting algorithms actu-ally used in database systems use merg-


84 “ Goetz Graefe

ing, i.e., the input data are written intoinitial sorted runs and then merged intolarger and larger runs until only~ne runis left, the sorted output. Only in theunusual case that a data set is smallerthan the available memorv can in-memory techniques such as q-uicksort beused. An excellent reference for many ofthe issues discussed here is Knuth [ 19731.-,who analyzes algorithms much more ac-curately than we do in this introductorysurvey.

In order to ensure that the sort moduleinterfaces well with the other operators,e.g., file scan or merge-join, sorting shouldbe im~lemented as an iterator. i.e.. withopen, ‘next, and close procedures as allother operators of the physical algebra.In the Volcano query-processing system

(which is based on iterators), most of thesort work is done during open -sort[Graefe 1990a; 1993b]. This procedureconsumes the entire input and leaves ap-propriate data structures for next-sort toproduce the final sorted output. If theentire input fits into the sort space inmain memory, open-sort leaves a sortedarray of pointers to records in 1/0 buffermemory which is used by next-sort to

~roduce the records in sorted order. If~he input is larger than main memory,the open-sort procedure creates sortedruns and merges them until onlv onefinal merge ph~se is left, The last mergestep is performed in the next-sort proce-dure, i.e., when demanded by the con-sumer of the sorted stream, e.g., amerge-join. The input to the sort modulemust be an iterator, and sort uses open,next, and close procedures to request itsinput; therefore, sort input can come froma scan or a complex query plan, and thesort operator can be inserted into a queryplan at any place or at several places.

Table 2, which summarizes a taxon-omy of parallel sort algorithms [ Graefe1990a], indicates some main characteris-tics of database sort algorithms. The firstfew items apply to any database sort andwill be discussed in this section. Thequestions pertaining to parallel inputsand outputs and to data exchange will beconsidered in a later section on parallel

algorithms, and the last question regard-ing substitute sorts will be touched on inthe section on surrogate processing.

All sort algorithms try to exploit theduality between main-memory mergesortand quicksort. Both of these algorithmsare recursive divide-and-conquer algo-rithms. The difference is that merge sortfirst divides physically and then mergeson logical keys, whereas quicksort firstdivides on logical keys and then com-bines physically by trivially concatenat-ing sorted subarrays. In general, one ofthe two phases—dividing and combining—is based on logical keys, whereas theother arranges data items only physi-cally. We call these the logical and thephysical phases. Sorting algorithms forverv large data sets stored on disk ortap: are-also based on dividing and com-bining. Usually, there are two distinctsub algorithms, one for sorting withinmain memory and one for managing sub-sets of the data set on disk or tape. Thechoices for mapping logical and physicalphases to dividing and combining stepsare independent for these two subalgo-rithms. For practical reasons, e.g., ensur-ing that a run fits into main memory, thedisk management algorithm typicallyuses physical dividing and logical com-bining (merging). A point of practical im-portance is the fan-in or degree ofmerging, but this is a parameter ratherthan a defining algorithm property.

There are two alternative methods forcreating initial runs, also called “level-Oruns” here. First, an in-memory sort al-gorithm can be used, typically quicksort.Using this method, each run will havethe size of allocated memory, and thenumber of initial runs W will be W =[R\M] for input size R and memory sizeM. (Table 3 summarizes variables andtheir meaning in cost calculations in thissurvey.) Second, runs can be producedusing replacement selection [Knuth1973]. Replacement selection starts byfilling memory with items which are or-ganized into a priority heap, i.e., a datastructure that efficiently supports the op-erations insert and remove-smallest.Next, the item with the smallest key is

ACM Comput, ng Surveys, Vol. 25, No. 2, June 1993


Table 2. A Taxonomy of Database Sorhng Algorithms

Determinant Possible Options

Input division Logical keys (partitioning) or physical divisionResult combination Logical keys (merging) or physical concatenationMain-memory sort Quicksort or replacement selectionMerging Eager or lazy or semi-eager; lazy and semi-eager with or without

optimizationsRead-ahead No read-ahead or double-buffering or read-ahead with forecastingInput Single-stream or paralleloutput Single-stream or parallelNumber of data exchanges One or multipleData exchange Before or after local sortSort objects Original records or key-RZD pairs (substitute sort)

removed from the priority heap and writ-ten to a run file and then immediatelyreplaced in the priority heap with an-other item from the input, With highprobability, this new item has a keylarger than the item just written andtherefore will be included in the samerun file. Notice that if this is the case,the first run file will be larger than mem-ory. Now the second item (the currentlysmallest item in the priority heap) iswritten to the run file and is also re-placed immediately in memory by an-other item from the input. This processrepeats, always keeping the memory andthe priority heap entirely filled. If a newitem has a key smaller than the last keywritten, the new item cannot be includedin the current run file and is marked forthe next run file. In comparisons amongitems in the heap, items marked for thecurrent run file are always considered“smaller” than items marked for the nextrun file. Eventually, all items in memory

are marked for the next run file, at whichpoint the current run file is closed, and anew one is created.

Using replacement selection, run filesare typically larger than memory. If theinput is already sorted or almost sorted,there will be only one run file. This situa-tion could arise, for example, if a file issorted on field A but should be sorted onA as major and B as the minor sort key.If the input is sorted in reverse order,which is the worst case, each run file willbe exactly as large as memory. If theinput is random, the average run file will

be twice the size of memory, except thefirst few runs (which get the processstarted) and the last run. On the aver-age, the expected number of runs is aboutw = [R/(2 x M)l + I, i.e., about half asmany runs as created with quicksort. Amore detailed discussion and an analysisof replacement selection were providedby Knuth [1973].

An additional difference betweenquicksort and replacement selection isthe resulting 1/0 pattern during run cre-ation. Quicksort results in bursts of readsand writes for entire memory loads fromthe input file and to initial run files,while replacement selection alternatesbetween individual read and write opera-tions. If only a single device is used,quicksort may result in faster 1/0 be-cause fewer disk arm movements are re-quired. However, if different devices areused for input and temporary files, or ifthe input comes as a stream from an-other operator, the alternating behavior

of replacement selection may permit moreoverlap of 1/0 and processing and there-fore result in faster sorting.

The problem with replacement selec-tion is memory management. If inputitems are kept in their original pages inthe buffer (in order to save copying data,a real concern for large data volumes)each page must be kept in the bufferuntil its last record has been written to arun file. On the average, half a page’srecords will be in the priority heap. Thus,the priority heap must be reduced to halfthe size (the number of items in the heap

ACM Computing Surveys, Vol 25, No. 2, June 1993

86 ● Goetz Graefe

Table 3. Variables, Their Mearrrrg, and Units

Variables Description Units

M Memory size pagesR, S Inputs or their

sizes pagesc Cluster or unit

of I/O pagesF, K Fan-in or

fan-out (none)w Number of level-O

run files (none)L Number of

merge levels (none)

is one half the number of records that fitinto memory), canceling the advantage oflonger and fewer run files. The solutionto this problem is to copy records into aholding space and to keep them therewhile they are in the priority heap anduntil they are written to a run file. If theinput items are of varying sizes, memorymanagement is more complex than forquicksort because a new item may not fitinto the space vacated in the holdingspace by the last item written into a runfile. Solutions to this problem will intro-duce memory management overhead andsome amount of fragmentation, i.e., thesize of runs will be less than twice thesize of memory. Thus, the advantage ofhaving fewer runs must be balanced withthe different 1/0 pattern and the dis-

advantage of more complex memorymanagement.

The level-O runs are merged into level-1 runs, which are merged into level-2runs, etc., to produce the sorted output.During merging, a certain amount ofbuffer memory must be dedicated to eachinput run and the merge output. We callthe unit of 1/0 a cluster in this survey,which is a number of pages located con-tiguously on disk. We indicate the clustersize with C, which we measure in pagesjust like memory and input sizes. Thenumber of 1/0 clusters that flt in mem-ory is the quotient of memory size andcluster size. The maximal merge fan-inF, i.e., the number of runs that can bemerged at one time, is this quotient mi-nus one cluster for the output. Thus, F =[M/C – 1]. Since the sizes of runs grow

by a factor F from level to level, thenumber of merge levels L, i.e., the num-ber of times each item is written to a runfile, is logarithmic with the input size,namely L = [log~( W )1.

There are four considerations that canimprove the merge efficiency. The firsttwo issues pertain to scheduling of 1/0operations, First, scans are faster ifread-ahead and write-behind are used;therefore, double-buffering using twopages of memory per input run and twofor the merge output might speed themerge process [Salzberg 1990; Salzberget al. 1990]. The obvious disadvantage isthat the fan-in is cut in half. However,instead of reserving 2 x F + 2 clusters, apredictive method called forecasting canbe employed in which the largest key ineach input buffer is used to determinefrom which input run the next clusterwill be read. Thus, the fan-in can be setto any number in the range [ ilI/(2 X C)– 2] s F s [M/C – 1]. One or tworead-ahead buffers per input disk aresufficient, and F = [ iW/C ] – 3 will bereasonable in most cases because it usesmaximal fan-in with one forecasting in-put buffer and double-buffering for themerge output.

Second, if the operating system andthe 1/0 hardware support them, usinglarge cluster sizes for the run files is verybeneficial. Larger cluster sizes will re-duce the fan-in and therefore may in-crease the number of merge levels.5However, each merging level is per-formed much faster because fewer 1/0operations and disk seeks and latencydelays are required. Furthermore, if theunit of 1/0 is equal to a disk track,rotational latencies can be avoided en-tirely with a sufficiently smart disk con-

5In files storing permanent data, large clusters(units of 1/0) contammg many records may alsocreate artificial buffer contention (if much moredisk space is copied into the buffer than truly nec-essary for one record) and “false sharing” in envi-ronments with page (cluster) locks, I.e., artificialconcurrency conflicts, Since run files m a sort oper-ation are not shared but temporary, these problemsdo not exist in this context.

ACM Computmg Surveys,Vol. 25, No 2, June 1993


troller. Usually, relatively small fan-inswith large cluster sizes are the optimalchoice, even if the sort requires multiplemerge levels [ Graefe 1990a]. The precisetradeoff depends on disk seek, latency,and transfer times. It is interesting tonote that the optimal cluster size andfan-in basically do not depend on theinput size.

As a concrete example, consider sort-ing a file of R = 50 MB = 51,200 KBusing M = 160 KB of memory. The num-ber of runs created by quicksort will beW = [51200/160] = 320. Depending onthe disk access and transfer times (e.g.,25 ms disk seek and latency, 2 ms trans-fer time for a page of 4 KB), C = 16 KBwill typically be a good cluster size forfast merging. If one cluster is used forread-ahead and two for the merge out-put, the fan-in will be F = 1160/161 – 3= 7. The number of merge levels will beL = [logT(320)l = 3. If a 16 KB 1/0 oper-ation takes T = 33 ms, the total 1/0time, including a factor of two for writingand reading at each merge level, for theentire sort will be 2 X L X [R/Cl X T =10.56 min.

An entirely different approach to de-termining optimal cluster sizes and theamount of memory allocated to forecast-ing and read-ahead is based on process-ing and 1/0 bandwidths and latencies.The cluster sizes should be set such thatthe 1/0 bandwidth matches the process-ing bandwidth of the CPU. Bandwidthsfor both 1/0 and CPU are measured herein record or bytes per unit time; instruc-tions per unit time (MIPS) are irrelevant.It is interesting to note that the CPU’sprocessing bandwidth is largely deter-mined by how fast the CPU can assemblenew pages, in other words, how fast theCPU can copy records within memory.This performance measure is usually ig-nored in modern CPU and cache designsgeared towards high MIPS or MFLOPSnumbers [ Ousterhout 19901.

Tuning the sort based on bandwidthand latency proceeds in three steps. First,the cluster size is set such that the pro-cessing and 1/0 bandwidths are equal orvery close to equal. If the sort is 1/0

bound, the cluster size is increased forless disk access overhead per record andtherefore faster 1/0; if the sort is CPUbound, the cluster size is decreased toslow the 1/0 in favor of a larger mergefan-in. Next, in order to ensure that thetwo processing components (1/0 andCPU) never (or almost never) have towait for one another, the amount of spacededicated to read-ahead is determined asthe 1/0 time for one cluster multipliedby the processing bandwidth. Typically,this will result in one cluster of read-ahead space per disk used to store andread inputs run into a merge. Of course,in order to make read-ahead effective,forecasting must be used. Finally, thesame amount of buffer space is allocatedfor the merge output (access latency timesbandwidth) to ensure that merge pro-cessing never has to wait for the com-pletion of output 1/0. It is an openissue whether these two alternative ap-proaches to tuning cluster size and read-ahead space result in different alloca-tions and sorting speeds or whether oneof them is more effective than the other.

The third and fourth merging issuesfocus on using (and exploiting) the maxi-mal fan-in as effectively and often aspossible. Both issues require adjustingthe fan-in of the first merge step usingthe formula given below, either the firstmerge step of all merge steps or, insemi-eager merging [ Graefe 1990a], thefirst merge step after the end of the in-put has been reached. This adjustment isused for only one merge step, called theinitial merge here, not for an entire mergelevel.

The third issue to be considered is thatthe number of runs W is typically not apower of F; therefore, some merges pro-ceed with fewer than F inputs, whichcreates the opportunity for some opti-mization. Instead of always merging runsof only one level together, the optimalstrategy is to merge as many runs aspossible using the smallest run filesavailable. The only exception is the fan-inof the first merge, which is determined toensure that all subsequent merges willuse the full fan-in F.


88 e Goetz Graefe

—

Figure 6. Naive and optimized merging.

Let us explain this idea with the exam-ple shown in Figure 6. Consider a sortwith a maximal fan-in F = 10 and aninput file that requires W = 12 initialruns. Instead of merging only runs of thesame level as shown in Figure 6, mergingis delayed until the end of the input hasbeen reached. In the first merge step,only 3 of the 12 runs are combined, andthe result is then merged with the other9 runs, as shown in Figure 6. The 1/0cost (measured by the number of memoryloads that must be written to any of theruns created) for the first strategy is 12+ 10 + 2 = 24, while for the secondstrategy it is 12 + 3 = 15. In other words,the first strategy requires 607. more 1/0to temporary files than the second one.The general rule is to merge just theright number of runs after the end of theinput file has been reached, and to al-ways merge the smallest runs availablefor merging. More detailed examples aregiven in Graefe [1990a]. One conse-quence of this optimization is that themerge depth L, i.e., the number of runfiles a record is written to during the sortor the number of times a record is writ-ten to and read from disk, is not uniformfor all records. Therefore, it makes senseto calculate an average merge depth (asrequired in cost estimation during queryoptimization), which may be a fraction.Of course, there are much more sophisti-cated merge optimizations, e.g., cascadeand polyphase merges [Knuth 1973].

Fourth, since some operations requiremultiple sorted inputs, for examplemerge-join (to be discussed in the sectionon matching) and sort output can bepassed directly from the final merge intothe next operation (as is natural whenusing iterators), memory must be divided

among multiple final merges. Thus, thefinal fan-in f and the “normal” fan-in Fshould be specified separately in an ac-tual sort implementation. Using a finalfan-in of 1 also allows the sort operatorto produce output into a very slow opera-tor, e.g., a display operation that allowsscrolling by a human user, without occu-pying a lot of buffer memory for merginginput runs over an extended period oftime.G

Considering the last two optimizationoptions for merging, the following for-mula determines the fan-in of the firstmerge. Each merge with normal fan-in Fwill reduce the number of run files byF – 1 (removing F runs, creating onenew one). The goal is to reduce the num-ber of runs from W to f and then to 1

(the final output). Thus, the first mergeshould reduce the number of runs to f +k(F – 1) for some integer k. In otherwords, the first merge should use a fan-inof FO=((W–f– l)mod(F– 1))+ 2.Inthe example of Figure 6, (12 – 10 –

1) mod (10 – 1) + 2 results in a fan-in forthe initial merge of FO = 3. If the sort ofFigure 6 were the input into a merge-joinand if a final fan-in of 5 were desired, theinitial merge should proceed with a fan-inof FO=(12–5– l)mod(10– 1)+2=8.

If multiple sort operations produceinput data for a common consumer oper-ator, e.g., a merge-join, the two final fan-ins should be set proportionally to the

GThere is a simdar case of resource sharing amongthe operators producing a sort’s input and the rungeneration phase of the sort, We will come back tothese issues later in the section on executing andscheduling complex queries and plans.



size of the two inputs. For example, iftwo merge-join inputs are 1 MB and 9MB, and if 20 clusters are available forinputs into the two final merges, then 2clusters should be allocated for the firstand 18 clusters for the second input (1/9= 2/18).

Sorting is sometimes criticized becauseit requires, unlike hybrid hashing (dis-cussed in the next section), that the en-tire input be written to run files and thenretrieved for merging. This difference hasa particularly large effect for files onlyslightly larger than memory, e.g., 1.25times the size of memory. Hybrid hash-ing determines dynamically how much ofthe input data truly must be written totemporary disk files. In the example, onlyslightly more than one quarter of thememory size must be written to tempo-rary files on disk while the remainder ofthe file remains in memory. In sorting,the entire file (1.25 memory sizes) iswritten to one or two run files and thenread for merging. Thus, sorting seems torequire five times more 1/0 for tempo-rary files in this example than hybridhashing. However, this is not necessarilytrue. The simple trick is to write initialruns in decreasing (reverse) order. Whenthe input is exhausted, and merging inincreasing order commences, buffermemory is still full of useful pages withsmall sort keys that can be merged im-mediately without 1/0 and that neverhave to be written to disk. The effect ofwriting runs in reverse order is compara-ble to that of hybrid hashing, i.e., itis particularly effective if the input isonly slightly larger than the availablememory.

To demonstrate the effect of clustersize optimization (the second of the fourmerging issues discussed above), wesorted 100,000 100-byte records, about10 MB, with the Volcano query process-ing system, which includes all merge op-timization described above with the ex-ception of read-ahead and forecasting.

(This experiment and a similar one weredescribed in more detail earlier [Graefe1990a; Graefe et al. 1993].) We used asort space of 40 pages (160 KB) within a

50-page (200 KB) 1/0 buffer, varying thecluster size from 1 page (4 KB) to 15

pages (60 KB). The initial run size was1,600 records, for a total of 63 initialruns. We counted the number of 1/0operations and the transferred pages forall run files, and calculated the total 1/0cost by charging 25 ms per 1/0 operation(for seek and rotational latency) and 2ms for each transferred page (assuming 2MB/see transfer rate). As can be seen inTable 4 and Figure 7, there is an optimalcluster size with minimal 1/0 cost. Thecurve is not as smooth as might havebeen expected from the approximate costfunction because the curve reflects allreal-system effects such as rounding

(truncating) the fan-in if the cluster sizeis not an exact divisor of the memorysize, the effectiveness of merge optimiza-tion varying for different fan-ins, andinternal fragmentation in clusters. Thedetailed data in Table 4, however, reflectthe trends that larger clusters andsmaller fan-ins clearly increase theamount of data transferred but not thenumber of 1/0 operations (disk and la-tency time) until the fan-in has shrunkto very small values, e.g., 3. It is clearlysuboptimal to always choose the smallestcluster size (1 page) in order to obtainthe largest fan-in and fewest merge lev-els. Furthermore, it seems that the rangeof cluster sizes that result in near-optimal total 1/0 costs is fairly large;thus it is not as important to determinethe exact value as it is to use a clustersize “in the right ball park.” The optimalfan-in is typically fairly small; however,it is not e or 3 as derived by Bratberg-sengen [1984] under the (unrealistic)assumption that the cost of an 1/0 oper-ation is independent of the amount ofdata being transferred.

2.2 Hashing

For many matching tasks, hashing is analternative to sorting. In general, whenequality matching is required, hashingshould be considered because the ex-pected complexity of set algorithms basedon hashing is O(N) rather than


90 “ Goetz Graefe

Table 4. Effect of Cluster Size Optlmlzations

Cluster Pages Total 1/0Size Average Disk Transferred cost

[ X 4 KB] Fan-in Depth Operations [X4KB] [see]

123456789

101112

13

14

15

40 1.37620 1.72813 1.87210 1.9368 2.0006 2.5205 2.7605 2.7604 3.0004 3.0003 3.8563 3.8563 3.8562 5.9842 5.984

687442983176240619842132198017181732149017981686162821822070

68748596952896249920

12792138601374415588149001977820232211643054831050

185.598124.64298.45679.39869.44078.88477.22070.43874.47667.05084.50682.61483.028

115.646113.850

150

Total

1/0 Cost 100

[SW]

50

0i

oI I I I I I

2.5 5 7.5 10 12.5 15

Cluster Size [x 4 KB]

Figure 7. Effect of cluster size optimizations

0( IV log N) as for sorting. Of course. thismakes ‘intuitive sense- if hashing isviewed as radix sorting on a virtual key[Knuth 1973].

Hash-based query processing algo-rithms use an in-memory hash table ofdatabase objects to perform their match-ing task. If the entire hash table (includ-ing all records or items) fits into memory,hash-based query processing algorithmsare very easy to design, understand, andimplement, and they outperform sort-based alternatives. Note that for binarymatching operations, such as join or in-tersection, only one of the two inputs

must fit into memory. However, if therequired hash table is larger than mem-ory, hash table ouerfZow occurs and mustbe dealt with.

There are basically two methods formanaging hash table overflow, namelyavoidance and resolution. In either case,the input is divided into multiple parti-tion files such that partitions can be pro-cessed independently from one another,and the concatenation of the results of allpartitions is the result of the entire oper-ation. Partitioning should ensure that thepartitioning files are of roughly even sizeand can be done using either hash parti -



tioning or range partitioning, i.e., basedon keys estimated to be quantiles. Usu-ally, partition files can be processed us-ing the original hash-based algorithm.The maximal partitioning fan-out F, i.e.,number of partition files created, is de-termined by the memory size ill dividedover the cluster size C minus one clusterfor the partitioning input, i.e., F = [M/C– 11, just like the fan-in for sorting.

In hash table overflow avoidance, theinput set is partitioned into F partitionfiles before any in-memory hash table isbuilt. If it turns out that fewer partitionsthan have been created would have beensufficient to obtain partition files thatwill fit into memory, bucket tuning (col-lapsing multiple small buckets into largerones) and dynamic destaging (determin-ing which buckets should stay in mem-ory) can improve the performance ofhash-based operations [Kitsuregawa etal. 1989a; Nakayama et al. 1988].

Algorithms based on hash table over-flow resolution start with the assumptionthat overflow will not occur, but resort to

basically the same set of mechanisms as

hash table overflow avoidance once itdoes occur. No real system uses this naive

hash table overflow resolution becauseso-called hybrid hashing is as efficientbut more flexible. Hybrid hashing com-bines in-memory hashing and overflow

resolution [DeWitt et al. 1984; Shapiro1986]. Although invented for relationaljoin and known as hybrid hash join, hy-brid hashing is equally applicable to allhash-based query processing algorithms.

Hybrid hash algorithms start out withthe (optimistic) premise that no overflowwill occur; if it does, however, they parti-tion the input into multiple partitions ofwhich only one is written immediately totemporary files on disk. The other F – 1partitions remain in memory. If anotheroverflow occurs, another partition iswritten to disk. If necessary, all F parti-tions are written to disk. Thus, hybridhash algorithms use all available mem-ory for in-memory processing, but at thesame time are able to process large inputfiles by overflow resolution, Figure 8shows the idea of hybrid hash algo-

1-..‘.+

‘.<

‘.<

‘..

Iash /

‘-Partition

‘= Files

On Disk‘*

‘*

Directory ~

Figure 8. Hybrid hashing.

rithms. As many hash buckets as possi-

ble are kept in memory, e.g., as linkedlists as indicated by solid arrows. Theother hash buckets are spooled to tempo-rary disk files, called the overflow or par-tition files, and are processed in laterstages of the algorithm. Hybrid hashingis useful if the input size R is larger thanthe memory size AL?but smaller than thememory size multiplied by the fan-out F,i.e., M< R< FXM.

In order to predict the number of 1/0operations (which actually is not neces-sary for execution because the algorithmadapts to its input size but may be desir-able for cost estimation during queryoptimization), the number of requiredpartition files on disk must be deter-mined. Call this number K, which mustsatisfy O s ~ s F. Presuming that theassignment of buckets to partitions isoptimal and that each partition file isequal to the memory size ikf, the amountof data that may be written to ~ parti-tion files is equal to K x M. The numberof required 1/0 buffers is 1 for the inputand K for the output partitions, leavingM – (K+ 1) x C memory for the hashtable. The optimal K for a given inputsize R is the minimal K for which K XM + (M – (K + 1) x C) > R. Solvingthis inequality and taking the smallestsuch K results in K = [(~ – M i-C)/(M – C)]. The minimal possible 1/0cost, including a factor of 2 for writing

ACM Computing Surveys,Vol. 25, No 2, June 1993

92 “ Goetz Graefe

and reading the partition files and mea-sured in the amount of data that must bewritten or read, is 2 x (~ – (lvl – (~ +1) X C)). To determine the 1/0 time, thisamount must be divided by the clustersize and multiplied with the 1/0 time forone cluster.

For example, consider an input of R =240 pages, a memory of M = 80 pages,and a cluster size of C = 8 pages. Themaximal fan-out is F = [80/8 – 1] = 9.The number of partition files that needto be created on disk is K = [(240 – 80+ 8)/(80 – 8)] = 3. In other words, in thebest case, K X C = 3 X 8 = 24 pages willbe used as output buffers to write K = 3partition files of no more than M = 80

pages, and M–(K+l)x C=80–4X 8 = 48 pages of memory will be usedas the hash table. The total amount ofdata written to and read from disk is2 X (240 – (80 – 4 X 8)) = 384 pages. Ifwriting or reading a cluster of C = 8pages takes 40 msec, the total 1/0 timeis 384/8 X 40 = 1.92 sec.

In the calculation of K, we assumed anoptimal assignment of hash buckets topartition files. If buckets were assignedin the most straightforward way, e.g., bydividing the hash directory into Fequal-size regions and assigning thebuckets of one region to a partition asindicated in Figure 8, all partitions wereof nearly the same size, and either all ornone of them will fit into their outputcluster and therefore into memory. Inother words, once hash table overflowoccurred, all input was written to parti-tion files. Thus, we presumed in the ear-lier calculations that hash buckets wereassigned more intelligently to outputpartitions.

There are three ways to assign hashbuckets to partitions. First, each time ahash table overflow occurs, a fixed num-ber of hash buckets is assigned to a newoutput partition. In the Gamma databasemachine, the number of disk partitions ischosen “such that each bucket [buckethere means what is called an output par-tition in this survey] can reasonably beexpected to fit in memory” [DeWitt andGerber 1985], e.g., 10% of the hash buck-

ets in the hash directory for a fan-out of10 [Schneider 19901. In other words. thefan~out is set a pri~ri by the query opti-mizer based on the expected (estimated)input size. Since the page size in Gammais relatively small, only a fraction ofmemorv is needed for outrmt buffers. andan in-memory hash tab~e can be ‘usedeven while output partitions are beingwritten to disk. Second. in bucket tuningand dynamic destaging [Kitsuregawa e;al. 1989a; Nakayama 1988], a large num-ber of small partition files is created andthen collamed into fewer partition filesno larger ~han memory. I; order to ob-tain a large number of partition files and,at the same time. retain some memorvfor a hash table, ‘the cluster size is sentquite small, e.g., C = 1 page, and thefan-out is very large though not maxi-mal, e.g., F = M/C/2. In the exampleabove, F = 40 output partitions with anaverage size of R/F = 6 pages could becreated, even though only K = 3 outputpartitions are required. The smallestpartitions are assigned to fill an in-mem-ory hash table of size M – K X C = 80 –3 x 1 = 77 pages. Hopefully, the dy-namic destaging rule—when an overflowoccurs, assign the largest partition stillin memorv to disk—ensures that indeed.the smallest partitions are retained inmemory. The partitions assigned to diskare collapsed into K = 3 partitions of nomore than M = 80 pages, to be processedin K = 3 subsequent phases. In binaryoperations such as intersection and rela-tional join, bucket tuning is quite effec-tive for skew in the first input, i.e., if thehash value distribution is nonuniformand if the partition files are of unevensizes. It avoids spooling parts of the sec-ond (typically larger) input to temporary~artition files because the ~artitions in. .memory can be matched immediately us-ing a hash table in the memory not re-auired as outtmt buffer and because a. L

number of small partitions have been col-lapsed into fewer, larger partitions, in-creasing the memory available for thehash table. For skew in the second inmt.bucket tuning and dynamic desta~nghave no advantage. Another disadvan-


Query Evaluation Techniques 8 93

tage of bucket tuning and dynamicdestaging is that the cluster size has tobe relatively small, thus requiring a largenumber of 1/0 operations with disk seeksand rotational latencies to write data tothe overflow files. Third, statistics gath-ered before hybrid hashing commencescan be used to assign hash buckets topartitions [Graefe 1993a].

Unfortunately, it is possible that oneor more partition files are larger thanmemory. In that case, partitioning is usedrecursively until the file sizes haveshrunk to memory size. Figure 9 showshow a hash-based algorithm for a unaryoperation, such as aggregation or dupli-cate removal, partitions its input file overmultiple recursion levels. The recursionterminates when the files fit into mem-ory. In the deepest recursion level, hy-brid hashing may be employed.

If the partitioning (hash) function isgood and creates a uniform hash valuedistribution, the file size in each recur-sion level shrinks by a factor equal to thefan-out, and therefore the number of re-cursion levels L is logarithmic with thesize of the input being partitioned. AfterL partitioning levels, each partition fileis of size R’ = R/FL. In order to obtainpartition files suitable for hybrid hashing(with M < R’ < F X itf), the number offull recursion levels L, i.e., levels at whichhybrid hashing is not applied, is L =[log~(R/M)]. The 1/0 cost of the re-maining step using hybrid hashing canbe estimated using the hybrid hash for-mula above with R replaced by R’ andmultiplying the cost with FL becausehybrid hashing is used for this number ofpartition files. Thus, the total 1/0 costfor partitioning an input and using hy-brid hashing in the deepest recursionlevel is

2X RX L+2XFL

X( R’–(M– KX C))

=2x( Rx(L+l)– F”

X( M–K XC))

=2 X( RX(L+l)– FLX(M

-[(R’ –M)/(M - C)l X C)).

~ x=

Figure 9. Recursive partitioning.

A major problem with hash-based algo-rithms is that their performance dependson the quality of the hash function. Inmany situations, fairly simple hash func-tions will perform reasonably well.Remember that the purpose of usinghash-based algorithms usually is to finddatabase items with a specific key or tobring like items together; thus, methodsas simple as using the value of a join keyas a hash value will frequently performsatisfactorily. For string values, goodhash values can be determined by usingbinary “exclusive or” operations or by de-termining cyclic redundancy check (CRC)values as used for reliable data storageand transmission. If the quality of thehash function is a potential problem, uni-versal hash functions should be consid-ered [Carter and Wegman 1979].

If the partitioning is skewed, the re-cursion depth may be unexpectedly high,making the algorithm rather slow. Thisis analogous to the worst-case perfor-mance of quicksort, 0(iV2 ) comparisonsfor an array of N items, if the partition-ing pivots are chosen extremely poorlyand do not divide arrays into nearly equalsub arrays.

Skew is the major danger for inferiorperformance of hash-based query-processing algorithms, There are severalways to deal with skew. For hash-basedalgorithms using overflow avoidance,bucket tuning and dynamic destaging arequite effective. Another method is to ob-tain statistical information about hashvalues and to use it to carefully assign


94 “ Goetz Graefe

hash buckets to partitions. Such statisti-cal information can be ke~t in the form ofhistomams and can ei;her come fromperm~nent system catalogs (metadata),from sampling the input, or from previ-ous recursion levels. For exam~le. for anintermediate query processing’ result forwhich no statistical parameters areknown a priori, the first partitioning levelmight have to proceed naively pretendingthat the partitioning hash function isperfect, but the second and further recur-sion levels should be able to use statisticsgathered in earlier levels to ensure thateach partitioning step creates even parti-tions, i.e., that the data is ~artitioned. . .with maximal effectiveness [Graefe1993a]. As a final resort, if skew cannotbe managed otherwise, or if distributionskew is not the problem but duplicatesare, some systems resort to algorithmsthat are not affected by data or hashvalue skew. For example, Tandem’s hashjoin algorithm resorts to nested-loops join(to be discussed later) [Zeller and Gray19901.

As- for sorting, larger cluster sizes re-sult in faster 1/0 at the expense ofsmaller fan-outs, with the optimal fan-outbeing fairly small [Graefe 1993a; Graefeet al. 1993]. Thus, multiple recursion lev-els are not uncommon for large files, andstatistics gathered on one level to limitskew effects on the next level are a real-istic method for large files to controlthe performance penalties of unevenpartitioning.

3. DISK ACCESS

All query evaluation systems have toaccess base data stored in the data-base. For databases in the megabyte toterabyte range, base data are typicallystored on secondary storage in the formof rotating random-access disks. How-ever, deeper storage hierarchies includ-ing optical storage, (maybe robot-oper-ated) tape archives, and remote storageservers will also have to be considered infuture high-functionality high-volumedatabase management systems, e.g., asoutlined by Stonebraker [1991]. Research

into database systems supporting and ex-ploiting a deep storage hierarchy is stillin its infancy.

On the other hand, some researchershave considered in-memory or main-memory databases, motivated both by thedesire for faster transaction and query-processing performance and by the de-creasing cost of semi-conductor memory[Analyti and Pramanik 1992; Bitton etal. 1987; Bucheral et al. 1990; DeWitt etal. 1984; Gruenwald and Eich 1991; Ku-mar and Burger 1991; Lehman and Carey1986; Li and Naughton 1988; Severanceet al. 1990; Whang and Krishnamurthy1990]. However, for most applications, ananalysis by Gray and Putzolo [ 1987]demonstrated that main memory is costeffective only for the most frequently ac-cessed data, The time interval betweenaccesses with equal disk and memorycosts was five minutes for their values ofmemory and disk prices and was ex-pected to grow as main-memory pricesdecrease faster than disk prices. For thepurposes of this survey, we will presumea disk-based storage architecture and willconsider disk 1/0 one of the major costsof query evaluation over large databases.

3.1 File Scans

The first operator to access base data isthe file scan, typically combined witha built-in selection facility. There isnot much to be said about file scan ex-cept that it can be made very fast usingread-ahead, particularly, large-chunk

(“track-at-a-crack”) read-ahead. In somedatabase systems, e.g., IBMs DB2, theread-ahead size is coordinated with thefree space left for future insertions dur-ing database reorganization. If a freepage is left after every 15 full data pages,the read-ahead unit of 16 pages (64 KB)ensures that overflow records are imme-diately available in the buffer.

Efficient read-ahead requires contigu-ous file allocation, which is supported bymany operating systems. Such contigu-ous disk regions are frequently called ex-tents. The UNIX operating system doesnot provide contiguous files, and many

ACM Computmg Surveys, Vol 25, No. 2, June 1993


database systems running on UNIX use“raw” devices instead, even though thismeans that the database managementsystem must provide operating-systemfunctionality such as file structures, diskspace allocation, and buffering.

The disadvantages of large units of 1/0are buffer fragmentation and the wasteof 1/0 and bus bandwidth if only individ-ual records are required. Permitting dif-ferent page sizes may seem to be a goodidea, even at the added complexity in thebuffer manager [Carey et al. 1986; Sikeler1988], but this does not solve the prob-lem of mixed sequential scans and ran-dom record accesses within one file. Thecommon solution is to choose a middle-of-the-road page size, e.g., 8 KB, and tosupport multipage read-ahead.

3.2 Associative Access Using Indices

In order to reduce the number of accessesto secondary storage (which is relativelyslow compared to main memory), mostdatabase systems employ associativesearch techniques in the form of indicesthat map key or attribute values to loca-tor information with which databaseobjects can be retrieved. The best knownand most often used database indexstructure is the B-tree [Bayer andMcCreighton 1972; Comer 1979]. A largenumber of extensions to the basic struc-ture and its algorithms have been pro-posed, e.g., B ‘-trees for faster scans, fastloading from a sorted file, increased fan-out and reduced depth by prefix and suf-fix truncation, B*-trees for better spaceutilization in random insertions, andtop-down B-trees for better locking be-havior through preventive maintenance

[Guibas and Sedgewick 1978]. Interest-ingly, B-trees seem to be having a renais-sance as a research subject, in particularwith respect to improved space utiliza-tion [Baeza-Yates and Larson 1989], con-currency control [ Srinivasan and Carey1991], recovery [Lanka and Mays 1991],parallelism [Seeger and Larson 1991],and on-line creation of B-trees for verylarge databases [Srinivasan and Carey1992]. On-line reorganization and modifi-

cation of storage structures, though not anew idea [Omiecinski 1985], is likely tobecome an important research topicwithin database research over the nextfew years as databases become largerand larger and are spread over manydisks and many nodes in parallel anddistributed systems.

While most current database systemimplementations only use some form ofB-trees, an amazing variety of indexstructures has been described in the lit-erature [Becker et al. 1991; Beckmann etal. 1990; Bentley 1975; Finkel and Bent-ley 1974; Guenther and Bilmes 1991;Gunther and Wong 1987; Gunther 1989;Guttman 1984; Henrich et al. 1989; Heeland Samet 1992; Hutflesz et al. 1988a;1988b; 1990; Jagadish 1991; Kemper andWallrath 1987; Kolovson and Stone-braker 1991; Kriegel and Seeger 1987;1988; Lomet and Salzberg 1990a; Lomet1992; Neugebauer 1991; Robinson 1981;Samet 1984; Six and Widmayer 1988].One of the few multidimensional indexstructures actually implemented in acomplete database management systemare R-trees in Postgres [Guttman 1984;Stonebraker et al. 1990b].

Table 5 shows some example indexstructures classified according to fourcharacteristics, namely their supportfor ordering and sorted scans, theirdynamic-versus-static behavior uponinsertions and deletions, their supportfor multiple dimensions, and their sup-port for point data versus range data. Weomitted hierarchical concatenation of at-tributes and uniqueness, because all in-dex structures can be implemented tosupport these. The indication “no rangedata” for multidimensional index struc-tures indicates that range data are notpart of the basic structure, although theycan be simulated using twice the numberof dimensions. We included a reference ortwo with each structure; we selected orig-inal descriptions and surveys over themany subsequent papers on special as-pects such as performance analyses, mul-tidisk’ and multiprocessor implementa-tions, page placement on disk, concur-rency control, recovery, order-preserving


96 = Goetz Graefe

Table 5. Classlflcatlon of Some Index Structures

Structure Ordered Dynamic

ISAM Yes NoB-trees Yes Yes

Quad-tree Yes Yes

kD-trees Yes YesKDB-trees Yes YeshB-trees Yes YesR-trees Yes YesExtendible No YesHashingLinear Hashing No YesGrid Files Yes Yes

Multl-Dim. Range Data References

No No [Larson 1981]No No [Bayer and McCreighton 1972;

Comer 1979]

Yes No [Finkel and Bentley 1974;Samet 1984]

Yes No [Bentley 1975]Yes No [Robinson 1981]Yes No [Lomet and Salzberg 1990a]Yes Yes [Guttman 1984]No No [Fagin et al. 1979]

No No [Litwin 1980]

Yes No [Nievergelt et al. 1984]

hashing, mapping range data of N di-mensions into point data of 2 N dimen-sions, etc.—this list suggests the wealthof subsequent research, in particular onB-trees, linear hashing, and refined mul-tidimensional index structures,

Storage structures typically thought ofas index structures may be used as pri-mary structures to store actual data oras redundant structures (“access paths”)that do not contain actual data but point-ers to the actual data items in a separatedata file. For example, Tandem’s Non-Stop SQL system uses B-trees for actualdata as well as for redundant indexstructures. In this case, a redundant in-dex structure contains not absolute loca-tions of the data items but keys used tosearch the primary B-tree. If indices areredundant structures, they can still beused to cluster the actual data items, i.e.,the order or organization of index entriesdetermines the order of items in the datafile. Such indices are called clusteringindices; other indices are called nonclus-tering indices. Clustering indices do notnecessarily contain an entry for each dataitem in the primary file, but only oneentry for each page of the primary file; inthis case, the index is called sparse. Non-clustering indices must always be dense,i.e., there are the same number of entriesin the index as there are items in theprimary file.

The common theme for all index struc-tures is that they associatively map someattribute of a data object to some locator

information that can then be used to re-trieve the actual data object. Typically, inrelational systems, an attribute value ismapped to a tuple or record identifier(TID or RID). Different systems use dif-ferent approaches, but it seems that mostnew designs do not firmly attach therecord lookup to the index scan.

There are several advantages to sepa-rating index scan and record lookup.First, it is possible to scan an index onlywithout ever retrieving records from theunderlying data file. For example, if onlysalary values are needed (e.g., to deter-mine the count or sum of all salaries), itis sufficient to access the salary indexonly without actually retrieving the datarecords. The advantages are that (i) fewer1/0s are required (consider the numberof 1/0s for retrieving N successive indexentries and those to retrieve N indexentries plus N full records, in particularif the index is nonclustering [Mackertand Lehman 1989] and (ii) the remaining1/0 operations are basically sequentialalong the leaves of the index (at least forB ‘-trees; other index types behave differ-ently). The optimizers of several commer-cial relational products have recentlybeen revised to recognize situations inwhich an index-only scan is sufficient.Second, even if none of the existing in-dices is sufficient by itself, multiple in-dices may be “joined” on equal RIDs toobtain all attributes required for a query(join algorithms are discussed below inthe section on binary matching). For ex-



ample, by matching entries in indices onsalaries and on names by equal RIDs,the correct salary-name pairs are estab-lished. If a query requires only namesand salaries, this “join” has made access-ing the underlying data file obsolete.Third, if two or more indices apply toindividual clauses of a query, it may bemore effective to take the union or inter-section of RID lists obtained from twoindex scans than using only one index(algorithms for union and intersection arealso discussed in the section on binarymatching). Fourth, joining two tables canbe accomplished by joining the indices onthe two join attributes followed by recordretrievals in the two underlying data sets;the advantage of this method is that onlythose records will be retrieved that trulycontribute to the join result [Kooi 1980].Fifth, for nonclustering indices, sets ofRIDs can be sorted by physical location,and the records can be retrieved veryefficiently, reducing substantially thenumber of disk seeks and their seek dis-tances. Obviously, several of these tech-niques can be combined. In addition, somesystems such as Rdb/VMS and DB2 usevery sophisticated implementations ofmultiindex scans that decide dynami-cally, i.e., during run-time, which indicesto scan, whether scanning a particularindex reduces the resulting RID list suf-ficiently to offset the cost of the indexscan, and whether to use bit vector filter-ing for the RID list intersection (see alater section on bit vector filtering)[Antoshenkov 1993; Mohan et al. 1990].

Record access performance for nonclus-tering indices can be addressed withoutperforming the entire index scan first (asrequired if all RIDs are to be sorted) byusing a “window” of RIDs. Instead ofobtaining one RID from the index scan,retrieving the record, getting the nextRID from the index scan, etc., the lookupoperator (sometimes called “functionaljoin”) could load IV RIDs, sort them intoa priority heap, retrieve the most conve-niently located record, get another RID,insert it into the heap, retrieve a record,etc. Thus, a functional join operator us-ing a window always has N open refer-

ences to items that must be retrieved,giving the functional join operator signif-icant freedom to fetch items from diskefficiently. Of course, this techniqueworks most effectively if no other trans-actions or operators use the same diskdrive at the same time.

This idea has been generalized to as-semble complex objects~In object-orientedsystems, objects can contain pointers to(identifiers of) other objects or compo-nents, which in turn may contain furtherpointers, etc. If multiple objects and alltheir unresolved references can be con-sidered concurrently when schedulingdisk accesses, significant savings in diskseek times can be achieved [Keller et al.1991].

3.3 Buffer Management

1/0 cost can be further reduced bycaching data in an 1/0 buffer. A largenumber of buffer management tech-niques have been devised; we give only afew references. Effelsberg and Haerder[19841 survey many of the buffer man-agement issues, including those pertain-ing to issues of recovery, e.g., write-aheadlogging. In a survey paper on the interac-tions of operating systems and databasemanagement systems, Stonebraker[ 1981] pointed out that the “standard”buffer replacement policy, LRU (least re-cently used), is wrong for many databasesituations. For example, a file scan readsa large set of pages but uses them onlyonce, “sweeping” the buffer clean of allother pages, even if they might be usefulin the future and should be kept in mem-ory. Sacco and Schkolnick [1982; 19861focused on the nonlinear performanceeffects of buffer allocation to many rela-tional algorithms, e.g., nested-loops join.Chou [ 1985] and Chou and DeWitt [ 1985]combined these two ideas in their DBMINalgorithm which allocates a fixed numberof buffer pages to each scan, dependingon its needs, and uses a local replace-ment policy for each scan appropriate toits reference pattern. A recent study intobuffer allocation is by Faloutsos et al.[1991] and Ng et al. [19911 on using


98 ● G’oetz Graefe

marginal gain for buffer allocation. A verypromising research direction for buffermanagement in object-oriented databasesystems is the work by Palmer andZdonik [1991] on saving reference pat-terns and using them to predict futureobject faults and to prevent them byprefetching the required pages.

The interactions of index retrieval andbuffer management were studied bySacco [1987] as well as Mackert andLehman [ 19891. and several authors-,studied database buffer management andvirtual memory provided by the operat-ing system [Sherman and Brice 1976;Stonebraker 1981; Traiger 1982].

On the level of buffer manager imple-mentation, most database buffer man-agers do not ~rovide read and write in.t~rfaces to th~ir client modules but fixingand unfixing, also called pinning and un-pinning. The semantics of fixing is that afixed page is not subject to replacementor relocation in the buffer pool, and aclient module may therefore safely use amemory address within a fixed page. Ifthe buffer manager needs to replace apage but all its buffer frames are fixed,some special action must occur such asdynamic growth of the buffer pool ortransaction abort.

The iterator implementation of queryevaluation algorithms can exploit thebuffer’s fix/unfix interface by passingpointers to items (records, objects) fixedin the buffer from iterator to iterator.The receiving iterator then owns the fixeditem; it may unfix it immediately (e.g.,after a predicate fails), hold on to thefixed record for a while (e.g., in a hashtable), or pass it on to the next iterator(e.g., if a predicate succeeds). Because theiterator control and interaction of opera-tors ensure that items are never pro-duced and fixed before they are required,the iterator protocol is very eflic~ent inits buffer usage.

Some implementors, however, have feltthat intermediate results should not bematerialized or kept in the database sys-tem’s 1/0 buffer, e.g., in order to easeimplementation of transaction (ACID) se-mantics, and have designed a separate

memory management scheme for inter-mediate results and items passed fromiterator to iterator. The cost of this deci-sion is additional in-memory copying aswell as the possible inefficiencies associ-ated with, in effect, two buffer and mem-ory managers.

4. AGGREGATION AND DUPLICATE

REMOVAL

Aggregation is a very important statisti-cal concept to summarize informationabout large amounts of data. The idea isto represent a set of items by a singlevalue or to classify items into groups anddetermine one value per group. Mostdatabase systems support aggregatefunctions for minimum, maximum, sum,count, and average (arithmetic mean’).Other aggregates, e.g., geometric meanor standard deviation, are typically notprovided, but may be constructed in somesystems with extensibility features. Ag-gregation has been added to both rela-tional calculus and algebra and adds thesame expressive power to each of them[Klug 1982].

Aggregation is typically supported intwo forms, called scalar aggregates andaggregate functions [Epstein 1979].Scalar aggregates calculate a singlescalar value from a unary input relation,e.g., the sum of the salaries of all employ-ees. Scalar aggregates can easily be de-termined using a single pass over a dataset. Some systems exploit indices, in par-ticular for minimum, maximum, andcount.

Aggregate functions, on the other hand,determine a set of values from a binaryinput relation, e.g., the sum of salariesfor each department. Aggregate functionsare relational operators, i.e., they con-sume and produce relations. Figure 10shows the output of the query “count ofemployees by department.” The “by-list”or grouping attributes are the key of thenew relation, the Department attributein this example.

Algorithms for aggregate functions re-quire grouping, e.g., employee items maybe grouped by department, and then one

ACM Computing Surveys. Vol 25, No 2, June 1993


Shoe 9

Hardware 7 IFigure 10. Count of employees by department,

output item is calculated per group. Thisgrouping process is very similar to dupli-cate removal in which eaual data itemsmust be brought together; compared, andremoved. Thus, aggregate functions andduplicate removal are typically imple-mented in the same module. There areonly two differences between aggregatefunctions and duplicate removal. First, induplicate removal, items are comparedon all their attributes. but onlv on the.attributes in the by-list of aggregatefunctions. Second, an identical item isimmediately dropped from further con-sideration in duplicate removal whereasin aggregate functions some computationis ~erformed before the second item ofth~ same group is dropped. Both differ-ences can easily be dealt with using aswitch in an actual algorithm implemen-tation. Because of their similarity, dupli-cate removal and aggregation are de-scribed and used interchangeably here.

In most existing commercial relationalsystems, aggregation and duplicate re-moval algorithms are based on sorting,following Epstein’s [1979] work. Sinceaggregation requires that all data be con-sumed before any output can be pro-duced, and since main memories weresignificantly smaller 15 years ago whenthe prototypes of these systems were de-signed, these implementations used tem-porary files for output, not streams anditerator algorithms. However, there is noreason why aggregation and duplicate re-moval cannot be implemented using iter-ators exploiting today’s memory sizes.

4.1 Aggregation Algorithms Based on

Nested Loops

There are three types of algorithms foraggregation and duplicate removal based

on nested loops, sorting, and hashing.The first algorithm, which we callnested-loops aggregation, is the mostsimple-minded one. Using a temporaryfile to accumulate the output, it loops foreach input item over the output file accu-mulated so far and either aggregates theinput item into the appropriate outputitem or creates a new output item andappends it to the output file. Obviously,this algorithm is quite inefficient for largeinputs, even if some performance en-hancements can be applied.7 We mentionit here because it corres~onds to the al~o-.rithm choices available for relationaljoins and other binary matching prob-lems (discussed in the next section),which are the nested-loops join and themore efficient sort-based and hash-basedjoin algorithms. As for joins and binarymatching, where the nested-loops algo-rithm is the only algorithm that can eval-uate any join predicate, the nested-loopsaggregation algorithm can support un-usual aggregations where the input itemsare not divided into disjoint equivalenceclasses but where a single input itemmay contribute to multiple output items.While such aggregations are not sup-ported in today’s database systems, clas-sifications that do not divide the inputinto equivalence classes can be useful inboth commercial and scientific applica-tions. If the number of classifications issmall enough that all output items canbe kept in memory, the performance ofthis algorithm is acceptable. However, forthe more standard database aggregationproblems, sort-based and hash-based du-plicate removal and aggregation algo-rithms are more appropriate.

7The possible improvements are (i) looping overpages or clusters rather than over records of inputand output items (block nested loops), (ii) speedingthe inner loop by an index (index nested loops), amethod that has been used in some commercialrelational sYstems, (iii) bit vector filtering to deter-

mine without inner loop or index lookup that anitem in the outer loop cannot possibly have a matchin the inner loop. All three of these issues arediscussed later in this survey as they apply to

binary operations such as joins and intersection.

ACM Computing Surveys, Vol. 25, No 2, June 1993

100 “ Goetz Graefe


Sorting

Sorting will bring equal items together,and duplicate removal will then be easy.The cost of duplicate removal is domi-nated bv the sort cost. and the cost ofthis na~ve dudicate removal al~orithmbased on sorting can be assume~ to bethat of the sort operation. For aggrega-tion, items are sorted on their groupingattributes.

This simple method can be improvedby detecting and removing duplicates asearly as possible, easily implemented inthe routines that write run files duringsorting. With such “early” duplicate re-moval or aggregation, a run file can nevercontain more items than the final output(because otherwise it would contain du-plicates!), which may speed up the finalmerges significantly [Bitten and DeWitt1983].

As for any external sort operation, theoptimizations discussed in the section onsorting, namely read-ahead using fore-casting, merge optimizations, large clus-ter sizes, and reduced final fan-in forbinary consumer operations, are fully ap-plicable when sorting is used for aggre-gation and duplicate removal. However,to limit the complexity of the formulas,we derive 1/0 cost formulas without theeffects of these optimizations.

The amount of I/0 in sort-based a~-gregation is determined by the number ~fmerge levels and the effect of early dupli-cate removal on each merge step. Thetotal number of merge levels is unaf-fected by aggregation; in sorting withquicksort and without optimized merg-ing, the number of merge levels is L =

DogF(R/M)l for input size R, memorysize M, and fan-in F. In the first mergelevels, the likelihood is negligible thatitems of the same group end up in thesame run file, and we therefore assumethat the sizes of run files are unaffecteduntil their sizes would exceed the size ofthe final output. Runs on the first fewmerge levels are of size M X F’ for leveli, and runs of the last levels have thesame size as the final output. Assuming

the output cardinality (number of items)is G times less than the input cardinality

(G = R/o), where G is called the aver.age group size or the reduction factor,only the last ~logF(G)l merge levels, in-cluding the final merge, are affected byearly aggregation because in earlier lev-els more than G runs exist, and itemsfrom each group are distributed over allthose runs, giving a negligible chance ofearly aggregation.

In the first merge levels, all input itemsparticipate, and the cost for these levelscan be determined without explicitly cal-culating the size and number of run fileson these levels. In the affected levels, thesize of the output runs is constant, equalto the size of the final output O = R/G,while the number of run files decreasesby a factor equal to the fan-in F in eachlevel. The number of affected levels thatcreate run files is Lz = [log~(G)l – 1; thesubtraction of 1 is necessary because thefinal merge does not create a run file butthe output stream. The number of unaf-fected levels is LI = L – Lz. The numberof input runs is W/Fz on level i (recallthe number of initial runs W = R/Mfrom the discussion of sorting). The totalcost,8 including a factor 2 for writing andreading, is

L–1

2X RX LI+2XOX ~W/FzZ=L1

=2 XRXL1+2XOXW

x(l\FL1 – l/’F~)/’(l – l\F).

For example, consider aggregating R= 100 MB input into O = 1 MB output

(i.e., reduction factor G = 100) using asystem with M = 100 KB memory andfan-in F = 10. Since the input is W =1,000 times the size of memory, L = 3merge levels will be needed. The last Lz= log~(G) – 1 = 1 merge level into tem-porary run files will permit early aggre-gation. Thus, the total 1/0 will be

8 Using Z~= ~ a z = (1 – a~+l)/(l – a) and Z~.KaL= X$LO a’ – E~=-O] a’ = (aK – a“’’+l)/(l – a).



2X1 OOX2+2X1X1OOO

x(l/lo2 – 1/103 )/’(1 – 1/10)

= 400 + 2 x 1000 x 0.009/0.9

= 420 MB

which has to be divided by the clustersize used and multi~lied bv the time toread or write a clu~ter to “estimate the1/0 time for aggregation based on sort-ing. Naive separation of sorting and sub-sequent aggregation would have requiredreading and writing the entire input filethree times, for a total of 600 MB 1/0.Thus, early aggregation realizes almost30% savings in this case.

Aggrega~e queries may require thatduplicates be removed from the input setto the aggregate functions,g e.g., if theSQL distinct keyword is used. If such anaggregate function is to be executed us-ing sorting, early aggregation can be usedonly for the duplicate removal part. How-ever, the sort order used for duplicateremoval can be suitable to ~ermit the.subsequent aggregation as a simple filteroperation on the duplicate removal’s out-put stream.


Hashing

Hashing can also be used for aggregationby hashing on the grouping attributes.Items of the same group (or duplicateitems in duplicate removal) can be foundand aggregated when inserting them intothe hash table. Since only output items,not input items, are kept in memory,hash table overflow occurs only if theoutput does not fit into memory. How-ever, if overflow does occur, the partition

9 Consider two queries, both counting salaries perdepartment. In order to determine the number of

(salaried) employees per department, all salariesare counted without removing duplicate salary val-ues. On the other hand, in order to assess salarydifferentiation in each department, one might wantto determine the number of distinct salary levels ineach department. For this query, only distinctsalaries are counted, i.e., duplicate department-salary pairs must be removed prior to counting.(This refers to the latter type of query.)

files (all partitioning files in any one re-cursion level) will basically be as large asthe entire input because once a partitionis being written to disk, no further aggre-gation can occur until the partition filesare read back into memory.

The amount of 1/0 for hash-basedaggregation depends on the number ofpartitioning (recursion) levels requiredbefore the output (not the input) of onepartition fits into memory. This will bethe case when partition files have beenreduced to the size G x M. Since the par-titioning files shrink by a factor of F ateach level (presuming hash value skew isabsent or effectively counteracted), thenumber of partitioning (recursion) levels

is DogF(R/G/M)l = [log~(O/M)l for in-put size R, output size O, reduction fac-tor G, and fan-out 1’. The costs at eachlevel are proportional to the input filesize R. The total 1/0 volume for hashingwith overflow avoidance, including a fac-tor of 2 for writing and reading, is

2 x R X [logF(O/~)l.

The last partitioning level may use hy-brid hashing, i.e., it may not involve 1/0for the entire input file. In that case,L = llog~( O\M )j complete recursion lev-els involving all input records are re-quired, partitioning the input into files ofsize R’ = R/FL. In each remaining hy-brid hash aggregation, the size limit foroverflow files is M x G because such anoverflow file can be aggregated in mem-ory. The number of partition files K mustsatisfy KxMx G+(M– KxC)XG> R’, meaning K = [(R’/G – M)/(M –C)l partition files will be created. Thetotal 1/0 cost for hybrid hash aggrega-tion is

2X RX L+2XFL

x( R’–(M– KxC)XG)

=2 X( RX(L+l)– FL

x( M–Kx C)XG)

=2 X( RX(L+l)– FL

x(M– [(R’/G –M)/(M– C)]

xC) x G).



I/o[MB]

600 –

500 –

400 –

300 –

200 –o Sorting without early aggregation

A Sorting with early aggregation100 – x Hashing without hybrid hashing

o– ❑ Hashing with hybrid hashing

I I I I I I I 1 I I I I I1 23510 2030 50 100 200300500 1000

Group Size or Reduction Factor

Figure 11. Performance of sort- and hash-based aggregation.

As for sorting, if an aggregate queryrequires duplicate removal for the inputset to the aggregate function,l” the groupsize or reduction factor of the duplicateremoval step determines the perfor-mance of hybrid hash duplicate removal.The subsequent aggregation can be per-formed after the duplicate removal as anadditional operation within each hashbucket or as a simple filter operation onthe duplicate removal’s output stream.

4.4 A Rough Performance Comparison

It is interesting to note that the perfor-mance of both sort-based and hash-basedaggregation is logarithmic and improveswith increasing reduction factors. Figure11 compares the performance of sort- andhash-based aggregationll using the for-mulas developed above for 100 MB inputdata, 100 KB memory, clusters of 8 KB,fan-in or fan-out of 10, and varying groupsizes or reduction factors. The output sizeis the input size divided by the groupsize.

It is immediately obvious in Figure 11that sorting without early aggregation isnot competitive because it does not limitthe sizes of run files, confirming the re-

1“ See footnote 9.11Aggregation by nested-loops methods is omittedfrom Figure 11 because it is not competitive forlarge data sets.

suits of Bitton and DeWitt [1983]. The

other algorithms all exhibit similar,though far from equal, performance im-provements for larger reduction factors.Sorting with early aggregation improvesonce the reduction factor is large enoughto affect not only the final but also previ-ous merge steps. Hashing without hybridhashing improves in steps as the numberof partitioning levels can be reduced, with“step” points where G = F’ for some i.Hybrid hashing exploits all availablememory to improve performance andgenerally outperforms overflow avoid-ance hashing. At points where overflowavoidance hashing shows a step, hybridhashing has no effect, and the two hash-ing schemes have the same performance.

While hash-based aggregation and du-plicate removal seem superior in thisrough analytical performance compari-son, recall that the cost formula for sort-based aggregation does not include theeffects of replacement selection or themerge optimizations discussed earlier inthe section on sorting; therefore, Figure11 shows an upper bound for the 1/0cost of sort-based aggregation and dupli-cate removal. Furthermore, since the costformula for hashing presumes optimalassignments of hash buckets to outputpartitions, the real costs of sort- andhash-based aggregation will be muchmore similar than they appear in Figure11. The important point is that both their



costs are logarithmic with the input size,improve with the group size or reductionfactor, and are quite similar overall.

4.5 Additional Remarks on Aggregation

Some applications require multilevel ag-gregation. For example, a report genera-tion language might permit a request like“sum (employee. salary by employee.id byemployee. department by employee. divi-siony to create a report with an entry foreach employee and a sum for each de-partment and each division. In fact, spec-ifying such reports concisely was thedriving design goal for the report genera-tion language RPG. In SQL, this requiresmultiple cursors within an applicationprogram, one for each level of detail. Thisis very undesirable for two reasons. First,the application program performs whatis essentially a join of three inputs. Suchjoins should be provided by the databasesystem, not required to be performedwithin application programs. Second, thedatabase system more likely than notexecutes the operations for these cursorsindependently from one another, result-ing in three sort operations on the em-ployee file instead of one.

If complex reporting applications areto be supported, the query languageshould support direct requests (perhapssimilar to the syntax suggested above),and the sort operator should be imple-mented such that it can perform the en-tire operation in a single sort and onefinal pass over the sorted data. An analo-gous algorithm based on hashing can bedefined; however, if the aggregated dataare required in sort order, sort-based ag-gregation will be the algorithm of choice.

For some applications, exact aggregatefunctions are not required; reasonablyclose approximations will do. For exam-ple, exploratory (rather than final pre-cise) data analysis is frequently very use-ful in “approaching” a new set of data[Tukey 1977]. In real-time systems, pre-cision and response time may be reason-able tradeoffs. For database query opti-mization, approximate statistics are asufficient basis for selectivity estimation,

cost calculation, and comparison of alter-native plans. For these applications,faster algorithms can be designed thatrely either on a single sequential scan ofthe data (no run files, no overflow files)or on sampling [Astrahan et al. 1987;Hou and Ozsoyoglu 1991; 1993; Hou etal. 1991].

5. BINARY MATCHING OPERATIONS

While aggregation is essential to con-dense information, there are a number ofdatabase operations that combine infor-mation from two inputs, files, or sets andtherefore are essential for databasesystems’ ability to provide more thanreliable shared storage and to performinferences, albeit limited. A group of op-erators that all do basically the sametask are called the one-to-one match op-erations here because an input item con-tributes to the output depending on itsmatch with one other item. The mostprominent among these operations is therelational join. Mishra and Eich [1992]have recently written a survey of joinalgorithms, which includes an interest-ing analysis and comparison of algo-rithms focusing on how data items fromthe two inputs are compared with oneanother. The other one-to-one match op-erations are left and right semi-joins, left,right, and symmetric outer-joins, left andright anti-semi-joins, symmetric anti-join, intersection, union, left and rightdifferences, and symmetric or anti-dif-ference.lz Figure 12 shows the basic

12 The anti-semijoin of R and S is R SEMIJOINS = R – (R SEMZJOIN S), i.e., the items in Rwithout matches in S. The (symmetric) anti-join

contains those items from both inputs that do nothave matches, suitably padded as in outer joins tomake them union compatible. Formally, the (sym-

metric) anti-join of R” and S is R JOIN S =- (R

SEMIJOIN S ) U (S SEMIJOIN R ) with the tuples

of the two union arguments suitably extended withnull values. The symmetric or anti-difference M theunion of the two differences. Formally the anti-dif-ference of R and S is (R u S) – (R n S) = (R –S) u (S – R) [Maier 1983]. Among these three op-erations, the anti-semijoin is probably the mostuseful one, as in the query to “find the courses thatdon’t have any enrollment.”



R s

m

A B c

output Match on all Match on some

Attributes Attributes

A Difference Anti-semi-join

B Intersection Join, semi-join

c Difference Anti-semi-join

A, B Left outer join

A, C Symmetric difference Anti-join

B, C Right outer join

A, B, C Union Symmernc outer join

Figure 12, Binary one-to-one matching.

principle underlying all these operations,namely separation of the matching andnonmatching components of two sets,called R and S in the figure, and produc-tion of appropriate subsets, possibly aftersome transformation and combination ofrecords as in the case of a join. If the setsR and S have different schemas as inrelational joins, it might make sense tothink of the set B as two sets 11~ andB~, i.e., the matching elements from Rand S. This distinction permits a clearerdefinition of left semi-join and rightsemi-join, etc. Since all these operationsrequire basically the same steps and canbe implemented with the same algo-rithms, it is logical to implement them inone general and efficient module. Forsimplicity, only join algorithms are dis-cussed here. Moreover, we discuss algo-rithms for only one join attribute sincethe algorithms for multi-attribute joins(and their performance) are not differentfrom those for single-attribute joins.

Since set operations such as intersec-tion and difference will be used and mustbe implemented efficiently for any datamodel, this discussion is relevant to rela-tional, extensible, and object-orienteddatabase systems alike. Furthermore, bi-nary matching problems occur in some

surprising places. Consider an object-oriented database system that uses atable to map logical object identifiers(OIDS) to physical locations (record iden-tifiers or RIDs). Resolving a set of OIDSto RIDs can be regarded (as well as opti-mized and executed) as a semi-join of themapping table and the set of OIDS, andall conventional join strategies can beemployed. Another example that can oc-cur in a database management systemfor any data model is the use of multipleindices in a query: the pointer (OID orRID) lists obtained from the indices mustbe intersected (for a conjunction) orunited (for a disjunction) to obtain thelist of pointers to items that satisfy thewhole query. Moreover, the actual lookupof the items using the pointer list can beregarded as a semi-join of the underlyingdata set and the list, as in Kooi’s [1980]thesis and the Ingres product [Kooi andFrankforth 1982] and a recent study byShekita and Carey [1990]. Finally, manypath expressions in object-orienteddatabase systems such as “employee. de-partment.manager. office.location” canfrequently be interpreted, optimized, andexecuted as a sequence of one-to-onematch operations using existing join andsemi-join algorithms. Thus, even if rela-



tional systems were completely abolishedand replaced by object-oriented databasesystems, set matching and join tech-niques developed in the relational con-text would continue to be important forthe performance of database systems.

Most of today’s commercial databasesystems use only nested loops and

merge-join because an analysis per-formed in connection with the System Rproject determined that of all the joinmethods considered, one of these two al-ways provided either the best or veryclose to the best performance [Blasgenand Eswaran 1976; 1977]. However, theSystem R study did not consider hashjoin algorithms, which are now regardedas more efficient in many cases.

There continues to be a strong interestin join techniques, although the interesthas shifted over the last 20 years frombasic algorithmic concerns to paralleltechniques and to techniques that adaptto unpredictable run-time situations suchas data skew and changing resourceavailability. Unfortunately, many newproposed techniques fail a very simpletest (which we call the “Guy Lehman testfor join techniques” after the first personwho pointed this test out to us), makingthem problematic for nontrivial queries.The crucial test question is: Does thisnew technique apply to joining three in-puts without interrupting data flow be-tween the join operators? For example, atechnique fails this test if it requires ma-terializing the entire intermediate joinresult for random sampling of both joininputs or for obtaining exact knowledgeabout both join input sizes. Given its im-portance, this test should be applied toboth proposed query optimization andquery execution techniques.

For the 1/0 cost formulas given here,we assume that the left and right inputshave R and S pages, respectively, andthat the memory size is M pages. Weassume that the algorithms are imple-mented as iterators and omit the cost ofreading stored inputs and writing an op-eration’s output from the cost formulasbecause both inputs and output may beiterators, i.e., these intermediate results

are never written to disk, and becausethese costs are equal for all algorithms.

5.1 Nested-Loops Join Algorithms

The simplest and, in some sense, mostdirect algorithm for binary matching isthe nested-loops join: for each item in oneinput (called the outer input), scan theentire other input (called the inner in-put) and find matches. The main advan-tage of this algorithm is its simplicity.Another advantage is that it can com-pute a Cartesian product and any @joinof two relations, i.e., a join with an arbi-trary two-relation comparison predicate.However, Cartesian products are avoidedby query optimizers because their out-puts tend to contain many data itemsthat will eventually not satisfy a querypredicate verified later in the query eval-uation plan.

Since the inner input is scanned re-peatedly, it must be stored in a file, i.e., atemporary file if the inner input is pro-duced by a complex subplan. This situa-tion does not change the cost of nestedloops; it just replaces the first read of theinner input with a write.

Except for very small inputs, the per-

formance of nested-loops join is disas-trous because the inner input is scannedvery often, once for each item in the outerinput. There are a number of improve-ments that can be made to this naiuenested-loops join. First, for one-to-onematch operations in which a single matchcarries all necessary information, e.g.,semi-join and intersection, a scan of theinner input can be terminated after thefirst match for an item of the outer input.Second, instead of scanning the inner in-put once for each item from the outerinput, the inner input can be scannedonce for each page of the outer input, analgorithm called block nested-loops join[Kim 1980]. Third, the performance canbe improved further by filling all of mem-ory except K pages with pages of theouter input and by using the remainingK pages to scan the inner input and tosave pages of the inner input in memory.Finally, scans of the inner input can be

ACM Computmg Surveys, Vol. 25, No 2, June 1993

106 e Goetz Graefe

made a little faster by scanning the innerinput alternatingly forward and back-ward, thus reusing the last page of theprevious scan and therefore saving one1/0 per inner scan. The 1/0 cost for thisversion of nested-loops join is the productof the number of scans (determined bythe size of the outer input) and the costper scan of the inner input, plus K 1/0sbecause the first inner scan has to scanor save the entire inner input. Thus, thetotal cost for scanning the inner inputrepeatedly is [R\(M – K)] x (S – K) +K. This expression is minimized if K = 1and R z S, i.e., the larger input shouldbe the outer.

If the critical performance measure isnot the amount of data read in the re-peated inner scans but the number of1/0 operations, more than one pageshould be moved in each 1/0, even ifmore memory has to be dedicated to the

inner input and less to the outer input,thus increasing the number of passes overthe inner input. If C pages are moved ineach 1/0 on the inner input and M – Cpages for the outer input, the number of1/0s is [R/(M – C)] x (S/C) + 1,which is minimized if C = M/2. In otherwords, in order to minimize the numberof large-chunk 1/0 operations, the clus-ter size should be chosen as half theavailable memory size [Hagmann 1986].

Finally, index nested-loops join ex-ploits a permanent or temporary indexon the inner input’s join attribute to re-place file scans by index lookups. In prin-ciple, each scan of the inner input innaive nested-loops join is used to findmatches, i.e., to provide associativity. Notsurprisingly, since all index structuresare designed and used for the purpose ofassociativity, any index structure sup-porting the join predicate (such as = , <,etc.) can be used for index nested-loopsjoin. The fastest indices for exact matchqueries are hash indices, but any indexstructure can be used, ordered or un -ordered (hash), single- or multi-attribute,single- or multidimensional. Therefore,indices on frequently used join attributes(keys and foreign keys in relational sys-tems) may be useful. Index nested-loopsjoin is also used sometimes with indices

built on the fly, i.e., indices built on inter-mediate query processing results.

A recent investigation by DeWitt et al.[1993] demonstrated that index nested-loops join can be the fastest join methodif one of the inputs is so small and if the

other indexed input is so large that thenumber of index and data page re-trievals, i.e., about the product of theindex depth and the cardinality of thesmaller input, is smaller than the num-ber of pages in the larger input.

Another interesting idea using two or-dered indices, e.g., a B-tree on each of thetwo join columns, is to switch roles ofinner and outer join inputs after eachindex lookup, which leads to the name“zig-zag join.” For example, for a joinpredicate R.a = S.a, a scan in the indexon R .a finds the lower join attributevalue in R, which is then looked up inthe index on S.a. A continuing scan inthe index on S.a yields the next possiblejoin attribute value, which is looked upin the index on R .a, etc. It is not immedi-ately clear under which circumstancesthis join method is most efficient.

For complex queries, N-ary joins aresometimes written as a single module,i.e., a module that performs index lookupsinto indices of multiple relations and joinsall relations simultaneously. However, itis not clear how such a multi-input joinimplementation is superior to multipleindex nested-loops joins,

5.2 Merge-Join Algorithms

The second commonly used join methodis the merge-join. It requires that bothinputs are sorted on the join attribute.Merging the two inputs is similar to themerge process used in sorting. An impor-tant difference, however, is that one ofthe two merging scans (the one which isadvanced on equality, usually called theinner input) must be backed up whenboth inputs contain duplicates of a joinattribute value and when the specificone-to-one match operation requires thatall matches be found, not just one match.Thus, the control logic for merge-joinvariants for join and semi-j oin are slightlydifferent. Some systems include the no-

ACM Computing Surveys, Vol 25, No 2. June 1993


tion of “value packet,” meaning all itemswith equal join attribute values [Kooi1980; Kooi and Frankforth 1982]. An it-erator’s next call returns a value packet,not an individual item, which makes thecontrol logic for merge-j oin much easier.If (or after) both inputs have been sorted,the merge-join algorithm typically doesnot require any 1/0, except when “valuepackets” are larger than memory. (Seefootnote 1.)

An input may be sorted because astored database file was sorted, an or-dered index was used, an input wassorted explicitly, or the input came froman operation that produced sorted out-put, e.g., another merge-join. The lastpoint makes merge-join an efficient algo-rithm if items from multiple sources arematched on the same join attribute(s) inmultiple binary steps because sorting in-termediate results is not required forlater merge-joins, which led to the con-cept of interesting orderings in the Sys-tem R query optimizer [Selinger et al.1979]. Since set operations such as inter-section and union can be evaluated usingany sort order, as long as the same sortorder is present in both inputs, the effectof interesting orderings for one-to-onematch operators based on merge-join canalways be exploited for set operations.

A combination of nested-loops join andmerge-join is the heap-filter merge -]”oin[Graefe 1991]. It first sorts the smallerinner input by the join attribute andsaves it in a temporary file. Next, it usesall available memory to create sortedruns from the larger outer input usingreplacement selection. As discussed in thesection on sorting, there will be aboutW = R/(2 x M) + 1 such runs for outerinput size R. These runs are not writtento disk; instead, they are joined immedi-ately with the sorted inner input usingmerge-join. Thus, the number of scans ofthe inner input is reduced to about onehalf when compared to block nested loops.On the other hand, when compared tomerge-join, it saves writing and read-ing temporary files for the larger outerinput.

Another derivation of merge-join is thehybrid join used in IBMs DB2 product

[Cheng et al. 1991], combining elements

from index nested-loops join, merge-join,and techniques joining sorted lists of in-dex leaf entries. After sorting the outerinput on its join attribute, hybrid joinuses a merge algorithm to “join” the outerinput with the leaf entries of a preexist-ing B-tree index on the join attribute ofthe inner input. The result file containsentire tuples from the outer input andrecord identifiers (RIDs, physical ad-dresses) for tuples of the inner input.This file is then sorted on the physicallocations, and the tuples of the inner re-lation can then be retrieved from diskvery efficiently. This algorithm is not en-tirely new as it is a special combinationof techniques explored by Blasgen andEswaran [1976; 1977], Kooi [1980], andWhang et al. [1984; 1985]. Blasgen andEswaran considered the manipulation ofRID lists but concluded that eithermerge-join or nested-loops join is the op-timal choice in almost all cases; based onthis study, only these two algorithmswere implemented in System R [Astra-han et al. 1976] and subsequent rela-tional database systems, Kooi’s optimizertreated an index similarly to a base rela-tion and the lookup of data records fromindex entries as a join; this naturallypermitted joining two indices or an indexwith a base relation as in hybrid join.

5.3 Hash Join Algorithms

Hash join algorithms are based on theidea of building an in-memory hash tableon one input (the smaller one, frequentlycalled the build input) and then probingthis hash table using items from the otherinput (frequently called the probe input ).These algorithms have only recentlyfound greater interest [Bratbergsengen1984; DeWitt et al. 1984; DeWitt andGerber 1985; DeWitt et al. 1986; Fushimiet al. 1986; Kitsuregawa et al. 1983;1989a; Nakayama et al. 1988; Omiecin-ski 1991; Schneider and DeWitt 1989;Shapiro 1986; Zeller and Gray 1990]. Onereason is that they work very fast, i.e.,without any temporary files, if the buildinput does indeed fit into memory, inde-pendently of the size of the probe input.


108 ● Goetz Graefe

However, they require overflow avoid-ance or resolution methods for largerbuild inputs, and suitable methods weredeveloped and experimentally verifiedonly in the mid-1980’s, most notably inconnection with the Grace and Gammadatabase machine projects [DeWitt et al.1986; 1990; Fushimi et al. 1986; Kitsure-gawa et al. 1983].

In hash-based join methods, build andprobe inputs are partitioned using thesame partitioning function, e.g., the joinkey value modulo the number of parti-tions. The final join result can be formedby concatenating the join results of pairsof partitioning files. Figure 13 shows theeffect of partitioning the two inputs of abinary operation such as join into hashbuckets and partitions. (This figure wasadapted from a similar diagram by IKit-suregawa et al. [ 1983]. Mishra and Eich[1992] recently adapted and generalizedit in their survey and comparison of rela-tional join algorithms.) Without parti-tioning, each item in the first input mustbe compared with each item in the sec-ond input; this would be represented bycomplete shading of the entire diagram.With partitioning, items are grouped intopartition files, and only pairs in the se-ries of small rectangles (representing thepartitions) must be compared.

If a build partition file is still largerthan memory, recursive partitioning isrequired. Recursive partitioning is usedfor both build- and probe-partitioningfiles using the same hash and partition-ing functions. Figure 14 shows how bothinput files are partitioned together. Thepartial results obtained from pairs ofpartition files are concatenated to formthe result of the entire match operation.Recursive partitioning stops when thebuild partition fits into memory. Thus,the recursion depth of partitioning forbinary match operators depends only onthe size of the build input (which there-fore should be chosen to be the smallerinput) and is independent of the size ofthe probe input. Compared to sort-basedbinary matching operators, i.e., variantsof merge-join in which the number ofmerge levels is determined for each input

Second

Join

Input

–-l -l-l

LFirst Join Input

Figure 13. Effect of partitioning for join operations.

Figure 14. Recursive partitioning in binaryoperations.

file individually, hash-based binarymatching operators are particularly ef-fective when the input sizes are very dif-ferent [Bratbergsengen 1984; Graefe etal. 1993].

The 1/0 cost for binary hybrid hashoperations can be determined by thenumber of complete levels (i.e., levelswithout hash table) and the fraction ofthe input remaining in memory in the

deepest recursion level. For memory sizeM, cluster size C, partitioning fan-outF = [M\C – 1],build input size R, andprobe input size S, the number of com-plete levels is L = [logF( R/M)], afterwhich the build input partitions shouldbe of size R’ = R/FL. The 1/0 cost forthe binary operation is the cost of parti-tioning the build input divided by thesize of the build input and multiplied bythe sum of the input sizes. Adapting the



cost formula for unary hashing discussedearlier, the total amount of 1/0 for arecursive binary hash operation is

2X( RX(L+l)– F’X(M

-[(R’ - ~+ c)/(lf - C)lxc))/R x (R + s)

which can be approximated with 2 x

log~(R/iW) X (R + S). In other words,the cost of binary hash operations onlarge inputs is logarithmic; the maindifference to the cost of merge-join isthat the recursion depth (the logarithm)depends only on one file, the build input,and is not taken for each file individually.

As for all operations based on parti-tioning, partitioning (hash) value skew isthe main danger to effectiveness. Whenusing statistics on hash value distribu-tions to determine which buckets shouldstay in memory in hybrid hash algo-rithms, the goal is to avoid as much 1/0as possible with the least memory “in-vestment.” Thus, it is most effective toretain those buckets in memory with fewbuild items but many probe items or,more formally, the buckets with thesmallest value for r,\(rt + Sj ) where r,and si indicate the total size of a bucket’sbuild and probe items [Graefe 1993b].

5.4 Pointer-Based Joins

Recently, links between data items havefound renewed interest, be it in object-oriented systems in the form of objectidentifiers (OIDS) or as access paths forfaster execution of relational joins. In asense, links represent a limited form ofprecomputed results, somewhat similarto indices and join indices, and have theusual cost-versus-benefit tradeoff be-tween query performance enhancementand maintenance effort. Kooi [1980] mod-eled the retrieval of actual records afterindex searches as “TID joins” (tuple iden-tifiers permitting direct record access) inhis query optimizer for Ingres; togetherwith standard join commutativity andassociativity rules, this model permitted

exploring joins of indices of different re-lations (joining lists of key-TID pairs) or

joins of one relation with another rela-tion’s index. In the Genesis data modeland database system, Batory et al.[1988a; 1988b] modeled joins in a func-tional way, borrowing from research intothe database languages FQL [Bunemanet al. 1982; Buneman and Frankel 1979],DAPLEX [Shipman 1981], and Gem [Tsurand Zaniolo 1984; Zaniolo 1983] andpermitting pointer-based join imple-mentations in addition to traditionalvalue-based implementations such asnested-loops join, merge-join, and hy-brid hash join.

Shekita and Carey [ 1990] recently ana-lyzed three pointer-based join methodsbased on nested-loops join, merge-join,and hybrid hash join. Presuming rela-tions R and S, with a pointer to an Stuple embedded in each R tuple, thenested-loops join algorithm simply scansthrough R and retrieves the appropriateS tuple for each R tuple. This algorithmis very reminiscent of uncluttered indexscans and performs similarly poorly forlarger set sizes. Their conclusion on naivepointer-based join algorithms is that “itis unwise for object-oriented databasesystems to support only pointer-basedjoin algorithms,”

The merge-join variant starts withsorting R on the pointers (i.e., accordingto the disk address they point to) andthen retrieves all S items in one elevatorpass over the disk, reading each S pageat most once. Again, this idea was sug-gested before for uncluttered index scans,and variants similar to heap-filtermerge-join [Graefe 1991] and complex ob-ject assembly using a window and prior-ity heap of open references [Keller et al.1991] can be designed.

The hybrid hash join variant partitionsonly relation R on pointer values, ensur-ing that R tuples with S pointers to thesame page are brought together, and thenretrieves S pages and tuples. Notice thatthe two relations’ roles are fixed by thedirection of the pointers, whereas forstandard hybrid hash join the smallerrelation should be the build input. Differ-



125 –

100

: /~

V Pointer 10 S+R

o Ne ed Loops

I/o 75 ❑ Merg oin with Two SortsCount + Hashi not using Hybrid ing

[Xlooo] 50 A Pointer Join

25

01

I I I I I I I I

100 300 500 700 900 1100 13(KI 1500

Sizeof R, S=lOx R

Figure 15. Performance of alternative join methods

ently than standard hybrid hash join,relation S is not partitioned. This algo-rithm performs somewhat faster thanpointer-based merge-join if it keeps somepartitions of R in memory and if sortingwrites all R tuples into runs before merg-ing them.

Pointer-based join algorithms tend tooutperform their standard value-basedcounterparts in many situations, in par-ticular if only a small fraction of S actu-ally participates in the join and can beselected effectively using the pointers inR. Historically, due to the difficulty ofcorrectly maintaining pointers (nones-sential links ), they were rejected as arelational access method in System R

[Chamberlain et al. 1981al and subse-quently in basically all other systems,perhaps with the exception of Kooi’smodified Ingres [Kooi 1980; Kooi andFrankforth 1982]. However, they werereevaluated and implemented in theStarburst project, both as a test of Star-burst’s extensibility and as a means ofsupporting “more object-oriented” modesof operation [Haas et al. 1990].

5.5 A Rough Performance Comparison

Figure 15 shows an approximate perfor-mance comparison using the cost formu-las developed above for block nested-loopsjoin; merge-join with sorting both inputswithout optimized merging; hash join

without hybrid hashing, bucket tuning,or dynamic destaging; and pointer joinswith pointers from R to S and from S toR without grouping pointers to the sametarget page together. This comparison isnot precise; its sole purpose is to give arough idea of the relative performance ofthe algorithm groups, deliberately ignor-ing the many tricks used to improve andfine-tune the basic algorithms. The rela-tion sizes vary; S is always 10 timeslarger than R. The memory size is 100KB; the cluster size is 8 KB; merge fan-inand partitioning fan-out are 10; and thenumber of R-records per cluster is 20.

It is immediately obvious in Figure 15that nested-loops join is unsuitable formedium-size and large relations, becausethe cost of nested-loops join is propor-tional to the size of the Cartesian prod-uct of the two inputs. Both merge-join

(sorting) and hash join have logarithmiccost functions; the sudden rise in merge-join and hash join cost around R = 1000is due to the fact that additional parti-tioning or merging levels become neces-sary at that point. The sort-basedmerge-join is not quite as fast as hashjoin because the merge levels are deter-mined individually for each file, includ-ing the bigger S file, while only thesmaller build relation R determines thepartitioning depth of hash join. Pointerjoins are competitive with hash andmerge-joins due to their linear cost func-


Query Evaluation Techniques w 111

tion, but only when the pointers are em-bedded in the smaller relation R. WhenS-records point to R-records, the cost ofthe pointer join is even higher than fornested-loops join.

The important point of Figure 15 is toillustrate that pointer joins can be veryefficient or very inefficient, that one-to-one match algorithms based on nested-loops join are not competitive formedium-size and large inputs, and thatsort- and hash-based algorithms for one-to-one match operations both have loga-rithmic cost growth. Of course, this com-

parison is quite naive since it uses onlythe simplest form of each algorithm.Thus, a comparison among alternativealgorithms in a query optimizer must usethe precise cost function for the availablealgorithm variant.

6. UNIVERSAL QUANTIFICATION’3

Universal quantification permits queriessuch as “find the students who have

taken all database courses”; the differ-ence to one-to-one match operations isthat a student qualifies because his orher transcript matches an entire set ofcourses, not only one item as in an exis-tentially quantified query (e.g., “find stu-dents who have taken a (at least one)database course”) that can be executedusing a semi-join. In the past, universalquantification has been largely ignoredfor four reasons. First, typical data-base applications, e.g., record-keepingand accounting applications, rarely re-quire universal quantification. Second, itcan be circumvented using a complex ex-pression involving a Cartesian product.Third, it can be circumvented using com-plex aggregation expressions. Fourth,there seemed to be a lack of efficientalgorithms.

The first reason will not remain truefor database systems supporting logicprogramming, rules, and quantifiers, andalgorithms for universal quantification

12 This section is a summary of earlier work [Graefe1989; Graefe and Cole 1993].

will become more important. The secondreason is valid; however, the substituteexpressions are very slow to execute be-cause of the Cartesian product. The thirdreason is also valid, but replacing a uni-versal quantifier may require very com-plex aggregation clauses that are easy to“get wrong” for the database user. Fur-thermore, they might be too complex forthe optimizer to recognize as universalquantification and to execute with a di-rect algorithm. The fourth reason is notvalid; universal quantification algo-rithms can be very efficient (in fact, asfast as semi-join, the operator for exis-tential quantification), useful for verylarge inputs, and easy to parallelize. Inthe remainder of this section, we discusssort- and hash-based direct and indirect(aggregation-based) algorithms for uni-versal quantification.

In the relational world, universalquantification is expressed with the uni-versal quantifier in relational calculusand with the division operator in rela-tional algebra. We will explain algo-rithms for universal quantification usingrelational terminology. The running ex-ample in this section uses the relationsStudent ( student-id, name, major),Course (course-no, title), Transcript ( stu-dent-id, course-no, grade ), and Require-ment (major, course-no) with the obviouskey attributes. The query to find the stu-dents who have taken all courses can beexpressed in relational algebra as

studmt.,d, cwrmnoTranscri@97

+ trCOU~,,e.~OCourse.

The projection of the Transcript relationis called the dividend, the projection ofthe Course relation the divisor, and theresult relation the quotient. The quotientattributes are those attributes of the div-idend that do not appear in the divisor.The dividend relation semi-joined withthe divisor relation and projected on thequotient attributes, in the example theset of student-ids of Students who havetaken at least one course, is called theset of quotient candidates here.



Some universal quantification queriesseem to require relational division butactually do not. Consider the query forthe students who have taken all coursesrequired for their major. This query canbe answered with a sequence of one-to-one match operations. A join of Studentand Requirement projected on the stu-dent-id and course-no attributes minusthe Transcript relation can be projectedon student-ids to obtain a set of studentswho have not taken all their require-ments. An anti-semi-join of the Studentrelation with this set finds the studentswho have satisfied all their require-ments. This sequence will have accept-able performance because its requiredset-matching algorithms (join, difference,anti-semi-join) all belong to the family ofone-to-one match operations, for whichefficient algorithms are available as dis-cussed in the previous section.

Division algorithms differ not only intheir performance but also in how theyfit into complex queries, Prior to the divi-sion, selections on the dividend, e.g., onlyTranscript entries with “A” grades, or onthe divisor, e.g., only the databasecourses, may be required. Restrictions onthe dividend can easily be enforced with-out much effect on the division operation,while restrictions on the divisor can im-ply a significant difference for the queryevaluation plan. Subsequent to the divi-sion operation, the resulting quotient re-lation (e.g., a set of student-ids) may bejoined with another relation, e.g., theStudent relation to obtain studentnames. Thus, obtaining the quotient in aform suitable for further processing (e.g.,join or semi-join with a third relation)can be advantageous.

Typically, universal quantification caneasily be replaced by aggregations. (In-tuitively, all universal quantification canbe replaced by aggregation. However, wehave not found a proof for this statement.)For example, the example query aboutdatabase courses can be restated as “findthe students who have taken as manydatabase courses as there are databasecourse s.” When specifying the aggregatefunction, it is important to count only

database courses both in the dividend(the Transcript relation) and in the divi-sor (the Course relation). Counting onlydatabase courses might be easy for thedivisor relation, but requires a semi-joinof the dividend with the divisor relationto propagate the restriction on the divi-sor to the dividend if it is not known apriori whether or not referential in-tegrity holds between the dividend’s divi-sor attributes and the divisor, i.e.,whether or not there are divisor attributevalues in the dividend that cannot befound in the divisor. For example, course-nos in the Transcript relation that do notpertain to database courses (and aretherefore not in the divisor) must be re-moved from the dividend by a semi-joinwith the divisor. In general, if the divisoris the result of a prior selection, anyreferential integrity constraints knownfor stored relations will not hold and mustbe explicitly enforced using a semi-join.Furthermore, in order to ensure correctcounting, duplicates have to be removedfrom either input if the inputs are projec-tions on nonkey attributes.

There are four methods to compute thequotient of two relations, a sort- and ahash-based direct method, and sort- andhash-based aggregation. Table 6 showsthis classification of relational divisionalgorithms. Methods for sort- and hash-based aggregation and the possible sort-or hash-based semi-join have alreadybeen discussed, including their variantsfor inputs larger than memory and theircost functions. Therefore, we focus hereon the direct division algorithms.

The sort-based direct method, pro-posed by Smith and Chang [1975] andcalled naizw diuision here, sorts thedivisor input on all its attributes and thedividend relation with the quotient at-tributes as major and the divisor at-tributes as minor sort keys. It then pro-ceeds with a merging scan of the twosorted inputs to determine which itemsbelong in the quotient. Notice that thescan can be programmed such that itignores duplicates in either input (in casethose had not been removed yet in thesort) as well as dividend items that do


Query Evaluation Techniques - 113

Table 6. Classiflcatlon of Relational Dtvwon Algorithms

Based on Sorting Based on Hashing

Direct Naive divison Hash-division

Indirect by semi-join and Sorting with duplicate removal, Hash-based duplicate removal,aggregation merge-join, sorting with hybrid hash join, hash-based

aggregation aggregation

not refer to items in the divisor. Thus,neither a preceding semi-join nor explicitduplicate removal steps are necessary fornaive division. The 1/0 cost of naive di-vision is the cost of sorting the two in-puts plus the cost of repeated scans ofthe divisor input.

Figure 16 shows two tables, a dividendand a divisor, properly sorted for naivedivision. Concurrent scans of the “Jack”tuples (only one) in the dividend and ofthe entire divisor determine that “Jackis not part of the quotient because he hasnot taken the “Readings in Databases”course. A continuing scan through the“Jill” tuples in the dividend and a newscan of the entire divisor include “Jill” inthe output of the naive division. The factthat “Jill” has also taken an “Intro toGraphics” course is ignored by a suitablygeneral scan logic for naive division.

The hash-based direct method, calledhash-division, uses two hash tables, onefor the divisor and one for the quotientcandidates. While building the divisortable, a unique sequence number is as-signed to each divisor item. After thedivisor table has been built, the dividendis consumed. For each quotient candi-date, a bit map is kept with one bit foreach divisor item. The bit map is indexedwith the sequence numbers assigned tothe divisor items. If a dividend item doesnot match with an item in the divisortable, it can be ignored immediately.Otherwise, a quotient candidate is eitherfound or created, and the bit correspond-ing to the matching divisor item is set.When the entire dividend has been con-sumed, the quotient consists of thosequotient candidates for which all bits areset.

This algorithm can ignore duplicates inthe divisor (using hash-based duplicate

Stadent Coarse

Jack Intro to Databases

Jill Intro to Databases

Jill Intro to Graphics

Jill Readings in Databases

Figure 16. Sorted inputs into naive division

removal during insertion into the divisortable) and automatically ignores dupli-cates in the dividend as well as dividenditems that do not refer to items in thedivisor (e.g., the AI course in the exam-ple). Thus, neither prior semi-join norduplicate removal are required. However,if both inputs are known to be duplicatefree, the bit maps can be replaced bycounters. Furthermore, if referential in-tegrity is known to hold, the divisor tablecan be omitted and replaced by a singlecounter. Hash-division, including thesevariants, has been implemented in theVolcano query execution engine and hasshown better performance than the otherthree algorithms [Graefe 1989; Graefeand Cole 1993]. In fact, the performanceof hash-division is almost equal to ahash-based join or semi-join ofdividend and divisor relations (a semi-join corresponds to existential quantifica-tion), making universal quantificationand relational division realistic opera-tions and algorithms to use in databaseapplications.

The aspect of hash-division that makesit an efficient algorithm is that the set ofmatches between a quotient candidateand the divisor is represented efficiently



using a bit map. Bit maps are one of thestandard data structures to representsets, and just as bit maps can be used fora number of set operations, the bit mapsassociated with each quotient candidatecan also be used for a number of opera-tions similar to relational division. Forexample, Carlis [ 1986] proposed a gener-alized division operator called “HAS” thatincluded relational division as a s~ecial

L

case. The hash-division algorithm caneasily be extended to compute quotientcandidates in the dividend that match amajority or given fraction of divisor itemsas well as (with one more bit in each bitmap) quotient candidates that do or donot match exactly the divisor items

For real queri~s containing a division,consider the operation that frequentlyfollows a division. In the example, auser is typically not really interested instudent-ids only but in information aboutthe students. Thus, in many cases, rela-tional division results will be used toselect items from another relation usinga semi-join. The sort-based algorithmsproduce their output sorted, which willfacilitate a subsequent (semi-) merge-join.The hash-based algorithms produce theiroutput in hash order; if overflow oc-curred, there is no m-edictable order at.all. However, both aggregation-based anddirect hash-based algorithms use a hashtable on the quotient attributes, whichmay be used immediately for a subse-quent (semi-) join. It seems quitestraightforward to use the same hashtable for the aggregation and a subse-quent join as well as to modify hash-division such that it removes quotientcandidates from the quotient table thatdo not belong to the final quotient andthen performs a semi-join with a thirdinput relation.

If the two hash tables do not fit intomemory, the divisor table or the quotienttable or both can be partitioned, and in-dividual partitions can be held on diskfor processing in multiple steps. In diui-

sor partitioning, the final result consistsof those items that are found in allpartial results; the final result is theintersection of all partial results. For ex-ample, if the Course relations in the ex-

ample above are partitioned into under-graduate and graduate courses, the finalresult consists of the students who havetaken all undergraduate courses and allgraduate courses, i.e., those that can befound in the division result of each parti-tion. In quotient partitioning, the entiredivisor must be kept in memory for allpartitions. The final result is the concate-nation (union) of all partial results. Forexample, if Transcript items are parti-tioned by odd and even student-ids, thefinal result is the union (concatenation)of all students with odd student-id whohave taken all courses and those witheven student-id who have taken allcourses. If warranted by the input data,divisor partitioning and quotient parti-tioning can be combined.

Hash-division can be modified into analgorithm for duplicate removal. Con-sider the problem of removing duplicatesfrom a relation R(X, Y) where X and Yare suitably chosen attribute groups. Thisrelation can be stored using two hashtables, one storing all values of X (simi-lar to the divisor table) and assigningeach of them a unique sequence number,the other storing all values of Y and bitmaps that indicate which X values haveoccurred with each Y value. Consider abrief example for this algorithm: Say re-lation R(X, Y) contains 1 million tuples,but only 100,000 tuples if duplicates wereremoved. Let X and Y be each 100 byteslong (total record size 200), and assumethere are 4,000 unique values of each Xand Y. For the standard hash-based du-plicate removal algorithm, 100,000 x 200bytes of memory are needed for duplicateremoval without use of temporary files.For the redesigned hash-division algo-rithm, 2 x 4,000 x 100 bytes are neededfor data values, 4,000 x 4 for unique se-quence numbers, and 4,000 x 4,000 bitsfor bit maps. Thus, the new algorithmworks efficiently with less than 3 MB ofmemory while conventional duplicate re-moval requires slightly more than 19 MBof memory, or seven times more than theduplicate removal algorithm adaptedfrom hash-division. Clearly, choosing at-tribute groups X and Y to find attributegroups with relatively few unique values



is crucial for the performance and mem-ory efficiency of this new algorithm. Sincesuch knowledge is not available in mostsystems and queries (even though someefficient and helpful algorithms exist, e.g.,Astrahan et al. [1987]), optimizer heuris-tics for choosing this algorithm might bedifficult to design and verify.

To summarize the discussion on uni-versal quantification algorithms, aggre-gation can be used in systems that lackdirect division algorithms, and hash-division performs universal quantifica-tion and relational division generally, i.e.,it covers cases with duplicates in the in-puts and with referential integrity viola-tions, and efficiently, i.e., it permits par-titioning and using hybrid hashing tech-niques similar to hybrid hash join, mak-ing universal quantification (division) asfast as existential quantification (semi-join). As will be discussed later, it canalso be effectively parallelized.

7. DUALITY OF SORT- AND HASH-BASED

QUERY PROCESSING ALGORITHMS’4

We conclude the discussion of individualquery processing by outlining the manyexisting similarities and dualities of sort-and hash-based query-processing algo-rithms as well as the points where thetwo types of algorithms differ. The pur-pose is to contribute to a better under-standing of the two approaches and theirtradeoffs. We try to discuss the ap-proaches in general terms, ignoringwhether the algorithms are used for rela-tional join, union, intersection, aggre-gation, duplicate removal, or otheroperations. Where appropriate, however,we indicate specific operations.

Table 7 gives an overview of the fea-tures that correspond to one another.

14 ~art~ of ~hi~ section have been derived from

Graefe et al. [1993], which also provides experimen-tal evidence for the relative performance of sort-and hash-based query processing algorithms anddiscusses simple cases of transferring tuning ideasfrom one type of algorithm to the other. The discus-

sion of this section is continued in Graefe [ 1993a;1993 C].

Both approaches permit in-memory ver-sions for small data sets and disk-basedversions for larger data sets. If a data setfits into memory, quicksort is the sort-based method to manage data sets whileclassic (in-memory) hashing can be usedas a hashing technique. It is interestingto note that both quicksort and classichashing are also used in memory to oper-ate on subsets after “cutting” an entirelarge data set into pieces. The cuttingprocess is part of the divide-and-conquerparadigm employed for both sort- andhash-based query-processing algorithms.This important similarity of sorting andhashing has been observed before, e.g.,by Bratbergsengen [ 1984] and Salzberg[1988]. There exists, however, an impor-tant difference. In the sort-based algo-rithms, a large data set is divided intosubsets using a physical rule, namely intochunks as large as memory. These chunksare later combined using a logical step,merging. In the hash-based algorithms,large inputs are cut into subsets using alogical rule, by hash values. The result-ing partitions are later combined using aphysical step, i.e., by simply concatenat-ing the subsets or result subsets. In otherwords, a single-level merge in a sort algo-rithm is a dual to partitioning in hashalgorithms. Figure 17 illustrates this du-ality and the opposite directions.

This duality can also be observed inthe behavior of a disk arm performingthe 1/0 operations for merging or parti-tioning. While writing initial runs aftersorting them with quicksort, the 1/0 issequential. During merging, read opera-tions access the many files being mergedand require random 1/O capabilities.During partitioning, the 1/0 operationsare random, but when reading a parti-tion later on, they are sequential.

For both approaches, sorting and hash-ing, the amount of available memory lim-its not only the amount of data in a basicunit processed using quicksort or classichashing, but also the number of basicunits that can be accessed simultane-ously. For sorting, it is well known thatmerging is limited to the quotient ofmemory size and buffer space requiredfor each run, called the merge fan-in.



Table 7. Duallty of Soti- and Hash-Based Algorithms

Aspect Sorting Hashing

In-memory algorithm

Divide-and-conquer

paradigmLarge inputs1/0 Patterns

Temporary files accessed

simultaneously1/0 Optimlzatlons

Very large inputs

OptimizationsBetter use of memory

Aggregation and

duphcate removalAlgorithm phases

Resource sharing

Partitiomng skew andeffectiveness“Item value”

Bit vector filtering

Interesting orderings,multiple joins

Interesting orderings:

grouping/aggregationfollowed by join

Interesting orderings in

index structures

QulcksortPhysical dlvi sion, logical

combination

Single-level mergeSequential write, random readFan-in

Read-ahead, forecastingDouble-buffering,

striping merge outputMulti-level mergeMerge levels

Nonoptimal final fan-inMerge optimizatlons

Reverse runs & LRU

Replacement selection?

Aggregation m replacement selection

Run generation, intermediate andfinal mergeEager mergingLazy mergingMergmg run files of different sizes

log (run size)

For both inputs and on each

merge level?Multiple merge-joins without sorting’

intermediate results

Sorted grouping on foreign key

useful for subsequent join

B-trees feeding into a merge-join

Classic HashLogical division, physical combination

PartitioningRandom write, sequential read

Fan-out

Write-behindDouble-buffering,

striping partitioning inputRecursive partitioningRecursion depthNonoptimal hash table size

Bucket tuning

Hybrid hashing?

Single input in memoryAggregation in hash table

Initial and intermediate partitioning,In-memory (hybrid) hashing

Depth-first partitlomngBreadth-first partitioning

Uneven output file sizes

log (build partition size/or@nal

build input size)For both inputs and on each recursion

level

N-ary partltlonmg and jams

Grouping while budding the hashtable in hash Join

Mergmg m hash value order

Figure 17. Duahty of partitioning and mergmg,

Similarly, partitioning is limited to thesame fraction, called the fan-out sincethe limitation is encountered while writ-ing partition files.

In order to keep the merge processactive at all times, many merge imple-mentations use read-ahead controlled byforecasting, trading reduced 1/0 delaysfor a reduced fan-in. In the ideal case,the bandwidths of 1/0 and processing(merging) match, and 1/0 latencies forboth the merge input and output are hid-den by read-ahead and double-buffering,as mentioned earlier in the section onsorting. The dual to read-ahead duringmerging is write-behind during partition-ing, i.e., keeping a free output buffer thatcan be allocated to an output file whilethe previous page for that file is beingwritten to disk. There is no dual to fore-casting because it is trivial that the next



output partition to write to is the one forwhich an output cluster has just filledup. Both read-ahead in merging andwrite-behind in partitioning are used toensure that the processor never has towait for the completion of an 1/0 opera-tion. Another dual is double-buffering andstriping over multiple disks for the out-put of sorting and the input ofpartitioning.

Considering the limitation on fan-inand fan-out, additional techniques mustbe used for very large inputs. Mergingcan be performed in multiple levels, eachcombining multiple runs into larger ones.Similarly, partitioning can be repeatedrecursively, i.e., partition files are repar-

titioned, the results repartitioned, etc.,until the partition files flt into mainmemory. In sorting and merging, the runsgrow in each level by a factor equal tothe fan-in. In partitioning, the partitionfiles decrease in size by a factor equal tothe fan-out in each recursion level. Thus,the number of levels during merging isequal to the recursion depth during par-titioning. There are two exceptions to bemade regarding hash value distributionsand relative sizes of inputs in binary op-erations such as join; we ignore those fornow and will come back to them later.

If merging is done in the most naiveway, i.e., merging all runs of a level assoon as their number reaches the fan-in,the last merge on each level might not beoptimal. Similarly, if the highest possiblefan-out is used in each partitioning step,the partition files in the deepest recur-sion level might be smaller than mem-ory, and less than the entire memory isused when processing these files. Thus,in both approaches the memory re-sources are not used optimally in themost naive versions of the algorithms.

In order to make best use of the finalmerge (which, by definition, includes alloutput items and is therefore the mostexpensive merge), it should proceed withthe maximal possible fan-in. Making bestuse of the final merge can be ensured bymerging fewer runs than the maximalfan-in after the end of the input file hasbeen reached (as discussed in the earlier

section on sorting). There is no directdual in hash-based algorithms for thisoptimization. With respect to memoryutilization, the fact that a partition fileand therefore a hash table might actu-ally be smaller than memory is the clos-est to a dual. Utilizing memory moreeffectively and using less than the maxi-mal fan-out in hashing has been ad-dressed in research on bucket tuning[Kitsuregawa et al. 1989a] and on his-togram-driven recursive hybrid hash join[Graefe 1993a].

The development of hybrid hash algo-rithms [DeWitt et al. 1984; Shapiro 1986]was a consequence of the advent of largemain memories that had led to the con-sideration of hash-based join algorithmsin the first place. If the data set is onlyslightly larger than the available mem-ory, e.g., 109%0larger or twice as large,much of the input can remain in memoryand is never written to a disk-residentpartition file. To obtain the same effectfor sort-based algorithms, if the databasesystem’s buffer manager is sufficientlysmart or receives and accepts appropri-ate hints, it is possible to retain some orall of the pages of the last run written inmemory and thus achieve the same effectof saving 1/0 operations, This effect canbe used particularly easily if the initialruns are written in reverse (descending)order and scanned backward for merg-ing. However, if one does not believe inbuffer hints or prefers to absolutely en-sure these 1/0 savings, then using afinal memory-resident run explicitly inthe sort algorithm and merging it withthe disk-resident runs can guarantee thiseffect.

Another well-known technique to usememory more effectively and to improvesort performance is to generate runstwice as large as main memory using apriority heap for replacement selection[Knuth 1973], as discussed in the earliersection on sorting. If the runs’ sizes aredoubled, their number is cut in half.Therefore, merging can be reduced bysome amount, namely log~(2) =l/logs(F) merge levels. This optimiza-tion for sorting has no direct dual in the


118 0 Goetz Graefe

realm of hash-based query-processingalgorithms.

If two sort operations produce inputdata for a binarv ooerator such as a-.merge-join and if both sort operators’ fi-nal merges are interleaved with the join,each final merge can employ only halfthe memorv. In hash-based one-to-onematch algo~ithms, only one of the twoinputs resides in and consumes memorybeyond a single input buffer, not both asin two final merges interleaved with amerge-join. This difference in the use ofthe two inputs is a distinct advantage ofhash-based one-to-one match al~orithmsthat does not have a dual in s~rt-basedalgorithms.

Interestingly, these two differences ofsort- and hash-based one-to-one matchalgorithms cancel each other out. Cuttingthe number of runs in half (on each mergelevel, including the last one) by usingreplacement selection for run generationexactly offsets this disadvantage of sort-based one-to-one match operations.

Run generation using replacement se-lection has a second advantage overquicksort; this advantage has a directdual in hashing. If a hash table is used tocompute an aggregate function usinggrouping, e.g., sum of salaries by depart-ment, hash table overflow occurs only ifthe operation’s output does not fit inmemory. Consider, for example, the sumof salaries by department for 100,000employees in 1,000 departments. If the1,000 result records fit in memory, clas-sic hashing (without overflow) is suffi-cient. On the other hand, if sorting basedon quicksort is used to compute ;his ag-gregate function, the input must fit intomemory to avoid temporary files.15 If re-placement selection is used for run gen-eration, however, the same behavior aswith classic hashing is easy to achieve.

15A scheme usuw aulcksort and avoiding tem~o-rary 1/0 m this ~as~ can be devised but ~ould’beextremely cumbersome; we do not know of anyreport or system with such a scheme.

If an iterator interface is used for bothits input and output, and therefore mul-tiple operators overlap in time, a sortoperator can be divided into three dis-tinct algorithm phases. First, input itemsare consumed and sorted into initial runs.Second, intermediate merging reducesthe number of runs such that only onefinal merge step is left. Third, the finalmerge is performed on demand from theconsumer of the sorted data stream. Dur-ing the first phase, the sort iterator hasto share resources, most notably memoryand disk bandwidth, with its produceroperators in a query evaluation plan.Similarly, the third phase must shareresources with the consumers.

In many sort implementations, namelythose using eager merging, the first andsecond phase interleave as a merge stepis initiated whenever the number of runson one level becomes equal to the fan-in.Thus, some intermediate merge stepscannot use all resources. In lazy merging,which starts intermediate merges onlyafter all initial runs have been created,the intermediate merges do not shareresources with other operators and canuse the entire memory allocated to aquery evaluation plan; “thus, intermedi-ate merges can be more effective in lazymerging than in eager merging.

Hash-based query processing algo-rithms exhibit three similar phases.First, the first partitioning step executesconcurrently with the input operator oroperators. Second, intermediate parti-tioning steps divide the partition files toensure that they can be processed withhybrid hashing. Third, hybrid and in-memory hash methods process these par-tition files and produce output passed to

the consumer operators. As in sorting,the first and third phases must shareresources with other concurrent opera-tions in the same query evaluation plan.

The standard implementation of hash-based query processing algorithms forverv large in~uts uses recursion, i.e., theori~inal ‘algo~ithm is invoked for eachpartition file (or pair of partition files).While conce~tuallv sim~le. this methodhas the di~adva~tage ‘ that output is



produced before all intermediate parti-tioning steps are complete. Thus, the op-erators that consume the output mustallocate resources to receive this output,typically memory (e.g., a hash table).Further intermediate partitioning stepswill have to share resources with theconsumer operators, making them lesseffective. We call this direct recursive im-plementation of hash-based partitioningdepth-first partitioning and consider itsbehavior as well as its resource sharingand performance effects a dual to eagermerging in sorting. The alternativeschedule is breadth-first partitioning,which completes each level of partition-ing before starting the next one. Thus,hybrid and in-memory hashing are notinitiated until all partition files have be-come small enough to permit hybrid andin-memory hashing, and intermediatepartitioning steps never have to shareresources with consumer operators.Breadth-first partitioning is a dual to lazymerging, and it is not surprising thatthey are both equally more effective thandepth-first partitioning and eager merg-ing, respectively.

It is well known that partitioning skewreduces the effectiveness of hash-basedalgorithms. Thus, the situation shown inFigure 18 is undesirable. In the extremecase, one of the partition files is as largeas the input, and an entire partitioningstep has been wasted. It is less well rec-ognized that the same issue also pertainsto sort-based query processing algo-rithms [Graefe 1993c]. Unfortunately, inorder to reduce the number of mergesteps, it is often necessary to merge filesfrom different merge levels and thereforeof different sizes. In other words, thegoals of optimized merging and of maxi-mal merge effectiveness do not alwaysmatch, and very sophisticated mergeplans, e.g., polyphase merging, might berequired [Knuth 1973].

The same effect can also be observed if“values” are attached to items in runsand in partition files. Values should re-flect the work already performed on anitem. Thus, the value should increasewith run sizes in sorting while the value

,... AEl‘i” kPartitioning

I I

n~es

I

Merging -*

<

Figure 18. Partitioning skew,

must increase as partition files getsmaller in hash-based query processingalgorithms. For sorting, a suitable choicefor such a value is the logarithm of therun size [Graefe 1993c]. The value of asorted run is the product of the run’s sizeand the size’s logarithm. The optimalmerge effectiveness is achieved if eachitem’s value increases in each merge stepby the logarithm of the fan-in, and theoverall value of all items increases withthis logarithm multiplied with the datavolume participating in the merge step.However, only if all runs in a merge stepare of the same size will the value of allitems increase with the logarithm of thefan-in.

In hash-based query processing, thecorresponding value is the fraction of apartition size relative to the original in-put size [Graefe 1993c]. Since only thebuild input determines the number ofrecursion levels in binary hash partition-ing, we consider only the build partition.If the partitioning is skewed, i.e., outputpartition files are not of uniform length,the overall effectiveness of the partition-ing step is not optimal, i.e., equal to thelogarithm of the partitioning fan-out.Thus, preventing or managing skew inpartitioning hash functions is very im-portant [Graefe 1993a].

Bit vector filtering, which will be dis-cussed later in more detail, can be usedfor both sort- and hash-based one-to-onematch operations, although it has beenused mainly for parallel joins to date.Basically, a bit vector filter is a largearray of bits initialized by hashing itemsin the first input of a one-to-one matchoperator and used to detect items in thesecond input that cannot possibly have a


120 “ Goetz Graefe ,

match in the first input. In effect, bitvector filtering reduces the second inputto the items that truly participate in thebinary operation plus some “false passes”due to hash collisions in the bit vectorfilter, In a merge-join with two sort oper-ations, if the bit vector filter is used be-fore the second sort, bit vector filtering isas effective as in hybrid hash join inreducing the cost of processing the sec-ond input. In merge-join, it can also beused symmetrically as shown in Figure19. Notice that for the right input, bitvector filtering reduces the sort inputsize, whereas for the left input, it onlyreduces the merge-join input. In recur-sive hybrid hash join, bit vector filteringcan be used in each recursion level. Theeffectiveness of bit vector filtering in-creases in deeper recursion levels, be-cause the number of distinct data valuesin each ~artition file decreases. thus re-ducing tie number of hash collisions andfalse passes if bit vector filters of thesame size are used in each recursion level.Moreover, it can be used in both direc-tions, i.e.: to reduce the second input us-ing a bit vector filter based on the firstinput and to reduce the first input (in thenext recursion level) using a bit vectorfilter based on the second input. The sameeffect could be achieved for sort-basedbinary operations requiring multilevelsorting and merging, although to do soimplies switching back and forth be-tween the two sorts for the two inputsafter each merge level. Not surprisingly,switching back and forth after each mergelevel would be the dual to the nartition-.ing process of both inputs in recursivehybrid hash join. However, sort operatorsthat switch back and forth on each mergelevel are not only complex to implementbut may also inhibit the merge optimiza-tion discussed earlier.

The final entries in Table 7 concerninteresting orderings used in the SystemR query optimizer [Selinger et al. 1979]and presumably other query optimizersas well. A strong argument in favor ofsorting and merge-join is the fact thatmerge-join delivers its output in sortedorder; thus, multiple merge-joins on the

Merge-Join

/“\Probe Vector 2 sort

I IsOrt Build Vector 2

I IBuild Vector 1 Probe Vector 1

I IScan Input 1 Scan Input 2

Figure 19. Merge-join with symmetric bit vectorfiltering.

same attribute can be performed withoutsorting intermediate join results. Forjoining three relations, as shown in Fig-ure 20, pipelining data from one merge-join to the next without sorting trans-lates into a 3:4 advantage in the numberof sort operations compared to two joinson different join keys, because the inter-mediate result 01 does not need to besorted. For joining N relations on thesame key, only N sorts are required in-stead of 2 x N – 2 for joins on differentattributes. Since set operations such asthe union or intersection of N sets canalways be performed using a merge-joinalgorithm without sorting intermediateresults, the effect of interesting orderingsis even more important for set operationsthan for relational joins.

Hash-based algorithms tend to producetheir outputs in a very unpredictable or-der, depending on the hash function andon overflow management. In order to takeadvantage of multiple joins on the sameattribute (or of multiple intersections,etc.) similar to the advantage derivedfrom interesting orderings in sort-basedquery processing, the equality of at-tributes has to be exploited during thelogical step of hashing, i.e., during parti-tioning. In other words, such set opera-tions and join queries can be executedeffectively by a hash join algorithm thatrecursively partitions N inputs concur-rently. The recursion terminates whenN – 1 inputs fit into memory and whenthe Nth input is used to probe N – 1



Merge-Join b=b

/\ Merge-Join a=a

Sort on b Sort on b

I I/\

01 Merge-Join a=a Sort on a

Merge-Join a=a

/ \ ‘n’”’”

/\ ISort on a Sort on a Input 13

Sort on a Sort on a I II I Input 11 Input 12

Input 11 Input 12

Figure 20. The effect of interesting orderings.

Figure 21. Partitioning in a multiinput hash join.

hash tables. Thus, the basic operation ofthis N-ary join (intersection, etc.) is anN-ary join of an N-tuple of partition files,not pairs as in binary hash join with onebuild and one m-obe file for each ~arti-tion. Figure 21 ‘illustrates recursiv~ par-titioning for a join of three inputs. In-stead of partitioning and joining a pair ofin~uts and ~airs of ~artition files as intr~ditional binary hybrid hash join, thereare file triples (or N-tuples) at each step.

However, N-ary recursive partitioningis cumbersome to implement, in particu-lar if some of the “join” operations areactually semi-join, outer join, set inter-section, union, or difference. Therefore,until a clean implementation method forhash-based N-ary matching has beenfound, it might well be that this distinc-tion. ioins on the same or on different/.,attributes, contributes to the right choicebetween sort- and hash-based algorithmsfor comdex aueries.

Anot~er si~uation with interesting or-

derings is an aggregation followed by ajoin. Many aggregations condense infor-mation about individual entities; thus,the aggregation operation is performedon a relation representing the “many”side of a many-to-one relationship or onthe relation that represents relationshipinstances of a many-to-many relation-ship. For example, students’ grade pointaverages are computed by grouping andaveraging transcript entries in a many-to-many relationship called transcriptbetween students and courses. The im-portant point to note here and in manysimilar situations is the grouping at-tribute is a foreign key. In order to relatethe aggregation output with other infor-mation pertaining to the entities aboutwhich information was condensed, aggre-gations are frequently followed by a join.If the grouping operation is based onsorting (on the grouping attribute, whichvery frequently is a foreign key), the nat-ural sort order of the aggregation outputcan be exploited for an efficient merge-join without sorting.

While this seems to be an advantage ofsort-based aggregation and join, thiscombination of operations also permits aspecial trick in hash-based query pro-cessing [Graefe 1993 b]. Hash-based ag-gregation is based on identifying items ofthe same group while building the hashtable. At the end of the operation, thehash table contains all output itemshashed on the grouping attribute. If thegrouping attribute is the join attribute inthe next operation, this hash table canimmediately be probed with the other



join input. Thus, the combined aggrega-tion-join operation uses only one hashtable, not two hash tables as two sepa-rate o~erations would do. The differencesto tw~ separate operations are that onlyone join input can be aggregated effi-ciently and that the aggregated inputmust be the join’s build input. Both is-sues could be addressed by symmetrichash ioins with a hash table on each ofthe in”~uts which would be as efficient assorting and grouping both join inputs.

A third use of interesting orderings isthe positive interaction of (sorted, B-tree)index scans and merge-join. While it hasnot been reported explicitly in the litera-ture. the leaves and entries of two hashindices can be merge-joinedjust like thoseof two B-trees, provided the same hashfunction was used to create the indices.For example, it is easy to imagine “merg-ing” the leaves (data pages) of two ex-tendible hash indices [Fagin et al. 1979],even if the key cardinalities and distribu-tions are verv different.

In summa~y, there exist many duali-ties between sorting using multilevelmerging and recursive hash table over-flow management. Two special cases ex-ist which favor one or the other, however.First, if two join inputs are of differentsize (and the query optimizer can reli-ably predict this difference), hybrid hashjoin outperforms merge-join because onlythe smaller of the two inputs determineswhat fraction of the input files has to bewritten to temporary disk files duringpartitioning (or how often each recordhas to be written to disk during recursivepartitioning), while each file determinesits own disk 1/0 in sorting [Bratberg-sengen 1984]. For example, sorting thelarger of two join inputs using multiplemerge levels is more expensive thanwriting a small fraction of that file tohash overflow files. This performance ad-vantage of hashing grows with the rela-tive size difference of the two inputs, notwith their absolute sizes or with thememory size.

Second, if the hash function is verypoor, e.g., because of a prior selection onthe ioin attribute or a correlated at-tribu~e, hash partitioning can perform

very poorly and create significantlyhigher costs than sorting and merge-join.If the quality of the hash function cannotbe predicted or improved (tuned) dynam-ically [Graefe 1993a], sort-based query-processing algorithms are superior be-cause they are less vulnerable to nonuni-form data distributions. Since both cases,join of differently sized files and skewedhash value distributions, are realistic sit-uations in database query processing, werecommend that both sort- and hash-based algorithms be included in a query-processing engine and chosen by thequery optimizer according to the twocases above. If both cases arise simulta-neously, i.e., a join of differently sizedinputs with unpredictable hash valuedistribution, the query optimizer has toestimate which one poses the greaterdanger to system performance and chooseaccordingly.

The important conclusion from thesedualities is that neither the absolute in-put sizes nor the absolute memory sizenor the input sizes relative to the mem-ory size determine the choice betweensort- and hash-based query-processingalgorithms. Instead, the choice should begoverned by the sizes of the two inputsinto binary operators relative to eachother and by the danger of performanceimpairments due to skewed data or hashvalue distributions. Furthermore, be-cause neither algorithm type outper-forms the other in all situations, bothshould be available in a query executionengine for a choice to be made in eachcase by the query optimizer.

8. EXECUTION OF COMPLEX QUERYPLANS

When multiple operators such as aggre-gations and joins execute concurrently ina pipelined execution engine, physical re-sources such as memory and disk band-width must be shared by all operators.Thus, optimal scheduling of multiple op-erators and the division and allocation ofresources in a complex plan are impor-tant issues.

In earlier relational execution engines,these issues were largely ignored for two



reasons. First, only left-deep trees wereused for query execution, i.e., the right

(inner) input of a binary operator had tobe a scan. In other words, concurrentexecution of multiple subplans in a singlequery was not possible. Second, underthe assumption that sorting was neededat each step and considering that sortingfor nontrivial file sizes requires that theentire input be written to temporary filesat least once, concurrency and the needfor resource allocation were basically ab-sent. Today’s query execution enginesconsider more join algorithms that per-mit extensive pipelining, e.g., hybrid hashjoin, and more complex query plans, in-cluding bushy trees. Moreover, today’ssystems support more concurrent usersand use parallel-processing capabilities.Thus, resource allocation for complexqueries is of increasing importance fordatabase query processing.

Some researchers have considered re-source contention among multiple queryprocessing operators with the focus onbuffer management. The goal in theseefforts was to assign disk pages to bufferslots such that the benefit of each bufferslot would be maximized, i.e., the numberof 1/0 operations avoided in the future.Sacco and Schkolnick [1982; 1986] ana-lyzed several database algorithms andfound that their cost functions exhibitsteps when plotted over available bufferspace, and they suggested that bufferspace should be allocated at the low endof a step for the least buffer use at agiven cost. Chou [1985] and Chou andDeWitt [1985] took this idea further bycombining it with separate page replace-ment algorithms for each relation or scan,following observations by Stonebraker[1981] on operating system support fordatabase systems, and with load control,calling the resulting algorithm DBMIN.Faloutsos et al. [1991] and Ng et al. [1991]generalized this goal and used the classiceconomic concepts of decreasing marginalgain and balanced marginal gains formaximal overall gain. Their measure ofgain was the reduction in the number ofpage faults. Zeller and Gray [1990] de-signed a hash join algorithm that adaptsto the current memory and buffer con-

tention each time a new hash table isbuilt. Most recently, Brown et al. [1992]have considered resource allocation

tradeoffs among short transactions andcomplex queries.

Schneider [1990] and Schneider andDeWitt [1990] were the first to systemat-ically examine execution schedules andcosts for right-deep trees, i.e., query eval-uation plans with multiple binary hashjoins for which all build phases proceedconcurrently or at least could proceedconcurrently (notice that in a left-deepplan, each build phase receives its datafrom the probe phase of the previous join,limiting left-deep plans to two concurrentjoins in different phases). Among themost interesting findings are thatthrough effective use of bit vector fil-tering (discussed later), memory re-quirements for right-deep plans mightactually be comparable to those ofleft-deep plans [Schneider 1991]. Thiswork has recently been extended byChen et al. [1992] to bushy plans in-terpreted and executed as multipleright-deep subplans,

For binary matching iterators to beused in bushy plans, we have identifiedseveral concerns. First, some query-processing algorithms include a point atwhich all data are in temporary files ondisk and at which no intermediate resultdata reside in memory. Such “stop” pointscan be used to switch efficiently betweendifferent subplans. For example, if twosubplans produce and sort two merge-joininputs, stopping work on the first sub-plan and switching to the second oneshould be done when the first sort opera-tor has all its data in sorted runs andwhen only the final merge is left but nooutput has been produced yet. Figure 22illustrates this point in time. Fortu-nately, this timing can be realized natu-rally in the iterator implementation ofsorting if input runs for the final mergeare opened in the first call of the nextprocedure, not at the end of the openphase. A similar stop point is available inhash join when using overflow avoidance.

Second, since hybrid hashing producessome output data before the memory con-tents (output buffers and hash table) can


124 - Goetz Graefe

merge join

/\

‘0”mm‘~(done) to be done

Figure 22. The stop point during sorting

be discarded and since, therefore, such astop point does not occur in hybrid hashjoin, implementations of hybrid hash joinand other binary match operations shouldbe parameterized to permit overflowavoidance as a run time option to bechosen by the query optimizer. This dy-namic choice will permit the query opti-mizer to force a stop point in some opera-tors while using hybrid hash in mostoperations.

Third, binary-operator implementa-tions should include a switch that con-trols which subplan is initiated first. InTable 1 with algorithm outlines for itera-tors’ open, next, and close procedures,the hash join open procedure executesthe entire build-input plan first beforeopening the probe input. However, theremight be situations in which it would bebetter to open the probe input beforeexecuting the build input. If the probeinput does not hold any resources suchas memory between open and next calls,initiating the probe input first is not aproblem. However, there are situationsin which it creates a big benefit, in par-ticular in bushy query evaluation plansand in parallel systems to be discussedlater.

Fourth, if multiple operators are activeconcurrently, memory has to be dividedamong them. If two sorts produce inputdata for a merge-join, which in turnpasses its output into another sort usingquicksort, memory should be divided pro-portionally to the sizes of the three filesinvolved. We believe that for multiplesorts producing data for multiple merge-joins on the same attribute, proportionalmemory division will also work best. If a

sort in its run generation phase sharesresources with other operations, e.g., asort following two sorts in their finalmerges and a merge-join, it should alsouse resources proportional to its inputsize. For example, if two merge-join in-puts are of the same size and if themerge-join output which is sorted imme-diately following the merge-join is aslarge as the two inputs together, the twofinal merges should each use one quar-ter of memory while the run gener-ation (quicksort) should use one half ofmemory.

Fifth, in recursive hybrid hash join,the recursion levels should be executedlevel by level. In the most straightfor-ward recursive algorithm, recursive invo-cation of the original algorithm for eachoutput partition results in depth-firstpartitioning, and the algorithm producesoutput as soon as the first leaf in therecursion tree is reached. However, if theoperator that consumes the output re-quires memory as soon as it receives in-put, for example, hybrid hash join (ii) inFigure 23 as soon as hybrid hash join (i)produces output, the remaining parti-tioning operations in the producer opera-tor (hybrid hash join (i)) must sharememory with the consumer operator (hy-brid hash join (ii)), effectively cutting thepartitioning fan-out in the producer inhalf. Thus. hash-based recursive match-ing algorithms should proceed in threedistinct phases—consuming input andinitial partitioning, partitioning into filessuitable for hybrid hash join, and finalhybrid hash join for all partitions—withphase two completed entirely beforephase three commences. This sequence ofpartitioning steps was introduced asbreadth-first partitioning in the previoussection as opposed to depth-first parti-tioning used in the most straightforwardrecursive algorithms. Of course, the top-most operator in a query evaluation plandoes not have a consumer operator withwhich it shares resources; therefore, thisoperator should use depth-first partition-ing in order to provide a better responsetime, i.e., earlier delivery of the first dataitem.



Hybrid Hash Join (ii)

01/ \Hybrid Hash Join (i)

/“ \ 1“’”’13Input 11 Input 12

Figure 23. Plan for joining three inputs.

Sixth, the allocation of resources otherthan memory, e.g., disk bandwidth anddisk arms for seeking in partitioning andmerging, is an open issue that should beaddressed soon, because the different im-m-ovement rates in CPU and disk s~eeds. .will increase the im~ortance of disk ~er-formance for over~ll query processingperformance. One possible alleviation ofthis m-oblem might come from disk ar-.rays configured exclusively for perfor-mance, not for reliability. Disk arraysmight not deliver the entire ~erformancega~n the large number of disk’drives couldprovide if it is not possible to disable adisk array’s parity mechanisms and toaccess s~ecific disks within an arrav.particul~rly during partitioning aidmerging.

Finally, scheduling bushy trees inmultiprocessor systems is not entirelyunderstood yet. While all considerationsdiscussed above apply in principle, multi-m-ocessors ~ermit trulv concurrent exe-.cution of multiple subplans in a bushytree. However, it is a very hard problemto schedule two or more subplans suchthat their result streams are available atthe right times and at the right rates, inparticular in light of the unavoidable er-rors in selectivity and cost estimationduring query optimization [Christodoula-kis 1984; Ioannidis and Christodoulakis19911.

The last point, estimation errors, leadsus to suspect that plans with 30 (or even100) joins or other operations cannot beoptimized completely before execution.Thus. we susnect that a techniaue remi-. .niscent of Ingres Decomposition [Wongand Youssefi 1976; Youssefi and Wong1979] will prove to be more effective. One

of the principal ideas of Ingres Decompo-sition is a repetitive cycle consisting ofthree steps. First, the next step is se-lected, e.g., a selection or join. Second,the chosen step is executed into a tempo-rary table. Third, the query is simplifiedby removing predicates evaluated in thecompleted execution step and replacingone range variable (relation) in the querywith the new temporary table. The justi-fication and advantage of this approachare that all earlier selectivities are knownfor each decision, because the intermedi-ate results are materialized. The disad-vantage is that data flow between opera-tors cannot be exploited, resulting in asignificant cost for writing and readingintermediate files. For very complexqueries, we suggest modifying Decompo-sition to decide on and execute multiplesteps in each cycle, e.g., 3 to 9 joins,instead of executing only one selection orjoin as in Ingres. Such a hybrid approachmight very well combine the advantagesof a priori optimization, namely, in-memory data flow between iterators, andoptimization with exactly known inter-mediate result sizes.

An optimization and execution envi-ronment even further tuned for verycomplex queries would anticipate possi-ble outcomes of executing subplans andprovide multiple alternative subsequentplans. Figure 24 shows the structure ofsuch a dynamic plan for a complex query.First, subplan A is executed, and statis-tics about its result are gathered while itis saved on disk. Depending on thesestatistics, either B or C is executed next.If B is chosen and executed, one of D, E,and F will complete the query; in thecase of C instead of B, it will be G or H.Notice that each letter A–H can be anarbitrarily complex subplan, althoughprobably not more than 10 operationsdue to the limitations of current selectiv-ity estimation methods. Unfortunately,realization of such sophisticated queryoptimizers will require further research,e.g., into determination of when separatecases are warranted and limitation ofthe possibly exponential growth in thenumber of subplans.


126 ● Goet.z Graefe

B c

/’”%DEF GH

Figure 24. A decision tree of partial plans,

9. MECHANISMS FOR PARALLEL QUERY

EXECUTION

Considering that all high-performancecomputers today employ some form ofparallelism in their processing hardware,it seems obvious that software written tomanage large data volumes ought to beable to exploit parallel execution capabil-ities [DeWitt and Gray 1992]. In fact, webelieve that five years from now it will beargued that a database management sys-tem without parallel query execution willbe as handicapped in the market place asone without indices.

The goal of parallel algorithms andsystems is to obtain speedup and scaleup,and speedup results are frequently usedto demonstrate the accomplishments of adesign and its implementation. Speedupconsiders additional hardware resourcesfor a constant problem size; linearspeedup is considered optimal. In otherwords, N times as many resources shouldsolve a constant-size problem in l\N ofthe time. Speedup can also be expressedas parallel efficiency, i.e., a measure ofhow close a system comes to linearspeedup. For example, if solving a prob-lem takes 1,200 seconds on a single ma-chine and 100 seconds on 16 machines,the speedup is somewhat less than lin-

ear. The parallel efficiency is (1 x1200)/(16 X 100) = 75%.

An alternative measure for a parallelsystem’s design and implementation isscaleup, in which the problem size is al-tered with the resources. Linear scaleupis achieved when N times as many re-sources can solve a problem with iV timesas much data in the same amount oftime. Scaleup can also be expressed us-ing parallel efficiency, but since speedup

and scaleup are different, it should al-ways be clearly indicated which parallelefficiency measure is being reported.

A third measure for the success of aparallel algorithm based on Amdahl’s lawis the fraction of the sequential programfor which linear speedup was attained,defined byp=f Xs/d+(l-f)X sforsequential execution time s, parallel exe-cution time p, and degree of parallelismd. Resolved for f, this is f = (s – p)/ (s– s/d) = ((s – p)/s)/((d – I)/d). Forthe example above, this fraction is f =((1200 – 100)/’1200)/((16 – 1)/’16) =97.78%. Notice that this measure givesmuch higher percentage values than theparallel efficiency calculated earlier;therefore, the two measures should notbe confused.

For query processing problems involv-ing sorting or hashing in which multiplemerge or partitioning levels are expected,the speedup can frequently be more thanlinear, or superlinear. Consider a sortingproblem that requires two merge levelsin a single machine. If multiple machinesare used, the sort problem can be parti-tioned such that each machine sorts afraction of the entire data amount. Suchpartitioning will, in a good implementa-tion, result in linear speedup. If, in addi-tion, each machine has its own memorysuch that the total memory in the systemgrows with the size of the machine, fewerthan two merge levels will suffice, mak-ing the speedup superlinear.

9.1 Parallel versus Distributed Database

Systems

It might be useful to start the discussionof parallel and distributed query process-ing with a distinction of the two concepts.

In the database literature, “distributed”usually implies “locally autonomous,” i.e.,each participating system is a completedatabase management system in itself,with access control, metadata (catalogs),query processing, etc. In other words,each node in a distributed database man-agement system can function entirely onits own, whether or not the other nodesare present or accessible. Each node per-



forms its own access control, and co-operation of each node in a distributed

transaction is voluntary. Examples ofdistributed (research) systems are R*[Haas et al. 1982; Traiger et al. 1982],distributed Ingres [Epstein and Stone-braker 1980; Stonebraker 1986a], andSDD-1 [Bernstein et al. 1981; Rothnie etal. 1980]. There are now several commer-cial distributed relational database man-agement systems. Ozsu and Valduriez[1991a; 1991b] have discussed dis-tributed database systems in much moredetail. If the cooperation among multipledatabase systems is only limited, the sys-tem can be called a “federated databasesystem [ Sheth and Larson 1990].

In parallel systems, on the other hand,there is only one locus of control. In otherwords, there is only one database man-agement system that divides individualqueries into fragments and executes thefragments in parallel. Access control todata is independent of where data objectscurrently reside in the system. The queryoptimizer and the query execution enginetypically assume that all nodes in thesystem are available to participate in ef-ficient execution of complex queries, andparticipation of nodes in a given transac-tion is either presumed or controlled by aglobal resource manager, but is not basedon voluntary cooperation as in dis-tributed systems. There are several par-allel research prototypes, e.g., Gamma[DeWitt et al. 1986; DeWitt et al. 1990],Bubba [Boral 1988; Boral et al. 1990],Grace [Fushimi et al. 1986; Kitsuregawaet al. 1983], and Volcano [Graefe 1990b;1993b; Graefe and Davison 1993], andproducts, e.g., Tandem’s NonStop SQL[Englert et al. 1989; Zeller 1990], Tera-data’s DBC/1012 [Neches 1984; 1988;Teradata 1983], and Informix [Davison1992].

Both distributed database systems andparallel database systems have been de-signed in various kinds, which may cre-ate some confusion. Distributed systemscan be either homogeneous, meaning thatall participating database managementsystems are of the same type (the hard-ware and the operating system may even

be of the same types), or heterogeneous,meaning that multiple database manage-ment systems work together using stan-dardized interfaces but are internallydifferent.lG Furthermore, distributedsystems may employ parallelism, e.g.,by pipelining datasets between nodeswith the receiver already working onsome items while the producer is stillsending more. Parallel systems can bebased on shared-memory (also calledshared-everything), shared-disk (multi-ple processors sharing disks but notmemory), distributed-memory (with-out sharing disks, also called shared-nothing), or hierarchical computerarchitectures consisting of multiple clus-ters, each with multiple CPUS and disksand a large shared memory. Stone-braker [ 1986b] compared the first threealternatives using several aspects ofdatabase management, and came tothe conclusion that distributed memoryis the most promising database man-agement system platform. Each ofthese approaches has advantages anddisadvantages; our belief is that the hi-erarchical architecture is the most gen-eral of these architectures and shouldbe the target architecture for new data-base software development [Graefe andDavison 1993],

9.2 Forms of Parallelism

There are several forms of parallelismthat are interesting to designers and im-plementors of query processing systems.Irzterquery parallelism is a direct resultof the fact that most database manage-ment systems can service multiple re-quests concurrently. In other words,multiple queries (transactions) can beexecuting concurrently within a singledatabase management system. In thisform of parallelism, resource contention

16 In some organizations, two different databasemanagement systems may run on the came (fairly

large) computer. Their interactions could be called“nondistributed heterogeneous.” However, since therules governing such interactions are the same asfor distributed heterogeneous systems, the case is

usually ignored in research and system design.


128 * Goetz Graefe

is of great concern, in particular, con-tention for memory and disk arms.

The other forms of parallelism are allbased on the use of algebraic operationson sets for database query processing,e.g., selection, join, and intersection. Thetheory and practice of exploiting other“bulk” types such as lists for paralleldatabase query execution are only nowdeveloping. Interoperator parallelism isbasically pipelining, or parallel executionof different operators in a single query.For example, the iterator concept dis-cussed earlier has also been called “syn-

chronous pipelines” [Pirahesh et al.1990]; there is no reason not to considerasynchronous pipelines in which opera-tors work independently connected by abuffering mechanism to provide flowcontrol.

Interoperator parallelism can be usedin two forms, either to execute producersand consumers in pipelines, called uerti-cal in teroperator parallelism here, or toexecute independent subtrees in a com-plex bushy-query evaluation plan concur-rently, called horizontal in teroperator orbushy parallelism here. A simple exam-ple for bushy parallelism is a merge-joinreceiving its input data from two sortprocesses. The main problem with bushyparallelism is that it is hard or impossi-ble to ensure that the two subplans startgenerating data at the right time andgenerate them at the right rates. Notethat the right time does not necessarilymean the same time, e.g., for the twoinputs of a hash join, and that the rightrates are not necessarily equal, e.g., iftwo inputs of a merge-join have differentsizes. Therefore, bushy parallelism pre-sents too many open research issues andis hardly used in practice at this time.

The final form of parallelism indatabase query processing is intraopera-tor parallelism in which a single opera-tor in a query plan is executed in multi-ple processes, typically on disjoint piecesof the problem and disjoint subsets of thedata. This form, also called parallelismbased on fragmentation or partitioning,is enabled by the fact that query process-ing focuses on sets. If the underlying data

represented sequences or time series ina scientific database management sys-tem, partitioning into subsets to be oper-ated on independently would not befeasible or would require additionalsynchronization when putting the in-dependently obtained results together.

Both vertical interoperator parallelismand intraoperator parallelism are used indatabase query processing to obtainhigher performance. Beyond the obviousopportunities for speedup and scaleupthat these two concepts offer, they bothhave significant problems. Pipeliningdoes not easily lend itself to load balanc-ing because each process or processorin the pipeline is loaded proportionally tothe amount of data it has to process. Thisamount cannot be chosen by the imple-mentor or the query optimizer and can-not be predicted very well. For intraoper-ator, partitioning-based parallelism, loadbalance and performance are optimal ifthe partitions are all of equal size; how-ever, this can be hard to achieve if valuedistributions in the inputs are skewed.

9.3 Implementation Strategies

The purpose of the query execution en-gine is to provide mechanisms for queryexecution from which the query opti-mizer can choose—the same applies forthe means and mechanisms for parallelexecution. There are two general ap-proaches to parallelizing a query execu-tion engine, which we call the bracketand operator models and which are used,for example, in the Gamma and Volcanosystems, respectively.

In the bracket model, there is a genericprocess template that can receive andsend data and can execute exactly oneoperator at any point of time. A schematicdiagram of a template process is shownin Figure 25, together with two possibleoperators, join and aggregation. In orderto execute a specific operator, e.g., a join,the code that makes up the generic tem-plate “loads” the operator into its place

(by switching to this operator’s code) andinitiates the operator which then controlsexecution; network 1/0 on the receiving



output

Input(s)

Figure 25. Bracket model of parallelization.

and sending sides is performed as a ser-vice to the operator on its request andinitiation and is implemented as proce-dures to be called by the operator. Thenumber of inputs that can be active atany point of time is limited to two sincethere are only unary and binary opera-tors in most database systems. The oper-ator is surrounded by generic templatecode. which shields it from its environ-ment, for example, the operator(s) thatproduce its input and consume its out-put. For parallel query execution, manytemplates are executed concurrently inthe system, using one process per tem-plate. Because each operator is writtenwith the imdicit assum~tion that this.operator controls all acti~ities in its pro-cess, it is not possible to execute twooperators in one process without resort-ing to some thread or coroutine facilityi.e., a second implementation level of theprocess concept.

In a query-processing system using thebracket model, operators are coded insuch a way that network 1/0 is theironly means of obtaining input and deliv-ering output (with the exception of scanand store operators). The reason is thateach operator is its own locus of control,and network flow control must be used tocoordinate multiple operators, e.g., tomatch two operators’ speed in a pro-ducer-consumer relationship. Unfortu-nately, this coordination requirementalso implies that passing a data itemfrom one operator to another always in-volves expensive interprocess communi-cation system calls, even in the cases

when an entire query is evaluated on asingle CPU (and could therefore be eval-uated in a single process, without inter-process communication and operatingsystem involvement) or when data do notneed to be repartitioned among nodes ina network. An example for the latter isthe query “joinCselAselB” in the Wiscon-sin Benchmark, which requires joiningthree inputs on the same attribute [De-Witt 1991], or any other query that per-mits interesting orderings [ Selinger et al.1979], i.e., any query that uses the samejoin attribute for multiple binary joins.Thus, in queries with multiple operators(meaning almost all queries), interpro-cess communication and its overhead aremandatory in the bracket model ratherthan optional.

An alternative to the bracket model is

the operator model. Figure 26 shows apossible parallelization of a join plan us-ing the operator model, i.e., by inserting“parallelism” operators into a sequentialplan, called exchange operators in theVolcano system [Graefe 1990b; Graefeand Davison 1993]. The exchange opera-tor is an iterator like all other operatorsin the system with open, next, and closeprocedures; therefore, the other opera-tors are entirely unaffected by the pres-ence of exchange operators in a queryevaluation plan. The exchange operatordoes not contribute to data manipulation;

thus, on the logical level, it is a “no-op”that has no place in a logical query alge-bra such as the relational algebra. Onthe physical level of algorithms and pro-cesses, however, it provides control notprovided by any of the normal operators,i.e., process management, data redistri-bution, and flow control. Therefore, it isa control operator or a meta-operator.Separation of data manipulation fromprocess control and interprocess com-munication can be considered an im-portant advantage of the operatormodel of parallel query processing, be-cause it permits design, implementation,and execution of new data manipulationalgorithms such as N-ary hybrid hashjoin [Graefe 1993a] without regard to theexecution environment.



Print

IExchange QPrint

IJoin

/’”\Join Exchange

/\ IExchange Exchange scan

I Iscan scan

Figure 26. Operator model of parallehzation.

A second issue important to point outis that the exchange operator only pro-vides mechanisms for parallel query pro-cessing; it does not determine or presup-pose policies for using its mechanisms.Policies for parallel processing such asthe degree of parallelism, partitioningfunctions, and allocation of processes toprocessors can be set either by a queryoptimizer or by a human experimenter inthe Volcano system as they are still sub-ject to intense research. The design of theexchange operator permits execution of acomplex query in a single process (byusing a query plan without any exchangeoperators, which is useful in single-processor environments) or with a num-ber of processes by using one or moreexchange operators in the query evalua-tion plan. The mapping of a sequentialplan to a parallel plan by inserting ex-change operators permits one process peroperator as well as multiple processes forone operator (using data partitioning) ormultiple operators per process, which isuseful for executing a complex query planwith a moderate number of processes.Earlier parallel query execution enginesdid not provide this degree of flexibility;the bracket model used in the Gammadesign, for example, requires a separateprocess for each operator [DeWitt et al.1986].

Figure 27 shows the processes createdby the exchange operators in the previ-

Figure 27. Processes created by exchange operators.

ous figure, with each circle representinga process. Note that this set of processesis only one possible parallelization, whichmakes sense if the joins are on the samejoin attributes. Furthermore, the degreesof data parallelism, i.e., the number ofprocesses in each process group, can becontrolled using an argument to the ex-change operator.

There is no reason to assume that thetwo models differ significantly in theirperformance if implemented with similarcare. Both models can be implementedwith a minimum of control overhead andcan be combined with any partitioningscheme for load balancing. The only dif-ference with respect to performance isthat the operator model permits multipledata manipulation operators such as joinin a single process, i.e., operator synchro-nization and data transfer between oper-ators with a single procedure call with-out operating system involvement. Theimportant advantages of the operatormodel are that it permits easy paral-lelization of an existing sequential sys-tem as well as development and mainte-nance of operators and algorithms in afamiliar and relatively simple single-pro-cess environment [ Graefe and Davison1993].



The bracket and operator models bothprovide pipelining and partitioning aspart of pipelined data transfer betweenprocess groups. For most algebraic opera-tors used in database query processing,these two forms of parallelism are suffi-cient. However, not all operations can beeasily supported by these two models.For example, in a transitive closure oper-ator, newly inferred data is equal to in-put data in its importance and role forcreating further data. Thus, to paral-lelize a single transitive closure operator,the newly created data must also be par-titioned like the input data. Neitherbracket nor operator model immediatelyallow for this need. Hence, for transitiveclosure operators, intraoperator paral-lelism based on partitioning requires thatthe processes exchange data among

themselves outside of the streamparadigm.

The transitive closure operator is notthe only operation for which this restric-tion holds. Other examples include thecomplex object assembly operator de-scribed by Keller et al. [1991] and opera-tors for numerical optimizations as mightbe used in scientific databases. Bothmodels, the bracket model and the opera-tor model, could be extended to provide ageneral and efficient solution to intraop-erator data exchange for intraoperatorparallelism.

9.4 Load Balancing and Skew

For optimal speedup and scaleup, piecesof the processing load must be assignedcarefully to individual processors anddisks to ensure equal completion timesfor all pieces. In interoperator paral-lelism, operators must be grouped to en-sure that no one processor becomes thebottleneck for an entire pipeline. Bal-anced processing loads are very hard toachieve because intermediate set sizescannot be anticipated with accuracy andcertainty in database query optimization.Thus, no existing or proposed query-processing engine relies solely on inter-operator parallelism. In intraoperatorparallelism, data sets must be parti-

tioned such that the processing load isnearly equal for each processor. Noticethat in particular for binary operationssuch as join, equal processing loads canbe different from equal-sized partitions.

There are several research efforts de-veloping techniques to avoid skew or tolimit the effects of skew in parallel queryprocessing, e.g., Baru and Frieder [1989],DeWitt et al. [ 1991b], Hua and Lee[1991], Kitsuregawa and Ogawa [1990],Lakshmi and Yu [1988; 1990], Omiecin-ski [199 1], Seshadri and Naughton[1992], Walton [1989], Walton et al.[1991], and Wolf et al. [1990; 1991]. How-ever, all of these methods have theirdrawbacks, for example, additional re-quirements for local processing to deter-mine quantiles.

Skew management methods can be di-vided into basically two groups. First,skew avoidance methods rely on deter-mining suitable partitioning rules beforedata is exchanged between processingnodes or processes. For range partition-ing, quantiles can be determined or esti-mated from sampling the data set to bepartitioned, from catalog data, e.g., his-tograms, or from a preprocessing step.Histograms kept on permament base datahave only limited use for intermediate

query processing results, in particular, ifthe partitioning attribute or a correlatedattribute has been used in a prior selec-tion or matching operation. However, forstored data they may be very beneficial.Sampling implies that the entire popula-tion is available for sampling because thefirst memory load of an intermediate re-sult may be a very poor sample for parti-tioning decisions. Thus, sampling mightimply that the data flow between opera-tors be halted and an entire intermediateresult be materialized on disk to ensureproper random sampling and subsequentpartitioning. However, if such a halt isrequired anyway for processing a largeset, it can be used for both purposes. Forexample, while creating and writing ini-tial run files without partitioning in aparallel sort, quantiles can be deter-mined or estimated and used in a com-bined partitioning and merging step.



10000 –solid — 99 9Z0confidence

dotted — 95 % confidence

Sample 10~_ o 1024 patitions

Size ❑ 2 partitions

Per❑

❑

Partition 10Q –❑

107I I I I

1 1.25 1.5 1.75 2

Skew Limit

Figure 28. Skew hmit,confidence, andsamples izeperpartltion.

Second, skew resolution repartitionssome or all of the data if an initial parti-tioning has resulted in skewed loads.Repartitioning is relatively easy inshared-memory machines, but can alsobe done in distributed-memory architec-tures, albeit at the expense of more net-work activity. Skew resolution can bebased on rehashing in hash partitioningor on quantile adjustment in range parti-tioning. Since hash partitioning tends tocreate fairly even loads and since net-work bandwidth will increase in the nearfuture within distributed-memory ma-chines as well as in local- and wide-areanetworks, skew resolution is a reason-able method for cases in which a priorprocessing step cannot be exploited togather the information necessary forskew avoidance as in the sort exampleabove.

In their recent research into samplingfor load balancing, DeWitt et al. [ 1991b]and Seshadri and Naughton [1992] haveshown that stratified random samplingcan be used, i.e., samples are selectedrandomly not from the entire distributeddata set but from each local data set ateach site, and that even small sets ofsamples ensure reasonably balancedloads. Their definition of skew is the quo-tient of sizes of the largest partition andthe average partition, i.e., the sum ofsizes of all partitions divided by the de-gree of parallelism. In other words, askew of 1.0 indicates a perfectly even

distribution. Figure 28 shows the re-quired sample sizes per partition for var-ious skew limits, degrees of parallelism,and confidence levels. For example, toensure a maximal skew of 1.5 among1,000 partitions with 9570 confidence, 110random samples must be taken at eachsite. Thus, relatively small samples suf-fice for reasonably safe skew avoidanceand load balancing, making precisemethods unnecessary. Typically, onlytens of samples per partition are needed,not several hundreds of samples at eachsite.

For allocation of active processing ele-ments, i.e., CPUS and disks, the band-width considerations discussed briefly inthe section on sorting can be generalizedfor parallel processes. In principal, allstages of a pipeline should be sized suchthat they all have bandwidths propor-tional to their respective data volumes inorder to ensure that no stage in thepipeline becomes a bottleneck and slowsthe other ones down. The latency almostunavoidable in data transfer betweenpipeline stages should be hidden by theuse of buffer memory equal in size to theproduct of bandwidth and latency.

9.5 Architectures and Architecture

Independence

Many database research projects haveinvestigated hardware architectures forparallelism in database systems. Stone-



braker[1986b] compared shared-nothing(distributed-memory), shared-disk (dis-tributed-memory with multiported disks),and shared-everything (shared-memory)architectures for database use based on anumber of issues including scalability,communication overhead, locking over-head, and load balancing. His conclusionat that time was that shared-everythingexcels in none of the points considered;shared-disk introduces too many lockingand buffer coherency problems; andshared-nothing has the significant bene-fit of scalability to very high degrees ofparallelism. Therefore, he concluded thatoverall shared-nothing is the preferablearchitecture for database system imple-mentation. (Much of this section has beenderived from Graefe et al. [1992] andGraefe and Davison [1993 ].)

Bhide [1988] and Bhide and Stone-braker [1988] compared architectural al-ternatives for transaction processing andconcluded that a shared-everything(shared-memory) design achieves the bestperformance, up to its scalability limit.To achieve higher performance, reliabil-ity, and scalability, Bhide suggestedconsidering shared-nothing (distributed-memory) machines with shared-every-thing parallel nodes. The same idea ismentioned in equally general terms byPirahesh et al. [1990] and Boral et al.[1990], but none of these authors elabo-rate on the idea’s generality or potential.Kitsuregawa and Ogawa’s [1990] newdatabase machine SDC uses multipleshared-memory nodes (plus custom hard-ware such as the Omega network and ahardware sorter), although the effect ofthe hardware design on operators otherthan join is not evaluated in their article.

Customized parallel hardware was in-vestigated but largely abandoned afterBoral and DeWitt’s [1983] influentialanalysis that compared CPU and 1/0speeds and their trends. Their analysisconcluded that 1/0, not processing, isthe most likely bottleneck in futurehigh-performance query execution. Sub-sequently, both Boral and DeWitt em-barked on new database machineprojects, Bubba and Gamma, that exe-

cuted customized software on standardprocessors with local disks [Boral et al.1990; DeWitt et al. 1990]. For scalabil-ity and availability, both projectsused distributed-memory hardwarewith single-CPU nodes and investi-gated scaling questions for very largeconfigurations.

The XPRS system, on the other hand,has been based on shared memory [Hongand Stonebraker 1991; Stonebraker et al.1988a; 1988b]. Its designers believe thatmodern bus architectures can handle upto 2,000 transactions per second, and thatshared-memory architectures provideautomatic load balancing and fastercommunication than shared-nothingmachines and are equally reliable andavailable for most errors, i.e., mediafailures, software, and operator errors[Gray 1990]. However, we believe thatattaching 250 disks to a single machineas necessary for 2,000 transactions persecond [ Stonebraker et al. 1988b] re-quires significant special hardware, e.g.,channels or 1/0 processors, and it is quitelikely that the investment for such hard-ware can have greater impact on overallsystem performance if spent on general-purpose CPUS or disks. Without suchspecial hardware, the performance limitfor shared-memory machines is probablymuch lower than 2,000 transactions persecond. Furthermore, there already areapplications that require larger storageand access capacities.

Richardson et al. [1987] performed ananalytical study of parallel join algo-rithms on multiple shared-memory “clus-ters” of CPUS. They assumed a group ofclusters connected by a global bus withmultiple microprocessors and sharedmemory in each cluster. Disk drives wereattached to the busses within clusters.Their analysis suggested that the bestperformance is obtained by using onlyone cluster, i.e., a shared-memory archi-tecture. We contend, however, that theirresults are due to their parameter set-tings, in particular small relations (typi-cally 100 pages of 32 KB), slow CPUS(e.g., 5 psec for a comparison, about 2-5MIPS), a slow global network (a bus with



typically 100 Mbit/see), and a modestnumber of CPUS in the entire system

(128). It would be very interesting to seethe analysis with larger relations (e.g.,1– 10 GB), a faster network, e.g., a mod-ern hypercube or mesh with hardwarerouting. and consideration of bus load

u,

and bus contention in each cluster, whichmight lead to multiple clusters being thebetter choice. On the other hand, commu-nication between clusters will remain asignificant expense. Wong and Katz[1983] developed the concept of “localsufficiency” that might provide guidancein declusterinp and re~lication to reducedata moveme~t betw~en nodes. Otherwork on declustering and limiting declus-tering includes Copeland et al. [1988],Fang et al. [1986], Ghandeharizadeh andDeWitt [1990], Hsiao and DeWitt [1990],and Hua and Lee [ 19901.

Finally, there are several hardware de-signs that attempt to overcome theshared-memory scaling problem, e.g., theDASH project [Anderson et al. 1988], theWisconsin Multicube [Goodman andWoest 1988], and the Paradigm project[Cheriton et al. 1991]. However, thesedesires follow the traditional se~arationof operating system and application pro-gram. They rely on page or cache-linefaulting and do not provide typicaldatabase concepts such as read-aheadand dataflow. Lacking separation ofmechanism and policy in these designsalmost makes it imperative to implementdataflow and flow control for databasequery processing within the query execu-tion engine. At this point, none of thesehardware designs has been experimen-tally tested for database queryprocessing.

New software systems designed to ex-ploit parallel hardware should be able toexploit both the advantages of sharedmemory, namely efficient communica-tion, synchronization, and load balanc-ing, and of distributed memory, namelyscalability to very high degrees of paral-lelism and reliability and availabilitythrough independent failures. Figure 29shows a general hierarchical architec-ture, which we believe combines these

advantages. The important point is thecombina~ion of lo~al busies withinshared-memory parallel machines and adobal interconnection network amomzmachines. The diagram is only a verygeneral outline of such an architecture;manv details are deliberately left out and. .unspecified. The network could be imple-mented using a bus such as an ethernet,a ring, a hypercube, a mesh, or a set of~oint-to-~oint connections. The localbusses m-ay or may not be split into codeand data or by address range to obtainless contention and higher bus band-width and hence higher scalability limitsfor the use of sh~red memory. “Designand placement of caches, disk controllers,terminal connections. and local- andwide-area network connections are alsoleft open. Tape drives or other backupdevices would be connected to localbusses.

Modularity is a very important consid-eration for such an architecture. For ex-ample, it should be possible to replace allCPU boards with uwn-aded models with-.=out having to replace memories or disks.Considering that new components willchange communication demands, e.g.,faster CPUS might require more local busbandwidth, it is also important that theallocation of boards to local busses can bechanged, For example, it should be easyto reconfigure a machine with 4 X 16CPUS into one with 8 X 8 CPUS.

Beyond the effect of faster communica-tion and synchronization, this architec-ture can also have a simificant effect oncontrol overhead, load balancing, and re-sulting response time problems. Investi-gations in the Bubba project at MCCdemonstrated that large degrees of paral-lelism mav reduce ~erformance unlessload imbal~nce and o~erhead for startup,synchronization, and communication canbe kept low [Copeland et al. 1988]. Forexample, when placing 100 CPUS eitherin 100 nodes or in 10 nodes of 10 CPUSeach, it is much faster to distribute queryplans to all CPUS and much easier toachieve reasonable balanced loads in the.second case than in the first case. Withineach shared-memory parallel node, load



[ Interconnection Network II I

Loc .d Bus

+ CPU 1

+ CPU ]

+ CPU I

+ CPU ~

EdFigure 29. Ahierarchical-memory architecture

imbalance can be dealt with either by engine were discussed. In this section,compensating allocation of resources, e.g.,memory for sorting or hashing, or by rel-atively efficient reassignment of data toprocessors.

Many of today’s parallel machines arebuilt as one of the two extreme cases ofthis hierarchical design: a distributed-memory machine uses single-CPU nodes,while a shared-memory machine consistsof a single node. Software designed forthis hierarchical architecture will run oneither conventional design as well as agenuinely hierarchical machine and willallow the exploration of tradeoffs in therange of alternatives in between. Themost recent version of Volcano’s ex-change operator is designed for hierar-chical memory, demonstrating that theoperator model of parallelization also of-fers architecture- and topology-indepen-dent parallel query evaluation [Graefeand Davison 1993]. In other words, theparallelism operator is the only operatorthat needs to “understand” the underly-ing architecture, while all data manipu-lation operators can be implementedwithout concern for parallelism, data dis-tribution, and flow control.

10. PARALLEL ALGORITHMS

In the previous section, mechanisms forparallelizing a database query execution

individual algorithms and their specialcases for parallel execution are consid-ered in more detail. Parallel databasequery processing algorithms are typicallybased on partitioning an input usingrange or hash partitioning. Either formof partitioning can be combined with sort-and hash-based query processing algo-rithms; in other words, the choices ofpartitioning scheme and local algorithmare almost always entirely orthogonal.

When building a parallel system, thereis sometimes a question whether it isbetter to parallelize a slower sequentialalgorithm with better speedup behavioror a fast sequential algorithm with infe-rior speedup behavior. The answer to thisquestion depends on the design goal andthe planned degree of parallelism. In thefew single-user database systems in use,the goal has been to minimize responsetime; for this goal, a slow algorithm withlinear speedup implemented on highlyparallel hardware might be the rightchoice. In multi-user systems, the goaltypically is to minimize resource con-sumption in order to maximize through-put. For this goal, only the best sequen-tial algorithms should be parallelized. Forexample, Boral and DeWitt [1983] con-cluded that parallelism is no substitutefor effective and efficient indices. For anew parallel algorithm with impressive



speedup behavior, the question ofwhether or not the underlying sequentialalgorithm is the most efficient choiceshould always be considered.

10.1 Parallel Selections and Updates

Since disk 1/0 is a performance bottle-neck in many systems, it is natural toparallelize it. Typically, either asyn-chronous 1/0 or one process per partici-pating 1/0 device is used, be it a disk oran array of disks under a single con-troller. If a selection attribute is also thepartitioning attribute, fewer than alldisks will contain selection results, andthe number of processes and activateddisks can be limited. Notice that parallelselection can be combined very effec-tively with local indices, i.e., ind~ces cov-ering the data of a single disk or node. Ingeneral, it is most efficient to maintainindices close to the stored data sets, i.e.,on the same node in a parallel databasesystem.

For updates of partitioning attributesin a partitioned data set, items may needto move between disks and sites, just asitems may move if a clustering attributeis updated. Thus, updates of partitioningattributes may require setting up datatransfers from old to new locations ofmodified items in order to maintain theconsistency of the partitioning. The factthat updating partitioning attributes ismore expensive is one reason why im-mutable (or nearly immutable) iden-tifiers or keys are usually used aspartitioning attributes.

10.2 Parallel Sorting

Since sorting is the most expensive oper-ation in many of today’s database man-agement systems, much research hasbeen dedicated to parallel sorting[Baugsto and Greipsland 1989; Beck etal. 1988; Bitton and Friedland 1982;Graefe 1990a; Iyer and Dias 1990; Kit-suregawa et al, 1989b; Lorie and Young1989; Menon 1986; Salzberg et al. 1990].There are two dimensions along whichparallel sorting methods can be classi-

fied: the number of their parallel inputs(e.g., scan or subplans executed in paral-lel) and the number of parallel outputs

(consumers) [Graefe 1990a]. As sequen-tial input or output restrict the through-put of parallel sorts, we assume amultiple-input multiple-output paral-lel sort here, and we further assumethat the input items are partitionedrandomly with respect to the sort at-tribute and that the outmut itemsshould be range-partitioned ~nd sortedwithin each range.

Considering that data exchange is ex-pensive, both in terms of communicationand synchronization delays, each dataitem should be exchanged only once be-tween m-ocesses. Thus. most ~arallel sort. L

algorithms consist of a local sort and adata exchange step. If the data exchangestep is done first, quantiles must beknown to ensure load balancing duringthe local sort step. Such quantil& can b:obtained from histograms in the catalogsor by sampling. It is not necessary thatthe quantiles be precise; a reasonableapproximation will suffice.

If the local sort is done first. the finallocal merging should pass data directlyinto the data exchange step. On eachreceiving site, multiple sorted streams

must be merged during the data ex-change step. bne of th~ possible prob-lems is that all producers of sortedstreams first produce low key values,limiting performance by the speed of thefirst (single!) consumer; then all produc-ers switch to the next consumer, etc.

If a different partitioning strategy thanrange partitioning is used, sorting withsubsequent partitioning is not guaran-teed to be deadlock free in all situations.Deadlock will occur if (1) multiple con-sumers feed multiple producers, (2) eachm-oducer m-educes a sorted stream, and~ach con&mer merges multiple sortedstreams, (3) some key-based partitioningrule is used other than range partition-ing, i.e., hash partitioning, (4) flow con-trol is enabled, and (5) the data distribu-tion is particularly unfortunate.

Figure 30 shows a scenario with twoproducer and two consumer processes,



i.e., both the producer operators and theconsumer operators are executed with adegree of parallelism of two. The circlesin Figure 30 indicate processes, and thearrows indicate data paths. Presume thatthe left sort produces the stream 1, 3, 5,7 , 999, 1002, 1004, 1006,Ikii, :.., 2000 while the right sort pro-duces 2, 4, 6, 8,..., 1000, 1001, 1003,1005, 1007,..., 1999. The merge opera-tions in the consumer processes must re-ceive the first item from each producerprocess before they can create their firstoutput item and remove additional itemsfrom their input buffers. However, theproducers will need to produce 500 itemseach (and insert them into oneconsumer’s input buffer, all 500 for oneconsumer) before they will send their firstitem to the other consumer. The dataexchange buffer needs to hold 1,000 itemsat one point of time, 500 on each side ofFigure 30. If flow control is enabled andif the exchange buffer (flow control slack)is less than 500 items, deadlock willoccur.

The reason deadlock can occur in thissituation is that the producer processesneed to ship data in the order obtainedfrom their input subplan (the sort in Fig-ure 30) while the consumer processesneed to receive data in sorted order asrequired by the merge. Thus, there aretwo sides which both require absolutecontrol over the order in which data passover the process boundary. If the tworequirements are incompatible, an un-bounded buffer is required to ensurefreedom from deadlock.

In order to avoid deadlock, it must beensured that one of the five conditionslisted above is not satisfied. The secondcondition is the easiest to avoid, andshould be focused on. If the receivingprocesses do not perform a merge, i.e.,the individual input streams are notsorted, deadlock cannot occur because theslack given in the flow control must besomewhere, either at some producer orsome consumer or several of them, andthe process holding the slack can con-tinue to process data, thus preventingdeadlock.

Figure 30. Scenario with possible deadlock.

F!

Receive

oddeven

Pa ..--..

k )

[$1\ Receive )

Y I evenm

Figure 31. Deadlock-free scenario.

Our recommendation is to avoid theabove situation, i.e., to ensure that such

query plans are never generated by theoptimizer. Consider for which purposessuch a query plan would be used. Thetypical scenario is that multiple pro-cesses perform a merge join of two in-puts, and each (or at least one) input issorted by several producer processes. Analternative scenario that avoids the prob-lem is shown in Figure 31. Result dataare partitioned and sorted as in the pre-vious scenario. The important differenceis that the consumer processes do notmerge multiple sorted incoming streams.

One of the conditions for the deadlockproblem illustrated in Figure 30 is that


138 . Goetz Graefe

Figure 32. Deadlock danger due to a binary operator in the consumer.

there are multiple producers and multi-

ple consumers of a single logical datastream. However, a very similar deadlocksituation can occur with single-processproducers if the consumer includes anoperation that depends on ordering, typi-cally merge-join. Figure 32 illustrates theproblem with a merge-join operation exe-cuted in two consumer processes. Noticethat the left and right producers in Fig-ure 32 are different inputs of the merge-

join, not processes executing the sameoperators as in Figures 30 and 31. Theconsumer in Figure 32 is still one opera-tor executed by two processes. Presumethat the left sort produces the stream 1,3, 5, 7, ..., 999, 1002, 1004, 1006,1008 ,...,2000 while the right sort pro-duces 2, 4, 6, 8,...,1000, 1001, 1003,1005, 1007,... , 1999. In this case, themerge-join has precisely the same effectas the merging of two parts of one logicaldata stream in Figure 30. Again, if thedata exchange buffer (flow control slack)is too small, deadlock will occur. Similarto the deadlock avoidance tactic in Fig-ure 31, deadlock in Figure 32 can beavoided by placing the sort operationsinto the consumer processes rather thaninto the producers. However, there is anadditional solution for the scenario inFigure 32, namely, moving only one ofthe sort operators, not both, into the con-sumer processes.

If moving a sort operation into the con-sumer process is not realistic, e.g., be-cause the data already are sorted whenthey are retrieved from disk as in a B-treescan, alternative parallel executionstrategies must be found that do not re-quire repartitioning and merging ofsorted data between the producers andconsumers. There are two possible cases.In the first case, if the input data are notonly sorted but also already partitionedsystematically, i.e., range or hash parti-tioned, on the attribute(s) considered bythe consumer operator, e.g., the by-list ofan aggregate function or the join at-tribute, the process boundary and dataexchange could be removed entirely. Thisimplies that the producer operator, e.g.,the B-tree scan, and the consumer, e.g.,the merge-join, are executed by the sameprocess group and therefore with thesame degree of parallelism.

In the second case, although sortedon the relevant attribute within eachpartition, the operator’s data could bepartitioned either round-robin or on a

different attribute. For a join, afragment-and-replicate matching strat-egy could be used [Epstein et al. 1978;Epstein and Stonebraker 1980; Lehmanet al. 1985], i.e., the join should executewithin the same threads as the operatorproducing sorted output while the secondinput is replicated to all instances of the



Merge

Depth

lo–Total Memory Size M =40

8–Input Size R = 125,000

6–

4–

2–

I I I I I I I1 3 5 7 9 11 13

Degree of Parallelism

Figure 33. Merge depth as a function of parallelism.

join. Note that fragment-and-replicatemethods do not work correctly for semi-join, outer join, difference, and union, i.e.,when an item is replicated and is in-serted (incorrectly) multiple times intothe global output. A second solution thatworks for all operators, not only joins, isto execute the consumer of the sorteddata in a single thread. Recall that mul-tiple consumers are required for a dead-lock to occur. A third solution that iscorrect for all operators is to send dummyitems containing the largest key seen sofar from a producer to a consumer if nodata have been exchanged for a predeter-mined amount of time (data volume, keyrange). In the examples above, if a pro-ducer must send a key to all consumersat least after every 100 data items pro-cessed in the producer, the requiredbuffer space is bounded, and deadlockcan be avoided. In some sense, this solu-tion is very simple; however, it requiresthat not only the data exchange mecha-nism but also sort-based algorithms suchas merge-join must “understand” dummyitems. Another solution is to exchange alldata without regard to sort order, i.e., toomit merging in the data exchange mech-anism, and to sort explicitly after repar-titioning is complete. For this sort, re-placement selection might be more effec-tive than quicksort for generating initialruns because the runs would probablybe much larger than twice the size ofmemory.

A final remark on deadlock avoidance:Since deadlock can only occur if the con-sumer process merges, i.e., not only theproducer but also the consumer operatortry to determine the order in which datacross process boundaries, the deadlockproblem only exists in a query executionengine based on sort-based set-processingalgorithms. If hash-based algorithmswere used for aggregation, duplicate re-moval, join, semi-join, outer join, inter-section, difference, and union, the needfor merging and therefore the danger ofdeadlock would vanish.

An interesting parallel sorting methodwith balanced communication and with-out the possibility of deadlock in spite oflocal sort followed by data exchange (ifthe data distribution is known a priori) isto sort locally only by the position withinthe final partition and then exchangedata guaranteeing a balanced data flow.This method might be best seen in anexample: Consider 10 partitions with keyvalues from O to 999 in a uniform distri-bution. The goal is to have all key valuesbetween O to 99 sorted on site O, between100 and 199 sorted on site 1, etc. First,each partition is sorted locally at its orig-inal site, without data exchange, on thelast two digits only, ignoring the firstdigit. Thus, each site has a sequence suchas 200, 301, 401, 902, 2, 603, 804, 605,105, 705,...,999, 399. Now each sitesends data to its correct final destination.Notice that each site sends data simulta-



neously to all other sites, creating a bal-anced data flow among all producers andconsumers. While this method seems ele-gant, its problem is that it requires fairlydetailed distribution information to en-sure the desired balanced data flow.

In shared-memory machines, memorymust be divided over all concurrent sortprocesses. Thus, the more processes areactive, the less memory each one can get.The importance of this memory divisionis the limitation it puts on the size ofinitial runs and on the fan-in in eachmerge process. In other words, large de-grees of parallelism may impede perfor-mance because they increase the numberof merge levels. Figure 33 shows how thenumber of merge levels grows with in-creasing degrees of parallelism, i.e., de-creasing memory per process and mergefan-in. For input size R, total mem-ory size ill, and P parallel processes, themerge depth L is L = log&f lP - ~((R\P)/

( M/P)) = log ~, P- ~(R/M). The optimaldegree of parallelism must be deter-mined considering the tradeoff betweenparallel processi~g and large fan-ins,somewhat similar to the tradeoff be-tween fan-in and cluster size. Extendingthis argument using the duality of sort-ing and hashing, too much parallelism inhash partitioning on shared-memory ma-chines can also be detrimental, both foraggregation and for binary lmatching[Hong and Stonebraker 1993].

10.3 Parallel Aggregation and Duplicate

Removal

Parallel algorithms for aggregation andduplicate removal are best divided into alocal step and a global step. First, dupli-cates are eliminated locally, and thendata are partitioned to detect and re-move duplicates from different originalsites. For aggregation, local and globalaggregate functions may differ. For ex-ample, to perform a global count, thelocal aggregation counts while the globalaggregation sums local counts into aglobal count.

For local hash-based aggregation, aspecial technique might improve perfor-mance. Instead of creating overflow files

locally to resolve hash table overflow,items can be moved directlv to their final.site. Hopefully, this site can aggregatethem immediately into the local hashtable because a similar item already ex-ists. In manv recent distributed-memorvmachines, it” is faster to ship an item toanother site than to do a local disk 1/0.In fact. some distributed-memorv ven-dors attach disk drives not to tie pri-mary processing nodes but to special“1/0 nodes” because network delay isnegligible compared to 1/0 time, e.g., inIntel’s iPSC/2 and its subsequent paral-lel architectures.

The advantage is that disk 1/0 is re-quired only when the aggregation outputsize does not fit into the aggregate mem-orv available on all machines. while the.standard local aggregation-exchange-global aggregation scheme requires localdisk 1/0 if any local output size does notfit into a local memorv. The differencebetween the two is d~termined by thedegree to which the original input is al-ready partitioned (usually not at all),making this technique very beneficial.

10.4 Parallel Joins and Other Binary

Matching Operations

Binary matching operations such as join,semi-join, outer join, intersection, union,and difference are different than the pre-vious operations exactly because they arebinary. For bushy parallelism, i.e., a joinfor which two subplans create the twoinputs independently from one anotherin parallel, we might consider symmetrichash join algorithms. Instead of differen-tiating between build and probeinputs, the symmetric hash join uses twohash tables. one for each in~ut. When adata item (or packet of ite’ms) arrives,the join algorithm first determines whichin~ut it came from and then ioins thene’w data item with the hash t~ble builtfrom the other input as well as insertingthe new data item into its hash tablesuch that data items from the other in-put arriving later can be joined correctly.Such a symmetric hash join algorithmhas been used in XPRS, a shared-memory high-performance extensible-



relational database system [Hong andStonebraker 1991; 1993; Stonebrakeret al. 1988a; 1988b] as well as in Pris-ma/DB, a shared-nothing main-memorydatabase system [Wilschut 1993; Wil-schut and Apers 1993]. The advantageof symmetric matching algorithms is thatthey are independent of the data rates ofthe inputs; their disadvantage is thatthey require that both inputs fit in mem-ory, although one hash table can bedropped when one input is exhausted.

For parallelizing a single binarymatching operation, there are basically

two techniques, called here symmetricpartitioning and fragment and replicate.In both cases, the global result is theunion (concatenation) of all local results.Some algorithms exploit the topology ofcertain architectures, e.g., ring- or cube-based communication networks [Baruand Frieder 1989; Omiecinski and Lin1989].

In the symmetric partitioning meth-ods, both inputs are partitioned on theattributes relevant to the operation (i.e.,the join attribute for joins or all at-tributes for set operations), and then theoperation is performed at each site. Boththe Gamma and the Teradata databasemachines use this method. Notice thatthe partitioning method (usually hashed)and the local join method are indepen-dent of each other; Gamma and Graceuse hash joins while Teradata usesmerge-join.

In the fragment-and-replicate meth-ods, one input is partitioned, and theother one is broadcast to all sites. Typi-cally, the larger input is partitioned bynot moving it at all, i.e., the existingpartitions are processed at their loca-tions prior to the binary matching opera-tion. Fragment-and-replicate methodswere considered the join algorithms ofchoice in early distributed database sys-tems such as R*, SDD-1, and distributedIngres, because communication costsovershadowed local processing costs andbecause it was cheaper to send a smallinput to a small number of sites than topartition both a small and a large input.Note that fragment-and-replicate meth-ods do not work correctly for semi-join,

outer join, difference, and union, namely,when an item is replicated and is in-serted into the output (incorrectly) multi-ple times.

A technique for reducing network traf-fic during join processing in distributeddatabase systems uses redundant semi-joins [Bernstein et al. 1981; Chiu and Ho1980; Gouda and Dayal 1981], an ideathat can also be used in distributed-memory parallel systems. For example,consider the join on a common attributeA of relations R and S stored on twodifferent nodes in a network, say r ands. The semi-join method transfers aduplicate-free projection of R on A to s,performs a semi-join there to determinethe items in S that actually participatein the join result, and ships these itemsto r for the actual join. In other words,based on the relational algebra law that

R JOINS= R JOIN (S SEMIJOIN R),

cost savings of not shipping all of S wererealized at the expense of projecting andshipping the R. A-column and executingthe semi-join. Of course, this idea can beused symmetrically to reduce R or S orboth, and all operations (projection, du-plicate removal, semi-join, and final join)can be executed in parallel on both r ands or on more than two nodes using theparallel join strategies discussed earlierin this section. Furthermore, there areprobabilistic variants of this idea thatuse bit vector filtering instead of semi-joins, discussed later in its own section.

Roussopoulos and Kang [ 1991] recentlyshowed that symmetric semi-joins areparticularly useful. Using the equalities

(for a join of relations R and S on at-tribute A)

R JOINS= R JOIN (S SEMIJOIN r~R )= (R SEMIJOIN w~(S SEMIJOIN n.R))

JOIN (S SEMIJOIN(a)~~R) (a)

= (R SEMIJOIN ri-.(S SEMIJOIN WAR))JOIN (S (SEMIJOIN@)r~R ) , (b)

ACM Computing Surveys, Vol 25, No. 2, ,June 1993


they designed a four-step procedure tocompute the join of two relations storedat two sites. First, the first relation’s joinattribute column R. A is sent duplicatefree to the other relation’s site, s. Second,the first semi-join is computed at s, andeither the matching values (term (a)above) or the nonmatching values (term(b) above) of the join column S. A aresent back to the first site, r. The choicebetween (a) and (b) is made based on thenumber of matching and nonmatching17values of S. A. Third, site r determineswhich items of R will participate in thejoin R JOIN S, i.e., R SEMIJOIN S.Fourth, both input sites send exactlythose items that will participate in thejoin R JOIN S to the site that will com-pute the final result, which may or maynot be one of the two input sites. Ofcourse, this two-site algorithm can beused across any number of sites in aparallel query evaluation system.

Typically, each data item is exchangedonly once across the interconnection net-work in a parallel algorithm. However,for parallel systems with small communi-cation overhead, in particular forshared-memory systems, and in parallelprocessing systems with processors with-out local disk(s), it may be useful tospread each overflow file over all avail-able nodes and disks in the system. Thedisadvantage of the scheme may becommunication overhead; however, theadvantages of load balancing and cumu-lative bandwidth while reading a parti-tion file have led to the use of this schemeboth in the Gamma and SDC databasemachines, called bucket spreading in theSDC design [DeWitt et al. 1990;Kitsuregawa and Ogawa 19901.

For parallel non-equi-joins, a symmet-ric fragment-and-replicate method hasbeen proposed by Stamos and Young[Stamos and Young 1989]. As shown inFigure 34, processors are organized intorows and columns. One input relation ispartitioned over rows, and partitions are

17SEMIJOIN stands for the anti-semi-join, whichdetermines those items in the first input that donot have a match in the second input.

I L I L I

A A A i

R

s

Figure 34. Symmetric fragment-and-replicate join.

replicated within each row, while theother input is partitioned and replicatedover columns. Each item from one input“meets” each item from the other inputat exactly one site, and the global joinresult is the concatenation of all localjoins.

Avoiding partitioning as well as broad-casting for many joins can be accom-plished with a physical database designthat considers frequently performed joinsand distributes and replicates data overthe nodes of a parallel or distributed sys-tem such that many joins already havetheir input data suitably partitioned.Katz and Wong formalized this notion aslocal sufficiency [Katz and Wong 1983;Wong and Katz 1983]; more recent re-search on the issue was performed in theBubba project [Copeland et al. 1988].

For joins in distributed systems, a thirdclass of algorithms, called fetch-as-needed, was explored. The idea of thesealgorithms is that one site performs thejoin by explicitly requesting (fetching)only those items from the other inputneeded to perform the join [Daniels andNg 1982; Williams et al. 1982]. If oneinput is very small, fetching only thenecessary items of the larger input mightseem advantageous. However, this algo-rithm is a particularly poor implementa-tion of a semi-join technique discussedabove. Instead of requesting items or val-ues one by one, it seems better to firstproject all join attribute values, ship(stream) them across the network, per-form the semi-join using any local binarymatching algorithm, and then stream ex-actly those items back that will be re-



quired for the join back to the first site.The difference between the semi-jointechnique and fetch-as-needed is that thesemi-join scans the first input twice, onceto extract the join values and once toperform the real join, while fetch asneeded needs to work on each data itemonly once.

10.5 Parallel Universal Quantification

In our earlier discussion on sequentialuniversal quantification, we discussedfour algorithms for universal quantifica-tion or relational division, namely, naivedivision (a direct, sort-based algorithm),hash-division (direct, hash based), andsort- and hash-based aggregation (indi-rect ) algorithms, which might requiresemi-joins and duplicate removal in theinputs.

For naive division, pipelining can beused between the two sort operators andthe division operator. However, both quo-tient partitioning and divisor partition-ing can be employed as described belowfor hash-division.

For algorithms based on aggregation,both pipelining and partitioning can beapplied immediately using standardtechniques for parallel query execution.While partitioning seems to be a promis-ing approach, it has an inherent problemdue to the possible need for a semi-join.Recall that in the example for universalquantification using Transcript andCourse relations, the join attribute in thesemi-join (course-no) is different than thegrouping attribute in the subsequent ag-gregation (student-id). Thus, the Tran-script relation has to be partitioned twice,once for the semi-join and once for theaggregation.

For hash-division, pipelining has onlylimited promise because the entiredivision is performed within a singleoperator. However, both partitioningstrategies discussed earlier for hash tableoverflow can be employed for parallel ex-ecution, i.e., quotient partitioning and di-visor partitioning [Graefe 1989; Graefeand Cole 1993].

For hash-division with quotient par-titioning, the divisor table must be

replicated in the main memory of all par-ticipating processors. After replication,all local hash-division operators workcompletely independent of each other.Clearly, replication is trivial for shared-memory machines, in particular since asingle copy of the divisor table can beshared without synchronization amongmultiple processes once it is complete.

When using divisor partitioning, theresulting partitions are processed in par-allel instead of in phases as discussed forhash table overflow. However, instead oftagging the quotient items with phasenumbers, processor network addressesare attached to the data items, and thecollection site divides the set of all incom-ing data items over the set of processornetwork addresses. In the case that thecentral collection site is a bottleneck, thecollection step can be decentralized usingquotient partitioning.

11. NONSTANDARD QUERY PROCESSING

ALGORITHMS

In this section, we briefly review thequery processing needs of data modelsand database systems for nonstandardapplications. In many cases, the logicaloperators defined for new data modelscan use existing algorithms, e.g., for in-tersection. The reason is that for process-ing, bulk data types such as array, set,bag (multi-set), or list are represented assequences similar to the streams used inthe query processing techniques dis-cussed earlier, and the algorithms to ma-nipulate these bulk types are equal tothe ones used for sets of tuples, i.e., rela-tions. However, some algorithms are gen-uinely different from the algorithms wehave surveyed so far. In this section, wereview operators for nested relations,temporal and scientific databases,object-oriented databases, and moremeta-operators for additional query pro-cessing control.

There are several reasons for integrat-ing these operators into an algebraicquery-processing system. First, it per-mits efficient data transfer from thedatabase to the application embodied inthese operators. The interface between



database operators is designed to be asefficient as possible; the same efficientinterface should also be used for applica-tions. Second, operator implementors cantake advantage of the control provided bythe meta-operators. For example, an op-erator for a scientific application canbe implemented in a single-process envi-ronment and later parallelized with theexchange operator. Third, query opti-mization based on algebraic transform a-tion rules can cover all operators, includ-ing operations that are normally consid-ered database application code. For ex-ample, using algebraic optimization toolssuch as the EXODUS and Volcano opti-mizer generators [ Graefe and DeWitt1987; Graefe et al. 1992; Graefe andMcKenna 1993], optimization rules thatcan move an unusual database operatorin a query plan are easy to implement.For a sampling operator, a rule mightpermit transforming an algebra expres-sion to query a sample instead of sam-pling a query result,

11.1 Nested Relations

Nested relations, or Non-First-Normal-Form (NF 2) relations, permit relation-valued attributes in addition to atomicvalues such as integers and strings usedin the normal or “flat” relational model.For example, in an order-processing ap-plication, the set of individual line itemson each order could be represented as anested relation, i.e., as part of an ordertuple. Figure 35 shows an NF z relationwith two tuples with two and three nestedtuples and the equivalent normalized re-lations, which we call the master anddetail relations. Nested relations can beused for all one-to-many relationships butare particularly well suited for the repre-sentation of “weak entities” in theEntity-Relationship (ER) Model [Chen1976], i.e., entities whose existence andidentification depend on another entityas for order entries in Figure 35. In gen-eral, nested subtuples may include rela-tion-valued attributes, with arbitrarynesting depth. The advantages of the NF 2model are that component relationships

can be represented more naturally thanin the fully normalized model; many fre-quent join operations can be avoided, andstructural information can be used forphysical clustering. Its disadvantage isthe added complexity, in particular, instorage management and querym-ocessing.

Severa~ algebras for nested relationshave been defined, e.g., Deshpande andLarson [1991], Ozsoyoglu et al. [1987],Roth et al. [ 1988], Schek and Scholl[1986], Tansel and Garnett [1992]. Ourdiscussion here focuses not on the concep-tual design of NF z algebras but on algo-rithms to manipulate nested relations.

Two operations required in NF 2database systems are operations thattransform an NF z relation into a normal-ized relation with atomic attributes onlv../and vice versa. The first operation isfrequently called unnest or flatten; theopposite direction is called the nest oper-ation. The unnest operation can be per-formed in a single scan over the NF 2relation that includes the nested subtu-ples; both normalized relations in Figure35 and their join can be derived readilyenough from the NF z relation. The nestoperation requires grouping of tuples inthe detail relation and a join with themaster relation. Grouping and join canbe implemented using any of the algo-rithms for aggregate functions and bi-nary matching discussed earlier, i.e., sort-and hash-based sequential and parallelmethods. However, in order to ensurethat unnest and nest o~erations are ex-.act inverses of each other, some struc-tural information might have to be pre-served in the unnest operation. Ozsoyogluand Wang [1992] present a recent inves-tigation of “keying methods” for this pur-pose.

All operations defined for flat relationscan also be defhed for nested relations,in particular: selection, join, and set op-erations (union, intersection, difference).For selections, additional power is gainedwith selection conditions on subtuplesand sets of subtuples using set compar-isons or existential or universal auantifl-.cation. In principle, since a nested rela-



Order Customer Date Items

-No -No Part-No Count

110 911 910902 4711 8

2345 7

112 912 910902 9876 3

2222 1

2357 9

Order-No Part-No Quantity

110 4711 8110 2345 7112 9876 3112 2222 1

112 2357 9

Figure 35. Nested relation and equivalent flat relations.

tion is a relation, any relational calculusand algebra expression should be permit-ted for it. In the example in Figure 35,there may be a selection of orders inwhich the ordered quantity of all items ismore than 100, which is a universalquantification. The algorithms for selec-tions with quantifier are similar to theones discussed earlier for flat relations,e.g., relational semi-join and division, butare easier to implement because thegrouping process built into the flat-relational algorithms is inherent in thenested tuple structure.

For joins, similar considerations apply.Matching algorithms discussed earliercan be used in principle. They may bemore complex if the join predicate in-volves subrelations, and algorithm com-binations may be required that are de-rived from a flat-relation query over flatrelations equivalent to the NF 2 queryover the nested relations. However, thereshould be some performance improve-ments possible if the grouping of valuesin the nested relations can be exploited,as for example, in the join algorithmsdescribed by Rosenthal et al. [1991].

Deshpande and Larson [ 1992] investi-gated join algorithms for nested relationsbecause “the purpose of nesting in orderto store precomputed joins is defeatedif it is unnested every time a join is

performed on a subrelation.” Their algo-rithm, (parallel) partitioned nested-hashed-loops, joins one relation’s subre-lations with a second, flat relation bycreating an in-memory hash table withthe flat relation. If the flat relation islarger than memory, memory-sized seg-ments are loaded one at a time, and thenested relation is scanned repeatedly.Since an outer tuple of the nested rela-tion might have matches in multiple seg-ments of the flat relation, a final mergingpass is required. This join algorithm isreminiscent of hash-division, the flat re-lation taking the role of the divisor andthe nested tuples replacing the quotienttable entries with their bit maps.

Sort-based join algorithms for nestedrelations require either flattening thenested tuples or scanning the sorted flatrelation for each nested tuple, somewhatreminiscent of naive division. Neither al-ternative seems very promising for large



inputs. Sort semantics and appropriatesort algorithms including duplicate re-moval and grouping have been consid-ered by Saake et al. [1989] and Kuespertet al. [ 1989]. Other researchers have fo-cused on storage and retrieval methodsfor nested relations and operations possi-ble with single scans [Dadam et al. 1986;Deppisch et al. 1986; Deshpande and VanGucht 1988; Hafez and Ozsoyoglu 1988;Ozsoyoglu and Wang 1992; Scholl et al.1987; Scholl 1988].

11.2 Temporal and Scientific Database

Management

For a variety of reasons, managementand manipulation of statistical, tempo-ral, and scientific data are gaining inter-est in the database research community.Most work on temporal databases hasfocused on semantics and representationin data models and query languages[McKenzie and Snodgrass 1991; Snod-grass 1990]; some work has consideredspecial storage structures, e.g., Ahn andSnodgrass [1988], Lomet and Salzberg[ 1990b], Rotem and Segev [1987], Sever-ance and Lehman [1976], algebraic oper-ators, e.g., temporal joins [Gunadhi andSegev 1991], and optimization of tempo-ral queries, e.g., Gunadhi and Segev[1990], Leung and Muntz [1990; 1992],Segev and Gunadhi [1989]. While logicalquery algebras require extensions to ac-commodate time, only some storagestructures and algorithms, e.g., multidi-mensional indices, differential files, andversioning, and the need for approximateselection and matching (join) predicatesare new in the query execution algo-rithms for temporal databases.

A number of operators can be identi-fied that both add functionality todatabase systems used to process scien-tific data and fit into the database queryprocessing paradigm. DeWitt et al.[1991a] considered algorithms for joinpredicates that express proximity, i.e.,join predicates of the form R. A – c1 <S.B < R.A + Cz for some constants c1and Cz. Such join predicates are verydifferent from the usual use of relational

join. They do not reestablish relation-ships based on identifying keys but matchdata values that express a dimension inwhich distance can be defined, in partic-ular, time. Traditionally, such join predi-cates have been considered non-equi-joinsand were evaluated by a variant ofnested-loops join. However, such “bandjoins” can be executed much more effi-ciently by a variant of merge-join thatkeeps a “window” of inner relation tuplesin memory or by a variant of hash jointhat uses range partitioning and assignssome build tuples to multiple partitionfiles. A similar partitioning model mustbe used for parallel execution, requiringmulti-cast for some tuples. Clearly, thesevariants of merge-join and hash join willoutperform nested loops for large inputs,unless the band is so wide that the joinresult approaches the Cartesian product.

For storage and management of themassive amounts of data resulting fromscientific experiments, database tech-niques are very desirable. Operators forprocessing time series in scientificdatabases are based on an interpretationof a stream between operators not as aset of items (as in most database applica-tions) but as a sequence in which theorder of items in a stream has semanticmeaning. For example, data reductionusing interpolation as well as extrapola-tion can be performed within the streamparadigm. Similarly, digital filtering

[Hamming 19771 also fits the stream-processing protocol very easily. Interpo-lation, extrapolation, and digital filteringwere implemented in the Volcano systemwith a single algorithm (physical opera-tor) to verify this fit, including their opti-mization and parallelization [ Graefe andWolniewicz 1992; Wolniewicz and Graefe1993]. Another promising candidate is vi-sualization of single-dimensional arrayssuch as time series,

Problems that do not fit the streamparadigm, e.g., many matrix operationssuch as transformations used in linearalgebra, Laplace or Fast Fourier Trans-form, and slab (multidimensional sub-array) extraction, are not as easy tointegrate into database query processing



systems. Some of them seem to fit betterinto the storage management subsystemrather than the algebraic query execu-tion engine. For example, slab extractionhas been integrated into the NetCDFstorage and access software [Rew andDavis 1990; Unidata 1991]. However, itis interesting to note that sorting is asuitable algorithm for permuting the lin-ear representation of a multidimensionalarray, e.g., to modify the hierarchy ofdimensions in the linearization (row- vs.column-major linearization). Since thefinal position of all elements can bepredicted from the beginning of the op-eration, such “sort” algorithms can bebased on merging or range partitioning(which is yet another example of theduality of sort- and hash- (partitioning-)based data manipulation algorithms).

11.3 Object-Oriented Database Systems

Research into query processing for exten-sible and object-oriented systems hasbeen growing rapidly in the last fewyears. Most proposals or implementa-tions use algebras for query processing,e.g., Albert [1991], Cluet et al. [1989],Graefe and Maier [1988], Guo et al.[199 1], Mitschang [1989], Shaw andZdonik [1989a; 1989b; 1990], Straube andOzsu [1989], Vandenberg and DeWitt[1991], Yu and Osborn [1991]. These al-gebras resemble relational algebra in thesense that they focus on bulk data typesbut are generalized to support operationson arrays, lists, etc., user-defined opera-tions (methods) on instances, heteroge-neous bulk types, and inheritance. Theuse of algebras permits several impor-tant conclusions. First, naive executionmodels that execute programs as if alldata were in memory are not the onlyalternative. Second, data manipulationoperators can be designed and imple-mented that go beyond data retrieval andpermit some amount of data reduction,aggregation, and even inference. Third,algebraic execution techniques includingthe stream paradigm and parallel execu-tion can be used in object-oriented datamodels and database systems, Fourth, al-

gebraic optimization techniques will con-tinue to be useful.

Associative operations are an impor-tant part in all object-oriented algebrasbecause they permit reducing largeamounts of data to the interesting subsetof the database suitable for further con-sideration and processing. Thus, set-processing and set-matching algorithmsas discussed earlier in this survey will befound in object-oriented systems, imple-mented in such a way that they can oper-ate on heterogeneous sets. The challengefor query optimization is to map a com-plex query involving complex behaviorand complex object structures to primi-tives available in a query execution en-gine. Translating an initial request withabstract data types and encapsulated be-havior coded in a computationally com-plete language into an internal form thatboth captures the entire query’s seman-tics and allows effective query optimiza-tion is still an open research issue[Daniels et al. 1991; Graefe and Maier1988].

Beyond associative indices discussedearlier, object-oriented systems can alsobenefit from special relationship indices,i.e., indices that contain condensed infor-mation about interobject references. Inprinciple, these index structures are sim-ilar to join indices [Valduriez 1987] butcan be generalized to support multiplelevels of referencing. Examples for in-dices in object-oriented database systemsinclude the work of Maier and Stein[1986] in the Gemstone object-orienteddatabase system product, Bertino [1990;1991] and Bertino and Kim [1989], in theOrion project and Kemper et al. [19911and Kemper and Moerkotte [1990a;1990b] in the GOM project. At this point,it is too early to decide which indexstructures will be the most useful be-cause the entire field of query processingin object-oriented systems is still devel-oping rapidly, from query languages toalgebra design, algorithm repertoire, andoptimization techniques. other areas ofintense current research interest arebuffer management and clustering ofobjects on disk.



One of the big performance penaltiesin object-oriented database systems is“pointer chasing” (using OID references)which may involve object faults and diskread operations at widely scattered loca-tions, also called “goto’s on disk.” In or-der to reduce 1/0 costs, some systemsuse what amounts to main-memorydatabases or map the entire databaseinto virtual memory. For systems withan explicit database on disk and an in-memory buffer, there are various tech-niques to detect object faults; somecommercial object-oriented database sys-tems use hardware mechanisms origi-nally perceived and implemented forvirtual-memory systems. While suchhardware support makes fault detectionfaster. it does not address the m-oblem of.expensive 1/0 operations. In order to re-duce actual 1/0 cost, read-ahead andplanned buffering must be used. Palmerand Zdonik [1991] recently proposedkeeping access patterns or sequences andactivating read-ahead if accesses equalor similar to a stored ~attern are de-tected. Another recent p~oposal for effi-cient assembly of complex objects uses awindow (a small set) of open referencesand resolves, at any point of time, themost convenient one by fetching this ob-ject or component from disk, which hasshown dramatic improvements in diskseek times and makes complex object re-trieval more efficient and more indepen-dent of object clustering [Keller et al.19911. Policies and mechanisms for effi-cient parallel complex object assembly arean important challenge for the develop-ers of next-generation object-orienteddatabase management systems [Maier etal. 1992].

11.4 More Control Operators

The exchange operator used for parallelquery processing is not a normal opera-tor in the sense that it does not manipu-late, select, or transform data. Instead,the exchange operator provides control ofquery processing in a way orthogonal towhat a query does and what algorithmsit uses. Therefore, we call it a meta-

or control operator. There are severalother control o~erators that can be used.in database query processing, and we

survey them briefly in this section.In situations in which an intermediate

result is used repeatedly, e.g., a nested-loops join with a composite inner input,either the intermediate result is derivedmany times, or it is saved in a temporaryfile during its first derivation and thenretrieved from this file while serving sub-sequent requests. This situation arisesnot only with nested-loops join but alsowith other algorithms, e.g., sort-baseduniversal quantification [Smith andChang 1975]. Thus, it might be useful toencapsulate this functionality in a newalgorithm, which we call the store-and-scan o~erator.

The’ store-and-scan operator permitsthree generalizations. First, if the firstconsumption of the intermediate resultmight actually not need it entirely, e.g., anested-loops semi-join which terminateseach inner scan after the first match, theoperator should be switched to derive onlythe necessary data items (which impliesleaving the input plan ready to producemore data later) or to save the entireintermediate result in the temporary fileright away in order to permit release ofall resources in the subplan. Second, sub-sequent scans might permit starting notat the beginning of the temporary file butat some later point. This version is usefulif manv duplicates exist in the inwts of. . .one-to-one matching algorithms based onmerge-join. Third, in some executionstrategies for correlated SQL sub queries,the plan corresponding to the inner blockis executed once for each tuple in theouter block. The tudes of the outer blockprovide different c~rrelation values, al-though each value may occur repeatedly.In order to ensure that the inner plan isexecuted only once for each outer correla-tion value, the store-and-scan operatorcould retain information about which partof its temporary file corresponds to whichcorrelation value and restrict each scanappropriately.

Another use of a tem~orarv file is to.“

support common subexpressions, which


Query Evaluation Techniques - 149

can be executed efficiently with an opera-tor that passes the result of a commonsubexpression to multiple consumers, asmentioned briefly in the section on thearchitecture of query execution engines.The problem is that multiple consumers,typically demand-driven and demand-driving their inputs, will request items ofthe common subexpression result at dif-ferent times or rates. The two standardsolutions are either to execute the com-mon subexpression into a temporary fileand let each consumer scan this file atwill or to determine which consumer willbe the first to require the result of thecommon subexpression, to execute thecommon subexpression as part of thisconsumer and to create a file with thecommon subexpression result as a by-product of the first consumer’s execution.Instead, we suggest a new meta-operator,which we call the split operator, to beplaced at the top of the common subex-pression’s plan and which can serve mul-tiple consumers at their own paces. Itautomatically performs buffering to ac-count for different paces, uses temporarydisk space if the discrepancies are toowide, and is suitably parameterized topermit both standard solutions describedabove.

In query processing systems, data flowis usually paced or driven from the top,the consumer. The leftmost diagram ofFigure 36 shows the control flow of nor-mal iterators. (Notice that the arrows inFigure 36 show the control flow; the dataflow of all diagrams in Figure 36 is as-sumed to be upward. In data-driven dataflow, control and data flows point in thesame direction; in demand-driven dataflow, their directions oppose each other.)However, in real-time systems that cap-ture data from experiments, this ap-proach may not be realistic because thedata source, e.g., a satellite receiver, hasto be able to unload data as they arrive.In such systems, data-driven operators,shown in the second diagram of Figure36, might be more appropriate. To com-bine the algorithms implemented andused for query processing with suchreal-time data capture requirements, one

could design data flow translation con-trol operators. The first such operatorwhich we call the active scheduler can beused between a demand-driven producerand a data-driven consumer. In this case,neither operator will schedule the other;therefore, an active scheduler that de-mands items from the producer and forcesthem onto the consumer will glue thesetwo operators together. An active-scheduler schematic is shown in the thirddiagram of Figure 36. The opposite case,a data-driven producer and a demand-driven consumer, has two operators, eachtrying to schedule the other one. A sec-ond flow control operator, called the pas-sive scheduler, can be built that acceptsprocedure calls from either neighbor andresumes the other neighbor in a corou-tine fashion to ensure that the resumedneighbor will eventually demand the itemthe scheduler just received. The final dia-gram of Figure 36 shows the control flowof a passive scheduler. (Notice that thiscase is similar to the bracket model ofparallel operator implementations dis-cussed earlier in which an operating sys-tem or networking software layer had tobe placed between data manipulation op-erators and perform buffering and flowcontrol.)

Finally, for very complex queries, itmight be useful to break the data flowbetween operators at some point, for tworeasons. First, if too many operators runin parallel, contention for memory ortemporary disks might be too intense,and none of the operators will run asefficiently as possible. A long series ofhybrid hash joins in a right-deep queryplan illustrates this situation. Second,due to the inherent error in selectivityestimation during query optimization[Ioannidis and Christodoulakis 1991;Mannino et al. 1988], it might be worth-while to execute only a subset of a plan,verify the correctness of the estimation,and then resume query processing withanother few steps. After a few processingsteps have been performed, their resultsize and other statistical properties suchas minimum and maximum and approxi-mate number of duplicate values can be


150 “ Goet.z Graefe

CDStandard Data-Driven

Iterator Opemtor

r+I Active

LScheduler

Passive

Scheduler

+

Figure 36. Operators, schedulers, and control flow

easily determined while saving the resulton temporary disk.

In principle, this was done in Ingres’original optimization method called De-composition, except that Ingres per-formed only one operation at a timebefore optimizing the remaining query[Wong and Yousset3 1976; Youssefi andWong 1979]. We propose alternating moreslowly between optimization and execu-tion, i.e., to perform a “reasonable” num-ber of steps between optimizations, wherereasonable may be three to ten selectionsand joins depending on errors and errorpropagation in selectivity estimation.Stopping the data flow and resuming af-ter additional optimization could verywell turn out to be the most reliabletechnique for very large complex queries.

Implementation of this technique couldbe embodied in another control operator,the choose-plan operator first describedin Graefe and Ward [1989]. Its currentimplementation executes zero or moresubplans and then invokes a decisionfunction provided by the optimizer thatdecides which of multiple equivalentplans to execute depending on intermedi-ate result statistics, current system load,and run-time values of query parametersunknown at optimization time. Unfor-tunately, further research is needed todevelop techniques for placing such op-erators in very complex query plans.One possible purpose of the subplansexecuted prior to a decision could be tosample the values in the database.A very interesting research directionquantifies the value of sampling by ana-lyzing the resulting improvement in thedecision quality [Seppi et al. 1989].


12. ADDITIONAL TECHNIQUES FOR

PERFORMANCE IMPROVEMENT

In this section, we consider some addi-tional techniques that have been pro-posed in the literature or used in realsystems and that have not been dis-cussed in earlier sections of this survey.In particular, we consider precomputa-tion, data compression, surrogate pro-cessing, bit vector filters, and specializedhardware. Recently proposed techniquesthat have not been fully developed arenot discussed here, e.g., “racing” equiva-lent plans and terminating the ones thatseem not competitive after some smallamount of time.

12.1 Precomputation and Derived Data

It is trivial to answer a query for whichthe answer is already known—therefore,precomputation of frequently requestedinformation is an obvious idea. The prob-lem with keeping preprocessed informa-tion in addition to base data is that it isredundant and must be invalidated ormaintained on updates to the base data.

Precomputation and derived data suchas relational views are duals. Thus, con-cepts and algorithms designed for onewill typically work well for the other. Themain difference is the database user’sview: precomputed data are typicallyused after a query optimizer has deter-mined that they can be used to answer auser query against the base data, whilederived data are known to the user andcan be queried without regard to the factthat they actually must be derived atrun-time from stored base data. Not sur-


prisingly, since derived data are likely tobe referenced and requested by users andapplication programs, precomputation ofderived data has been investigated bothfor relational and object-oriented datamodels.

Indices are the simplest form of pre-computed data since they are a re-dundant and, in a sense, precomputedselection. They represent a compromisebetween a nonredundant database andone with complex precomputed data be-cause they can be maintained relativelyefficiently.

The next more sophisticated formof precomputation are inversions as pro-vided in System Rs “Oth” prototype

[Chamberlain et al. 1981al, view indicesas analyzed by Roussopoulos [1991],two-relation join indices as proposed byValduriez [1987], or domain indices asused in the ANDA project (called VAL-TREE there) [Deshpande and Van Gucht1988] in which all occurrences of one do-main (e.g., part number) are indexed to-gether, and each index entry contains arelation identification with each recordidentifier. With join or domain indices,join queries can be answered very fast,typically faster than using multiple sin-gle-relation indices. On the other hand,single-clause selections and updates maybe slightly slower if there are more en-tries for each indexed key.

For binary operators, there is a spec-trum of possible levels of precomputa-tions (as suggested by J. A. Blakely),explored predominantly for joins. Thesimplest form of precomputation in sup-port of binary operations is individualindices, e.g., clustering B-trees that en-sure and maintain sorted relations. Onthe other extreme are completely materi-alized join results. Intermediate levelsare pointer-based joins [ Shekita andCarey 1990] (discussed earlier in the sec-tion on matching) and join indices

[Valduriez 19871. For each form of pre-computed result, the required redundantdata structures must be maintained eachtime the underlying base data are up-dated, and larger retrieval speedup mightbe paid for with larger maintenanceoverhead.

Babb [1982] explored storing only

results of outer joins, but not the nor-malized base relations, in the content-addressable file store (CAFS), and calledthis encoding join normal form. Blakeleyet al. [1989], Blakeley and Martin [1990],Larson and Yang [1985], Medeiros andTompa [1985], Tompa and Blakeley[1988], and Yang and Larson [ 1987] in-vestigated storing and maintaining ma-terialized views in relational databasesystems. Their hope was to speed rela-tional query processing by using deriveddata, possibly without storing all basedata, and ensuring that their mainte-nance overhead would be less than theirbenefits in faster query processing. Forexample, Blakeley and Martin [1990]demonstrated that for a single join thereexists a large range of retrieval and up-date mixes in which materialized viewsoutperform both join indices and hybridhash join. This investigation should beextended, however, for more complexqueries, e.g., joins of three and four in-puts, and for queries in object-orientedsystems and emerging database applica-tions.

Hanson [1987] compared query modifi-cation (i.e., query evaluation from baserelations) against the maintenance costsof materialized views and considered inparticular the cost of immediate versusdeferred updates. His results indicatethat for modest update rates, material-ized views provide better system perfor-mance. Furthermore, for modest selectiv-ities of the view predicate, deferred-viewmaintenance using differential files[Severance and Lohman 1976] outper-forms immediate maintenance of materi-alized views. However, Hanson also didnot include multi-input joins in his study.

Sellis [1987] analyzed caching of re-sults in a query language called Quel +(which is a subset of Postquel

[Stonebraker et al. 1990bl) over a rela-tional database with procedural (QUEL)fields [Sellis 1987]. He also consideredthe case of limited space on secondarystorage used for caching query results,and replacement algorithms for query re-sults in the cache when the space be-comes insufficient.


152 * Goetz Graefe

Links between records (pointers ofsome sort, e.g., record, tuple, or objectidentifiers) are another form of precom-putation. Links are particularly effectivefor system performance if they are com-bined with clustering (assignment ofrecords to pages). Database systems forthe hierarchical and network models haveused physical links and clustering, butsupported basically only queries and op-erations that were “recomputed” in thisway. Some researchers tried to overcomethis restriction by building relationalquery engines on top of network systems,e.g., Chen and Kuck [1984], Rosenthaland Reiner [ 1985], Zaniolo [1979]. How-ever, with performance improvements inthe relational world, these efforts seemto have been abandoned. With the adventof extensible and object-oriented databasemanagement systems, combining linksand ad hoc query processing might be-come a more interesting topic again. Arecent effort for an extensible-relationalsystem are Starburst’s pointer-basedjoins discussed earlier [Haas et al. 1990;Shekita and Carey 1990].

In order to ensure good performancefor its extensive rule-processing facilities,Postgres uses precomputation andcaching of the action parts of productionrules [Stonebraker 1987; Stonebraker etal. 1990a; 1990b]. For automatic mainte-nance of such derived data, persistent“invalidation locks” are stored for detec-tion of invalid data after updates to thebase data,

Finally, the Cactis project focused onmaintenance of derived data in object-oriented environments [Hudson and King1989]. The conclusions of this project in-clude that incremental maintenance cou-pled with a fairly simple adaptive clus-tering algorithm is an efficient way topropagate updates to derived data.

One issue that many investigationsinto materialized views ignore is the factthat many queries do not require viewsin their entirety. For example, if a rela-tional student information system in-cludes a view that computes each stu-dent’s grade point average from the en-rollment data, most queries using this

view will select only a single student, notall students at the school. Thus, if theview definition is merged into the querybefore query optimization, as discussedin the introduction, only one student’sgrade point average, not the entire view,will be computed for each query. Obvi-ously, the treatment of this differencewill affect an analysis of costs and bene-fits of materialized views.

12.2 Data Compression

A number of researchers have investi-gated the effect of compression ondatabase systems and their performance

[Graefe and Shapiro 1991; Lynch andBrownrigg 1981; Ruth and Keutzer 1972;Severance 1983]. There are two types ofcompression in database systems. First,the amount of redundancy can be re-duced by prefix and suffix truncation, inparticular, in indices, and by use of en-coding tables (e.g., color combination “9”means “red car with black interior”). Sec-ond, compression schemes can be appliedto attribute values, e.g., adaptiveHuffman coding or Ziv-Lempel methods[Bell et al. 1989; Lelewer and Hirschberg1987]. This type of compression can beexploited most effectively in databasequery processing if all attributes of thesame domain use the same encoding, e.g.,the “Part-No” attributes of data sets rep-resenting parts, orders, shipments, etc.,because common encodings permit com-parisons without decompression.

Most obviously, compression can re-duce the amount of disk space requiredfor a given data set. Disk space savingshas a number of ramifications on 1/0performance. First, the reduced data

space fits into a smaller physical diskarea; therefore, the seek distances andseek times are reduced. Second, moredata fit into each disk page, track, andcylinder, allowing more intelligent clus-tering of related objects into physicallynear locations. Third, the unused diskspace can be used for disk shadowing toincrease reliability, availability, and 1/0performance [Bitten and Gray 1988].Fourth, compressed data can be trans-

ACM Cornputlng Surveys, Vol 25, No 2, June 1993


ferred faster to and from disk. In otherwords, data com~ression is an effectivemeans to increa~e disk bandwidth (notby increasing physical transfer rates butby increasing the information densityof transferred data) and to relieve the1/0 bottleneck found in many high-performance database management sys-tems [Boral and DeWitt 19831. Fifth. indistributed database systems and’ inclient-server situations, compressed da-ta can be transferred faster across thenetwork than uncompressed data. Un-compressed data require either morenetwork time or a seoarate comm-ession. ,step. Finally, retaining data in com-pressed form in the 1/0 buffer allowsmore records to remain in the buffer.thus increasing the buffer hit rate andreducing the number of 1/0s. The lastthree points are actually more general.They apply to the entire storage hierar-chy of tape, disk, controller caches, localand remote main memories, and CPUcaches.

For query processing, compression canbe exploited far beyond improved 1/0~erformance because decomm-ession can~ften be delayed until a rel~tively smalldata set is presented to the user or anapplication program. First, exact-matchcomparisons can be performed on com-pressed data. Second, projection and du-plicate removal can be performed with-out decompressing data. The situation foraggregation is a little more complex sincethe attribute on which arithmetic is per-formed typically must be decompressed.Third, neither the join attributes norother attributes need to be decompressedfor most joins. Since keys and foreignkeys are from the same domain, and ifcompression schemes are fixed for eachdomain, a join on compressed key valueswill give the same results as a join onnormal uncompressed key values. Itmight seem unusual to perform a merge-join in the order of compressed values,but it nonetheless is ~ossible and willproduce correct results.’

There are a number of benefits fromprocessing compressed data. First, mate-rializing output records is faster because

records are shorter, i.e., less copying isrequired. ~econd, for inputs larger thanmemory, more records fit into memory.In hybrid hash join and duplicate re-moval, for instance, the fraction of thefile that can be retained in the hash tableand thus be joined without any 1/0 islarger. During sorting, the number ofrecords in memory and thus per run islarger, leading to fewer runs and possiblyfewer merge levels. Third, and very in-terestingly, skew is less likely to be aproblem. The goal of compression is torepresent the information with as fewbits as possible. Therefore, each bit inthe output of a good compression schemehas close to maximal information con-tent, and bit columns seen over theentire file are unlikely to be skewed.Furthermore, bit columns will not be cor-related. Thus, the compressed key valuescan be used to create a hash value distri-bution that is almost guaranteed to beuniform, i.e., optimal for hashing inmemory and partitioning to overflow filesas well as to multiple processors in paral-lel join algorithms.

We believe that data compression isundervalued in current query processingresearch, mostly because it was not real-ized that many operations can often beperformed faster on compressed datathan on uncompressed data, and we hopethat future database management sys-tems make extensive use of data com-pression. Considering the current growthrates in CPU and 1/0 performance, itmight even make sense to exploit datacompression on the fly for hash tableoverflow resolution.

12.3 Surrogate Processing

Another very useful technique in queryprocessing is the use of surrogates forintermediate results. A surrogate is a ref-erence to a data item, be it a logicalobject identifier (OID) used in object-oriented systems or a physical recordidentifier (RID) or location. Instead ofkeeping a complete record in memory,only the fields that are used immediatelyare kept, and the remainder replaced by



a surrogate, which has in principle thesame effect as compression. While thistechnique has traditionally been used toreduce main-memory requirements, itcan also be employed to improve board-and CPU-level caching [Nyberg et al.1993].

The simplest case in which surrogateprocessing can be exploited is in avoidingcopying. Consider a relational join; whentwo items satisfy the join predicate, anew tuple is created from the two origi-nal ones. Instead of copying the datafields, it is possible to create only a pairof RIDs or pointers to the original recordsif they are kept in memory. If a record is50 times larger than an RID, e.g., 8 vs.400 bytes, the effort spent on copyingbvtes is reduced bv that factor

“ Copying is alre~dy a major part of theCPU time spent in many query process-ing systems, but it is becoming more ex-pensive for two reasons. First, manymodern CPU designs and implementa-tions are optimized for an impressivenumber of instructions per second but donot provide the performance improve-ments in mundane tasks such as movingbytes from one memory location to an-other [Ousterhout 19901. Second, manymodern computer architectures employmultiple CPUS accessing shared memoryover one bus because this design permitsfast and inexpensive parallelism. Al-though alleviated by local caches, buscontention is the major bottleneck andlimitation to scalability in shared-memory parallel machines. Therefore, re-ductions in memory-to-memory copyingin database query execution engines per-mit higher useful degrees of parallelismin shared-memorv machines.

A second example for surrogate pro-cessing was mentioned earlier in connec-tion with indices. To evaluate a conjunc-tion with multiple clauses, each of whichis supported by an index, it might beuseful to perform an intersection of RID-lists to reduce the number of recordsneeded before actual data are accessed.

A third case is the use of indices andRIDs to evaluate joins, for example, inthe query processing techniques used in

Ingres [Kooi 1980; Kooi and Frankforth1982] and IBMs hybrid join [Cheng et al.1991] discussed in the section on binarymatching.

Surrogate processing has also beenused in parallel systems, in particular,distributed-memory implementations, toreduce network traffic. For example,Lorie and Young [1989] used RIDs toreduce the communication time in paral-lel sorting by sending (sort key, RID)~airs to a central site. which determines~ach record’s global’ rank, and thenrepartitioning and merging records veryquickly by their rank alone without fur-ther data comparisons.

Another form of surrogates are encod-ings with lossy compressions, such as su-perimposed coding used for efficient ac-cess methods [Bloom 1970; Faloutsos1985; Sacks-Davis and Ramamohanarao1983; Sacks-Davis et al. 1987]. Berra etal. [1987] and Chung and Berra [1988]considered indexing and retrieval organi-zations for very large (relational) knowl-edge bases and databases. They em-ployed three techniques, concatenatedcode words (CCWS), superimposed codewords (SCWS), and transformed invertedlists (TILs). TILs are normal index struc-tures for all attributes of a relation thatpermit answering conjunctive queries bybitwise anding. CCWS and SCWS usehash values of all attributes of a tupleand either concatenate such hash valuesor bitwise or them together. The result-ing code words are then used as keys inindices. In their particular architecture,Berra et al. and Chung and Berra con-sider associative memory and opticalcomputing to search efficiently throughsuch indices, although conventional soft-

ware techniques could be used as well.

12.4 Bit Vector Filtering

In parallel systems, bit vector filters havebeen used very effectively for what wecall here “probabilistic semi-j oins.” Con-sider a relational join to be executed on adistributed-memory machine with repar-titioning of both input relations on thejoin attribute. It is clear that communica-



tion effort could be reduced if only thetuples that actually contribute to the joinresult, i.e., those with a match in theother relation, needed to be shippedacross the network. To accomplish this,distributed database systems were de-signed to make extensive use of semi-joins, e.g., SDD-1 [Bernstein et al. 1981].

A faster alternative to semi-joins,which, as discussed earlier, requires ba-sically the same computational effort asnatural joins, is the use of bit vectorfilters [Babb 1979], also called Bloom-filters [Bloom 1970]. A bit vector filterwith N bits is initialized with zeroes;and all items in the first (preferably thesmaller) input are hashed on their joinkeyto O,..., N – 1. For each item, onebit in the bit vector filter is set to one;hash collisions are ignored. After the firstjoin input has been exhausted, the bitvector filter is used to filter the secondinput. Data items of the second input arehashed on their join key value, and onlyitems for which the bit is set to one canpossibly participate in the join. There issome chance for false passes in the caseof collisions, i.e., items of the second in-put pass the bit vector filter althoughthey actually do not participate in thejoin, but if the bit vector filter is suffi-ciently large, the number of false passesis very small.

In general, if the number of bits isabout twice the number of items in thefirst input, bit vector filters are very ef-fective. If many more bits are available,the bit vector filter can be split into mul-tiple subvectors, or multiple bits can beset for each item using multiple hashfunctions, reducing the number of falsepasses. Babb [1979] analyzed the use ofmultiple bit vector filters in detail.

The Gamma relational database ma-chine demonstrated the effectiveness ofbit vector filtering in relational join pro-cessing on distributed-memory hardware

[DeWitt et al. 1986; 1988; 1990; Gerber1986]. When scanning and redistributingthe build input of a join, the Gammamachine creates a bit vector filter that isthen distributed to the scanning sites ofthe probe input. Based on the bit vector

filter, a large fraction of the probe tuplescan often be discarded before incurringnetwork costs. The decision whether to

create one bit vector filter for the entirebuild input or to create a bit vector filterfor each of the join sites depends on thespace available for bit vector filters andthe communication costs for bit arrays.

Mullin [ 1990] generalized bit vector fil-tering to sending bit vector filters backand forth between sites. In his words,“the central notion is to send small butoptimally information-dense Bloom fil-ters between sites as long as these filtersserve to reduce the volume of tupleswhich need to be transmitted by morethan their own size.” While this proce-dure achieves very low communicationcosts, it ignores the 1/0 cost at each siteif the reduced relations must be scannedfrom disk in each step. Qadah [1988] dis-cussed a limited form of this idea usingonly two bit vector filters and augment-ing it with bit vector filter compression.

While bit vector filtering is typicallyused only for joins, it is equally applica-ble to all other one-to-one match oper-ators, including semi-j oin, outer join,intersection, union, and difference. Foroperators that include nonmatchingitems in their output, e.g., outer joinsand unions, part of the result can beobtained before network transfer, basedsolely on the bit vector filter. For one-to-one match operations other than join,e.g., outer join and union, bit vector fil-ters can also be used, but the algorithmmust be modified to ensure that itemsthat do not pass the bit vector filter areproperly included in the operation’s out-put stream. For parallel relational divi-sion (universal quantification), bit vectorfiltering can be used on the divisor at-tributes to eliminate most of the dividenditems that do not pertain to any divisoritem. Thus, our earlier assessment thatuniversal quantification can be per-formed as fast as existential quantifica-tion (a semi-join of dividend and divisorrelations) even extends to special tech-niques used to boost join performance.

Bit vector filtering can also be ex-ploited in sequential systems. Consider a


156 * Goetz Graefe

merge-join with sort operations on bothinputs. If the bit vector filter is builtbased on the input of the first sort, i.e.,the ,bit vector filter is completed when alldata have reached the first sort operator.This bit vector filter can then be used toreduce the input into the second sort op-erator on the (presumably larger) secondinput. Depending on how the sort opera-tion is organized into phases, it mighteven be possible to create a second bitvector filter from the second merge-joininput and use it to reduce the first joininput while it is being merged.

For sequential hash joins, bit vectorfilters can be used in two ways. First,they can be used to filter items of theprobe input using a bit vector filter cre-ated from items of the build input. Thisuse of bit vector filters is analogous to bitvector filter usage in parallel systemsand for merge-join. In Rdb/VMS andDB2, bit vector filters are used whenintersecting large RID lists obtained frommultiple indices on the same table[Antoshenkov 1993; Mohan et ‘al. 1990].Second, new bit vector filters can be cre-ated and used for each partition in eachrecursion level. In the Volcano query-processing system, the operator imple-menting hash join, intersection, etc. usesthe space used as anchor for each bucket’slinked list for a small bit vector filterafter the bucket has been spilled to anoverflow file. Only those items from theprobe input that pass the bit vector filterare written to the probe overflow file.This technique is used in each recursionlevel of overflow resolution. Thus, duringrecursive partitioning, relatively smallbit vector filters can be used repeatedlyand at increasingly finer granularity to

remove items from the probe input thatdo not contribute to the join result. Bitvectors could also be used to remove itemsfrom the build input using bit vector fil-ters created from the probe input; how-ever, since the probe input is presumedthe larger input and hash collisions inthe bit vector filter would make the filterless effective, it may or may not be aneffective technique.

With some modifications of the stan-dard algorithm, bit vector filters can also

be used in hash-based duplicate removal.Since bit vector filters can only deter-mine safely which item has not been seenyet, but not which item has been seen yet(due to possible hash collisions), bit vec-tor filters cannot be used in the mostdirect way in hash-based duplicate re-moval. However, hash-based duplicateremoval can be modified to become simi-lar to a hash join or actually a hash-basedset intersection. Consider a large file Rand a partitioning fan-out F. First, R ispartitioned into F/2 partitions. For eachpartition, two files are created; thus, thisstep uses the entire fan-out to create atotal of F files. Within each partition, abit vector filter is used to determinewhether an item belongs into the first orthe second file of that partition. If anitem is guaranteed to be unique, i.e., thereis no earlier item indicated in the bitvector filter, the item is assigned to thefirst file, and a bit in the bit vector filteris set. Otherwise, the item is assignedinto the partition’s second file. At the endof this partitioning step, there are F files,half of them guaranteed to be free ofduplicate data items. The possible size ofthe duplicate-free files is limited by thesize of the bit vector filters; therefore,this step should use the largest bit vectorfilters possible. After the first partition-ing step, each partition’s pair of files isintersected using the duplicate-free fileas probe input. Recall that duplicate re-moval for a join’s build input can be ac-complished easily and inexpensivelywhile building the in-memory hash table.Remaining duplicates with one copy inthe duplicate-free (probe) file and an-other copy in the other file (the buildinput) in the hash table are found whenthe probe input is matched against thehash table. This algorithm performs verywell if many output items depend on onlyone input item and if the bit vectors arequire large. In that case, the duplicate-free partition files are very large, and thesmaller partition file with duplicates canbe processed very efficiently.

In order to find and exploit a dual inthe realm of sorting and merge-join to bitvector filtering in each recursion level ofrecursive hash join, sorting of multiple



inputs must be divided into individualmerge levels. In other words, for amerge-join of inputs R and S, the sortactivity should switch back and forth be-tween R and S, level by level, creatingand using a new bit vector filter in eachmerge level. Unfortunately, even with asophisticated sort implementation thatsupports this use of bit vector filters ineach merge level, recursive hybrid hashjoin will make more effective use of bitvector filters because the inputs are par-titioned, thus reducing the number ofdistinct values in each partition in eachrecursion level.

12.5 Specialized Hardware

Specialized hardware was considered bya number of researchers, e.g., in the formsof hardware sorters and logic-per-trackselection. A relatively recent survey ofdatabase machine research is given bySu [1988]. Most of this research wasabandoned after Boral and DeWitt’s[1983] influential analysis that comparedCPU and 1/0 speeds and their trends.They concluded that 1/0 is most likelythe bottleneck in future high-perfor-mance query execution, not processing.Therefore, they recommended movingfrom research on custom processors totechniques for overcoming the 1/0 bot-tleneck, e.g., by use of parallel readoutdisks, disk caching and read-ahead, andindexing to reduce the amount of data tobe read for a query. Other investigationsalso came to the conclusion that par-allelism is no substitute for effectivestorage structures and query executionalgorithms [DeWitt and Hawthorn1981; Neches 1984]. An additional verystrong argument against custom VLSIprocessors is that microprocessor speedis currently improving so rapidly that itis likely that, by the time a special hard-ware component has been designed, fab-ricated, tested, and integrated into alarger hardware and software system, thenext generation of general-purpose CPUswill be available and will be able to exe-cute database functions programmed in ahigh-level language at the same speed asthe specialized hardware component.

Furthermore, it is not clear what special-ized hardware would be most beneficial

to design, in particular, in light of today’sdirections toward extensible databasesystems and emerging database applica-tion domains. Therefore, we do not favorspecialized database hardware modulesbeyond general-purpose processing, stor-age, and communication hardware dedi-cated to executing database software.

SUMMARY AND OUTLOOK

Database management systems providethree essential groups of services. First,they maintain both data and associatedmetadata in order to make databasesself-contained and self-explanatory, atleast to some extent, and to provide dataindependence. Second, they support safedata sharing among multiple users aswell as prevention and recovery of fail-ures and data loss. Third, they raise thelevel of abstraction for data manipula-tion above the primitive access com-mands provided by file systems with moreor less sophisticated matching and infer-ence mechanisms, commonly called thequery language or query-processing facil-ity. We have surveyed execution algo-rithms and software architectures usedin providing this third essential service.

Query processing has been exploredextensively in the last 20 years in thecontext of relational database manage-ment systems and is slowly gaininginterest in the research community forextensible and object-oriented systems.This is a very encouraging development,because if these new systems have in-creased modeling power over previousdata models and database managementsystems but cannot execute even simplerequests efficiently, they will never gainwidespread use and acceptance.Databases will continue to manage mas-sive amounts of data; therefore, efficientquery and request execution will con-tinue to represent both an important re-search direction and an important crite-rion in investment decisions in the “realworld.” In other words, new databasemanagement systems should providegreater modeling power (this is widely


158 - Goetz Graefe

accepted and intensely pursued), but alsocompetitive or better performance thanprevious systems. We hope that this sur-vey will contribute to the use of efficientand parallel algorithms for query pro-cessing tasks in new database manage-ment systems.

A large set of query processing algo-rithms has been developed for relationalsystems. Sort- and hash-based tech-niques have been used for physical-storage design, for associative index

structures, for algorithms for unary andbinary matching operations such as ag-gregation, duplicate removal, join, inter-section, and division, and for parallelquery processing using hash- or range-

partitioning. Additional techniques suchas precomputation and compression havebeen shown to provide substantial per-formance benefits when manipulatinglarge volumes of data. Many of the exist-ing algorithms will continue to be usefulfor extensible and object-oriented sys-tems, and many can easily be general-ized from sets of tuples to more generalpattern-matching functions. Someemerging database applications will re-quire new operators, however, both fortranslation between alternative data rep-resentations and for actual data manipu-lation.

The most promising aspect of currentresearch into database query processingfor new application domains is that theconcept of a fixed number of parametri-zed operators, each performing a part ofthe required data manipulation and eachpassing an intermediate result to the nextoperator, is versatile enough to meet thenew challenges. This concept permits

specification of database queries and re.

quests in a logical algebra as well asconcise representation of database pro-grams in a physical algebra. Further-more, it allows algebraic optimizations ofrequests, i.e., optimizing transformationsof algebra expressions and cost-sensitivetranslations of logical into physical ex-pressions. Finally, it permits pipeliningbetween operators to exploit parallelcomputer architectures and partitioningof stored data and intermediate results

for most operators, in particular, for op-erators on sets but also for other bulktypes such as arrays, lists, and timeseries.

We can hope that much of the existingrelational technology for query optimiza-tion and parallel execution will remainrelevant and that research into extensi-ble optimization and parallelization willhave a significant impact on futuredatabase applications such as scientificdata. For database management systemsto become acceptable for new applicationdomains, their performance must at leastmatch that of the file systems currentlyin use. Automatic optimization and par-allelization may be crucial contributionsto achieving this goal, in addition to thequery execution techniques surveyedhere.

ACKNOWLEDGMENTS

JOS6 A Blakeley, Cathy Brand, Rick Cole, Diane

Davison, David Helman, Ann Linville, Bill

McKenna, Gail Mitchell, Shengsong N1, Barb Pe-

ters, Leonard Shapiro, the students of “Readings m

Database Systems” at the University of Colorado at

Boulder (Fall 1991) and “Database Implementation

Techniques” at Portland State University (Winter

1993), David Maler’s weekly reading group at the

Oregon Graduate Institute (Winter 1992), the

anonymous referees, and the Computmg Surveys

editors Shamkant Navathe and Dick Muntz gave

many valuable comments on earlier drafts of this

survey, which have improved the paper very much.

This paper is based on research partially supported

by the National Science Foundation with grants

IRI-8996270, IRI-8912618, IRI-9006348, IRI-

9116547, IRI-9119446, and ASC-9217394, ARPA

with contract DAAB 07-91-C-Q5 18, Texas Instru-

ments, D@al Equipment Corp., Intel Super-

computer Systems Division, Sequent Computer

Systems. ADP, and the Oregon Advanced Com-

puting Institute (OACIS)

REFERENCES

ADALI, N. R., AND WORTMANN, J. C. 1989. Secu-rity-control methods for statistical databases:A comparative study. ACM Comput. Surv. 21,4 (Dec. 1989), 515.

AHN, I., AND SNonGRAss, R. 1988. Partitionedstorage for temporal databases. Inf. SVSt. 13, 4,369.



ALBERT, J. 1991. Algebraic properties of bag data

types. In Proceedings of the International Con-

ference on Very Large Data Bases. VLDB En-dowment, 211.

ANALYTI, A., AND PRAMANIK, S. 1992. Fast search

in main memory databases. In Proceedings ofthe ACM SIGMOD Conference. ACM, NewYork, 215.

ANDERSON, D. P., Tzou, S. Y., AND GmwM, G. S.1988. The DASH virtual memory system.

Tech. Rep. 88/461, Univ. of California—Berke-

ley, CS Division, Berkeley, Calif.

ANTOSHENKOV, G. 1993. Dynamic query opti-

mization in Rdb/VMS. In Proceedings of theIEEE Conference on Data Engineering. IEEE,

New York.

ASTRAHAN, M. M., BLASGEN, M. W., CHAMBERLAIN,D. D., ESWARAN, K. P., GRAY, J. N., GRIFFITHS,P. P., KING, W. F., LORIE, R. A., MCJONES,P. R., MEHL, J. W., PUTZOLU, G. R., TRAIGER,I. L., WmE, B. W., AND WATSON, V. 1976.System R: A relational approach to databasemanagement. ACM Trans. Database Syst. 1, 2

(June), 97.ASTRAHAN, M. M., SCHKOLNICK, M., MD WHANG,

K. Y. 1987. Approximating the number ofunique values of an attribute without sorting,

Inf. Syst. 12, 1, 11.

ATKINSON, M. P., AND BUNEMANN, O. P. 1987.Types and persistence in database program-

ming languages. ACM Comput. Surv. 19, 2

(June), 105.

BABB, E. 1982. Joined Normal Form: A storageencoding for relational databases. ACM Trans.Database Syst. 7, 4 (Dec.), 588.

BABB, E. 1979. Implementing a relationaldatabase by means of specialized hardware.

ACM Trans. Database Syst. 4, 1 (Mar.), 1.

BAEZA-YATES, R. A., AND LARSON, P. A. 1989. Per-formance of B + -trees with partial expansions.

IEEE Trans. Knowledge Data Eng. 1, 2 (June),

248.

BANCILHON, F., AND RAMAKRISHNAN, R. 1986. An

amateur’s introduction to recursive query pro-cessing strategies. In Proceedings of the ACMSIGMOD Conference. ACM, New York, 16.

BARGHOUTI, N. S., AND KAISER, G. E. 1991. Con-currency control in advanced database applica-tions. ACM Comput. Suru. 23, 3 (Sept.), 269.

BARU, C. K., AND FRIEDER, O. 1989. Database op-

erations in a cube-connected multicomputersystem. IEEE Trans. Comput. 38, 6 (June),920.

BATINI, C., LENZERINI, M., AND NAVATHE, S. B.

1986. A comparative analysis of methodolo-gies for database schema integration. ACMComput. Sum. 18, 4 (Dec.), 323.

BATORY, D. S., BARNETT, J. R., GARZA, J. F., SMITH,K. P., TSUKUDA, K., TWICHELL, B. C., AND WISE,T. E. 1988a. GENESIS: An extensibledatabase management system, IEEE Trans.Softw. Eng. 14, 11 (Nov.), 1711.

BATORY, D. S., LEUNG, T. Y., AND WISE, T. E. 1988b.

Implementation concepts for an extensible data

model and data language. ACM Trans.Database Syst. 13, 3 (Sept.), 231.

BAUGSTO, B., AND GREIPSLAND, J. 1989. Parallel

sorting methods for large data volumes on ahypercube database computer. In Proceedingsof the 6th International Workshop on DatabaseMachines (Deauville, France, June 19-21).

BAYER, R., AND MCCREIGHTON, E. 1972. Organi-

sation and maintenance of large ordered in-

dices. Acts Informatica 1, 3, 173.

BECK, M., BITTON, D., AND WILKINSON, W. K. 1988.

Sorting large files on a backend multiprocessor.IEEE Trans. Comput. 37, 7 (July), 769.

BECKER, B., SIX, H. W., AND WIDMAYER, P. 1991.

Spatial priority search: Au access technique forscaleless maps. In Proceedings of ACM SIG-MOD Conference. ACM, New York, 128.

BECKMANN, N., KRIEGEL, H. P., SCHNEIDER, R., AND

SEEGER, B. 1990. The R*-tree: Au efficientand robust access method for points and rect-angles. In Proceedings of ACM SIGMOD Con-

ference. ACM, New York, 322.

BELL, T., WITTEN, I. H., AND CLEARY, J. G. 1989.

Modelling for text compression. ACM Cornput.

Surv. 21, 4 (Dec.), 557.

BENTLEY, J. L. 1975. Multidimensional binary

search trees used for associative searching.Commun. ACM 18, 9 (Sept.), 509.

BERNSTEIN, P. A., mm GOODMAN, N. 1981. Con-

currency control in distributed database sys-tems. ACM Comput. Suru. 13, 2 (June), 185.

BERNSTEIN, P. A., GOODMAN, N., WONG, E., REEVE,

C. L., AND ROTHNIE, J. B. 1981. Query pro-

cessing in a system for distributed databases(SDD-1). ACM Trans. Database Syst, 6, 4

(Dec.), 602.

BERNSTEIN, P. A., HADZILACOS, V., AND GOODMAN, N.

1987. Concurrency Control and Recovery inDatabase Systems. Addison-Wesley, Reading,

Mass.

BERRA, P. B., CHUNG, S. M., AND HACHEM, N. I.

1987, Computer architecture for a surrogatefile to a very large data/knowledge base. IEEEComput. 20, 3 (Mar), 25.

BERTINO, E. 1991. An indexing technique for ob-

ject-oriented databases. In Proceedings of theIEEE Conference on Data Engineering. IEEE,

New York, 160.

BERTINO, E. 1990. Optimization of queries using

nested indices. In Lecture Notes in Computer

Science, vol. 416. Springer-Verlag, New York.

BERTINO, E., AND KIM, W. 1989. Indexing tech-

niques for queries on nested objects. IEEETrans. Knowledge Data Eng. 1, 2 (June), 196.

BHIDE, A. 1988. An analysis of three transactionprocessing architectures. In Proceedings of theInternational Conference on Ve~ Large DataBases (Los Angeles, Aug.). VLDB Endowment,339.


160 * Goetz Graefe

BHIDE, A,, AND STONEBRAKER, M. 1988. Aperfor-mance comparison of two architectures for fast

transaction processing. In proceedings of theIEEE Conference on Data Englneermg. IEEE,

New York, 536.

BITTON, D., AND DEWITT, D. J. 1983. Duplicate

record elimination in large data files. ACMTrans. Database Syst. 8, 2 (June), 255,

BITTON-FRIEDLAND, D. 1982. Design, analysis,

and implementation of parallel external sortingalgorithms Ph D. Thesis, Univ. of Wiscon-

sin—Madison.

BITTON, D , AND GRAY, J 1988. Dmk shadowing,In Proceedings of the International Conference

on Very Large Data Bases. (Los Angeles, Aug.).VLDB Endowment, 331

BITTON, D., DEWITT, D J., HSIAO, D. K. AND MENON,J. 1984 A taxonomy of parallel sorting.ACM Comput. SurU. 16, 3 (Sept.), 287.

BITTON, D., HANRAHAN, M. B., AND TURBWILL, C.1987, Performance of complex queries in main

memory database systems. In Proceedt ngs of

the IEEE Con ference on Data Engineering.IEEE, New York.

BLAKELEY, J. A., AND MARTIN. N. L. 1990 Joinindex, materialized view, and hybrid hash-join:A performance analysis In Proceedings of theIEEE Conference on Data Englneermg IEEE,

New York.

BM.KELEY, J. A, COBURN, N., AND LARSON, P. A.1989. Updating derived relations: Detectingirrelevant and autonomously computable up-

dates. ACM Trans Database Syst 14,3 (Sept.),369

BLAS~EN, M., AND ESWARAN, K. 1977. Storage andaccess in relational databases. IBM Syst. J.16, 4, 363.

BLASGEN, M., AND ESWARAN, K. 1976. On theevaluation of queries in a relational databasesystem IBM Res. Rep RJ 1745, IBM, San Jose,

Calif.

BLOOM, B H. 1970 Space/time tradeoffs in hashcoding with allowable errors Cornmzm. ACM13, 7 (July), 422.

BORAL, H. 1988. Parallehsm in Bubba. In Pro-

ceedings of the International Symposium onDatabases m Parallel and Distr~buted Systems

(Austin, Tex., Dec.), 68.

BORAL, H., AND DEWITT, D. J. 1983. Databasemachines: An idea whose time has passed? Acritique of the future of database machines. InProceedings of the International Workshop onDatabase Machines. Reprinted in Parallel Ar-

ch I tectures for Database Systems. IEEE Com-puter Society Press, Washington, D. C., 1989.

BORAL, H., ALEXANDER, W., CLAY, L., COPELAND, G.,DANFORTH, S., FRANKLIN, M., HART, B., SMITH,M., AND VALDURIEZ, P. 1990. PrototypingBubba, A Highly Parallel Database System.IEEE Trans. Knowledge Data Eng. 2, 1 (Mar.),4.

BRATBERGSENGEN, K, 1984. Hashing methods

and relational algebra operations. In Proceed-ings of the International Conference on Very

Large Data Bases. VLDB Endowment, 323.

BROWN, K. P., CAREY, M. J., DEWITT, D, J., MEHTA,

M., AND NAUGHTON, J. F. 1992. Schedulingissues for complex database workloads, Com-

puter Science Tech. Rep. 1095, Univ. ofWisconsin—Madison.

BUCHERAL, P., THEVERIN, J. M., AND VAL~URIEZ, P.1990. Efficient main memory data manage-ment using the DBGraph storage model. InProceedings of the International Conference onVery Large Data Bases. VLDB Endowment, 683.

BUNEMAN, P., AND FRANKEL, R. E. 1979. FQL—A

Functional Query Language. In Proceedings of

ACM SIGMOD Conference. ACM, New York,

52.

BUNEMAN, P,, FRANKRL, R, E., AND NIKHIL, R. 1982

An implementation technique for database

query languages, ACM Trans. Database Syst.

7, 2 (June), 164.

CACACE, F., CERI, S., .AND HOUTSMA, M A. W. 1992.A survey of parallel execution strategies for

transitive closures and logic programs To ap-pear in Dwtr. Parall. Databases.

CAREY, M. J,, DEWITT, D. J., RICHARDSON, J. E., ANDSHEKITA, E. J. 1986. Object and file manage-

ment in the EXODUS extensible database sys-tem. In proceedings of the International

Conference on Very Large Data Bases. VLDBEndowment, 91.

CARLIS, J. V. 1986. HAS: A relational algebra

operator, or divided M not enough to conquer.In Proceedings of the IEEE Conference on DataEngineering. IEEE, New York, 254.

CARTER, J. L., AND WEGMAN, M. N. 1979, Univer-

sal classes of hash functions. J. Cornput. Syst.Scl. 18, 2, 143.

CHAMB~RLIN, D. D., ASTRAHAN, M M., BLASGEN, M.W., GsA-I-, J. N., KING, W. F., LINDSAY, B. G.,LORI~, R., MEHL, J. W., PRICE, T. G,, PUTZOLO,F , SELINGER, P. G., SCHKOLNIK, M., SLUTZ, D.R., TKAIGER, I. L , WADE, B. W., AND YOST, R, A.198 la. A history and evaluation of System R.

Cornmun ACM 24, 10 (Oct.), 632.

CHAMBERLAIN, D. D,, ASTRAHAN, M. M., KING, W F.,

LORIE, R. A,, MEHL, J. W., PRICE, T. G.,

SCHKOLNIK, M., SELINGER, P G., SLUTZ, D. R.,WADE, B. W., AND YOST, R. A. 1981b. SUp-port for repetitive transactions and ad hocqueries in System R. ACM Trans. DatabaseSyst. 6, 1 (Mar), 70.

CHEN, P. P. 1976. The entity relationship model—Toward a umtied view of data. ACM Trans.Database Syst. 1, 1 (Mar.), 9.

C!HEN, H , AND KUCK, S. M. 1984. Combining re-lational and network retrieval methods. Inproceedings of ACM SIGMOD Conference,ACM, New York, 131,

CHEN, M. S., Lo, M. L., Yu, P. S., AND YOUNG, H C1992. Using segmented right-deep trees for


Query Evaluation Techniques * 161

the execution of pipelined hash joins. In Pro-ceedings of the International Conference on Very

Large Data Bases (Vancouver, BC, Canada).

VLDB Endowment, 15.

CHEN~, J., HADERLE, D., HEDGES, R., IYER, B. R.,

MESSINGER, T., MOHAN, C., ANI) WANG, Y.1991. An efficient hybrid join algorithm: ADB2 prototype. In Proceedings of the IEEE

Conference on Data Engineering. IEEE, NewYork, 171.

CHERITON, D. R., GOOSEN, H. A., mn BOYLE, P. D.1991 Paradigm: A highly scalable shared-

memory multicomputer. IEEE Comput. 24, 2

(Feb.), 33.

CHIU, D. M., AND Ho, Y, C. 1980. A methodologyfor interpreting tree queries into optimal semi-join expressions. In Proceedings of ACM SIG-

MOD Conference. ACM, New York, 169.

CHOU, H, T. 1985. Buffer management ofdatabase systems. Ph.D. thesis, Univ. ofWisconsin—Madison.

CHOU, H. T., AND DEWITT, D. J. 1985. An evalua-

tion of buffer management strategies for rela-tional database systems. In Proceedings of theInternational Conference on Very Large Data

Bases (Stockholm, Sweden, Aug.), VLDB En-

dowment, 127. Reprinted in Readings inDatabase Systems. Morgan-Kaufman, San

Mateo, Calif., 1988.

CHRISTODOULAKIS, S. 1984. Implications of cer-tain assumptions in database performanceevaluation. ACM Trans. Database Syst. 9, 2

(June), 163.

CHUNG, S. M., AND BERRA, P. B. 1988. A compari-son of concatenated and superimposed code

word surrogate files for very large data/knowl-edge bases. In Lecture Notes in Computer Sci-

ence, vol. 303. Springer-VerIag, New York, 364.

CLUET, S., DIZLOBEL, C., LECLUSF,, C., AND RICHARD,P. 1989. Reloops, an algebra based querylanguage for an object-oriented database sys-

tem. In Proceedings of the Ist International

Conference on Deductive and Object-OrzentedDatabases (Kyoto, Japan, Dec. 4-6).

COMER, D. 1979. The ubiquitous B-tree. ACMComput. Suru. 11, 2 (June), 121.

COPELAND, G., ALEXANDER, W., BOUGHTER, E., ANDKELLER, T. 1988. Data placement in Bubba.In Proceedings of ACM SIGMOD Conference.

ACM, New York, 99.

DADAM, P., KUESPERT, K., ANDERSON, F., BLANKEN,H,, ERBE, R., GUENAUER, J., LUM, V., PISTOR, P.,

AND WALCH, G. 1986. A database manage-ment prototype to support extended NF 2 rela-

tions: An integrated view on flat tables andhierarchies. In Proc.edmgs of ACM SIGMODConference. ACM, New York, 356.

DMWELS, D., AND NG, P. 1982. Distributed querycompilation and processing in R*. IEEEDatabase Eng. 5, 3 (Sept.).

DmIELS, S., GRAEFE, G., KELLER, T., MAIER, D.,

SCHMIDT, D., AND VANCE, B. 1991. Query op-timization in revelation, an overview. IEEEDatabase Eng. 14, 2 (June).

DAVIDSON, S. B., GARCIA-M• LINA, H., AND SKEEN, D.

1985. Consistency in partitioned networks,ACM Comput. Surv. 17, 3 (Sept.), 341.

DAVIS, D. D. 1992. Oracle’s parallel punch forOLTP. Datamation (Aug. 1), 67.

DAVISON, W. 1992. Parallel index building in In-formix OnLine 6.0. In Proceedings of ACM

SIGMOD Conference. ACM, New York, 103.

DEPPISCH, U., PAUL, H. B., AND SCHEK, H. J. 1986.A storage system for complex objects. In Pro-

ceedings of the International Workshop on Ob-ject-Or[ented Database Systems (Pacific Grove,Calif., Sept.), 183.

DESHPANDE, V., AND LARSON, P. A. 1992. The de-

sign and implementation of a parallel join algo-rithm for nested relations on shared-memorymultiprocessors. In Proceedings of the IEEE

Conference on Data Engineering. IEEE, NewYork, 68,

DESHPANDE, V., AND LARSON, P, A, 1991. An alge-

bra for nested relations with support for nullsand aggregates. Computer Science Dept., Univ.of Waterloo, Waterloo, Ontario, Canada.

DESHPANDE, A.j AND VAN GUCHT, D. 1988. An im-

plementation for nested relational databases.

In proceedings of the I?lternattonal Conference

on Very Large Data Bases (Los Angeles, Calif.,Aug. ) VLDB Endowment, 76.

DEWITT, D. J. 1991. The Wisconsin benchmark:Past, present, and future. In Database andTransaction Processing System PerformanceHandbook. Morgan-Kaufman, San Mateo, Calif,

DEWITT, D. J., AND GERBER, R. H. 1985, Multi-processor hash-based Join algorithms. In F’ro-

ceedings of the International Conference on Very

Large Data Bases (Stockholm, Sweden, Aug.).VLDB Endowment, 151.

DEWITT, D. J., AND GRAY, J. 1992. Parallel

database systems: The future of high-perfor-mance database systems. Commun. ACM 35, 6

(June), 85.

DEWITT, D. J., AND HAWTHORN, P. B. 1981. Aperformance evaluation of database machine

architectures. In Proceedings of the Interns-tional Conference on Very Large Data Bases

(Cannes, France, Sept.). VLDB Endowment,199.

DEWITT, D. J., GERBER, R. H., GRAEFE, G., HEYTENS,M. L., KUMAR, K. B., AND MURALI%ISHNA, M.1986. GAMMA-A high performance dataflowdatabase machine. In Proceedings of the Inter-national Conference on Very Large Data Bases.VLDB Endowment, 228. Reprinted in Read-ings in Database Systems. Morgan-Kaufman,San Mateo, Calif., 1988.

DEWITT, D. J., GHANDEHARIZADEH, S., AND SCHNEI-DER, D. 1988. A performance analysis of theGAMMA database machine. In Proceedings of


162 * Goetz Graefe


350

DEWITT, D. J., GHANDEHARIZADEH, S., SCHNEIDER,

D., BRICKER, A., HSIAO, H. I.. AND RASMUSSEN,R. 1990. The Gamma database machine pro-

ject. IEEE Trans. Knowledge Data Eng. 2, 1

(Mar.). 44.

DEWITT, D. J., KATZ, R., OLKEN, F., SHAPIRO, L.,

STONEBRAKER, M., AND WOOD, D. 1984. Im-plementation techniques for mam memorydatabase systems In ProceecZings of ACM SIG-

MOD Conference. ACM, New York, 1.

DEWITT, D., NAUGHTON, J., AND BURGER, J. 1993.

Nested loops revisited. In proceedings of Paral-

lel and Distributed In forrnatlon Systems [San

Diego, Calif., Jan.).

DEWITT, D. J., NAUGHTON, J. E., AND SCHNEIDER,

D. A. 1991a. An evaluation of non-equijoin

algorithms. In Proceedings of the InternationalConference on Very Large Data Bases

(Barcelona, Spain). VLDB Endowment, 443.

DEWITT, D., NAUGHTON, J., AND SCHNEIDER, D.1991b Parallel sorting on a shared-nothing

architecture using probabilistic splitting. InProceedings of the International Conference onParallel and Dlstnbuted Information Systems

(Miami Beach, Fla , Dec.)

DOZIER, J. 1992. Access to data in NASAS

Earth observing systems. In proceedings of

ACM SIGMOD Conference. ACM, New York, 1.

EFFELSBERG, W., AND HAERDER, T. 1984. Princi-

ples of database buffer management. ACM

Trans. Database Syst. 9, 4 (Dee), 560.

EN~ODY, R. J., AND Du, H. C. 1988. Dynamic

hashing schemes ACM Comput. Suru. 20, 2

(June), 85.

ENGLERT, S., GRAY, J., KOCHER, R., AND SHAH, P.

1989. A benchmark of nonstop SQL release 2demonstrating near-linear speedup and scaleup

on large databases. Tandem Computers Tech.Rep. 89,4, Tandem Corp., Cupertino, Calif.

EPSTEIN, R. 1979. Techniques for processing of

aggregates in relational database systems

UCB/ERL Memo. M79/8, Univ. of California,Berkeley, Calif.

EPSTEIN, R., AND STON~BRAE~R, M. 1980. Analy-

sis of dmtrlbuted data base processing strate-gies. In Proceedings o} the International Con-ference on Very Large Data Bases UWmtreal,Canada, Oct.). VLDB Endowment, 92.

EPSTEIN, R , STONE~RAKER, M., AND WONG, E. 1978

Distributed query processing in a relationaldatabase system. In Proceedings of ACM

SIGMOD Conference. ACM, New York.

FAGIN, R,, NUNERGELT, J., PIPPENGER, N., AND

STRONG, H. R. 1979. Extendible hashing: Afast access method for dynamic tiles. ACMTrans. Database Syst. 4, 3 (Sept.), 315.

FALOUTSOS, C 1985. Access methods for text.ACM Comput. Suru. 17, 1 (Mar.), 49.

FALOUTSOS, C,, NG, R., AND SELLIS, T. 1991. Pre-

dictive load control for flexible buffer alloca-

tion. In Proceedings of the InternationalConference on Very Large Data Bases

(Barcelona, Spain). VLDB Endowment, 265.

FANG, M. T,, LEE, R. C. T., AND CHANG, C, C. 1986.The idea of declustering and its applications. InProceedings of the International Conference on

Very Large Data Bases (Kyoto, Japan, Aug.).VLDB Endowmentj 181.

FINKEL, R. A., AND BENTLEY, J. L. 1974. Quadtrees: A data structure for retrieval on compos-ite keys. Acts Inform atzca 4, 1,1.

FREYTAG, J. C., AND GOODMAN, N. 1989 On thetranslation of relational queries into iteratme

programs. ACM Trans. Database Syst. 14, 1(Mar.), 1.

FUSHIMI, S., KITSUREGAWA, M., AND TANAKA, H.1986. An overview of the system software of a

parallel relational database machine GRACE.In Proceedings of the Intern atLona! Conference

on Very Large Data Bases Kyoto, Japan, Aug.).ACM, New York, 209.

GALLAIRE, H., MINRER, J., AND NICOLAS, J M. 1984.Logic and databases A deductive approach.

ACM Comput. Suru. 16, 2 (June), 153

GERBER, R. H. 1986. Dataflow query process-ing using multiprocessor hash-partitioned

algorithms. Ph.D. thesis, Univ. of

Wisconsin—Madison.

GHANDEHARIZADEH, S., AND DEWITT, D. J. 1990.Hybrid-range partitioning strategy: A newdeclustermg strategy for multiprocessor

database machines. In Proceedings of the Inter-national Conference on Very Large Data Bases

(Brisbane, Australia). VLDB Endowment, 481.

GOODMAN, J. R., AND WOEST, P. J 1988. TheWisconsin Multicube: A new large-scale cache-coherent multiprocessor. Computer ScienceTech Rep. 766, Umv. of Wisconsin—Madison

GOUDA, M. G.. AND DAYAL, U. 1981. Optimalsemijoin schedules for query processing in local

distributed database systems. In Proceedingsof ACM SIGMOD Conference. ACM, New York,164,

GRAEFE, G. 1993a. Volcano, An extensible andparallel dataflow query processing system.IEEE Trans. Knowledge Data Eng. To bepublished.

GRAEFE, G. 1993b. Performance enhancementsfor hybrid hash join Available as ComputerScience Tech. Rep. 606, Univ. of Colorado,Boulder.

GRAEFE, G. 1993c Sort-merge-join: An ideawhose time has passed? Revised in PortlandState Univ. Computer Science Tech. Rep. 93-4.

GRAEFII, G. 1991. Heap-filter merge join A newalgorithm for joining medium-size inputs. IEEETrans. Softw. Eng. 17, 9 (Sept.). 979.

GRAEFE, G. 1990a. Parallel external sorting inVolcano. Computer Science Tech. Rep. 459,Umv. of Colorado, Boulder.



GRAEFE, G. 1990b. Encapsulation of parallelismin the Volcano query processing system. Inproceedings of ACM SIGMOD Conference,ACM, New York, 102.

GRAEFE) G. 1989. Relational division: Four algo-rithms and their performance. In Proceedings

of the IEEE Conference on Data Engineering.IEEE, New York, 94.

GRAEFE, G., AND COLE, R. L. 1993. Fast algo-rithms for universal quantification in large

databases. Portland State Univ. and Univ. of

Colorado at Boulder.

GRAEFE, G., AND DAVISON, D. L. 1993. Encapsula-tion of parallelism and architecture-indepen-

dence in extensible database query processing.IEEE Trans. Softw. Eng. 19, 7 (July).

GRAEFE, G., AND DEWITT, D. J. 1987. TheEXODUS optimizer generator, In Proceedings

of ACM SIGMOD Conference. ACM, New York,

160.

GRAEFE, G., .miI) MAIER, D. 1988. Query opti-mization in object-oriented database systems:A prospectus, In Advances in Object-OrientedDatabase Systems, vol. 334. Springer-Verlag,

New York, 358.

GRAEFE, G., AND MCKRNNA, W. J. 1993. The Vol-cano optimizer generator: Extensibility and ef-

ficient search. In Proceedings of the IEEE Con-

ference on Data Engineering. IEEE, New York.

GRAEIW, G., AND SHAPIRO, L. D. 1991. Data com-pression and database performance. In Pro-ceedings of the ACM/IEEE-Computer Science

Symposium on Applied Computing.

ACM\ IEEE, New York.

GRAEFE, G., AND WARD, K. 1989. Dynamic query

evaluation plans. In Proceedings of ACM


GRAEEE,G., ANDWOLNIEWICZ,R. H. 1992. Alge-braic optimization and parallel execution ofcomputations over scientific databases. In Pro-

ceedings of the Workshop on Metadata Manage-ment in Sclentrfzc Databases (Salt Lake City,

Utah, Nov. 3-5).

GRAEFE, G., COLE, R. L., DAVISON, D. L., MCKENNA,

W. J., AND WOLNIEWICZ, R. H. 1992. Extensi-ble query optimization and parallel execution

in Volcano. In Query Processing for AduancedDatabase Appkcatlons. Morgan-Kaufman, San

Mateo, Calif.

GRAEEE, G., LHVWLLE, A., AND SHAPIRO, L. D. 1993.Sort versus hash revisited. IEEE Trans.Knowledge Data Eng. To be published.

GRAY, J. 1990. A census of Tandem system avail-

ability between 1985 and 1990, TandemComputers Tech. Rep. 90.1, Tandem Corp.,Cupertino, Calif.

GWY, J., AND PUTZOLO, F. 1987. The 5 minuterule for trading memory for disc accesses andthe 10 byte rule for trading memory for CPUtime. In Proceedings of ACM SIGMOD Confer-ence. ACM, New York, 395.

GWY) J., AND REUTER, A. 1991. Transaction Pro-

cessing: Concepts and Techniques. Morgan-Kaufman, San Mateo, Calif,

GRAY, J., MCJONES, P., BLASGEN, M., LINtEAY, B.,

LORIE, R., PRICE, T., PUTZOLO, F., AND TRAIGER,I. 1981. The recovery manager of the Sys-

tem R database manager. ACM Comput, Sw-u.13, 2 (June), 223.

GRUENWALD, L., AND EICH, M, H. 1991. MMDB

reload algorithms. In Proceedings of ACMSIGMOD Conference, ACM, New York, 397.

GUENTHER, O., AND BILMES, J. 1991. Tree-based

access methods for spatial databases: Imple-mentation and performance evaluation. IEEETrans. Knowledge Data Eng. 3, 3 (Sept.), 342.

GUIBAS, L., AND SEI)GEWICK, R. 1978. A dichro-matic framework for balanced trees. In Pro-

ceedings of the 19th SymposL urn on the Founda

tions of Computer Science.

GUNADHI, H., AND SEGEW, A. 1991. Query pro-

cessing algorithms for temporal intersectionjoins. In Proceedings of the IEEE Conference onData Engtneermg. IEEE, New York, 336.

GITNADHI, H., AND SEGEV) A. 1990. A frameworkfor query optimization in temporal databases.

In Proceedings of the 5th Zntcrnatzonal Confer-

ence on Statistical and Scten tific DatabaseManagement.

GUNTHER, O. 1989. The design of the cell tree:

An object-oriented index structure for geomet-ric databases. In Proceedings of the IEEE Con-

ference on Data Engineering. IEEE, New York,

598.

GUNTHER, O., AND WONG, E. 1987 A dual space

representation for geometric data. In Proceed-

ings of the International Conference on VeryLarge Data Bases (Brighton, England, Aug.).

VLDB Endowment, 501.

Guo, M., SLT, S. Y. W., AND LAM, H. 1991. An

association algebra for processing object-

oriented databases. In proceedings of the IEEE

Conference on Data Engmeermg, IEEE, NewYork, 23.

GUTTMAN, A. 1984. R-Trees: A dynamic index

structure for spatial searching. In Proceedingsof ACM SIGMOD Conference. ACM, New York,47. Reprinted in Readings in Database Sys-tems. Morgan-Kaufman, San Mateo, Ccdif.,1988.

Hfis, L., CHANG, W., LOHMAN, G., MCPHERSON, J.,

WILMS, P. F., LAPIS, G., LINDSAY, B., PIRAHESH,H., CAREY, M. J., AND SHEKITA, E. 1990.

Starburst mid-flight: As the dust clears. IEEETrans. Knowledge Data Eng. 2, 1 (Mar.), 143.

Hfis, L., FREYTAG, J. C., LOHMAN, G., ANDPIRAHESH, H. 1989. Extensible query pro-cessing in Starburst. In Proceedings of ACMSIGMOD Conference. ACM, New York, 377.

H.%+s, L. M., SELINGER, P. G., BERTINO, E., DANI~LS,D., LINDSAY, B., LOHMAN, G., MASUNAGA, Y.,Mom, C., NG, P., WILMS, P., AND YOST, R.



1982. R*: A research project on distributedrelational database management. IBM Res. Di-vision, San Jose, Calif.

HAERDER, T., AND REUTER, A. 1983. Principles oftransaction-oriented database recovery. ACM

Comput. Suru. 15, 4 (Dec.).

HAFEZ, A., AND OZSOYOGLU, G. 1988. Storage

structures for nested relations. IEEE Database

Eng. 11, 3 (Sept.), 31.

HAGMANN, R. B. 1986. An observation ondatabase buffering performance metrics. In

Proceedings of the International Conference onVery Large Data Bases (Kyoto, Japan, Aug.).


HAMMING, R. W. 1977. Digital Filters. Prentice-

Hall, Englewood Cliffs, N.J.

HANSON, E. N. 1987. A performance analysis ofview materialization strategies. In Proceedings

of ACM SIGMOD Conference. ACM, New York,440.

HENRICH, A., Stx, H. W., AND WIDMAYER, P. 1989.The LSD tree: Spatial access to multi-

dimensional point and nonpoint objects. InProceedings of the International Conference on

Very Large Data Bases (Amsterdam, The

Netherlands). VLDB Endowment, 45.

HOEL, E. G., AND SAMET, H. 1992. A qualitative

comparison study of data structures for large

linear segment databases. In %oceedtngs of

ACM SIGMOD Conference. ACM. New York,205.

HONG, W., AND STONEBRAKER, M. 1993. Opti-

mization of parallel query execution plans inXPRS. Distrib. Parall. Databases 1, 1 (Jan.), 9.

HONG, W., AND STONEBRAKRR, M. 1991. Opti-

mization of parallel query execution plans inXPRS. In Proceedings of the International Con-

ference on Parallel and Distributed InformationSystems (Miami Beach, Fla., Dec.).

Hou, W. C., AND OZSOYOGLU, G. 1993. Processing

time-constrained aggregation queries in CASE-

DB. ACM Trans. Database Syst. To bepublished.

Hou, W. C., AND OZSOYOGLU, G. 1991. Statisticalestimators for aggregate relational algebra

queries. ACM Trans. Database Syst. 16, 4

(Dec.), 600.Hou, W. C., OZSOYOGLU, G., AND DOGDU, E. 1991.

Error-constrained COUNT query evaluation inrelational databases. In Proceedings of ACMSIGMOD Conference. ACM, New York, 278.

HSIAO, H. I., ANDDEWITT, D. J. 1990. Chaineddeclustering: A new availability strategy formultiprocessor database machines. In Proceed-ings of the IEEE Conference on Data Engineer-ing. IEEE, New York, 456.

HUA, K. A., m~ LEE, C. 1991. Handling dataskew in multicomputer database computers us-ing partition tuning. In Proceedings of the In-ternational Conference on Very Large DataBases (Barcelona, Spain). VLDB Endowment,525.

HUA, K. A., AND LEE, C. 1990. An adaptive data

placement scheme for parallel database com-

puter systems. In Proceedings of the Interna-tional Conference on Very Large Data Bases

(Brisbane, Australia). VLDB Endowment, 493,

HUDSON, S. E., AND KING, R. 1989. Cactis: A self-

adaptive, concurrent implementation of an ob-

ject-oriented database management system.ACM Trans. Database Svst. 14, 3 (Sept.), 291,

HULL, R., AND KING, R. 1987. Semantic database

modeling: Survey, applications, and research

issues. ACM Comput. Suru, 19, 3 (Sept.), 201.

HUTFLESZ, A., SIX, H. W., AND WIDMAYER) P. 1990.

The R-File: An efficient access structure forproximity queries. In proceedings of the IEEE

Conference on Data Engineering. IEEE, NewYork, 372.

HUTFLESZ, A., SIX, H. W., AND WIDMAYER, P. 1988a,

Twin grid files: Space optimizing accessschemes. In Proceedings of ACM SIGMODConference. ACM, New York, 183.

HUTFLESZ, A., Sm, H, W., AND WIDMAYER, P, 1988b.

The twin grid file: A nearly space optimal index

structure. In Lecture Notes m Computer Sci-ence, vol. 303, Springer-Verlag, New York, 352.

IOANNIDIS, Y. E,, AND CHRISTODOULAIiIS, S. 1991,

On the propagation of errors in the size of joinresults. In Proceedl ngs of ACM SIGMOD Con-

ference. ACM, New York, 268.

IYKR, B. R., AND DIAS, D. M. 1990, System issues

in parallel sorting for database systems. InProceedings of the IEEE Conference on DataEngineering. IEEE, New York, 246.

JAGADISH, H. V. 1991, A retrieval technique for

similar shapes. In Proceedings of ACM


JARKE, M., AND KOCH, J. 1984. Query optimiza-

tion in database systems. ACM Cornput. Suru.

16, 2 (June), 111.

JARKE, M., AND VASSILIOU, Y. 1985. A framework

for choosing a database query language. ACMComput. Sure,. 17, 3 (Sept.), 313.

KATz, R. H. 1990. Towards a unified framework

for version modeling in engineering databases.ACM Comput. Suru. 22, 3 (Dec.), 375.

KATZ, R. H., AND WONG, E. 1983. Resolving con-flicts in global storage design through replica-tion. ACM Trans. Database S.vst. 8, 1 (Mar.),110.

KELLER, T., GRAEFE, G., AND MAIE~, D. 1991. Ef-

ficient assembly of complex objects. In Proceed-

ings of ACM SIGMOD Conference. ACM. NewYork, 148.

KEMPER, A., AND MOERKOTTE G. 1990a. Accesssupport in object bases. In Proceedings of ACMSIGMOD Conference. ACM, New York, 364.

KEMPER, A., AND MOERKOTTE, G. 1990b. Ad-vanced query processing in object bases usingaccess support relations. In Proceedings of theInternational Conference on Very Large Data



Bases (Brisbane, Australia). VLDB Endow-

ment, 290.

KEMPER, A.j AND WALI.RATH, M. 1987. An analy-

sis of geometric modeling in database systems.

ACM Comput. Suru. 19, 1 (Mar.), 47.KEMPER,A., KILGER, C., AND MOERKOTTE, G. 1991.

Function materialization in object bases. InProceedings of ACM SIGMOD Conference.

ACM, New York, 258.KRRNIGHAN,B. W., ANDRITCHIE,D. M. 1978. VW

C Programming Language. Prentice-Hall,Englewood Cliffs, N.J.

KIM, W. 1984. Highly available systems for

database applications. ACM Comput. Suru. 16,1 (Mar.), 71.

KIM, W. 1980. A new way to compute the prod-

uct and join of relations. In Proceedings ofACM SIGMOD Conference. ACM, New York,179.

KITWREGAWA,M., AND OGAWA,Y, 1990. Bucketspreading parallel hash: A new, robust, paral-lel hash join method for skew in the superdatabase computer (SDC). In Proceedings of

the International Conference on Very LargeData Bases (Brisbane, Australia). VLDB En-

dowment, 210.

KI’rSUREGAWA, M., NAKAYAMA, M., AND TAKAGI, M.1989a. The effect of bucket size tuning in the

dynamic hybrid GRACE hash join method. InProceedings of the International Conference on

Very Large Data Bases (Amsterdam, TheNetherlands). VLDB Endowment, 257.

KITSUREGAWA7 M., TANAKA, H., AND MOTOOKA, T.

1983. Application of hash to data base ma-chine and its architecture. New Gener. Com-

p~t. 1, 1, 63.

KITSUREGAWA, M., YANG, W., AND FUSHIMI, S.

1989b. Evaluation of 18-stage ~i~eline hard-. .ware sorter. In Proceedings of the 6th In tern a-

tional Workshop on Database Machines

(Deauville, France, June 19-21).

KLUIG, A. 1982, Equivalence of relational algebra

and relational calculus query languages having

aggregate functions. J. ACM 29, 3 (July), 699.

KNAPP, E. 1987. Deadlock detection in dis-

tributed databases. ACM Comput. Suru. 19, 4

(Dec.), 303.

KNUTH, D, 1973. The Art of Computer Program-ming, Vol. III, Sorting and Searching.Addison-Wesley, Reading, Mass.

KOLOVSON, C. P., ANTD STONEBRAKER M. 1991.

Segment indexes: Dynamic indexing tech-niques for multi-dimensional interval data. In

Proceechngs of ACM SIGMOD Conference.

ACM, New York, 138.

KC)OI. R. P. 1980. The optimization of queries inrelational databases. Ph.D. thesis, Case West-ern Reserve Univ., Cleveland, Ohio.

KOOI, R. P., AND FRANKFORTH, D. 1982. Query

optimization in Ingres. IEEE Database Eng. 5,3 (Sept.), 2.

KFUEGEL, H. P., AND SEEGER, B. 1988. PLOP-Hashing: A grid file without directory. In Pro-

ceedings of the IEEE Conference on Data Engi-neering. IEEE, New York, 369.

KRIEGEL, H. P., AND SEEGER, B. 1987. Multidi-mensional dynamic hashing is very efficient fornonuniform record distributions. In Proceed-

ings of the IEEE Conference on Data Enginee-ring. IEEE, New York, 10.

KRISHNAMURTHY,R., BORAL, H., AND ZANIOLO, C.1986. Optimization of nonrecursive queries. In

Proceedings of the International Conference onVery Large Data Bases (Kyoto, Japan, Aug.).


KUESPERT, K.j SAAKE, G.j AND WEGNER, L. 1989.Duplicate detection and deletion in the ex-

tended NF2 data model. In Proceedings of the3rd International Conference on the Founda-tions of Data Organization and Algorithms

(Paris, France, June).

KUMAR, V., AND BURGER, A. 1991. Performancemeasurement of some main memory databaserecovery algorithms. In Proceedings of theIEEE Conference on Data Engineering. IEEE,New York, 436.

LAKSHMI,M. S., AND Yu, P. S. 1990. Effectiveness

of parallel joins. IEEE Trans. Knowledge DataEng. 2, 4 (Dec.), 410.

LAIWHMI, M. S., AND Yu, P. S, 1988. Effect ofskew on join performance in parallel architec-

tures. In Proceedings of the International Sym-posmm on Databases in Parallel and Dis-tributed Systems (Austin, Tex., Dec.), 107.

LANKA, S., AND MAYS, E. 1991. Fully persistentB + -trees. In Proceedings of ACM SIGMOD

Conference. ACM, New York, 426.LARSON,P. A. 1981. Analysis of index-sequential

files with overflow chaining. ACM Trans.Database Syst. 6, 4 (Dec.), 671.

LARSON, P., AND YANG, H. 1985. Computing

queries from derived relations. In Proceedingsof the International Conference on Very LargeData Bases (Stockholm, Sweden, Aug.). VLDB

Endowment, 259.

LEHMAN, T. J., AND CAREY, M. J. 1986. Queryprocessing in main memory database systems.In Proceedings of ACM SIGMOD Conference.ACM, New York, 239.

LELEWER, D. A., AND HIRSCHBERG, D. S. 1987.Data compression. ACM Comput. Suru. 19, 3

(Sept.), 261.

LEUNG, T. Y. C., AND MUNTZ, R. R. 1992. Tempo-ral query processing and optimization in multi-

processor database machines. In Proceedings ofthe International Conference on Very LargeData Bases (Vancouver, BC, Canada). VLDBEndowment, 383.

LEUNG, T. Y. C., AND MUNTZ, R. R. 1990. Queryprocessing in temporal databases. In Proceed-ings of the IEEE Conference on Data Englneer-mg. IEEE, New York, 200.

ACM Com~utina Survevs, Vol 25, No. 2, June 1993


LI, K., AND NAUGHTON, J. 1988. Multiprocessor

main memory transaction processing. In Pro-

ceedings of the Intcrnatlonal Symposium onDatabases m Parallel and Dlstrlbuted Systems

(Austin, Tex., Dec.), 177.

LITWIN, W. 1980. Linear hashing: A new tool for

file and table addressing. In Proceedings of the

International Conference on Ver<y Large DataBases (Montreal, Canada, Oct.). VLDB Endow-

ment, 212. Reprinted in Readings in DatabaseSystems. Morgan-Kaufman, San

Mateo, Calif

LJTWIN, W., MARK L., AND ROUSSOPOULOS, N. 1990.

Interoperability of multiple autonomousdatabases. ACM Comput. Suru. 22, 3 (Sept.).

267.

LOHMAN, G., MOHAN, C., HAAs, L., DANIELS, D.,

LINDSAY, B., SELINGER, P., AND WILMS, P. 1985.

Query processing m R’. In Query Processuzg mDatabase Systems. Springer, Berlin, 31.

LOMET, D. 1992. A review of recent work

on multi-attribute access methods. ACMSIGMOD Rec. 21, 3 (Sept.), 56.

LoM~T, D., AND SALZBER~, B 1990a The perfor-

mance of a multiversion access method. In Pro-ceedings of ACM SIGMOD Conference ACM,New York, 353

LOM~T, D. B , AND SALZEER~, B. 1990b. The hB-

tree A multlattrlbute mdexing method withgood guaranteed performance. ACM Trans.

Database Syst. 15, 4 (Dec.), 625.

LORIE, R. A., AND NILSSON, J, F, 1979. AU access

specification language for a relational database

management system IBM J. Res. Deuel. 23, 3

(May), 286

LDRIE, R, A,, AND YOUNG, H. C. 1989. A low com-

munication sort algorithm for a paralleldatabase machme. In Proceedings of the Inter-

national Conference on Very Large Data Bases(AmAwdam. The Netherlands). VLDB Endow-ment, 125.

LYNCH, C A., AND BROWNRIGG, E. B. 1981. Appli-

cation of data compression to a large biblio-graphic data base In Proceedings of the Inter-

national Conference on Very Large Data Base,~(Cannes, France, Sept.). VLDB Endowment,

435

L~~INEN, K. 1987. Different perspectives on in-formation systems: Problems and solutions.ACM Comput. Suru. 19, 1 (Mar.), 5.

MACVCEVZT,L. F., AND LOHMAN, G. M. 1989 Index

scans using a finite LRU buffer: A validated1/0 model. ACM Trans Database S.vst. 14, 3

(Sept.), 401.

MAJER, D, 1983. The Theory of RelationalDatabases. CS Press, Rockville, Md.

MAI~R, D., AND STEIN, J. 1986 Indexing m an

object-oriented database management. In Pro-ceedings of the Intern at[on al Workshop on Ob]ect-(hented Database Systems (Pacific Grove,Calif, Sept ), 171

MAIER, D., GRAEFE, G., SHAFIRO, L., DANIELS, S.,KELLER, T., AND VANCE, B. 1992 Issues in

distributed complex object assembly In Prc~-

ceedings of the Workshop on Distributed ObjectManagement (Edmonton, BC, Canada, Aug.).

MANNINO, M. V., CHU, P., ANrI SAGER, T. 1988.

Statistical profile estimation m database sys-

tems. ACM Comput. Suru. 20, 3 (Sept.).

MCKENZIE, L. E., AND SNODGRASS, R. T. 1991.Evaluation of relational algebras incorporating

the time dimension m databases. ACM Co~n-put. Suru. 23, 4 (Dec.).

MEDEIROS, C., AND TOMPA, F. 1985. Understand-ing the implications of view update pohcles, InProceedings of the International Conference on

Very Large Data Bases (Stockholm, Sweden,Aug.). VLDB Endowment, 316.

MENON, J. 1986. A study of sort algorithms formultiprocessor database machines. In Proceed-

I ngs of the International Conference on VeryLarge Data bases (Kyoto, Japan, Aug ) VLDBEndowment, 197

MISHRA, P., AND EICH, M. H. 1992. Join process-ing in relational databases. ACM Comput.

Suru. 24, 1 (Mar.), 63

MITSCHANG, B. 1989. Extending the relational al-

gebra to capture complex objects. In Proceed-ings of the International Conference on Very

Large Data Bases (Amsterdam, The Nether-lands). VLDB Endowment, 297.

MOHAN, C., HADERLE, D., WANG, Y., AND CHENG, J.

1990. Single table access using multiple in-

dexes: Optimization, execution and concur-

rency control techniques. In Lecture Notes mComputer Sc~ence, vol. 416. Springer-Verlag,

New York, 29.

MOTRO, A. 1989. An access authorization modelfor relational databases based on algebraic ma-

mpulation of view definitions In Proceedl ngsof the IEEE Conferen m on Data Engw-um-mgIEEE, New York, 339

MULLIN, J K. 1990. Optimal semijoins for dis-tributed database systems. IEEE Trans. Softw.Eng. 16, 5 (May), 558.

NAKAYAMA, M.. KITSUREGAWA. M., AND TAKAGI, M.1988. Hash-partitioned jom method using dy-

namic dcstaging strategy, In Proeeedmgs of theImternatronal Conference on Very Large Data

Bases (Los Angeles, Aug.). VLDB Endowment,468.

NECHES, P M. 1988. The Ynet: An interconnectstructure for a highly concurrent data basecomputer system. In Proceedings of the 2ndSymposium on the Frontiers of Massiuel.v Par-allel Computatl on (Fairfax, Virginia, Ott.).

NECHES, P M. 1984. Hardware support for ad-vanced data management systems. IEEE Comput. 17, 11 (Nov.), 29.

NEUGEBAUER, L. 1991 Optimization and evalua-tion of database queries mcludmg embeddedinterpolation procedures. In Proceedings of

ACM Computmg Surveys, Vol 25, No 2. June 1993


ACM SIGMOD Conference. ACM, New York,118.

NG, R., FALOUTSOS,C., ANDSELLIS,T. 1991. Flex-ible buffer allocation based on marginal gains.In Proceedings of ACM SIGMOD Conference.ACM, New York, 387.

NIEVERGELT, J., HINTERBERGER, H., AND SEVCIK,K. C. 1984. The grid file: An adaptable, sym-

metric multikey file structure. ACM Trans.Database Syst. 9, 1 (Mar.), 38.

NYBERG,C., BERCLAY,T., CVETANOVIC,Z., GRAY.J.,ANDLOMET,D. 1993. AlphaSort: A RISC ma-chine sort. Tech. Rep. 93.2. DEC San FranciscoSystems Center. Digital Equipment Corp., SanFrancisco.

OMIECINSK1,E. 1991. Performance analysis of aload balancing relational hash-join algorithmfor a shared-memory multiprocessor. In Pro-

ceedings of the International Conference on VeryLarge Data Bases (Barcelona, Spain). VLDBEndowment, 375.

OMIECINSKLE. 1985. Incremental file reorgani-zation schemes. In Proceedings of the Interna-

tional Conference on Very Large Data Bases(Stockholm, Sweden, Aug.). VLDB Endowment,346.

OMIECINSKI, E., AND LIN, E. 1989. Hash-basedand index-based join algorithms for cube andring connected multicomputers. IEEE Trans.

Knowledge Data Eng. 1, 3 (Sept.), 329.

ONO, K., AND LOHMAN, G. M. 1990. Measuringthe complexity of join enumeration in query

optimization. In Proceedings of the Interna-

tional Conference on Very Large Data Bases

(Brisbane, Australia). VLDB Endowment, 314.OUSTERHOUT,J. 1990. Why aren’t operating sys-

tems getting faster as fast as hardware. InUSENIX Summer Conference (Anaheim, Calif.,June). USENIX.

O~SOYOGLU, Z. M., AND WANG, J. 1992. A keyingmethod for a nested relational database man-

agement system. In Proceedings of the IEEEConference on Data Engineering. IEEE, NewYork, 438.

OZSOYOGLU,G., OZSOYOGLU,Z. M., ANDMATOS,V.1987. Extending relational algebra and rela-tional calculus with set-valued attributes andaggregate functions. ACM Trans. Database

Syst. 12, 4 (Dec.), 566.

Ozsu, M. T., AND VALDURIEZ, P. 1991a. Dis-tributed database systems: Where are we now.IEEE Comput. 24, 8 (Aug.), 68.

Ozsu, M. T., AND VALDURIEZ, P. 1991b. Principlesof Distributed Database Systems. Prentice-Hall,Englewood Cliffs, N.J.

PALMER, M., AND ZDONIK, S. B. 1991. FIDO: Acache that learns to fetch. In Proceedings of theInternational Conference on Very Large DataBases (Barcelona, Spain). VLDB Endowment,255.

PECKHAM, J., AND MARYANSKI, F. 1988. Semantic

data models. ACM Comput. Suru. 20, 3 (Sept.),

153.

PmAHESH, H., MOHAN, C., CHENG, J., LIU, T. S., AND

SELINGER, P. 1990. Parallelism in relationaldata base systems: Architectural issues and

design approaches. In Proceedings of the Inter-

national Symposwm on Databases m Parallel

and Distributed Systems (Dublin, Ireland,

July).

QADAH, G. Z. 1988. Filter-based join algorithms

on uniprocessor and dmtributed-memory multi-processor database machines. In Lecture Notes

m Computer Science, vol. 303. Springer-Verlag,New York, 388.

REW, R. K., AND DAVIS, G. P. 1990. The UnidataNetCDF: Software for scientific data access. Inthe 6th International Conference on InteractweInformation and Processing Swstems for Me-

teorology, Ocean ography,- aid Hydrology(Anaheim, Calif.).

RICHARDSON, J. E., AND CAREY, M. J. 1987. Pro-

gramming constructs for database system im-plementation m EXODUS. In Proceedings of

ACM SIGMOD Conference. ACM, New York,208.

RICHAR~SON,J. P., Lu, H., AND MIKKILINENI, K.1987. Design and evaluation of parallel

pipelined join algorithms. In Proceedings ofACM SIGMOD Conference. ACM, New York,399.

ROBINSON,J. T. 1981. The K-D-B-Tree: A searchstructure for large multidimensional dynamicindices. ln proceedings of ACM SIGMOD Con-ference. ACM, New York, 10.

ROSENTHAL,A., ANDREINER,D. S. 1985. Query-ing relational views of networks. In Query Pro-cessing in Database Systems. Springer, Berlin,109.

ROSENTHAL, A., RICH, C., AND SCHOLL, M. 1991.Reducing duplicate work in relational join(s): Amodular approach using nested relations. ETH

Tech. Rep., Zurich, Switzerland.

ROTEM, D., AND SEGEV, A. 1987. Physical organi-zation of temporal data. In Proceedings of theIEEE Conference on Data Engineering. IEEE,

New York, 547.

ROTH, M. A., KORTH, H. F., AND SILBERSCHATZ, A.1988. Extended algebra and calculus fornested relational databases. ACM Trans.Database Syst. 13, 4 (Dec.), 389.

ROTHNIE, J. B., BERNSTEIN, P. A., Fox, S., GOODMAN,N., HAMMER, M., LANDERS, T. A., REEVE, C.,

SHIPMAN, D. W., AND WONG, E. 1980. Intro-

duction to a system for distributed databases

(SDD-1). ACM Trans. Database Syst. 5, 1(Mar.), 1.

ROUSSOPOULOS, N. 1991. An incremental accessmethod for ViewCache: Concept, algorithms,and cost analysis. ACM Trans. Database Syst.16, 3 (Sept.), 535.

ROUSSOPOULOS, N., AND KANG, H. 1991. Apipeline N-way join algorithm based on the

ACM Computing Surveys, Vol. 25. No. 2, June 1993


2-way semijoin program. IEEE Trans Knoul-

edge Data Eng. 3, 4 (Dec.), 486.

RUTH, S. S , AND KEUTZER, P J 1972. Data com-pression for business files. Datamatlon 18

(Sept.), 62.

SAAKD, G., LINNEMANN, V., PISTOR, P , AND WEGNER,L. 1989. Sorting, grouping and duplicate

elimination in the advanced information man-agement prototype. In Proceedl ngs of th e Inter-national Conference on Very Large Data BasesVLDB Endowment, 307 Extended version inIBM Sci. Ctr. Heidelberg Tech. Rep 8903.008,

March 1989.

S.4CX:0, G 1987 Index access with a finite buffer.

In Procecdmgs of the International Conference

on Very Large Data Bases (Brighton, England,Aug.) VLDB Endowment. 301.

SACCO, G. M., .4ND SCHKOLNIK, M. 1986, Buffermanagement m relational database systems.ACM Trans. Database Syst. 11, 4 (Dec.), 473.

SACCO, G M , AND SCHKOLNI~, M 1982. A mech-anism for managing the buffer pool m a rela-

tional database system using the hot set model.In Proceedings of the International Conferenceon Very Large Data Bases (Mexico City, Mex-

ico, Sept.). VLDB Endowment, 257.

SACXS-DAVIS,R., AND RAMMIOHANARAO,K. 1983.A two-level superimposed coding scheme forpartial match retrieval. Inf. Syst. 8, 4, 273.

S.ACXS-DAVIS, R., KENT, A., ANn RAMAMOHANAR~O, K1987. Multikey access methods based on su-

perimposed coding techniques. ACM TransDatabase Syst. 12, 4 (Dec.), 655

SALZBERG, B. 1990 Mergmg sorted runs usinglarge main memory Acts Informatica 27, 195

SALZBERG, B. 1988, Fde Structures: An Analytlc

Approach. Prentice-Hall, Englewood Cliffs, NJ.SALZBER~,B , TSU~~RMAN,A., GRAY, J., STEWART,

M., UREN, S., ANrJ VAUGHAN, B. 1990 Fast-Sort: A distributed single-input single-outputexternal sort In Proceeduzgs of ACM SIGMODConference ACM, New York, 94.

SAMET, H. 1984. The quadtree and related hier-

archical data structures. ACM Comput. Saru.16, 2 (June), 187,

SCHEK, H. J., ANII SCHOLL, M. H. 1986. The rela-tional model with relation-valued attributes.Inf. Syst. 11, 2, 137.

SCHNEIDER, D. A. 1991. Blt filtering and multi-way join query processing. Hewlett-PackardLabs, Palo Alto, Cahf. Unpublished Ms

SCHNEIDER, D. A. 1990. Complex query process-ing in multiprocessor database machines. Ph.D.thesis, Univ. of Wmconsin-Madison

SCHNF,mER. D. A., AND DEWITT, D. J. 1990.Tradeoffs in processing complex join queriesvia hashing in multiprocessor databasemachines. In Proceedings of the Interna-tional Conference on Very Large Data Bases(Brisbane, Austraha). VLDB Endowment, 469

SCHNEIDER, D., AND DEWITT, D. 1989. A perfor-

mance evaluation of four parallel join algo-

rithms in a shared-nothing multiprocessorenvironment. In Proceedings of ACM SIGMODConference ACM, New York, 110

SCHOLL, M. H, 1988. The nested relational model—Efficient support for a relational databaseinterface. Ph.D. thesis, Technical Umv. Darm-

stadt. In German.

SCHOLL, M., PAUL, H. B., AND SCHEK, H. J 1987Supporting flat relations by a nested relational

kernel. In Proceedings of the International

Conference on Very Large Data Bases (Brigh-

ton, England, Aug ) VLDB Endowment. 137.

SEEGER, B., ANU LARSON, P A 1991. Multl-disk

B-trees. In Proceedings of ACM SIGMOD Con-ference ACM, New York. 436.

,SEG’N, A . AND GtTN.ADHI. H. 1989. Event-Join op-

timization in temporal relational databases. InProceedings of the IrLternatlOnctl Conference on

Very Large Data Bases (Amsterdam, TheNetherlands), VLDB Endowment, 205.

SELINGiZR, P. G., ASTRAHAN, M. M , CHAMBERLAIN.D. D., LORIE, R. A., AND PRWE, T. G. 1979Access path selectlon m a relational database

management system In Proceedings of AdM


Reprinted in Readings m Database Sy~temsMorgan-Kaufman, San Mateo, Calif., 1988.

SELLIS. T. K. 1987 Efficiently supporting proce-

dures in relational database systems In Pro-ceedl ngs of ACM SIGMOD Conference. ACM.

New York, 278.

SEPPI, K., BARNES, J,, AND MORRIS, C. 1989, A

Bayesian approach to query optimization mlarge scale data bases The Univ. of Texas atAustin ORP 89-19, Austin.

SERLIN, 0. 1991. The TPC benchmarks. InDatabase and Transaction Processing System

Performance Handbook. Morgan-Kaufman, SanMateo, Cahf

SESHAnRI, S , AND NAUGHTON, J F. 1992 Sam-

pling issues in parallel database systems InProceedings of the International Conferenceon Extending Database Technology (Vienna,Austria, Mar.).

SEVERANCE. D. G. 1983. A practitioner’s guide to

data base compression. Inf. S.yst. 8, 1, 51.

SE\rERANCE, D., AND LOHMAN, G 1976. Differen-tial files: Their application to the maintenanceof large databases ACM Trans. Database Syst.1.3 (Sept.).

SEWERANCE, C., PRAMANK S., AND WOLBERG, P.1990. Distributed linear hashing and parallelprojection in mam memory databases. In Pro-ceedings of the Internatl onal Conference on Very

Large Data Bases (Brmbane, Australia) VLDBEndowment, 674.

SHAPIRO, L. D. 1986. Join processing in database

systems with large main memories. ACMTrans. Database Syst. 11, 3 (Sept.), 239.

SHAW, G. M., AND ZDONIIL S. B. 1990. A query

ACM Camputmg Surveys, Vol. 25, No. 2, June 1993


algebra for object-oriented databases. In Pro.

ceedings of the IEEE Conference on Data Engl.

neering. IEEE, New York, 154.SHAW, G., AND Z~ONIK, S. 1989a. An object-

oriented query algebra. IEEE Database Eng.12, 3 (Sept.), 29.

SIIAW, G. M., AND ZDUNIK, S. B. 1989b. AU

object-oriented query algebra. In Proceedingsof the 2nd International Workshop on DatabaseProgramming Languages. Morgan-Kaufmann,San Mateo, Calif., 103.

SHEKITA, E. J., AND CAREY, M. J. 1990. A perfor-

mance evaluation of pointer-based joins. In

Proceedings of ACM SIGMOD (70nferen ce.ACM, New York, 300.

SHERMAN, S. W., AND BRICE, R. S. 1976. Perfor-mance of a database manager in a virtual

memory system. ACM Trans. Data base Syst.

1, 4 (Dec.), 317.SHETH, A. P., AND LARSON, J. A. 1990, Federated

database systems for managing distributed,

heterogeneous, and autonomous databases.ACM Comput. Surv. 22, 3 (Sept.), 183.

SHIPMAN, D. W. 1981. The functional data modeland the data 1anguage DAPLEX. ACM Trans.Database Syst. 6, 1 (Mar.), 140.

SIKELER, A. 1988. VAR-PAGE-LRU: A buffer re-placement algorithm supporting different pagesizes. In Lecture Notes in Computer Science,vol. 303. Springer-Verlag, New York, 336.

SILBERSCHATZ, A., STONEBRAKER, M., AND ULLMAN, J.

1991. Database systems: Achievements and

opportunities. Commun. ACM 34, 10 (Oct.),110,

SIX, H. W., AND WIDMAYER, P. 1988. Spatialsearching in geometric databases. In Proceed-ings of the IEEE Conference on Data Enginee-ring. IEEE, New York, 496.

SMITH,J. M., ANDCHANG,P. Y. T. 1975. Optimiz-ing the performance of a relational algebradatabase interface. Commun. ACM 18, 10(Oct.), 568.

SNODGRASS. R, 1990. Temporal databases: Status

and research directions. ACM SIGMOD Rec.19, 4 (Dec.), 83.

SOCKUT, G. H., AND GOLDBERG, R. P. 1979.Database reorganization—Principles and prac-tice. ACM Comput. Suru. 11, 4 (Dec.), 371.

SRINIVASAN, V., AND CAREY, M. J. 1992. Perfor-

mance of on-line index construction algorithms.In Proceedings of the International Conference

on Extending Database Technology (Vienna,Austria, Mar.).

SRINWASAN,V., AND CAREY, M. J. 1991. Perfor-mance of B-tree concurrency control algo-rithms. In Proceedings of ACM SIGMODConference. ACM, New York, 416.

STAMOS,J. W., ANDYOUNG,H. C. 1989. A sym-metric fragment and replicate algorithm fordistributed joins. Tech. Rep. RJ7 188, IBM Re-search Labs, San Jose, Calif.

STONEBRAKER, M. 1991. Managing persistent ob-

jects in a multi-level store. In Proceedings ofACM SIGMOD Conference. ACM, New York, 2.

STONEBRAKER,M. 1987. The design of the POST-GRES storage system, In Proceedings of the

International Conference on Very Large Data

Bases (Brighton, England, Aug.). VLDB En-dowment, 289. Reprinted in Readings inDatabase Systems. Morgan-Kaufman, San Ma-

teo, Calif,, 1988.

STONEBRAKER, M. 1986a. The case for shared-nothing. IEEE Database Eng. 9, 1 (Mar.),

STONEBRAKER, M, 1986b. The design and imple-

mentation of distributed INGRES. In TheINGRES Papers. Addison-Wesley, Reading,

Mass., 187.

STONEBRAKER, M. 1981. Operating system sup-port for database management. Comrnun. ACM

24, 7 (July), 412.

STONEBRAKER, M. 1975. Implementation of in-tegrity constraints and views by query modifi-

cation. In Proceedings of ACM SIGMOD Con-ference ACM, New York.

STONEBRAKER, M., AOKI, P., AND SELTZER, M.1988a. Parallelism in XPRS. UCB/ERL

Memorandum M89\ 16, Univ. of California,Berkeley.

STONEBRAWCR, M., JHINGRAN, A., GOH, J., ANDPOTAMIANOS, S. 1990a. On rules, procedures,caching and views in data base systems In

Proceedings of ACM SIGMOD Conference.

ACM, New York, 281

STONEBRAKER,M., KATZ, R., PATTERSON,D., ANDOUSTERHOUT,J. 1988b. The design of XPRS,In Proceedings of the International Conference

on Very Large Data Bases (Los Angeles, Aug.).


STONEBBAKER, M., ROWE, L. A., AND HIROHAMA, M.1990b. The implementation of Postgres. IEEETrans. Knowledge Data Eng. 2, 1 (Mar.), 125.

STRAUBE, D. D., AND OLSU, M. T. 1989. Querytransformation rules for an object algebra,Dept. of Computing Sciences Tech, Rep. 89-23,Univ. of Alberta, Alberta, Canada.

Su, S. Y. W. 1988. Database Computers: Princ-

iples, Archltectur-es and Techniques. McGraw-

Hill, New York.

TANSEL, A. U., AND GARNETT, L. 1992. On Roth,Korth, and Silberschat,z’s extended algebra andcalculus for nested relational databases. ACMTrans. Database Syst. 17, 2 (June), 374.

TEOROy, T. J., YANG, D., AND FRY, J. P. 1986. Alogical design metb odology for relationaldatabases using the extended entity-relation-ship model. ACM Cornput. Suru. 18, 2 (June).197.

TERADATA. 1983. DBC/1012 Data Base Com-

puter, Concepts and Facilities. Teradata Corpo-ration, Los Angeles.

THOMAS, G., THOMPSON, G. R., CHUNG, C. W.,BARKMEYER, E., CARTER, F., TEMPLETON, M.,



Fox, S., AND HARTMAN, B. 1990. Heteroge-

neous distributed database systems for produc-tion use, ACM Comput. Surv. 22. 3 (Sept.),237.

TOMPA, F. W., AND BLAKELEY, J A. 1988. Main-taining materialized views without accessing

base data. Inf Swt. 13, 4, 393.

TRAI~ER, 1. L. 1982. Virtual memory manage-

ment for data base systems ACM Oper. S.vst.

Reu. 16, 4 (Oct.), 26.

TRIAGER, 1. L., GRAY, J., GALTIERI, C A., AND

LINDSAY, B, G. 1982. Transactions and con-

sistency in distributed database systems. ACM

Trans. Database Syst. 7, 3 (Sept.), 323.

TSUR, S., AND ZANIOLO, C. 1984 An implementa-

tion of GEM—Supporting a semantic data

model on relational back-end. In Proceedings ofACM SIGMOD Conference. ACM, New York,

286.

TUKEY, J, W, 1977. Exploratory Data Analysls.Addison-Wesley, Reading, Mass,

UNTDATA 1991. NetCDF User’s Guide, An Inter-

face for Data Access, Verszon III. NCAR TechNote TS-334 + 1A, Boulder, Colo

VALDURIIIZ, P. 1987. Join indices. ACM Trans.

Database S.yst. 12, 2 (June). 218.

VANDEN~ERC, S. L., AND DEWITT, D. J. 1991. Al-

gebraic support for complex objects with ar-rays, identity, and inhentance. In Proceedmg.sof ACM SIGMOD Conference. ACM, New York,158,

WALTON, C B. 1989 Investigating skew andscalabdlty in parallel joins. Computer ScienceTech. Rep. 89-39, Umv. of Texas, Austin.

WALTON, C, B., DALE, A. G., AND JENEVEIN, R. M.

1991. A taxonomy and performance model ofdata skew effects in parallel joins. In Proceed-

ings of the Interns tlona 1 Conference on Very

Large Data Bases (Barcelona, Spain). VLDBEndowment, 537.

WHANG. K. Y.. AND KRISHNAMURTHY, R. 1990.Query optimization m a memory-resident do-mam relational calculus database system. ACMTrans. Database Syst. 15, 1 (Mar.), 67.

WHANG, K. Y., WIEDERHOLD G., AND SAGALOWICZ, D.1985 The property of separabdity and Its ap-plication to physical database design. In Query

Processing m Database Systems. Springer,Berlin, 297.

WHANG, K. Y., WIEDERHOLD, G., AND SAGLOWICZ, D.1984. Separability—An approach to physical

database design. IEEE Trans. Comput. 33, 3

(Mar.), 209.WILLIANLS,P., DANIELS, D., HAAs, L., LAPIS, G.,

LINDSAY,B., NG, P., OBERMARC~,R., SELINGER,P., WALKER, A., WILMS, P., AND YOST, R. 1982R’: An overview of the architecture. In Im-provmg Database Usabdlty and Responslue -ness. Academic Press, New York. Reprinted in

Readings m Database Systems. Morgan-Kauf-man, San Mateo, Calif., 1988

WILSCHUT, A. N. 1993. Parallel query execution

in a mam memory database system. Ph.D. the-sis, Univ. of Tweuk, The Netherlands.

WiLSCHUT, A. N., AND AP~RS, P. M. G. 1993.Dataflow query execution m a parallel main-memory environment. Distrlb. Parall.

Databases 1, 1 (Jan.), 103.

WOLF, J. L , DIAS, D. M , AND Yu, P. S. 1990. An

effective algorithm for parallelizing sort mergein the presence of data skew In Proceedl ngs of

the International Syrnpo.?lum on Data base~ 1n

Parallel and DLstrlbuted Systems (Dubhn,

Ireland, July)

WOLF, J. L., DIAS, D M., Yu, P. S . AND TUREK, ,J.

1991. An effective algorithm for parallelizmghash Joins in the presence of data skew. InProceedings of the IEEE Conference on Data

Engtneermg. IEEE, New York. 200

WOLNIEWWZ, R. H., AND GRAEFE, G. 1993 Al-gebralc optlmlzatlon of computations overscientific databases. In Proceedings of theInternational Conference on Very Large

Data Bases. VLDB Endowment.

WONG, E., ANO KATZ. R. H. 1983. Dlstributmg a

database for parallelism. In Proceedings of


23,

WONG, E., AND Youssmv, K. 1976 Decomposition—A strategy for query processing. ACM TransDatabase Syst 1, 3 (Sept.), 223.

YANG, H., AND LARSON, P. A. 1987. Query trans-

formation for PSJ-queries. In proceedings of

the International Conference on Very LargeData Bases (Brighton, England, Aug. ) VLDB

Endowment, 245.

YOUSS~FI. K, AND WONG, E 1979. Query pro-cessing in a relational database management

system. In Proceedings of the Internatmnal

Conference on Very Large Data Bases (Rio de

Janeiro, Ott ). VLDB Endowment, 409.

Yu, C, T., AND CHANG) C. C. 1984. Distributedquery processing. ACM Comput. Surt, 16, 4(Dec.), 399.

YLT, L., AND OSBORN, S, L, 1991. An evaluation

framework for algebralc object-oriented query

models. In Proceedings of the IEEE Conferenceon Data Englneermg. VLDB Endowment, 670.

ZANIOIJ I, C. 1983 The database language Gem.

In Proceedings of ACM SIGMOD Conference.ACM, New York, 207. Rcprmted m Readcngb

zn Database Systems. Morgan-Kaufman, SanMateo, Calif., 1988.

ZANmLo, C. 1979 Design of relational views over

network schemas. In Proceedings of ACMSIGMOD Conference ACM, New York, 179

ZELLER, H. 1990. Parallel query execution mNonStop SQL. In Dtgest of Papers, 35th C’omp-

Con Conference. San Francisco.

ZELLER H,, AND GRAY, J. 1990 An adaptive hash

join algorithm for multiuser environments. InProceedings of the International Conference onVery Large Data Bases (Brisbane, Australia).VLDB Endowment, 186.

Recewed January 1992, final revk+on accepted February 1993


Query evaluation techniques for large...

Documents

Transcript of Query evaluation techniques for large...