WEBSEARCHFORAPLANET: THEGOOGLECLUSTER ARCHITECTURE

22

Few Web services require as muchcomputation per request as search engines.On average, a single query on Google readshundreds of megabytes of data and consumestens of billions of CPU cycles. Supporting apeak request stream of thousands of queriesper second requires an infrastructure compa-rable in size to that of the largest supercom-puter installations. Combining more than15,000 commodity-class PCs with fault-tol-erant software creates a solution that is morecost-effective than a comparable system builtout of a smaller number of high-end servers.

Here we present the architecture of theGoogle cluster, and discuss the most importantfactors that influence its design: energy effi-ciency and price-performance ratio. Energyefficiency is key at our scale of operation, aspower consumption and cooling issues becomesignificant operational factors, taxing the lim-

its of available data center power densities.Our application affords easy parallelization:

Different queries can run on different proces-sors, and the overall index is partitioned sothat a single query can use multiple proces-sors. Consequently, peak processor perfor-mance is less important than its price/performance. As such, Google is an exampleof a throughput-oriented workload, andshould benefit from processor architecturesthat offer more on-chip parallelism, such assimultaneous multithreading or on-chip mul-tiprocessors.

Google architecture overviewGoogle’s software architecture arises from

two basic insights. First, we provide reliabilityin software rather than in server-class hard-ware, so we can use commodity PCs to builda high-end computing cluster at a low-end

Luiz André BarrosoJeffrey Dean

Urs HölzleGoogle

AMENABLE TO EXTENSIVE PARALLELIZATION, GOOGLE’S WEB SEARCH

APPLICATION LETS DIFFERENT QUERIES RUN ON DIFFERENT PROCESSORS AND,

BY PARTITIONING THE OVERALL INDEX, ALSO LETS A SINGLE QUERY USE

MULTIPLE PROCESSORS. TO HANDLE THIS WORKLOAD, GOOGLE’S

ARCHITECTURE FEATURES CLUSTERS OF MORE THAN 15,000 COMMODITY-

CLASS PCS WITH FAULT-TOLERANT SOFTWARE. THIS ARCHITECTURE ACHIEVES

SUPERIOR PERFORMANCE AT A FRACTION OF THE COST OF A SYSTEM BUILT

FROM FEWER, BUT MORE EXPENSIVE, HIGH-END SERVERS.

WEB SEARCH FOR A PLANET: THE GOOGLE CLUSTER

ARCHITECTURE

Published by the IEEE Computer Society 0272-1732/03/$17.00 2003 IEEE

price. Second, we tailor the design for bestaggregate request throughput, not peak serverresponse time, since we can manage responsetimes by parallelizing individual requests.

We believe that the best price/performancetradeoff for our applications comes from fash-ioning a reliable computing infrastructurefrom clusters of unreliable commodity PCs.We provide reliability in our environment atthe software level, by replicating services acrossmany different machines and automaticallydetecting and handling failures. This software-based reliability encompasses many differentareas and involves all parts of our systemdesign. Examining the control flow in han-dling a query provides insight into the high-level structure of the query-serving system, aswell as insight into reliability considerations.

Serving a Google queryWhen a user enters a query to Google (such

as www.google.com/search?q=ieee+society), theuser’s browser first performs a domain name sys-tem (DNS) lookup to map www.google.comto a particular IP address. To provide sufficientcapacity to handle query traffic, our service con-sists of multiple clusters distributed worldwide.Each cluster has around a few thousandmachines, and the geographically distributedsetup protects us against catastrophic data cen-ter failures (like those arising from earthquakesand large-scale power failures). A DNS-basedload-balancing system selects a cluster byaccounting for the user’s geographic proximityto each physical cluster. The load-balancing sys-tem minimizes round-trip time for the user’srequest, while also considering the availablecapacity at the various clusters.

The user’s browser then sends a hypertexttransport protocol (HTTP) request to one ofthese clusters, and thereafter, the processingof that query is entirely local to that cluster.A hardware-based load balancer in each clus-ter monitors the available set of Google Webservers (GWSs) and performs local load bal-ancing of requests across a set of them. Afterreceiving a query, a GWS machine coordi-nates the query execution and formats theresults into a Hypertext Markup Language(HTML) response to the user’s browser. Fig-ure 1 illustrates these steps.

Query execution consists of two majorphases.1 In the first phase, the index servers

consult an inverted index that maps eachquery word to a matching list of documents(the hit list). The index servers then determinea set of relevant documents by intersecting thehit lists of the individual query words, andthey compute a relevance score for each doc-ument. This relevance score determines theorder of results on the output page.

The search process is challenging because ofthe large amount of data: The raw documentscomprise several tens of terabytes of uncom-pressed data, and the inverted index resultingfrom this raw data is itself many terabytes ofdata. Fortunately, the search is highly paral-lelizable by dividing the index into pieces(index shards), each having a randomly chosensubset of documents from the full index. Apool of machines serves requests for each shard,and the overall index cluster contains one poolfor each shard. Each request chooses a machinewithin a pool using an intermediate load bal-ancer—in other words, each query goes to onemachine (or a subset of machines) assigned toeach shard. If a shard’s replica goes down, theload balancer will avoid using it for queries,and other components of our cluster-man-agement system will try to revive it or eventu-ally replace it with another machine. Duringthe downtime, the system capacity is reducedin proportion to the total fraction of capacitythat this machine represented. However, ser-vice remains uninterrupted, and all parts of theindex remain available.

The final result of this first phase of queryexecution is an ordered list of document iden-tifiers (docids). As Figure 1 shows, the second

23MARCH–APRIL 2003

Document servers

Ad server

Spell checkerGoogle Web server

Index servers

Figure 1. Google query-serving architecture.

phase involves taking this list of docids andcomputing the actual title and uniformresource locator of these documents, alongwith a query-specific document summary.Document servers (docservers) handle thisjob, fetching each document from disk toextract the title and the keyword-in-contextsnippet. As with the index lookup phase, thestrategy is to partition the processing of alldocuments by

• randomly distributing documents intosmaller shards

• having multiple server replicas responsi-ble for handling each shard, and

• routing requests through a load balancer.

The docserver cluster must have access toan online, low-latency copy of the entire Web.In fact, because of the replication required forperformance and availability, Google storesdozens of copies of the Web across its clusters.

In addition to the indexing and document-serving phases, a GWS also initiates severalother ancillary tasks upon receiving a query,such as sending the query to a spell-checkingsystem and to an ad-serving system to gener-ate relevant advertisements (if any). When allphases are complete, a GWS generates theappropriate HTML for the output page andreturns it to the user’s browser.

Using replication for capacity and fault-toleranceWe have structured our system so that most

accesses to the index and other data structuresinvolved in answering a query are read-only:Updates are relatively infrequent, and we canoften perform them safely by diverting queriesaway from a service replica during an update.This principle sidesteps many of the consis-tency issues that typically arise in using a gen-eral-purpose database.

We also aggressively exploit the very largeamounts of inherent parallelism in the appli-cation: For example, we transform the lookupof matching documents in a large index intomany lookups for matching documents in aset of smaller indices, followed by a relativelyinexpensive merging step. Similarly, we dividethe query stream into multiple streams, eachhandled by a cluster. Adding machines to eachpool increases serving capacity, and addingshards accommodates index growth. By par-

allelizing the search over many machines, wereduce the average latency necessary to answera query, dividing the total computation acrossmore CPUs and disks. Because individualshards don’t need to communicate with eachother, the resulting speedup is nearly linear.In other words, the CPU speed of the indi-vidual index servers does not directly influ-ence the search’s overall performance, becausewe can increase the number of shards toaccommodate slower CPUs, and vice versa.Consequently, our hardware selection processfocuses on machines that offer an excellentrequest throughput for our application, ratherthan machines that offer the highest single-thread performance.

In summary, Google clusters follow threekey design principles:

• Software reliability. We eschew fault-tol-erant hardware features such as redun-dant power supplies, a redundant arrayof inexpensive disks (RAID), and high-quality components, instead focusing ontolerating failures in software.

• Use replication for better request through-put and availability. Because machines areinherently unreliable, we replicate eachof our internal services across manymachines. Because we already replicateservices across multiple machines toobtain sufficient capacity, this type offault tolerance almost comes for free.

• Price/performance beats peak performance.We purchase the CPU generation thatcurrently gives the best performance perunit price, not the CPUs that give thebest absolute performance.

• Using commodity PCs reduces the cost ofcomputation. As a result, we can afford touse more computational resources perquery, employ more expensive techniquesin our ranking algorithm, or search alarger index of documents.

Leveraging commodity partsGoogle’s racks consist of 40 to 80 x86-based

servers mounted on either side of a custommade rack (each side of the rack containstwenty 20u or forty 1u servers). Our focus onprice/performance favors servers that resemblemid-range desktop PCs in terms of their com-ponents, except for the choice of large disk

24

GOOGLE SEARCH ARCHITECTURE

IEEE MICRO

drives. Several CPU generations are in activeservice, ranging from single-processor 533-MHz Intel-Celeron-based servers to dual 1.4-GHz Intel Pentium III servers. Each servercontains one or more integrated drive elec-tronics (IDE) drives, each holding 80 Gbytes.Index servers typically have less disk spacethan document servers because the formerhave a more CPU-intensive workload. Theservers on each side of a rack interconnect viaa 100-Mbps Ethernet switch that has one ortwo gigabit uplinks to a core gigabit switchthat connects all racks together.

Our ultimate selection criterion is cost perquery, expressed as the sum of capital expense(with depreciation) and operating costs (host-ing, system administration, and repairs) divid-ed by performance. Realistically, a server willnot last beyond two or three years, because ofits disparity in performance when comparedto newer machines. Machines older than threeyears are so much slower than current-gener-ation machines that it is difficult to achieveproper load distribution and configuration inclusters containing both types. Given the rel-atively short amortization period, the equip-ment cost figures prominently in the overallcost equation.

Because Google servers are custom made,we’ll use pricing information for comparablePC-based server racks for illustration. Forexample, in late 2002 a rack of 88 dual-CPU2-GHz Intel Xeon servers with 2 Gbytes ofRAM and an 80-Gbyte hard disk was offeredon RackSaver.com for around $278,000. Thisfigure translates into a monthly capital cost of$7,700 per rack over three years. Personneland hosting costs are the remaining majorcontributors to overall cost.

The relative importance of equipment costmakes traditional server solutions less appeal-ing for our problem because they increase per-formance but decrease the price/performance.For example, four-processor motherboards areexpensive, and because our application paral-lelizes very well, such a motherboard doesn’trecoup its additional cost with better perfor-mance. Similarly, although SCSI disks arefaster and more reliable, they typically costtwo or three times as much as an equal-capac-ity IDE drive.

The cost advantages of using inexpensive,PC-based clusters over high-end multi-

processor servers can be quite substantial, atleast for a highly parallelizable application likeours. The example $278,000 rack contains176 2-GHz Xeon CPUs, 176 Gbytes ofRAM, and 7 Tbytes of disk space. In com-parison, a typical x86-based server containseight 2-GHz Xeon CPUs, 64 Gbytes of RAM,and 8 Tbytes of disk space; it costs about$758,000.2 In other words, the multiproces-sor server is about three times more expensivebut has 22 times fewer CPUs, three times lessRAM, and slightly more disk space. Much ofthe cost difference derives from the muchhigher interconnect bandwidth and reliabili-ty of a high-end server, but again, Google’shighly redundant architecture does not relyon either of these attributes.

Operating thousands of mid-range PCsinstead of a few high-end multiprocessorservers incurs significant system administra-tion and repair costs. However, for a relative-ly homogenous application like Google,where most servers run one of very few appli-cations, these costs are manageable. Assum-ing tools to install and upgrade software ongroups of machines are available, the time andcost to maintain 1,000 servers isn’t much morethan the cost of maintaining 100 serversbecause all machines have identical configu-rations. Similarly, the cost of monitoring acluster using a scalable application-monitor-ing system does not increase greatly with clus-ter size. Furthermore, we can keep repair costsreasonably low by batching repairs and ensur-ing that we can easily swap out componentswith the highest failure rates, such as disks andpower supplies.

The power problemEven without special, high-density packag-

ing, power consumption and cooling issues canbecome challenging. A mid-range server withdual 1.4-GHz Pentium III processors drawsabout 90 W of DC power under load: roughly55 W for the two CPUs, 10 W for a disk drive,and 25 W to power DRAM and the mother-board. With a typical efficiency of about 75 per-cent for an ATX power supply, this translatesinto 120 W of AC power per server, or rough-ly 10 kW per rack. A rack comfortably fits in25 ft2 of space, resulting in a power density of400 W/ft2. With higher-end processors, thepower density of a rack can exceed 700 W/ft2.


Unfortunately, the typical power density forcommercial data centers lies between 70 and150 W/ft2, much lower than that required forPC clusters. As a result, even low-tech PCclusters using relatively straightforward pack-aging need special cooling or additional spaceto bring down power density to that which istolerable in typical data centers. Thus, pack-ing even more servers into a rack could be oflimited practical use for large-scale deploy-ment as long as such racks reside in standarddata centers. This situation leads to the ques-tion of whether it is possible to reduce thepower usage per server.

Reduced-power servers are attractive forlarge-scale clusters, but you must keep somecaveats in mind. First, reduced power is desir-able, but, for our application, it must comewithout a corresponding performance penal-ty: What counts is watts per unit of perfor-mance, not watts alone. Second, thelower-power server must not be considerablymore expensive, because the cost of deprecia-tion typically outweighs the cost of power.The earlier-mentioned 10 kW rack consumesabout 10 MW-h of power per month (includ-ing cooling overhead). Even at a generous 15cents per kilowatt-hour (half for the actualpower, half to amortize uninterruptible powersupply [UPS] and power distribution equip-ment), power and cooling cost only $1,500per month. Such a cost is small in compari-son to the depreciation cost of $7,700 permonth. Thus, low-power servers must not bemore expensive than regular servers to havean overall cost advantage in our setup.

Hardware-level application characteristicsExamining various architectural characteris-

tics of our application helps illustrate whichhardware platforms will provide the bestprice/performance for our query-serving sys-tem. We’ll concentrate on the characteristics ofthe index server, the component of our infra-structure whose price/performance most heav-ily impacts overall price/performance. The mainactivity in the index server consists of decodingcompressed information in the inverted indexand finding matches against a set of documentsthat could satisfy a query. Table 1 shows somebasic instruction-level measurements of theindex server program running on a 1-GHz dual-processor Pentium III system.

The application has a moderately high CPI,considering that the Pentium III is capable ofissuing three instructions per cycle. We expectsuch behavior, considering that the applica-tion traverses dynamic data structures and thatcontrol flow is data dependent, creating a sig-nificant number of difficult-to-predictbranches. In fact, the same workload runningon the newer Pentium 4 processor exhibitsnearly twice the CPI and approximately thesame branch prediction performance, eventhough the Pentium 4 can issue more instruc-tions concurrently and has superior branchprediction logic. In essence, there isn’t thatmuch exploitable instruction-level parallelism(ILP) in the workload. Our measurementssuggest that the level of aggressive out-of-order, speculative execution present in mod-ern processors is already beyond the point ofdiminishing performance returns for suchprograms.

A more profitable way to exploit parallelismfor applications such as the index server is toleverage the trivially parallelizable computa-tion. Processing each query shares mostly read-only data with the rest of the system, andconstitutes a work unit that requires little com-munication. We already take advantage of thatat the cluster level by deploying large numbersof inexpensive nodes, rather than fewer high-end ones. Exploiting such abundant thread-level parallelism at the microarchitecture levelappears equally promising. Both simultaneousmultithreading (SMT) and chip multiproces-sor (CMP) architectures target thread-levelparallelism and should improve the perfor-mance of many of our servers. Some early

26


IEEE MICRO

Table 1. Instruction-level

measurements on the index server.

Characteristic ValueCycles per instruction 1.1Ratios (percentage)

Branch mispredict 5.0Level 1 instruction miss* 0.4Level 1 data miss* 0.7Level 2 miss* 0.3Instruction TLB miss* 0.04Data TLB miss* 0.7* Cache and TLB ratios are per

instructions retired.

experiments with a dual-context (SMT) IntelXeon processor show more than a 30 percentperformance improvement over a single-con-text setup. This speedup is at the upper boundof improvements reported by Intel for theirSMT implementation.3

We believe that the potential for CMP sys-tems is even greater. CMP designs, such asHydra4 and Piranha,5 seem especially promis-ing. In these designs, multiple (four to eight)simpler, in-order, short-pipeline cores replacea complex high-performance core. The penal-ties of in-order execution should be minorgiven how little ILP our application yields,and shorter pipelines would reduce or elimi-nate branch mispredict penalties. The avail-able thread-level parallelism should allownear-linear speedup with the number of cores,and a shared L2 cache of reasonable size wouldspeed up interprocessor communication.

Memory systemTable 1 also outlines the main memory sys-

tem performance parameters. We observegood performance for the instruction cacheand instruction translation look-aside buffer,a result of the relatively small inner-loop codesize. Index data blocks have no temporal local-ity, due to the sheer size of the index data andthe unpredictability in access patterns for theindex’s data block. However, accesses withinan index data block do benefit from spatiallocality, which hardware prefetching (or pos-sibly larger cache lines) can exploit. The neteffect is good overall cache hit ratios, even forrelatively modest cache sizes.

Memory bandwidth does not appear to bea bottleneck. We estimate the memory busutilization of a Pentium-class processor sys-tem to be well under 20 percent. This is main-ly due to the amount of computation required(on average) for every cache line of index databrought into the processor caches, and to thedata-dependent nature of the data fetchstream. In many ways, the index server’s mem-ory system behavior resembles the behaviorreported for the Transaction Processing Per-formance Council’s benchmark D (TPC-D).6

For such workloads, a memory system with arelatively modest sized L2 cache, short L2cache and memory latencies, and longer (per-haps 128 byte) cache lines is likely to be themost effective.

Large-scale multiprocessingAs mentioned earlier, our infrastructure

consists of a massively large cluster of inex-pensive desktop-class machines, as opposedto a smaller number of large-scale shared-memory machines. Large shared-memorymachines are most useful when the computa-tion-to-communication ratio is low; commu-nication patterns or data partitioning aredynamic or hard to predict; or when total costof ownership dwarfs hardware costs (due tomanagement overhead and software licensingprices). In those situations they justify theirhigh price tags.

At Google, none of these requirementsapply, because we partition index data andcomputation to minimize communicationand evenly balance the load across servers. Wealso produce all our software in-house, andminimize system management overheadthrough extensive automation and monitor-ing, which makes hardware costs a significantfraction of the total system operating expens-es. Moreover, large-scale shared-memorymachines still do not handle individual hard-ware component or software failures grace-fully, with most fault types causing a fullsystem crash. By deploying many small mul-tiprocessors, we contain the effect of faults tosmaller pieces of the system. Overall, a clustersolution fits the performance and availabilityrequirements of our service at significantlylower costs.

At first sight, it might appear that there arefew applications that share Google’s char-

acteristics, because there are few services thatrequire many thousands of servers andpetabytes of storage. However, many applica-tions share the essential traits that allow for aPC-based cluster architecture. As long as anapplication orientation focuses on theprice/performance and can run on servers thathave no private state (so servers can be repli-cated), it might benefit from using a similararchitecture. Common examples include high-volume Web servers or application servers thatare computationally intensive but essentiallystateless. All of these applications have plentyof request-level parallelism, a characteristicexploitable by running individual requests onseparate servers. In fact, larger Web sitesalready commonly use such architectures.


At Google’s scale, some limits of massiveserver parallelism do become apparent, such asthe limited cooling capacity of commercialdata centers and the less-than-optimal fit ofcurrent CPUs for throughput-oriented appli-cations. Nevertheless, using inexpensive PCsto handle Google’s large-scale computationshas drastically increased the amount of com-putation we can afford to spend per query,thus helping to improve the Internet searchexperience of tens of millions of users. MICRO

AcknowledgmentsOver the years, many others have made con-

tributions to Google’s hardware architecturethat are at least as significant as ours. In partic-ular, we acknowledge the work of Gerald Aign-er, Ross Biro, Bogdan Cocosel, and Larry Page.

References1. S. Brin and L. Page, “The Anatomy of a

Large-Scale Hypertextual Web SearchEngine,” Proc. Seventh World Wide WebConf. (WWW7), International World WideWeb Conference Committee (IW3C2), 1998,pp. 107-117.

2. “TPC Benchmark C Full Disclosure Reportfor IBM eserver xSeries 440 using MicrosoftSQL Server 2000 Enterprise Edition andMicrosoft Windows .NET Datacenter Server2003, TPC-C Version 5.0,” http://www.tpc.org/results/FDR/TPCC/ibm.x4408way.c5.fdr.02110801.pdf.

3. D. Marr et al., “Hyper-Threading TechnologyArchitecture and Microarchitecture: AHypertext History,” Intel Technology J., vol.6, issue 1, Feb. 2002.

4. L. Hammond, B. Nayfeh, and K. Olukotun,“A Single-Chip Multiprocessor,” Computer,vol. 30, no. 9, Sept. 1997, pp. 79-85.

5. L.A. Barroso et al., “Piranha: A ScalableArchitecture Based on Single-ChipMultiprocessing,” Proc. 27th ACM Int’lSymp. Computer Architecture, ACM Press,

2000, pp. 282-293.6. L.A. Barroso, K. Gharachorloo, and E.

Bugnion, “Memory System Characterizationof Commercial Workloads,” Proc. 25th ACMInt’l Symp. Computer Architecture, ACMPress, 1998, pp. 3-14.

Luiz André Barroso is a member of the Sys-tems Lab at Google, where he has focused onimproving the efficiency of Google’s Websearch and on Google’s hardware architecture.Barroso has a BS and an MS in electrical engi-neering from Pontifícia Universidade Católi-ca, Brazil, and a PhD in computer engineeringfrom the University of Southern California.He is a member of the ACM.

Jeffrey Dean is a distinguished engineer in theSystems Lab at Google and has worked on thecrawling, indexing, and query serving systems,with a focus on scalability and improving rel-evance. Dean has a BS in computer scienceand economics from the University of Min-nesota and a PhD in computer science fromthe University of Washington. He is a mem-ber of the ACM.

Urs Hölzle is a Google Fellow and in his pre-vious role as vice president of engineering wasresponsible for managing the developmentand operation of the Google search engineduring its first two years. Hölzle has a diplo-ma from the Eidgenössische TechnischeHochschule Zürich and a PhD from StanfordUniversity, both in computer science. He is amember of IEEE and the ACM.

Direct questions and comments about thisarticle to Urs Hölzle, 2400 Bayshore Parkway,Mountain View, CA 94043; [email protected].

For further information on this or any othercomputing topic, visit our Digital Library athttp://computer.org/publications/dlib.

28


IEEE MICRO

WEBSEARCHFORAPLANET: THEGOOGLECLUSTER ARCHITECTURE

Technology

Transcript of WEBSEARCHFORAPLANET: THEGOOGLECLUSTER ARCHITECTURE