Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

65
Google Confidential and Proprietary Woulda, Coulda, Shoulda The World of Tera, Peta & Exa Stephen McHenry Chancellor of Site Reliability Engineering April 22, 2009

description

'Woulda, Coulda, Shoulda: The world of Tera, Peta, and exa...'

Transcript of Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Page 1: Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Google Confidential and Proprietary

Woulda, Coulda, Shoulda The World of Tera, Peta & Exa

Stephen McHenry

Chancellor of Site Reliability Engineering

April 22, 2009

Page 2: Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Google Confidential and Proprietary

Overview

•  Mission Statement •  Some History •  Planning for

•  Failure •  Expansion

•  Applications •  Infrastructure •  Hardware

•  The Future

Page 3: Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Google Confidential and Proprietary

Google’s Mission

To organize the world’s information and make it universally accessible and useful

Page 4: Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Google Confidential and Proprietary

Overview

•  Mission Statement •  Some History •  Planning for

•  Failure •  Expansion

•  Applications •  Infrastructure •  Hardware

•  The Future

Page 5: Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Google Confidential and Proprietary

One of our earliest storage systems

Lego Disk Case

Page 6: Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Google Confidential and Proprietary

Peak of google.stanford.edu (circa 1997)

Page 7: Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Google Confidential and Proprietary

The Infamous “Corkboard”

Page 8: Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Google Confidential and Proprietary

Many Corkboards (1999)

Page 9: Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Google Confidential and Proprietary

A Data Center in 1999…

Page 10: Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Google Confidential and Proprietary

Another Data Center, Spring 2000

Note the Cooling

Page 11: Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Google Confidential and Proprietary

google.com (new data center 2001)

Page 12: Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Google Confidential and Proprietary

google.com (3 days later)

Page 13: Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Google Confidential and Proprietary

Current Data Center

Page 14: Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Google Confidential and Proprietary

Overview

•  Mission Statement •  Some History •  The Challenge •  Planning for

•  Failure •  Expansion

•  Applications •  Infrastructure •  Hardware

•  The Future

Page 15: Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Google Confidential and Proprietary

Just For Reference

Terabyte – 1012 Bytes -1,000,000,000,000 Bytes

Petabyte – 1015 Bytes – 1000 Terabytes 1,000,000,000,000,000 Bytes

Exabyte – 1018 Bytes – 1 Million Terabytes 1,000,000,000,000,000,000 Bytes

Zettabyte – 1021 Bytes – 1 Billion Terabytes 1,000,000,000,000,000,000,000 Bytes

Yottabyte – 1024 Bytes – 1 Trillion Terabytes 1,000,000,000,000,000,000,000,000 Bytes

Page 16: Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Google Confidential and Proprietary

How much information is out there?

How large is the Web? •  Tens of billions of documents? Hundreds?

•  ~10KB/doc => 100s of Terabytes

Then there’s everything else •  Email, personal files, closed databases, broadcast media, print, etc.

Estimated 5 Exabytes/year (growing at 30%)*

800MB/year/person – ~90% in magnetic media

Web is just a tiny starting point

Source: How much information 2003

Page 17: Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Google Confidential and Proprietary

Google takes its mission seriously

Started with the Web (html)

Added various document formats •  Images •  Commercial data: ads and shopping (Froogle) •  Enterprise (corporate data) •  News •  Email (Gmail) •  Scholarly publications •  Local information •  Maps •  Yellow pages •  Satellite images •  Instant messaging and VoIP •  Communities (Orkut) •  Printed media •  …

Page 18: Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Google Confidential and Proprietary

Ever-Increasing Computation Needs

more queries

better results

more data Every Google service sees

continuing growth in computational needs •  More queries

  More users, happier users

•  More data   Bigger web, mailbox, blog, etc.

•  Better results   Find the right information, and

find it faster

Page 19: Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Google Confidential and Proprietary

Overview

•  Mission Statement •  Some History •  The Challenge •  Planning for

•  Failure •  Expansion

•  Applications •  Infrastructure •  Hardware

•  The Future

Page 20: Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Google Confidential and Proprietary

When Your Data Center Reaches 170o F o

Page 21: Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Google Confidential and Proprietary

The Joys of Real Hardware

Typical first year for a new cluster:

~0.5 overheating (power down most machines in <5 mins, ~1-2 days to recover) ~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to come back) ~1 rack-move (plenty of warning, ~500-1000 machines powered down, ~6 hours) ~1 network rewiring (rolling ~5% of machines down over 2-day span) ~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back) ~5 racks go wonky (40-80 machines see 50% packetloss) ~8 network maintenances (4 might cause ~30-minute random connectivity losses) ~12 router reloads (takes out DNS and external vips for a couple minutes) ~3 router failures (have to immediately pull traffic for an hour) ~dozens of minor 30-second blips for dns ~1000 individual machine failures ~thousands of hard drive failures

slow disks, bad memory, misconfigured machines, flaky machines, etc.

Page 22: Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Google Confidential and Proprietary

Overview

•  Mission Statement •  Some History •  The Challenge •  Planning for

•  Failure •  Expansion

•  Applications •  Infrastructure •  Hardware

•  The Future

Page 23: Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Google Confidential and Proprietary

Components of Web Search

Crawler (Spider):   Collects the documents •  Tradeoff between size and speed •  High networking bandwidth requirements •  Be gentle to serving hosts while doing it

Indexer:   Generates the index - similar to the back of a book (but big!)   Requires several days on thousands of computers   More than 20 billion web documents

•  Web, Images, News, Usenet messages, …

  Pre-compute query-independent ranking (PageRank, etc)

Query serving:   Processes user queries   Finding all relevant documents

•  Search over tens of Terabytes, 1000s of times/second

  Scoring - Mix of query dependent and independent factors

List of

links to

explore

Expired pages from index

Get link from list

Fetch page

Add to queue

Parses page to

extract links

Crawling process

Add URL

Page 24: Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Google Confidential and Proprietary

Google Web Server

Spell checker

Ad Server

I0 I1 I2 IN

I0 I1 I2 IN

I0 I1 I2 IN

Rep

licas

Index shards

D0 D1 DM

D0 D1 DM

D0 D1 DM R

eplic

as …

Doc shards

query Misc. servers

Index servers Doc servers

Elapsed time: 0.25s, machines involved: 1000+

Google Query Serving Infrastructure

Page 25: Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Google Confidential and Proprietary

Ads System

As challenging as search •  But with some transactional semantics

Problem: find useful ads based on what the user is interested in at that moment •  A form of mind reading

Two systems •  Ads for search results pages (search for tires or restaurants)

•  Ads for web browsing/email (or ‘content ads’)   Extract a contextual meaning from web pages   Do the same thing for data from a gazillion advertisers

  Match those up and score them

  Do it faster than the original content provider can respond to the web page!

Page 26: Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Google Confidential and Proprietary

Example: Sunday NY Times

Page 27: Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Google Confidential and Proprietary

Language Translation (by Machine)

Information is more useful if more people can understand it

Translation is a long-standing, challenging Artificial Intelligence problem

Key insight: •  Transform it into a statistical modeling problem

•  Train it with tons of data!

Arabic-English Chinese-English Doubling training corpus size

~0.5% higher score

Page 28: Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Google Confidential and Proprietary

Data + CPUs = Playground

Substantial fraction of internet available for processing

Easy-to-use teraflops/petabytes

Cool problems, great fun…

Page 29: Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Google Confidential and Proprietary

Searching for Britney Spears…

Learning From Data

Page 30: Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Google Confidential and Proprietary

Query Frequency Over Time

Queries containing “eclipse”

Queries containing “full moon”

Queries containing “watermelon” Queries containing “opteron”

Queries containing “summer olympics”

Queries containing “world series”

Page 31: Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Google Confidential and Proprietary

WhiteHouse.gov/openforquestions

Page 32: Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Google Confidential and Proprietary

A Simple Challenge For Our Computing Platform

1.  Create the world’s largest computing infrastructure

2.  Make sure we can afford it

Need to drive efficiency of the computing infrastructure to unprecedented levels   indices containing more documents

  updated more often

  faster queries

  faster product development cycles

  …

Page 33: Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Google Confidential and Proprietary

Overview

•  Mission Statement •  Some History •  The Challenge •  Planning for

•  Failure •  Expansion

•  Applications •  Infrastructure •  Hardware

•  The Future

Page 34: Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Google Confidential and Proprietary

Systems Infrastructure

Google File System (GFS)

Map Reduce

Big Table

Page 35: Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Google Confidential and Proprietary

GFS: Google File System

Planning – For unprecedented quantities of data storage & failure(s)

Google has unique FS requirements •  Huge read/write bandwidth •  Reliability over thousands of nodes •  Mostly operating on large data blocks •  Need efficient distributed operations

GFS Usage @ Google •  Many clusters •  Filesystem clusters of up to 5000+ machines •  Pools of 10000+ clients •  5+ PB Filesystems •  40 GB/s read/write load in single cluster •  (in the presence of frequent HW failures)

Page 36: Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Google Confidential and Proprietary

GFS Setup

•  Master manages metadata •  Data transfers happen directly between clients/

machines

Client

Client

Misc. servers

Client Repl

icas

Masters

GFS Master

GFS Master

C0 C1

C2 C5

Machine 1

C0

C2

C5

Machine N

C1

C3 C5

Machine 2

Page 37: Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Google Confidential and Proprietary

MapReduce – Large Scale Processing

Okay, GFS lets us store lots of data… now what?

We need to process that data in new and interesting ways! •  Fast: locality optimization, optimized sorter, lots of tuning work done... •  Robust: handles machine failure, bad records, … •  Easy to use: little boilerplate, supports many formats, … •  Scalable: can easily add more machines to handle more data or reduce the

run-time •  Widely applicable: can solve a broad range of problems •  Monitoring: status page, counters, …

The Plan – Develop a robust compute infrastructure that allows rapid development of complex analyses, and is tolerant to failure(s)

Page 38: Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Google Confidential and Proprietary

MapReduce – Large Scale Processing

MapReduce: •  a framework to simplify large-scale computations on large clusters

•  Good for batch operations •  User writes two simple functions: map and reduce

•  Underlying library/framework takes care of messy details

•  Greatly simplifies large, distributed data processing

Sawmill (Logs Analysis) Search My History

Search quality Spelling

Web search indexing …many other internal projects ...

Ads Froogle

Google Earth Google Local Google News Google Print

Machine Translation

Lots of uses inside Google

Page 39: Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Google Confidential and Proprietary

Large Scale Processing – (semi) Structured Data

Why not just use commercial DB? •  Scale is too large for most commercial databases •  Even if it weren’t, cost would be very high

  Building internally means system can be applied across many projects for low incremental cost

•  Low-level storage optimizations help performance significantly   Much harder to do when running on top of a database layer

Okay, traditional relational databases are woefully inadequate at this scale… now what?

The Plan – Build a large scale, distributed solution for semi-structured data, that is resistant to failure(s)

Page 40: Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Google Confidential and Proprietary

Large Scale Processing – (semi) Structured Data

BigTable:

•  A large-scale storage system for semi-structured data •  Database-like model, but data stored on thousands of machines.. •  Fault-tolerant, persistent •  Scalable

  Thousands of servers   Terabytes of in-memory data   Petabytes of disk-based data   Millions of reads/writes per second, efficient scans   billions of URLs, many versions/page (~20K/version)   Hundreds of millions of users, thousands of queries/sec   100TB+ of satellite image data

•  Self-managing   Servers can be added/removed dynamically   Servers adjust to load imbalance

•  Design/initial implementation started beginning of 2004

Page 41: Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Google Confidential and Proprietary

BigTable Usage

Useful for structured/semi-structured data   URLs - Contents, crawl metadata, links, anchors, pagerank, …   Per-user data - User preference settings, recent queries/search results, …   Geographic data - Physical entities, roads, satellite imagery, annotations, …

Production use or active development for ~70 projects:   Google Print   My Search History   Orkut   Crawling/indexing pipeline   Google Maps/Google Earth   Blogger   …

Currently ~500 BigTable cells Largest bigtable cell manages ~3000TB of data spread over several

thousand machines (larger cells planned)

Page 42: Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Google Confidential and Proprietary

Overview

•  Mission Statement •  Some History •  The Challenge •  Planning for

•  Failure •  Expansion

•  Applications •  Infrastructure •  Hardware

•  The Future

Page 43: Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Google Confidential and Proprietary

A Simple Challenge For Our Computing Platform

1.  Create the world’s largest computing infrastructure

2.  Make sure we can afford it

Need to drive efficiency of the computing infrastructure to unprecedented levels   indices containing more documents

  updated more often

  faster queries

  faster product development cycles

  …

Page 44: Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Google Confidential and Proprietary

Innovative Solutions Needed In Several Areas

Server design and architecture

Power efficiency

System software

Large scale networking

Performance tuning and optimization

System management and repairs automation

Page 45: Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Google Confidential and Proprietary

•  Brainstorming Circa 2003

•  Container-based data centers

•  Battery per server instead of traditional UPS

o  99.9% efficient backup power!

•  Application of best practices leads to PUE below 1.2

Pictorial History

Page 46: Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Google Confidential and Proprietary

Prototype arriving at Google, Jan 2005

Pictorial History

Page 47: Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Google Confidential and Proprietary

Pictorial History

The first crane was too small -- Take 2

Page 48: Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Google Confidential and Proprietary

Pictorial History

Google prototypes first airborne data center

Page 49: Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Google Confidential and Proprietary

Pictorial History

And into the parking garage we go

Page 50: Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Google Confidential and Proprietary

Data Center Vitals

•  Capacity: 10 MW IT load

•  Area: 75000 sq ft total under roof

•  Overall power density: 133W/sq ft

•  Prototype container delivered January 2005

•  Data center built 2004-2005

•  Construction completed September, 2005

•  Went live November 21, 2005

Page 51: Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Google Confidential and Proprietary

Additional Vitals

•  45 containers, approx. 40000 servers

•  Single and 2-story on facing sides of hangar

•  Bridge crane for container handling

Page 52: Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Google Confidential and Proprietary

Overview

•  Mission Statement •  Some History •  The Challenge •  Planning for

•  Failure •  Expansion

•  Applications •  Infrastructure •  Hardware

•  The Future

Page 53: Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Google Confidential and Proprietary

Planning for the Future

•  Manage Total Cost of Ownership

•  Reduce Water Usage

•  Reduce Power Consumption

•  Manage E-Waste

Page 54: Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Google Confidential and Proprietary

Total Cost of Ownership - TCO

Earnings and sustainability are (often) aligned •  Careful application of best practices leads

to much lower energy use which leads to lower TCO for facilities – Examples:

o  Manage air flow - avoid hot/cold mixing

o  Raise the inlet temperature

o  Use free cooling (Belgium has no chillers!)

o  Optimize power distribution •  Don't need exotic technologies •  But: need to break down traditional silos

o  Between capex and opex

o  Between facilities and IT

o  Manage everyone by impact on TCO

Page 55: Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Google Confidential and Proprietary

Water resources management is the next "elephant in the room" we are all

going to have to address.

Page 56: Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Google Confidential and Proprietary

Lake Powell 53% full

Shasta Lake

(from ESPN!)

A Great Wave Rising: The coming U.S. crisis in water policy

Page 57: Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Google Confidential and Proprietary

Lake Oroville - new docks

Lake Mead historical levels

Lake Mead - 45% full

* Scripps Institution of Oceanography, UCSD,

Feb 2008.

Lake Mead water could dry up by 2021*

Page 58: Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Google Confidential and Proprietary

February 11, 2008 March 4, 2007

Georgia’s Lake Lanier

Page 59: Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Google Confidential and Proprietary

Lake Hartwell, GA – November 2008

Page 60: Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Google Confidential and Proprietary

References:

U.S. Dept. of Energy – Energy Demands On Water Resources – Dec., 2006

National Renewable Energy Laboratory - Consumptive Water Use for U.S. Power Production - Dec., 2003

USGS - Water Use At Home - Jan., 2009

Water – The Next “Big Elephant”

Why?

•  Water resources are becoming (a lot) scarcer and more variable

How do data centers fit in?

•  For every 10 MW consumed, the average data center uses ~150,000 gallons of water per day for cooling.

•  Upstream of the data center, the same 10 MW of delivered power consumes 480,000 gallons of water per day to generate that power.

Page 61: Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Google Confidential and Proprietary

Factoid: The typical 'water-less' DC uses about a third more water than the evaporatively cooled Google DC

Using less power is the most significant factor for reducing water consumption

Water Consumption (gpd) by DC Type

Page 62: Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Google Confidential and Proprietary

Our data center in St. Ghislain, Belgium

Google's data center in Belgium uses 100% reclaimed

water from an industrial canal

Water Recycling:

Page 63: Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Google Confidential and Proprietary

Fact: The typical PC wastes half the electricity it uses

Fact: Over 60% of all corporate PCs are left on overnight ________________________________________________

•  End-user devices are the largest portion of IT footprint •  Power efficiency is critical as billions of devices are deployed •  The technology exists today to save energy and money

Buy power efficient laptops / PCs / servers Google saves $30 per server every year

Enable power management Power management suites: ROI < 1 year

Transition to lightweight devices Reduce power from 150W to less than 5W

Potential: 50% emissions reduction

Power - Cutting waste / Smarter computing

Page 64: Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Google Confidential and Proprietary

•  Hazardous

•  High volume because of obsolescence

•  Ubiquitous (computers, appliances, consumer electronics, cell phones) Solutions

•  4 R's: Reduce, reuse, repair, recycle

•  Dispose of remainder responsibly

E-waste is a Growing Problem

Page 65: Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Google Confidential and Proprietary

Thank you!