Post on 02-Feb-2016
description
1
An Overview of Cloud Computing @ Yahoo!
Raghu RamakrishnanChief Scientist, Audience and Cloud Computing
Research Fellow, Yahoo! Research
Reflects many discussions with: Eric Baldeschwieler, Jay Kistler, Chuck Neerdaels, Shelton Shugar, and Raymie Stata
and joint work with the Sherpa team, in particular:Brian Cooper, Utkarsh Srivastava, Adam Silberstein, Rodrigo Fonseca and Nick Puz in Y! ResearchChuck Neerdaels, P.P. Suryanarayanan and many others in CCDI
2
Questions
• What is cloud computing?– Horizontal and functional services
• What’s it going to change?– Software business models, science, life
• How many clouds will there be?– 1, 2, 3, infinity
• What’s new in cloud computing?– HPC grids, ASPs, hosted services, Multics (!)– Emerging “cloud stack” to support a broad class of
programs, including data intensive applications
3
SCENARIOSPie-in-the-sky
4
Living in the Clouds
• We want to start a new website, FredsList.com• Our site will provide listings of items for sale, jobs,
etc.• As time goes on, we’ll add more features
– And illustrate how more cloud capabilities (and corresponding infrastructure components) are used as needed
• List of capabilities/components is illustrative, not exhaustive
• Our cloud provides a “dataset” abstraction– FredsList doesn’t worry about the underlying components
5
Step 1: Listings Scenario
Simple Web Service API’s Simple Web Service API’s
Database
PNUTS
FredsList.com application FredsList.com application
1234323, transportation, For sale: one bicycle, barely used
FredsList wants to store listings as (key, category, description)
5523442, childcare, Nanny available in San Jose
215534, wanted, Looking for issue 1 of Superman comic book
DECLARE DATASET Listings AS( ID String PRIMARY KEY,Category String,Description Text )
DECLARE DATASET Listings AS( ID String PRIMARY KEY,Category String,Description Text )
6
Step 2: System Evolution
Simple Web Service API’s Simple Web Service API’s
Database
PNUTS
FredsList.com application FredsList.com application
1234323, transportation, For sale: one bicycle, barely used
Fred belatedly realizes prices are useful information!
5523442, childcare, Nanny available in San Jose
215534, wanted, Looking for issue 1 of Superman comic book
ALTER DATASET ListingsADD (Price Float)
ALTER DATASET ListingsADD (Price Float)
Schemas are flexible, and evolve
32138, camera, Nikon D40,USD 300
Not every record in adataset has values defined for all fields declared forthe dataset
vs.
7
Step 3: Search
Simple Web Service API’s Simple Web Service API’s
Database
PNUTS
“bicycle”
FredsList’s customers quickly ask for keyword search
Search
Vespa
“dvd’s” “nanny”
FredsList.com application FredsList.com application
ALTER ListingsSET Description SEARCHABLE
ALTER ListingsSET Description SEARCHABLE
Messaging
Tribble
Federation of systems
offering different
capabilities
8
Step 4: Photos
Simple Web Service API’s Simple Web Service API’s
Database
PNUTS
FredsList decides to add photos/videos to listings
Search
Vespa
Storage
MObStorForeign key
photo → listing
FredsList.com application FredsList.com application
ALTER ListingsADD Photo BLOB
ALTER ListingsADD Photo BLOB
Messaging
Tribble
Federation of systems
offering different
performance points
9
Step 5: Data Analysis
Simple Web Service API’s Simple Web Service API’s
Database
PNUTS
FredsList wants to analyze its listings to get statistics about category, do geocoding, etc.
Search
Vespa
Storage
MObStorForeign key
photo → listing
FredsList.com application FredsList.com application
ALTER ListingsMAKE ANALYZABLE
ALTER ListingsMAKE ANALYZABLE
Compute
Grid
Batch export
Pig query to analyze categories
Hadoop program to geocode data
Hadoop program to generate fancy pages for listings
Messaging
Tribble
10
Step 6: Performance
Simple Web Service API’s Simple Web Service API’s
Database
PNUTS
FredsList wants to reduce its data access latency
Search
Vespa
Messaging
Tribble
Storage
MObStorForeign key
photo → listing
FredsList.com application FredsList.com application
ALTER ListingsMAKE CACHEABLE
ALTER ListingsMAKE CACHEABLE
Compute
Grid
Batch export
Caching
memcached
And by now, Fred is
global, and wants geo-replication!
11
Data Serving vs. Analysis
• Very different workloads, requirements• Data from serving system is one of many
kinds of data (click streams are another common kind, as are syndicated feeds) to be analyzed and integrated
• The result of analysis often goes right back into serving system
12
EYES TO THE SKIESMotherhood-and-Apple-Pie
13
Why Clouds?
• On-demand infrastructure to create a fundamental shift in the OE curve:
– Do things we can’t do
– Build more robustly, more efficiently, more globally, more completely, more quickly, for a given budget
• Cloud services should do heavy lifting of heavy-lifting of scaling & high-availability– Today, this is done at the
app-level, which is not productive
14
Requirements for Cloud Services
• Multitenant. A cloud service must support multiple, organizationally distant customers.
• Elasticity. Tenants should be able to negotiate and receive resources/QoS on-demand.
• Resource Sharing. Ideally, spare cloud resources should be transparently applied when a tenant’s negotiated QoS is insufficient, e.g., due to spikes.
• Horizontal scaling. It should be possible to add cloud capacity in small increments; this should be transparent to the tenants of the service.
• Metering. A cloud service must support accounting that reasonably ascribes operational and capital expenditures to each of the tenants of the service.
• Security. A cloud service should be secure in that tenants are not made vulnerable because of loopholes in the cloud.
• Availability. A cloud service should be highly available.• Operability. A cloud service should be easy to operate, with few
operators. Operating costs should scale linearly or better with the capacity of the service.
15
Types of Cloud Services
• Two kinds of cloud services:– Horizontal (“Platform”) Cloud Services
• Functionality enabling tenants to build applications or new services on top of the cloud
– Functional Cloud Services • Functionality that is useful in and of itself to tenants. E.g., various
SaaS instances, such as Saleforce.com; Google Analytics and Yahoo!’s IndexTools; Yahoo! properties aimed at end-users and small businesses, e.g., flickr, Groups, Mail, News, Shopping
• Could be built on top of horizontal cloud services or from scratch• Yahoo! has been offering these for a long while (e.g., Mail for
SMB, Groups, Flickr, BOSS, Ad exchanges)
16
Opening Up Yahoo! Search
Phase 1 Phase 2
Giving site owners and developers control over the appearance of Yahoo!
Search results.
BOSS takes Yahoo!’s open strategy to the next level by providing Yahoo!
Search infrastructure and technology to developers and companies to help them
build their own search experiences.
18
BOSS Offerings
API
A self-service, web services model for developers and start-ups to quickly build and deploy new search experiences.
BOSS offers two options for companies and developers and has partnered with top technology universities to drive search experimentation, innovation and research into next generation search.
• University of Illinois Urbana Champaign• Carnegie Mellon University
• Stanford University
• Purdue University
• MIT
• Indian Institute of
Technology Bombay
• University of
Massachusetts
CUSTOM
Working with 3rd parties to build a more relevant, brand/site specific web search experience.
This option is jointly built by Yahoo! and select partners.
ACADEMIC
Working with the following universities to allow for wide-scale research in the search field:
(Slide courtesy Prabhakar Raghavan)
19
Partner Examples
20
Horizontal Cloud Services
• Horizontal cloud services are foundations on which tenants build applications or new services. They should be:– Semantics-free. Must be "generic infrastructure,” and not tied to
specific app-logic. • May provide the ability to inject application logic through well-defined
APIs
– Broadly applicable. Must be broadly applicable (i.e., it can't be intended for just one or two properties).
– Fault-tolerant over commodity hardware. Must be built using inexpensive commodity hardware, and should mask component failures.
• While each cloud service provides value, the power of the cloud paradigm will depend on a collection of well-chosen, loosely coupled services that collectively make it easy to quickly develop and operate innovative web applications.
22
Yahoo! Cloud StackPr
ovis
ioni
ng (
Self-
serv
e)
Horizontal Cloud Services …YCS YCPI Brooklyn
EDGEM
onito
ring/
Met
erin
g/Se
curit
y
Horizontal Cloud Services…Hadoop
BATCH
Horizontal Cloud Services…Sherpa MOBStor
STORAGE
Horizontal Cloud ServicesVM/OS …
APP
Horizontal Cloud ServicesVM/OS yApache
WEB
Dat
a H
ighw
ay
Serving Grid
PHP App Engine
23
Yahoo! CCDI Thrust Areas
• Fast Provisioning and Machine Virtualization: On demand, deliver a set of hosts imaged with desired software and configured against standard services– Multiple hosts may be multiplexed onto the same physical
machine.
• Batch Storage and Processing: Scalable data storage optimized for batch processing, together with computational capabilities
• Operational Storage: Persistent storage that supports low-latency updates and flexible retrieval
• Edge Content Services: Support for dealing with network topology, communication protocols, caching, and BCP
Rest of today’s talk
24
Web Data Management
Large data analysis(Hadoop)
Structured record storage
(PNUTS/Sherpa)
Blob storage(SAN/NAS)
• Scan oriented workloads
• Focus on sequential disk I/O
• $ per cpu cycle
• CRUD • Point lookups
and short scans
• Index organized table and random I/Os
• $ per latency
• Object retrieval and streaming
• Scalable file storage
• $ per GB
25
[Workflow][Workflow]
Hadoop: Batch Storage/Analysis
Why is batch processing important?
• Whether it’s – response-prediction for advertising– machine-learned relevance for Search, or– content optimization for audience, – data-intensive computing is increasingly
central to everything Yahoo! does– Hadoop is central to addressing this need
• Hadoop is a case-study in our cloud vision– Processes enormous amounts of data– Provides horizontal scaling and fault-
tolerance for our users– Allows those users to focus on their app
logic
HDFSHDFS
Map-ReduceMap-Reduce
High-level query layer (Pig)
High-level query layer (Pig)
26
The World Has Changed
• Web serving applications need:– Scalability!
• Preferably elastic
– Flexible schemas– Geographic distribution– High availability– Reliable storage
• Web serving applications can do without:– Complicated queries– Strong transactions
2727
MObStor
• Yahoo!’s next-generation globally replicated, virtualized media object storage service
• Better provisioning, easy migration, replication, better BCP, and performance
• New features (Evergreen URLs, CDN integration, REST API, …)
• The object metadata problem addressed using Sherpa, though MObStor is focused on blob storage.
28
Storage & Delivery Stack
29
PNUTS /
SHERPA
To Help You Scale Your Mountains of Data
30
CCDI—Research Collaboration
Yahoo! Research
• Raghu Ramakrishnan • Brian Cooper• Utkarsh Srivastava• Adam Silberstein• Rodrigo Fonseca
CCDI
• Chuck Neerdaels • P.P.S. Narayan • Kevin Athey• Toby Negrin• Plus Dev/QA teams
31
Yahoo! Serving Storage Problem
– Small records – 100KB or less
– Structured records – lots of fields, evolving
– Extreme data scale - Tens of TB
– Extreme request scale - Tens of thousands of requests/sec
– Low latency globally - 20+ datacenters worldwide
– High Availability - outages cost $millions
– Variable usage patterns - as applications and users change
31
33
E 75656 C
A 42342 EB 42521 W
C 66354 W
D 12352 E
F 15677 E
What is PNUTS/Sherpa?
E 75656 C
A 42342 EB 42521 W
C 66354 W
D 12352 E
F 15677 E
CREATE TABLE Parts (ID VARCHAR,StockNumber INT,Status VARCHAR…
)
CREATE TABLE Parts (ID VARCHAR,StockNumber INT,Status VARCHAR…
)
Parallel databaseParallel database Geographic replicationGeographic replication
Structured, flexible schemaStructured, flexible schema
Hosted, managed infrastructureHosted, managed infrastructure
A 42342 E
B 42521 W
C 66354 W
D 12352 E
E 75656 C
F 15677 E
33
35
What Will It Become?
E 75656 C
A 42342 EB 42521 W
C 66354 W
D 12352 E
F 15677 E
E 75656 C
A 42342 EB 42521 W
C 66354 W
D 12352 E
F 15677 E
E 75656 C
A 42342 EB 42521 W
C 66354 W
D 12352 E
F 15677 E
Indexes and viewsIndexes and views
36
Scalability• Thousands of machines• Easy to add capacity• Restrict query language to avoid costly queries
Geographic replication• Asynchronous replication around the globe• Low-latency local access
High availability and fault tolerance• Automatically recover from failures• Serve reads and writes despite failures
Design Goals
36
Consistency• Per-record guarantees• Timeline model • Option to relax if needed
Multiple access paths• Hash table, ordered table• Primary, secondary access
Hosted service• Applications plug and play• Share operational cost
37
Technology Elements
PNUTS • Query planning and execution• Index maintenance
Distributed infrastructure for tabular data • Data partitioning • Update consistency• Replication
YDOT FS • Ordered tables
Applications
Tribble• Pub/sub messaging
YDHT FS • Hash tables
Zookeeper• Consistency service
YC
A:
Aut
hori
zati
on
PNUTS API Tabular API
37
38
Data Manipulation
• Per-record operations– Get– Set– Delete
• Multi-record operations– Multiget– Scan– Getrange
• Web service (RESTful) API
38
39
Tablets—Hash Table
Apple
Lemon
Grape
Orange
Lime
Strawberry
Kiwi
Avocado
Tomato
Banana
Grapes are good to eat
Limes are green
Apple is wisdom
Strawberry shortcake
Arrgh! Don’t get scurvy!
But at what price?
How much did you pay for this lemon?
Is this a vegetable?
New Zealand
The perfect fruit
Name Description Price
$12
$9
$1
$900
$2
$3
$1
$14
$2
$8
0x0000
0xFFFF
0x911F
0x2AF3
39
40
Tablets—Ordered Table
40
Apple
Banana
Grape
Orange
Lime
Strawberry
Kiwi
Avocado
Tomato
Lemon
Grapes are good to eat
Limes are green
Apple is wisdom
Strawberry shortcake
Arrgh! Don’t get scurvy!
But at what price?
The perfect fruit
Is this a vegetable?
How much did you pay for this lemon?
New Zealand
$1
$3
$2
$12
$8
$1
$9
$2
$900
$14
Name Description PriceA
Z
Q
H
41
Flexible Schema
Posted date Listing id Item Price
6/1/07 424252 Couch $570
6/1/07 763245 Bike $86
6/3/07 211242 Car $1123
6/5/07 421133 Lamp $15
Color
Red
Condition
Good
Fair
42
Storageunits
Routers
Tablet Controller
REST API
Clients
Local region Remote regions
Tribble
Detailed Architecture
42
43
Tablet Splitting and Balancing
43
Each storage unit has many tablets (horizontal partitions of the table)Each storage unit has many tablets (horizontal partitions of the table)
Tablets may grow over timeTablets may grow over timeOverfull tablets splitOverfull tablets split
Storage unit may become a hotspotStorage unit may become a hotspot
Shed load by moving tablets to other serversShed load by moving tablets to other servers
Storage unitTablet
44
QUERY PROCESSING
44
45
Accessing Data
45
SUSU SU
1
Get key k
2Get key k3 Record for key k
4 Record for key k
46
Bulk Read
46
SUScatter/gather server
SU SU
1
{k1, k2, … kn}
2Get k1
Get k2Get k3
47
Storage unit 1 Storage unit 2 Storage unit 3
Range Queries in YDOT
• Clustered, ordered retrieval of records
Storage unit 1Canteloupe
Storage unit 3Lime
Storage unit 2Strawberry
Storage unit 1
Router
AppleAvocadoBananaBlueberry
CanteloupeGrapeKiwiLemon
LimeMangoOrange
StrawberryTomatoWatermelon
AppleAvocadoBananaBlueberry
CanteloupeGrapeKiwiLemon
LimeMangoOrange
StrawberryTomatoWatermelon
Grapefruit…Pear?Grapefruit…Lime?
Lime…Pear?
Storage unit 1Canteloupe
Storage unit 3Lime
Storage unit 2Strawberry
Storage unit 1
48
Updates
1
Write key k
2Write key k7
Sequence # for key k
8
Sequence # for key k
SU SU SU
3Write key k
4
5SUCCESS
6Write key k
RoutersMessage brokers
48
49
ASYNCHRONOUS REPLICATION AND
CONSISTENCY
49
50
Asynchronous Replication
50
51
• Goal: Make it easier for applications to reason about updates and cope with asynchrony
• What happens to a record with primary key “Alice”?
Consistency Model
51
Time
Record inserted
Update Update Update UpdateUpdate Delete
Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7
Generation 1
v. 6 v. 8
Update Update
As the record is updated, copies may get out of sync.
52
Example: Social Alice
User Status
Alice Busy
West East
User Status
Alice Free
User Status
Alice ???User Status
Alice ???
User Status
Alice Busy
User Status
Alice ______
Busy
Free
Free
Record Timeline
53
Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7
Generation 1
v. 6 v. 8
Current version
Stale versionStale version
Read
Consistency Model
53
In general, reads are served using a local copy
54
Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7
Generation 1
v. 6 v. 8
Read up-to-date
Current version
Stale versionStale version
Consistency Model
54
But application can request and get current version
55
Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7
Generation 1
v. 6 v. 8
Read ≥ v.6
Current version
Stale versionStale version
Consistency Model
55
Or variations such as “read forward”—while copies may lag themaster record, every copy goes through the same sequence of changes
56
Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7
Generation 1
v. 6 v. 8
Write
Current version
Stale versionStale version
Consistency Model
56
Achieved via per-record primary copy protocol(To maximize availability, record masterships automaticlly transferred if site fails)
Can be selectively weakened to eventual consistency (local writes that are reconciled using version vectors)
57
Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7
Generation 1
v. 6 v. 8
Write if = v.7
ERROR
Current version
Stale versionStale version
Consistency Model
57
Test-and-set writes facilitate per-record transactions
58
Consistency Techniques
• Per-record mastering– Each record is assigned a “master region”
• May differ between records
– Updates to the record forwarded to the master region– Ensures consistent ordering of updates
• Tablet-level mastering– Each tablet is assigned a “master region”– Inserts and deletes of records forwarded to the master region– Master region decides tablet splits
• These details are hidden from the application– Except for the latency impact!
5959
Mastering
A 42342 EB 42521 W
C 66354 W
D 12352 EE 75656 C
F 15677 E A 42342 EB 42521 W
C 66354 W
D 12352 EE 75656 C
F 15677 EA 42342 EB 42521 W
C 66354 W
D 12352 EE 75656 C
F 15677 E
Tablet master
60
Bulk Insert/Update/Replace
Client
Source Data
Bulk manager
1. Client feeds records to bulk manager
2. Bulk loader transfers records to SU’s in batches• Bypass routers and
message brokers• Efficient import into
storage unit
61
Bulk Load in YDOT
• YDOT bulk inserts can cause performance hotspots
• Solution: preallocate tablets
62
Index Maintenance
• How to have lots of interesting indexes and views, without killing performance?
• Solution: Asynchrony!– Indexes/views updated asynchronously when
base table updated
63
SHERPAIN CONTEXT
63
64
Types of Record Stores
• Query expressiveness
Simple Feature rich
Object retrieval
Retrieval from single table of
objects/records
SQL
S3 PNUTS Oracle
65
Types of Record Stores
• Consistency model
Best effort Strong guaranteesEventual
consistencyTimeline
consistencyACID
S3 PNUTS Oracle
Program centric
consistency
Program centric
consistencyObject-centric consistency
Object-centric consistency
66
Types of Record Stores
• Data model
Flexibility,Schema evolution
Optimized forFixed schemas
CouchDB
PNUTS
Oracle
Consistency spans objectsConsistency
spans objectsObject-centric consistency
Object-centric consistency
67
Types of Record Stores
• Elasticity (ability to add resources on demand)
Inelastic Elastic
Limited (via data
distribution)
VLSD(Very Large
Scale Distribution /Replication)
OraclePNUTS
S3
68
Data Stores Comparison
• User-partitioned SQL stores– Microsoft Azure SDS– Amazon SimpleDB
• Multi-tenant application databases– Salesforce.com– Oracle on Demand
• Mutable object stores– Amazon S3
Versus PNUTS
• More expressive queries• Users must control partitioning• Limited elasticity
• Highly optimized for complex workloads
• Limited flexibility to evolving applications
• Inherit limitations of underlying data management system
• Object storage versus record management
69
Application Design Space
Records Files
Get a few things
Scan everything
Sherpa MObStor
Everest Hadoop
YMDBMySQL
Filer
Oracle
BigTable
69
70
Alternatives Matrix
Ela
stic
Ope
rabi
lity
Glo
bal l
ow
late
ncy
Ava
ilab
ilit
y
Stru
ctur
ed
acce
ss
Sherpa
Y! UDB
MySQL
Oracle
HDFS
BigTable
DynamoU
pdat
esCassandra
Con
sist
ency
m
odel
SQL
/AC
ID
70
71
Further Reading
Efficient Bulk Insertion into a Distributed Ordered Table (SIGMOD 2008)Adam Silberstein, Brian Cooper, Utkarsh Srivastava, Erik Vee, Ramana Yerneni, Raghu Ramakrishnan
PNUTS: Yahoo!'s Hosted Data Serving Platform (VLDB 2008)Brian Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Phil Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel Weaver, Ramana Yerneni
Asynchronous View Maintenance for VLSD Databases,Parag Agrawal, Adam Silberstein, Brian F. Cooper, Utkarsh Srivastava and Raghu RamakrishnanSIGMOD 2009 (to appear)
Cloud Storage Design in a PNUTShellBrian F. Cooper, Raghu Ramakrishnan, and Utkarsh SrivastavaBeautiful Data, O’Reilly Media, 2009 (to appear)
72
QUESTIONS?
72