CS 744: DATACENTER AS A COMPUTER

Post on 13-Jun-2022

1 views 0 download

Transcript of CS 744: DATACENTER AS A COMPUTER

CS 744: DATACENTER AS A COMPUTER

Shivaram Venkataraman Fall 2019

With slides from Mosharaf Chowdhury and Ion Stoica

ANNOUNCEMENTS

-  Assignments -  Assignment zero is due! -  Form groups for Assignment 1 on Piazza

-  Class format -  Lecture -  Review -  Discussion

Scalable Storage Systems

Datacenter Architecture

Resource Management

Computational Engines

Machine Learning SQL Streaming Graph

Applications

OUTLINE

-  Hardware Trends -  Datacenter design -  WSC workloads -  Discussion

Why is One Machine Not Enough?

What’s in a Machine?

Interconnected compute and storage Newer Hardware

- GPUs, FPGAs - RDMA, NVlink

Memory Bus

Ethe

rnet SATA

PCIe v4

Scale Up: Make More Powerful Machines

Moore’s law –  Stated 52 years ago by Intel

founder Gordon Moore –  Number of transistors on

microchip double every 2 years

–  Today “closer to 2.5 years” Intel CEO Brian Krzanich

Dennard Scaling is the Problem

Suggested that power requirements are proportional to the area for transistors

–  Both voltage and current being proportional to length

–  Stated in 1974 by Robert H. Dennard (DRAM inventor)

Broken since 2005 “Adapting to Thrive in a New Economy of Memory Abundance,” Bresniker et al

Dennard Scaling is the Problem

Performance per-core is stalled Number of cores is increasing

“Adapting to Thrive in a New Economy of Memory Abundance,” Bresniker et al

Memory TRENDS

MEMORY TAKEAWAY

Growing +15% per year Data access from memory is getting more expensive !

HDD CAPACITY

HDD BANDWIDTH

Disk bandwidth is not growing

SSDs

Performance: –  Reads: 25us latency –  Write: 200us latency –  Erase: 1,5 ms

Steady state, when SSD full –  One erase every 64 or 128 reads (depending on page size)

Lifetime: 100,000-1 million writes per page

SSD VS HDD COST

Amazon EC2 (2014)

Machine Memory (GB) Compute Units (ECU)

Local Storage (GB) Cost / hour

t1.micro 0.615 1 0 $0.02

m1.xlarge 15 8 1680 $0.48

cc2.8xlarge 60.5 88 (Xeon 2670) 3360 $2.40

1 ECU = CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor

Amazon EC2 (2018)

Machine Memory (GB) Compute Units (ECU)

Local Storage (GB) Cost / hour

t2.nano 0.5 1 0 $0.0058

r5d.24xlarge 244 768 104 96 4x900 NVMe $6.912

x1.32xlarge 2 TB 4 * Xeon E7 3.4 TB (SSD) $13.338

p3.16xlarge 488 GB 8 Nvidia Tesla V100 GPUs 0 $24.48

Amazon EC2 (2019)

Machine Memory (GB) Compute Units (ECU)

Local Storage (GB) Cost / hour

t2.nano 0.5 1 0 $0.0058

r5d.24xlarge 768 96 4x900 NVMe $6.912

x1e.32xlarge 2 TB 4 TB 4 * Xeon E7 3.4 TB (SSD) $26.68

p3dn.24xlarge 488 768 GB 8 Nvidia Tesla V100 GPUs 0 $31.21

Ethernet Bandwidth

1998

1995

2002

2017

Growing 33-40% per year !

TRENDS SUMMARY

CPU speed per core is flat Memory bandwidth growing slower than capacity SSD, NVMe replacing HDDs Ethernet bandwidth growing Scale up vs Scale out? (Discussion)

DATACENTER ARCHITECHTURE

Memory Bus

Ethe

rnet

SATA

PCIe

ServerServer

Datacenter Networks

Traditional hierarchical topology –  Expensive –  Difficult to scale –  High oversubscription –  Smaller path diversity –  …

Core

Agg.

Edge

STORAGE HIERARCHY (v2)

Scale Out: Warehouse-Scale Computers

Single organization Homogeneity (to some extent) Cost efficiency at scale

–  Multiplexing across applications and services

–  Rent it out!

Many concerns –  Infrastructure –  Networking –  Storage –  Software –  Power/Energy –  Failure/Recovery –  …

MAIN COMPONENTS OF WSC

SOFTWARE IMPLICATIONS

Workload Diversity

Reliability

Single organization

Storage Hierarchy

Three Categories of Software

1.  Platform-level –  Software firmware that are present in every machine

2.  Cluster-level –  Distributed systems to enable everything

3.  Application-level –  User-facing applications built on top

BigData

WORKLOAD: Partition-Aggregate

Top-level Aggregator

Mid-level Aggregators

Workers

WORKLOAD: Map-Reduce

Reduce StageMap Stage

VIDEO ENCODING

MACHINE LEARNING

WORKLOAD PATTERNS

DATACENTER VS DESKTOP

Parallelism Available Workload churn Platform homogeneity Fault-free operation

DISCUSSION

Form groups of 4 students Pick up a discussion form per group Fill out responses at https://forms.gle/hhuKktMb5pKkotwc9

Discussion

Scale-up vs Scale-out

DISCUSSION

Differences between web-search and MapReduce

DISCUSSION

Microsoft Word vs. online document editor like Google Docs

DISCUSSION

NEXT STEPS

9/12 class on Storage Systems Assignment 1 out Thursday