Cloud Analytics Data Warehousing - Marco Serafini · 2020. 9. 4. · •Analytical (OLAP)...

38
Cloud Analytics Data Warehousing Marco Serafini COMPSCI 532 Lecture 18

Transcript of Cloud Analytics Data Warehousing - Marco Serafini · 2020. 9. 4. · •Analytical (OLAP)...

Page 1: Cloud Analytics Data Warehousing - Marco Serafini · 2020. 9. 4. · •Analytical (OLAP) relational queries •Different architectures •Snowflake: shared-disk + caching at compute

Cloud AnalyticsData Warehousing

Marco Serafini

COMPSCI 532Lecture 18

Page 2: Cloud Analytics Data Warehousing - Marco Serafini · 2020. 9. 4. · •Analytical (OLAP) relational queries •Different architectures •Snowflake: shared-disk + caching at compute

22

Trivia• How does Amazon make money?

• Selling books?• Entertainment?

Page 3: Cloud Analytics Data Warehousing - Marco Serafini · 2020. 9. 4. · •Analytical (OLAP) relational queries •Different architectures •Snowflake: shared-disk + caching at compute

33

Migrating to the Cloud

• ELASTICITY• Pay-as-you-go• Unlimited scale

• COST• HW procurement at scale• Cluster management at scale

Page 4: Cloud Analytics Data Warehousing - Marco Serafini · 2020. 9. 4. · •Analytical (OLAP) relational queries •Different architectures •Snowflake: shared-disk + caching at compute

44

Cloud Computing• Shared resources

• Multiple tenants sharing resources (with isolation)• Economy of scale

• Elastic provisioning• Can easily add and remove resources on the fly

• Pay as you go only when used• Different flavors

• IaaS, PaaS, SaaS• Public, private cloud

Page 5: Cloud Analytics Data Warehousing - Marco Serafini · 2020. 9. 4. · •Analytical (OLAP) relational queries •Different architectures •Snowflake: shared-disk + caching at compute

55

Cloud Offerings• Computing nodes

• Example: AWS EC2• Full nodes with local storage and pre-installed OS• Very large number of instance types: compute optimized, memory optimized, storage optimized, with GPUs, burstable…

• Storage services• Example: AWS S3• Key-value stores (put/get), file systems

• Higher-level services• Example: DBMS

Page 6: Cloud Analytics Data Warehousing - Marco Serafini · 2020. 9. 4. · •Analytical (OLAP) relational queries •Different architectures •Snowflake: shared-disk + caching at compute

66

Storage Disaggregation• Computing nodes (e.g. EC2)

• Feature-rich machines• Storage services (e.g. S3)

• On cheaper, storage-heavy machines• Limited read/write interface

• Advantages for cloud provider• Provision storage and computation independently

• Advantages for users• Storage services cheaper• Network bandwidth ~ I/O bandwidth

Page 7: Cloud Analytics Data Warehousing - Marco Serafini · 2020. 9. 4. · •Analytical (OLAP) relational queries •Different architectures •Snowflake: shared-disk + caching at compute

7

7

Cloud Storage Types

STORAGE PERFORMANCE

ACCESS APPENDS AVAILABILITY PRICE

OBJECT (S3) -- Shared X ✓ Low

FILE SYSTEM (EFS) - Shared ✓ ✓ High

BLOCK (EBS) + Instance (*) ✓ X Mid

INSTANCE-LOCAL ++ Instance ✓ X High (**)

(*) Can be detached from an instance and reattached to another(**) Storage-heavy instances are expensive

Page 8: Cloud Analytics Data Warehousing - Marco Serafini · 2020. 9. 4. · •Analytical (OLAP) relational queries •Different architectures •Snowflake: shared-disk + caching at compute

88

From Shared-Nothing Architecture…

COMPUTE COMPUTE COMPUTE COMPUTE

LS LS LS LS

Principle: move computation to data

Page 9: Cloud Analytics Data Warehousing - Marco Serafini · 2020. 9. 4. · •Analytical (OLAP) relational queries •Different architectures •Snowflake: shared-disk + caching at compute

99

…To Hybrid Architectures

COMPUTE COMPUTE COMPUTE COMPUTE

LS LS LS LS

STORAGESERVICE

Arbitrary computation

Read/Write onlyCannot move computation to data!

Page 10: Cloud Analytics Data Warehousing - Marco Serafini · 2020. 9. 4. · •Analytical (OLAP) relational queries •Different architectures •Snowflake: shared-disk + caching at compute

1010

Scheduling Low-Priority Tasks• Helps increase hardware utilization• Spot instances

• Allocated in real-time based on live bidding• Can be revoked any time (with notice)

• Serverless computing• Example: AWS Lambda

• Each of these services comes with own pricing

Page 11: Cloud Analytics Data Warehousing - Marco Serafini · 2020. 9. 4. · •Analytical (OLAP) relational queries •Different architectures •Snowflake: shared-disk + caching at compute
Page 12: Cloud Analytics Data Warehousing - Marco Serafini · 2020. 9. 4. · •Analytical (OLAP) relational queries •Different architectures •Snowflake: shared-disk + caching at compute

12 12

Goals: Push-Button Analytics• Easily parallelize single-threaded code• Eliminate cluster management overhead

• Deployment of nodes• Installation• Configuration

• Even cloud offerings have their complexities• Many instance types• Many services

• Solution: Serverless functions

Page 13: Cloud Analytics Data Warehousing - Marco Serafini · 2020. 9. 4. · •Analytical (OLAP) relational queries •Different architectures •Snowflake: shared-disk + caching at compute

1313

Goal: Push-Button Analytics• Use ”serverless” components

• No need to select a specific cluster size• System auto-scales up and down on demand

• Building blocks• Serverless functions (AWS Lambdas)• Cloud storage services (AWS S3)

• This paper implements MapReduce in this setting

Page 14: Cloud Analytics Data Warehousing - Marco Serafini · 2020. 9. 4. · •Analytical (OLAP) relational queries •Different architectures •Snowflake: shared-disk + caching at compute

1414

Serverless Functions• Single threaded code• Invoked through HTTP requests• Cloud platform takes care of

• Deployment• Load balancing• Performance isolation

• No need to• Deploy servers• Configure clusters

Page 15: Cloud Analytics Data Warehousing - Marco Serafini · 2020. 9. 4. · •Analytical (OLAP) relational queries •Different architectures •Snowflake: shared-disk + caching at compute

1515

Challenges with Lambdas• No local storage, need to use remote cloud storage

• For example S3• No function-to-function communication

• Again need remote storage to share remote memory• Short maximum running time

Page 16: Cloud Analytics Data Warehousing - Marco Serafini · 2020. 9. 4. · •Analytical (OLAP) relational queries •Different architectures •Snowflake: shared-disk + caching at compute

1616

Remote vs. Local Storage

Page 17: Cloud Analytics Data Warehousing - Marco Serafini · 2020. 9. 4. · •Analytical (OLAP) relational queries •Different architectures •Snowflake: shared-disk + caching at compute

1717

State and Fault Tolerance• State is lost after execution• Inputs and outputs need to be persisted• Fault tolerance

• Re-execute function• Require atomic writes to check what has succeeded

Page 18: Cloud Analytics Data Warehousing - Marco Serafini · 2020. 9. 4. · •Analytical (OLAP) relational queries •Different architectures •Snowflake: shared-disk + caching at compute

1818

Registering Functions• Registering a new Lambda function is slow• Solution

• Register a single generic Lambda function• Serialize the code that needs the be executed• Store the code (and the input data) on S3• Generic Lambda function loads code and executes it

Page 19: Cloud Analytics Data Warehousing - Marco Serafini · 2020. 9. 4. · •Analytical (OLAP) relational queries •Different architectures •Snowflake: shared-disk + caching at compute

1919

Remote Storage Scalability

Page 20: Cloud Analytics Data Warehousing - Marco Serafini · 2020. 9. 4. · •Analytical (OLAP) relational queries •Different architectures •Snowflake: shared-disk + caching at compute

2020

Semantics• Map is easy

• Execute one function per element of the list• Map + single Reducer

• E.g. parallel featurization + single-server ML• MapReduce

• Many Lambdas needed, many small intermediate files• Use Redis, an in-memory key-value store

• Parameter server• Use Redis

Page 21: Cloud Analytics Data Warehousing - Marco Serafini · 2020. 9. 4. · •Analytical (OLAP) relational queries •Different architectures •Snowflake: shared-disk + caching at compute

2121

The Cost of Scaling Up• Using more nodes does not always imply higher cost• Lower latency à lower cost per node

Page 22: Cloud Analytics Data Warehousing - Marco Serafini · 2020. 9. 4. · •Analytical (OLAP) relational queries •Different architectures •Snowflake: shared-disk + caching at compute

22

Data Warehousing Architectures

Page 23: Cloud Analytics Data Warehousing - Marco Serafini · 2020. 9. 4. · •Analytical (OLAP) relational queries •Different architectures •Snowflake: shared-disk + caching at compute

23 23

Data Warehousing• Analytical (OLAP) relational queries• Different architectures

• Snowflake: shared-disk + caching at compute nodes• Redshift: shared-nothing, store all data at compute nodes• Redshift Spectrum: serverless workers executing on-demand and reading from S3

• Let’s discuss these architectures and compare them

Page 24: Cloud Analytics Data Warehousing - Marco Serafini · 2020. 9. 4. · •Analytical (OLAP) relational queries •Different architectures •Snowflake: shared-disk + caching at compute

24 24

Snowflake• Shared-disk architecture

• Data is stored on S3, all nodes can access it• But nodes keep a distributed cache

• Challenges• Heterogeneous workloads

• No one-size-fits-all hardware configuration• Membership changes

• Large data shuffles when a node fails/is removed• Online upgrade

• It is similar to changing all the nodes in the system

Page 25: Cloud Analytics Data Warehousing - Marco Serafini · 2020. 9. 4. · •Analytical (OLAP) relational queries •Different architectures •Snowflake: shared-disk + caching at compute

2525

Snowflake Architecture• Data Storage

• Based on S3: high throughput, high latency• Used also for intermediate data

• Virtual Warehouses• Responsible for query execution• Stateless (restarted in their entirety)• Shared cache (low latency on hot data, most data cold)

• Cloud Services• Query parsing, access control, optimization• Snapshot isolation with multi-versioning• Metadata on external key-value store

Page 26: Cloud Analytics Data Warehousing - Marco Serafini · 2020. 9. 4. · •Analytical (OLAP) relational queries •Different architectures •Snowflake: shared-disk + caching at compute

2626

Snowflake Advantages• Storage on S3 is cheaper• Use expensive local disk only for hot data• All services (except storage) are stateless

• Simpler fault tolerance and membership change

Page 27: Cloud Analytics Data Warehousing - Marco Serafini · 2020. 9. 4. · •Analytical (OLAP) relational queries •Different architectures •Snowflake: shared-disk + caching at compute

27 27

Redshift• Classical shared-nothing architecture

• Initially based on PostgreSQL but heavily re-optimized for OLAP• Runs on EC2, explicit provisioning• All data pre-loaded on instance storage• Query compilation

• S3 for backup only

Page 28: Cloud Analytics Data Warehousing - Marco Serafini · 2020. 9. 4. · •Analytical (OLAP) relational queries •Different architectures •Snowflake: shared-disk + caching at compute

2828

Redshift Spectrum• Serverless query executor

• Number of workers dynamically assigned• Stateless

• Reads data directly from S3• Scale out to leverage storage and computation bandwidth

Page 29: Cloud Analytics Data Warehousing - Marco Serafini · 2020. 9. 4. · •Analytical (OLAP) relational queries •Different architectures •Snowflake: shared-disk + caching at compute
Page 30: Cloud Analytics Data Warehousing - Marco Serafini · 2020. 9. 4. · •Analytical (OLAP) relational queries •Different architectures •Snowflake: shared-disk + caching at compute

30 30

Comparison Setup• Benchmark: TPC-H

• 1 TB uncompressed data • 1 execution of the query suite

• Configuration• Default: 4 nodes, memory optimized (r4 8xlarge) • Redshift: analogous node that offers SSD storage (dc2) • Athena: opaque

Page 31: Cloud Analytics Data Warehousing - Marco Serafini · 2020. 9. 4. · •Analytical (OLAP) relational queries •Different architectures •Snowflake: shared-disk + caching at compute

3131

Comparison: Initialization Time• Paid every time we shut down and restart the system• Load metadata and (optionally) data

Page 32: Cloud Analytics Data Warehousing - Marco Serafini · 2020. 9. 4. · •Analytical (OLAP) relational queries •Different architectures •Snowflake: shared-disk + caching at compute

3232

Comparison: Runtime• Pre-loading pays off

• Initialization delay is easily amortized

• Caching less helpful• Cost

• Athena: pay data scan only• Other systems: mainly running time• Spectrum: scan + running time

Page 33: Cloud Analytics Data Warehousing - Marco Serafini · 2020. 9. 4. · •Analytical (OLAP) relational queries •Different architectures •Snowflake: shared-disk + caching at compute

3333

Comparison: Execution Cost• RS can amortize loading costs• Athena

• Servlerless• Pay per amount of data scanned

• RS Spectrum• Similar scheme as Athena• But must add RS cluster cost

Page 34: Cloud Analytics Data Warehousing - Marco Serafini · 2020. 9. 4. · •Analytical (OLAP) relational queries •Different architectures •Snowflake: shared-disk + caching at compute

3434

Storage Cost Per Day

EBS very expensive Instance storage + S3 backup cheaper

Page 35: Cloud Analytics Data Warehousing - Marco Serafini · 2020. 9. 4. · •Analytical (OLAP) relational queries •Different architectures •Snowflake: shared-disk + caching at compute

3535

Pushing Down Computation?• One should always move computation to data• But disaggregated storage cannot compute!

COMPUTE COMPUTE COMPUTE COMPUTE

LS LS LS LS

STORAGESERVICE

Arbitrary computation

Read/Write only

Page 36: Cloud Analytics Data Warehousing - Marco Serafini · 2020. 9. 4. · •Analytical (OLAP) relational queries •Different architectures •Snowflake: shared-disk + caching at compute

36 36

S3 Select• Computation on the storage layer

• Simple selection and projection queries on structured data (e.g. CSV or Parquet)• Simple aggregations (e.g. sum)

Page 37: Cloud Analytics Data Warehousing - Marco Serafini · 2020. 9. 4. · •Analytical (OLAP) relational queries •Different architectures •Snowflake: shared-disk + caching at compute

37

PusdownDB• Stateless query execution with S3 select• Example: Bloom join

• Standard hash join but push down Bloom filter to filter results that will not join

Page 38: Cloud Analytics Data Warehousing - Marco Serafini · 2020. 9. 4. · •Analytical (OLAP) relational queries •Different architectures •Snowflake: shared-disk + caching at compute

3838

TPC-H Results• Great speedups with S3 select