How Amazon.com is Leveraging Amazon Redshift (DAT306) | AWS re:Invent 2013

DAT306 - How Amazon.com, with One of the

World’s Largest Data Warehouses, is Leveraging

Amazon Redshift

Erik Selberg (selberg@amazon.com) and

Abhishek Agrawal (abhagrwa@amazon.com)

November 14, 2013

Agenda

• Amazon Data Warehouse Overview

• Amazon Data Warehouse and Amazon Redshift Integration Project

• Amazon Redshift Best Practices

• Conclusion

Amazon Data Warehouse

Overview Erik Selberg <selberg@amazon.com>

Amazon Data Warehouse

• Authoritative repository of data for all Amazon

• Petabytes of data

• Existing EDW is Oracle RAC; also using Amazon Elastic MapReduce and now Amazon Redshift

• Owns managing the hardware and software infrastructure – Apart from Oracle DB, just Amazon IP

• Not part of AWS

Introducing the Elephant…

• Mission: Provide customers the

best value – Leverage AWS only if it provides the

best value

– We aren’t moving 100% to Amazon

Redshift

• Publish best practices – If AWS isn’t the best, we’ll say so

• There is a conflict of interest

Amazon Data Warehouse Architecture

Control Plane (ETL

Manager)

Existing

Amazon

Redshift

Amazon Data Warehouse – Growth Story

• Petabytes of data

• Growth of data volume – YoY storage requirements have grown 67%

• Growth of processing volume – YoY processing demand has grown 47%

Long-Term Sustainable Scale

Demand

SAN-based

Redshift

$$ Wasted

Coping with Change

Demand

Redshift

Growth

changes

Capacity

Amazon Data Warehouse – Cost per Job

• Our main efficiency metric – Cost per Job (CPJ)

rDayPeakJobsPe

port$VendorSupr$DataCente$CapEx

What Drives Cost per Job…

Up? • Number of disks

– Data gets bigger!

• Number of servers

• Short-sighted negotiations

– 4th year support…

• Data Center costs (power, rent)

Down? • Bidding

– 2+ vendors

• Moore’s Law

– Vendors fight this!

• Data design

• Software (e.g. DBM)

Current State and Problems

• Existing EDW – Multiple multi-petabyte clusters (redundancy and jobs)

– Why not <x>? CPJ not lower

• Data stored in SANs (not Exadata)

• Performs poorly on scans of 10T+

• Long procurement cycles (3 month minimum)

Amazon Data Warehouse and Amazon Redshift

Integration Project

• Spent 2013 evaluating Amazon Redshift for Amazon data

warehouse

– Where does Amazon Redshift provide a better CPJ?

– Can Amazon Redshift solve some pain (without introducing new pain)?

• Picked 10K jobs and 275 tables to copy

Current State of Affairs

• Biggest cluster size: 20+1 8XL

• Peak daily jobs: 7211 (using all 4 clusters)

• 4159 extracts

• 3052 loads

Some Results

• Benchmarking for 4159 jobs – Outperforming 2719

– Underperforming 1440

– Avg. runtime • 4:43 mins in Amazon Redshift

• 17:38 mins in existing EDW

• LOADS are slower

• EXTRACTS are faster

Job Type RS Performance

Category

EXTRACT 10X Faster 945

EXTRACT 1X or same 480

EXTRACT 2X Slower 1150

LOAD 10X Faster 7

LOAD 5X Faster 15

LOAD 3X Faster 23

LOAD 2X Faster 23

LOAD 1X or same 45

LOAD 2X Slower 290

Amazon Redshift Best Practices Abhishek Agrawal <abhagrwa@amazon.com>

Amazon Redshift Integration Best Practices

• Integrating via Amazon S3 (Manifests)

• Primary key enforcement

• Idempotent loads – MERGE via INSERT/UPDATE

– Mimic Trunc-Load [Backfills]

• Trunc-partition using sort keys

• Administration automation

• Ensuring data correctness

Integrating via Amazon S3

• S3 in the US Standard Region is eventually consistent!

• S3 LIST might not give the entire list of data right after

you save it (this WILL eventually happen to you!)

• Amazon Redshift loads everything it sees in a bucket – You may see all data files, Amazon Redshift may not, which can cause

missing data

Best Practices – Using Amazon S3

• Read/COPY – System table validation – STL_LOAD_ERRORS,

– Verify files loaded are ‘intended’ files

• Write/ UNLOAD – System table validation – STL_UNLOAD_LOG

– Verify all files that has the data are on S3

• Manifests – Metadata to know what to exactly to read from S3

– Provides authoritative reference to data

– Powerful in terms of user metadata format, encryption, etc.

Primary Key Enforcement

• Amazon Redshift does not enforce primary key – You will need to do this to ensure data quality

• Best practice – Introduce temp table to check duplicates in incoming data

– Validate against incoming data to catch offenders

– Put the data in target table and validate target data in the same transaction before commit

• Yes, this IS a lot of overhead

Idempotent Loads

• Idempotent Loads – doing a load 2+ times the same as

doing 1 load – Needed to manage load failures

• MERGE – leverages primary key, row at a time

• TRUNC / INSERT – load a partition at a time

• No native Amazon Redshift MERGE support

• Merge is implemented as a multi-step process – Load the data in temp table

– Figure out inserts and load

– Figure out updates and modify target table

– Validation for duplicates

TRUNC - INSERT

• Solution – Distribute randomly

– Use sort keys to align data (mimics partition)

– Selectively delete and insert

• Issues – Inserts are in an “unsorted” bucket – performance degrades without

periodic VACUUM

– Very slow (effectively row at a time)

Other Temp Table Uses

• Partial column data load

• Filtered data load

• Column transformations

Automating Administration

• Stored procs / Oracle workflow used to do

admin task like retention, stats, etc.

• Solution – We introduced a software layer to prepare the administrative

task statements based on defined inputs

– Execute using JDBC connection

– Can schedule work like stats collection, vacuum, etc.

2013 Results

• CPJ is 55% less on Amazon Redshift in general – We can’t share the math, sorry YMMV

– Between Redshift and Amazon data warehouse, known improvements get us to ~66%

– Big wins are in big queries

– Loads are slow and expensive

• Moved ~10K jobs to ~60 8XLs (4 clusters)

• We could move at most 45% of our work to Amazon Redshift with

minimal changes

2014 Plan

• Focus on big tables (100T+) – Need to solve data expiry and backfill challenges

• Solve problems with CPU bound

• Interactive analytics (third-party vendor apps

with Amazon Redshift + Oracle)

Please give us your feedback on this

presentation

As a thank you, we will select prize

winners daily for completed surveys!

DAT306

How Amazon.com is Leveraging Amazon Redshift (DAT306) | AWS re:Invent 2013

Technology

Transcript of How Amazon.com is Leveraging Amazon Redshift (DAT306) | AWS re:Invent 2013

sqlalchemy-redshift Documentation filesqlalchemy-redshift Documentation, Release 0.7.2 Amazon Redshift dialect for SQLAlchemy. Contents 1

(GAM301) Real-Time Game Analytics with Amazon Kinesis, Amazon Redshift, and Amazon DynamoDB | AWS re:Invent 2014

Redshift Brochure 2013

Gopal Ashok Program Manager Microsoft Corporation DAT306.

sqlalchemy-redshift Documentation - Read the Docs › pdf › sqlalchemy-redshift › latest › ... · sqlalchemy-redshift Documentation, Release 0.7.2.dev0 Dialect class sqlalchemy_redshift.dialect.RedshiftDialect(*args,

DAT306 Construire des applications supportant la montée en charge avec SQL Azure

(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014

Amazon Redshift

Redshift deep dive

Tuning your Amazon Redshift and Tableau Software ...d0.awsstatic.com/whitepapers/redshift/Tableau_Redshift_Whitepaper.pdf · Tuning your Amazon Redshift and Tableau Software Deployment

Introduction to Amazon Redshift and What's Next (DAT103) | AWS re:Invent 2013

Redshift draft

(BDT206) See How Amazon Redshift is Powering Business Intelligence in the Enterprise | AWS re:Invent 2014

REDSHIFT - Amazon

REDSHIFT Graphic Novel

Redshift Introduction

Redshift spike

AWS re:Invent 2016: Best Practices for Data Warehousing with Amazon Redshift (BDM402)

Redshift Dg

Re:Invent announcements 2014