(BDT309) Delivering Results with Amazon Redshift, One Petabyte at a Time | AWS re:Invent 2014

26
November 12, 2014 | Las Vegas, NV Erik Selberg ([email protected] ) Samar Sodhi (samars@ amazon.com)

description

The Amazon Enterprise Data Warehouse team, responsible for data warehousing across all of Amazon's divisions, spent 2014 working with Amazon Redshift on its largest datasets, including web log traffic. The key goals in this project were to provide a viable, enterprise-grade solution that enabled full scans of 2 trillion rows in under an hour at load. Key to success were automation of routine DW tasks that become complicated at scale: backfilling erroneous data, re-calculating statistics, re-sorting daily additions, and so forth. In this session, we discuss the scale and performance of a 100-node 1PB Amazon Redshift cluster, as well as describing some of the technical aspects and best practices of running 100-node clusters in an enterprise environment.

Transcript of (BDT309) Delivering Results with Amazon Redshift, One Petabyte at a Time | AWS re:Invent 2014

Page 1: (BDT309) Delivering Results with Amazon Redshift, One Petabyte at a Time | AWS re:Invent 2014

November 12, 2014 | Las Vegas, NV

Erik Selberg ([email protected])

Samar Sodhi ([email protected])

Page 2: (BDT309) Delivering Results with Amazon Redshift, One Petabyte at a Time | AWS re:Invent 2014
Page 4: (BDT309) Delivering Results with Amazon Redshift, One Petabyte at a Time | AWS re:Invent 2014
Page 5: (BDT309) Delivering Results with Amazon Redshift, One Petabyte at a Time | AWS re:Invent 2014
Page 6: (BDT309) Delivering Results with Amazon Redshift, One Petabyte at a Time | AWS re:Invent 2014
Page 7: (BDT309) Delivering Results with Amazon Redshift, One Petabyte at a Time | AWS re:Invent 2014
Page 8: (BDT309) Delivering Results with Amazon Redshift, One Petabyte at a Time | AWS re:Invent 2014
Page 9: (BDT309) Delivering Results with Amazon Redshift, One Petabyte at a Time | AWS re:Invent 2014
Page 10: (BDT309) Delivering Results with Amazon Redshift, One Petabyte at a Time | AWS re:Invent 2014
Page 11: (BDT309) Delivering Results with Amazon Redshift, One Petabyte at a Time | AWS re:Invent 2014
Page 12: (BDT309) Delivering Results with Amazon Redshift, One Petabyte at a Time | AWS re:Invent 2014

Use Case Goal Benchmark

Scan 2.25 Trillion Rows

(15 months)

60m 14m

Load 5 Billion Rows

(1 day)

60m 10m

Load 150 Billion Rows

(30 days)

24 hours 9.75 hours

Page 14: (BDT309) Delivering Results with Amazon Redshift, One Petabyte at a Time | AWS re:Invent 2014

– VACUUM is slow, physical partitions do not exist

• Doesn’t allow for parallel loads into the same table

• 15 concurrent queries

– “Bad” queries can impact the entire cluster

Page 15: (BDT309) Delivering Results with Amazon Redshift, One Petabyte at a Time | AWS re:Invent 2014
Page 16: (BDT309) Delivering Results with Amazon Redshift, One Petabyte at a Time | AWS re:Invent 2014
Page 17: (BDT309) Delivering Results with Amazon Redshift, One Petabyte at a Time | AWS re:Invent 2014

2x

Page 18: (BDT309) Delivering Results with Amazon Redshift, One Petabyte at a Time | AWS re:Invent 2014

– COMPUPDATE (samples the date) – fast but not optimal

Page 19: (BDT309) Delivering Results with Amazon Redshift, One Petabyte at a Time | AWS re:Invent 2014
Page 20: (BDT309) Delivering Results with Amazon Redshift, One Petabyte at a Time | AWS re:Invent 2014
Page 21: (BDT309) Delivering Results with Amazon Redshift, One Petabyte at a Time | AWS re:Invent 2014
Page 22: (BDT309) Delivering Results with Amazon Redshift, One Petabyte at a Time | AWS re:Invent 2014

FASTER 86.35%

GREATER THAN 15X 14.91%

10X TO 15X 18.42%

5X TO 10X 25.73%

3X TO 5X 19.88%

2X TO 3X 7.02%

1X TO 2X 3.80%

SAME 8.47%

SLOWER 5.65%

1X TO 2X 1.75%

Page 23: (BDT309) Delivering Results with Amazon Redshift, One Petabyte at a Time | AWS re:Invent 2014

FASTER 14.85%

3X TO 5X .56%

2X TO 3X 3.64%

1X TO 2X 10.64%

SAME 19.05%

SLOWER 66.11%

1X TO 2X 18.49%

2X TO 3X 8.96%

3X TO 5X 9.8%

5X TO 10X 10.08%

10X TO 15X 5.04%

SLOWER THAN 15X 13.73%

Page 24: (BDT309) Delivering Results with Amazon Redshift, One Petabyte at a Time | AWS re:Invent 2014

or

30 min

48 hours

48 hours

Daily (6B) 40 8XL nodes 100 8XL nodes

Vacuum 80 min 30 min

Stats Collection 90 sec 50 sec

Monthly (150B) 40 8XL nodes 100 8XL nodes

Vacuum (Deep

Copy) 380 min 201 min

Stats Collection 22 min 4 min

Page 25: (BDT309) Delivering Results with Amazon Redshift, One Petabyte at a Time | AWS re:Invent 2014
Page 26: (BDT309) Delivering Results with Amazon Redshift, One Petabyte at a Time | AWS re:Invent 2014

http://bit.ly/awsevals