Post on 11-May-2015
description
© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
DAT306 - How Amazon.com, with One of the
World’s Largest Data Warehouses, is Leveraging
Amazon Redshift
Erik Selberg (selberg@amazon.com) and
Abhishek Agrawal (abhagrwa@amazon.com)
November 14, 2013
Agenda
• Amazon Data Warehouse Overview
• Amazon Data Warehouse and Amazon Redshift Integration Project
• Amazon Redshift Best Practices
• Conclusion
Amazon Data Warehouse
• Authoritative repository of data for all Amazon
• Petabytes of data
• Existing EDW is Oracle RAC; also using Amazon Elastic MapReduce and now Amazon Redshift
• Owns managing the hardware and software infrastructure – Apart from Oracle DB, just Amazon IP
• Not part of AWS
Introducing the Elephant…
• Mission: Provide customers the
best value – Leverage AWS only if it provides the
best value
– We aren’t moving 100% to Amazon
Redshift
• Publish best practices – If AWS isn’t the best, we’ll say so
• There is a conflict of interest
Amazon Data Warehouse Architecture
Control Plane (ETL
Manager)
Existing
EDW
Amazon
EMR
Amazon
Redshift
Amazon Data Warehouse – Growth Story
• Petabytes of data
• Growth of data volume – YoY storage requirements have grown 67%
• Growth of processing volume – YoY processing demand has grown 47%
Amazon Data Warehouse – Cost per Job
• Our main efficiency metric – Cost per Job (CPJ)
rDayPeakJobsPe
port$VendorSupr$DataCente$CapEx
What Drives Cost per Job…
Up? • Number of disks
– Data gets bigger!
• Number of servers
• Short-sighted negotiations
– 4th year support…
• Data Center costs (power, rent)
Down? • Bidding
– 2+ vendors
• Moore’s Law
– Vendors fight this!
• Data design
• Software (e.g. DBM)
Current State and Problems
• Existing EDW – Multiple multi-petabyte clusters (redundancy and jobs)
– Why not <x>? CPJ not lower
• Data stored in SANs (not Exadata)
• Performs poorly on scans of 10T+
• Long procurement cycles (3 month minimum)
Amazon Data Warehouse and Amazon Redshift
Integration Project
• Spent 2013 evaluating Amazon Redshift for Amazon data
warehouse
– Where does Amazon Redshift provide a better CPJ?
– Can Amazon Redshift solve some pain (without introducing new pain)?
• Picked 10K jobs and 275 tables to copy
Current State of Affairs
• Biggest cluster size: 20+1 8XL
• Peak daily jobs: 7211 (using all 4 clusters)
• 4159 extracts
• 3052 loads
Some Results
• Benchmarking for 4159 jobs – Outperforming 2719
– Underperforming 1440
– Avg. runtime • 4:43 mins in Amazon Redshift
• 17:38 mins in existing EDW
• LOADS are slower
• EXTRACTS are faster
Job Type RS Performance
Category
Job Count by
Category
EXTRACT 10X Faster 945
EXTRACT 5X Faster 487
EXTRACT 3X Faster 393
EXTRACT 2X Faster 301
EXTRACT 1X or same 480
EXTRACT 2X Slower 1150
LOAD 10X Faster 7
LOAD 5X Faster 15
LOAD 3X Faster 23
LOAD 2X Faster 23
LOAD 1X or same 45
LOAD 2X Slower 290
Amazon Redshift Integration Best Practices
• Integrating via Amazon S3 (Manifests)
• Primary key enforcement
• Idempotent loads – MERGE via INSERT/UPDATE
– Mimic Trunc-Load [Backfills]
• Trunc-partition using sort keys
• Administration automation
• Ensuring data correctness
Integrating via Amazon S3
• S3 in the US Standard Region is eventually consistent!
• S3 LIST might not give the entire list of data right after
you save it (this WILL eventually happen to you!)
• Amazon Redshift loads everything it sees in a bucket – You may see all data files, Amazon Redshift may not, which can cause
missing data
Best Practices – Using Amazon S3
• Read/COPY – System table validation – STL_LOAD_ERRORS,
– Verify files loaded are ‘intended’ files
• Write/ UNLOAD – System table validation – STL_UNLOAD_LOG
– Verify all files that has the data are on S3
• Manifests – Metadata to know what to exactly to read from S3
– Provides authoritative reference to data
– Powerful in terms of user metadata format, encryption, etc.
Primary Key Enforcement
• Amazon Redshift does not enforce primary key – You will need to do this to ensure data quality
• Best practice – Introduce temp table to check duplicates in incoming data
– Validate against incoming data to catch offenders
– Put the data in target table and validate target data in the same transaction before commit
• Yes, this IS a lot of overhead
Idempotent Loads
• Idempotent Loads – doing a load 2+ times the same as
doing 1 load – Needed to manage load failures
• MERGE – leverages primary key, row at a time
• TRUNC / INSERT – load a partition at a time
MERGE
• No native Amazon Redshift MERGE support
• Merge is implemented as a multi-step process – Load the data in temp table
– Figure out inserts and load
– Figure out updates and modify target table
– Validation for duplicates
TRUNC - INSERT
• Solution – Distribute randomly
– Use sort keys to align data (mimics partition)
– Selectively delete and insert
• Issues – Inserts are in an “unsorted” bucket – performance degrades without
periodic VACUUM
– Very slow (effectively row at a time)
Automating Administration
• Stored procs / Oracle workflow used to do
admin task like retention, stats, etc.
• Solution – We introduced a software layer to prepare the administrative
task statements based on defined inputs
– Execute using JDBC connection
– Can schedule work like stats collection, vacuum, etc.
2013 Results
• CPJ is 55% less on Amazon Redshift in general – We can’t share the math, sorry YMMV
– Between Redshift and Amazon data warehouse, known improvements get us to ~66%
– Big wins are in big queries
– Loads are slow and expensive
• Moved ~10K jobs to ~60 8XLs (4 clusters)
• We could move at most 45% of our work to Amazon Redshift with
minimal changes
2014 Plan
• Focus on big tables (100T+) – Need to solve data expiry and backfill challenges
• Solve problems with CPU bound
• Interactive analytics (third-party vendor apps
with Amazon Redshift + Oracle)