© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
November 30, 2016
Migrating Your Data Warehouse to Amazon Redshift
DAT202
Pavan Pothukuchi, Sr. Manager PM, Amazon Redshift
Ali Khan, Director of BI and Analytics, Scholastic
Laxmikanth Malladi, Principal Architect, Northbay Solutions
“It’s our biggest driver of growth in our biggest markets, and is a feature of the
company” …on Data Mining in Redshift– Chris Lambert, Lyft CTO
“The doors were blown wide open to create custom dashboards for anyone to
instantly go in and see and assess what is going in our ad delivery landscape,
something we have never been able to do until now.”– Bryan Blair, Vevo’s VP of Ad Operations
“Analytical queries are 10 times faster in Amazon Redshift than they
were with our previous data warehouse.”– Yuki Moritani, NTT Docomo Innovation Manager
“We have several petabytes of data and use a massive Redshift
cluster. Our data science team can get to the data faster and then
analyze that data to find new ways to reduce costs, market
products, and enable new business.”– Yuki Moritani, NTT Docomo Innovation Manager
“We saw a 2x performance improvement on a wide variety of
workloads. The more complex the queries, the higher the
performance improvement..”- Naeem Ali, Director of Software Development, Data
Science at Cablevision (Optimum)
“Over the last few years, we’ve tried all kinds of databases in search of more
speed, including $15k of custom hardware. Of everything we’ve tried,
Amazon Redshift won out each time.”– Periscope Data, Analyst’s Guide to Redshift
“We took Amazon Redshift for a test run the moment it was
released. It’s fast. It’s easy. Did I mention it’s ridiculously fast?
We’re using it to provide our analysts an alternative to Hadoop.”– Justin Yan, Data Scientist at Yelp
“The move to Redshift also significantly improved dashboard query
performance… Redshift performed ~200% faster than the
traditional SQL Server we had been using in the past.”
- Dean Donovan, Product Development at DiamondStream
“…[Redshift] performance has blown away everyone here (we
generally see 50-100x speedup over Hive)”
- Jie Li Data Infrastructure at Pinterest
“450,000 online queries 98 percent faster than previous traditional data
center, while reducing infrastructure costs by 80 percent.”
- John O’Donovan, CTO, Financial Times
“We needed to load six months' worth of data, about 10 TB of data, for a
campaign. That type of load would have taken about 20 days with our previous
solution. By using Amazon Redshift, it only took six hours to load the data.”
- Zhong Hong, VP of Infrastructure, Vivaki (Publicis Groupe)
“We regularly process multibillion row datasets and we do that in a
matter of hours. We are heading to up to 10 times more data volumes in
the next couple of years, easily.”
- Bob Harris, CTO, Channel 4
“On our previous big data warehouse system, it took around 45
minutes to run a query against a year of data, but that number went
down to just 25 seconds using Amazon Redshift”
- Kishore Raja Director of Strategic Programs and R&D, Boingo Wireless
“Most competing data warehousing solutions would have cost us up
to $1 million a year. By contrast, Amazon Redshift costs us just
$100,000 all-in, representing a total cost savings of around 90%”
- Joel Cumming, Head of Data, Kik Interactive
“Annual costs of Redshift are equivalent to just the annual
maintenance of some of the cheaper on-premises options for
data warehouses..”
- Kevin Diamond, CTO, HauteLook (Nordstrom)
“Our data volume keeps growing, and we can support that
growth because Amazon Redshift scales so well.. We wouldn’t
have that capability using the supporting on-premises hardware in
our previous solution.”
- Ajit Zadgaonkar, Director of Ops. and Infrastructure, Edmunds
“With Amazon Redshift and Tableau, anyone in the company can set up
any queries they like - from how users are reacting to a feature, to growth by
demographic or geography, to the impact sales efforts had in different areas”
- Jon Hoffman, Head of Engineering, Foursquare
Today’s agenda
• Amazon Redshift Overview
• Use cases and benefits
• Migration options
• Scholastic’s use case
• Architecture details
• Technical overview
• Key project learnings
Relational data warehouse
Massively parallel; petabyte scale
Fully managed
HDD and SSD platforms
$1,000/TB/year; starts at $0.25/hour
Amazon
Redshift
a lot faster
a lot simpler
a lot cheaper
The Forrester Wave™ is copyrighted by Forrester Research, Inc. Forrester and Forrester Wave™ are trademarks of Forrester Research, Inc. The Forrester Wave™ is a graphical
representation of Forrester's call on a market and is plotted using a detailed spreadsheet with exposed scores, weightings, and comments. Forrester does not endorse any
vendor, product, or service depicted in the Forrester Wave. Information is based on best available resources. Opinions reflect judgment at the time and are subject to change.
Forrester Wave™ Enterprise Data Warehouse Q4 ’15
Selected Amazon Redshift customers
Why migrate to Amazon Redshift?
100x faster
Scales from GBs to PBs
Analyze data without storage
constraints
10x cheaper
Easy to provision and operate
Higher productivity
10x faster
No programming
Standard interfaces and
integration to leverage BI tools,
machine learning, streaming
Transactional database MPP database Hadoop
Migration from Oracle @ Boingo Wireless
2000+ Commercial Wi-Fi locations
1 million+ Hotspots
90M+ ad engagements
100+ countries
Legacy DW: Oracle 11g based DW
Before migration
Rapid data growth slowed
analytics
Mediocre IOPS, limited memory,
vertical scaling
Admin overhead
Expensive (license, h/w, support)
After migration
180x performance improvement
7x cost savings
0
50,000
100,000
150,000
200,000
250,000
300,000
350,000
400,000
Exadata SAPHANA
Redshift
$400,000
$300,000
$55,000
7,200
2,700
15 15
Query
Performance
Data Load
Performance
1 year of data
1 million records
Late
ncy in s
econds
RedshiftExisting System
7X cheaper than Oracle Exadata 180X faster than Oracle database
Migration from Oracle @ Boingo Wireless
Migration from Greenplum @ NTT Docomo
68 million customers
10s of TBs per day of data across
mobile network
6PB of total data (uncompressed)
Data science for marketing
operations, logistics etc.
Legacy DW: Greenplum on-premises
After migration:
125 node DS2.8XL cluster
4,500 vCPUs, 30TB RAM
6 PB uncompressed
10x faster analytic queries
50% reduction in time for new BI
app. deployment
Significantly less ops. overhead
Migration from SQL on Hadoop @ Yahoo
Analytics for website/mobile events
across multiple Yahoo properties
On an average day
2B events
25M devices
Before migration: Hive – Found it to be
slow, hard to use, share and repeat
After migration:
21 node DC1.8XL (SSD)
50TB compressed data
100x performance improvement
Real-time insights
Easier deployment and
maintenance
Migration from SQL on Hadoop @ Yahoo
1
10
100
1000
10000
CountDistinctDevices
Count AllEvents
FilterClauses
Joins
Seco
nd
s
Amazon Redshift
Impala
Business Value and Productivity
Business Productivity Benefits
Analyze more data
Faster time to market
Get better insights
Match capacity with demand
ENGINE X Amazon Redsh i f t
ETL Sc r i p t s
SQL i n repo r t s
Adhoc . que r i es
How to Migrate?
Schema Convers ion Database Migra t ion
Map da ta t ypes
Choose compress ion
encod ing , so r t keys ,
d i s t r i bu t i on keys
Gene ra te and app l y DDL
Schema & Data
Trans format ionData Migrat ion
Conver t SQL Code
Bu lk Load
Cap tu re upda tes
Trans fo rma t i ons
Assess Gaps
Sto red P rocedu res
Func t i ons
1 2
3
4
Convert schema in a few clicks
Sources include Oracle, Teradata,
Greenplum and Netezza
Automatic schema optimization
Converts application SQL code
Detailed assessment report
AWS Schema
Conversion Tool
(AWS SCT)
AWS Schema Conversion Tool
Start your first migration in few minutes
Sources include: Aurora, Oracle, SQL
Server, MySQL and PostgreSQL
Bulk load and continuous replication
Migrate a TB for $3
Fault tolerant
(AWS DMS)
AWS DMS: Change data capture
Replication instance
Source Target
Update
t1 t2
t1
t2
Transactions Change
apply
after bulk
load
Data integration partners
Data Integration Systems Integrators
Amazon Redshift
Beyond Amazon Redshift…
Scholastic, Established 1920
Where were we?
Platform
13+ years old. IBM AS/400 DB2 and Microsoft SQL Server are the primary data
warehouse platforms. BI Platform is primarily Microsoft (SSRS, SSAS, Excel, SharePoint)
500+ direct users across every LOB and business function
20+ TB. 5,500+ DB2 workloads, 350+ SQL Server workloads, 15 SSAS cubes, 150+
SSRS reports
Challenges
Inflexible, multi-layered architecture – slow time to market
Inability to meet internal SLAs due to performance of daily ETL processes
Scalability limitations with SQL Server Analysis Services (SSAS) for reports
Limited ability to perform self-service Business Intelligence
21
Moving forward: Key decision factors
• Improved performance, scalability, availability,
logging, security
• Enablement of self service business intelligence
• Leverage the skill set of current team (Relational DB
& SQL)
• Integration with existing technology stack
• Alignment with the tech strategy (devops model,
Cloud First)
• Ability to support Big Data initiatives
• Team up with an experienced consulting partner22
Why we chose AWS and Amazon Redshift
AWS was chosen for its agility, scalability, elasticity, and
security
Redshift
• Scalable, fast
• Managed service, cost-optimization models,
elastic
• SQL/relational matched skillset of team
S3 was chosen as location for ingestion process
NorthBay was chosen as the implementation partner for
their expertise in Big Data and Redshift migrations
23
How the project unfolded
Goals
• 3-month pilot to migrate a Functional area in key LOB
• Demonstrate immediate business value
• Use AWS Stack & Open Source for Data Movement from DB2
(No CDC/ETL tool)
Outcomes
• Core Framework for Migration
• ELT Architecture and Validation
• Visualization/Self-service capability through Tableau
EMR Cluster running
Sqoop ScriptOutput Bucket EC2 Instance running
Copy Command
Redshift
(Staging)
Data Pipeline
SNS Topic
(Pipeline Status) (Pipeline Failure)
SNS Email Notification
Lambda
(Save Pipeline Stats)
RDS MySQL Instance
(Pipeline
Configurations)
DynamoDB
Redshift
(Enterprise Data
Repository)
AS400 / DB2
(Staging)
SQL Server EDW
Tableau
(Reporting Tool)Source
DBs
SSAS CubesSSRS Reports
Scholastic data cloud: Technical architecture
Core Framework
• Jobs and Job Groups are defined as metadata in DynamoDB
• Control-M scheduler, Custom Application and Data Pipeline for
Orchestration
• ELT Process with EMR/Sqoop for Extraction. Load and Transform
the data through Redshift SQL scripts
• Core Framework enables
• Restart capability from point of failure
• Capturing of operational statistics (# of rows updated, etc.)
• Audit capability (which feed caused the Fact to change, etc.)
26
Extract
• Pre-create EMR resources at the start of Batch
• Achieve parallelism in Sqoop with mappers and Fair Scheduling
• Sqoop query to add additional fields like Batch_id, Updated_date etc
• Data extracts are split and compressed for optimized loading into Redshift
27
AS400 / DB2
EMR with Sqoop
S3
Metadata
KMS
Data Pipeline
1
2
3
4
5 6
Control Flow
Data Flow
Load
• Truncate and Load through Data Pipeline for Staging tables
• Dynamic Work Load Management (WLM) queues setup to allow maximum
resources during Loading/Transformation
• Check and terminate any locks on tables to allow truncation
• Capture metrics related to number of rows loaded, time taken, etc.28
StagingS3
KMSData Pipeline
4
1 2
3
EC2 Control Flow
Data Flow
Transform
• Custom Application for building Dimensions and Facts
• SQL Scripts are stored in S3 and executed by ELT process
• SQL scripts refactored from SQL Server and AS400 scripts
• Non-Functional Requirements are achieved through Custom App
29
1
32
4
5
6
7a
7b
S3Staging
Facts
Metadata
Dimensions
App
Control Flow
Data Flow
Schema Design
• Modified Star Schema
• Natural Keys instead of generating unique identifiers
• Commonly used columns from Dimensions are copied over to
Facts
• Surrogate keys are eliminated except for few cases
• Compression
• Define appropriate Distribution and Sort Keys
• Define primary key and Foreign keys
Security
• AWS Key Management Service (KMS) is used for encrypting
access credentials to Source and Target databases
• Jenkins job to allow encrypting of credentials using KMS
directly by Database Administrators
• Amazon EMR, Jenkins resources are given KMS decrypt
permissions to allow connecting to Sources and Targets during
the ELT process
• Standard Security in Transit and at Rest throughout the process
• IAM federation through Enterprise Active Directory31
Reporting
• Business users access to Facts/Dimensions through Tableau
• Power users access to Staging tables through Tableau
• Enable Data Analysts access to files in S3 using Hive/Presto
• Self-Service capability across business users
32
S3 Staging Facts/ Dimensions
Business
AnalystsPower
UsersData
Analysts
EMR
Presto/Hive
Workstream Effort
• Define Jobs and Job Groups specific to each
Workstream
• Create Redshift tables (Staging, Facts, Dimensions)
based on mapping from AS400 and best practices
learned
• Create new SQL scripts (based on the logic from
AS400/SQL Server code) for transformation
• Develop, Test and Deploy in 2-week Agile sprints33
Key Lessons - Technical
• Isolate core framework with project specific code repositories
• Consolidating logging solution across Amazon S3, Amazon
Redshift, Amazon DynamoDB etc., was a challenge
• Make appropriate schema changes when migrating to new
platform
• Custom Framework for gathering operational stats (eg: # of
rows loaded etc.)
• Start with Test Automation tools and Acceptance Test Driven
Development (ATDD) earlier in the project34
Project timeline revisited
After the successful pilot:
• Executive Leadership accelerated timeline:
• Reduce project timeline by 50% (to 12 months) to
deliver value faster to LOBs
• Realize cost savings by eliminating the DB2 and
SQL Server platforms earlier
• Users wanted to be on the new platform!
• Scholastic & NorthBay partnered to create a
training curriculum to ensure a supply of skilled
staff would be available to our teams35
Scaling up: 7 workstreams
• Developed a model for estimating effort and cost
(AWS costs & Labor per LOB migration)
• Running agile teams in parallel – employed Agile
coaches
• Enhanced the core framework to ensure it would
scale effectively when in use by multiple teams
simultaneously
• Building a Code repository for use by all teams
• Building CI / CD Frameworks
Where are we now?
• 4 of 7 LOBs migrated – framework enables complete migration of a
functional area within days/weeks as opposed to months. On track to
migrate and decommission entire legacy environment within next 6
months
• 10 weeks to migrate from an external vendor hosting data and providing
reports for one LoB
• Cost of Data Ingestion Framework is under $40/day (EC2, EMR, Data
Pipeline)
• First “Big Data” initiative in production, captures and processes an
average of 1.5 Million e reading events daily (peak: 7 Million)
• Profile: LOB #1
• Loading ~5-6 Million rows/day (6-7GB/day)
• Processing over 1.5 billion rows within Redshift daily
• Complete ETL/ELT batch cycle performance improved by over 170%
Key lessons – project execution
• Essential to monitor and optimize AWS costs
• “Data Champion” / “Data Guide” partnership absolutely critical for
successful adoption of new platforms
• Importance of strong Agile coaches while scaling out Agile teams
• Criticality of choosing consulting partners (AWS & North Bay)
who can ramp up and supply key resources fast and cycle off the
project when finished
• Creating new data platforms and migrating data into them is
easy, especially with AWS. Decommission of existing data
platforms is hard!
38
Thank you!
Remember to complete
your evaluations!
Related Sessions
Hear from other customers discussing their Amazon Redshift use cases:
• BDM402—Best Practices for Data Warehousing with Amazon Redshift (King.com)
• BDA304—What’s New with Amazon Redshift
• SVR308—Content and Data Platforms at Vevo: Rebuilding and Scaling from Zero in One Year
• GAM301—How EA Leveraged Amazon Redshift and AWS Partner 47Lining to Gather Meaningful
Player Insights
• BDA207—Fanatics: Deploying Scalable, Self-Service Business Intelligence on AWS
• BDM306— Netflix: Using Amazon S3 as the fabric of our big data ecosystem
• BDA203 — Billions of Rows Transformed in Record Time Using Matillion ETL for Amazon Redshift
(GE Power and Water)
• BDM206 — Understanding IoT Data: How to Leverage Amazon Kinesis in Building an IoT
Analytics Platform on AWS (Hello)
• STG307— Case Study: How Prezi Built and Scales a Cost-Effective, Multipetabyte Data Platform
and Storage Infrastructure on Amazon S3
Top Related