Building a data warehouse with AWS Redshift, Matillion and Yellowfin
-
Upload
lynn-langit -
Category
Technology
-
view
363 -
download
4
Transcript of Building a data warehouse with AWS Redshift, Matillion and Yellowfin
Building a Data Warehouse on AWS
Amazon S3
Amazon Redshift
CollectCollect ProcessProcess AnalyzeAnalyzeStoreStore
Data Answers
Visualize
@Lynn Langit
AWS MarketplaceEnterprise software store for business users who need simplified procurement
•2000+ product listings •to browse, test and buy software
•1-click deployment •to launch, in multiple regions around the world
•Pay-as-you-go pricing •to use on demand
Advanced Analytics
Data Enablement
Business Intelligence
Building a Data Warehouse on AWS
Move data into Redshift from S3 for analysis
Amazon S3
Amazon Redshift
AWS Marketplace Partners
Matillion
Visualize
Yellowfin
CollectCollect ProcessProcess AnalyzeAnalyzeStoreStore
Data Answers
Setup
Our Scenario and Source FilesFile Types
-- Text - .csv-- Compressed - .gz
File CategoriesDetails / Events -- Flights
-- WeatherMetadata
-- Airports -- Carriers
“In this scenario we will use Matillion ETL for Redshift to prepare two separate data sources ready for analysis. The sample data is US airport flight information from 1995 -> 2008. Every flight to or from a US airport (and whether it left on time or not) is included.
The second data set is weather data, taken from NOAA, including the daily weather readings for each US Airport.”
Loading data from S3 in to Redshift
Using Matillion ETL for Redshift• Create Instance (AMI/EC2) of Matillion/AWS Marketplace• Connect Matillion to Redshift
Loading Data in Redshift
Table distribution stylesDistribution Key All
Node 1
Slice 1
Slice 2
Node 2
Slice 3
Slice 4
Node 1
Slice 1
Slice 2
Node 2
Slice 3
Slice 4
key1
key2
key3
key4
All data on every nodeSame key to same location
Node 1
Slice 1
Slice 2
Node 2
Slice 3
Slice 4
EvenRound robin distribution
Sort Keys• Single Column - [ SORTKEY ( date ) ]
• Queries that use 1st column (i.e. date) as primary filter
• Compound - [ SORTKEY COMPOUND ( date, region, country) ]
• Queries that use 1st column as primary filter, then other columns
• Interleaved - [ SORTKEY INTERLEAVED ( date, region, country) ] • Queries that use different columns in filter
Time Series Data – Vacuum Operation
Unsorte
dRegion
Sorte
dRegio
n Sorte
d
Sorte
d
Sorte
d
Append in Sort Key Order
Sort Unsorted Region
Merge
Visualizing with Yellowfin
Automate – https://github.com/lynnlangit/AWSDataWarehouse