Redshift Chartio Event Presentation
-
Upload
chartio -
Category
Technology
-
view
356 -
download
2
Transcript of Redshift Chartio Event Presentation
![Page 1: Redshift Chartio Event Presentation](https://reader035.fdocuments.in/reader035/viewer/2022062516/587a5ea21a28ab520b8b727d/html5/thumbnails/1.jpg)
Amazon Redshift
Spend time with your data, not your database….
![Page 2: Redshift Chartio Event Presentation](https://reader035.fdocuments.in/reader035/viewer/2022062516/587a5ea21a28ab520b8b727d/html5/thumbnails/2.jpg)
Data Warehouse Challenges
Cost
Complexity
Performance
Rigidity1990 2000 2010 2020
Enterprise Data Data in Warehouse
![Page 3: Redshift Chartio Event Presentation](https://reader035.fdocuments.in/reader035/viewer/2022062516/587a5ea21a28ab520b8b727d/html5/thumbnails/3.jpg)
Amazon Redshift powers Clickstream Analytics for Amazon.com• Web log analysis for Amazon.com
– Petabyte workload– Largest table: 400 TB
• Understand customer behavior– Who is browsing but not buying– Which products/features are winners– What sequence led to higher customer conversion
• Solution– Best scale-out solution—query across 1 week– Hadoop—query across 1 month
![Page 4: Redshift Chartio Event Presentation](https://reader035.fdocuments.in/reader035/viewer/2022062516/587a5ea21a28ab520b8b727d/html5/thumbnails/4.jpg)
Amazon Redshift benefits realized• Performance
– Scan 2.25 trillion rows of data: 14 minutes
– Load 5 billion rows data: 10 minutes
– Backfill 150 billion rows of data: 9.75 hours
– Pig Amazon Redshift: 2 days to 1 hr• 10B row join with 700 M rows
– Oracle Amazon Redshift: 90 hours to 8 hrs
• Cost– 1.6 PB cluster– 100 8xl HDD nodes– $180/hr
• Complexity– 20% time of one DBA
• Backup• Restore• Resizing
![Page 5: Redshift Chartio Event Presentation](https://reader035.fdocuments.in/reader035/viewer/2022062516/587a5ea21a28ab520b8b727d/html5/thumbnails/5.jpg)
Expanding Amazon RedshiftFunctionality
![Page 6: Redshift Chartio Event Presentation](https://reader035.fdocuments.in/reader035/viewer/2022062516/587a5ea21a28ab520b8b727d/html5/thumbnails/6.jpg)
Scalar User-Defined Functions (UDF)
• Scalar UDFs using Python 2.7– Return single result value for each input value– Executed in parallel across cluster– Syntax largely identical to PostgreSQL– We reserve any function with f_ for customers
• Pandas, NumPy, SciPy pre-installed– Do matrix operations, build optimization algorithms,
and run statistical analyses– Build end-to-end modeling workflow
• Import your own libraries
CREATE FUNCTION f_function_name
( [ argument_name arg_type, ... ] )
RETURNS data_type
{ VOLATILE | STABLE | IMMUTABLE }
AS $$
python_program
$$ LANGUAGE plpythonu;
![Page 7: Redshift Chartio Event Presentation](https://reader035.fdocuments.in/reader035/viewer/2022062516/587a5ea21a28ab520b8b727d/html5/thumbnails/7.jpg)
Scalar UDF Security
• Run in restricted container that is fully isolated – Cannot make system and network calls – Cannot corrupt your cluster or negatively impact its performance
• Current limitations– Can’t access file system - functions that write files won’t work– Don’t yet cache stable and immutable functions – Slower than built-in functions compiled to machine code
• Haven’t fully optimized some cases, including nested functions
![Page 8: Redshift Chartio Event Presentation](https://reader035.fdocuments.in/reader035/viewer/2022062516/587a5ea21a28ab520b8b727d/html5/thumbnails/8.jpg)
Scalar UDF example - URL parsing
CREATE FUNCTION f_hostname (url varchar) RETURNS varcharIMMUTABLE AS $$ import urlparse return urlparse.urlparse(url).hostname$$ LANGUAGE plpythonu;
SELECT f_hostname(url) FROM table;
Rather than using complex regular expressions (e.g. to extract a host name from URL)… SELECT REGEXP_REPLACE(url, '(https?)://([^@]*@)?([^:/]*)([/:].*|$)', ‘\3') FROM table;
….You can use a built-in Python URL parsing library directly in your SQL
![Page 9: Redshift Chartio Event Presentation](https://reader035.fdocuments.in/reader035/viewer/2022062516/587a5ea21a28ab520b8b727d/html5/thumbnails/9.jpg)
Scalar UDF example – Distance
CREATE FUNCTION f_distance (orig_lat float, orig_long float, dest_lat float, dest_long float) RETURNS floatSTABLE AS $$ import math r = 3963.1676 # earth's radius, in miles phi_orig = math.radians(orig_lat) phi_dest = math.radians(dest_lat) delta_lat = math.radians(dest_lat - orig_lat) delta_long = math.radians(dest_long - orig_long) a = math.sin(delta_lat/2) * math.sin(delta_lat/2) + math.cos(phi_orig) \ * math.cos(phi_dest) * math.sin(delta_long/2) * math.sin(delta_long/2) c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a)) d = r * c return d$$ LANGUAGE plpythonu;
Calculate approx distance in miles between origin and destination
![Page 10: Redshift Chartio Event Presentation](https://reader035.fdocuments.in/reader035/viewer/2022062516/587a5ea21a28ab520b8b727d/html5/thumbnails/10.jpg)
Redshift Github UDF Repository
Script Purpose
f_encryption.sql Uses pyaes library to encrypt/decrypt strings using passphrase
f_next_business_day.sql Uses pandas library to return dates which are US Federal Holiday aware
f_null_syns.sql Uses python sets to match strings, similar to a SQL IN condition
f_parse_url_query_string.sql Uses urlparse to parse the field-value pairs from a url query string
f_parse_xml.sql Uses xml.etree.ElementTree to parse XML
f_unixts_to_timestamp.sql Uses pandas library to convert a unix timestamp to UTC datetime
github.com/awslabs/amazon-redshift-udfs
![Page 11: Redshift Chartio Event Presentation](https://reader035.fdocuments.in/reader035/viewer/2022062516/587a5ea21a28ab520b8b727d/html5/thumbnails/11.jpg)
Amazon Kinesis Firehose to Amazon RedshiftLoad massive volumes of streaming data into Amazon Redshift
• Zero administration: Capture and deliver streaming data into Redshift without writing an application
• Direct-to-data store integration: Batch, compress, and encrypt streaming data for delivery
• Seamless elasticity: Seamlessly scales to match data throughput w/o intervention
Capture and submit streaming data to Firehose
Firehose loads streaming data continuously into S3 and Redshift
Analyze streaming data using Chartio
![Page 12: Redshift Chartio Event Presentation](https://reader035.fdocuments.in/reader035/viewer/2022062516/587a5ea21a28ab520b8b727d/html5/thumbnails/12.jpg)
• Uses your S3 bucket as an intermediate destination• S3 bucket has ‘manifests’ folder – holds manifest of files to be copied
• Issues COPY command synchronously • Single delivery stream loads into a single Redshift cluster, database, and table • Continuously issues COPY once previous one is finished • Frequency of COPYs determined by how fast your cluster can load files• No partial loads. If a single record fails, whole file or batch fails
• Info on skipped files delivered to S3 bucket as manifest in errors folder• If cannot reach cluster, retries every 5 min for 60 min and then moves on to
next batch of objects
Amazon Kinesis Firehose to Amazon Redshift
![Page 13: Redshift Chartio Event Presentation](https://reader035.fdocuments.in/reader035/viewer/2022062516/587a5ea21a28ab520b8b727d/html5/thumbnails/13.jpg)
Multi-Column Sort
• Compound sort keys– Filter data by one leading column
• Interleaved sort keys– Filter data by up to eight columns– No storage overhead, unlike an index or projection– Lower maintenance penalty
![Page 14: Redshift Chartio Event Presentation](https://reader035.fdocuments.in/reader035/viewer/2022062516/587a5ea21a28ab520b8b727d/html5/thumbnails/14.jpg)
Compound sort keys illustrated
• Four records fill a block, sorted by customer
• Records with a given customer are all in one block.
• Records with a given product are spread across four blocks.
1
1
1
1
2
3
4
1
4
4
4
2
3
4
4
1
3
3
3
2
3
4
3
1
2
2
2
2
3
4
2
1
1 [1,1] [1,2] [1,3] [1,4]
2 [2,1] [2,2] [2,3] [2,4]
3 [3,1] [3,2] [3,3] [3,4]
4 [4,1] [4,2] [4,3] [4,4]
1 2 3 4prod_id
cust_id
cust_id prod_id other columns blocks
![Page 15: Redshift Chartio Event Presentation](https://reader035.fdocuments.in/reader035/viewer/2022062516/587a5ea21a28ab520b8b727d/html5/thumbnails/15.jpg)
1 [1,1] [1,2] [1,3] [1,4]
2 [2,1] [2,2] [2,3] [2,4]
3 [3,1] [3,2] [3,3] [3,4]
4 [4,1] [4,2] [4,3] [4,4]
1 2 3 4prod_id
cust_id
Interleaved sort keys illustrated
• Records with a given customer are spread across two blocks.
• Records with a given product are also spread across two blocks.
• Both keys are equal.
1
1
2
2
2
1
2
3
3
4
4
4
3
4
3
1
3
4
4
2
1
2
3
3
1
2
2
4
3
4
1
1
cust_id prod_id other columns blocks
![Page 16: Redshift Chartio Event Presentation](https://reader035.fdocuments.in/reader035/viewer/2022062516/587a5ea21a28ab520b8b727d/html5/thumbnails/16.jpg)
Interleaved Sort Key Considerations• Vacuum time can increase by 10-50% for interleaved sort
keys vs. compound keys • If data increases monotonically, such as dates, interleaved
sort order will skew over time– You’ll need to run a vacuum operation to re-analyze the distribution
and re-sort the data.
• Query filtering on the leading sort column, runs faster using compound sort keys vs. interleaved