Redshift Chartio Event Presentation

Amazon Redshift

Spend time with your data, not your database….

Data Warehouse Challenges

Cost

Complexity

Performance

Rigidity1990 2000 2010 2020

Enterprise Data Data in Warehouse

Amazon Redshift powers Clickstream Analytics for Amazon.com• Web log analysis for Amazon.com

– Petabyte workload– Largest table: 400 TB

• Understand customer behavior– Who is browsing but not buying– Which products/features are winners– What sequence led to higher customer conversion

• Solution– Best scale-out solution—query across 1 week– Hadoop—query across 1 month

Amazon Redshift benefits realized• Performance

– Scan 2.25 trillion rows of data: 14 minutes

– Load 5 billion rows data: 10 minutes

– Backfill 150 billion rows of data: 9.75 hours

– Pig Amazon Redshift: 2 days to 1 hr• 10B row join with 700 M rows

– Oracle Amazon Redshift: 90 hours to 8 hrs

• Cost– 1.6 PB cluster– 100 8xl HDD nodes– $180/hr

• Complexity– 20% time of one DBA

• Backup• Restore• Resizing

Expanding Amazon RedshiftFunctionality

Scalar User-Defined Functions (UDF)

• Scalar UDFs using Python 2.7– Return single result value for each input value– Executed in parallel across cluster– Syntax largely identical to PostgreSQL– We reserve any function with f_ for customers

• Pandas, NumPy, SciPy pre-installed– Do matrix operations, build optimization algorithms,

and run statistical analyses– Build end-to-end modeling workflow

• Import your own libraries

CREATE FUNCTION f_function_name

( [ argument_name arg_type, ... ] )

RETURNS data_type

{ VOLATILE | STABLE | IMMUTABLE }

AS $$

python_program

$$ LANGUAGE plpythonu;

Scalar UDF Security

• Run in restricted container that is fully isolated – Cannot make system and network calls – Cannot corrupt your cluster or negatively impact its performance

• Current limitations– Can’t access file system - functions that write files won’t work– Don’t yet cache stable and immutable functions – Slower than built-in functions compiled to machine code

• Haven’t fully optimized some cases, including nested functions

Scalar UDF example - URL parsing

CREATE FUNCTION f_hostname (url varchar) RETURNS varcharIMMUTABLE AS $$ import urlparse return urlparse.urlparse(url).hostname$$ LANGUAGE plpythonu;

SELECT f_hostname(url) FROM table;

Rather than using complex regular expressions (e.g. to extract a host name from URL)… SELECT REGEXP_REPLACE(url, '(https?)://([^@]*@)?([^:/]*)([/:].*|$)', ‘\3') FROM table;

….You can use a built-in Python URL parsing library directly in your SQL

Scalar UDF example – Distance

CREATE FUNCTION f_distance (orig_lat float, orig_long float, dest_lat float, dest_long float) RETURNS floatSTABLE AS $$ import math r = 3963.1676 # earth's radius, in miles phi_orig = math.radians(orig_lat) phi_dest = math.radians(dest_lat) delta_lat = math.radians(dest_lat - orig_lat) delta_long = math.radians(dest_long - orig_long) a = math.sin(delta_lat/2) * math.sin(delta_lat/2) + math.cos(phi_orig) \ * math.cos(phi_dest) * math.sin(delta_long/2) * math.sin(delta_long/2) c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a)) d = r * c return d$$ LANGUAGE plpythonu;

Calculate approx distance in miles between origin and destination

Redshift Github UDF Repository

Script Purpose

f_encryption.sql Uses pyaes library to encrypt/decrypt strings using passphrase

f_next_business_day.sql Uses pandas library to return dates which are US Federal Holiday aware

f_null_syns.sql Uses python sets to match strings, similar to a SQL IN condition

f_parse_url_query_string.sql Uses urlparse to parse the field-value pairs from a url query string

f_parse_xml.sql Uses xml.etree.ElementTree to parse XML

f_unixts_to_timestamp.sql Uses pandas library to convert a unix timestamp to UTC datetime

github.com/awslabs/amazon-redshift-udfs

Amazon Kinesis Firehose to Amazon RedshiftLoad massive volumes of streaming data into Amazon Redshift

• Zero administration: Capture and deliver streaming data into Redshift without writing an application

• Direct-to-data store integration: Batch, compress, and encrypt streaming data for delivery

• Seamless elasticity: Seamlessly scales to match data throughput w/o intervention

Capture and submit streaming data to Firehose

Firehose loads streaming data continuously into S3 and Redshift

Analyze streaming data using Chartio

• Uses your S3 bucket as an intermediate destination• S3 bucket has ‘manifests’ folder – holds manifest of files to be copied

• Issues COPY command synchronously • Single delivery stream loads into a single Redshift cluster, database, and table • Continuously issues COPY once previous one is finished • Frequency of COPYs determined by how fast your cluster can load files• No partial loads. If a single record fails, whole file or batch fails

• Info on skipped files delivered to S3 bucket as manifest in errors folder• If cannot reach cluster, retries every 5 min for 60 min and then moves on to

next batch of objects

Amazon Kinesis Firehose to Amazon Redshift

Multi-Column Sort

• Compound sort keys– Filter data by one leading column

• Interleaved sort keys– Filter data by up to eight columns– No storage overhead, unlike an index or projection– Lower maintenance penalty

Compound sort keys illustrated

• Four records fill a block, sorted by customer

• Records with a given customer are all in one block.

• Records with a given product are spread across four blocks.

1

1

1

1

2

3

4

1

4

4

4

2

3

4

4

1

3

3

3

2

3

4

3

1

2

2

2

2

3

4

2

1

1 [1,1] [1,2] [1,3] [1,4]

2 [2,1] [2,2] [2,3] [2,4]

3 [3,1] [3,2] [3,3] [3,4]

4 [4,1] [4,2] [4,3] [4,4]

1 2 3 4prod_id

cust_id

cust_id prod_id other columns blocks

1 [1,1] [1,2] [1,3] [1,4]

2 [2,1] [2,2] [2,3] [2,4]

3 [3,1] [3,2] [3,3] [3,4]

4 [4,1] [4,2] [4,3] [4,4]

1 2 3 4prod_id

cust_id

Interleaved sort keys illustrated

• Records with a given customer are spread across two blocks.

• Records with a given product are also spread across two blocks.

• Both keys are equal.

1

1

2

2

2

1

2

3

3

4

4

4

3

4

3

1

3

4

4

2

1

2

3

3

1

2

2

4

3

4

1

1

cust_id prod_id other columns blocks

Interleaved Sort Key Considerations• Vacuum time can increase by 10-50% for interleaved sort

keys vs. compound keys • If data increases monotonically, such as dates, interleaved

sort order will skew over time– You’ll need to run a vacuum operation to re-analyze the distribution

and re-sort the data.

• Query filtering on the leading sort column, runs faster using compound sort keys vs. interleaved

SAN FRANCISCO

Questions/Comments? Please contact us at [email protected]

Redshift Chartio Event Presentation

Technology

Transcript of Redshift Chartio Event Presentation