Download - (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

Transcript
Page 1: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

November 13, 2014 | Las Vegas, NV

Timon Karnezos, Director Infrastructure, Neustar

Vidhya Srinivasan, Sr. Manager Software Development, Amazon Redshift

Page 2: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

Petabyte scale

Massively parallel

Relational data warehouse

Fully managed; zero admin

Page 3: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

10 GigE

(HPC)

IngestionBackupRestore

JDBC/ODBC

Page 4: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

Ad Tech

Use Cases

Page 5: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 6: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 7: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

692.8s

34.9s

< 0.76%

Page 8: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 9: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

00 01 10 11

00

01

10

11

P

Space filling Curve for Two Dimensions

Page 10: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 11: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 12: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 13: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 14: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 15: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

Frequency

Page 16: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 17: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

Attribution

Page 18: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 19: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

Overlap

Page 20: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 21: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

Ad-hoc

Page 22: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 23: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 24: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 25: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

0.7B / day

2B / week

8B / month

21B / quarter

Page 26: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

-- Number of ads seen per user

WITH frequency_intermediate AS (

SELECT user_id ,

SUM(1) AS impression_count,

SUM(cost) AS cost ,

SUM(revenue) AS revenue

FROM impressions

WHERE record_date BETWEEN <...>

GROUP BY 1

)

-- Number of people who saw N ads

SELECT impression_count, SUM(1), SUM(cost), SUM(revenue)

FROM frequency_intermediate

GROUP BY 1;

Page 27: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 28: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 29: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

CREATE TABLE (

record_date date ENCODE NOT NULL ,

campaign_id bigint ENCODE NOT NULL ,

site_id bigint ENCODE NOT NULL ,

user_id bigint ENCODE NOT NULL DISTKEY,

impression_count int ENCODE NOT NULL ,

cost bigint ENCODE NOT NULL ,

revenue bigint ENCODE NOT NULL

) SORTKEY( , , , );

Page 30: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

WITH user_frequency AS (

SELECT user_id, campaign_id, site_id,

SUM(impression_count) AS frequency,

SUM(cost) AS cost ,

SUM(revenue) AS revenue

FROM frequency_intermediate

WHERE record_date BETWEEN <...>

GROUP BY 1,2,3

)

SELECT campaign_id, site_id, frequency,

SUM(1), SUM(cost), SUM(revenue)

FROM user_frequency

GROUP BY 1,2,3;

Page 31: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 32: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 33: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 34: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 35: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 36: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

-- Basic sessionization query, assemble user activity

-- that ended in a conversion into a timeline.

SELECT <...>

FROM impressions i

JOIN conversions c ON

i.user_id = c.user_id AND

i.record_date < c.record_date

ORDER BY i.record_date;

Page 37: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

Position: 1

Position: 2

Position: 3

Page 38: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

Hour offset: 3

Position: 1

Position: 2

Hour offset: 12

Hour offset: 16

Position: 3

Page 39: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

-- Sessionize user activity per conversion, partition by campaign (45-day lookback window)

SELECT c.record_date AS conversion_date ,

c.event_id AS conversion_id ,

i.campaign_id AS campaign_id ,

i.site_id AS site_id ,

i.user_id AS user_id ,

c.revenue AS conversion_revenue,

DATEDIFF('hour', i.record_date, c.record_date) AS hour_offset,

SUM(1) OVER (PARTITION BY i.user_id, i.campaign_id, c.event_id

ORDER BY i.record_date DESC ROWS UNBOUNDED PRECEDING) AS position

FROM impressions i

JOIN conversions c ON

i.user_id = c.user_id AND

i.campaign_id = c.campaign_id AND

i.record_date < c.record_date AND

i.record_date > (c.record_date - interval '45 days') AND

c.record_date BETWEEN <...>;

Page 40: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

-- Compute statistics on sessions (funnel placement, last-touch, site-count, etc...)

SELECT campaign_id ,

site_id ,

conversion_date,

AVG(position) AS average_position,

SUM(conversion_revenue * (position = 1)::int) AS lta_attributed ,

AVG(COUNT(DISTINCT site_id)

OVER (PARTITION BY i.user_id, i.campaign_id, c.event_id

ORDER BY i.record_date ASC

ROWS UNBOUNDED PRECEDING)) AS average_unique_preceding_site_count

FROM sessions

GROUP BY 1,2,3;

Page 41: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 42: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 43: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 44: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

Site A Site B Site

C

Site A 20% 60%

Site B 90%

Site

C

CPM $0.06 $1.05 $9.50

Page 45: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

Site A Site B Site

C

Site A 20% 60%

Site B 90%

Site

C

CPM $0.06 $1.05 $9.50

Page 46: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

Site A Site B Site

C

Site A 20% 60%

Site B 90%

Site

C

CPM $0.06 $1.05 $9.50

Page 47: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

CREATE TABLE (

user_id bigint ENCODE NOT NULL DISTKEY,

site_id bigint ENCODE NOT NULL

) SORTKEY ( );

Page 48: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

WITH co_occurences AS (

SELECT

oi.site_id AS site1 ,

oi2.site_id AS site2

FROM overlap_intermediate oi

JOIN overlap_intermediate oi2 ON

oi.site_id > oi2.site_id AND

oi.ak_user_id = oi2.ak_user_id

)

SELECT site1, site2, SUM(1)

FROM co_occurences

GROUP BY 1,2;

Page 49: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

CREATE TABLE (

record_date date ENCODE NOT NULL ,

campaign_id bigint ENCODE NOT NULL ,

site_id bigint ENCODE NOT NULL ,

user_id bigint ENCODE NOT NULL DISTKEY

) SORTKEY ( , );

Page 50: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

WITH

site_overlap_intermediate AS (

SELECT user_id, site_id, campaign_id

FROM overlap_intermediate WHERE record_date BETWEEN <...> GROUP BY 1,2,3

),

site_co_occurences AS (

SELECT oi.campaign_id AS c_id, oi.site_id AS site1, oi2.site_id AS site2

FROM site_overlap_intermediate oi

JOIN site_overlap_intermediate oi2 ON

oi.site_id > oi2.site_id AND

oi.ak_user_id = oi2.ak_user_id AND

oi.campaign_id = oi2.campaign_id

)

SELECT c_id, site1, site2, SUM(1) FROM site_co_occurences GROUP BY 1,2,3;

Page 51: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 52: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 53: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 54: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 55: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 56: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

8 fact tables

26 dimension tables

7 mapping tables

Page 57: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

42 views

121 joins

1100 sloc

Page 58: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

$ pg_dump –Fc some_file --table=foo --table=bar

$ pg_restore --schema-only --clean –Fc some_file > schema.sql

$ pg_restore --data-only --table=foo –Fc some_file > foo.tsv

$ aws s3 cp schema.sql s3://metadata-bucket/YYYYMMDD/schema.sql

$ aws s3 cp foo.tsv s3://metadata-bucket/YYYYMMDD/foo.tsv

> \i schema.sql

> COPY foo FROM ‘s3://metadata-bucket/YYYYMMDD/foo.tsv’ <...>

# or combine ‘COPY <..> FROM <...> SSH’ and pg_restore/psql

Page 59: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

UNLOAD

('

SELECT i.*

FROM impressions i

JOIN client_to_campaign_mapping m ON

m.campaign_id = i.campaign_id

WHERE i.record_date >= '{{yyyy}}-{{mm}}-{{dd}}' - interval \'1 day\' AND

i.record_date < '{{yyyy}}-{{mm}}-{{dd}}' AND

m.client_id = <...>

‘)

TO 's3://{{bucket}}/us_eastern/{{yyyy}}/{{mm}}/{{dd}}/dsdk_events/{{vers}}/impressions/'

WITH CREDENTIALS 'aws_access_key_id={{key}};aws_secret_access_key={{secret}}'

DELIMITER ',' NULL '\\N' ADDQUOTES ESCAPE GZIP MANIFEST;

Page 60: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 61: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 62: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 63: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 64: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

Workload Node Count Node Type Restore Maint. Exec.

Frequency

& Attribution

& Overlap

& Ad Hoc

16 dw2.8xlarge 2h 1h 6h

= $691.20

Page 65: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

Workload Node Count Node Type Restore Maint. Exec.

Frequency 8 dw2.8xlarge 1.5h 0.5h 2.5h

Attribution 8 dw2.8xlarge 1.5h 0.5h 2h

Overlap 8 dw2.8xlarge 1h 0.5h 2.5h

Ad-hoc 8 dw2.8xlarge 0h 0.5h 1.5h

= $556.80 (-19%)

Page 66: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 67: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

http://bit.ly/awsevals