Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

Post on 27-Jan-2015

121 views 5 download

Tags:

description

Session presented at Big Data Spain 2012 Conference 16th Nov 2012 ETSI Telecomunicacion UPM Madrid www.bigdataspain.org More info: http://www.bigdataspain.org/es-2012/conference/crunching-data-with-google-bigquery/jordan-tigani

Transcript of Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

Crunching Data with BigQuery Fast analysis of Big Data

Jordan Tigani, Software Engineer

01000001011011100111001101110111011001010111001

00010000001110100011011110010000001110100011010

00011001010010000001010101011011000111010001101

00101101101011000010111010001100101001000000101

00010111010101100101011100110111010001101001011

01111011011100010000001101111011001100010000001

00110001101001011001100110010100101100001000000

11101000110100001100101001000000101010101101110

01101001011101100110010101110010011100110110010

10010110000100000011000010110111001100100001000

00010001010111011001100101011100100111100101110

100101110011001000000011010000110010...........

Big Data at Google

72 hours

100 million gigabytes

SELECT

kick_ass_product_plan AS strategy,

AVG(kicking_factor) AS awesomeness

FROM

lots_of_data

GROUP BY

strategy

+-------------+----------------+

| strategy | awesomeness |

+-------------+----------------+

| "Forty-two" | 1000000.01 |

+-------------+----------------+

1 row in result set (10.2 s)

Scanned 100GB

Regular expressions on 13 billion rows...

13 Billion rows

1 TB of data in 4 tables

FAST! AST

Google's Internal Technology:

Dremel

MapReduce is Flexible but Heavy

Master

Mapper Mapper

• Master constructs the plan and

begins spinning up workers

Distributed Storage

• Mappers read and write to

distributed storage

• Map => Shuffle => Reduce

Reducer • Reducers read and write to

distributed storage

Master

Reducer

Mapper Mapper

Stage 2

MapReduce is Flexible but Heavy

Stage 1

Master

Mapper Mapper

Distributed Storage

Reducer

Dremel vs MapReduce

• MapReduce

o Flexible batch processing

o High overall throughput

o High latency

• Dremel

o Optimized for interactive SQL queries

o Very low latency

Mixer 0

Mixer 1 Mixer 1

Leaf Leaf Leaf Leaf

Distributed Storage

Dremel Architecture

• Columnar Storage

• Long lived shared serving tree

• Partial Reduction

• Diskless data flow

SELECT

state, COUNT(*) count_babies

FROM [publicdata:samples.natality]

WHERE

year >= 1980 AND year < 1990

GROUP BY state

ORDER BY count_babies DESC

LIMIT 10

Simple Query

Mixer 0

Mixer 1 Mixer 1

Leaf Leaf Leaf Leaf

Distributed Storage SELECT state, year

O(Rows ~140M)

COUNT(*)

GROUP BY state

WHERE year >= 1980 and year < 1990

O(50 states)

LIMIT 10

ORDER BY count_babies DESC

COUNT(*)

GROUP BY state

COUNT(*)

GROUP BY state

O(50 states) O(50 states)

Modeling Data

Example: Daily Weather Station Data

weather_station_data

station lat long mean_temp humidity timestamp year month day

9384 33.57 86.75 89.3 .35 1351005129 2011 04 19

2857 36.77 119.72 78.5 .24 1351005135 2011 04 19

3475 40.77 73.98 68 .35 1351015930 2011 04 19

etc...

Example: Daily Weather Station Data

station, lat, long, mean_temp, year, mon, day

999999, 36.624, -116.023, 63.6, 2009, 10, 9

911904, 20.963, -156.675, 83.4, 2009, 10, 9

916890, -18133, 178433, 76.9, 2009, 10, 9

943320, -20678, 139488, 73.8, 2009, 10, 9

CSV

Organizing BigQuery Tables

Your Source

Data

October 22

October 23

October 24

Modeling Event Data: Social Music Store

logs.oct_24_2012_song_activities

USERNAME ACTIVITY Cost SONG ARTIST TIMESTAMP

Michael LISTEN Too Close Alex Clare 1351065562

Michael LISTEN Gangnam Style PSY 1351105150

Jim LISTEN Complications Deadmau5 1351075720

Michael PURCHASE 0.99 Gangnam Style PSY 1351115962

logs.oct_24_2012_song_activities

USERNAME ACTIVITY Cost SONG ARTIST TIMESTAMP

Michael LISTEN Too Close Alex Clare 1351065562

Michael LISTEN Gangnam Style PSY 1351105150

Jim LISTEN Complications Deadmau5 1351075720

Michael PURCHASE 0.99 Gangnam Style PSY 1351115962

Users Who Listened to More than 10 Songs/Day

SELECT

UserId, COUNT(*) as ListenActivities

FROM

[logs.oct_24_2012_song_activities]

GROUP EACH BY

UserId

HAVING

ListenActivites > 10

How Many Songs Listened to Total by Listeners of PSY?

SELECT

UserId, count(*) as ListenActivities

FROM

[logs.oct_24_2012_song_activities]

WHERE UserId IN (

SELECT

UserId

FROM

[logs.oct_24_2012_song_activities]

WHERE artist = 'PSY')

GROUP EACH BY UserId

HAVING

ListenActivites > 10

Modeling Event Data: Nested and Repeated Values

{"UserID" : "Michael",

"Listens": [

{"TrackId":1234,"Title":"Gangam Style",

"Artist":"PSY","Timestamp":1351075700},

{"TrackId":1234,"Title":"Alex Clare",

"Artist":"Alex Clare",'Timestamp":1351075700}

]

"Purchases": [

{"Track":2345,"Title":"Gangam Style",

"Artist":"PSY","Timestamp":1351075700,"Cost":0.99}

]}

JSON

{"UserID" : "Michael",

"Listens": [

{"TrackId":1234,"Title":"Gangam Style",

"Artist":"PSY","Timestamp":1351075700},

{"TrackId":1234,"Title":"Alex Clare",

"Artist":"Alex Clare",'Timestamp":1351075700}

]

"Purchases": [

{"Track":2345,"Title":"Gangam Style",

"Artist":"PSY","Timestamp":1351075700,"Cost":0.99}

]}

{"UserID" : "Michael",

"Listens": [

{"TrackId":1234,"Title":"Gangnam Style",

"Artist":"PSY","Timestamp":1351075700},

{"TrackId":1234,"Title":"Alex Clare",

"Artist":"Alex Clare",'Timestamp":1351075700}

]

"Purchases": [

{"Track":2345,"Title":"Gangnam Style",

"Artist":"PSY","Timestamp":1351075700,"Cost":0.99}

]}

Which Users Have Listened to Beyonce?

SELECT

UserID,

COUNT(ListenActivities.artist) WITHIN RECORD

AS song_count

FROM

[logs.oct_24_2012_songactivities]

WHERE

UserID IN (SELECT UserID,

FROM [logs.oct_24_2012_songactivities]

WHERE ListenActivities.artist = 'Beyonce');

What Position are PSY songs in our Users' Daily Playlists?

SELECT

UserID,

POSITION(ListenActivities.artist)

FROM

[sample_music_logs.oct_24_2012_songactivities]

WHERE

ListenActivities.artist = 'PSY';

SELECT

AVG(POSITION(ListenActivities.artist))

FROM

[sample_music_logs.oct_24_2012_songactivities],

[sample_music_logs.oct_23_2012_songactivities],

/* etc... */

WHERE

ListenActivities.artist = 'PSY';

Average Position of Songs by PSY in All Daily Playlists?

Summary: Choosing a BigQuery Data Model

• "Shard" your Data Using Multiple Tables

• Source Data Files

• CSV format

• Newline-delimited JSON

• Using Nested and Repeated Records

• Simplify Some Types of Queries

• Often Matches Document Database Models

Developing with BigQuery

Google Cloud Storage

Upload Your Data

BigQuery

Load your Data into BigQuery

"jobReference":{

"projectId":"605902584318"},

"configuration":{

"load":{

"destinationTable":{

"projectId":"605902584318",

"datasetId":"my_dataset",

"tableId":"widget_sales"},

"sourceUris":[

"gs://widget-sales-data/2012080100.csv"],

"schema":{

"fields":[{

"name":"widget",

"type":"string"},

...

POST https://www.googleapis.com/bigquery/v2/projects/605902584318/jobs

"jobReference":{

"projectId":"605902584318"},

"configuration":{

"load":{

"destinationTable":{

"projectId":"605902584318",

"datasetId":"my_dataset",

"tableId":"widget_sales"},

"sourceUris":[

"gs://widget-sales-data/2012080100.csv"],

"schema":{

"fields":[{

"name":"widget",

"type":"string"},

...

Query Away!

"jobReference":{

"projectId":"605902584318",

"query":"SELECT TOP(widget, 50), COUNT(*) AS sale_count

FROM widget_sales",

"maxResults":100,

"apiVersion":"v2"

}

POST https://www.googleapis.com/bigquery/v2/projects/605902584318/jobs

"jobReference":{

"projectId":"605902584318",

"query":"SELECT TOP(widget, 50), COUNT(*) AS sale_count

FROM widget_sales",

"maxResults":100,

"apiVersion":"v2"

}

Libraries

• Python

• Java

• .NET

• Ruby

• JavaScript

• Go

• PHP

• Objective-C

Libraries - Example JavaScript Query

var request = gapi.client.bigquery.jobs.query({

'projectId': project_id,

'timeoutMs': '30000',

'query': 'SELECT state, AVG(mother_age) AS theav

FROM [publicdata:samples.natality]

WHERE year=2000 AND ever_born=1

GROUP BY state

ORDER BY theav DESC;'

});

request.execute(function(response) {

console.log(response);

$.each(response.result.rows, function(i, item) {

...

Custom Code and the Google Chart Tools API

Google Spreadsheets

Commercial Visualization Tools

Demo: Using BigQuery on BigQuery

• Full table scans FAST

• Aggregate Queries on Massive Datasets

• Supports Flat and Nested/Repeated Data Models

• It's an API

BigQuery - Aggregate Big Data Analysis in Seconds

Get started now:

http://developers.google.com/bigquery/

SELECT questions FROM audience

SELECT 'Thank You!'

FROM jordan

http://developers.google.com/bigquery

Schema definition

birth_record

parent_id_mother

parent_id_father

plurality

is_male

race

weight

parents

id

race

age

cigarette_use

state

Schema definition

birth_record

mother_race

mother_age

mother_cigarette_use

mother_state

father_race

father_age

father_cigarette_use

father_state

plurality

is_male

race

weight

Tools to prepare your data

• App Engine MapReduce

• Commercial ETL tools

• Pervasive

• Informatica

• Talend

• UNIX command-line

Schema definition - sharding

birth_record_2011

mother_race

mother_age

mother_cigarette_use

mother_state

father_race

father_age

father_cigarette_use

father_state

plurality

is_male

race

weight

birth_record_2012

mother_race

mother_age

mother_cigarette_use

mother_state

father_race

father_age

father_cigarette_use

father_state

plurality

is_male

race

weight

birth_record_2013

birth_record_2014

birth_record_2015

birth_record_2016

Visualizing your Data

BigQuery architecture

“ If you do a table scan over a 1TB table,

you're going to have a bad time. ”

Anonymous

16th century Italian Philosopher-Monk

• Reading 1 TB/ second from disk:

• 10k+ disks

• Processing 1 TB / sec:

• 5k processors

Goal: Perform a 1 TB table scan in 1 second

Parallelize Parallelize Parallelize!

Data access: Column Store

Record Oriented Storage Column Oriented Storage

Distributed Storage (e.g. GFS)

BigQuery Architecture

Mixer 0

Mixer 1

Shard 0-8

Mixer 1

Shard 17-24

Mixer 1

Shard 9-16

Shard 0 Shard 10 Shard 12 Shard 24 Shard 20

Running your Queries

SELECT COUNT(foo), MAX(foo), STDDEV(foo)

FROM ...

BigQuery SQL Example: Simple aggregates

SELECT ... FROM ....

WHERE REGEXP_MATCH(url, "\.com$")

AND user CONTAINS 'test'

BigQuery SQL Example: Complex Processing

SELECT COUNT(*) FROM

(SELECT foo ..... )

GROUP BY foo

BigQuery SQL Example: Nested SELECT

BigQuery SQL Example: Small JOIN

SELECT huge_table.foo

FROM huge_table

JOIN small_table

ON small_table.foo = huge_table.foo

Distributed Storage (e.g. GFS)

BigQuery Architecture: Small Join

Mixer 0

Mixer 1

Shard 0-8

Mixer 1

Shard 17-24

Shard 0 Shard 24 Shard 20

Other new features!

Batch queries!

• Don't need interactive queries for some jobs?

• priority: "BATCH"

• API

• Column-based datastore

• Full table scans FAST

• Aggregates

• Commercial tool support

• Use cases

That's it

SELECT questions FROM audience

SELECT 'Thank You!'

FROM ryan

http://developers.google.com/bigquery

@ryguyrg http://profiles.google.com/ryan.boyd

Data access: Column Store

Record Oriented Storage Column Oriented Storage

A Little Later ...

Row wp_namespace Revs

1 0 53697002

2 1 6151228

3 3 5519859

4 4 4184389

5 2 3108562

6 10 1052044

7 6 877417

8 14 838940

9 5 651749

10 11 192534

11 100 148135

Underlying table:

• Wikipedia page revision records

• Rows: 314 million

• Byte size: 35.7 GB

Query Stats:

• Scanned 7G of data

• <5 seconds

• ~ 100M rows scanned / second

Mixer 0

Mixer 1 Mixer 1

Leaf Leaf Leaf Leaf

Distributed Storage

SELECT wp_namespace, revision_id

10 GB / s

COUNT (revision_id)

GROUP BY wp_namespace

WHERE timestamp > CUTOFF

ORDER BY Revs DESC

COUNT (revision_id)

GROUP BY wp_namespace

COUNT (revision_id)

GROUP BY wp_namespace

"Multi-stage" Query

SELECT

contributor_id,

INTEGER(LOG10(COUNT(revision_id))) LogEdits

FROM [publicdata:samples.wikipedia]

SELECT

contributor_id,

INTEGER(LOG10(COUNT(revision_id))) LogEdits

FROM [publicdata:samples.wikipedia]

GROUP EACH BY contributor_id)

SELECT

LogEdits, COUNT(contributor_id) Contributors

FROM (

SELECT

contributor_id,

INTEGER(LOG10(COUNT(*))) LogEdits

FROM [publicdata:samples.wikipedia]

GROUP EACH BY contributor_id)

GROUP BY LogEdits

ORDER BY LogEdits DESC

Mixer 0

Mixer 1 Mixer 1

Leaf Leaf Shuffler Shuffler

Distributed Storage

SELECT contributor_id

ORDER BY LogEdits DESC

COUNT(contributor_id)

GROUP BY LogEdits

COUNT(contributor_id)

GROUP BY LogEdits

COUNT(contributor_id)

GROUP BY LogEdits

SELECT LE, Id

COUNT(*)

GROUP BY contributor_id

Shuffle by

contributor_id

N^2

GB/s

When to use EACH

• Shuffle definitely adds some overhead

• Poor query performance if used incorrectly

• GROUP BY

o Groups << Rows => Unbalanced load

o Example: GROUP BY state

• GROUP EACH BY

o Groups ~ Rows

o Example: GROUP BY user_id