Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf
-
Upload
javier-ramirez -
Category
Software
-
view
492 -
download
0
description
Transcript of Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf
![Page 1: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf](https://reader036.fdocuments.in/reader036/viewer/2022062307/55838aded8b42a9e528b4868/html5/thumbnails/1.jpg)
by javier ramirez@supercoco9
Big Data Analyticswith
Google BigQuery
https://teowaki.com
https://datawaki.com
![Page 2: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf](https://reader036.fdocuments.in/reader036/viewer/2022062307/55838aded8b42a9e528b4868/html5/thumbnails/2.jpg)
![Page 3: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf](https://reader036.fdocuments.in/reader036/viewer/2022062307/55838aded8b42a9e528b4868/html5/thumbnails/3.jpg)
INPUT /
OUTPUT
Big Data's #1 Enemy
![Page 4: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf](https://reader036.fdocuments.in/reader036/viewer/2022062307/55838aded8b42a9e528b4868/html5/thumbnails/4.jpg)
Read one terabyte ofdata inone second
javier ramirez @supercoco9 https://teowaki.com
![Page 5: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf](https://reader036.fdocuments.in/reader036/viewer/2022062307/55838aded8b42a9e528b4868/html5/thumbnails/5.jpg)
data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the structures of your database architectures.
Ed Dumbill program chair for the O’Reilly Strata Conference
javier ramirez @supercoco9 https://teowaki.com
![Page 6: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf](https://reader036.fdocuments.in/reader036/viewer/2022062307/55838aded8b42a9e528b4868/html5/thumbnails/6.jpg)
bigdata is doing a fullscan to 330MM rows, matching them against a regexp, and getting the result (223MM rows) in just 5 seconds
javier ramirez @supercoco9 https://teowaki.com
Javier Ramirezimpresionable teowaki founder
![Page 7: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf](https://reader036.fdocuments.in/reader036/viewer/2022062307/55838aded8b42a9e528b4868/html5/thumbnails/7.jpg)
javier ramirez @supercoco9 https://teowaki.com nosqlmatters 2013
REST API +
AngularJS web as an API client
![Page 8: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf](https://reader036.fdocuments.in/reader036/viewer/2022062307/55838aded8b42a9e528b4868/html5/thumbnails/8.jpg)
1. non intrusive metrics2. keep the history3. interactive queries4. cheap5. extra ball: real time
javier ramirez @supercoco9 https://teowaki.com
![Page 9: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf](https://reader036.fdocuments.in/reader036/viewer/2022062307/55838aded8b42a9e528b4868/html5/thumbnails/9.jpg)
javier ramirez @supercoco9 https://teowaki.com
![Page 10: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf](https://reader036.fdocuments.in/reader036/viewer/2022062307/55838aded8b42a9e528b4868/html5/thumbnails/10.jpg)
Apache HadoopApache Cassandra
Apache SparkApache Storm
Amazon Redshift
javier ramirez @supercoco9 https://teowaki.com
![Page 11: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf](https://reader036.fdocuments.in/reader036/viewer/2022062307/55838aded8b42a9e528b4868/html5/thumbnails/11.jpg)
bigdata is cool but...
expensive cluster
hard to set up and monitor
not interactive enough
![Page 12: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf](https://reader036.fdocuments.in/reader036/viewer/2022062307/55838aded8b42a9e528b4868/html5/thumbnails/12.jpg)
Our choice:
Google BigQuery
Data analysis as a service
http://developers.google.com/bigquery
javier ramirez @supercoco9 https://teowaki.com
![Page 13: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf](https://reader036.fdocuments.in/reader036/viewer/2022062307/55838aded8b42a9e528b4868/html5/thumbnails/13.jpg)
Based on Dremel
Specifically designed for interactive queries over petabytes of real-time data
javier ramirez @supercoco9 https://teowaki.com
![Page 14: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf](https://reader036.fdocuments.in/reader036/viewer/2022062307/55838aded8b42a9e528b4868/html5/thumbnails/14.jpg)
• Analysis of crawled web documents.• Tracking install data for applications on Android Market.• Crash reporting for Google products.• OCR results from Google Books.• Spam analysis.• Debugging of map tiles on Google Maps.• Tablet migrations in managed Bigtable instances.• Results of tests run on Google’s distributed build system.• Disk I/O statistics for hundreds of thousands of disks.• Resource monitoring for jobs run in Google’s data centers.• Symbols and dependencies in Google’s codebase.
What Dremel is used for in Google
![Page 15: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf](https://reader036.fdocuments.in/reader036/viewer/2022062307/55838aded8b42a9e528b4868/html5/thumbnails/15.jpg)
INDEXES
Data Scientists's#1 Enemy
![Page 16: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf](https://reader036.fdocuments.in/reader036/viewer/2022062307/55838aded8b42a9e528b4868/html5/thumbnails/16.jpg)
javier ramirez @supercoco9 https://teowaki.com
in BigQuery everything is a full-scan*
*Over a ridiculously fast distributed filesystem.Dremel design goal: 1TB/sec. It was exceeded
BigQuery delivers ~ 50Gb/Sec.
![Page 17: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf](https://reader036.fdocuments.in/reader036/viewer/2022062307/55838aded8b42a9e528b4868/html5/thumbnails/17.jpg)
Columnarstorage
javier ramirez @supercoco9 https://teowaki.com
![Page 18: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf](https://reader036.fdocuments.in/reader036/viewer/2022062307/55838aded8b42a9e528b4868/html5/thumbnails/18.jpg)
Colossus filesystemDistributed/redundantParallel readsUltra fast network
javier ramirez @supercoco9 https://teowaki.com
![Page 19: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf](https://reader036.fdocuments.in/reader036/viewer/2022062307/55838aded8b42a9e528b4868/html5/thumbnails/19.jpg)
highly distributed execution using a tree
javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14
![Page 20: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf](https://reader036.fdocuments.in/reader036/viewer/2022062307/55838aded8b42a9e528b4868/html5/thumbnails/20.jpg)
Getting started...
![Page 21: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf](https://reader036.fdocuments.in/reader036/viewer/2022062307/55838aded8b42a9e528b4868/html5/thumbnails/21.jpg)
create dataset and tables
![Page 22: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf](https://reader036.fdocuments.in/reader036/viewer/2022062307/55838aded8b42a9e528b4868/html5/thumbnails/22.jpg)
loading data
You can feed flat CSV-like files or nested JSON objects
javier ramirez @supercoco9 https://teowaki.com
![Page 23: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf](https://reader036.fdocuments.in/reader036/viewer/2022062307/55838aded8b42a9e528b4868/html5/thumbnails/23.jpg)
javier ramirez @supercoco9 https://teowaki.com
bq cli
bq load --nosynchronous_mode --encoding UTF-8 --field_delimiter 'tab' --max_bad_records 100 --source_format CSV api.stats 20131014T11-42-05Z.gz
![Page 24: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf](https://reader036.fdocuments.in/reader036/viewer/2022062307/55838aded8b42a9e528b4868/html5/thumbnails/24.jpg)
web console screenshot
javier ramirez @supercoco9 https://teowaki.com
![Page 25: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf](https://reader036.fdocuments.in/reader036/viewer/2022062307/55838aded8b42a9e528b4868/html5/thumbnails/25.jpg)
javier ramirez @supercoco9 https://teowaki.com
it's just sql, plus...
analytical SQL functions.correlations.
window functions.views.
JSON fields.timestamped tables.
![Page 26: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf](https://reader036.fdocuments.in/reader036/viewer/2022062307/55838aded8b42a9e528b4868/html5/thumbnails/26.jpg)
Things you always wanted to try but were too scared to
javier ramirez @supercoco9 https://teowaki.com
select count(*) from publicdata:samples.wikipedia
where REGEXP_MATCH(title, "[0-9]*") AND wp_namespace = 0;
223,163,387Query complete (5.6s elapsed, 9.13 GB processed, Cost: 32¢)
![Page 27: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf](https://reader036.fdocuments.in/reader036/viewer/2022062307/55838aded8b42a9e528b4868/html5/thumbnails/27.jpg)
![Page 28: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf](https://reader036.fdocuments.in/reader036/viewer/2022062307/55838aded8b42a9e528b4868/html5/thumbnails/28.jpg)
Global Database of Events, Language and Tone
quarter billion rows30 yearsupdated daily
http://gdeltproject.org/data.html#googlebigquery
![Page 29: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf](https://reader036.fdocuments.in/reader036/viewer/2022062307/55838aded8b42a9e528b4868/html5/thumbnails/29.jpg)
SELECT Year, Actor1Name, Actor2Name, Count FROM (SELECT Actor1Name, Actor2Name, Year, COUNT(*) Count, RANK() OVER(PARTITION BY YEAR ORDER BY Count DESC) rankFROM (SELECT Actor1Name, Actor2Name, Year FROM [gdelt-bq:full.events] WHERE Actor1Name < Actor2Name and Actor1CountryCode != '' and Actor2CountryCode != '' and Actor1CountryCode!=Actor2CountryCode), (SELECT Actor2Name Actor1Name, Actor1Name Actor2Name, Year FROM [gdelt-bq:full.events] WHERE Actor1Name > Actor2Name and Actor1CountryCode != '' and Actor2CountryCode != '' and Actor1CountryCode!=Actor2CountryCode),WHERE Actor1Name IS NOT nullAND Actor2Name IS NOT nullGROUP EACH BY 1, 2, 3HAVING Count > 100)
WHERE rank=1ORDER BY Year
![Page 30: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf](https://reader036.fdocuments.in/reader036/viewer/2022062307/55838aded8b42a9e528b4868/html5/thumbnails/30.jpg)
![Page 31: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf](https://reader036.fdocuments.in/reader036/viewer/2022062307/55838aded8b42a9e528b4868/html5/thumbnails/31.jpg)
javier ramirez @supercoco9 https://teowaki.com
our most active user
![Page 32: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf](https://reader036.fdocuments.in/reader036/viewer/2022062307/55838aded8b42a9e528b4868/html5/thumbnails/32.jpg)
javier ramirez @supercoco9 https://teowaki.com
10 request we should be caching
![Page 33: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf](https://reader036.fdocuments.in/reader036/viewer/2022062307/55838aded8b42a9e528b4868/html5/thumbnails/33.jpg)
javier ramirez @supercoco9 http://teowaki.com
5 most created resources
select uri, count(*) total from stats where method = 'POST' group by URI;
![Page 34: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf](https://reader036.fdocuments.in/reader036/viewer/2022062307/55838aded8b42a9e528b4868/html5/thumbnails/34.jpg)
javier ramirez @supercoco9 http://teowaki.com
...but
/users/javier/shouts/users/rgo/shouts/teams/javier-community/links/teams/nosqlmatters-cgn/links
![Page 36: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf](https://reader036.fdocuments.in/reader036/viewer/2022062307/55838aded8b42a9e528b4868/html5/thumbnails/36.jpg)
what is it being used for?
![Page 37: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf](https://reader036.fdocuments.in/reader036/viewer/2022062307/55838aded8b42a9e528b4868/html5/thumbnails/37.jpg)
![Page 38: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf](https://reader036.fdocuments.in/reader036/viewer/2022062307/55838aded8b42a9e528b4868/html5/thumbnails/38.jpg)
![Page 39: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf](https://reader036.fdocuments.in/reader036/viewer/2022062307/55838aded8b42a9e528b4868/html5/thumbnails/39.jpg)
Analysing weather information
Finding patterns in e-commerce
Match online/offline behaviour
Log analysys
Analysing inventory/booking data
...
![Page 40: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf](https://reader036.fdocuments.in/reader036/viewer/2022062307/55838aded8b42a9e528b4868/html5/thumbnails/40.jpg)
warning: BigQuery is not open source and not for free
$26 per stored TB
$5 per processed TB*the 1st TB processed every month is free of charge
javier ramirez @supercoco9 https://teowaki.com
![Page 41: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf](https://reader036.fdocuments.in/reader036/viewer/2022062307/55838aded8b42a9e528b4868/html5/thumbnails/41.jpg)
![Page 42: Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf](https://reader036.fdocuments.in/reader036/viewer/2022062307/55838aded8b42a9e528b4868/html5/thumbnails/42.jpg)
Find related links at
https://teowaki.com/teams/javier-community/link-categories/bigquery-talk
Thanks
javier ramirez@supercoco9
https://teowaki.com
https://datawaki.com