War stories with Apache Spark - BI...
Transcript of War stories with Apache Spark - BI...
![Page 1: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5ecd342a3ba3fa0b2c4f3f98/html5/thumbnails/1.jpg)
War stories withApache SparkMate Gulyas
![Page 2: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5ecd342a3ba3fa0b2c4f3f98/html5/thumbnails/2.jpg)
CTO & Co-Founder
GULYÁS MÁTÉ
@gulyasm
![Page 3: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5ecd342a3ba3fa0b2c4f3f98/html5/thumbnails/3.jpg)
![Page 4: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5ecd342a3ba3fa0b2c4f3f98/html5/thumbnails/4.jpg)
![Page 5: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5ecd342a3ba3fa0b2c4f3f98/html5/thumbnails/5.jpg)
Product placeholder
![Page 6: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5ecd342a3ba3fa0b2c4f3f98/html5/thumbnails/6.jpg)
DATA PLATFORM at Enbritely
![Page 7: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5ecd342a3ba3fa0b2c4f3f98/html5/thumbnails/7.jpg)
DATA COLLECTION
ANALYZEDATA PROCESSION
ANTI FRAUDVIEWABILITY
BRAND SAFETYREPORT + API
What we do?
![Page 8: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5ecd342a3ba3fa0b2c4f3f98/html5/thumbnails/8.jpg)
HOW WE GOT HERE?
MONOLITHIC PYTHON ANALYTICS
EVALUATE BIG DATA TECHNOLOGIES
STARTED WORK ON DP
DPPRODUCTION READY
SAAS DP
@gulyasm
![Page 9: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5ecd342a3ba3fa0b2c4f3f98/html5/thumbnails/9.jpg)
DATA COLLECTION
![Page 10: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5ecd342a3ba3fa0b2c4f3f98/html5/thumbnails/10.jpg)
The way to access log
{
"session_id": "spark_meetup_jsmmmoq",
"timestamp": 1456080915621,
"type": "click"
}
eyJzZXNzaW9uX2lkIjoic3Bhcmtfb
WVldHVwX2pzbW1tb3EiLCJ0aW1l
c3RhbXAiOjE0NTYwODA5MTU2M
jEsInR5cGUiOiAiY2xpY2sifQo=
Click event attributes
(created by JS tracker)
Access log format
TS CLIENT_IP STATUS "GET https://api.endpoint?event=eyJzZXNzaW9uX2lkIj..."
1.
2.
3.
![Page 11: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5ecd342a3ba3fa0b2c4f3f98/html5/thumbnails/11.jpg)
DATA PROCESSINGDATA PROCESSING
![Page 12: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5ecd342a3ba3fa0b2c4f3f98/html5/thumbnails/12.jpg)
Spark TOOLS
● 0.5-2TB data processed daily
1-10B rows
● Ad-hoc batch queries 20TB data
● 20+ node cluster
● Spent 4 month optimizing it
![Page 13: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5ecd342a3ba3fa0b2c4f3f98/html5/thumbnails/13.jpg)
Luigi TOOLS
Luigi + enbrite.ly extensions = Gabo Luigi
WORKFLOW ENGINE
![Page 14: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5ecd342a3ba3fa0b2c4f3f98/html5/thumbnails/14.jpg)
LESSONS LEARNED
![Page 15: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5ecd342a3ba3fa0b2c4f3f98/html5/thumbnails/15.jpg)
LESSONS LEARNED
YOU WILL SPEND A LOT
OF TIME ON TOOLING
![Page 16: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5ecd342a3ba3fa0b2c4f3f98/html5/thumbnails/16.jpg)
![Page 17: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5ecd342a3ba3fa0b2c4f3f98/html5/thumbnails/17.jpg)
Tools we created GABO LUIGI
![Page 18: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5ecd342a3ba3fa0b2c4f3f98/html5/thumbnails/18.jpg)
LESSONS LEARNED
OPTIMIZATION
takes a
LOT OF TIME
![Page 19: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5ecd342a3ba3fa0b2c4f3f98/html5/thumbnails/19.jpg)
![Page 20: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5ecd342a3ba3fa0b2c4f3f98/html5/thumbnails/20.jpg)
LESSONS LEARNED
OPTIMIZATION
NEVER
ENDS
![Page 21: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5ecd342a3ba3fa0b2c4f3f98/html5/thumbnails/21.jpg)
LESSONS LEARNED
AUTOMATE
PERFORMANCE
OPTIMIZATION
![Page 22: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5ecd342a3ba3fa0b2c4f3f98/html5/thumbnails/22.jpg)
PERFORMANCE MEASUREMENTS
● CLUSTER CONFIGURATION
● SPARK JOB CONFIGURATION
● DATA SET VARIATIONS
● IMPACT OF ALGORITHMS
![Page 23: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5ecd342a3ba3fa0b2c4f3f98/html5/thumbnails/23.jpg)
PERFORMANCE MEASUREMENTS
MARATHON
![Page 24: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5ecd342a3ba3fa0b2c4f3f98/html5/thumbnails/24.jpg)
LESSONS LEARNED
DATA STORAGE IS THE
BIGGEST
OPTIMIZATION
![Page 25: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5ecd342a3ba3fa0b2c4f3f98/html5/thumbnails/25.jpg)
LESSONS LEARNED
DON’T START WITH
SCALA AND SPARK
![Page 26: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5ecd342a3ba3fa0b2c4f3f98/html5/thumbnails/26.jpg)
LESSONS LEARNED
KEEP ANALYTICS CODE
IN ONE
REPOSITORY
![Page 27: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5ecd342a3ba3fa0b2c4f3f98/html5/thumbnails/27.jpg)
![Page 28: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5ecd342a3ba3fa0b2c4f3f98/html5/thumbnails/28.jpg)
![Page 29: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5ecd342a3ba3fa0b2c4f3f98/html5/thumbnails/29.jpg)
LESSONS LEARNED
STRUCTURE YOUR
CODE
![Page 30: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5ecd342a3ba3fa0b2c4f3f98/html5/thumbnails/30.jpg)
![Page 31: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5ecd342a3ba3fa0b2c4f3f98/html5/thumbnails/31.jpg)
![Page 32: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5ecd342a3ba3fa0b2c4f3f98/html5/thumbnails/32.jpg)
LESSONS LEARNED
START WITH THE
SMALLEST BIG DATA PROJECT
![Page 33: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5ecd342a3ba3fa0b2c4f3f98/html5/thumbnails/33.jpg)
HOW WE GOT HERE?
MONOLITHIC PYTHON ANALYTICS
EVALUATE BIG DATA TECHNOLOGIES
STARTED WORK ON DP
DPPRODUCTION READY
SAAS DP
@gulyasm
![Page 34: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5ecd342a3ba3fa0b2c4f3f98/html5/thumbnails/34.jpg)
LESSONS LEARNED
REUSECODE
![Page 35: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5ecd342a3ba3fa0b2c4f3f98/html5/thumbnails/35.jpg)
LESSONS LEARNED
REUSEKNOWLEDGE
![Page 36: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5ecd342a3ba3fa0b2c4f3f98/html5/thumbnails/36.jpg)
Unified Data Processing Engine
![Page 37: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5ecd342a3ba3fa0b2c4f3f98/html5/thumbnails/37.jpg)
NOT EVERY USE CASE IS A SPARK USE-CASE