Spark Summit EU talk by Ruben Pulido and Behar Veliqi
-
Upload
spark-summit -
Category
Data & Analytics
-
view
302 -
download
0
Transcript of Spark Summit EU talk by Ruben Pulido and Behar Veliqi
![Page 1: Spark Summit EU talk by Ruben Pulido and Behar Veliqi](https://reader036.fdocuments.in/reader036/viewer/2022081323/586f75821a28ab10258b60ad/html5/thumbnails/1.jpg)
SPARK SUMMIT EUROPE 2016
FROM SINGLE-TENANT HADOOPTO 3000 TENANTS IN APACHE SPARK
RUBEN PULIDOBEHAR VELIQI
IBM WATSON ANALYTICS FOR SOCIAL MEDIAIBM GERMANY R&D LAB
!
""
#
![Page 2: Spark Summit EU talk by Ruben Pulido and Behar Veliqi](https://reader036.fdocuments.in/reader036/viewer/2022081323/586f75821a28ab10258b60ad/html5/thumbnails/2.jpg)
WHAT IS WATSON ANALYTICS FOR SOCIAL MEDIAPREVIOUS ARCHITECTURETHOUGHT PROCESS TOWARDS MULTITENANCYNEW ARCHITECTURELESSONS LEARNED
![Page 3: Spark Summit EU talk by Ruben Pulido and Behar Veliqi](https://reader036.fdocuments.in/reader036/viewer/2022081323/586f75821a28ab10258b60ad/html5/thumbnails/3.jpg)
WATSON ANALYTICS A data analytics solution for business users in the cloud
users
WATSON ANALYTICS FOR SOCIAL MEDIA
◉ Part of Watson Analytics◉ Allows users to
harvest and analyzesocial media content
◉ To understand how brands, products, services, social issues,… are perceived
![Page 4: Spark Summit EU talk by Ruben Pulido and Behar Veliqi](https://reader036.fdocuments.in/reader036/viewer/2022081323/586f75821a28ab10258b60ad/html5/thumbnails/4.jpg)
Cognos BIText Analytics
BigInsights
DB2
Evolving Topics
End to End OrchestrationContent is stored in HDFS…
…analyzed within our
components...
…stored in HDFS again...
…and loaded into DB2
Our previous architecture: a single-tenant “big data” ETL batch pipeline
![Page 5: Spark Summit EU talk by Ruben Pulido and Behar Veliqi](https://reader036.fdocuments.in/reader036/viewer/2022081323/586f75821a28ab10258b60ad/html5/thumbnails/5.jpg)
Photo by GhislainBonneau, gbphotodidactical.ca
What people think about Social Media data rates What Stream Processing is really good at
![Page 6: Spark Summit EU talk by Ruben Pulido and Behar Veliqi](https://reader036.fdocuments.in/reader036/viewer/2022081323/586f75821a28ab10258b60ad/html5/thumbnails/6.jpg)
Our requirement:Efficiently processing “trickle feeds” of data
What the size of a customer specific social media data-feed really is
![Page 7: Spark Summit EU talk by Ruben Pulido and Behar Veliqi](https://reader036.fdocuments.in/reader036/viewer/2022081323/586f75821a28ab10258b60ad/html5/thumbnails/7.jpg)
WATSON ANALYTICS FOR SOCIAL MEDIAPREVIOUS ARCHITECTURETHOUGHT PROCESS TOWARDS MULTITENANCYNEW ARCHITECTURELESSONS LEARNED
![Page 8: Spark Summit EU talk by Ruben Pulido and Behar Veliqi](https://reader036.fdocuments.in/reader036/viewer/2022081323/586f75821a28ab10258b60ad/html5/thumbnails/8.jpg)
Step 1: Revisit our analytics workload
Author featuresdetection
Concept level
sentiment detection
Concept detection
Everyone I ask says my PhoneAis way better than my brother’s PhoneB
Everyone I ask says my PhoneA is way better than my brother’s PhoneBPhoneA:Positive
Everyone I ask says my PhoneA is way better than my brother’s PhoneBPhoneB:Negative
My wife’s PhoneA died yesterday.
My son wants me to get him PhoneB.
Author: Is Married = true
Author: Has Children = true
Creation of author profiles
Is Married = trueHas Children = trueAuthor:Author: Is Married = true
Author: Has Children = true
![Page 9: Spark Summit EU talk by Ruben Pulido and Behar Veliqi](https://reader036.fdocuments.in/reader036/viewer/2022081323/586f75821a28ab10258b60ad/html5/thumbnails/9.jpg)
Step 2: Assign the right processing model for the right workloadDocument-level analyticsConcept detectionSentiment detectionExtracting author features
◉ Can easily work on a data stream◉ User gets additional value from
every document
Collection-level analyticsConsolidation of author profilesInfluence detection…
◉ Most algorithms need a batch of documents (or all documents)
◉ User value is getting the “big picture”
![Page 10: Spark Summit EU talk by Ruben Pulido and Behar Veliqi](https://reader036.fdocuments.in/reader036/viewer/2022081323/586f75821a28ab10258b60ad/html5/thumbnails/10.jpg)
◉Document-level analytics makes sense on a stream…
◉….document analysis for a single tenant often does not justify stream processing…
◉…but what if we process all documents for all tenants in a single system?
Step 3: Turn “trickle feeds” into a stream through multitenancy
![Page 11: Spark Summit EU talk by Ruben Pulido and Behar Veliqi](https://reader036.fdocuments.in/reader036/viewer/2022081323/586f75821a28ab10258b60ad/html5/thumbnails/11.jpg)
Step 4: Address Multitenancy Requirements
◉Minimize “switching costs” when analyzing data for different tenants
−Separated text analytics steps into: − tenant-specific− language-specific
−Optimized tenant-specific steps for “low latency” switching
◉ Avoid storing tenant state in processing components−Using Zookeeper as distributed configuration store
![Page 12: Spark Summit EU talk by Ruben Pulido and Behar Veliqi](https://reader036.fdocuments.in/reader036/viewer/2022081323/586f75821a28ab10258b60ad/html5/thumbnails/12.jpg)
WATSON ANALYTICS FOR SOCIAL MEDIAPREVIOUS ARCHITECTURETHOUGHT PROCESS TOWARDS MULTITENANCYNEW ARCHITECTURELESSONS LEARNED
![Page 13: Spark Summit EU talk by Ruben Pulido and Behar Veliqi](https://reader036.fdocuments.in/reader036/viewer/2022081323/586f75821a28ab10258b60ad/html5/thumbnails/13.jpg)
ANALYSIS PIPELINE
![Page 14: Spark Summit EU talk by Ruben Pulido and Behar Veliqi](https://reader036.fdocuments.in/reader036/viewer/2022081323/586f75821a28ab10258b60ad/html5/thumbnails/14.jpg)
CO
LLEC
TIO
N S
ERVI
CE
ANALYSIS PIPELINE
![Page 15: Spark Summit EU talk by Ruben Pulido and Behar Veliqi](https://reader036.fdocuments.in/reader036/viewer/2022081323/586f75821a28ab10258b60ad/html5/thumbnails/15.jpg)
DATA FETCHING
DOCUMENT ANALYSIS
EXPORT
AU
THO
R A
NA
LYSI
S
DAT
ASE
T C
REA
TIO
N
CO
LLEC
TIO
N S
ERVI
CE
![Page 16: Spark Summit EU talk by Ruben Pulido and Behar Veliqi](https://reader036.fdocuments.in/reader036/viewer/2022081323/586f75821a28ab10258b60ad/html5/thumbnails/16.jpg)
DATA FETCHING
DOCUMENT ANALYSIS
EXPORT
AU
THO
R A
NA
LYSI
S
DAT
ASE
T C
REA
TIO
N
CO
LLEC
TIO
N S
ERVI
CE
Stream Processing Batch Processing
![Page 17: Spark Summit EU talk by Ruben Pulido and Behar Veliqi](https://reader036.fdocuments.in/reader036/viewer/2022081323/586f75821a28ab10258b60ad/html5/thumbnails/17.jpg)
DATA FETCHING
DOCUMENT ANALYSIS
EXPORT
AU
THO
R A
NA
LYSI
S
DAT
ASE
T C
REA
TIO
N
CO
LLEC
TIO
N S
ERVI
CE
urls
raw
analyzed
KafkaHDFSHTTP
HDFS HDFS
Stream Processing Batch Processing
![Page 18: Spark Summit EU talk by Ruben Pulido and Behar Veliqi](https://reader036.fdocuments.in/reader036/viewer/2022081323/586f75821a28ab10258b60ad/html5/thumbnails/18.jpg)
DATA FETCHING
DOCUMENT ANALYSIS
EXPORT
AU
THO
R A
NA
LYSI
S
DAT
ASE
T C
REA
TIO
N
CO
LLEC
TIO
N S
ERVI
CE
urls
raw
analyzed
KafkaHDFSHTTP
HDFS HDFS
Stream Processing Batch Processing
![Page 19: Spark Summit EU talk by Ruben Pulido and Behar Veliqi](https://reader036.fdocuments.in/reader036/viewer/2022081323/586f75821a28ab10258b60ad/html5/thumbnails/19.jpg)
ORCHESTRATION
DATA FETCHING
DOCUMENT ANALYSIS
EXPORT
AU
THO
R A
NA
LYSI
S
DAT
ASE
T C
REA
TIO
N
CO
LLEC
TIO
N S
ERVI
CE
urls
raw
analyzed
KafkaHDFSHTTP
HDFS HDFS
Stream Processing Batch Processing
![Page 20: Spark Summit EU talk by Ruben Pulido and Behar Veliqi](https://reader036.fdocuments.in/reader036/viewer/2022081323/586f75821a28ab10258b60ad/html5/thumbnails/20.jpg)
master master masterZK ZK ZK
worker broker worker broker
worker broker worker broker
worker broker worker broker
HDFS
HDFS
HDFS
Orchestration Tier Data Processing Tier Storage Tier
![Page 21: Spark Summit EU talk by Ruben Pulido and Behar Veliqi](https://reader036.fdocuments.in/reader036/viewer/2022081323/586f75821a28ab10258b60ad/html5/thumbnails/21.jpg)
worker broker worker broker
worker broker worker broker
worker broker worker broker
HDFS
HDFS
HDFS
Data Processing Tier Storage Tier
master master masterZK ZK ZK
JOB-1JOB-2
JOB-N… status
errorCode# fetched
# analyzed…
Orchestration Tier
![Page 22: Spark Summit EU talk by Ruben Pulido and Behar Veliqi](https://reader036.fdocuments.in/reader036/viewer/2022081323/586f75821a28ab10258b60ad/html5/thumbnails/22.jpg)
Orchestration Tier Storage Tier
master master masterZK ZK ZK
HDFS
HDFS
HDFS
worker broker worker broker
Data Processing Tier
![Page 23: Spark Summit EU talk by Ruben Pulido and Behar Veliqi](https://reader036.fdocuments.in/reader036/viewer/2022081323/586f75821a28ab10258b60ad/html5/thumbnails/23.jpg)
Orchestration Tier Storage Tier
master master masterZK ZK ZK
HDFS
HDFS
HDFS
en de fr es … en de fr es …
en de fr es …
en de fr es … en de fr es …
worker broker worker broker
worker broker worker broker
worker broker worker broker
Data Processing Tier
• Expensive in Initialization
• High memory usage
à Not appropriate for . Caching
• Stable across tenants• One analysis engine
• per language • per Spark
workerà created at
. deployment time• Processing threads
co-use these engines
LANGUANGE SPECIFIC ANALYSIS
en de fr es …
![Page 24: Spark Summit EU talk by Ruben Pulido and Behar Veliqi](https://reader036.fdocuments.in/reader036/viewer/2022081323/586f75821a28ab10258b60ad/html5/thumbnails/24.jpg)
Orchestration Tier Storage Tier
master master masterZK ZK ZK
HDFS
HDFS
HDFS
en de fr es … en de fr es …
en de fr es …
en de fr es … en de fr es …
worker broker worker broker
worker broker worker broker
worker broker worker broker
Data Processing Tier
• Low memory usagecompared to languageanalysis engines
à LRU-cache of 100 . tenant specific. analysis engines. per Spark worker
• Very low tenant-switching overhead
TENANTSPECIFIC ANALYSIS
• Expensive in Initialization
• High memory usage
à Not appropriate for . Caching
• Stable across tenants• One analysis engine
• per language • per Spark
workerà created at
. deployment time• Processing threads
co-use these engines
LANGUANGE SPECIFIC ANALYSIS
en de fr es …
![Page 25: Spark Summit EU talk by Ruben Pulido and Behar Veliqi](https://reader036.fdocuments.in/reader036/viewer/2022081323/586f75821a28ab10258b60ad/html5/thumbnails/25.jpg)
READ
…raw
![Page 26: Spark Summit EU talk by Ruben Pulido and Behar Veliqi](https://reader036.fdocuments.in/reader036/viewer/2022081323/586f75821a28ab10258b60ad/html5/thumbnails/26.jpg)
READ
…raw
ANALYZE
create view Parent as extract regex /
(my | our)(sons?|daughters?|kids?|baby( boy|girl)?|babies|child|children)/
with flags 'CASE_INSENSITIVE' on D.text as match from Document D
AQL
![Page 27: Spark Summit EU talk by Ruben Pulido and Behar Veliqi](https://reader036.fdocuments.in/reader036/viewer/2022081323/586f75821a28ab10258b60ad/html5/thumbnails/27.jpg)
READ
…raw
ANALYZE
WRITEanalyzed
status = RUNNINGerrorCode = NONE# fetched = 5000
# analyzed = 4000…
![Page 28: Spark Summit EU talk by Ruben Pulido and Behar Veliqi](https://reader036.fdocuments.in/reader036/viewer/2022081323/586f75821a28ab10258b60ad/html5/thumbnails/28.jpg)
◉ More than 3000 tenants◉ Between 150-300 analysis jobs per day◉ Between 25K-4M documents analyzed per job◉ 3 data centers in Toronto, Dallas, Amsterdam◉ Cluster of 10 VMs: each with 16 Cores and 32G RAM
◉ Text-Analytics Throughput:◉ Full Stack: from Tokenization to Sentiment / Behavior◉ Mixed Sources: ~60K documents per minute◉ Tweets only: 150K Tweets per minute
![Page 29: Spark Summit EU talk by Ruben Pulido and Behar Veliqi](https://reader036.fdocuments.in/reader036/viewer/2022081323/586f75821a28ab10258b60ad/html5/thumbnails/29.jpg)
DEPLOYMENT
ACTIVEREST-API
GitHub CI UberJAR
DeploymentContainer
UPLOAD TO HDFSGRACEFUL SHUTDOWNSTART APPS
…offsetSettingStrategy: "keepWithNewPartitions",segment: {
app: “fetcher”,spark.cores.max: 10,spark.executor.memory: "3G",spark.streaming.stopGracefullyOnShutdown: true,…
}…
![Page 30: Spark Summit EU talk by Ruben Pulido and Behar Veliqi](https://reader036.fdocuments.in/reader036/viewer/2022081323/586f75821a28ab10258b60ad/html5/thumbnails/30.jpg)
WATSON ANALYTICS FOR SOCIAL MEDIAPREVIOUS ARCHITECTURETHOUGHT PROCESS TOWARDS MULTITENANCYNEW ARCHITECTURELESSONS LEARNED
![Page 31: Spark Summit EU talk by Ruben Pulido and Behar Veliqi](https://reader036.fdocuments.in/reader036/viewer/2022081323/586f75821a28ab10258b60ad/html5/thumbnails/31.jpg)
◉ “Driver Only” – No Stream or Batch Processing◉ Uses Spark’s Fail-Over mechanisms◉ No implementation of custom master election necessary
ORCHESTRATION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS
…// update progress on zookeeper// trigger the batch phases
![Page 32: Spark Summit EU talk by Ruben Pulido and Behar Veliqi](https://reader036.fdocuments.in/reader036/viewer/2022081323/586f75821a28ab10258b60ad/html5/thumbnails/32.jpg)
KAFKA EVENT à HTTP REQUEST à DOCUMENT BATCH
ORCHESTRAION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS
![Page 33: Spark Summit EU talk by Ruben Pulido and Behar Veliqi](https://reader036.fdocuments.in/reader036/viewer/2022081323/586f75821a28ab10258b60ad/html5/thumbnails/33.jpg)
KAFKA EVENT à HTTP REQUEST à DOCUMENT BATCH
ORCHESTRAION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS
◉ wrong workload for Spark`s micro-batch based stream-processing
◉ Hard to find appropriate scheduling interval◉ builds high scheduling delays◉ hard to distribute equally across the cluster◉ No benefit of Spark settings like
maxRatePerPartition or backpressure.enabled◉ sometimes leading to OOM
![Page 34: Spark Summit EU talk by Ruben Pulido and Behar Veliqi](https://reader036.fdocuments.in/reader036/viewer/2022081323/586f75821a28ab10258b60ad/html5/thumbnails/34.jpg)
KAFKA EVENT à HTTP REQUEST à DOCUMENT BATCH
ORCHESTRAION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS
NON-SPARK APPLICATION
◉ Run the fetcher as separate applicationoutside of Spark – “long running” thread
◉ Fetch documents from social media providers◉ Write to raw documents to Kafka◉ Let the pipeline analyze these documents◉ backpressure.enabled = true
◉ wrong workload for Spark`s micro-batch based stream-processing
◉ Hard to find appropriate scheduling interval◉ builds high scheduling delays◉ hard to distribute equally across the cluster◉ No benefit of Spark settings like
maxRatePerPartition or backpressure.enabled◉ sometimes leading to OOM
![Page 35: Spark Summit EU talk by Ruben Pulido and Behar Veliqi](https://reader036.fdocuments.in/reader036/viewer/2022081323/586f75821a28ab10258b60ad/html5/thumbnails/35.jpg)
ORCHESTRAION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS
worker broker worker broker
worker broker worker broker
JOB-1
JOB-2
Job-ID
• Jobs don’t influence each other• Run fully in parallel• Cluster is not fully/equally utilized
![Page 36: Spark Summit EU talk by Ruben Pulido and Behar Veliqi](https://reader036.fdocuments.in/reader036/viewer/2022081323/586f75821a28ab10258b60ad/html5/thumbnails/36.jpg)
ORCHESTRAION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS
worker broker worker broker
worker broker worker broker
JOB-1
JOB-2
worker broker worker broker
worker broker worker broker
JOB-1JOB-2
Job-ID
“Random” + All URLs At Once
• Jobs don’t influence each other• Run fully in parallel• Cluster is not fully/equally utilized
• Cluster is fully utilized• Workload equally distributed• Jobs run sequentially
![Page 37: Spark Summit EU talk by Ruben Pulido and Behar Veliqi](https://reader036.fdocuments.in/reader036/viewer/2022081323/586f75821a28ab10258b60ad/html5/thumbnails/37.jpg)
ORCHESTRAION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS
worker broker worker broker
worker broker worker broker
JOB-1
JOB-2
worker broker worker broker
worker broker worker broker
JOB-1JOB-2
worker broker worker broker
worker broker worker broker
JOB-1JOB-2
Job-ID
”Random” + “Next” URLs
• Jobs don’t influence each other• Run fully in parallel• Cluster is not fully/equally utilized
• Cluster is fully utilized• Workload equally distributed• Jobs run sequentially
• Cluster is fully utilized• Workload equally distributed• Jobs run in parallel
“Random” + All URLs At Once
![Page 38: Spark Summit EU talk by Ruben Pulido and Behar Veliqi](https://reader036.fdocuments.in/reader036/viewer/2022081323/586f75821a28ab10258b60ad/html5/thumbnails/38.jpg)
◉ Grouping happens in User Memory region◉ When a large batch of documents is processed:
◉ Spark does does not spill to disk◉ OOM errors
à configure a small streaming batch interval or . set maxRatePerPartition to limit the events of the batch
OOM
Write documents for each job to a separate HDFS directory.First version: Group documents using Scala collections API
UserMemoryRegion
ORCHESTRAION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS
![Page 39: Spark Summit EU talk by Ruben Pulido and Behar Veliqi](https://reader036.fdocuments.in/reader036/viewer/2022081323/586f75821a28ab10258b60ad/html5/thumbnails/39.jpg)
◉ Group and write job documents through DataFrame APIs ◉ Uses the Execution Memory region for computations◉ Spark can prevent OOM errors by borrowing space from the
Storage memory region or by spilling to disk if needed.
Spark Execution
MemoryRegion
No OOM
ORCHESTRAION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS
Write documents for each job to a separate HDFS directory.Second version: Let Spark do the grouping and write to HDFS
![Page 40: Spark Summit EU talk by Ruben Pulido and Behar Veliqi](https://reader036.fdocuments.in/reader036/viewer/2022081323/586f75821a28ab10258b60ad/html5/thumbnails/40.jpg)
ORCHESTRAION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS
worker worker
worker worker
JOB-1
JOB-2
FIFO SCHEDULER
◉ First job gets available resources while its stages have tasks to launch
◉ Big jobs “block” smaller jobs started later
worker worker
worker worker
JOB-1JOB-2
◉ Tasks between jobs assigned “round robin”
◉ Smaller jobs submitted while a big job is running can start receiving resources right away
FAIR SCHEDULER
Within-Application Job Scheduling
![Page 41: Spark Summit EU talk by Ruben Pulido and Behar Veliqi](https://reader036.fdocuments.in/reader036/viewer/2022081323/586f75821a28ab10258b60ad/html5/thumbnails/41.jpg)
◉ Watson Analytics for Social Media allows harvesting and analyzing social media data
◉ Becoming a real-time multi-tenant analytics pipeline required:– Splitting analytics into tenant-specific and language-specific– Aggregating all tenants trickle-feeds into a single stream– Ensuring low-latency context switching between tenants– Removing state from processing components
◉ New pipeline based on Spark, Kafka and Zookeeper– robust– fulfills latency and throughput requirements– additional analytics
SUMMARY
![Page 42: Spark Summit EU talk by Ruben Pulido and Behar Veliqi](https://reader036.fdocuments.in/reader036/viewer/2022081323/586f75821a28ab10258b60ad/html5/thumbnails/42.jpg)
SPARK SUMMIT EUROPE 2016
THANK YOU.RUBEN PULIDOwww.linkedin.com/in/ruben-pulido @_rubenpulido
www.watsonanalytics.comFREE TRIAL
BEHAR VELIQIwww.linkedin.com/in/beharveliqi @beveliqi