Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and Brad Kaiser
-
Upload
spark-summit -
Category
Data & Analytics
-
view
215 -
download
0
Transcript of Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and Brad Kaiser
![Page 1: Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and Brad Kaiser](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a64760b7f8b9afc4d8b45d1/html5/thumbnails/1.jpg)
Brad Kaiser, IBM/TWCCraig Ingram, IBM/TWC
Supporting Highly Multitenant Spark Notebook Workloads
#EUdev8
![Page 2: Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and Brad Kaiser](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a64760b7f8b9afc4d8b45d1/html5/thumbnails/2.jpg)
Hosting Multitenant Spark Notebooks Is Hard But It Doesn't Have To Be• Our Journey• Best Practices• New Work
2#EUdev8
![Page 3: Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and Brad Kaiser](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a64760b7f8b9afc4d8b45d1/html5/thumbnails/3.jpg)
Our Journey
3#EUdev8
![Page 4: Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and Brad Kaiser](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a64760b7f8b9afc4d8b45d1/html5/thumbnails/4.jpg)
Who we are
4#EUdev8
![Page 5: Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and Brad Kaiser](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a64760b7f8b9afc4d8b45d1/html5/thumbnails/5.jpg)
IBM's Commitment to Open Source
5#EUdev8
• Contribute intellectual and technical capital to the Apache Spark community.
• Make the core technology enterprise-and cloud-ready.
• Build data science skills to drive intelligence into business applications — https://cognitiveclass.ai/
• http://spark.tc
![Page 6: Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and Brad Kaiser](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a64760b7f8b9afc4d8b45d1/html5/thumbnails/6.jpg)
Mission• Provide:
– a secure, performant, stable cluster.– interactive analytics, visualizations, and reports.– collaboration and sharing with other data scientists,
engineers, and consumers.– job scheduling capabilities.– a quick and easy way to get started with Spark.
6#EUdev8
![Page 7: Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and Brad Kaiser](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a64760b7f8b9afc4d8b45d1/html5/thumbnails/7.jpg)
Goals• Support hundreds of analysts/data scientists
using Spark– Quick kernel creation (<10s from notebook creation to
available sc).– Utilize cluster resources efficiently.– Elastically scale based on load.
7#EUdev8
![Page 8: Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and Brad Kaiser](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a64760b7f8b9afc4d8b45d1/html5/thumbnails/8.jpg)
Lessons Learned at TWC
8#EUdev8
![Page 9: Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and Brad Kaiser](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a64760b7f8b9afc4d8b45d1/html5/thumbnails/9.jpg)
One Big Cluster
9#EUdev8
• Spark collocated with Cassandra
• Fast• Stable
![Page 10: Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and Brad Kaiser](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a64760b7f8b9afc4d8b45d1/html5/thumbnails/10.jpg)
One Big Cluster - Cons
10#EUdev8
• Outside analysts start using our cluster
• Provided notebook services• Interrupted our perfectly
scheduled jobs• Used a lot of resources
causing Cassandra to crash
![Page 11: Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and Brad Kaiser](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a64760b7f8b9afc4d8b45d1/html5/thumbnails/11.jpg)
Add Smaller Clusters
11#EUdev8
• We built some smaller clusters
• Still platform agnostic• Other teams couldn't
affect our production cluster
![Page 12: Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and Brad Kaiser](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a64760b7f8b9afc4d8b45d1/html5/thumbnails/12.jpg)
Add Smaller Clusters - Cons
12#EUdev8
• Hassle to set up• Required a lot of
maintenance• Sat idle
![Page 13: Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and Brad Kaiser](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a64760b7f8b9afc4d8b45d1/html5/thumbnails/13.jpg)
EMR• Analysts make ad hoc clusters for their needs• No maintenance from us• Learning curve for analysts• They tend to leave them running
13#EUdev8
![Page 14: Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and Brad Kaiser](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a64760b7f8b9afc4d8b45d1/html5/thumbnails/14.jpg)
Lessons Learned - IBM
14#EUdev8
![Page 15: Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and Brad Kaiser](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a64760b7f8b9afc4d8b45d1/html5/thumbnails/15.jpg)
Data Science Experience• Collaborative environment on the front end
– Collaboration Tools– Shared Data Sets– Flows– GitHub Integration
• Multiple compute environments on the back end– DSX on the cloud: compute runs on IBM cloud– DSX Local: compute runs on private cloud or Z
15#EUdev8
![Page 16: Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and Brad Kaiser](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a64760b7f8b9afc4d8b45d1/html5/thumbnails/16.jpg)
Lessons Learned - IBM• You need kernel remoting
– Allows advanced collaborative tools in the application tier– Allows resource consolidation in the analytics tier
• Resource consolidation puts stress on the analytics tier– Starvation– Management of cached data– Performance bottlenecks (example: Spark web UI)
16#EUdev8
![Page 17: Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and Brad Kaiser](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a64760b7f8b9afc4d8b45d1/html5/thumbnails/17.jpg)
Best Practices
17#EUdev8
![Page 18: Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and Brad Kaiser](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a64760b7f8b9afc4d8b45d1/html5/thumbnails/18.jpg)
Best Practices• Use kernel remoting• Use fewer, bigger clusters• Know your workloads• Isolate users• Schedule resources efficiently
18#EUdev8
![Page 19: Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and Brad Kaiser](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a64760b7f8b9afc4d8b45d1/html5/thumbnails/19.jpg)
Use Kernel Remoting• Running all of your notebook kernels on the
same server is a bottle neck• Run your kernels distributed on the cluster• You can run a lot more notebooks
19#EUdev8
![Page 20: Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and Brad Kaiser](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a64760b7f8b9afc4d8b45d1/html5/thumbnails/20.jpg)
Jupyter Enterprise Gateway• New Open Source project from IBM• Goals:
– Allow hundreds of notebook users to share a single Spark cluster…
– …with enterprise-level security and performance.• Used in IBM Analytics Engine (GA)• developer.ibm.com/code/openprojects/jupyter-
enterprise-gateway-2
20#EUdev8
![Page 21: Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and Brad Kaiser](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a64760b7f8b9afc4d8b45d1/html5/thumbnails/21.jpg)
Jupyter Enterprise Gateway:How it Works
21#EUdev8
Spark Cluster
Security
Laye
r Jupyter Enterprise Gateway• Multitenant• Remote Kernel Lifecycle Management
YARNWorkers
Impersonation: Alice’s kernel
runs under Alice’s user ID.
Spark
ExecutorsSpark ExecutorsSpark Executors
Yarn Container
Jupyter Kernel
Spark Driver
Spark
ExecutorsSpark ExecutorsSpark Executors
Yarn Container
Jupyter Kernel
Spark Driver
Secure
Secure
Secure Secure
![Page 22: Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and Brad Kaiser](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a64760b7f8b9afc4d8b45d1/html5/thumbnails/22.jpg)
Use Fewer, Bigger Clusters• Better resource utilization through statistical multiplexing• Improved security and auditing due to centralization
– Hive, Ranger, and Atlas are common in the ecosystem.– Many new, platform specific solutions to address this problem.
• Easier collaboration between users– Shared notebooks with interactive visualizations and markdown
support.– GitHub integration for versioning and external sharing.– Catalog based data discovery and sharing.– Governance and auditing support.
22#EUdev8
![Page 23: Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and Brad Kaiser](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a64760b7f8b9afc4d8b45d1/html5/thumbnails/23.jpg)
Know your workloads• What is the main resource they use?• Overprovision the hardware• When to scale up and down?
– CPU load– YARN/RM Queue Stats (depth, waiting jobs, available
CPU/mem, preemptions)• If containers are getting preempted, it’s due to queues filling
up.
23#EUdev8
![Page 24: Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and Brad Kaiser](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a64760b7f8b9afc4d8b45d1/html5/thumbnails/24.jpg)
Isolate Users• Don't have a generic user account• Catalog and Governance
– Hive– Atlas– Ranger
• You don't want users embedding keys and passwords in their notebooks.
24#EUdev8
![Page 25: Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and Brad Kaiser](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a64760b7f8b9afc4d8b45d1/html5/thumbnails/25.jpg)
Schedule Resources Efficiently
25#EUdev8
![Page 26: Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and Brad Kaiser](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a64760b7f8b9afc4d8b45d1/html5/thumbnails/26.jpg)
YARN Queues• Take advantage of YARN’s hierarchical queue system to manage
and organize resources.– Over-allocate queues for better resource utilization and sharing.– Take advantage of node labels for users that have priority jobs
that require an SLA.– Intra-queue preemption and asynchronous container allocation
should be available in YARN 3.0.
26#EUdev8
![Page 27: Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and Brad Kaiser](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a64760b7f8b9afc4d8b45d1/html5/thumbnails/27.jpg)
YARN Queues
27#EUdev8
![Page 28: Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and Brad Kaiser](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a64760b7f8b9afc4d8b45d1/html5/thumbnails/28.jpg)
Dynamic Allocation• Dynamic allocation lets you take advantage of varying activity
– Proactively scales the number of executors based on the scheduler's backlog.
– Removes idle executors after a timeout.– Be sure to set a sensible number of initial executors
and minimum executor floor and let them ramp up on demand.
• Static allocation best for known workloads
28#EUdev8
![Page 29: Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and Brad Kaiser](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a64760b7f8b9afc4d8b45d1/html5/thumbnails/29.jpg)
New Work
29#EUdev8
![Page 30: Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and Brad Kaiser](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a64760b7f8b9afc4d8b45d1/html5/thumbnails/30.jpg)
Improvements to Spark• Alleviate the tradeoffs inherent in current best
practices.– Recover cached data when shutting down idle
executors– Proactively shut down executors to prevent
starvation
30#EUdev8
![Page 31: Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and Brad Kaiser](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a64760b7f8b9afc4d8b45d1/html5/thumbnails/31.jpg)
Recovering Cached Data• Replicates cached data to remaining executors
before shutting them down. • Ameliorates cache issues with dynamic
allocation• Useful in shared spark notebook environments
31#EUdev8
![Page 32: Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and Brad Kaiser](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a64760b7f8b9afc4d8b45d1/html5/thumbnails/32.jpg)
Benchmarks
32#EUdev8
![Page 33: Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and Brad Kaiser](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a64760b7f8b9afc4d8b45d1/html5/thumbnails/33.jpg)
Benchmarks
33#EUdev8
![Page 34: Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and Brad Kaiser](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a64760b7f8b9afc4d8b45d1/html5/thumbnails/34.jpg)
Check out my PR– SPARK-21097– github.com/apache/spark/pull/19041
34#EUdev8
![Page 35: Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and Brad Kaiser](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a64760b7f8b9afc4d8b45d1/html5/thumbnails/35.jpg)
Preventing Starvation• Eliminate issues where users are unable to run
anything due to other users taking up all of the cluster’s resources.
• Especially useful in shared spark notebook environments where idle resources can be reclaimed easily.
• Preemption can solve this.35#EUdev8
![Page 36: Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and Brad Kaiser](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a64760b7f8b9afc4d8b45d1/html5/thumbnails/36.jpg)
Enter Preemption• Requests containers associated with over-
allocated queues to shut down.• Handle YARN's PreemptionMessage in a way
that best suits the workload.• Pick the right executors to terminate.
36#EUdev8
![Page 37: Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and Brad Kaiser](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a64760b7f8b9afc4d8b45d1/html5/thumbnails/37.jpg)
Keep up with the JIRA• SPARK-21122• PR coming soon
37#EUdev8
![Page 38: Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and Brad Kaiser](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a64760b7f8b9afc4d8b45d1/html5/thumbnails/38.jpg)
Call to action• Look at our JIRAs• Try out our PRs
38#EUdev8
![Page 39: Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and Brad Kaiser](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a64760b7f8b9afc4d8b45d1/html5/thumbnails/39.jpg)
Shout-outs!!!• Our notebook workload simulator, benchmark,
and tracing tools.– spark-bench - github.com/SparkTC/spark-bench
• Check out Emily Curtin’s talk tomorrow about spark-bench.– spark-tracing - github.com/SparkTC/spark-tracing
• Matthew Schauer's baby awaiting open-source approval.
• Special thanks to Vijay Bommireddipalli and Fred Reiss for their guidance and support!
39#EUdev8
![Page 41: Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and Brad Kaiser](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a64760b7f8b9afc4d8b45d1/html5/thumbnails/41.jpg)
Extra Material
41#EUdev8
![Page 42: Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and Brad Kaiser](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a64760b7f8b9afc4d8b45d1/html5/thumbnails/42.jpg)
YARN Asynchronous Scheduling• Enable asynchronous scheduling of containers in YARN.
– yarn.scheduler.capacity.schedule-asynchronously.enable– YARN-7327 and YARN-5139
42#EUdev8
![Page 43: Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and Brad Kaiser](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a64760b7f8b9afc4d8b45d1/html5/thumbnails/43.jpg)
References• Some Icons provided by Icons8
43#EUdev8
![Page 44: Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and Brad Kaiser](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a64760b7f8b9afc4d8b45d1/html5/thumbnails/44.jpg)
Quick Settings• Use spark.yarn.jars or spark.yarn.archive.
• Running Spark from tmpfs did not improve performance.
• Support for multiple/new versions of Spark.
44#EUdev8
Move to backup
![Page 45: Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and Brad Kaiser](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a64760b7f8b9afc4d8b45d1/html5/thumbnails/45.jpg)
Disable unused credential providers• spark.yarn.security.credentials.hive.enabled• spark.yarn.security.credentials.hbase.enabled
45#EUdev8
conf avg stdevdefault 8.745 0.051nohive 8.7 0.05nohbase 7.87 0.05
Move to backup
![Page 46: Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and Brad Kaiser](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a64760b7f8b9afc4d8b45d1/html5/thumbnails/46.jpg)
Problem Domain• Security
– UI and service protection– Data governance and auditing
• Stability• Performance
46#EUdev8
![Page 47: Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and Brad Kaiser](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a64760b7f8b9afc4d8b45d1/html5/thumbnails/47.jpg)
What’s next…• spark-on-k8s• Scheduler improvements• Executor startup time reduction
47#EUdev8
![Page 48: Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and Brad Kaiser](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a64760b7f8b9afc4d8b45d1/html5/thumbnails/48.jpg)
• Hundreds of notebook users leads to a highly multitenant and interactive workload.
• Challenge: Give each user the illusion of having a large cluster all to herself
48#EUdev8
Many users at once
Burstyoffered load
Latency is important
![Page 49: Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and Brad Kaiser](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a64760b7f8b9afc4d8b45d1/html5/thumbnails/49.jpg)
What platforms are out there…Hosted Solutions• Data Science Experience
(DSX)/Watson Data Platform (WDP)• databricks• Google Cloud Platform• Microsoft Azure HDInsight• and others!!!
49#EUdev8
![Page 50: Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and Brad Kaiser](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a64760b7f8b9afc4d8b45d1/html5/thumbnails/50.jpg)
What platforms are out there…Self-Hosted Solutions• HDP – Hadoop Data Platform• CDH – Cloudera Distribution
Including Hadoop• MapR• Mesosphere
50#EUdev8
![Page 51: Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and Brad Kaiser](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a64760b7f8b9afc4d8b45d1/html5/thumbnails/51.jpg)
Reasons to Self Host at TWC• Cloud Agnostic• Flexibility• Sensitive Data• Fewer Options in 2014• Cassandra Colocation• Cost…maybe?
51#EUdev8
![Page 52: Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and Brad Kaiser](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a64760b7f8b9afc4d8b45d1/html5/thumbnails/52.jpg)
Potential Downfalls• In a self-hosted environment, everything is up to
you.– Security– Stability and Performance– Scalability
• Compute• Storage
– Monitoring– Alerting– Logging
52#EUdev8