Manchester OpenStack Meetup: I have an OpenStack Cloud, now what? OpenStack 101
IT-DB-SAS Internal Training · 2018. 11. 22. · openstack coe cluster ... BACKUP SLIDES !30....
Transcript of IT-DB-SAS Internal Training · 2018. 11. 22. · openstack coe cluster ... BACKUP SLIDES !30....
Piotr Mrowczynski
Cloud-native Spark on Kubernetes
IT-DB-SAS Internal Training
!1
Credit - https://www.edureka.co/blog/hadoop-ecosystem
What is Apache Spark?
Apache Spark is an open-source distributed and parallel processing framework that allows for sophisticated analytics, real-time streaming and machine learning on large datasets
Piotr MrowczynskiCloud-native Spark on Kubernetes !2
IT-DB-SAS !3
Spark K8S Operator Democern.ch/spark-user-guide
Spark Operator Deployment
(long-running operator for batch)
Cloud Infrastructure
Kubernetes Resource Manager
Create cluster Ref: Spark/K8S Cluster Admin Guide
Spark Drivers/Executors Pods (containers)
Lets kick off Create cluster as it can take up to 10 min
Current state of the art for Apache Spark at CERN
• Spark running on top of HDFS/YARN (distributed filesystem for big data and cluster resource manager). Usually processing of data happens on the same cluster of machines as storage (data locality).
Piotr MrowczynskiCloud-native Spark on Kubernetes !4
Current state of the art for Apache Spark at CERN
• Spark running on top of HDFS/YARN (distributed filesystem for big data and cluster resource manager). Usually processing of data happens on the same cluster of machines as storage (data locality).
• Physical machines allocated – means no resource elasticity, no isolation between users, compute coupled with data storage.
Piotr MrowczynskiCloud-native Spark on Kubernetes 3
Current state of the art for Apache Spark at CERN
• Spark running on top of HDFS/YARN (distributed filesystem for big data and cluster resource manager). Usually processing of data happens on the same cluster of machines as storage (data locality).
• Physical machines allocated – means no resource elasticity, no isolation between users, compute coupled with data storage.
• Stable workloads from monitoring, security, experiment operations (NXCals) with data in HDFS/Kafka
Piotr MrowczynskiCloud-native Spark on Kubernetes 3
Current state of the art for Apache Spark at CERN
• Spark running on top of HDFS/YARN (distributed filesystem for big data and cluster resource manager). Usually processing of data happens on the same cluster of machines as storage (data locality).
• Physical machines allocated – means no resource elasticity, no isolation between users, compute coupled with data storage.
• Stable workloads from monitoring, security, experiment operations (NXCals) with data in HDFS/Kafka
• Ad-hoc workloads from physics communities interested in analyzing data at scale using external storages (EOS/S3). Physics analysis limited to the capacity and resources of shared Spark/Hadoop cluster. Some workloads (>100TB, possibly 1PB) not possible on shared cluster without major upgrade of cluster or disruption of other workflows
Piotr MrowczynskiCloud-native Spark on Kubernetes 3
Use-case investigated
• Spark running on top of HDFS/YARN (distributed filesystem for big data and cluster resource manager). Usually processing of data happens on the same cluster of machines as storage (data locality).
• Physical machines allocated – means no resource elasticity, no isolation between users, compute coupled with data storage.
• Stable workloads from monitoring, security, experiment operations (NXCals) with data in HDFS/Kafka
• Ad-hoc workloads from physics communities interested in analysing data at scale using external storages (EOS/S3).
Piotr MrowczynskiCloud-native Spark on Kubernetes 3
Data and operational challenges
Piotr MrowczynskiCloud-native Spark on Kubernetes 4
1. Grow of data brings new challenges
Mass Storage of LHC data for analysis at CERN - EOS
Currently >250PB, 12.3 PB ingest only in October 2017
Piotr MrowczynskiCloud-native Spark on Kubernetes 5
Piotr Mrowczynski
Physics analysis has to be reproducible! YARN clusters are difficult to provision, and difficult to setup with required environment.
2. Reproducible analysis
Cloud-native Spark on Kubernetes 6
Addressing the challenges
Piotr MrowczynskiCloud-native Spark on Kubernetes 7
Piotr Mrowczynski
Literature study
„Cloud-native is an approach to building and running applications that exploits the advantages of the cloud computing model” [1]
Companies attempted cloud-native Spark deployments using Kubernetes. Native resource scheduler implementation
had experimental release in Apache Spark in March 2018.
Research from Google defends the hypothesis, that in cluster computing disk-locality is becoming irrelevant nowadays.
Databricks and Accenture Labs benchmarked Spark with data in external storages compared to HDFS.
1. “Kubernetes becomes the first project to graduate from the cloud native computing foundation.” https://www.forbes.com/sites/janaki-rammsv/2018/03/07/kubernetes-becomes-the-first-project-to-graduate-from-the-cloud-native-computing-foundation.
Cloud-native Spark on Kubernetes 8
Separating Compute and Storage for Elastic Big Data
Spark
HDFSSpark
External Storage (EOS, S3, GCS + Kafka,
HDFS)
Data Locality Data External (network)
Piotr MrowczynskiCloud-native Spark on Kubernetes 10
Separating Compute and Storage for Elastic Big Data
Spark
HDFSSpark
External Storage (EOS, S3, GCS + Kafka,
HDFS)
- assumes disk bandwidth is higher than storage throughput over network
- assumes mass storage throughput higher than cluster disk bandwidth
Piotr Mrowczynski
Data Locality Data External (network)
Cloud-native Spark on Kubernetes 11
Separating Compute and Storage for Elastic Big Data
Spark
HDFSSpark
External Storage (EOS, S3, GCS + Kafka,
HDFS)
- assumes disk bandwidth is higher than storage throughput over network
- limited elasticity (both for computation and storage)
- assumes mass storage throughput higher than cluster disk bandwidth
- allows compute not being bound to storage (with cloud elasticity)
Piotr Mrowczynski
Data Locality Data External (network)
Cloud-native Spark on Kubernetes 11
Separating Compute and Storage for Elastic Big Data
Spark
HDFSSpark
External Storage (EOS, S3, GCS + Kafka,
HDFS)
- assumes disk bandwidth is higher than storage throughput over network
- limited elasticity (both for computation and storage)
- avoids high network bandwidth for analysis on large datasets
- assumes mass storage throughput higher than cluster disk bandwidth
- allows compute not being bound to storage (with cloud elasticity)
- assumes network delivers data to CPU faster than local disks, might generate high traffic
Piotr Mrowczynski
Data Locality Data External (network)
Cloud-native Spark on Kubernetes 11
Possible solution for storage elasticity
- mass storage services are easier/cheaper to scale - data stored on disk can be large, and compute nodes can be adjusted to compute needs
Piotr MrowczynskiCloud-native Spark on Kubernetes 12
Possible solution for storage and compute elasticity, reproducibility
- Openstack CaaS (Containers-as-a-Service) with Kubernetes provides compute elasticity and authentication mechanism (certificates).
[1] https://github.com/GoogleCloudPlatform/spark-on-k8s-operator
Piotr MrowczynskiCloud-native Spark on Kubernetes 16
Possible solution for storage and compute elasticity, reproducibility
- Openstack CaaS (Containers-as-a-Service) with Kubernetes provides compute elasticity and authentication mechanism (certificates).
- Separation of monolithic services simplifies operational effort
[1] https://github.com/GoogleCloudPlatform/spark-on-k8s-operator
Piotr MrowczynskiCloud-native Spark on Kubernetes 16
Possible solution for storage and compute elasticity, reproducibility
- Openstack CaaS (Containers-as-a-Service) with Kubernetes provides compute elasticity and authentication mechanism (certificates).
- Separation of monolithic services simplifies operational effort
- Just Spark-Submit with Kubernetes only allows to create drivers and executors (not so friendly to manage).
- Spark K8S Operator bring feature parity known from YARN (managing Spark Applications)
- Spark Operator gives idiomatic submission, application restart policies, cron support, custom volume/config/secret mounts, pod affinity/anti-affinity [1] https://github.com/GoogleCloudPlatform/spark-on-k8s-operator
Piotr MrowczynskiCloud-native Spark on Kubernetes 16
Containers-as-a-Service to automate deployment
BACKUP SLIDES !22
Spark Operator Deployment
(long-running operator for batch)
Cloud Infrastructure
Kubernetes Resource Manager
openstack coe clusterCreate/Resize/Delete/GetConfiguration for
Kubernetes clusters
Spark Drivers/Executors Pods (containers)
Piotr Mrowczynski
Containers-as-a-Service for compute elasticity
Cloud-native Spark on Kubernetes 13
Spark Operator Deployment
(long-running operator for batch)
Cloud Infrastructure
Kubernetes Resource Manager
openstack coe clusterCreate/Resize/Delete/GetConfiguration for
Kubernetes clusters
Helm Create/Delete/Update
Spark Operator to control batch/stream jobs
Spark Drivers/Executors Pods (containers)
Piotr Mrowczynski
Containers-as-a-Service for compute elasticity
Cloud-native Spark on Kubernetes 13
Spark Operator Deployment
(long-running operator for batch)
Cloud Infrastructure
Kubernetes Resource Manager
openstack coe clusterCreate/Resize/Delete/GetConfiguration for
Kubernetes clusters
sparkctl Submit batch/stream jobs
to K8S via Operator(cluster-mode)
Helm Create/Delete/Update
Spark Operator to control batch/stream jobs
Spark Drivers/Executors Pods (containers)
Piotr Mrowczynski
Containers-as-a-Service for compute elasticity
Cloud-native Spark on Kubernetes 13
Spark Operator Deployment
(long-running operator for batch)
Cloud Infrastructure
Kubernetes Resource Manager
openstack coe clusterCreate/Resize/Delete/GetConfiguration for
Kubernetes clusters
sparkctl Submit batch/stream jobs
to K8S via Operator(cluster-mode)
Helm/Kubectl Create/Delete/Update
Spark Operator to control batch/stream jobs
Spark Drivers/Executors Pods (containers)
Piotr Mrowczynski
Containers-as-a-Service for compute elasticity
Container Registry Fetch containers with
required software
Cloud-native Spark on Kubernetes 13
spark-submit Create driver on behalf
of user (customized by operator)
Workflow Reproducibility
Dockerfiles with required software packages, configurations etc
Cloud Infrastructure
Kubernetes Resource Manager
Baremetal Infrastructure
Kubernetes Resource Manager
Piotr Mrowczynski
- Kubernetes and use of containers allows Spark developers to build their isolated and reproducible environments
- Encapsulate software packages, libraries, mounting configmaps, volumes
Cloud-native Spark on Kubernetes 14
IT-DB-SAS !28
Spark K8S Operator Democern.ch/spark-user-guide
Spark Operator Deployment
(long-running operator for batch)
Cloud Infrastructure
Kubernetes Resource Manager
Create cluster Ref: Spark/K8S Cluster Admin Guide
Spark Drivers/Executors Pods (containers)
Install Spark Operator Ref: Spark/K8S Cluster Admin Guide
Submit jobs Ref: Spark/K8S User Guide
Spark K8S Operator Submitting application
- Spark Operator watches for create/delete/update events of SparkApplication
- Spark Operator executes spark-submit with required configurations
- Data is uploaded/read from external storage service, and monitoring is realised with internal/external monitoring tools
Piotr MrowczynskiCloud-native Spark on Kubernetes 18
Spark K8S Operator Running application
- Spark Application Controller handles restarts and resubmissions.
- Submission Runner executes spark-submit
- Spark Pod Monitor reports updates of pods to controller
- Mutating Admission WebHook handles customising Docker images and node affinities
BACKUP SLIDES !30
IT-DB-SAS !31
Openstack Containers-as-a-Service
Magnum - responsible for Kubernetes deployment
Nova VMs Nova Ironic (baremetal)
Heat - composition of applications in the cloud
Nova - resource provisioning
Neutron - network provisioning
IT-DB-SAS !32
Spark K8S Operator Democern.ch/spark-user-guide
Spark Operator Deployment
(long-running operator for batch)
Cloud Infrastructure
Kubernetes Resource Manager
Create cluster Ref: Spark/K8S Cluster Admin Guide
Spark Drivers/Executors Pods (containers)
Install Spark Operator Ref: Spark/K8S Cluster Admin Guide
Submit jobs Ref: Spark/K8S User Guide
For Spark-Pi you are ready to go
For TPCDS: Go to IT-DB Keepass to get CS3.CERN.CH (S3) credentials
Evaluation of Spark on Kubernetes
Piotr MrowczynskiCloud-native Spark on Kubernetes 19
Evaluating Persistent Storage for Cloud-native Spark
Data over network
Persistent Storage Feature
Network Volume Mounts
Native Storage Connectors
Object Storage Connectors Monitoring Storage
ExampleCephFS,
CVMFS ,AzureDisk, EFS
HDFS, EOS, Kafka GCS, S3, Azure InfluxDB, Stackdriver, Prometheus
Use-caseSoftware Packages,
Streaming Checkpoints, Logs
Data processing Data processing, Logs, Checkpoints Events, Statistics
Spark Transactional Writes No Yes Possible -
Eventual Consistency Writes No No Yes -
Piotr MrowczynskiCloud-native Spark on Kubernetes 20
Memory management and Cloud-native SparkHow Kubernetes fits to Spark Memory Management
Piotr MrowczynskiCloud-native Spark on Kubernetes 21
Memory management and Cloud-native SparkHow Kubernetes fits to Spark Memory Management
- Kubelet monitors memory and disk available to the Node to prevent node instability.
Piotr MrowczynskiCloud-native Spark on Kubernetes 21
Memory management and Cloud-native SparkHow Kubernetes fits to Spark Memory Management
- Kubelet monitors memory and disk available to the Node.
- Detecting MemoryPressure or System Out Of Memory and reclaiming resources from running containers (OOMKilled errors)
Piotr MrowczynskiCloud-native Spark on Kubernetes 21
Memory management and Cloud-native SparkHow Kubernetes fits to Spark Memory Management
Piotr Mrowczynski
- MemoryOverheadFactor - memory for off-heap memory, non-JVM processes (e.g. in case of Python higher limit) and processes required for operation of container.
- OOMKills tuned by spark.memory.fraction and spark.memory.storageFraction
Cloud-native Spark on Kubernetes
- Kubelet monitors memory and disk available to the Node.
- Detecting MemoryPressure or System Out Of Memory and reclaiming resources from running containers (OOMKilled errors)
21
Persistent Storage for Cloud-native Spark Large scale data reduction in Kubernetes
“Dimuon Mass Calculation on 20TB, max 180 executors”
Job execution duration trendThroughput trend
Piotr Mrowczynski
- After optimisations (read-ahead, speculation) very high resource utilisation and significant improvement of execution time
- Workload scaled near-linear with increasing number of executors
Cloud-native Spark on Kubernetes 24
Persistent Storage for Cloud-native Spark
- InfluxDB to monitor the cluster usage, CPU/Network usage was pretty stable
Large scale data reduction in Kubernetes “Dimuon Mass Calculation on 20TB, max 180 executors”
Piotr MrowczynskiCloud-native Spark on Kubernetes 25
Network impact of overtuning the read-ahead
20 GB/s cluster read throughput for a duration of calculation accross 500 VMs (>200 hypervisors) with high read-ahead parameter
Piotr Mrowczynski
“Dimuon Mass Calculation with max available resources”
Cloud-native Spark on Kubernetes 23
Overhead of Cloud-native Spark Attempt to asses what user can expect with
our on-premise YARN and cloud K8S
Piotr Mrowczynski
- Very small scale of 100GB EOS TPCDS SQL Benchmark.
- Spark 2.3 - 8 Executors - 2CPU/7GB RAM
- Metrics collected per query and exported using SparkMeasure [1]
Cloud-native Spark on Kubernetes
[1] https://github.com/cerndb/sparkMeasure
26
Overhead of Cloud-native Spark Attempt to asses what user can expect with
our on-premise YARN and cloud K8S
Piotr MrowczynskiCloud-native Spark on Kubernetes 27
Overhead of Cloud-native Spark Attempt to asses what user can expect with
our on-premise YARN and cloud K8S
Piotr MrowczynskiCloud-native Spark on Kubernetes 28
Target use cases / workloads for Spark on Kubernetes
Piotr Mrowczynski
- Physics analysis with Spark - Physics analysis with EP-SFT (ROOT team) RDataframe approach - CMS Big Data Project with hadoop-xrootd connector and spark-root
package
- Usage of Spark as compute engine to analyse data stored in external storage systems (non-physics)
- Stable Streaming workloads with Kafka
Cloud-native Spark on Kubernetes 31
Service Deliverables / Future work
Piotr Mrowczynski
- Contribution to the upstream spark on kubernetes operator
- Helm charts to deploy spark operator for customisation of Spark for CERN infrastructure / use cases
- https://gitlab.cern.ch/db/spark-service/spark-service-charts
- Monitoring of Spark workloads for efficient usage of compute resources
- Consultancy for on boarding of the users and curated examples for various use cases
- Support of Spark on kubernetes
- Easier interaction with the service - Web application ? - SWAN ?
Cloud-native Spark on Kubernetes 31
Service Deliverables / Future work
Piotr MrowczynskiCloud-native Spark on Kubernetes 31
- Spark Streaming and Mounting CVMFS to driver/executor
- Monitoring of Spark on kubernetes workloads
- Ease of interacting with the service for creation of the cluster and submission of spark jobs - Integrate with Jupyter Notebooks (CERN SWAN) for interactive/batch analysis
- Working with the user community on the adoption of the new compute model
- Federated multi-cloud clusters ?
- Benchmark resource scheduling (start-up time of K8S and YARN for provisioning and setting up executors) ?
Conclusions
Piotr Mrowczynski
- Openstack provisions Kubernetes cluster of 10s of nodes in automated fashion (CaaS).
- Kubernetes allows reproducible Spark and limited operational effort using Docker containers and microservices
- Just Spark-Submit with Kubernetes only allows to create drivers and executors (not so friendly to manage)
- Spark K8S Operator provides management of Spark Applications similar to ones in YARN ecosystem
Cloud-native Spark on Kubernetes 29
Conclusions
Piotr Mrowczynski
Persistent Storage Feature
Network Volume Mounts
Native Storage Connectors
Object Storage Connectors
Monitoring Storage
Example CephFS, CVMFS EOS, Kafka S3 [1] InfluxDB
Use-caseIn evaluation for
logs and checkpoints
Data processing
Logs, Checkpoints
(streaming use-case), Data
processing (in evaluation)
Events, Statistics
- Native Storage Connectors best for data processing
- Network Volume Mounts and Objectstorage Connectors good for logs and checkpoints
- InfluxDB good for monitoring data
Cloud-native Spark on Kubernetes 29
- Openstack provisions Kubernetes cluster of 10s of nodes in automated fashion (CaaS).
- Kubernetes allows reproducible Spark and limited operational effort using Docker containers and microservices
- Just Spark-Submit with Kubernetes only allows to create drivers and executors (not so friendly to manage)
- Spark K8S Operator provides management of Spark Applications similar to ones in YARN ecosystem
Conclusions
Piotr Mrowczynski
- Spark/K8S, as Spark/YARN as native resource schedulers, handles sustained ad-hoc large-scale workloads from external storages well. Spark/YARN had ~40 nodes, Spark/K8S 1000 VMs (~200 nodes)
- Without data locality, network can be a serious problem/bottleneck
Cloud-native Spark on Kubernetes 30
Conclusions
Piotr Mrowczynski
- Kubernetes, as YARN, controls the executor memory. Similarly, proper tuning to avoid OOMKilled
- With proper data access tuning and speculation enabled, more resources (executors) near-linearly scaled workload
- TPCDS shown overhead of cloud-native Spark/K8S compared to Spark/YARN in similar workload setup. Shared Spark/YARN with spikes due to other heavy users
Cloud-native Spark on Kubernetes
- Spark/K8S, as Spark/YARN as native resource schedulers, handles sustained ad-hoc large-scale workloads from external storages well. Spark/YARN had ~40 nodes, Spark/K8S 1000 VMs (~200 nodes)
- Without data locality, network can be a serious problem/bottleneck
30
Thank you! Questions?
Piotr Mrowczynski
Special thanks for support: Supervisors at CERN: Prasanth Kothuri, Vaggelis Motesnitsalis
Cloud-native Spark on Kubernetes 32
Backup slides
BACKUP SLIDES !53
State-of-the-art Conclusions
Piotr MrowczynskiCloud-native Spark on Kubernetes 28
Data Locality
Data Externality
Batch/StreamInteractive
Datacenter Networking
Cluster Networking
On-Premise Performance Effective
Spark/K8S Operator
Spark/YARNSpark/K8S
External Storage
Spark/YARN HDFS
Spark/YARN External StorageSpark/YARN
Spark/YARN
Cloud-Native Cost Effective
Spark/K8S Future?
• Stable streaming workloads with data in Kafka
• Ad-hoc workloads from physics and other communities interested in analyzing data at scale using external storages (EOS/S3).
BACKUP SLIDES
Example of large-scale ad-hoc/external-storage workload
Physics data mining
Detect particle interactions (data),
compare with theory predictions (simulation)
Particle detection analysis
Large scale data reduction facility
!55
EOS [1]uses namespace, files
are replicated, optimized for ROOT file access,
storage service
[1] AJ Peters and J Lukasz ”Exabyte Scale Storage at CERN.”, 2011.
S3, EOS and HDFSHDFS
uses namespace, files are chunked into equal blocks and replicated,
compute service
NamespaceNamespace
EOS Node
File 1
HDFS Datanode
Block 1
Block 3
HDFS Datanode
Block 2
Block 3
HDFS Datanode
Block 1
Block 2
EOS Node
File 1
S3 [1]objects are replicated,
objects can have custom ACLs and metadata,
storage service
Ceph Node
Object ID1
“File” Metadata For Object ID1
Ceph Node
Object ID1
IT-DB-SAS !57
Kubernetes Federation across clouds
Allows Spark developers to build their isolated and reproducible environments (software packages, libraries, mounting configmaps, volumes)
with use of containers.
Scales horizontally, self-healing system
Multi-cloud support
Elastic resource provisioning in the cloud, resources autoscaling in public clouds
Large and industry-grade community
Automating deployment, scaling, and management of containerized applications.
Kubernetes and Data Processing
BACKUP SLIDES !58
“Dimuon Mass Calculation” in YARN ROOT Files Dataset skew in HDFS vs Mass Storage
- Our 20TB (6100 ROOT files) dataset has been stored in HDFS, and in mass storage EOS
- In HDFS in most servers we had big number of block replicas for the same dataset
- In EOS in most servers we had small number of file replicas for the same dataset. Large number if replicas only on small subset of servers
- Not easy to keep HDFS in balance in mass scale and high write rate
BACKUP SLIDES !59