IT-DB-SAS Internal Training · 2018. 11. 22. · openstack coe cluster ... BACKUP SLIDES !30....

Piotr Mrowczynski

Cloud-native Spark on Kubernetes

IT-DB-SAS Internal Training

!1

Credit - https://www.edureka.co/blog/hadoop-ecosystem

What is Apache Spark?

Apache Spark is an open-source distributed and parallel processing framework that allows for sophisticated analytics, real-time streaming and machine learning on large datasets

Piotr MrowczynskiCloud-native Spark on Kubernetes !2

IT-DB-SAS !3

Spark K8S Operator Democern.ch/spark-user-guide

Spark Operator Deployment

(long-running operator for batch)

Cloud Infrastructure

Kubernetes Resource Manager

Create cluster Ref: Spark/K8S Cluster Admin Guide

Spark Drivers/Executors Pods (containers)

Lets kick off Create cluster as it can take up to 10 min

Current state of the art for Apache Spark at CERN

• Spark running on top of HDFS/YARN (distributed filesystem for big data and cluster resource manager). Usually processing of data happens on the same cluster of machines as storage (data locality).

Piotr MrowczynskiCloud-native Spark on Kubernetes !4



• Physical machines allocated – means no resource elasticity, no isolation between users, compute coupled with data storage.

Piotr MrowczynskiCloud-native Spark on Kubernetes 3




• Stable workloads from monitoring, security, experiment operations (NXCals) with data in HDFS/Kafka






• Ad-hoc workloads from physics communities interested in analyzing data at scale using external storages (EOS/S3). Physics analysis limited to the capacity and resources of shared Spark/Hadoop cluster. Some workloads (>100TB, possibly 1PB) not possible on shared cluster without major upgrade of cluster or disruption of other workflows


Use-case investigated




• Ad-hoc workloads from physics communities interested in analysing data at scale using external storages (EOS/S3).


Data and operational challenges


1. Grow of data brings new challenges

Mass Storage of LHC data for analysis at CERN - EOS

Currently >250PB, 12.3 PB ingest only in October 2017


Piotr Mrowczynski

Physics analysis has to be reproducible! YARN clusters are difficult to provision, and difficult to setup with required environment.

2. Reproducible analysis

Cloud-native Spark on Kubernetes 6

Addressing the challenges


Piotr Mrowczynski

Literature study

„Cloud-native is an approach to building and running applications that exploits the advantages of the cloud computing model” [1]

Companies attempted cloud-native Spark deployments using Kubernetes. Native resource scheduler implementation

had experimental release in Apache Spark in March 2018.

Research from Google defends the hypothesis, that in cluster computing disk-locality is becoming irrelevant nowadays.

Databricks and Accenture Labs benchmarked Spark with data in external storages compared to HDFS.

1. “Kubernetes becomes the first project to graduate from the cloud native computing foundation.” https://www.forbes.com/sites/janaki-rammsv/2018/03/07/kubernetes-becomes-the-first-project-to-graduate-from-the-cloud-native-computing-foundation.


Separating Compute and Storage for Elastic Big Data

Spark

HDFSSpark

External Storage (EOS, S3, GCS + Kafka,

HDFS)

Data Locality Data External (network)



Spark

HDFSSpark


HDFS)

- assumes disk bandwidth is higher than storage throughput over network

- assumes mass storage throughput higher than cluster disk bandwidth

Piotr Mrowczynski




Spark

HDFSSpark


HDFS)


- limited elasticity (both for computation and storage)


- allows compute not being bound to storage (with cloud elasticity)

Piotr Mrowczynski




Spark

HDFSSpark


HDFS)


- limited elasticity (both for computation and storage)

- avoids high network bandwidth for analysis on large datasets


- allows compute not being bound to storage (with cloud elasticity)

- assumes network delivers data to CPU faster than local disks, might generate high traffic

Piotr Mrowczynski



Possible solution for storage elasticity

- mass storage services are easier/cheaper to scale - data stored on disk can be large, and compute nodes can be adjusted to compute needs


Possible solution for storage and compute elasticity, reproducibility

- Openstack CaaS (Containers-as-a-Service) with Kubernetes provides compute elasticity and authentication mechanism (certificates).

[1] https://github.com/GoogleCloudPlatform/spark-on-k8s-operator




- Separation of monolithic services simplifies operational effort

[1] https://github.com/GoogleCloudPlatform/spark-on-k8s-operator




- Separation of monolithic services simplifies operational effort

- Just Spark-Submit with Kubernetes only allows to create drivers and executors (not so friendly to manage).

- Spark K8S Operator bring feature parity known from YARN (managing Spark Applications)

- Spark Operator gives idiomatic submission, application restart policies, cron support, custom volume/config/secret mounts, pod affinity/anti-affinity [1] https://github.com/GoogleCloudPlatform/spark-on-k8s-operator


Containers-as-a-Service to automate deployment

BACKUP SLIDES !22





openstack coe clusterCreate/Resize/Delete/GetConfiguration for

Kubernetes clusters


Piotr Mrowczynski

Containers-as-a-Service for compute elasticity







Kubernetes clusters

Helm Create/Delete/Update

Spark Operator to control batch/stream jobs


Piotr Mrowczynski








Kubernetes clusters

sparkctl Submit batch/stream jobs

to K8S via Operator(cluster-mode)

Helm Create/Delete/Update



Piotr Mrowczynski








Kubernetes clusters

sparkctl Submit batch/stream jobs

to K8S via Operator(cluster-mode)

Helm/Kubectl Create/Delete/Update



Piotr Mrowczynski


Container Registry Fetch containers with

required software


spark-submit Create driver on behalf

of user (customized by operator)

Workflow Reproducibility

Dockerfiles with required software packages, configurations etc



Baremetal Infrastructure


Piotr Mrowczynski

- Kubernetes and use of containers allows Spark developers to build their isolated and reproducible environments

- Encapsulate software packages, libraries, mounting configmaps, volumes


IT-DB-SAS !28








Install Spark Operator Ref: Spark/K8S Cluster Admin Guide

Submit jobs Ref: Spark/K8S User Guide

Spark K8S Operator Submitting application

- Spark Operator watches for create/delete/update events of SparkApplication

- Spark Operator executes spark-submit with required configurations

- Data is uploaded/read from external storage service, and monitoring is realised with internal/external monitoring tools


Spark K8S Operator Running application

- Spark Application Controller handles restarts and resubmissions.

- Submission Runner executes spark-submit

- Spark Pod Monitor reports updates of pods to controller

- Mutating Admission WebHook handles customising Docker images and node affinities

BACKUP SLIDES !30

IT-DB-SAS !31

Openstack Containers-as-a-Service

Magnum - responsible for Kubernetes deployment

Nova VMs Nova Ironic (baremetal)

Heat - composition of applications in the cloud

Nova - resource provisioning

Neutron - network provisioning

IT-DB-SAS !32








Install Spark Operator Ref: Spark/K8S Cluster Admin Guide

Submit jobs Ref: Spark/K8S User Guide

For Spark-Pi you are ready to go

For TPCDS: Go to IT-DB Keepass to get CS3.CERN.CH (S3) credentials

Evaluation of Spark on Kubernetes


Evaluating Persistent Storage for Cloud-native Spark

Data over network

Persistent Storage Feature

Network Volume Mounts

Native Storage Connectors

Object Storage Connectors Monitoring Storage

ExampleCephFS,

CVMFS ,AzureDisk, EFS

HDFS, EOS, Kafka GCS, S3, Azure InfluxDB, Stackdriver, Prometheus

Use-caseSoftware Packages,

Streaming Checkpoints, Logs

Data processing Data processing, Logs, Checkpoints Events, Statistics

Spark Transactional Writes No Yes Possible -

Eventual Consistency Writes No No Yes -


Memory management and Cloud-native SparkHow Kubernetes fits to Spark Memory Management



- Kubelet monitors memory and disk available to the Node to prevent node instability.



- Kubelet monitors memory and disk available to the Node.

- Detecting MemoryPressure or System Out Of Memory and reclaiming resources from running containers (OOMKilled errors)



Piotr Mrowczynski

- MemoryOverheadFactor - memory for off-heap memory, non-JVM processes (e.g. in case of Python higher limit) and processes required for operation of container.

- OOMKills tuned by spark.memory.fraction and spark.memory.storageFraction


- Kubelet monitors memory and disk available to the Node.

- Detecting MemoryPressure or System Out Of Memory and reclaiming resources from running containers (OOMKilled errors)

21

Persistent Storage for Cloud-native Spark Large scale data reduction in Kubernetes

“Dimuon Mass Calculation on 20TB, max 180 executors”

Job execution duration trendThroughput trend

Piotr Mrowczynski

- After optimisations (read-ahead, speculation) very high resource utilisation and significant improvement of execution time

- Workload scaled near-linear with increasing number of executors


Persistent Storage for Cloud-native Spark

- InfluxDB to monitor the cluster usage, CPU/Network usage was pretty stable

Large scale data reduction in Kubernetes “Dimuon Mass Calculation on 20TB, max 180 executors”


Network impact of overtuning the read-ahead

20 GB/s cluster read throughput for a duration of calculation accross 500 VMs (>200 hypervisors) with high read-ahead parameter

Piotr Mrowczynski

“Dimuon Mass Calculation with max available resources”


Overhead of Cloud-native Spark Attempt to asses what user can expect with

our on-premise YARN and cloud K8S

Piotr Mrowczynski

- Very small scale of 100GB EOS TPCDS SQL Benchmark.

- Spark 2.3 - 8 Executors - 2CPU/7GB RAM

- Metrics collected per query and exported using SparkMeasure [1]


[1] https://github.com/cerndb/sparkMeasure

26

Target use cases / workloads for Spark on Kubernetes

Piotr Mrowczynski

- Physics analysis with Spark - Physics analysis with EP-SFT (ROOT team) RDataframe approach - CMS Big Data Project with hadoop-xrootd connector and spark-root

package

- Usage of Spark as compute engine to analyse data stored in external storage systems (non-physics)

- Stable Streaming workloads with Kafka


Service Deliverables / Future work

Piotr Mrowczynski

- Contribution to the upstream spark on kubernetes operator

- Helm charts to deploy spark operator for customisation of Spark for CERN infrastructure / use cases

- https://gitlab.cern.ch/db/spark-service/spark-service-charts

- Monitoring of Spark workloads for efficient usage of compute resources

- Consultancy for on boarding of the users and curated examples for various use cases

- Support of Spark on kubernetes

- Easier interaction with the service - Web application ? - SWAN ?


Service Deliverables / Future work


- Spark Streaming and Mounting CVMFS to driver/executor

- Monitoring of Spark on kubernetes workloads

- Ease of interacting with the service for creation of the cluster and submission of spark jobs - Integrate with Jupyter Notebooks (CERN SWAN) for interactive/batch analysis

- Working with the user community on the adoption of the new compute model

- Federated multi-cloud clusters ?

- Benchmark resource scheduling (start-up time of K8S and YARN for provisioning and setting up executors) ?

Conclusions

Piotr Mrowczynski

- Openstack provisions Kubernetes cluster of 10s of nodes in automated fashion (CaaS).

- Kubernetes allows reproducible Spark and limited operational effort using Docker containers and microservices

- Just Spark-Submit with Kubernetes only allows to create drivers and executors (not so friendly to manage)

- Spark K8S Operator provides management of Spark Applications similar to ones in YARN ecosystem


Conclusions

Piotr Mrowczynski

Persistent Storage Feature

Network Volume Mounts

Native Storage Connectors

Object Storage Connectors

Monitoring Storage

Example CephFS, CVMFS EOS, Kafka S3 [1] InfluxDB

Use-caseIn evaluation for

logs and checkpoints

Data processing

Logs, Checkpoints

(streaming use-case), Data

processing (in evaluation)

Events, Statistics

- Native Storage Connectors best for data processing

- Network Volume Mounts and Objectstorage Connectors good for logs and checkpoints

- InfluxDB good for monitoring data


- Openstack provisions Kubernetes cluster of 10s of nodes in automated fashion (CaaS).

- Kubernetes allows reproducible Spark and limited operational effort using Docker containers and microservices

- Just Spark-Submit with Kubernetes only allows to create drivers and executors (not so friendly to manage)

- Spark K8S Operator provides management of Spark Applications similar to ones in YARN ecosystem

Conclusions

Piotr Mrowczynski

- Spark/K8S, as Spark/YARN as native resource schedulers, handles sustained ad-hoc large-scale workloads from external storages well. Spark/YARN had ~40 nodes, Spark/K8S 1000 VMs (~200 nodes)

- Without data locality, network can be a serious problem/bottleneck


Conclusions

Piotr Mrowczynski

- Kubernetes, as YARN, controls the executor memory. Similarly, proper tuning to avoid OOMKilled

- With proper data access tuning and speculation enabled, more resources (executors) near-linearly scaled workload

- TPCDS shown overhead of cloud-native Spark/K8S compared to Spark/YARN in similar workload setup. Shared Spark/YARN with spikes due to other heavy users


- Spark/K8S, as Spark/YARN as native resource schedulers, handles sustained ad-hoc large-scale workloads from external storages well. Spark/YARN had ~40 nodes, Spark/K8S 1000 VMs (~200 nodes)

- Without data locality, network can be a serious problem/bottleneck

30

Thank you! Questions?

Piotr Mrowczynski

Special thanks for support: Supervisors at CERN: Prasanth Kothuri, Vaggelis Motesnitsalis


Backup slides

BACKUP SLIDES !53

State-of-the-art Conclusions


Data Locality

Data Externality

Batch/StreamInteractive

Datacenter Networking

Cluster Networking

On-Premise Performance Effective

Spark/K8S Operator

Spark/YARNSpark/K8S

External Storage

Spark/YARN HDFS

Spark/YARN External StorageSpark/YARN

Spark/YARN

Cloud-Native Cost Effective

Spark/K8S Future?

• Stable streaming workloads with data in Kafka

• Ad-hoc workloads from physics and other communities interested in analyzing data at scale using external storages (EOS/S3).

BACKUP SLIDES

Example of large-scale ad-hoc/external-storage workload

Physics data mining

Detect particle interactions (data),

compare with theory predictions (simulation)

Particle detection analysis

Large scale data reduction facility

!55

EOS [1]uses namespace, files

are replicated, optimized for ROOT file access,

storage service

[1] AJ Peters and J Lukasz ”Exabyte Scale Storage at CERN.”, 2011.

S3, EOS and HDFSHDFS

uses namespace, files are chunked into equal blocks and replicated,

compute service

NamespaceNamespace

EOS Node

File 1

HDFS Datanode

Block 1

Block 3

HDFS Datanode

Block 2

Block 3

HDFS Datanode

Block 1

Block 2

EOS Node

File 1

S3 [1]objects are replicated,

objects can have custom ACLs and metadata,

storage service

Ceph Node

Object ID1

“File” Metadata For Object ID1

Ceph Node

Object ID1

IT-DB-SAS !57

Kubernetes Federation across clouds

Allows Spark developers to build their isolated and reproducible environments (software packages, libraries, mounting configmaps, volumes)

with use of containers.

Scales horizontally, self-healing system

Multi-cloud support

Elastic resource provisioning in the cloud, resources autoscaling in public clouds

Large and industry-grade community

Automating deployment, scaling, and management of containerized applications.

Kubernetes and Data Processing

BACKUP SLIDES !58

“Dimuon Mass Calculation” in YARN ROOT Files Dataset skew in HDFS vs Mass Storage

- Our 20TB (6100 ROOT files) dataset has been stored in HDFS, and in mass storage EOS

- In HDFS in most servers we had big number of block replicas for the same dataset

- In EOS in most servers we had small number of file replicas for the same dataset. Large number if replicas only on small subset of servers

- Not easy to keep HDFS in balance in mass scale and high write rate

BACKUP SLIDES !59

IT-DB-SAS Internal Training · 2018. 11. 22. · openstack coe cluster ... BACKUP SLIDES !30....

Documents

Transcript of IT-DB-SAS Internal Training · 2018. 11. 22. · openstack coe cluster ... BACKUP SLIDES !30....