Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache...
Transcript of Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache...
![Page 1: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation](https://reader030.fdocuments.in/reader030/viewer/2022040306/5ec56fc3c82f0c182427b1b1/html5/thumbnails/1.jpg)
Apache Spark in the Cloud
Zbyněk RoubalíkSenior Quality Engineer, Red Hat
February 15 2018
![Page 2: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation](https://reader030.fdocuments.in/reader030/viewer/2022040306/5ec56fc3c82f0c182427b1b1/html5/thumbnails/2.jpg)
Apache Spark in the Cloud | Zbyněk Roubalík2
Technologies
● Apache Spark
● Docker
● Kubernetes
● OpenShift
![Page 3: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation](https://reader030.fdocuments.in/reader030/viewer/2022040306/5ec56fc3c82f0c182427b1b1/html5/thumbnails/3.jpg)
Apache Spark in the Cloud | Zbyněk Roubalík3
Apache Spark in the Cloud
aka
How to create and deploy Apache Spark
applications to cloud native environments like
OpenShift
![Page 4: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation](https://reader030.fdocuments.in/reader030/viewer/2022040306/5ec56fc3c82f0c182427b1b1/html5/thumbnails/4.jpg)
Apache Spark in the Cloud | Zbyněk Roubalík4
What is cloud native?
● Containerized● Dynamically orchestrated● Microservice oriented
● www.cncf.io/about/faq
![Page 5: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation](https://reader030.fdocuments.in/reader030/viewer/2022040306/5ec56fc3c82f0c182427b1b1/html5/thumbnails/5.jpg)
Apache Spark in the Cloud | Zbyněk Roubalík5
Containers
● A container image is a lightweight, stand-alone,
executable package of a piece of software that
includes everything needed to run it: code, runtime,
system tools, system libraries, settings.
● https://www.docker.com/what-container
![Page 6: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation](https://reader030.fdocuments.in/reader030/viewer/2022040306/5ec56fc3c82f0c182427b1b1/html5/thumbnails/6.jpg)
Apache Spark in the Cloud | Zbyněk Roubalík6
VM vs Containers
![Page 7: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation](https://reader030.fdocuments.in/reader030/viewer/2022040306/5ec56fc3c82f0c182427b1b1/html5/thumbnails/7.jpg)
Apache Spark in the Cloud | Zbyněk Roubalík7
Containers
● Cloud vs standard deployment model
● Pets vs Cattle
● Developers + Operations (Admins) → DevOps
● Docker
![Page 8: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation](https://reader030.fdocuments.in/reader030/viewer/2022040306/5ec56fc3c82f0c182427b1b1/html5/thumbnails/8.jpg)
Apache Spark in the Cloud | Zbyněk Roubalík8
Kubernetes
● Container cluster manager
![Page 9: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation](https://reader030.fdocuments.in/reader030/viewer/2022040306/5ec56fc3c82f0c182427b1b1/html5/thumbnails/9.jpg)
Apache Spark in the Cloud | Zbyněk Roubalík9
Kubernetes
● Based on etcd – distributed clustered key value store● Smallest deployable unit is Pod
![Page 10: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation](https://reader030.fdocuments.in/reader030/viewer/2022040306/5ec56fc3c82f0c182427b1b1/html5/thumbnails/10.jpg)
Apache Spark in the Cloud | Zbyněk Roubalík10
OpenShift
● Open Source Container Application Platform● Focused on application (not just containers as a
concept) and developer experience
![Page 11: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation](https://reader030.fdocuments.in/reader030/viewer/2022040306/5ec56fc3c82f0c182427b1b1/html5/thumbnails/11.jpg)
Apache Spark in the Cloud | Zbyněk Roubalík11
OpenShift
● Sits on the top of Kubernetes● Source code, builds and deployments management● S2I - Source to Image● Application lifecycle management (CI/CD)● Service catalog (Language runtimes, Middleware,
Databases)● Security
![Page 12: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation](https://reader030.fdocuments.in/reader030/viewer/2022040306/5ec56fc3c82f0c182427b1b1/html5/thumbnails/12.jpg)
Apache Spark in the Cloud | Zbyněk Roubalík12
OpenShift architecture
![Page 13: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation](https://reader030.fdocuments.in/reader030/viewer/2022040306/5ec56fc3c82f0c182427b1b1/html5/thumbnails/13.jpg)
Apache Spark in the Cloud | Zbyněk Roubalík13
Apache Spark
● Fast and general engine for large-scale data processing
● Distributed computation system
● Provides high-level APIs in Java, Scala, Python and R
● Supports a rich set of tools for Big Data, AI, ML● Spark SQL for SQL and structured data processing● MLlib for machine learning● GraphX for graph processing● Spark Streaming● ...
![Page 14: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation](https://reader030.fdocuments.in/reader030/viewer/2022040306/5ec56fc3c82f0c182427b1b1/html5/thumbnails/14.jpg)
Apache Spark in the Cloud | Zbyněk Roubalík14
General Spark architecture
![Page 15: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation](https://reader030.fdocuments.in/reader030/viewer/2022040306/5ec56fc3c82f0c182427b1b1/html5/thumbnails/15.jpg)
Apache Spark in the Cloud | Zbyněk Roubalík15
How to interact with Spark
● Run an application
● Start a REPL● Scala
● Python
● R
![Page 16: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation](https://reader030.fdocuments.in/reader030/viewer/2022040306/5ec56fc3c82f0c182427b1b1/html5/thumbnails/16.jpg)
Apache Spark in the Cloud | Zbyněk Roubalík16
The fundamental Spark abstraction
Resilient distributed dataset (RDD)
● are partitioned, lazy and immutable homogenous collections
● partitioned● lazy● immutable
![Page 17: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation](https://reader030.fdocuments.in/reader030/viewer/2022040306/5ec56fc3c82f0c182427b1b1/html5/thumbnails/17.jpg)
Apache Spark in the Cloud | Zbyněk Roubalík17
Resilient distributed dataset in action
![Page 18: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation](https://reader030.fdocuments.in/reader030/viewer/2022040306/5ec56fc3c82f0c182427b1b1/html5/thumbnails/18.jpg)
Apache Spark in the Cloud | Zbyněk Roubalík18
Resilient distributed dataset in action
![Page 19: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation](https://reader030.fdocuments.in/reader030/viewer/2022040306/5ec56fc3c82f0c182427b1b1/html5/thumbnails/19.jpg)
Apache Spark in the Cloud | Zbyněk Roubalík19
Resilient distributed dataset in action
![Page 20: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation](https://reader030.fdocuments.in/reader030/viewer/2022040306/5ec56fc3c82f0c182427b1b1/html5/thumbnails/20.jpg)
Apache Spark in the Cloud | Zbyněk Roubalík20
Resilient distributed dataset in action
![Page 21: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation](https://reader030.fdocuments.in/reader030/viewer/2022040306/5ec56fc3c82f0c182427b1b1/html5/thumbnails/21.jpg)
Apache Spark in the Cloud | Zbyněk Roubalík21
Resilient distributed dataset in action
![Page 22: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation](https://reader030.fdocuments.in/reader030/viewer/2022040306/5ec56fc3c82f0c182427b1b1/html5/thumbnails/22.jpg)
Apache Spark in the Cloud | Zbyněk Roubalík22
Resilient distributed dataset in action
![Page 23: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation](https://reader030.fdocuments.in/reader030/viewer/2022040306/5ec56fc3c82f0c182427b1b1/html5/thumbnails/23.jpg)
Apache Spark in the Cloud | Zbyněk Roubalík23
Resilient distributed dataset in action
![Page 24: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation](https://reader030.fdocuments.in/reader030/viewer/2022040306/5ec56fc3c82f0c182427b1b1/html5/thumbnails/24.jpg)
Apache Spark in the Cloud | Zbyněk Roubalík24
What is Spark application?
![Page 25: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation](https://reader030.fdocuments.in/reader030/viewer/2022040306/5ec56fc3c82f0c182427b1b1/html5/thumbnails/25.jpg)
Apache Spark in the Cloud | Zbyněk Roubalík25
simple.py application
● Even numbers count
![Page 26: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation](https://reader030.fdocuments.in/reader030/viewer/2022040306/5ec56fc3c82f0c182427b1b1/html5/thumbnails/26.jpg)
Apache Spark in the Cloud | Zbyněk Roubalík26
![Page 27: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation](https://reader030.fdocuments.in/reader030/viewer/2022040306/5ec56fc3c82f0c182427b1b1/html5/thumbnails/27.jpg)
Apache Spark in the Cloud | Zbyněk Roubalík27
A little more complex application
![Page 28: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation](https://reader030.fdocuments.in/reader030/viewer/2022040306/5ec56fc3c82f0c182427b1b1/html5/thumbnails/28.jpg)
Apache Spark in the Cloud | Zbyněk Roubalík28
Designing a Spark microservice
![Page 29: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation](https://reader030.fdocuments.in/reader030/viewer/2022040306/5ec56fc3c82f0c182427b1b1/html5/thumbnails/29.jpg)
Apache Spark in the Cloud | Zbyněk Roubalík29
On demand batch processing
![Page 30: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation](https://reader030.fdocuments.in/reader030/viewer/2022040306/5ec56fc3c82f0c182427b1b1/html5/thumbnails/30.jpg)
Apache Spark in the Cloud | Zbyněk Roubalík30
Continuous batch processing
![Page 31: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation](https://reader030.fdocuments.in/reader030/viewer/2022040306/5ec56fc3c82f0c182427b1b1/html5/thumbnails/31.jpg)
Apache Spark in the Cloud | Zbyněk Roubalík31
Stream processing
![Page 32: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation](https://reader030.fdocuments.in/reader030/viewer/2022040306/5ec56fc3c82f0c182427b1b1/html5/thumbnails/32.jpg)
Apache Spark in the Cloud | Zbyněk Roubalík32
OpenShift architecture - recall
![Page 33: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation](https://reader030.fdocuments.in/reader030/viewer/2022040306/5ec56fc3c82f0c182427b1b1/html5/thumbnails/33.jpg)
Apache Spark in the Cloud | Zbyněk Roubalík33
Spark on OpenShift
![Page 34: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation](https://reader030.fdocuments.in/reader030/viewer/2022040306/5ec56fc3c82f0c182427b1b1/html5/thumbnails/34.jpg)
Apache Spark in the Cloud | Zbyněk Roubalík34
Oshinko - Integrating Spark and OpenShift
![Page 35: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation](https://reader030.fdocuments.in/reader030/viewer/2022040306/5ec56fc3c82f0c182427b1b1/html5/thumbnails/35.jpg)
Apache Spark in the Cloud | Zbyněk Roubalík35
Oshinko - Integrating Spark and OpenShift
![Page 36: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation](https://reader030.fdocuments.in/reader030/viewer/2022040306/5ec56fc3c82f0c182427b1b1/html5/thumbnails/36.jpg)
Apache Spark in the Cloud | Zbyněk Roubalík36
Demo time
![Page 37: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation](https://reader030.fdocuments.in/reader030/viewer/2022040306/5ec56fc3c82f0c182427b1b1/html5/thumbnails/37.jpg)
Apache Spark in the Cloud | Zbyněk Roubalík37
Takeaways
● Containers
● Kubernetes
● OpenShift
● Apache Spark
● Oshinko tooling
![Page 38: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation](https://reader030.fdocuments.in/reader030/viewer/2022040306/5ec56fc3c82f0c182427b1b1/html5/thumbnails/38.jpg)
Apache Spark in the Cloud | Zbyněk Roubalík38
Спасибі!
www.github.com/radanalyticsio