Big Data SMACK: A Guide to Apache Spark, Mesos, Akka, Cassandra, and Kafka
DC/OS AND FAST DATA (THE SMACK STACK) June...
Transcript of DC/OS AND FAST DATA (THE SMACK STACK) June...
![Page 1: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/1.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 1
June 2017
DC/OS AND FAST DATA(THE SMACK STACK)
Benjamin Hindman - @benh
Elizabeth K. Joseph - @pleia2
![Page 2: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/2.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved.
TRADITIONAL APPLICATION
MODERNAPPLICATION
Data
Service
Latency
Users3+ Billion internet & smartphone users
Data growth (CAGR): 40%+
Source: KPCB Internet Trends 2016, EMC Digital Universe 2014
ARCHITECTURAL SHIFT
![Page 3: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/3.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 3
TODAY’S REINFORCING TRENDS
MICROSERVICES
CONTAINERIZATION
CONTAINER ORCHESTRATION
BIG DATA & ANALYTICS
![Page 4: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/4.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 4
TODAY’S REINFORCING TRENDS
MICROSERVICES
CONTAINERIZATION
CONTAINER ORCHESTRATION
FAST BIG DATA & ANALYTICS
![Page 5: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/5.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 5
FROM BIG DATA TO FAST DATA
Batch Event ProcessingMicro-Batch
Days Hours Minutes Seconds Microseconds
Solves problems using predictive and prescriptive analyticsReports what has happened using descriptive analytics
Predictive User InterfaceReal-time Pricing and Routing Real-time AdvertisingBilling, Chargeback Product recommendations
![Page 6: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/6.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 6
ON THE EDGE, AND STILL REALLY BIG!
A380-1000: 10,000 sensors in each wing;produces more than 7Tb of IoT data per day
[1] https://goo.gl/2S4q5N
![Page 7: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/7.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved.
MODERN APPLICATION -> FAST DATA BUILT-IN
Data Ingestion
Request/Response
Devices
Client
Sensors
MessageQueue/Bus
Microservices Distributed Storage
Analytics(Streaming) Use Cases:
● Anomaly detection
● Personalization
● IoT Applications
● Predictive Analytics
● Machine Learning
![Page 8: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/8.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 8
THE FOUNDATIONS OF FAST DATA
Sensors& Sources
Data Processing
ModernApps
MessageQueue
![Page 9: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/9.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 9
Message Brokers● Apache Kafka● ØMQ, RabbitMQ, Disque
Log-based Queues● fluentd, Logstash, Flume
see also queues.io
MESSAGE QUEUES
![Page 10: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/10.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 10
Typical Use: A reliable buffer for stream processing
Why Kafka?● High-throughput, distributed,
persistent publish-subscribe messaging system
● Created by LinkedIn; used in production by 100+ web-scale companies [1]
[1] https://cwiki.apache.org/confluence/display/KAFKA/Powered+By
APACHE KAFKA
![Page 11: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/11.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 11
DELIVERY GUARANTEES
● At most once—Messages may be lost but are never redelivered
● At least once—Messages are never lost but may be redelivered (Kafka)
● Exactly once—Messages are delivered once and only once (this is what everyone actually wants, but no one can deliver!)
Murphy’s Law of Distributed Systems:
Anything that can go wrong, will go wrong … partially!
![Page 12: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/12.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 12
Microbatching● Apache Spark (Streaming)
Native Streaming● Apache Flink● Apache Storm/Heron ● Apache Apex● Apache Samza
STREAMING ANALYTICS
![Page 13: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/13.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 13
Typical Use: distributed, large-scale data processing; micro-batching
Why Spark Streaming?● Micro-batching creates very low
latency, which can be faster● Well defined role means it fits in well
with other pieces of the pipeline
APACHE SPARK (STREAMING)
![Page 14: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/14.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 14
NoSQL● ArangoDB● mongoDB● Apache Cassandra● Apache HBase
SQL● MemSQL
Filesystems● Quobyte● HDFS
DISTRIBUTEDSTORAGE
Time-Series Datastores● InfluxDB● OpenTSDB● KairosDB● Prometheus
see also iot-a.info
![Page 15: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/15.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 15
Typical Use: No-dependency, time series database
Why Cassandra?● A top level Apache project born at
Facebook and built on Amazon’s Dynamo and Google’s BigTable
● Offers continuous availability, linear scale performance, operational simplicity and easy data distribution
APACHE CASSANDRA
![Page 16: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/16.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved.
A GOOD STACK ...
Data Ingestion
Request/Response
Devices
Client
Sensors
Use Cases:
● Anomaly detection
● Personalization
● IoT Applications
● Predictive Analytics
● Machine Learning
MessageQueue/Bus
Microservices Distributed Storage
Analytics(Streaming)
![Page 17: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/17.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 17
how do we operate these distributed systems?
![Page 18: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/18.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 18
most organizations have many stateless independent (micro)services, the distributed systems I’m talking about here are ...
![Page 19: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/19.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 19
![Page 20: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/20.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 20
how do we scale the operations of distributed systems?
![Page 21: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/21.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 21
SMACK STACKApache Spark: distributed, large-scale data processing
Apache Mesos: cluster resource manager
Akka: toolkit for message driven applications
Apache Cassandra: distributed, highly-available database
Apache Kafka: distributed, highly-available messaging system
![Page 22: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/22.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 22
distributed systemsare hard to operate
![Page 23: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/23.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 23
Bare metal storage (or someone else’s problem)
Drive down job latency and drive up utilization
Run multiple versions simultaneously
Upgrade complicated data systems
DATA & ANALYTICSDAY 2 OPSCHALLENGES
![Page 24: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/24.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 24
1: download2: deploy3: monitor4: maintain5: upgrade → goto 1
![Page 25: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/25.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 25
fault tolerance+ high availability+ latency+ bandwidth+ CPU/mem resources+ …
1: download2: deploy3: monitor4: maintain5: upgrade → goto 1
= configuration
![Page 26: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/26.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 26
1: download2: deploy3: monitor4: maintain5: upgrade → goto 1
![Page 27: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/27.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 27
1: download2: deploy3: monitor4: maintain5: upgrade → goto 1
n1
n4
n2
n5
n3
n6
n7 n8 n9
![Page 28: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/28.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 28
1: download2: deploy3: monitor4: maintain5: upgrade → goto 1
n1
n4
n2
n5
n3
n6
n7 n8 n9
(1) express
![Page 29: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/29.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 29
1: download2: deploy3: monitor4: maintain5: upgrade → goto 1
n1
n4
n2
n5
n3
n6
n7 n8 n9
(1) express
(2) orchestrate
![Page 30: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/30.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 30
1: download2: deploy3: monitor4: maintain5: upgrade → goto 1
(1) express
(2) orchestrate
n1
n4
n2
n5
n3
n6
n7 n8 n9
![Page 31: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/31.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 31
1: download2: deploy3: monitor4: maintain5: upgrade → goto 1
![Page 32: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/32.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 32
1: download2: deploy3: monitor4: maintain5: upgrade → goto 1
![Page 33: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/33.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 33
1: download2: deploy3: monitor4: maintain5: upgrade → goto 1
first, debug ...
![Page 34: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/34.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 34
1: download2: deploy3: monitor4: maintain5: upgrade → goto 1
first, debug ...
![Page 35: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/35.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 35
1: download2: deploy3: monitor4: maintain5: upgrade → goto 1
second, fix (scale, patch, etc) ...
![Page 36: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/36.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 36
1: download2: deploy3: monitor4: maintain5: upgrade → goto 1
then, debug again ...
![Page 37: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/37.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 37
1: download2: deploy3: monitor4: maintain5: upgrade → goto 1
finally, write code so it never happens again ...
![Page 38: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/38.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 38
1: download2: deploy3: monitor4: maintain5: upgrade → goto 1
![Page 39: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/39.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 39
thesis: distributed systems should(be able to) operate themselves;deploy, monitor, upgrade ...
![Page 40: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/40.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 40
why:(1) operators have inadequate knowledge of distributed system needs/semanticsto make optimal decisions
![Page 41: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/41.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 41
why:(1) operators have inadequate knowledge of distributed system needs/semanticsto make optimal decisions(even after reading the book)
![Page 42: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/42.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 42
why:(2) execution needs/semantics can’t easily or efficiently be expressedto underlying system, and vice versa
![Page 43: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/43.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 43
n1
n4
n2 n3
n6
n7 n8 n9(1) express (2) orchestrate
n5
![Page 44: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/44.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 44
![Page 45: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/45.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 45
configuration spectrum:
coarse-grained fine-grained
![Page 46: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/46.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 46
configuration spectrum:
coarse-grained fine-grained
easiest to express (how most of us would do it),but worst resource utilization
![Page 47: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/47.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 47
configuration spectrum:
coarse-grained fine-grained
hardest to express (if even possible),but best resource utilization
![Page 48: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/48.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 48
why can’t Hadoop decide this for me?
![Page 49: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/49.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 49
applications “operate” themselves on Linux; when an application needs to “scale up” it asks the operating system to allocate more memory or create another thread ...
![Page 50: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/50.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved.
application
operating system
syscall interface:memory allocateclone/forkcreate fileread, write...
50
![Page 51: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/51.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 51
0x0 0x8
configuration:● [0x0,0x1) ➝ calc.exe● [0x1,0x4) ➝ winmine.exe● [0x4,0x8) ➝ notepad.exe
physical memory
0x0 0x8
applications take physical memory address range as an input
once upon a time … before virtual memory
![Page 52: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/52.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 52
0x0 0x8
configuration:● [0x0,0x1) ➝ calc.exe● [0x1,0x4) ➝ winmine.exe● [0x4,0x8) ➝ notepad.exe
physical memory
0x0 0x8
once upon a time … before virtual memory
n1
n4
n2
n5
n3
n6
n7 n8 n9
![Page 53: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/53.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 53
how:distributed systems need interface to communicate with underlying system,and vice versa
![Page 54: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/54.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved.
distributed system interface:resource allocationlaunch container/VMcreate storageattach/detach storage...
54
(operating) system
![Page 55: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/55.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 55
vice versa:operating system should be able to callback into application
![Page 56: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/56.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 56
learning from history … bidirectional interface
application
operating system
callback interface:paging/swappingCPU deallocation...
![Page 57: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/57.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 57
learning from history … bidirectional interface
application
operating system
callback interface:paging/swappingCPU deallocation...
better than LRU, ask the application what pages to swap!
![Page 58: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/58.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 58
learning from history … bidirectional interface
application
operating system
callback interface:paging/swappingCPU deallocation...
search for ‘scheduler activations’ and ‘Lithe composition’
![Page 59: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/59.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 59
distributed systems need bidirectional interface too
distributed system
(operating) system
callback interface:container/VM failedresource deallocation...
![Page 60: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/60.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 60
distributed systems need bidirectional interface too
distributed system
(operating) system
callback interface:container/VM failedresource deallocation...
tell the distributed system about “planned failures” (i.e., maintenance)
![Page 61: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/61.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 61
distributed system
Apache Mesos
Apache Mesos
![Page 62: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/62.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 62
Dogfooding:Apache Spark
Apache Mesos
![Page 63: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/63.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 63
reality is people are (already) building software that operates distributed systems ...
![Page 64: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/64.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 64
goal: provide distributed system* as software as a service (SaaS) to the rest of your internal organization or to sell to external organizations
solution: a control plane built out of ad hoc scripts, ancillary services, etc, that deploy, maintain, and upgrade said SaaS
common pattern: ad hoc control planes
* e.g., analytics via Spark, message queue via Kafka, key/value store via Cassandra
![Page 65: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/65.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 65https://github.com/crunchydata/crunchy-containers
![Page 66: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/66.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 66
$ kubectl create -f $LOC/kitchensink-master-service.json$ kubectl create -f $LOC/kitchensink-slave-service.json$ kubectl create -f $LOC/kitchensink-pgpool-service.json$ envsubst < $LOC/kitchensink-sync-slave-pv.json | kubectl create -f -$ envsubst < $LOC/kitchensink-master-pv.json | kubectl create -f -$ kubectl create -f $LOC/kitchensink-sync-slave-pvc.json$ kubectl create -f $LOC/kitchensink-master-pvc.json$ envsubst < $LOC/kitchensink-master-pod.json | kubectl create -f -$ envsubst < $LOC/kitchensink-slave-dc.json | kubectl create -f -$ envsubst < $LOC/kitchensink-sync-slave-pod.json | kubectl create -f -$ envsubst < $LOC/kitchensink-pgpool-rc.json | kubectl create -f -$ kubectl create -f $LOC/kitchensink-watch-sa.json$ envsubst < $LOC/kitchensink-watch-pod.json | kubectl create -f -
https://schd.ws/hosted_files/cnd2016/8d/Containerizing%20PostgreSQL%20and%20Making%20it%20Cloud%20Native%20Ready%20-%20Jeff%20McCormick.pdf
![Page 67: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/67.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 67
$ kubectl create -f $LOC/kitchensink-master-service.json$ kubectl create -f $LOC/kitchensink-slave-service.json$ kubectl create -f $LOC/kitchensink-pgpool-service.json$ envsubst < $LOC/kitchensink-sync-slave-pv.json | kubectl create -f -$ envsubst < $LOC/kitchensink-master-pv.json | kubectl create -f -$ kubectl create -f $LOC/kitchensink-sync-slave-pvc.json$ kubectl create -f $LOC/kitchensink-master-pvc.json$ envsubst < $LOC/kitchensink-master-pod.json | kubectl create -f -$ envsubst < $LOC/kitchensink-slave-dc.json | kubectl create -f -$ envsubst < $LOC/kitchensink-sync-slave-pod.json | kubectl create -f -$ envsubst < $LOC/kitchensink-pgpool-rc.json | kubectl create -f -$ kubectl create -f $LOC/kitchensink-watch-sa.json$ envsubst < $LOC/kitchensink-watch-pod.json | kubectl create -f -
https://schd.ws/hosted_files/cnd2016/8d/Containerizing%20PostgreSQL%20and%20Making%20it%20Cloud%20Native%20Ready%20-%20Jeff%20McCormick.pdf
![Page 68: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/68.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 68
what happens if there’s a bug in the control plane?
what if my control plane has diverged from yours?
what happens when a new release of the distributed system invalidates an assumption the control plane previously made?
![Page 69: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/69.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 69
control planes should be built into the distributed systems itself by the experts who built the distributed system in the first place!
as an industry we should strive to build a standard interface that distributed systems can leverage
a better world ...
![Page 70: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/70.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 70
vice versa:
abstractions exist for good reasons, but without sufficient communication they force sub-optimal outcomes …
![Page 71: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/71.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 71
control planes should be built into distributed systems themselves by the experts who built the distributed system in the first place!
as an industry we should strive to build a standard interface distributed systems can leverage
our standard interface should be bidirectional to avoid sub-optimal outcomes
a better world ...
![Page 72: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/72.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 72
how do we scale the operations
of distributed systems?
![Page 73: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/73.jpg)
73
let them scale themselves!
![Page 74: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/74.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved.
OPERATING SYSTEMS ARE FOR APPLICATIONS
DC/OS
Spark
Jenkins
Riak
DataStax Enterprise
Confluent Kafka
ArangoDB
GitLab
Cassandra
JFrog Artifactory
Elasticsearch
MariaDB
Storm
HDFS
Zeppelin
MemSQL
“SaaS” Experience using DC/OS
![Page 75: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/75.jpg)
© 2016 Mesosphere, Inc. All Rights Reserved. 75
DC/OS SERVICE MANAGES IT’S OWN UPGRADES
![Page 76: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/76.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved.
DC/OS: AVOIDING CLOUD LOCK-IN #2CAPABILITY AWS AZURE GCP DC/OS
S3
Elastic Block Storage (EBS)
Elastic File System
RDS
DynamoDB
CloudSearch
Elastic Map Reduce (EMR)
Kinesis
Redshift
CloudWatch
Lambda
Blob Storage
Page Blobs, Premium Storage
File Storage
SQL Database
DocumentDB
Log Analytics, Search
HDInsight
Stream Analytics, Data Lake
SQL Data Warehouse
Application Insights, Portal
Azure Functions
Cloud Storage
GCE Persistent Disks
ZFS / Avere
Cloud SQL (MySQL)
Datastore, Bigtable
N/A
Dataproc, Dataflow
Kinesis
BigQuery
Stackdriver Monitoring
Google Cloud Functions
Object Storage
Block Storage
File Storage
Relational
NoSQL
Full Text Search
Hadoop / Analytics
Stream Processing / Ingest
Data Warehouse
Monitoring
Serverless
Stor
age
DBDa
ta &
Ana
lytic
sO
ther
![Page 77: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/77.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 77
THANK YOU!
DEMO!
QUESTIONS?@dcos
/groups/8295652
/dcos/dcos/examples/dcos/demos
chat.dcos.io
![Page 78: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/78.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 78
![Page 79: DC/OS AND FAST DATA (THE SMACK STACK) June 2017mesosphere.github.io/presentations/2017-06-06-nyc... · Apache Spark: distributed, large-scale data processing Apache Mesos: cluster](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec8d6b120162707222fa0a5/html5/thumbnails/79.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 79
bigger picture:
abstractions exist for good reasons, but without sufficient communication they force sub-optimal outcomes …