Spark Seattle meetup - Breaking ETL barrier with Spark Streaming
-
Upload
santosh-sahoo -
Category
Data & Analytics
-
view
104 -
download
0
Transcript of Spark Seattle meetup - Breaking ETL barrier with Spark Streaming
![Page 1: Spark Seattle meetup - Breaking ETL barrier with Spark Streaming](https://reader034.fdocuments.in/reader034/viewer/2022052603/55d0527ebb61ebb81d8b4823/html5/thumbnails/1.jpg)
Breaking ETL barrier with Real-time reportingusing Kafka, Spark Streaming
Santosh SahooArchitect at Concur
![Page 2: Spark Seattle meetup - Breaking ETL barrier with Spark Streaming](https://reader034.fdocuments.in/reader034/viewer/2022052603/55d0527ebb61ebb81d8b4823/html5/thumbnails/2.jpg)
About us
Concur (now part of SAP) provides travel and expense management services to businesses.
Data Insights team is building solutions to provide customer access to data, visualization and reporting.
![Page 3: Spark Seattle meetup - Breaking ETL barrier with Spark Streaming](https://reader034.fdocuments.in/reader034/viewer/2022052603/55d0527ebb61ebb81d8b4823/html5/thumbnails/3.jpg)
Stack so far..
OLAP ReportETL
OLTP
App
![Page 4: Spark Seattle meetup - Breaking ETL barrier with Spark Streaming](https://reader034.fdocuments.in/reader034/viewer/2022052603/55d0527ebb61ebb81d8b4823/html5/thumbnails/4.jpg)
Numbers
7K OLTP database sources14K OLAP Reporting dbs28K ETL Jobs300M rows (Compacted), 2B row changesOnly ~20 failure a night
![Page 5: Spark Seattle meetup - Breaking ETL barrier with Spark Streaming](https://reader034.fdocuments.in/reader034/viewer/2022052603/55d0527ebb61ebb81d8b4823/html5/thumbnails/5.jpg)
Batch ETL challenges
Scheduled (High latency)Processing timeHard to scale.Not fault toleranceMonolithicHigh maintenance
![Page 6: Spark Seattle meetup - Breaking ETL barrier with Spark Streaming](https://reader034.fdocuments.in/reader034/viewer/2022052603/55d0527ebb61ebb81d8b4823/html5/thumbnails/6.jpg)
Moving forwardScheduled (High latency) Streaming, real time
Hard to scale Scalable
Monolithic Modular
Not fault tolerant Fault tolerant
ACID Consistent, Normalized Eventual Consistency
High maintenance (Single Tenant)
Reduce maintenance overhead(Multi tenant)
![Page 7: Spark Seattle meetup - Breaking ETL barrier with Spark Streaming](https://reader034.fdocuments.in/reader034/viewer/2022052603/55d0527ebb61ebb81d8b4823/html5/thumbnails/7.jpg)
Source Flow Manager
StreamingProcessor Storage Reporting
Streaming Data Pipeline
Applications
Mobile Devices
Sensors
IOT - Internet of things
Database Log scrapping
Alert
Message Queues
Kafka
Flume
Azure Event hub
AWS Kinesis
HDFS
Storm
Spark Streaming
Azure Stream analytics
Samza
Flink
RDBMS
NoSQL
HDFS
Redshift
Custom App D3
Tableau
Cognos
Excel
![Page 8: Spark Seattle meetup - Breaking ETL barrier with Spark Streaming](https://reader034.fdocuments.in/reader034/viewer/2022052603/55d0527ebb61ebb81d8b4823/html5/thumbnails/8.jpg)
Spark StreamingWhat? A data processing framework to build scalable fault-tolerant streaming applications.Why? It lets you reuse the same code for batch processing, join streams against historical data, or run ad-hoc queries on stream state.
![Page 9: Spark Seattle meetup - Breaking ETL barrier with Spark Streaming](https://reader034.fdocuments.in/reader034/viewer/2022052603/55d0527ebb61ebb81d8b4823/html5/thumbnails/9.jpg)
Demo….
![Page 10: Spark Seattle meetup - Breaking ETL barrier with Spark Streaming](https://reader034.fdocuments.in/reader034/viewer/2022052603/55d0527ebb61ebb81d8b4823/html5/thumbnails/10.jpg)
Kafka - Flow Management
No nonsense logging100K/s throughput vs 20k of RabbitMQLog compactionDurable persistencePartition tolerance ReplicationBest in class integration with Spark
![Page 11: Spark Seattle meetup - Breaking ETL barrier with Spark Streaming](https://reader034.fdocuments.in/reader034/viewer/2022052603/55d0527ebb61ebb81d8b4823/html5/thumbnails/11.jpg)
Spark Streaming Architecture
Worker
Worker
Worker
Receiver
Driver Master
Executor
Executor
Executor
Source
D1 D2
D3 D4
WAL
D1 D2
Replication
DataStore
TASK
DStream- Discretized Stream of RDDRDD - Resilient Distributed Datasets
![Page 12: Spark Seattle meetup - Breaking ETL barrier with Spark Streaming](https://reader034.fdocuments.in/reader034/viewer/2022052603/55d0527ebb61ebb81d8b4823/html5/thumbnails/12.jpg)
Optimized Direct Kafka API
https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html
![Page 13: Spark Seattle meetup - Breaking ETL barrier with Spark Streaming](https://reader034.fdocuments.in/reader034/viewer/2022052603/55d0527ebb61ebb81d8b4823/html5/thumbnails/13.jpg)
Architecture
![Page 14: Spark Seattle meetup - Breaking ETL barrier with Spark Streaming](https://reader034.fdocuments.in/reader034/viewer/2022052603/55d0527ebb61ebb81d8b4823/html5/thumbnails/14.jpg)
OLTP
Reporting
CognosTableau ?
StreamProcessorSpark
HDFSImport
FTP
HTTP
SMTP
P
ProtobufJson
Broker
Kafka
Hive/Spark SQL
OLAP
Load balanceFailover
HANA
HANAOLAP
Replication
Service bus
Normalization
ExtractCompensate
Data {Quality, Correction, Analytics}Migrate method
API/SQL
ExpenseTravel
TTXAPI
Reporting Next Gen Architecture
C
Tachyon
![Page 15: Spark Seattle meetup - Breaking ETL barrier with Spark Streaming](https://reader034.fdocuments.in/reader034/viewer/2022052603/55d0527ebb61ebb81d8b4823/html5/thumbnails/15.jpg)
Can Spark Streaming survive Chaos Monkey?
http://techblog.netflix.com/2015/03/can-spark-streaming-survive-chaos-monkey.html
![Page 16: Spark Seattle meetup - Breaking ETL barrier with Spark Streaming](https://reader034.fdocuments.in/reader034/viewer/2022052603/55d0527ebb61ebb81d8b4823/html5/thumbnails/16.jpg)
QnA
![Page 17: Spark Seattle meetup - Breaking ETL barrier with Spark Streaming](https://reader034.fdocuments.in/reader034/viewer/2022052603/55d0527ebb61ebb81d8b4823/html5/thumbnails/17.jpg)
concur.com/en-us/careers
We are hiring
![Page 18: Spark Seattle meetup - Breaking ETL barrier with Spark Streaming](https://reader034.fdocuments.in/reader034/viewer/2022052603/55d0527ebb61ebb81d8b4823/html5/thumbnails/18.jpg)
Thank you!