Spark Summit EU talk by Brij Bhushan Ravat
-
Upload
spark-summit -
Category
Data & Analytics
-
view
187 -
download
3
Transcript of Spark Summit EU talk by Brij Bhushan Ravat
Productizing a Spark & Cassandra Based Solution in Telecom
Brij Bhushan Ravat Chief Architect, Voucher Server - Charging System
Productizing a Spark and Cassandra Based Solution in Telecom | Public | © Ericsson AB 2016 | 2016-09-25 | Page 2
1 What is productizing?
2 A brief on the product – Voucher Server
3 Challenges & evolution
4 O&M challenges
5 Wrap up
Agenda
Productizing a Spark and Cassandra Based Solution in Telecom | Public | © Ericsson AB 2016 | 2016-09-25 | Page 3
Evolution of a product
Tony Stark in captivity
The Need
Invention of Arc Reactor
The Design
Armor – Mark 1
The Prototype
Ironman
The Solution
J.A.R.V.I.S.
The Support Mechanism
Ironman & IronPatriot
Productized Solution
Productizing a Spark and Cassandra Based Solution in Telecom | Public | © Ericsson AB 2016 | 2016-09-25 | Page 4
› First phase of development: – It involves designing a solution for a need – Prototyping it for proof of the concept – Completing the solution with all required features & good performance
› Second phase of development
– Building a mechanism to support usage of the solution – So that every customer can use the solution with same accuracy
› ‘Productizing’ is to take a solution to a level where anyone can:
– buy the solution off-the-shelf – install it by himself/herself, and – use it by himself/herself
productizing
Productizing a Spark and Cassandra Based Solution in Telecom | Public | © Ericsson AB 2016 | 2016-09-25 | Page 5
Evolution of VS
Market forces
The Need
Change in database
The Design
NGVS Prototype
The Prototype
Next Generation VS
The Solution
Strong Support
The Support Mechanism
Multiple deployments
Productized Solution
Productizing a Spark and Cassandra Based Solution in Telecom | Public | © Ericsson AB 2016 | 2016-09-25 | Page 6
1 What is productizing?
2 A brief on the product – Voucher Server
3 Challenges & evolution
4 O&M challenges
5 Wrap up
Productizing a Spark and Cassandra Based Solution in Telecom | Public | © Ericsson AB 2016 | 2016-09-25 | Page 7
› Generates ‘Activation Codes’ – An ‘activation code’ corresponds to a currency & a value
› Handles top-up requests of pre-paid subscribers – Interpret value of an ‘activation code’ – Add the corresponding value to the balance amount of the subscriber
Voucher Server
Productizing a Spark and Cassandra Based Solution in Telecom | Public | © Ericsson AB 2016 | 2016-09-25 | Page 8
Voucher Server (vs)
Subscribers Account Voucher Server
voucher lookup (Activation code)
…
Print Shop Voucher codes
vouc
her g
ener
atio
n
1
vouc
her l
oadi
ng
2
vouc
her s
tate
cha
nge
3 Retail Shop Printed
vouchers
2
3
subs
crib
ers vo
uche
r pur
ge
4
4
Operator
value
gene
rate
repo
rt
5
Productizing a Spark and Cassandra Based Solution in Telecom | Public | © Ericsson AB 2016 | 2016-09-25 | Page 9
1 What is productizing?
2 A brief on the product – Voucher Server
3 Challenges & evolution
4 O&M challenges
5 Wrap up
Productizing a Spark and Cassandra Based Solution in Telecom | Public | © Ericsson AB 2016 | 2016-09-25 | Page 10
Design Change: Oracle to Cassandra
First digit driven message routing
Refill Request
Cluster 1
Voucher Pool 1
Oracle RAC
Cluster 2
Voucher Pool 2
Oracle RAC
Refill Request
Single cluster
Single Voucher Pool Cassandra cluster
0 - 4 5 - 9
Old VS solution (Oracle based): • Overall a good solution • but limited capacity (one cluster – 300M vouchers) • and scaling is not immune to hotspots
VS VS VS VS VS VS VS VS VS
Next Generation VS solution (Cassandra based): • Highly scalable solution with good performance • No hotspots • but performance of reports generation is a constraint
Round robin message routing
Productizing a Spark and Cassandra Based Solution in Telecom | Public | © Ericsson AB 2016 | 2016-09-25 | Page 11
Report Generation: Feature
Voucher Server
Report Unavailable 170M. Available 80M. Used 30M. Damaged 13M. Stolen 2M. Refill
Load
Sta
te c
hang
e
Pur
ge
Analytics Solution
Gen
erat
e R
epor
t
Voucher state distribution: • Count vouchers in each state • Voucher reconciliation • Taking stock of inventory
Export voucher data: • Fraud detection • Market prediction
Challenges: • Cassandra by itself does not provide function to count fields w.r.t. their value • Report generation requires full-table scan (takes 5-6 hours)
Productizing a Spark and Cassandra Based Solution in Telecom | Public | © Ericsson AB 2016 | 2016-09-25 | Page 12
Report Generation: Design (w/o spark)
Web GUI
Cassandra cluster
Node 1
Node 2
Node 3
Node 4
Node 5
Design approach: • Web GUI triggers report
generation task on Java process • Java process interfaces with
Cassandra cluster and fetches voucher data from all nodes
• After full-table scan, entire data is loaded in a single Java process of a node
• The Java process writes the voucher data in a report file
Alternative design: • Implement distributed computing • Use a framework for distributed
computing
Productizing a Spark and Cassandra Based Solution in Telecom | Public | © Ericsson AB 2016 | 2016-09-25 | Page 13
Report Generation: New Design
Web GUI
Cassandra cluster
New design approach: • Web GUI triggers report
generation task on Java process • Java process triggers a spark job • Spark spawns spark executors
on each node • Each spark executor is assigned
task such that its local Cassandra instance has all its required tokens (leveraging data-locality)
• Each spark executor generate report file for the data it has fetched.
Benefits: • Minimum network latency • Therefore, better performance
Productizing a Spark and Cassandra Based Solution in Telecom | Public | © Ericsson AB 2016 | 2016-09-25 | Page 14
DatA locality: Spark-Cassandra Connector
Java (VS business logic)
Apache Cassandra v 2.0.14
Apache Cassandra v 2.0.14
Apache Cassandra v 2.0.14
Spark Executor v 1.2 Spark Executor v 1.2 Spark Executor v 1.2
Spark
partition
Spark
partition
Spark
partition
Spark
patition
Spark
partition
Spark
partition
Spark
partition
Spark
partition
Spark
partition
Spark
partition
Spark
partition
Spark
partition
Spark Cassandra Connector – Java wrapper v 1.2
Spark Cassandra Connector v 1.2
( Cassandra partition ) ( Cassandra partition ) ( Cassandra partition )
Productizing a Spark and Cassandra Based Solution in Telecom | Public | © Ericsson AB 2016 | 2016-09-25 | Page 15
1 What is productizing?
2 A brief on the product – Voucher Server
3 Challenges & evolution
4 O&M challenges
5 Wrap up
Productizing a Spark and Cassandra Based Solution in Telecom | Public | © Ericsson AB 2016 | 2016-09-25 | Page 16
› Cassandra repair
› Lagging compaction
› Handling real-time traffic along with Spark jobs
Key O&M challenges
Productizing a Spark and Cassandra Based Solution in Telecom | Public | © Ericsson AB 2016 | 2016-09-25 | Page 17
Cassandra Repair
Node 1
Node 2
Node 3
A
A
A
B
B B
Mutation drop
What causes inconsistency • Node-2 receives db-request (update value to B) • Node-2 becomes coordinator
• applies changes in its replica • sends message to Node-1 & Node-2
• Node-1 applies db-update (from value A to B) • Node-3 fails to receive the message (mutation drop)
• On Node-3, value remains ‘A’
Consequences of inconsistency (if not fixed) • If ‘consistency level’ is QUORUM, while writing as well as reading,
then no impact till all 3 nodes are up & running • Once a node with latest value goes down read-requests can fail • In some cases, purged data can also reappear
How to fix inconsistencies • Schedule ‘repair jobs’ to run periodically (e.g. weekly)
Challenges • Lack of awareness about ‘repair’ among customers • At times, customer fail to run repair for prolonged periods
Replication Factor : 3 Consistency Level : Quorum
Productizing a Spark and Cassandra Based Solution in Telecom | Public | © Ericsson AB 2016 | 2016-09-25 | Page 18
Lagging compaction (C* sstable)
Node 1
Node 2
Node 3
Why compaction is required? • A batch of updates is flushed into an sstable
• sstables are immutable • Therefore, every batch of updates make a new sstable
• In each Cassandra node sstable compaction runs in background • It merges individual sstables in single sstable • It frees up diskspace consumed by multiple entries of same record
What is lagging compaction? • In STCS (Size Tiered Compaction Strategy), by default Cassandra
requires 4 sstables of similar size to initiate compaction • When there is an unusual burst of several db operations in a small
duration of time, that results in an sstable of unusual large size. • Such large sstables are never selected for compaction because
there is often no sstable of similar size • Reoccurrence of this phenomenon cause high disk utilization
How to fix lagging compaction? • Tune compaction to run on 2 sstables itself & even with varied sizes
Challenges • Customers ignore the increasing disk utilization until it is 85%
Replication Factor : 3 Consistency Level : Quorum
S
S
S
S
S
S
S
S
S
S
S
S
Productizing a Spark and Cassandra Based Solution in Telecom | Public | © Ericsson AB 2016 | 2016-09-25 | Page 19
Real time traffic + Spark batch
At peak time of operations • Resources are dedicated to traffic of refill requests
• which are processed in real time • Other activities performed are those which take very
few resources, like voucher lookup, etc.
Rea
l tim
e re
fill r
eque
sts
Admin & other functions
Rea
l tim
e re
fill r
eque
sts
Spark-based report generation
Peak time distribution between real-time & batch processing
During off-peak of operations • Traffic of refill request is less • This window is utilized for admin functions which
require more resources • However, Spark jobs overload the system so much
that even off-peak traffic starts giving timeout • This requires tuning of ‘spark.executor.cores’
Off-peak distribution between real-time & batch processing
Productizing a Spark and Cassandra Based Solution in Telecom | Public | © Ericsson AB 2016 | 2016-09-25 | Page 20
› For ‘Cassandra repair’ monitor – ‘Time elapsed since last successful Cassandra-repair’ w.r.t. ‘GC Grace Seconds’ for each node in
Cassandra cluster
› For ‘lagging compaction’ monitor – Disk utilization – sstable count – Number of sstables read per table
› Handling real-time traffic along with Spark jobs
– Monitor real-time traffic throughput & response time – Monitor CPU utilization
Solution
Productizing a Spark and Cassandra Based Solution in Telecom | Public | © Ericsson AB 2016 | 2016-09-25 | Page 21
1 What is productizing?
2 A brief on the product – Voucher Server
3 Challenges & evolution
4 O&M challenges
5 Wrap up
Productizing a Spark and Cassandra Based Solution in Telecom | Public | © Ericsson AB 2016 | 2016-09-25 | Page 22
› First phase (technical issues) – Need for high scalability –
› use Cassandra – Need for efficiency in full-table scan –
› use Spark for report generation (leveraging data locality)
› Second phase (O&M issues)
– Lack of admin skills among end-users leads to: › issues with Cassandra repair & compaction › running real-time traffic with batch processing jobs
Summary of challenges (of productizing)
Productizing a Spark and Cassandra Based Solution in Telecom | Public | © Ericsson AB 2016 | 2016-09-25 | Page 23
› Active development – Tune Cassandra & Spark to handle all usual as well as unusual usage patterns – Along with core business logic, develop tools that can facilitate monitoring of vital statistics
related Cassandra & Spark operations – Regularly train ‘support team’ to strengthen their knowledge of Cassandra & Spark
› Strong support
– which can augment end-users knowledge, and – which can partner with end-users for smooth operations of Cassandra & Spark
Solution for the challenges
Productizing a Spark and Cassandra Based Solution in Telecom | Public | © Ericsson AB 2016 | 2016-09-25 | Page 24
Strong support
DEVELOPER CUSTOMER SUPPORT
Train support team on O&M of
Cassandra & Spark
Partner with end-user for smooth operations of Cassandra & Spark
Productizing a Spark and Cassandra Based Solution in Telecom | Public | © Ericsson AB 2016 | 2016-09-25 | Page 25
Strong Support helps R&D handle multiple deployments
R&D Customer 7
Customer 1 Customer 2
Customer 3 Customer 4
Customer 5
Customer 6
Customer 8 Customer 9
Customer 10 Customer 11
Support