WSO2 Product Release Webinar: WSO2 Data Analytics Server 3.0
Scaling up wso2 bam for billions of requests and terabytes of data
Transcript of Scaling up wso2 bam for billions of requests and terabytes of data
Scaling Up WSO2 BAM for Billions of Requests and Terabytes of Data
Buddhika ChamithSoftware Engineer – WSO2 BAM
Business Activity Monitoring
“The aggregation, analysis, and presentation of real-time information about activities inside organizations and involving customers and partners.” - Gartner
Aggregation
● Capturing data● Data storage● What data to
capture?
Analysis
● Data operations● Building KPIs● Operate on large
amounts of historic data or new data
● Building BI
Presentation
● Visualizing KPIs/BI● Custom Dashboards● Visualization tools● Not just dashboards!
Need for Scalability
BAM 2.x - Component Architecture
Data Agents
● Push data to BAM● Collecting
● Service data● Mediation data● Logs etc.
● Various interceptors used● Axis2 Handlers● Synapse Mediators● Tomcat Valves● Log4j Appenders
Performance Considerations
● Should be asynchronous ● Event batching ● SOAP?● Apache Thrift (Binary protocol)
Apache Thrift
● A RPC framework● With a pluggable architecture
for mixing different transports with different protocols
● Has multiple language bindings (Java, C++, Python, Perl, C# etc.)
● We mainly use Java binding
Not Just Performance...
● Load balancing● Failover● All available within a Java SDK libary. ● You can use it too.
Data Receiver
● Capture and transfer data to subscribed sinks.● Not just the database. ● Can be clustered. ● Load balancing is handled from client side.
Data Bridge
Data Storage
● Apache Cassandra● NoSQL column family
implementation● Scalable, HA and no
SPOF.● Very high write
throughput and good read throughput
● Tunable consistency with data replication
Deployment – Storage Cluster
Reciever Cluster
Results
With a single receiver node allocated 2GB heap with quad core on RHEL.
Disk Growth
Analyzer Engine
● Idea : Distribute processing to multiple nodes to run in parallel
● Obvious choice : Hadoop ● Uses Map Reduce Programming paradigm
Map Reduce
● Process multiple data chunks paralley at Mappers.
● Aggregate map outputs having similar keys at Reducers and store the result.
● Let's think of a useful example..
Hadoop Components
● Job Tracker● Name node● Secondary Name Node● Task Trackers● Data Nodes
It's Cool But ..● Do we need to have a
Hadoop cluster in order to try out BAM?
● Are we supposed to code Hadoop jobs to get
BAM to summarize some thing?
● Answers
1) No
2) No. Ok may be very rarely at best.
Courtesy: http://goo.gl/QEnpN
Apache Hive
● You write SQL. (Almost)● Let Hive convert to Map Reduce jobs.● So Hive does two things
● Provide an abstraction for Hadoop Map Reduce● Submit the analytic jobs to Hadoop
● Hive may spawn a Hadoop JVM locally or delegate to a Hadoop Cluster
A Typical Hive Script
Results
Task Framework
● Run Hive scripts periodically● Can specify as cron expressions/ predefined
templates● Handles task failover in case of node faliure● Uses Zookeeper for coordination
Zookeeper
● Can be run seperately or embedded within BAM
Analyzer Cluster
Dashboard
● Making dashboard scale.
Deployment Patterns
Single Node Single Node
High AvailabilityHigh Availability
Fully Distributed SetupFully Distributed Setup
Summary
● BAM ● Need for scalability● Scaling BAM components● Results● BAM deployment patterns