Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays

54
Quick dive into the Big Data pool without drowning Demi Ben-Ari - VP R&D @ Panorays

Transcript of Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays

Page 1: Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays

Quick dive into theBig Data pool

without drowning

Demi Ben-Ari - VP R&D @ Panorays

Page 2: Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays

About Me

Demi Ben-Ari, Co-Founder & VP R&D @ Panorays● BS’c Computer Science – Academic College Tel-Aviv Yaffo● Co-Founder “Big Things” Big Data Community

In the Past:● Sr. Data Engineer - Windward● Team Leader & Sr. Java Software Engineer,

Missile defense and Alert System - “Ofek” – IAFInterested in almost every kind of technology – A True Geek

Page 3: Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays

Agenda

● Basic Concepts

● Introduction to Big Data frameworks

● Distributed Systems => Problems

● Monitoring

● Conclusions

Page 4: Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays

Say “Distributed”, Say “Big Data”,Say….

Page 5: Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays

Some basic concepts

Page 6: Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays

What is Big Data (IMHO)?

● Systems involving the “3 Vs”:

What are the right questions we want to ask?

○ Volume - How much?

○ Velocity - How fast?

○ Variety - What kind? (Difference)

Page 7: Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays

What is Big Data (IMHO)

● Some define it the “7 Vs”○ Variability (constantly changing)○ Veracity (accuracy)○ Visualization ○ Value

Page 8: Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays

What is Big Data (IMHO)

● Characteristics○ Multi-region availability ○ Very fast and reliable response○ No single point of failure

Page 9: Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays

Why Not Relational Data

● Relational Model Provides○ Normalized table schema○ Cross table joins○ ACID compliance (Atomicity, Consistency, Isolation, Durability)

● But at very high cost○ Big Data table joins - bilions of rows - massive overhead○ Sharding tables across systems is complex and fragile

● Modern applications have different priorities○ Needs for speed and availability come over consistency○ Commodity servers racks trump massive high-end systems○ Real world need for transactional guarantees is limited

Page 10: Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays

What strategies help manage Big Data?

● Distribute data across nodes○ Replication

● Relax consistency requirements● Relax schema requirements● Optimize data to suit actual needs

Page 11: Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays

What is the NoSQL landscape?

● 4 broad classes of non-relational databases (DB-Engines)○ Graph: data elements each relate to N others in graph / network○ Key-Value: keys map to arbitrary values of any data type○ Document: document sets (JSON) queryable in whole or part ○ Wide column Store (Column Family): keys mapped to sets of

n-numbers of typed columns● Three key factors to help understand the subject

○ Consistency: Get identical results, regardless which node is queried?○ Availability: Respond to very high read and write volumes?○ Partition tolerance: Still available when part of it is down?

Page 12: Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays

What is the CAP theorem? ● In distributed systems, consistency, availability and partition tolerance exist in

a manually dependant relationship, Pick any two.

Availability

Partition toleranceConsistency

MySQL, PostgreSQL,Greenplum, Vertica,

Neo4J

Cassandra, DynamoDB, Riak,

CouchDB, Voldemort

HBase, MongoDB, Redis, BigTable, BerkeleyDB

GraphKey-Value

Wide Column RDBMS

Page 13: Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays

DB Engines - Comparison● http://db-engines.com/en/ranking

Page 14: Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays

DB Engines - Comparison

Page 15: Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays

What does DevOps really mean?

DevelopmentSoftware EngineeringUX

OperationsSystem AdminDatabase Admin

Page 16: Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays

What does DevOps really mean?

DevOpsCross-functional teams

Operators automating systemsDevelopers operating systems

Page 17: Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays

Introduction to Big Data

Frameworks

https://d152j5tfobgaot.cloudfront.net/wp-content/uploads/2015/02/yourstory_BigData.jpg

Page 18: Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays

Characteristics of Hadoop

● A system to process very large amounts of unstructured and complex data with wanted speed

● A system to run on a large amount of machines that don’t share any memory or disk

● A system to run on a cluster of machines which can put together in relatively lower cost and easier maintenance

Page 19: Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays

Hadoop Principals● “A system to move the computation, where the data is”

● Key Concepts of Hadoop

Flexibility Scalability

Low cost Fault Tolerant

Page 20: Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays

Hadoop Core Components

● HDFS - Hadoop Distributed File System○ Provides a distributed data storage system to store data in smaller

blocks in a fail safe manner● MapReduce - Programming framework

○ Has the ability to take a query over a dataset, divide it and run in in parallel on multiple nodes

● YARN - (Yet Another Resource Negotiator) MRv2○ Splitting a MapReduce Job Tracker’s info

■ Resource Manager (Global)■ Application Manager (Per application)

Page 21: Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays

Hadoop Ecosystem

Hadoop Core

HDFS MapReduce /YARN

Hadoop Common

Hadoop Applications

Hive Pig HBase Oozie Zookeeper Sqoop Spark

Page 22: Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays

Hadoop (+Spark) Distributions

Elastic MapReduce DataProc

Page 23: Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays

New Age BI Applications

● Able to understand various types of data● Ability to clean the data● Process data with applied rules locally and in distributed environment● Visualize sizeable data with speed● Extend results by sharing within the enterprise

Page 24: Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays

Big Data Analytics

● Processing large amounts of data without data movement● Avoid data connectors if possible (run natively)● Ability to understand vast amount of data types and and data

compressions● Ability to process data on variety of processing frameworks ● Distributed data processing

○ In-Memory a big plus● Super fast visualization

○ In-Memory a big plus

Page 25: Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays

When to choose hadoop?

● Large volumes of data to store and process● Semi-Structured or Unstructured data● Data is not well categorized● Data contains a lot of redundancy● Data arrives in streams or large batches● Complex batch jobs arriving in parallel● You don’t know how the data might be useful

Page 26: Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays

Distributed Systems => Problems

https://imgflip.com/i/1ap5krhttp://kingofwallpapers.com/otter/otter-004.jpg

Page 27: Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays

Monolith Structure

OS CPU Memory Disk

Processes Java Application Server

Database

Web Server

Load Balancer

Users - Other Applications

Monitoring System

UI

Many times...all of this was on a single physical server!

Page 28: Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays

Distributed Microservices Architecture

Service A

Queue

DB

Service B

DBCache

Cache DBService C

Web Server

DB

Analytics Cluster

Master

Slave Slave Slave

Monitoring System???

Page 29: Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays

MongoDB + Spark

Worker 1

Worker 2

….

….

Worker N

Spark Cluster

Master

Write

Read

MasterSahrded MongoDB

Replica Set

Page 30: Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays

Cassandra + Spark

Worker 1

Worker 2

….

….

Worker N

Cassandra Cluster

Spark Cluster

Write

Read

Page 31: Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays

Cassandra + Serving

Cassandra Cluster

Write

Read

UI ClientUI Client

UI ClientUI Client

Web ServiceWeb

ServiceWeb ServiceWeb

Service

Page 32: Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays

Problems

● Multiple physical servers

● Multiple logical services

● Want Scaling => More Servers

● Even if you had all of the metrics○ You’ll have an overflow of the data

● Your monitoring becomes a “Big Data” problem itself

Page 33: Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays

This is what “Distributed” really Means

The DevOps Guy

(It might be you)

Page 34: Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays

Monitoring is Crucial

http://memeguy.com/photo/46871/you-are-being-monitored

Page 35: Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays

Monitoring Operation SystemMetrics

Page 36: Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays

Some help from “the Cloud”

Page 37: Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays

AWS’s CloudWatch / GCP StackDriver

Page 38: Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays

Report to Where?

● We chose: ● Graphite (InfluxDB) + Grafana● Can correlate System and

Application metrics in one place :)

Page 39: Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays

Monitoring Cassandra

Page 41: Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays

Monitoring Cassandra

Page 42: Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays

Monitoring Spark

Page 43: Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays

Ways to Monitoring Spark

● Grafana-spark-dashboards

○ Blog: http://www.hammerlab.org/2015/02/27/monitoring-spark-with-graphite-and-grafana/

● Spark UI - Online on each application running● Spark History Server - Offline (After application finishes)● Spark REST API

○ Querying via inner tools to do ad-hoc monitoring

● Back to the basics: dstat, iostat, iotop, jstack● Blog post by Tzach Zohar - “Tips from the Trenches”

Page 44: Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays

Monitoring Your Data

https://memegenerator.net/instance/53617544

Page 45: Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays

Data Questions? What should be measure

● Did all of the computation occur?

○ Are there any data layers missing?

● How much data do we have? (Volume)

● Is all of the data in the Database?

● Data Quality Assurance

Page 46: Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays

Data Answers!● The method doesn’t really matter, as long as you:

○ Can follow the results over time

○ Know what your data flow, know what might fail

○ It’s easy for anyone to add more monitoring(For the ones that add the new data each time…)

○ It don’t trust others to add monitoring

(It will always end up the DevOps’s “fault” -> No monitoring will be

applied)

Page 48: Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays

ELK - Elasticsearch + Logstash + Kibana

http://www.digitalgov.gov/2014/05/07/analyzing-search-data-in-real-time-to-drive-decisions/

Page 49: Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays

Monitoring Stack

Alerting

Metrics Collection

Datastore

Dashboard

Data Monitoring

Log Monitoring

Page 50: Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays

Big Data - Are we there yet?

● “3 Vs”: - What are the right questions we want to ask?○ Volume - How much?

■ Can it run on a single machine in reasonable time?

○ Velocity - How fast?

■ Can a single machine handle the throughput?

○ Variety - What kind? (Difference)

■ Is your data not changing and varying?

● If the answer for most of the previous questions is “Yes”?Think again if you want to add the complexity of “Big Data”

Page 51: Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays

Conclusions

● Think carefully before going into the “Big Data pool”○ See if you really have a problem that you’re trying to solve○ It’s not a silver bullet

● Take measures to automate and monitor everything● Having Clusters and distributed frameworks will cost a lot - eventually● Fit your storage layer(s) to the needs

Page 52: Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays

Questions?

https://www.stayathomemum.com.au/wp-content/uploads/2015/01/DDDDDD.jpg

Still feel like you’re drowning?

Page 54: Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays