Big data for cio 2015

75
Zohar Elkayam CTO, Brillix [email protected] Big Data For CIOs

Transcript of Big data for cio 2015

Page 1: Big data for cio 2015

Zohar Elkayam CTO, Brillix

[email protected]

Big Data For CIOs

Page 2: Big data for cio 2015

Who am I?

• Zohar Elkayam, CTO at Brillix

• DBA, team leader, and a senior consultant for over 17 years

• Oracle ACE Associate

• Involved with Big Data projects since 2011

• Blogger – www.realdbamagic.com

http://brillix.co.il2

Page 3: Big data for cio 2015

About Brillix

• Brillix is a leading company that specialized in Data Management

• We provide professional services and consulting for Databases, Security and Big Data solutions

3

Page 4: Big data for cio 2015

Agenda: Big Data

• Big Data • Why • What• Where• Who and How

• A Big Data Solution: Hadoop

• NoSQL vs. RDBMS

4 http://brillix.co.il

Page 5: Big data for cio 2015

What is Big Data?

http://brillix.co.il5

Page 6: Big data for cio 2015

"Big Data"??

Different definitions

“Big data exceeds the reach of commonly used hardware environments and software tools to capture, manage, and process it with in a tolerable elapsed time for its user population.” -Teradata Magazine article, 2011

“Big data refers to data sets whose size is beyond the ability of typical database software tools to capture, store, manage and analyze.” - The McKinsey Global Institute, 2012

“Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools.” -Wikipedia, 2014

http://brillix.co.il6

Page 7: Big data for cio 2015

http://brillix.co.il7

Page 8: Big data for cio 2015

Success Stories

http://brillix.co.il8

Page 9: Big data for cio 2015

More success stories

http://brillix.co.il9

Page 10: Big data for cio 2015

MORE stories..

• Crime Prevention in Los Angeles

• Diagnosis and treatment of genetic diseases

• Investments in the financial sector

• Generation of personalized advertising

• Astronomical discoveries

http://brillix.co.il10

Page 11: Big data for cio 2015

Examples of Big Data Use Cases Today

MEDIA/ENTERTAINMENTViewers / advertising effectiveness

COMMUNICATIONSLocation-based advertising

EDUCATION &RESEARCHExperiment sensor analysis

CONSUMER PACKAGED GOODS

Sentiment analysis of what’s hot, problems

HEALTH CAREPatient sensors, monitoring, EHRsQuality of care

LIFE SCIENCESClinical trialsGenomics

HIGH TECHNOLOGY / INDUSTRIAL MFG.Mfg qualityWarranty analysis

OIL & GASDrilling exploration sensor analysis

FINANCIALSERVICESRisk & portfolio analysis New products

AUTOMOTIVEAuto sensors reporting location, problems

RETAILConsumer sentimentOptimized marketing

LAW ENFORCEMENT & DEFENSEThreat analysis - social media monitoring, photo analysis

TRAVEL &TRANSPORTATIONSensor analysis for optimal traffic flowsCustomer sentiment

UTILITIESSmart Meter analysis for network capacity,

ON-LINE SERVICES / SOCIAL MEDIAPeople & career matchingWeb-site optimization

http://brillix.co.il11

Page 12: Big data for cio 2015

Most Requested Uses of Big Data

• Log Analytics & Storage• Smart Grid / Smarter Utilities• RFID Tracking & Analytics• Fraud / Risk Management & Modeling• 360° View of the Customer• Warehouse Extension• Email / Call Center Transcript Analysis• Call Detail Record Analysis

12 http://brillix.co.il

Page 13: Big data for cio 2015

The Challenge

http://brillix.co.il13

Page 14: Big data for cio 2015

The Big Data Challenge

http://brillix.co.il14

Page 15: Big data for cio 2015

Volume

• Big data come in one size: Big.

• Size is measured in Terabyte(1012), Petabyte(1015), Exabyte(1018), Zettabyte (1021)

• The storing and handling of the data becomes an issue

• Producing value out of the data in a reasonable time is an issue

15 http://brillix.co.il

Page 16: Big data for cio 2015

Some numbers

• How much data in the world?• 800 Terabytes, 2000• 160 Exabytes, 2006 (1EB = 1018B)• 4.5 Zettabytes, 2012 (1ZB = 1021B)• 44 Zettabytes by 2020

• How much is a zettabyte?• 1,000,000,000,000,000,000,000 bytes• A stack of 1TB hard disks that is 25,400 km high

http://brillix.co.il16

Page 17: Big data for cio 2015

Growth Rate

• How much data generated in a day?• 7 TB, Twitter• 10 TB, Facebook

http://brillix.co.il17

Page 18: Big data for cio 2015

Data grows fast!

http://brillix.co.il18

Page 19: Big data for cio 2015

Variety

• Big Data extends beyond structured data: including semi-structured and unstructured information: logs, text, audio and videos.

• Wide variety of rapidly evolving data types requires highly flexible stores and handling.

19 http://brillix.co.il

Page 20: Big data for cio 2015

Structured & Un-Structured

Un-Structured Structured

Objects Tables

Flexible Columns and Rows

Structure Unknown Predefined Structure

Textual and Binary Mostly Textual

http://brillix.co.il20

Page 21: Big data for cio 2015

Big Data is ANY data

• Some has fixed structure

• Some is “bring own structure”

• We want to find value in all of it

Unstructured, Semi-Structure and Structured

http://brillix.co.il21

Page 22: Big data for cio 2015

Data Types by Industry

http://brillix.co.il22

Page 23: Big data for cio 2015

Velocity

• The speed in which the data is being generated and collected

• Streaming data and large volume data movement

• High velocity of data capture – requires rapid ingestion

• Might cause the backlog problem

23 http://brillix.co.il

Page 24: Big data for cio 2015

Global Internet Device Forecast

http://brillix.co.il24

Page 25: Big data for cio 2015

http://brillix.co.il25

Internet of Things

Page 26: Big data for cio 2015

Veracity

• Quality of the data can vary greatly

• Data sources might be messy or corrupted

http://brillix.co.il26

Page 27: Big data for cio 2015

So, What Defines Big Data?

• When we think that we can produce value from that data and want to handle it

• When the data is too big or moves too fast to handle in a sensible amount of time

• When the data doesn’t fit conventional database structure

• When the solution becomes part of the problem

27 http://brillix.co.il

Page 28: Big data for cio 2015

http://brillix.co.il28

Page 29: Big data for cio 2015

Why Big Data Now?

• Because we have data:• Data is born already in digital form• 40% of data growth per year

• Because we can:• 500$ for a drive in which to store all the music of the world• 40 years of Moore's Law = large computational resources

• 64% of organizations have invested in big data in 2013• 34 billion $ invested in big data in 2013

“Because we reached dead end with logic”

http://brillix.co.il29

Page 30: Big data for cio 2015

How to do Big Data

http://brillix.co.il30

Page 31: Big data for cio 2015

31 http://brillix.co.il

Page 32: Big data for cio 2015

Big Data in Practice

• Big data is big: technological infrastructure solutions needed

• Big data is messy: data sources must be cleaned before use

• Big data is complicated: need developers and system admins to manage intake of data

http://brillix.co.il32

Page 33: Big data for cio 2015

Big Data in Practice (cont.)

• Data must be broken out of silos in order to be mined, analyzed and transformed into value

• The organization must learn how to communicate and interpret the results of analysis

http://brillix.co.il33

Page 34: Big data for cio 2015

Infrastructure Challenges

• Infrastructure that is built for:• Large-scale• Distributed• Data-intensive jobs that spread the problem across clusters of server

nodes

34 http://brillix.co.il

Page 35: Big data for cio 2015

Infrastructure Challenges (cont.)

• Storage:• Efficient and cost-effective enough to capture and store terabytes, if

not petabytes, of data• With intelligent capabilities to reduce your data footprint such as:

• Data compression• Automatic data tiering• Data deduplication

35 http://brillix.co.il

Page 36: Big data for cio 2015

Infrastructure Challenges (cont.)

• Network infrastructure that can quickly import large data sets and then replicate it to various nodes for processing

• Security capabilities that protect highly-distributed infrastructure and data

36 http://brillix.co.il

Page 37: Big data for cio 2015

Goals of Analytics

http://brillix.co.il37

Page 38: Big data for cio 2015

Positions in Big Data management

• DevOps are handling the infrastructure – sys admins and cluster manager

• Data scientists are in charge of producing value from the data

http://brillix.co.il38

Page 39: Big data for cio 2015

Data Scientist

http://brillix.co.il39

Page 40: Big data for cio 2015

Hadoop

http://brillix.co.il40

Page 41: Big data for cio 2015

Apache Hadoop

• Open source project run by Apache (2006)• Hadoop brings the ability to cheaply process large amounts of

data, regardless of its structure• It Is has been the driving force behind the growth of the big

data Industry• Get the public release from:

• http://hadoop.apache.org/core/

41 http://brillix.co.il

Page 42: Big data for cio 2015

Hadoop Creation History

http://brillix.co.il42

Page 43: Big data for cio 2015

Key points• An open-source framework that uses a simple programming model to

enable distributed processing of large data sets on clusters of computers.

• The complete technology stack includes• common utilities• a distributed file system• analytics and data storage platforms• an application layer that manages distributed processing, parallel

computation, workflow, and configuration management• Cost-effective for handling large unstructured data sets than

conventional approaches, and it offers massive scalability and speed

43

Page 44: Big data for cio 2015

Why use Hadoop?

Cost Flexibility

Near linear performance up

to 1000s of nodes

Leverages commodity HW & open source SW

Versatility with data, analytics &

operation

Scalability

http://brillix.co.il44

Page 45: Big data for cio 2015

What Hadoop Is Not?

• Hadoop does not replace DW or relational databases

• Hadoop is not for OLTP or real-time systems

• Very good for large amount, not so much for smaller sets

• Designed for clusters – there is Hadoop monster server (single server)

http://brillix.co.il45

Page 46: Big data for cio 2015

Hadoop Cluster in Yahoo

46

Cluster of machine running Hadoop at Yahoo! (credit: Yahoo!)

http://brillix.co.il

Page 47: Big data for cio 2015

Hadoop under the Hood

http://brillix.co.il47

Page 48: Big data for cio 2015

Hadoop Main Components

• HDFS: Hadoop Distributed File System – distributed file system that runs in a clustered environment.

• MapReduce – programming paradigm for running processes over a clustered environments.

48 http://brillix.co.il

Page 49: Big data for cio 2015

HDFS is...

• A distributed file system• Redundant storage• Designed to reliably store data using commodity hardware• Designed to expect hardware failures• Intended for large files• Designed for batch inserts• The Hadoop Distributed File System

49 http://brillix.co.il

Page 50: Big data for cio 2015

MapReduce is...

• A programming model for expressing distributed computations at a massive scale

• An execution framework for organizing and performing such computations

• An open-source implementation called Hadoop

50 http://brillix.co.il

Page 51: Big data for cio 2015

MapReduce is good for...

• Embarrassingly parallel algorithms

• Summing, grouping, filtering, joining

• Off-line batch jobs on massive data sets

• Analyzing an entire large dataset

51 http://brillix.co.il

Page 52: Big data for cio 2015

MapReduce is OK for...

• Iterative jobs (i.e., graph algorithms)

• Each iteration must read/write data to disk

• IO and latency cost of an iteration is high

52 http://brillix.co.il

Page 53: Big data for cio 2015

MapReduce is NOT good for...

• Jobs that need shared state/coordination• Tasks are shared-nothing• Shared-state requires scalable state store

• Low-latency jobs• Jobs on small datasets• Finding individual records

53 http://brillix.co.il

Page 54: Big data for cio 2015

Spark• Fast and general MapReduce-like engine for large-scale data

processing• Fast

• In memory data storage for very fast interactive queries Up to 100 times faster then Hadoop

• General• Unified platform that can combine: SQL, Machine Learning , Streaming ,

Graph & Complex analytics• Ease of use

• Can be developed in Java, Scala or Python • Integrated with Hadoop

• Can read from HDFS, HBase, Cassandra, and any Hadoop data source.

54

Page 55: Big data for cio 2015

Key Concepts

55

Resilient Distributed Datasets• Collections of objects spread

across a cluster, stored in RAM or on Disk

• Built through parallel transformations

• Automatically rebuilt on failure

Operations

• Transformations(e.g. map, filter, groupBy)

• Actions(e.g. count, collect, save)

Write programs in terms of transformations on

distributed datasets

Page 56: Big data for cio 2015

Unified Platform

• Continued innovation bringing new functionality, e.g.:• Java 8 (Closures, LambaExpressions)• Spark SQL (SQL on Spark, not just Hive)• BlinkDB(Approximate Queries)• SparkR(R wrapper for Spark)

56

Page 57: Big data for cio 2015

Big Data and NoSQL

http://brillix.co.il57

Page 58: Big data for cio 2015

The Challenge

• We want scalable, durable, high volume, high velocity, distributed data storage that can handle non-structured data and that will fit our specific need

• RDBMS is too generic and doesn’t cut it any more – it can do the job but it is not cost effective to our usages

58 http://brillix.co.il

Page 59: Big data for cio 2015

The Solution: NoSQL

• Let’s take some parts of the standard RDBMS out to and design the solution to our specific uses

• NoSQL databases have been around for ages under different names/solutions

59 http://brillix.co.il

Page 60: Big data for cio 2015

Example Comparison: RDBMS vs. Hadoop

60

Typical Traditional RDBMS Hadoop

Data Size Gigabytes Petabytes

Access Interactive and Batch Batch – NOT Interactive

Updates Read / Write many times Write once, Read many times

Structure Static Schema Dynamic Schema

Scaling Nonlinear Linear

Query Response

Time

Can be near immediate Has latency (due to batch processing)

http://brillix.co.il

Page 61: Big data for cio 2015

Best Used For:

Structured or Not (Flexibility) Scalability of Storage/Compute Complex Data Processing Cheaper compared to RDBMS

Relational Database

Best Used For:

Interactive OLAP Analytics (<1sec)

Multistep Transactions 100% SQL Compliance

Best when used together

Hadoop And Relational Database

61 http://brillix.co.il

Page 62: Big data for cio 2015

The NOSQL Movement

• NOSQL is not a technology – it’s a concept

• We need high performance, scale out abilities or agile structure

• We are willing to sacrifice our sacred database cows: consistency, transactions, durability

• Over 150 different brands and solutions (http://nosql-database.org/).

62 http://brillix.co.il

Page 63: Big data for cio 2015

Is NoSQL a RDMS Replacement?

NO

63

Well... Sometimes it does…

http://brillix.co.il

Page 64: Big data for cio 2015

NoSQL Taxonomy

Type Examples

Key-Value Store

Document Store

Column Store

Graph Store

http://brillix.co.il64

Page 65: Big data for cio 2015

Key Value Store

• Distributed hash tables• Very fast to get a single value• Examples:

• Amazon DynamoDB• Berkeley DB• Redis• Riak• Cassandra

65 http://brillix.co.il

Page 66: Big data for cio 2015

Document Store

• Similar to Key/Value, but value is a document• JSON or something similar, flexible schema• Agile technology• Examples:

• MongoDB• CouchDB• CouchBase

66 http://brillix.co.il

Page 67: Big data for cio 2015

What is a Column Store Database?

• Column Store databases are management systems that uses data managed in a columnar structure format for better analysis of single column data (i.e. aggregation). Data is saved and handled as columns instead of rows.

• Examples:• HP Vertica• Pivotal (EMC) GreenPlum• Hadoop Hbase• Amazon’s SimpleDB• Cassandra

http://brillix.co.il67

Page 68: Big data for cio 2015

Query Data

• When we query data, records are read at the order they are organized in the physical structure

• Even when we query a single column, we still need to read the entire table and extract the column

Row 1

Row 2

Row 3

Row 4

Col 1 Col 2 Col 3 Col 4

Select Col2 From MyTable

Select *From MyTable

http://brillix.co.il68

Page 69: Big data for cio 2015

How Does Column Stores Keep Data

Organization in row store Organization in column store

http://brillix.co.il69

Select Col2 From MyTable

Page 70: Big data for cio 2015

Row Format vs. Column Format

http://brillix.co.il71

Page 71: Big data for cio 2015

Graph Store

• Inspired by the graph theory• Data model: nodes, relationships, properties on both sides• Relational database have a hard time to represent a graph in

the Database• Example:

• Neo4j• InfiniteGraph• RDF

72 http://brillix.co.il

Page 72: Big data for cio 2015

Graph Example

http://brillix.co.il73

Page 73: Big data for cio 2015

Conclusion• We do Big Data to gain Value. Without value, there is no Big Data

• Handling Big Data is a challenge – we talked about who uses it, when and where

• Hadoop is a solution for Big Data usages but it’s not a magical solution

• NoSQL, NewSQL and RDBMS are all solutions we can integrate for different usages

• New organizational positions: cluster devops and data scientist.

http://brillix.co.il74

Page 74: Big data for cio 2015

Q&A

http://brillix.co.il75

Page 75: Big data for cio 2015

Thank You

Zohar Elkayamtwitter: @[email protected]

www.realdbamagic.com

http://brillix.co.il76