Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets)...

67
© 2015 IBM Corporation Getting Started with Hadoop and BigInsights Alan Fischer e Silva Hadoop Sales Engineer Nov 2015

Transcript of Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets)...

Page 1: Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,

© 2015 IBM Corporation

Getting Started with Hadoop and BigInsights Alan Fischer e Silva Hadoop Sales Engineer Nov 2015

Page 2: Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,

© 2015 IBM Corporation 2

Agenda

!  Intro ! Q&A ! Break ! Hands on Lab

Page 3: Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,

© 2015 IBM Corporation 3

Hadoop Timeline

Page 4: Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,

© 2015 IBM Corporation 4

In a Big Data World….

The Technology exists now for us to:

! Store everything, for as long as we want

! Efficiently analyze everything, without sub-setting

! Connect tiny nugget of valuable information buried in piles of worthless bytes

4

Page 5: Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,

© 2015 IBM Corporation 5

© 2013 IBM Corporation

5

Page 6: Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,

© 2015 IBM Corporation 6

Apache Hadoop Modules

!  Hadoop Common: Common Utilities that supports all other modules. !  Hadoop Distributed File System (HDFS™):

-  File system that spans all the nodes in a Hadoop cluster for data storage. -  Links together the file systems on many local nodes to make them into one big file system.

!  Hadoop MapReduce -  Software framework for easily writing applications which process vast amounts of data (multi-terabyte

data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.

!  YARN (Yet Another Resource Negotiator) -  Scheduling & Resource Management

!  Large Hadoop Ecosystem: Open-source Apache project

Hive Pig HBase Oozie Sqoop Flume

Zookeeper

Mahout Spark Avro Spark

Page 7: Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,

© 2015 IBM Corporation 7

HDFS – Architecture

!  Master / Slave architecture

!  Master: NameNode -  Manages the file system namespace and metadata

•  FsImage •  EditLog

-  Regulates access by files by clients

!  Slave: DataNode -  Many DataNodes per cluster -  Manages storage attached to the nodes -  Periodically reports status to NameNode -  Data is stored across multiple nodes -  Nodes and components will fail, so for reliability data

is replicated across multiple nodes a a

a b

b b

d d

d c c

c

File1 a b c d

NameNode

DataNodes

Page 8: Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,

© 2015 IBM Corporation 8

HDFS – Blocks

! HDFS is designed to support very large files ! Each file is split into blocks -  Hadoop default: 64MB -  BigInsights default: 128MB

! Behind the scenes, 1 HDFS block is supported by multiple operating system blocks

!  If a file or a chunk of the file is smaller than the block size, only needed space is used. E.g.: a 210MB file is split as follows:

64 MB HDFS blocks OS blocks

64 MB 64 MB 64 MB 18 MB

Page 9: Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,

© 2015 IBM Corporation 9

Rack 3

Rack 2

Rack 1

Replication of Data and Rack Awareness

Name Node

Rack aware: R1: 1,2,3,4 R2: 5,6,7,8 R3: 9,10,11 Metadata file.txt= A: 1, 5, 6 B: 5, 9, 10 C: 9, 1, 2

Data Node 1 A

Data Node 2

Data Node 3

Data Node 4

Data Node 5 B

Data Node 6 A

Data Node 7

Data Node 8

Data Node 9 C

Data Node 10

Data Node 11

Data Node 12

A C

C

B

B

•  Blocks of data are replicated to multiple nodes •  Behavior is controlled by replication factor, configurable per file - Default is 3 replicas

•  Replication is rack-aware to reduce inter-rack network hops/latency: •  1 copy in first rack •  2nd and 3rd copy together in a separate rack.

Page 10: Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,

© 2015 IBM Corporation 10

MapReduce Processing Summary ! Map Reduce computation model -  Data stored in a distributed file system spanning many inexpensive computers -  Bring function to the data -  Distribute application to the compute resources where the data is stored

! Scalable to thousands of nodes and petabytes of data

MapReduce Application

1.  Map Phase (break job into small parts)

2.  Shuffle (transfer interim output for final processing)

3.  Reduce Phase (boil all output down to a single result set)

Return a single result set Result Set

Shuffle

publicstaticclassTokenizerMapperextendsMapper<Object,Text,Text,IntWritable>{privatefinalstaticIntWritableone=newIntWritable(1);privateTextword=newText();

publicvoidmap(Objectkey,Textval,ContextStringTokenizeritr=newStringTokenizer(val.toString());while(itr.hasMoreTokens()){word.set(itr.nextToken());context.write(word,one);} }}publicstaticclassIntSumReducerextendsReducer<Text,IntWritable,Text,IntWritaprivateIntWritableresult=newIntWritable();

publicvoidreduce(Textkey,Iterable<IntWritable>val,Contextcontext){intsum=0;for(IntWritablev:val){sum+=v.get();...

Distribute map tasks to cluster

Hadoop Data Nodes

Page 11: Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,

© 2015 IBM Corporation 11

Map Task

Client

Name Node

Map Task

Data Node 5

B

Map Task

Data Node 1

A

Map Task

Data Node 9

C

MR Application Master

How many times does

“Hello” appear in Wordcount.txt

Count=8 Count=3 Count=10

Count “Hello” in Block C

Page 12: Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,

© 2015 IBM Corporation 12

Reduce Task

Client

Map Task

Data Node 5

B

Map Task

Data Node 1

A

Map Task

Data Node 9

C

MR Application Master

Count=8 Count=3 Count=10

Reduce Task

Data Node 3

Results.txt

Count=21

HDFS

Sum of “Hello”

from Map tasks

Page 13: Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,

© 2015 IBM Corporation 13 © 2013 IBM Corporation

13

Hadoop Open Source Ecosystem Diagram

Data Storage

Resource Management

Processing Framework

MR v2

C

oordination Zookeeper

Data Access

Workflow

M

anagement

YARN

Page 14: Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,

© 2015 IBM Corporation 14

What is Apache Spark™?

! Spark is a fast and general engine for large-scale data processing

! Speed -  In memory, Spark can run programs up to 100X faster

than MapReduce -  On disk, Spark can run programs up to 10X faster

than MapReduce ! Application Development -  Java -  Scala -  Python

Page 15: Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,

© 2015 IBM Corporation 15

What is Apache Spark™?

! High Level Capabilities -  SQL -  Streaming -  Complex Analytics (ML, R, etc)

! Run on multiple platforms -  Hadoop, Mesos, standalone, Cloud

! Connectivity -  HDFS, Cassandra, Hbase, S3, JDBC, ODBC

Page 16: Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,

© 2015 IBM Corporation 16

Yet Another Resource Scheduler (YARN)

•  Generic scheduling and resource management •  Support more than just MapReduce •  Support for more workloads (Hadoop 1.x Map Reduce was primarily batch

processing)

Page 17: Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,

© 2015 IBM Corporation 17

© 2013 IBM Corporation

17

What is HBase?

!  An open source Apache Top Level Project -  An industry leading implementation of Google’s BigTable Design -  Considered as “the Hadoop database” -  HBase powers some of the leading sites on the Web

!  A NoSQL data store -  NoSQL stands for “Not Only SQL” -  Flexible data model to accommodate semi-structured data -  Cost Effective to handle Peta-bytes of data -  Traditional RDBMS sharding (partitioning) lacks flexibility.

!  Why HBASE ? -  Key, Value store – Column Oriented -  Highly Scalable: Automatic partitioning, scales linearly and automatically with new nodes. -  Low Latency: Support random read/write, small range scan -  Highly Available -  Strong Consistency -  Flexible Data Model, very good for “sparse data” (no fixed columns)

Page 18: Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,

© 2015 IBM Corporation 18

How to Analyze Large Data Sets in Hadoop

!  Although the Hadoop framework is implemented in Java, MapReduce applications do not need to be written in Java

!  To abstract complexities of Hadoop programming model, a few application development languages have emerged that build on top of Hadoop: -  Pig -  Hive

Page 19: Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,

© 2015 IBM Corporation 19

© 2013 IBM Corporation

19

What is Hive?

!  Developed by Facebook in 2007

!  Provides a Hive-QL or HQL, SQL interface: -  DDL and DML are similar to SQL -  HQL Queries are translated into Map Reduce jobs -  Schema-on-read capability: Projects a table structure onto existing data.

!  Not a true RDBMS: -  Suited for batch mode processing, not real-time, latency. -  No transaction support, no single row INSERT, no UPDATE or DELETE. -  Limited SQL Support

Output Reduce Map SQL

Page 20: Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,

© 2015 IBM Corporation 20

Pig: Data Transformation

!  Pig vs MapReduce

Page 21: Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,

© 2015 IBM Corporation 21

Data Ingestion - Structured

!  Sqoop

• Efficient transferring bulk data between Hadoop clusters and RDBMS

$sqoop import --connect jdbc:db2://db2.my.com:50000/SAMPLE --username db2user --password db2pwd --table ORDERS -split-by tbl_primarykey target-dir sqoopimports

$sqoop export --connect jdbc:db2://db2.my.com:50000/SAMPLE --username db2user --password db2pwd --table ORDERS --export-dir /sqoop/dataFile.csv

--split-by tbl_primarykey

Page 22: Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,

© 2015 IBM Corporation 22

Data Ingestion - Unstructured

!  Flume

• Distributed data collection service • Aggregates data from one or many sources to a centralized place • Great for logs, twitter feeds, unstructured data in general

Page 23: Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,

© 2015 IBM Corporation 23

Other Hadoop Related Projects

!  Data serialization: AVRO -  Uses JSON schemas for defining data types, data serialization in compact format

!  Machine learning: MAHOUT -  Library of scalable machine-learning algorithms

!  Distributed coordination: ZOOKEEPER -  Distributed configuration, synchronization, naming registry

!  Jobs management: OOZIE -  Simplifies workflow and coordination of MapReduce jobs

Page 24: Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,

© 2015 IBM Corporation 24

Introduction to BigInsights & Open Data Platform Initiative (ODPi)

Page 25: Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,

© 2015 IBM Corporation 25

Goal of the Apache Software Foundation: Let 1000 Flowers Bloom!

•  249 Top Level Projects, 40 Incubating •  2 Million+ Code Commits •  IBM co-founded the ASF in 1999 and

is a Gold Sponsor

•  The “Apache Way” is about fostering open innovation

•  Not a standards organization

Page 26: Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,

© 2015 IBM Corporation 26

Harmonize on Open Data Platform to Accelerate Big Data Solutions

! Over 30 members from the leading companies in the world ! Provides a solid foundation of standard core Apache Hadoop components

http://opendataplatform.org/

Goal: Achieve standardization and interoperability of software from ODP members

Page 27: Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,

© 2015 IBM Corporation 27

Goal of the ODP: Enable Innovation to Flourish on a Common Platform

•  Complements the Apache Software Foundation’s governance model

•  ODP efforts focus on integration, testing, and certifying a standard core of Apache Hadoop ecosystem projects

•  Fixes for issues found in ODP testing will be contributed to the ASF projects in line with ASF processes

•  The ODP will not override or replace any aspect of ASF governance

Page 28: Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,

© 2015 IBM Corporation 28

Component IBM Open Platform V4.0 (ODPi)

Hortonworks HDP 2.2 (ODPi)

Cloudera CDH 5.3

Ambari 1.7 1.7 N/A

Flume 1.5.2 1.5.2 1.5.0

Hadoop/YARN 2.6 2.6 2.5.0

Hbase 0.98.8 0.98.4 0.98.6

Hive 0.14 0.14 0.13

Knox 0.5.0 0.5.0 N/A

Oozie 4.0.1 4.1 4.0

Pig 0.14 0.14 0.12

Slider 0.60.0 0.60.0 N/A

Solr 4.10.3 4.10.2 4.4

Spark 1.2.1 1.2 1.2

Sqoop 1.4.5 1.4.5 1.4.4

Zookeeper 3.4.5 3.4.5 3.4.5

IBM Open Data Platform as of V4.0 Open Data Platform (ODP) benefits and IBM open source project currency commitment

Page 29: Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,

© 2015 IBM Corporation 29

IBM Open Data Platform as of V4.1 Open Data Platform (ODP) benefits and IBM open source project currency commitment

Component IBM Open Platform V4.1 (ODPi)

Hortonworks HDP 2.3 (ODPi)

Cloudera CDH 5.4.7

Ambari 2.1.0 2.1.0 N/A

Flume 1.5.2 1.5.2 1.5.0

Hadoop / YARN 2.7.1 2.7.1 2.6.0

Hbase 1.1.1 1.1.1 1.0.0

Hive 1.2.1 1.2.1 1.1.0

Kafka 0.8.2.1 0.8.2 0.8.2

Knox 0.6.0 0.6.0 N/A

Oozie 4.2.1 4.2 4.1.0

Pig 0.15.0 0.15.0 0.12.0

Slider 0.80.0 0.8.0 N/A

Solr 5.1.0 5.2.1 4.10.3

Spark 1.4.1 1.3.1 1.3.0

Sqoop 1.4.6 1.4.6 1.4.5

Zookeeper 3.4.6 3.4.6 3.4.5

Page 30: Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,

© 2015 IBM Corporation 30

Enabling Personas with Capabilities

Business Analyst Data Scientist Administrator

•  Identify patterns, trends, insights with machine learning algorithms

•  Apply statistical models to large scale data

•  Discover data for analysis •  Visualize data for action •  Reduce learning curve by

leveraging existing skills (SQL, spreadsheets)

•  Manage workloads and schedule jobs to ensure performance

•  Secure environment to reduce risk

IBM

Val

ue

Per

sona

N

eed

Customer Insight Large financial services company analyzed 4 billion tweets and identified 110 million client profiles that matched with at least 90 percent precision

Complete and Fast Big SQL runs 100% of Hadoop-DS queries and 3.6x times faster query time over Impala (Audited Hadoop-DS benchmark)

Performance 4x improvement in running MapReduce jobs over (STAC report)

Page 31: Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,

© 2015 IBM Corporation 31

Packaging Structure

Text Analytics

POSIX Distributed Filesystem

Multi-workload, Multi-tenant scheduling

IBM BigInsights Enterprise Management

Machine Learning on Big R

Big R

IBM Open Platform with Apache Hadoop

IBM BigInsights Data Scientist

IBM BigInsights Analyst

Big SQL

BigSheets

Big SQL

BigSheets

IBM BigInsights for Apache Hadoop

Page 32: Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,

© 2015 IBM Corporation

Text Analytics

Page 33: Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,

© 2015 IBM Corporation 33

BigInsights and Text Analytics

•  Distills structured info from unstructured text - Sentiment analysis - Consumer behavior -  Illegal or suspicious activities - …

•  Parses text and detects meaning with annotators

•  Understands the context in which the text is analyzed

•  Features pre-built extractors for names, addresses, phone numbers, etc

•  Multiple Languages

Football World Cup 2010, one team distinguished themselves well, losing to the eventual champions 1-0 in the Final. Early in the second half, Netherlands’ striker, Arjen Robben, had a breakaway, but the keeper for Spain, Iker Casillas made the save. Winger Andres Iniesta scored for Spain for the win.

Unstructured text (document, email, etc)

Classification and Insight

Page 34: Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,

© 2015 IBM Corporation 34

Web-based tool to define rules to extract data and derive information from unstructured text Graphical interface to describe structure of various textual formats – from log file data to natural language

Text Analytics Tooling

Page 35: Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,

© 2015 IBM Corporation

BigSheets

Page 36: Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,

© 2015 IBM Corporation 36

BigSheets Browser based analytics tool for BigData

!  Explore, visualize, transform unstructured and structured data

!  Visual data cleansing and analysis

!  Filter and enrich content

!  Visualize Data

!  Export data into common formats

No programming knowledge needed!

Page 37: Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,

© 2015 IBM Corporation 37

Geospatial Analytics ! New for Version 4

!  Lattitude/Longitude Inputs in WKT format

! Over 35 Geo Spatial functions such as: -  Area -  Distance -  Contains -  Crosses -  Difference -  Union

-  … many more!

Page 38: Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,

© 2015 IBM Corporation

Big R and Scalable Machine Learning

Page 39: Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,

© 2015 IBM Corporation 39

Challenges with Running Large-Scale Analytics

TRADITIONAL APPROACH BIG DATA APPROACH

Analyze small subsets of information

Analyze all information

Analyzed information

All available information

All available information analyzed

Page 40: Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,

© 2015 IBM Corporation 40

User Experience for Big R

Connect to BI cluster

Data frame proxy to large data file

Data transformation step

Run scalable linear regression on cluster

Page 41: Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,

© 2015 IBM Corporation 41

NEW: Underneath Big R’s ML Algorithms

Cost-based Optimizer •  Automatic

parallelization •  Execution plan

based on data characteristics and Hadoop configuration

High-level declarative language with R-like syntax

Key Benefits: "  Automatic

performance tuning

"  Protects data science investment as platform progresses

�5+ Years Development in IBM Research

Page 42: Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,

© 2015 IBM Corporation 42

Big R Machine Learning -- Scalability and Performance

bigr.lm 28x

Performance (data fit in memory)

Scalability (data larger than aggr. memory)

R out-of-memory 28X Speedup

Scales beyond cluster memory

Page 43: Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,

© 2015 IBM Corporation 43

3 Key Capabilities of Big R

Use of familiar R language on Hadoop -  Running native R functions-  Existing R assets (code & CRAN)

NEW: Run scalable machine learning algorithms beyond R in Hadoop-  Wide class of algorithms and growing-  R-like syntax for new algorithms & customize existing algorithms

NEW: Leverage scale of Hadoop for faster insights-  Only IBM can use the entire cluster memory-  Only IBM can spill to disk-  Only IBM can run thousands of models in parallel

1

2

3

Page 44: Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,

© 2015 IBM Corporation

Big SQL

Page 45: Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,

© 2015 IBM Corporation 45

SQL ON HADOOP =

BROAD ACCESS TO ANALYTICS ON HADOOP

SQL is the most prevalent analytical skill available in most data driven organizations

Page 46: Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,

© 2015 IBM Corporation 46

SQL on Hadoop Matters for Big Data Analytics For BI Tools like Cognos

Visualizations from Cognos 10.2.2

Page 47: Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,

© 2015 IBM Corporation 47

Hive is Really 3 Things… Storage Model, Metastore, and Execution Engine

47

SQL Execution Engine

Hive (Open Source)

Hive Storage Model (open source)

CSV Parquet RC Others… Tab Delim.

Hive Metastore (open source)

Map

Red

uce

Applications

Page 48: Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,

© 2015 IBM Corporation 48

Big SQL preserves open source foundation Co-exists with Hive by using metastore and storage formats. No Lock-in.

48

SQL Execution Engines

IBM BigSQL (IBM)

Hive (Open Source)

Hive Storage Model (open source)

CSV Parquet RC Others… Tab Delim.

Hive Metastore (open source)

C/C

++ MPP

Engine

Map

Red

uce

Applications

Page 49: Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,

© 2015 IBM Corporation 49

WHY WOULD I WANT TO DO THAT?

Ok…. But…..

Page 50: Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,

© 2015 IBM Corporation 50

IBM First to Produce Audited Benchmark Hadoop-DS (based on TPC-DS)

!  Letters of attestation are available for both Hadoop-DS benchmarks at 10TB and 30TB scale

!  InfoSizing, Transaction Processing Performance Council Certified Auditors verified both IBM results as well as results on Cloudera Impala and HortonWorks HIVE.

!  These results are for a non-TPC benchmark. A subset of the TPC-DS Benchmark standard requirements was implemented

Page 51: Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,

© 2015 IBM Corporation 51

IBM Big SQL – Runs 100% of the queries

Key points !  With Impala and Hive, many

queries needed to be re-written, some significantly

!  Owing to various restrictions, some queries could not be re-written or failed at run-time

!  Re-writing queries in a benchmark scenario where results are known is one thing – doing this against real databases in production is another

Other environments require significant effort at scale

Results for 10TB scale shown here

Page 52: Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,

© 2015 IBM Corporation 52

Hadoop-DS benchmark – Single user performance @ 10TB Big SQL is 3.6x faster than Impala and 5.4x faster than Hive 0.13

for single query stream using 46 common queries

Based on IBM internal tests comparing BigInsights Big SQL, Cloudera Impala and Hortonworks Hive (current versions available as of 9/01/2014) running on identical hardware. The test workload was based on the latest revision of the TPC-DS benchmark specification at 10TB data size. Successful executions measure the ability to execute queries a) directly from the specification without modification, b) after simple modifications, c) after extensive query rewrites. All minor modifications are either permitted by the TPC-DS benchmark specification or are of a similar nature. All queries were reviewed and attested by a TPC certified auditor. Development effort measured time required by a skilled SQL developer familiar with each system to modify queries so they will execute correctly. Performance test measured scaled query throughput per hour of 4 concurrent users executing a common subset of 46 queries across all 3 systems at 10TB data size. Results may not be typical and will vary based on actual workload, configuration, applications, queries and other variables in a production environment. Cloudera, the Cloudera logo, Cloudera Impala are trademarks of Cloudera. Hortonworks, the Hortonworks logo and other Hortonworks trademarks are trademarks of Hortonworks Inc. in the United States and other countries.

Page 53: Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,

© 2015 IBM Corporation 53

ISN’T ALL THAT INTERESTING ANYMORE

But, benchmarking against Hive

Page 54: Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,

© 2015 IBM Corporation November 14, 2015

Hadoop-DS Performance Test Update Spark SQL vs. Big SQL

Page 55: Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,

© 2015 IBM Corporation 55

Performance Test – Hadoop-DS (based on TPC-DS) 20 (Physical Node) Cluster

!  TPC-DS stands for Transaction Processing Council – Decision Support (workload) which is an industry standard benchmark for SQL

Spark 1.5.0 IBM Open Platform V4.1 20 Nodes

Big SQL V4.1 IBM Open Platform V4.1 20 Nodes

IBM Open Platform V4.1 shipped with Spark 1.4.1 but upgraded to 1.5.0 *Not an official TPC-DS Benchmark.

Page 56: Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,

© 2015 IBM Corporation 56

EVERYTHING? But first, ….is raw performance

Page 57: Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,

© 2015 IBM Corporation 57

Big SQL Security – Best In Class

Role Based Access Control Row Level Security

Colum Level Security Separation of Duties / Audit

BRANCH_A

BRANCH_B

FINANCE

Page 58: Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,

© 2015 IBM Corporation 58

Big SQL runs more SQL out-of-box Big SQL 4.1 Spark SQL 1.5.0

1 hour 3-4 weeks Porting Effort:

Big SQL is the only engine that

can execute all 99 queries with

minimal porting effort

Page 59: Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,

© 2015 IBM Corporation 59

Big SQL 4.1 vs. Spark 1.5.0, Single Stream @ 1TB

Page 60: Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,

© 2015 IBM Corporation 60

Big SQL 4.1 vs. Spark 1.5.0, Single Stream @ 1TB

Page 61: Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,

© 2015 IBM Corporation 61

Big SQL 4.1 vs. Spark 1.5.0, Single Stream @ 1TB

Page 62: Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,

© 2015 IBM Corporation 62

Big SQL 4.1 vs. Spark 1.5.0, Single Stream @ 1TB

Page 63: Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,

© 2015 IBM Corporation 63

Conclusions: Big SQL vs. Spark SQL @ 1TB TPC-DS ! Single Stream Results:

- Big SQL was faster than Spark SQL 76 / 99 Queries When Big SQL was slower, it was only slower by 1.6X on average

- Query vs. Query, Big SQL was on average 5.5X faster

- Removing Top 5 / Bottom 5, Big SQL was 2.5X faster

Page 64: Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,

© 2015 IBM Corporation 64

SCALE IT….? But… that was just a single stream/user. What happens when you

Page 65: Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,

© 2015 IBM Corporation 65

But, … what happens when you scale it?

Scale Single Stream 4 Concurrent Streams

1 TB •  Big SQL was faster on 76 / 99 Queries

•  Big SQL averaged 5.5X faster

•  Removing Top / Bottom 5, Big SQL averaged 2.5X faster

•  Spark SQL FAILED on 3 queries

•  Big SQL was 4.4X faster*

10 TB •  Big SQL was faster on 87/99 Queries

•  Spark SQL FAILED on 7 queries

•  Big SQL averaged 6.2X faster*

•  Removing Top / Bottom 5, Big SQL averaged 4.6X faster

•  Big SQL elapsed time for workload was better than linear

•  Spark SQL could not complete the workload (numerous issues). Partial results possible with only 2 concurrent streams.

*Compares only queries that both Big SQL and Spark SQL could complete (benefits Spark SQL)

More Users

More D

ata

Page 66: Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,

© 2015 IBM Corporation 66

Recommendation: Right Tool for the Right Job

Machine Learning

Simpler SQL

Good Performance

Ideal tool for BI Data Analysts and production

workloads

Ideal tool for Data Scientists and discovery

Big SQL Spark SQL

Migrating existing workloads to Hadoop

Security

Many Concurrent Users

Best in-class Performance

Not Mutually Exclusive. Big SQL & Spark SQL can co-exist in the cluster

Page 67: Getting Started with Hadoop and BigInsightsfiles.meetup.com/9505562/Meetup v2.0.pdf · data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,

© 2015 IBM Corporation 67

QUESTIONS