Big data presentation (2014)

copy 2014 IBM Corporation1

Big Data

Xavier Constantxavierconstantesibmcom

Lecture at EADAInternational Master in Marketing (2014)


Big Data Concepts

Big Data Technology

Data Scientists


Traditional DW

BI Server

ERP

CRM

Data Marts

Reports Dashboards

Operational System

ETL ETL

BENEFITS

Mature Technology

SQL Language (declarative non technical)

Skills amp resources availablity (programmers DBAshellip)

LIMITATIONS

Big operational data volumes

Queries take too long or donrsquot even finish

Admin complexity (partitions archivinghellip)

New data types

Free text images video audiohellip

Data in real time (sensors logs geospatial data etchellip)

New analysis types

Exploratory

Predictive

Flat filesSpread sheets

Data Warehouse(s)


1 in 2business leaders donrsquot have access to data they need

83of CIOrsquos cited BI and analytics as part of their visionary plan

54Xmore likely that top performers use business analytics

80of the worldrsquos data today is unstructured

90of the worldrsquos

data was created in the last two

years

20of available data can

be processed by traditional systems

Source GigaOM Software Group IBM Institute for Business Value

Intrinsic Property of Data hellip it grows


Characteristics of Big Data

Velocity is the game changer Itrsquos NOT just how

fast data is produced or changed BUT the

speed at which it must be analyzed

received understood and processed


Paradigm shifts enabled by big data ILeverage more of the data being captured



Bank X


Paradigm shifts enabled by big data IIReduce effort required to leverage data




Paradigm shifts enabled by big data IIIData leads the way ndash and sometimes correlations are good enough



Hypothesis based correlation Weird correlation




Paradigm shifts enabled by big data IVLeverage data as it is captured




Complementary Analytics

Traditional ApproachStructured analytical logical

New ApproachCreative holistic thought intuition

Multimedia

Data Warehouse

Web Logs

Social Data

Sensor data

images

RFID

Internal AppData

TransactionData

MainframeData

OLTP SystemData

Traditional databases

ERP Data

StructuredRepeatable

Linear

UnstructuredExploratory

Dynamic

Text Data

emails

Hadoop andStreams

NewSources


Types of Analytic Tools


Organisations are prioritising internal data sources

17

Untapped stores of internal data

Size and scope of some internal data such as

detailed transactions and operational log data

have become too large and varied to manage

within traditional systems

New infrastructure components make them

accessible for analysis

Some data has been collected but not

analyzed for years

Focus on customer insights

Customers ndash influenced by digital experiences

ndash often expect information provided to an

organization will then be ldquoknownrdquo during future

interactions

Combining disparate internal sources with

advanced analytics creates insights into

customer behavior and preferences

(Transactions Emails Call center interaction records)

Big data sources

Respondents were

asked which data

sources are currently

being collected and

analyzed as part of

active big data efforts

within their

organization


Stages of Big Data adoption

18

Big data adoption

When segmented into four groups based on current levels of big data activity respondents showed significant consistency

in organizational behaviors Total respondents n = 1061

Totals do not equal 100 due to rounding


Hadoop workloads

92

92

83

58

42

25

58

92

92

92

67

67

67

83

Staging area

Online archive

Transformation Engine

Ad hoc queries

Scheduled reports

Visual exploration

Data mining

Today In 18 Months

Based on respondents that have implemented Hadoop BI Leadership Forum April 2012


Big Data ExplorationFind visualize understand all big data to improve decision making

Enhanced 360o Viewof the CustomerExtend existing customer views (MDM CRM etc) by incorporating additional internal and external information sources

Operations AnalysisAnalyze a variety of machinedata for improved business results

Data Warehouse ModernizationIntegrate big data and data warehouse capabilities to increase operational efficiency

SecurityIntelligence ExtensionLower risk detect fraud and monitor cyber security in real-time

Key Big Data Use Cases


Big Data Concepts

Big Data Technology

Data Scientists


Solution for Big Data

Rest Data

ndash Data to analyze are already stored (structured and unstructured)

ndash Examples logs facebook twitter etc

ndash Solution Hadoop (open source)

Data in motion

ndash Data are analyzed in real time just in the moment they are generated They are analyzed with any previous storage

ndash Examples Sensors RFID etc

ndash Solution Streams CEP solutions


Hardware improvements through the years

CPU Speedsndash 1990 - 44 MIPS at 40 MHz

ndash 2000 - 3561 MIPS at 12 GHz ndash 2010 - 147600 MIPS at 33 GHz

RAM Memoryndash 1990 ndash 640K conventional memory (256K extended memory recommended)ndash 2000 ndash 64MB memoryndash 2010 - 8-32GB (and more)

Disk Capacityndash 1990 ndash 20MBndash 2000 - 1GBndash 2010 ndash 1TB

Disk Latency (speed of reads and writes) ndash not much improvement in last 7-10 years currently around 70 ndash 80MB sec


How long it will take to read 1TB of data

1TB (at 80Mb sec)ndash 1 disk - 34 hours

ndash 10 disks - 20 min


ndash 1000 disks - 12 sec


Parallel Data Processing is the answer

It was with us for a whilendash GRID computing - spreads processing load

ndash Distributed workload - hard to manage applications overhead on

developer

ndash Parallel databases ndash DB2 DPF Teradata Netezza etc (distribute the

data)


What is Apache Hadoop

Apache Open source software framework

Flexible enterprise-class support for processing large volumes of

data ndash Inspired by Google technologies (MapReduce GFS BigTable hellip)

ndash Initiated at Yahoobull Originally built to address scalability problems of Nutch an open source Web search

technology

ndash Well-suited to batch-oriented read-intensive applications

ndash Supports wide variety of data

Enables applications to work with thousands of nodes and petabytes

of data in a highly parallel cost effective mannerndash CPU + local disks = ldquonoderdquo

ndash Nodes can be combined into clusters

ndash New nodes can be added as needed without changing

bull Data formats

bull How data is loaded

bull How jobs are written


Design principles of Hadoop New way of storing and processing the data

ndash Let system handle most of the issues automaticallybull Failuresbull Scalabilitybull Reduce communications bull Distribute data and processing power to where the data isbull Make parallelism part of operating systembull Meant for heterogeneous commodity hardware

Bring processing to Data

Hadoop = HDFS + MapReduce infrastructure

Optimized to handlendash Massive amounts of data through parallelism

ndash A variety of data (structured unstructured semi-structured)

ndash Using inexpensive commodity hardware

Reliability provided through replication


What is the Hadoop Distributed File System Driving principals

ndash Data is stored across the entire cluster (multiple nodes)

ndash Programs are brought to the data not the data to the program

ndash Follows the Divide and Conquer paradigm

Data is stored across the entire cluster (the DFS)

ndash The entire cluster participates in the file system

ndash Blocks of a single file are distributed across the cluster

ndash A given block is typically replicated as well for resiliency

10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111

0100

Logical File

1

2

3

4

Blocks

1

Cluster

1

1

2

22

3

3

34

44


Introduction to MapReduce

Scalable to thousands of nodes and petabytes of data

MapReduce Application

1 Map Phase(break job into small parts)

2 Shuffle(transfer interim output

for final processing)

3 Reduce Phase(boil all output down to

a single result set)

Return a single result setResult Set

Shuffle

public static class TokenizerMapper extends MapperltObjectTextTextIntWritablegt

private final static IntWritableone = new IntWritable(1)

private Text word = new Text()

public void map(Object key Text val ContextStringTokenizer itr =

new StringTokenizer(valtoString())while (itrhasMoreTokens()) wordset(itrnextToken())

contextwrite(word one)

public static class IntSumReducer extends ReducerltTextIntWritableTextIntWrita

private IntWritable result = new IntWritable()

public void reduce(Text keyIterableltIntWritablegt val Context context)int sum = 0for (IntWritable v val)

sum += vget()

Distribute map

tasks to cluster

Hadoop Data Nodes


MapReduce Example

Hello World Bye World

Hello IBM

Reduce (final output)

lt Bye 1gt

lt IBM 1gt

lt Hello 2gt

lt World 2gt

Map 1lt Hello 1gt

lt World 1gt

lt Bye 1gt

lt World 1gt

Count number of words occurrences

Map 2lt Hello 1gt

lt IBM 1gt

Entry Data

Map

Process

Reduce

Process

Shuffle

Process


How to Analyze Large Data Sets in Hadoop

Its not just runtime Development phase has to be taken into

account

Although the Hadoop framework is implemented in Java

MapReduce applications do not need to be written in Java

To abstract complexities of Hadoop programming model a few

application development languages have emerged that build on top

of Hadoopndash Pig

ndash Hive

ndash Jaql

ndash Jaql


Pig Hive Jaql ndash Similarities

Reduced program size over Java

Applications are translated to map

and reduce jobs behind scenes

Extension points for extending

existing functionality

Interoperability with other

languages

Not designed for random

readswrites or low-latency queries

Pig Hive Jaql ndash Differences

Characteristic Pig Hive Jaql

Developed by Yahoo Facebook IBM

Language Pig Latin HiveQL Jaql

Type of language

Data flow Declarative (SQL dialect) Data flow

Data structures supported

Complex Better suited for structured data

JSON semi structured

Schema Optional Not optional Optional

Hadoop Distributions

34


Example of Hadoop Ecosystem

Visualization amp DiscoveryIntegration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

BigSheets JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Dashboard amp Visualization

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro


Open Source frameworks I

Avro A data serialization system that includes a schema within each file A schema defines the data types that are

contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data

types and complex type definitions within a schema

Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop

cluster

HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data

sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications

are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column

families so that the elements of a column family are all stored together This approach is different from a row-oriented

relational database where all columns of a row are stored together

HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do

not need to know where or how your data is stored You can change how you write data while still supporting existing data in

older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service

that includes functions for both MapReduce and Pig

Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to

analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements

which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere

BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos

Business Intelligence software


Open Source frameworks II

Lucene A high-performance text search engine library that is written entirely in Java When you search within a

collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key

component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene

libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build

scan and query Lucene indexes

Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides

users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the

required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of

specific data in the file system

R A Project for Statistical Computing

Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop

systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and

export it to relational databases and enterprise data warehouses

Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper

maintains common objects that are needed in large cluster environments such as configuration information distributed

synchronization and group services




Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets


BigSheets

Browser based Analytical Tool that generates MapReduce Jobs

working over Hadoop Big Data

Helps non-programmers to work with Hadoop cluster

User models their big data as familiar spreadsheet-like tabular data

structures (collections) Once data is represented in a collection

business analysts can filter and enrich its content using built-in

functions and macros Furthermore analysts can combine data

residing in different collections as well as generate charts and new

ldquosheetsrdquo (collections) to visualize their data They can even export

data into a variety of common formats with a click of a button

Much of the technology included in Sheets was derived from the

BigSheets project of IBMrsquos Emerging Technologies team


BigSheets Collection Sample

Spreadsheet-like structures defined by user

Based on data accessible through BigInsights Web console ndash eg file

system data output from Web crawl etc


Big Sheets Collection Operations

Work with built-in ldquosheetsrdquo editor

Add delete columns

Filter data

Specify formulas to compute new

values using spreadsheet-style

syntax

Apply built-in or custom macro

functions

helliphelliphelliphellip


BigSheets Collection Graphic Visualization

Built-in charting facility aids analysis

Pie charts bar charts tag clouds maps etc

Hover over sections to reveal details




Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL


What is Big SQL

Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture

ndash Comprehensive SQL92 ansi support

ndash Standards compliant client drivers (JDBC amp ODBC)

ndash Efficient handling of point queries

ndash Wide variety of data sources and file formats

ndash Extensive HBase focus

ndash Open source interoperability

Our driving design goalsndash Existing queries should run with no or few modifications

ndash Existing JDBC and ODBC compliant tools should continue to function

ndash Queries should be executed as efficiently as the chosen storage

mechanisms allow


Architecture

Big SQL shares catalogs with

Hive via the Hive metastorendash Each can query the others tables

SQL engine analyzes incoming

queriesndash Separates portion(s) to execute at

the server vs portion(s) to execute

on the cluster

ndash Re-writes query if necessary for

improved performance

ndash Determines appropriate storage

handler for data

ndash Produces execution plan

ndash Executes and coordinates query

Server layout and relative sizes

for illustrative purposes only

Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node

Job TrackerNetwork Protocol

SQL Engine

Storage Handlers

Del

Files

SEQ

FilesHBase RDBMS bullbullbull

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore




Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC


MapReduce

Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL

Text Analytics


What is Text Analytics

High Performance and Scalable rule based Information Extraction Engine

Distill structured information from unstructured data

- Rich annotator library supports multiple languages

Provides sophisticated tooling to help build test and refine rules

ndash Developer tools an easy to use text analytics language and a set of

extractors for fast adoption

ndash Multilingual support including support for DBCS languages

Developed at IBM Research since 2004 System T

BigInsights is the first time IBM opens up the Text Analytics Engine

technology for customization and development


Annotator Query Language (AQL)

Language to create rules for Text Analytics

SQL Like Language

Fully declarative text analytics language

Once compiled produced an AOG plan to work in the data

No ldquoblack boxesrdquo or modules that canrsquot be customized

Tooling for easy customization because you are abstracted from the

programmatic details

Competing solutions make use of locked up black-box modules that cannot be

customized which restricts flexibility and are difficult to optimize for performance

create view AmountWithUnit as

extract pattern ltNmatchgt ltUmatchgt

as match

from Number N Unit U


Text Analytic Simple Example

NetherlandsStrikerArjen Robben

Keeper SpainIker Casillas

WingerAndres Iniesta Spain

World Cup 2010 Highlights

Football World Cup 2010 one team distinguished well

from the rest winning the final Early in the second

half Netherlandsrsquo striker Arjen Robben had a chance

to score but the awesome keeper for Spain Iker

Casillas made the save Winner superiority was

reflected when Winger Andres Iniesta scored for Spain

for the win


Text Analytic Real Example


One step beyond Watson




Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC


MapReduce

Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL

Text Analytics

R


Big R

bull Explore visualize transform

and model big data using

familiar R syntax and

paradigm

bull Scale out R with MR

programming

ndash Partitioning of large data

ndash Parallel cluster execution of R

code

bull Distributed Machine

Learning

ndash A scalable statistics engine that

provides canned algorithms and

an ability to author new ones all

via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3


Where Does BigData Fit

Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate

ldquoCapture in case itrsquos neededrdquo

1 Extract transform load

ldquoCapture only whatrsquos neededrdquo

9 Report and mine data


Big Data Concepts

Big Data Technology

Data Scientists


Data scientist ndash The new cool guy in town

Article in Fortune ldquoThe unemployment rate in

the US continues to be abysmal (91 in

July) but the tech world has spawned a

new kind of highly skilled nerdy-cool job

that companies are scrambling to fill data

scientistrdquo

McKinsey Global Institute ldquoBig data Reportrdquo

By 2018 the United States alone could

face a shortage of 140000 to 190000

people with deep analytical skills as well as

15 million managers and analysts with the

know-how to use the analysis of big data to

make effective decisions


Data Science is Multidisciplinary


Successful Data Scientist Characteristics


Data Scientist Qualities


How Long Does It Take For a Beginner to Become

a Good Data Scientist


wwwkagglecom


Kaggle ranking

copy 2014 IBM Corporation63 copy 2013 IBM Corporation63

Learn Big Data

Reading Materials - Online

ndash Understanding Big Data ndash Free PDF Book

bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF

ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights

bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml

ndash Implementing IBM InfoSphere BigInsights on System x - Redbook

bull httpwwwredbooksibmcomredpiecesabstractssg248077html

Resources

ndash Big Data Information Center

bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml

ndash InfoSphere BigInsights

bull www-01ibmcomsoftwaredatainfospherebiginsights

ndash Stream Computing

bull www-01ibmcomsoftwaredatainfospherestream-computing

ndash DeveloperWorks forums demos

bull httpwwwibmcomdeveloperworkswikibiginsights


Learn Big Data Technologies

BigDataUniversitycom

Flexible on-line delivery

allows learning your place

and your pace

Free courses free study

materials

Cloud-based sandbox for

exercises ndash zero setup

Robust Course

Management System and

Content Distribution

infrastructure


65


Big Data Concepts

Big Data Technology

Data Scientists


Traditional DW

BI Server

ERP

CRM

Data Marts

Reports Dashboards

Operational System

ETL ETL

BENEFITS

Mature Technology



LIMITATIONS




New data types



New analysis types

Exploratory

Predictive


Data Warehouse(s)








years















Bank X




















Multimedia

Data Warehouse

Web Logs

Social Data

Sensor data

images

RFID

Internal AppData

TransactionData

MainframeData

OLTP SystemData


ERP Data


Linear


Dynamic

Text Data

emails

Hadoop andStreams

NewSources





17









analyzed for years





interactions





Big data sources

Respondents were

asked which data


being collected and

analyzed as part of


within their

organization



18

Big data adoption





Hadoop workloads

92

92

83

58

42

25

58

92

92

92

67

67

67

83

Staging area

Online archive


Ad hoc queries

Scheduled reports

Visual exploration

Data mining

Today In 18 Months










Big Data Concepts

Big Data Technology

Data Scientists



Rest Data




Data in motion





















developer


data)







technology







bull Data formats





















10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111

0100

Logical File

1

2

3

4

Blocks

1

Cluster

1

1

2

22

3

3

34

44











Shuffle










sum += vget()

Distribute map

tasks to cluster

Hadoop Data Nodes


MapReduce Example


Hello IBM


lt Bye 1gt

lt IBM 1gt

lt Hello 2gt

lt World 2gt

Map 1lt Hello 1gt

lt World 1gt

lt Bye 1gt

lt World 1gt


Map 2lt Hello 1gt

lt IBM 1gt

Entry Data

Map

Process

Reduce

Process

Shuffle

Process




account





of Hadoopndash Pig

ndash Hive

ndash Jaql

ndash Jaql









languages







Type of language







34





Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


BigSheets JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms


Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro







cluster




































Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro


BigSheets


BigSheets





















Add delete columns

Filter data



syntax


functions










Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL


What is Big SQL











mechanisms allow


Architecture






on the cluster




handler for data





Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node


SQL Engine

Storage Handlers

Del

Files

SEQ


bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore




Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC


MapReduce

Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL

Text Analytics
















SQL Like Language










as match














for the win








Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC


MapReduce

Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL

Text Analytics

R


Big R




paradigm


programming



code


Learning




via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3



Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate






Big Data Concepts

Big Data Technology

Data Scientists








scientistrdquo


















wwwkagglecom


Kaggle ranking


Learn Big Data








Resources














and your pace


materials



Robust Course



infrastructure


65


Traditional DW

BI Server

ERP

CRM

Data Marts

Reports Dashboards

Operational System

ETL ETL

BENEFITS

Mature Technology



LIMITATIONS




New data types



New analysis types

Exploratory

Predictive


Data Warehouse(s)








years















Bank X




















Multimedia

Data Warehouse

Web Logs

Social Data

Sensor data

images

RFID

Internal AppData

TransactionData

MainframeData

OLTP SystemData


ERP Data


Linear


Dynamic

Text Data

emails

Hadoop andStreams

NewSources





17









analyzed for years





interactions





Big data sources

Respondents were

asked which data


being collected and

analyzed as part of


within their

organization



18

Big data adoption





Hadoop workloads

92

92

83

58

42

25

58

92

92

92

67

67

67

83

Staging area

Online archive


Ad hoc queries

Scheduled reports

Visual exploration

Data mining

Today In 18 Months










Big Data Concepts

Big Data Technology

Data Scientists



Rest Data




Data in motion





















developer


data)







technology







bull Data formats





















10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111

0100

Logical File

1

2

3

4

Blocks

1

Cluster

1

1

2

22

3

3

34

44











Shuffle










sum += vget()

Distribute map

tasks to cluster

Hadoop Data Nodes


MapReduce Example


Hello IBM


lt Bye 1gt

lt IBM 1gt

lt Hello 2gt

lt World 2gt

Map 1lt Hello 1gt

lt World 1gt

lt Bye 1gt

lt World 1gt


Map 2lt Hello 1gt

lt IBM 1gt

Entry Data

Map

Process

Reduce

Process

Shuffle

Process




account





of Hadoopndash Pig

ndash Hive

ndash Jaql

ndash Jaql









languages







Type of language







34





Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


BigSheets JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms


Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro







cluster




































Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro


BigSheets


BigSheets





















Add delete columns

Filter data



syntax


functions










Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL


What is Big SQL











mechanisms allow


Architecture






on the cluster




handler for data





Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node


SQL Engine

Storage Handlers

Del

Files

SEQ


bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore




Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC


MapReduce

Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL

Text Analytics
















SQL Like Language










as match














for the win








Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC


MapReduce

Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL

Text Analytics

R


Big R




paradigm


programming



code


Learning




via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3



Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate






Big Data Concepts

Big Data Technology

Data Scientists








scientistrdquo


















wwwkagglecom


Kaggle ranking


Learn Big Data








Resources














and your pace


materials



Robust Course



infrastructure


65








years















Bank X




















Multimedia

Data Warehouse

Web Logs

Social Data

Sensor data

images

RFID

Internal AppData

TransactionData

MainframeData

OLTP SystemData


ERP Data


Linear


Dynamic

Text Data

emails

Hadoop andStreams

NewSources





17









analyzed for years





interactions





Big data sources

Respondents were

asked which data


being collected and

analyzed as part of


within their

organization



18

Big data adoption





Hadoop workloads

92

92

83

58

42

25

58

92

92

92

67

67

67

83

Staging area

Online archive


Ad hoc queries

Scheduled reports

Visual exploration

Data mining

Today In 18 Months










Big Data Concepts

Big Data Technology

Data Scientists



Rest Data




Data in motion





















developer


data)







technology







bull Data formats





















10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111

0100

Logical File

1

2

3

4

Blocks

1

Cluster

1

1

2

22

3

3

34

44











Shuffle










sum += vget()

Distribute map

tasks to cluster

Hadoop Data Nodes


MapReduce Example


Hello IBM


lt Bye 1gt

lt IBM 1gt

lt Hello 2gt

lt World 2gt

Map 1lt Hello 1gt

lt World 1gt

lt Bye 1gt

lt World 1gt


Map 2lt Hello 1gt

lt IBM 1gt

Entry Data

Map

Process

Reduce

Process

Shuffle

Process




account





of Hadoopndash Pig

ndash Hive

ndash Jaql

ndash Jaql









languages







Type of language







34





Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


BigSheets JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms


Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro







cluster




































Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro


BigSheets


BigSheets





















Add delete columns

Filter data



syntax


functions










Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL


What is Big SQL











mechanisms allow


Architecture






on the cluster




handler for data





Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node


SQL Engine

Storage Handlers

Del

Files

SEQ


bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore




Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC


MapReduce

Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL

Text Analytics
















SQL Like Language










as match














for the win








Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC


MapReduce

Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL

Text Analytics

R


Big R




paradigm


programming



code


Learning




via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3



Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate






Big Data Concepts

Big Data Technology

Data Scientists








scientistrdquo


















wwwkagglecom


Kaggle ranking


Learn Big Data








Resources














and your pace


materials



Robust Course



infrastructure


65











Bank X




















Multimedia

Data Warehouse

Web Logs

Social Data

Sensor data

images

RFID

Internal AppData

TransactionData

MainframeData

OLTP SystemData


ERP Data


Linear


Dynamic

Text Data

emails

Hadoop andStreams

NewSources





17









analyzed for years





interactions





Big data sources

Respondents were

asked which data


being collected and

analyzed as part of


within their

organization



18

Big data adoption





Hadoop workloads

92

92

83

58

42

25

58

92

92

92

67

67

67

83

Staging area

Online archive


Ad hoc queries

Scheduled reports

Visual exploration

Data mining

Today In 18 Months










Big Data Concepts

Big Data Technology

Data Scientists



Rest Data




Data in motion





















developer


data)







technology







bull Data formats





















10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111

0100

Logical File

1

2

3

4

Blocks

1

Cluster

1

1

2

22

3

3

34

44











Shuffle










sum += vget()

Distribute map

tasks to cluster

Hadoop Data Nodes


MapReduce Example


Hello IBM


lt Bye 1gt

lt IBM 1gt

lt Hello 2gt

lt World 2gt

Map 1lt Hello 1gt

lt World 1gt

lt Bye 1gt

lt World 1gt


Map 2lt Hello 1gt

lt IBM 1gt

Entry Data

Map

Process

Reduce

Process

Shuffle

Process




account





of Hadoopndash Pig

ndash Hive

ndash Jaql

ndash Jaql









languages







Type of language







34





Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


BigSheets JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms


Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro







cluster




































Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro


BigSheets


BigSheets





















Add delete columns

Filter data



syntax


functions










Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL


What is Big SQL











mechanisms allow


Architecture






on the cluster




handler for data





Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node


SQL Engine

Storage Handlers

Del

Files

SEQ


bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore




Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC


MapReduce

Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL

Text Analytics
















SQL Like Language










as match














for the win








Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC


MapReduce

Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL

Text Analytics

R


Big R




paradigm


programming



code


Learning




via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3



Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate






Big Data Concepts

Big Data Technology

Data Scientists








scientistrdquo


















wwwkagglecom


Kaggle ranking


Learn Big Data








Resources














and your pace


materials



Robust Course



infrastructure


65




















Multimedia

Data Warehouse

Web Logs

Social Data

Sensor data

images

RFID

Internal AppData

TransactionData

MainframeData

OLTP SystemData


ERP Data


Linear


Dynamic

Text Data

emails

Hadoop andStreams

NewSources





17









analyzed for years





interactions





Big data sources

Respondents were

asked which data


being collected and

analyzed as part of


within their

organization



18

Big data adoption





Hadoop workloads

92

92

83

58

42

25

58

92

92

92

67

67

67

83

Staging area

Online archive


Ad hoc queries

Scheduled reports

Visual exploration

Data mining

Today In 18 Months










Big Data Concepts

Big Data Technology

Data Scientists



Rest Data




Data in motion





















developer


data)







technology







bull Data formats





















10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111

0100

Logical File

1

2

3

4

Blocks

1

Cluster

1

1

2

22

3

3

34

44











Shuffle










sum += vget()

Distribute map

tasks to cluster

Hadoop Data Nodes


MapReduce Example


Hello IBM


lt Bye 1gt

lt IBM 1gt

lt Hello 2gt

lt World 2gt

Map 1lt Hello 1gt

lt World 1gt

lt Bye 1gt

lt World 1gt


Map 2lt Hello 1gt

lt IBM 1gt

Entry Data

Map

Process

Reduce

Process

Shuffle

Process




account





of Hadoopndash Pig

ndash Hive

ndash Jaql

ndash Jaql









languages







Type of language







34





Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


BigSheets JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms


Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro







cluster




































Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro


BigSheets


BigSheets





















Add delete columns

Filter data



syntax


functions










Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL


What is Big SQL











mechanisms allow


Architecture






on the cluster




handler for data





Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node


SQL Engine

Storage Handlers

Del

Files

SEQ


bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore




Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC


MapReduce

Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL

Text Analytics
















SQL Like Language










as match














for the win








Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC


MapReduce

Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL

Text Analytics

R


Big R




paradigm


programming



code


Learning




via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3



Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate






Big Data Concepts

Big Data Technology

Data Scientists








scientistrdquo


















wwwkagglecom


Kaggle ranking


Learn Big Data








Resources














and your pace


materials



Robust Course



infrastructure


65





17









analyzed for years





interactions





Big data sources

Respondents were

asked which data


being collected and

analyzed as part of


within their

organization



18

Big data adoption





Hadoop workloads

92

92

83

58

42

25

58

92

92

92

67

67

67

83

Staging area

Online archive


Ad hoc queries

Scheduled reports

Visual exploration

Data mining

Today In 18 Months










Big Data Concepts

Big Data Technology

Data Scientists



Rest Data




Data in motion





















developer


data)







technology







bull Data formats





















10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111

0100

Logical File

1

2

3

4

Blocks

1

Cluster

1

1

2

22

3

3

34

44











Shuffle










sum += vget()

Distribute map

tasks to cluster

Hadoop Data Nodes


MapReduce Example


Hello IBM


lt Bye 1gt

lt IBM 1gt

lt Hello 2gt

lt World 2gt

Map 1lt Hello 1gt

lt World 1gt

lt Bye 1gt

lt World 1gt


Map 2lt Hello 1gt

lt IBM 1gt

Entry Data

Map

Process

Reduce

Process

Shuffle

Process




account





of Hadoopndash Pig

ndash Hive

ndash Jaql

ndash Jaql









languages







Type of language







34





Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


BigSheets JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms


Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro







cluster




































Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro


BigSheets


BigSheets





















Add delete columns

Filter data



syntax


functions










Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL


What is Big SQL











mechanisms allow


Architecture






on the cluster




handler for data





Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node


SQL Engine

Storage Handlers

Del

Files

SEQ


bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore




Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC


MapReduce

Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL

Text Analytics
















SQL Like Language










as match














for the win








Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC


MapReduce

Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL

Text Analytics

R


Big R




paradigm


programming



code


Learning




via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3



Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate






Big Data Concepts

Big Data Technology

Data Scientists








scientistrdquo


















wwwkagglecom


Kaggle ranking


Learn Big Data








Resources














and your pace


materials



Robust Course



infrastructure


65



18

Big data adoption





Hadoop workloads

92

92

83

58

42

25

58

92

92

92

67

67

67

83

Staging area

Online archive


Ad hoc queries

Scheduled reports

Visual exploration

Data mining

Today In 18 Months










Big Data Concepts

Big Data Technology

Data Scientists



Rest Data




Data in motion





















developer


data)







technology







bull Data formats





















10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111

0100

Logical File

1

2

3

4

Blocks

1

Cluster

1

1

2

22

3

3

34

44











Shuffle










sum += vget()

Distribute map

tasks to cluster

Hadoop Data Nodes


MapReduce Example


Hello IBM


lt Bye 1gt

lt IBM 1gt

lt Hello 2gt

lt World 2gt

Map 1lt Hello 1gt

lt World 1gt

lt Bye 1gt

lt World 1gt


Map 2lt Hello 1gt

lt IBM 1gt

Entry Data

Map

Process

Reduce

Process

Shuffle

Process




account





of Hadoopndash Pig

ndash Hive

ndash Jaql

ndash Jaql









languages







Type of language







34





Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


BigSheets JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms


Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro







cluster




































Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro


BigSheets


BigSheets





















Add delete columns

Filter data



syntax


functions










Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL


What is Big SQL











mechanisms allow


Architecture






on the cluster




handler for data





Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node


SQL Engine

Storage Handlers

Del

Files

SEQ


bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore




Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC


MapReduce

Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL

Text Analytics
















SQL Like Language










as match














for the win








Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC


MapReduce

Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL

Text Analytics

R


Big R




paradigm


programming



code


Learning




via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3



Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate






Big Data Concepts

Big Data Technology

Data Scientists








scientistrdquo


















wwwkagglecom


Kaggle ranking


Learn Big Data








Resources














and your pace


materials



Robust Course



infrastructure


65


Hadoop workloads

92

92

83

58

42

25

58

92

92

92

67

67

67

83

Staging area

Online archive


Ad hoc queries

Scheduled reports

Visual exploration

Data mining

Today In 18 Months










Big Data Concepts

Big Data Technology

Data Scientists



Rest Data




Data in motion





















developer


data)







technology







bull Data formats





















10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111

0100

Logical File

1

2

3

4

Blocks

1

Cluster

1

1

2

22

3

3

34

44











Shuffle










sum += vget()

Distribute map

tasks to cluster

Hadoop Data Nodes


MapReduce Example


Hello IBM


lt Bye 1gt

lt IBM 1gt

lt Hello 2gt

lt World 2gt

Map 1lt Hello 1gt

lt World 1gt

lt Bye 1gt

lt World 1gt


Map 2lt Hello 1gt

lt IBM 1gt

Entry Data

Map

Process

Reduce

Process

Shuffle

Process




account





of Hadoopndash Pig

ndash Hive

ndash Jaql

ndash Jaql









languages







Type of language







34





Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


BigSheets JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms


Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro







cluster




































Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro


BigSheets


BigSheets





















Add delete columns

Filter data



syntax


functions










Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL


What is Big SQL











mechanisms allow


Architecture






on the cluster




handler for data





Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node


SQL Engine

Storage Handlers

Del

Files

SEQ


bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore




Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC


MapReduce

Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL

Text Analytics
















SQL Like Language










as match














for the win








Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC


MapReduce

Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL

Text Analytics

R


Big R




paradigm


programming



code


Learning




via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3



Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate






Big Data Concepts

Big Data Technology

Data Scientists








scientistrdquo


















wwwkagglecom


Kaggle ranking


Learn Big Data








Resources














and your pace


materials



Robust Course



infrastructure


65









Big Data Concepts

Big Data Technology

Data Scientists



Rest Data




Data in motion





















developer


data)







technology







bull Data formats





















10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111

0100

Logical File

1

2

3

4

Blocks

1

Cluster

1

1

2

22

3

3

34

44











Shuffle










sum += vget()

Distribute map

tasks to cluster

Hadoop Data Nodes


MapReduce Example


Hello IBM


lt Bye 1gt

lt IBM 1gt

lt Hello 2gt

lt World 2gt

Map 1lt Hello 1gt

lt World 1gt

lt Bye 1gt

lt World 1gt


Map 2lt Hello 1gt

lt IBM 1gt

Entry Data

Map

Process

Reduce

Process

Shuffle

Process




account





of Hadoopndash Pig

ndash Hive

ndash Jaql

ndash Jaql









languages







Type of language







34





Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


BigSheets JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms


Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro







cluster




































Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro


BigSheets


BigSheets





















Add delete columns

Filter data



syntax


functions










Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL


What is Big SQL











mechanisms allow


Architecture






on the cluster




handler for data





Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node


SQL Engine

Storage Handlers

Del

Files

SEQ


bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore




Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC


MapReduce

Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL

Text Analytics
















SQL Like Language










as match














for the win








Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC


MapReduce

Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL

Text Analytics

R


Big R




paradigm


programming



code


Learning




via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3



Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate






Big Data Concepts

Big Data Technology

Data Scientists








scientistrdquo


















wwwkagglecom


Kaggle ranking


Learn Big Data








Resources














and your pace


materials



Robust Course



infrastructure


65



Rest Data




Data in motion





















developer


data)







technology







bull Data formats





















10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111

0100

Logical File

1

2

3

4

Blocks

1

Cluster

1

1

2

22

3

3

34

44











Shuffle










sum += vget()

Distribute map

tasks to cluster

Hadoop Data Nodes


MapReduce Example


Hello IBM


lt Bye 1gt

lt IBM 1gt

lt Hello 2gt

lt World 2gt

Map 1lt Hello 1gt

lt World 1gt

lt Bye 1gt

lt World 1gt


Map 2lt Hello 1gt

lt IBM 1gt

Entry Data

Map

Process

Reduce

Process

Shuffle

Process




account





of Hadoopndash Pig

ndash Hive

ndash Jaql

ndash Jaql









languages







Type of language







34





Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


BigSheets JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms


Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro







cluster




































Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro


BigSheets


BigSheets





















Add delete columns

Filter data



syntax


functions










Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL


What is Big SQL











mechanisms allow


Architecture






on the cluster




handler for data





Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node


SQL Engine

Storage Handlers

Del

Files

SEQ


bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore




Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC


MapReduce

Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL

Text Analytics
















SQL Like Language










as match














for the win








Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC


MapReduce

Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL

Text Analytics

R


Big R




paradigm


programming



code


Learning




via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3



Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate






Big Data Concepts

Big Data Technology

Data Scientists








scientistrdquo


















wwwkagglecom


Kaggle ranking


Learn Big Data








Resources














and your pace


materials



Robust Course



infrastructure


65


















developer


data)







technology







bull Data formats





















10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111

0100

Logical File

1

2

3

4

Blocks

1

Cluster

1

1

2

22

3

3

34

44











Shuffle










sum += vget()

Distribute map

tasks to cluster

Hadoop Data Nodes


MapReduce Example


Hello IBM


lt Bye 1gt

lt IBM 1gt

lt Hello 2gt

lt World 2gt

Map 1lt Hello 1gt

lt World 1gt

lt Bye 1gt

lt World 1gt


Map 2lt Hello 1gt

lt IBM 1gt

Entry Data

Map

Process

Reduce

Process

Shuffle

Process




account





of Hadoopndash Pig

ndash Hive

ndash Jaql

ndash Jaql









languages







Type of language







34





Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


BigSheets JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms


Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro







cluster




































Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro


BigSheets


BigSheets





















Add delete columns

Filter data



syntax


functions










Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL


What is Big SQL











mechanisms allow


Architecture






on the cluster




handler for data





Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node


SQL Engine

Storage Handlers

Del

Files

SEQ


bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore




Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC


MapReduce

Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL

Text Analytics
















SQL Like Language










as match














for the win








Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC


MapReduce

Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL

Text Analytics

R


Big R




paradigm


programming



code


Learning




via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3



Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate






Big Data Concepts

Big Data Technology

Data Scientists








scientistrdquo


















wwwkagglecom


Kaggle ranking


Learn Big Data








Resources














and your pace


materials



Robust Course



infrastructure


65







technology







bull Data formats





















10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111

0100

Logical File

1

2

3

4

Blocks

1

Cluster

1

1

2

22

3

3

34

44











Shuffle










sum += vget()

Distribute map

tasks to cluster

Hadoop Data Nodes


MapReduce Example


Hello IBM


lt Bye 1gt

lt IBM 1gt

lt Hello 2gt

lt World 2gt

Map 1lt Hello 1gt

lt World 1gt

lt Bye 1gt

lt World 1gt


Map 2lt Hello 1gt

lt IBM 1gt

Entry Data

Map

Process

Reduce

Process

Shuffle

Process




account





of Hadoopndash Pig

ndash Hive

ndash Jaql

ndash Jaql









languages







Type of language







34





Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


BigSheets JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms


Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro







cluster




































Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro


BigSheets


BigSheets





















Add delete columns

Filter data



syntax


functions










Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL


What is Big SQL











mechanisms allow


Architecture






on the cluster




handler for data





Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node


SQL Engine

Storage Handlers

Del

Files

SEQ


bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore




Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC


MapReduce

Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL

Text Analytics
















SQL Like Language










as match














for the win








Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC


MapReduce

Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL

Text Analytics

R


Big R




paradigm


programming



code


Learning




via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3



Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate






Big Data Concepts

Big Data Technology

Data Scientists








scientistrdquo


















wwwkagglecom


Kaggle ranking


Learn Big Data








Resources














and your pace


materials



Robust Course



infrastructure


65



















10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111

0100

Logical File

1

2

3

4

Blocks

1

Cluster

1

1

2

22

3

3

34

44











Shuffle










sum += vget()

Distribute map

tasks to cluster

Hadoop Data Nodes


MapReduce Example


Hello IBM


lt Bye 1gt

lt IBM 1gt

lt Hello 2gt

lt World 2gt

Map 1lt Hello 1gt

lt World 1gt

lt Bye 1gt

lt World 1gt


Map 2lt Hello 1gt

lt IBM 1gt

Entry Data

Map

Process

Reduce

Process

Shuffle

Process




account





of Hadoopndash Pig

ndash Hive

ndash Jaql

ndash Jaql









languages







Type of language







34





Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


BigSheets JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms


Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro







cluster




































Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro


BigSheets


BigSheets





















Add delete columns

Filter data



syntax


functions










Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL


What is Big SQL











mechanisms allow


Architecture






on the cluster




handler for data





Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node


SQL Engine

Storage Handlers

Del

Files

SEQ


bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore




Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC


MapReduce

Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL

Text Analytics
















SQL Like Language










as match














for the win








Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC


MapReduce

Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL

Text Analytics

R


Big R




paradigm


programming



code


Learning




via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3



Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate






Big Data Concepts

Big Data Technology

Data Scientists








scientistrdquo


















wwwkagglecom


Kaggle ranking


Learn Big Data








Resources














and your pace


materials



Robust Course



infrastructure


65











Shuffle










sum += vget()

Distribute map

tasks to cluster

Hadoop Data Nodes


MapReduce Example


Hello IBM


lt Bye 1gt

lt IBM 1gt

lt Hello 2gt

lt World 2gt

Map 1lt Hello 1gt

lt World 1gt

lt Bye 1gt

lt World 1gt


Map 2lt Hello 1gt

lt IBM 1gt

Entry Data

Map

Process

Reduce

Process

Shuffle

Process




account





of Hadoopndash Pig

ndash Hive

ndash Jaql

ndash Jaql









languages







Type of language







34





Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


BigSheets JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms


Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro







cluster




































Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro


BigSheets


BigSheets





















Add delete columns

Filter data



syntax


functions










Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL


What is Big SQL











mechanisms allow


Architecture






on the cluster




handler for data





Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node


SQL Engine

Storage Handlers

Del

Files

SEQ


bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore




Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC


MapReduce

Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL

Text Analytics
















SQL Like Language










as match














for the win








Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC


MapReduce

Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL

Text Analytics

R


Big R




paradigm


programming



code


Learning




via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3



Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate






Big Data Concepts

Big Data Technology

Data Scientists








scientistrdquo


















wwwkagglecom


Kaggle ranking


Learn Big Data








Resources














and your pace


materials



Robust Course



infrastructure


65


MapReduce Example


Hello IBM


lt Bye 1gt

lt IBM 1gt

lt Hello 2gt

lt World 2gt

Map 1lt Hello 1gt

lt World 1gt

lt Bye 1gt

lt World 1gt


Map 2lt Hello 1gt

lt IBM 1gt

Entry Data

Map

Process

Reduce

Process

Shuffle

Process




account





of Hadoopndash Pig

ndash Hive

ndash Jaql

ndash Jaql









languages







Type of language







34





Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


BigSheets JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms


Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro







cluster




































Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro


BigSheets


BigSheets





















Add delete columns

Filter data



syntax


functions










Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL


What is Big SQL











mechanisms allow


Architecture






on the cluster




handler for data





Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node


SQL Engine

Storage Handlers

Del

Files

SEQ


bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore




Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC


MapReduce

Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL

Text Analytics
















SQL Like Language










as match














for the win








Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC


MapReduce

Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL

Text Analytics

R


Big R




paradigm


programming



code


Learning




via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3



Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate






Big Data Concepts

Big Data Technology

Data Scientists








scientistrdquo


















wwwkagglecom


Kaggle ranking


Learn Big Data








Resources














and your pace


materials



Robust Course



infrastructure


65




account





of Hadoopndash Pig

ndash Hive

ndash Jaql

ndash Jaql









languages







Type of language







34





Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


BigSheets JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms


Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro







cluster




































Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro


BigSheets


BigSheets





















Add delete columns

Filter data



syntax


functions










Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL


What is Big SQL











mechanisms allow


Architecture






on the cluster




handler for data





Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node


SQL Engine

Storage Handlers

Del

Files

SEQ


bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore




Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC


MapReduce

Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL

Text Analytics
















SQL Like Language










as match














for the win








Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC


MapReduce

Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL

Text Analytics

R


Big R




paradigm


programming



code


Learning




via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3



Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate






Big Data Concepts

Big Data Technology

Data Scientists








scientistrdquo


















wwwkagglecom


Kaggle ranking


Learn Big Data








Resources














and your pace


materials



Robust Course



infrastructure


65









languages







Type of language







34





Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


BigSheets JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms


Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro







cluster




































Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro


BigSheets


BigSheets





















Add delete columns

Filter data



syntax


functions










Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL


What is Big SQL











mechanisms allow


Architecture






on the cluster




handler for data





Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node


SQL Engine

Storage Handlers

Del

Files

SEQ


bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore




Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC


MapReduce

Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL

Text Analytics
















SQL Like Language










as match














for the win








Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC


MapReduce

Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL

Text Analytics

R


Big R




paradigm


programming



code


Learning




via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3



Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate






Big Data Concepts

Big Data Technology

Data Scientists








scientistrdquo


















wwwkagglecom


Kaggle ranking


Learn Big Data








Resources














and your pace


materials



Robust Course



infrastructure


65





Type of language







34





Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


BigSheets JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms


Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro







cluster




































Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro


BigSheets


BigSheets





















Add delete columns

Filter data



syntax


functions










Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL


What is Big SQL











mechanisms allow


Architecture






on the cluster




handler for data





Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node


SQL Engine

Storage Handlers

Del

Files

SEQ


bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore




Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC


MapReduce

Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL

Text Analytics
















SQL Like Language










as match














for the win








Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC


MapReduce

Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL

Text Analytics

R


Big R




paradigm


programming



code


Learning




via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3



Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate






Big Data Concepts

Big Data Technology

Data Scientists








scientistrdquo


















wwwkagglecom


Kaggle ranking


Learn Big Data








Resources














and your pace


materials



Robust Course



infrastructure


65


34





Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


BigSheets JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms


Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro







cluster




































Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro


BigSheets


BigSheets





















Add delete columns

Filter data



syntax


functions










Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL


What is Big SQL











mechanisms allow


Architecture






on the cluster




handler for data





Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node


SQL Engine

Storage Handlers

Del

Files

SEQ


bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore




Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC


MapReduce

Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL

Text Analytics
















SQL Like Language










as match














for the win








Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC


MapReduce

Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL

Text Analytics

R


Big R




paradigm


programming



code


Learning




via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3



Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate






Big Data Concepts

Big Data Technology

Data Scientists








scientistrdquo


















wwwkagglecom


Kaggle ranking


Learn Big Data








Resources














and your pace


materials



Robust Course



infrastructure


65





Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


BigSheets JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms


Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro







cluster




































Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro


BigSheets


BigSheets





















Add delete columns

Filter data



syntax


functions










Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL


What is Big SQL











mechanisms allow


Architecture






on the cluster




handler for data





Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node


SQL Engine

Storage Handlers

Del

Files

SEQ


bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore




Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC


MapReduce

Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL

Text Analytics
















SQL Like Language










as match














for the win








Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC


MapReduce

Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL

Text Analytics

R


Big R




paradigm


programming



code


Learning




via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3



Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate






Big Data Concepts

Big Data Technology

Data Scientists








scientistrdquo


















wwwkagglecom


Kaggle ranking


Learn Big Data








Resources














and your pace


materials



Robust Course



infrastructure


65







cluster




































Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro


BigSheets


BigSheets





















Add delete columns

Filter data



syntax


functions










Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL


What is Big SQL











mechanisms allow


Architecture






on the cluster




handler for data





Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node


SQL Engine

Storage Handlers

Del

Files

SEQ


bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore




Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC


MapReduce

Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL

Text Analytics
















SQL Like Language










as match














for the win








Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC


MapReduce

Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL

Text Analytics

R


Big R




paradigm


programming



code


Learning




via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3



Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate






Big Data Concepts

Big Data Technology

Data Scientists








scientistrdquo


















wwwkagglecom


Kaggle ranking


Learn Big Data








Resources














and your pace


materials



Robust Course



infrastructure


65






















Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro


BigSheets


BigSheets





















Add delete columns

Filter data



syntax


functions










Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL


What is Big SQL











mechanisms allow


Architecture






on the cluster




handler for data





Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node


SQL Engine

Storage Handlers

Del

Files

SEQ


bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore




Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC


MapReduce

Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL

Text Analytics
















SQL Like Language










as match














for the win








Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC


MapReduce

Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL

Text Analytics

R


Big R




paradigm


programming



code


Learning




via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3



Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate






Big Data Concepts

Big Data Technology

Data Scientists








scientistrdquo


















wwwkagglecom


Kaggle ranking


Learn Big Data








Resources














and your pace


materials



Robust Course



infrastructure


65


BigSheets





















Add delete columns

Filter data



syntax


functions










Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL


What is Big SQL











mechanisms allow


Architecture






on the cluster




handler for data





Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node


SQL Engine

Storage Handlers

Del

Files

SEQ


bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore




Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC


MapReduce

Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL

Text Analytics
















SQL Like Language










as match














for the win








Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC


MapReduce

Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL

Text Analytics

R


Big R




paradigm


programming



code


Learning




via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3



Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate






Big Data Concepts

Big Data Technology

Data Scientists








scientistrdquo


















wwwkagglecom


Kaggle ranking


Learn Big Data








Resources














and your pace


materials



Robust Course



infrastructure


65









Add delete columns

Filter data



syntax


functions










Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL


What is Big SQL











mechanisms allow


Architecture






on the cluster




handler for data





Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node


SQL Engine

Storage Handlers

Del

Files

SEQ


bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore




Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC


MapReduce

Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL

Text Analytics
















SQL Like Language










as match














for the win








Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC


MapReduce

Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL

Text Analytics

R


Big R




paradigm


programming



code


Learning




via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3



Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate






Big Data Concepts

Big Data Technology

Data Scientists








scientistrdquo


















wwwkagglecom


Kaggle ranking


Learn Big Data








Resources














and your pace


materials



Robust Course



infrastructure


65









Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC



Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL


What is Big SQL











mechanisms allow


Architecture






on the cluster




handler for data





Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node


SQL Engine

Storage Handlers

Del

Files

SEQ


bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore




Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC


MapReduce

Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL

Text Analytics
















SQL Like Language










as match














for the win








Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC


MapReduce

Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL

Text Analytics

R


Big R




paradigm


programming



code


Learning




via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3



Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate






Big Data Concepts

Big Data Technology

Data Scientists








scientistrdquo


















wwwkagglecom


Kaggle ranking


Learn Big Data








Resources














and your pace


materials



Robust Course



infrastructure


65


What is Big SQL











mechanisms allow


Architecture






on the cluster




handler for data





Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node


SQL Engine

Storage Handlers

Del

Files

SEQ


bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore




Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC


MapReduce

Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL

Text Analytics
















SQL Like Language










as match














for the win








Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC


MapReduce

Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL

Text Analytics

R


Big R




paradigm


programming



code


Learning




via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3



Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate






Big Data Concepts

Big Data Technology

Data Scientists








scientistrdquo


















wwwkagglecom


Kaggle ranking


Learn Big Data








Resources














and your pace


materials



Robust Course



infrastructure


65


Architecture






on the cluster




handler for data





Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node


SQL Engine

Storage Handlers

Del

Files

SEQ


bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore




Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC


MapReduce

Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL

Text Analytics
















SQL Like Language










as match














for the win








Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC


MapReduce

Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL

Text Analytics

R


Big R




paradigm


programming



code


Learning




via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3



Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate






Big Data Concepts

Big Data Technology

Data Scientists








scientistrdquo


















wwwkagglecom


Kaggle ranking


Learn Big Data








Resources














and your pace


materials



Robust Course



infrastructure


65




Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC


MapReduce

Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL

Text Analytics
















SQL Like Language










as match














for the win








Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC


MapReduce

Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL

Text Analytics

R


Big R




paradigm


programming



code


Learning




via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3



Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate






Big Data Concepts

Big Data Technology

Data Scientists








scientistrdquo


















wwwkagglecom


Kaggle ranking


Learn Big Data








Resources














and your pace


materials



Robust Course



infrastructure


65
















SQL Like Language










as match














for the win








Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC


MapReduce

Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL

Text Analytics

R


Big R




paradigm


programming



code


Learning




via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3



Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate






Big Data Concepts

Big Data Technology

Data Scientists








scientistrdquo


















wwwkagglecom


Kaggle ranking


Learn Big Data








Resources














and your pace


materials



Robust Course



infrastructure


65













for the win








Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC


MapReduce

Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL

Text Analytics

R


Big R




paradigm


programming



code


Learning




via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3



Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate






Big Data Concepts

Big Data Technology

Data Scientists








scientistrdquo


















wwwkagglecom


Kaggle ranking


Learn Big Data








Resources














and your pace


materials



Robust Course



infrastructure


65








Integration


Streams

Netezza

Flume

DB2

DataStage


Runtime


File System

MapReduce

HDFS

Data StoreHBase


JDBC


MapReduce

Pig amp Jaql Hive

Administration

Index


Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive


Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro


BigSheets

Big SQL

Text Analytics

R


Big R




paradigm


programming



code


Learning




via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3



Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate






Big Data Concepts

Big Data Technology

Data Scientists








scientistrdquo


















wwwkagglecom


Kaggle ranking


Learn Big Data








Resources














and your pace


materials



Robust Course



infrastructure


65


Big R




paradigm


programming



code


Learning




via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3



Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate






Big Data Concepts

Big Data Technology

Data Scientists








scientistrdquo


















wwwkagglecom


Kaggle ranking


Learn Big Data








Resources














and your pace


materials



Robust Course



infrastructure


65



Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate






Big Data Concepts

Big Data Technology

Data Scientists








scientistrdquo


















wwwkagglecom


Kaggle ranking


Learn Big Data








Resources














and your pace


materials



Robust Course



infrastructure


65


Big Data Concepts

Big Data Technology

Data Scientists








scientistrdquo


















wwwkagglecom


Kaggle ranking


Learn Big Data








Resources














and your pace


materials



Robust Course



infrastructure


65








scientistrdquo


















wwwkagglecom


Kaggle ranking


Learn Big Data








Resources














and your pace


materials



Robust Course



infrastructure


65











wwwkagglecom


Kaggle ranking


Learn Big Data








Resources














and your pace


materials



Robust Course



infrastructure


65


Kaggle ranking


Learn Big Data








Resources














and your pace


materials



Robust Course



infrastructure


65


Learn Big Data








Resources














and your pace


materials



Robust Course



infrastructure


65






and your pace


materials



Robust Course



infrastructure


65


65

Big data presentation (2014)

Data & Analytics

Transcript of Big data presentation (2014)