Big data presentation (2014)

65
© 2014 IBM Corporation 1 Big Data Xavier Constant [email protected] Lecture at EADA International Master in Marketing (2014)

Transcript of Big data presentation (2014)

Page 1: Big data presentation (2014)

copy 2014 IBM Corporation1

Big Data

Xavier Constantxavierconstantesibmcom

Lecture at EADAInternational Master in Marketing (2014)

copy 2014 IBM Corporation2

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation3

Traditional DW

BI Server

ERP

CRM

Data Marts

Reports Dashboards

Operational System

ETL ETL

BENEFITS

Mature Technology

SQL Language (declarative non technical)

Skills amp resources availablity (programmers DBAshellip)

LIMITATIONS

Big operational data volumes

Queries take too long or donrsquot even finish

Admin complexity (partitions archivinghellip)

New data types

Free text images video audiohellip

Data in real time (sensors logs geospatial data etchellip)

New analysis types

Exploratory

Predictive

Flat filesSpread sheets

Data Warehouse(s)

copy 2014 IBM Corporation4

1 in 2business leaders donrsquot have access to data they need

83of CIOrsquos cited BI and analytics as part of their visionary plan

54Xmore likely that top performers use business analytics

80of the worldrsquos data today is unstructured

90of the worldrsquos

data was created in the last two

years

20of available data can

be processed by traditional systems

Source GigaOM Software Group IBM Institute for Business Value

Intrinsic Property of Data hellip it grows

copy 2014 IBM Corporation5

Characteristics of Big Data

Velocity is the game changer Itrsquos NOT just how

fast data is produced or changed BUT the

speed at which it must be analyzed

received understood and processed

copy 2014 IBM Corporation6

Paradigm shifts enabled by big data ILeverage more of the data being captured

copy 2014 IBM Corporation7

Paradigm shifts enabled by big data ILeverage more of the data being captured

Bank X

copy 2014 IBM Corporation8

Paradigm shifts enabled by big data IIReduce effort required to leverage data

copy 2014 IBM Corporation9

Paradigm shifts enabled by big data IIReduce effort required to leverage data

copy 2014 IBM Corporation10

Paradigm shifts enabled by big data IIIData leads the way ndash and sometimes correlations are good enough

copy 2014 IBM Corporation11

Paradigm shifts enabled by big data IIIData leads the way ndash and sometimes correlations are good enough

Hypothesis based correlation Weird correlation

copy 2014 IBM Corporation12

Paradigm shifts enabled by big data IIIData leads the way ndash and sometimes correlations are good enough

copy 2014 IBM Corporation13

Paradigm shifts enabled by big data IVLeverage data as it is captured

copy 2014 IBM Corporation14

Paradigm shifts enabled by big data IVLeverage data as it is captured

copy 2014 IBM Corporation15

Complementary Analytics

Traditional ApproachStructured analytical logical

New ApproachCreative holistic thought intuition

Multimedia

Data Warehouse

Web Logs

Social Data

Sensor data

images

RFID

Internal AppData

TransactionData

MainframeData

OLTP SystemData

Traditional databases

ERP Data

StructuredRepeatable

Linear

UnstructuredExploratory

Dynamic

Text Data

emails

Hadoop andStreams

NewSources

copy 2014 IBM Corporation16

Types of Analytic Tools

copy 2014 IBM Corporation17

Organisations are prioritising internal data sources

17

Untapped stores of internal data

Size and scope of some internal data such as

detailed transactions and operational log data

have become too large and varied to manage

within traditional systems

New infrastructure components make them

accessible for analysis

Some data has been collected but not

analyzed for years

Focus on customer insights

Customers ndash influenced by digital experiences

ndash often expect information provided to an

organization will then be ldquoknownrdquo during future

interactions

Combining disparate internal sources with

advanced analytics creates insights into

customer behavior and preferences

(Transactions Emails Call center interaction records)

Big data sources

Respondents were

asked which data

sources are currently

being collected and

analyzed as part of

active big data efforts

within their

organization

copy 2014 IBM Corporation18

Stages of Big Data adoption

18

Big data adoption

When segmented into four groups based on current levels of big data activity respondents showed significant consistency

in organizational behaviors Total respondents n = 1061

Totals do not equal 100 due to rounding

copy 2014 IBM Corporation19

Hadoop workloads

92

92

83

58

42

25

58

92

92

92

67

67

67

83

Staging area

Online archive

Transformation Engine

Ad hoc queries

Scheduled reports

Visual exploration

Data mining

Today In 18 Months

Based on respondents that have implemented Hadoop BI Leadership Forum April 2012

copy 2014 IBM Corporation20

Big Data ExplorationFind visualize understand all big data to improve decision making

Enhanced 360o Viewof the CustomerExtend existing customer views (MDM CRM etc) by incorporating additional internal and external information sources

Operations AnalysisAnalyze a variety of machinedata for improved business results

Data Warehouse ModernizationIntegrate big data and data warehouse capabilities to increase operational efficiency

SecurityIntelligence ExtensionLower risk detect fraud and monitor cyber security in real-time

Key Big Data Use Cases

copy 2014 IBM Corporation21

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation22

Solution for Big Data

Rest Data

ndash Data to analyze are already stored (structured and unstructured)

ndash Examples logs facebook twitter etc

ndash Solution Hadoop (open source)

Data in motion

ndash Data are analyzed in real time just in the moment they are generated They are analyzed with any previous storage

ndash Examples Sensors RFID etc

ndash Solution Streams CEP solutions

copy 2014 IBM Corporation23

Hardware improvements through the years

CPU Speedsndash 1990 - 44 MIPS at 40 MHz

ndash 2000 - 3561 MIPS at 12 GHz ndash 2010 - 147600 MIPS at 33 GHz

RAM Memoryndash 1990 ndash 640K conventional memory (256K extended memory recommended)ndash 2000 ndash 64MB memoryndash 2010 - 8-32GB (and more)

Disk Capacityndash 1990 ndash 20MBndash 2000 - 1GBndash 2010 ndash 1TB

Disk Latency (speed of reads and writes) ndash not much improvement in last 7-10 years currently around 70 ndash 80MB sec

copy 2014 IBM Corporation24

How long it will take to read 1TB of data

1TB (at 80Mb sec)ndash 1 disk - 34 hours

ndash 10 disks - 20 min

ndash 100 disks - 2 min

ndash 1000 disks - 12 sec

copy 2014 IBM Corporation25

Parallel Data Processing is the answer

It was with us for a whilendash GRID computing - spreads processing load

ndash Distributed workload - hard to manage applications overhead on

developer

ndash Parallel databases ndash DB2 DPF Teradata Netezza etc (distribute the

data)

copy 2014 IBM Corporation26

What is Apache Hadoop

Apache Open source software framework

Flexible enterprise-class support for processing large volumes of

data ndash Inspired by Google technologies (MapReduce GFS BigTable hellip)

ndash Initiated at Yahoobull Originally built to address scalability problems of Nutch an open source Web search

technology

ndash Well-suited to batch-oriented read-intensive applications

ndash Supports wide variety of data

Enables applications to work with thousands of nodes and petabytes

of data in a highly parallel cost effective mannerndash CPU + local disks = ldquonoderdquo

ndash Nodes can be combined into clusters

ndash New nodes can be added as needed without changing

bull Data formats

bull How data is loaded

bull How jobs are written

copy 2014 IBM Corporation27

Design principles of Hadoop New way of storing and processing the data

ndash Let system handle most of the issues automaticallybull Failuresbull Scalabilitybull Reduce communications bull Distribute data and processing power to where the data isbull Make parallelism part of operating systembull Meant for heterogeneous commodity hardware

Bring processing to Data

Hadoop = HDFS + MapReduce infrastructure

Optimized to handlendash Massive amounts of data through parallelism

ndash A variety of data (structured unstructured semi-structured)

ndash Using inexpensive commodity hardware

Reliability provided through replication

copy 2014 IBM Corporation28

What is the Hadoop Distributed File System Driving principals

ndash Data is stored across the entire cluster (multiple nodes)

ndash Programs are brought to the data not the data to the program

ndash Follows the Divide and Conquer paradigm

Data is stored across the entire cluster (the DFS)

ndash The entire cluster participates in the file system

ndash Blocks of a single file are distributed across the cluster

ndash A given block is typically replicated as well for resiliency

10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111

0100

Logical File

1

2

3

4

Blocks

1

Cluster

1

1

2

22

3

3

34

44

copy 2011 IBM Corporation29

Introduction to MapReduce

Scalable to thousands of nodes and petabytes of data

MapReduce Application

1 Map Phase(break job into small parts)

2 Shuffle(transfer interim output

for final processing)

3 Reduce Phase(boil all output down to

a single result set)

Return a single result setResult Set

Shuffle

public static class TokenizerMapper extends MapperltObjectTextTextIntWritablegt

private final static IntWritableone = new IntWritable(1)

private Text word = new Text()

public void map(Object key Text val ContextStringTokenizer itr =

new StringTokenizer(valtoString())while (itrhasMoreTokens()) wordset(itrnextToken())

contextwrite(word one)

public static class IntSumReducer extends ReducerltTextIntWritableTextIntWrita

private IntWritable result = new IntWritable()

public void reduce(Text keyIterableltIntWritablegt val Context context)int sum = 0for (IntWritable v val)

sum += vget()

Distribute map

tasks to cluster

Hadoop Data Nodes

copy 2011 IBM Corporation30

MapReduce Example

Hello World Bye World

Hello IBM

Reduce (final output)

lt Bye 1gt

lt IBM 1gt

lt Hello 2gt

lt World 2gt

Map 1lt Hello 1gt

lt World 1gt

lt Bye 1gt

lt World 1gt

Count number of words occurrences

Map 2lt Hello 1gt

lt IBM 1gt

Entry Data

Map

Process

Reduce

Process

Shuffle

Process

copy 2014 IBM Corporation31

How to Analyze Large Data Sets in Hadoop

Its not just runtime Development phase has to be taken into

account

Although the Hadoop framework is implemented in Java

MapReduce applications do not need to be written in Java

To abstract complexities of Hadoop programming model a few

application development languages have emerged that build on top

of Hadoopndash Pig

ndash Hive

ndash Jaql

ndash Jaql

copy 2014 IBM Corporation32

Pig Hive Jaql ndash Similarities

Reduced program size over Java

Applications are translated to map

and reduce jobs behind scenes

Extension points for extending

existing functionality

Interoperability with other

languages

Not designed for random

readswrites or low-latency queries

Pig Hive Jaql ndash Differences

Characteristic Pig Hive Jaql

Developed by Yahoo Facebook IBM

Language Pig Latin HiveQL Jaql

Type of language

Data flow Declarative (SQL dialect) Data flow

Data structures supported

Complex Better suited for structured data

JSON semi structured

Schema Optional Not optional Optional

Hadoop Distributions

34

copy 2014 IBM Corporation35

Example of Hadoop Ecosystem

Visualization amp DiscoveryIntegration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

BigSheets JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Dashboard amp Visualization

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

copy 2014 IBM Corporation36

Open Source frameworks I

Avro A data serialization system that includes a schema within each file A schema defines the data types that are

contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data

types and complex type definitions within a schema

Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop

cluster

HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data

sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications

are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column

families so that the elements of a column family are all stored together This approach is different from a row-oriented

relational database where all columns of a row are stored together

HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do

not need to know where or how your data is stored You can change how you write data while still supporting existing data in

older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service

that includes functions for both MapReduce and Pig

Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to

analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements

which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere

BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos

Business Intelligence software

copy 2014 IBM Corporation37

Open Source frameworks II

Lucene A high-performance text search engine library that is written entirely in Java When you search within a

collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key

component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene

libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build

scan and query Lucene indexes

Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides

users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the

required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of

specific data in the file system

R A Project for Statistical Computing

Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop

systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and

export it to relational databases and enterprise data warehouses

Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper

maintains common objects that are needed in large cluster environments such as configuration information distributed

synchronization and group services

copy 2014 IBM Corporation38

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

copy 2014 IBM Corporation39

BigSheets

Browser based Analytical Tool that generates MapReduce Jobs

working over Hadoop Big Data

Helps non-programmers to work with Hadoop cluster

User models their big data as familiar spreadsheet-like tabular data

structures (collections) Once data is represented in a collection

business analysts can filter and enrich its content using built-in

functions and macros Furthermore analysts can combine data

residing in different collections as well as generate charts and new

ldquosheetsrdquo (collections) to visualize their data They can even export

data into a variety of common formats with a click of a button

Much of the technology included in Sheets was derived from the

BigSheets project of IBMrsquos Emerging Technologies team

copy 2014 IBM Corporation40

BigSheets Collection Sample

Spreadsheet-like structures defined by user

Based on data accessible through BigInsights Web console ndash eg file

system data output from Web crawl etc

copy 2014 IBM Corporation41

Big Sheets Collection Operations

Work with built-in ldquosheetsrdquo editor

Add delete columns

Filter data

Specify formulas to compute new

values using spreadsheet-style

syntax

Apply built-in or custom macro

functions

helliphelliphelliphellip

copy 2014 IBM Corporation42

BigSheets Collection Graphic Visualization

Built-in charting facility aids analysis

Pie charts bar charts tag clouds maps etc

Hover over sections to reveal details

copy 2014 IBM Corporation43

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

copy 2014 IBM Corporation44

What is Big SQL

Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture

ndash Comprehensive SQL92 ansi support

ndash Standards compliant client drivers (JDBC amp ODBC)

ndash Efficient handling of point queries

ndash Wide variety of data sources and file formats

ndash Extensive HBase focus

ndash Open source interoperability

Our driving design goalsndash Existing queries should run with no or few modifications

ndash Existing JDBC and ODBC compliant tools should continue to function

ndash Queries should be executed as efficiently as the chosen storage

mechanisms allow

copy 2014 IBM Corporation45

Architecture

Big SQL shares catalogs with

Hive via the Hive metastorendash Each can query the others tables

SQL engine analyzes incoming

queriesndash Separates portion(s) to execute at

the server vs portion(s) to execute

on the cluster

ndash Re-writes query if necessary for

improved performance

ndash Determines appropriate storage

handler for data

ndash Produces execution plan

ndash Executes and coordinates query

Server layout and relative sizes

for illustrative purposes only

Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node

Job TrackerNetwork Protocol

SQL Engine

Storage Handlers

Del

Files

SEQ

FilesHBase RDBMS bullbullbull

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore

copy 2014 IBM Corporation46

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

copy 2014 IBM Corporation47

What is Text Analytics

High Performance and Scalable rule based Information Extraction Engine

Distill structured information from unstructured data

- Rich annotator library supports multiple languages

Provides sophisticated tooling to help build test and refine rules

ndash Developer tools an easy to use text analytics language and a set of

extractors for fast adoption

ndash Multilingual support including support for DBCS languages

Developed at IBM Research since 2004 System T

BigInsights is the first time IBM opens up the Text Analytics Engine

technology for customization and development

copy 2014 IBM Corporation48

Annotator Query Language (AQL)

Language to create rules for Text Analytics

SQL Like Language

Fully declarative text analytics language

Once compiled produced an AOG plan to work in the data

No ldquoblack boxesrdquo or modules that canrsquot be customized

Tooling for easy customization because you are abstracted from the

programmatic details

Competing solutions make use of locked up black-box modules that cannot be

customized which restricts flexibility and are difficult to optimize for performance

create view AmountWithUnit as

extract pattern ltNmatchgt ltUmatchgt

as match

from Number N Unit U

copy 2014 IBM Corporation49

Text Analytic Simple Example

NetherlandsStrikerArjen Robben

Keeper SpainIker Casillas

WingerAndres Iniesta Spain

World Cup 2010 Highlights

Football World Cup 2010 one team distinguished well

from the rest winning the final Early in the second

half Netherlandsrsquo striker Arjen Robben had a chance

to score but the awesome keeper for Spain Iker

Casillas made the save Winner superiority was

reflected when Winger Andres Iniesta scored for Spain

for the win

copy 2014 IBM Corporation5050

Text Analytic Real Example

copy 2014 IBM Corporation5151

One step beyond Watson

copy 2014 IBM Corporation52

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

R

copy 2014 IBM Corporation53

Big R

bull Explore visualize transform

and model big data using

familiar R syntax and

paradigm

bull Scale out R with MR

programming

ndash Partitioning of large data

ndash Parallel cluster execution of R

code

bull Distributed Machine

Learning

ndash A scalable statistics engine that

provides canned algorithms and

an ability to author new ones all

via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3

copy 2014 IBM Corporation54

Where Does BigData Fit

Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate

ldquoCapture in case itrsquos neededrdquo

1 Extract transform load

ldquoCapture only whatrsquos neededrdquo

9 Report and mine data

copy 2014 IBM Corporation55

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation56

Data scientist ndash The new cool guy in town

Article in Fortune ldquoThe unemployment rate in

the US continues to be abysmal (91 in

July) but the tech world has spawned a

new kind of highly skilled nerdy-cool job

that companies are scrambling to fill data

scientistrdquo

McKinsey Global Institute ldquoBig data Reportrdquo

By 2018 the United States alone could

face a shortage of 140000 to 190000

people with deep analytical skills as well as

15 million managers and analysts with the

know-how to use the analysis of big data to

make effective decisions

copy 2014 IBM Corporation57

Data Science is Multidisciplinary

copy 2014 IBM Corporation58

Successful Data Scientist Characteristics

copy 2014 IBM Corporation59

Data Scientist Qualities

copy 2014 IBM Corporation60

How Long Does It Take For a Beginner to Become

a Good Data Scientist

copy 2014 IBM Corporation61

wwwkagglecom

copy 2014 IBM Corporation62

Kaggle ranking

copy 2014 IBM Corporation63 copy 2013 IBM Corporation63

Learn Big Data

Reading Materials - Online

ndash Understanding Big Data ndash Free PDF Book

bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF

ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights

bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml

ndash Implementing IBM InfoSphere BigInsights on System x - Redbook

bull httpwwwredbooksibmcomredpiecesabstractssg248077html

Resources

ndash Big Data Information Center

bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml

ndash InfoSphere BigInsights

bull www-01ibmcomsoftwaredatainfospherebiginsights

ndash Stream Computing

bull www-01ibmcomsoftwaredatainfospherestream-computing

ndash DeveloperWorks forums demos

bull httpwwwibmcomdeveloperworkswikibiginsights

copy 2014 IBM Corporation64

Learn Big Data Technologies

BigDataUniversitycom

Flexible on-line delivery

allows learning your place

and your pace

Free courses free study

materials

Cloud-based sandbox for

exercises ndash zero setup

Robust Course

Management System and

Content Distribution

infrastructure

copy 2014 IBM Corporation65

65

Page 2: Big data presentation (2014)

copy 2014 IBM Corporation2

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation3

Traditional DW

BI Server

ERP

CRM

Data Marts

Reports Dashboards

Operational System

ETL ETL

BENEFITS

Mature Technology

SQL Language (declarative non technical)

Skills amp resources availablity (programmers DBAshellip)

LIMITATIONS

Big operational data volumes

Queries take too long or donrsquot even finish

Admin complexity (partitions archivinghellip)

New data types

Free text images video audiohellip

Data in real time (sensors logs geospatial data etchellip)

New analysis types

Exploratory

Predictive

Flat filesSpread sheets

Data Warehouse(s)

copy 2014 IBM Corporation4

1 in 2business leaders donrsquot have access to data they need

83of CIOrsquos cited BI and analytics as part of their visionary plan

54Xmore likely that top performers use business analytics

80of the worldrsquos data today is unstructured

90of the worldrsquos

data was created in the last two

years

20of available data can

be processed by traditional systems

Source GigaOM Software Group IBM Institute for Business Value

Intrinsic Property of Data hellip it grows

copy 2014 IBM Corporation5

Characteristics of Big Data

Velocity is the game changer Itrsquos NOT just how

fast data is produced or changed BUT the

speed at which it must be analyzed

received understood and processed

copy 2014 IBM Corporation6

Paradigm shifts enabled by big data ILeverage more of the data being captured

copy 2014 IBM Corporation7

Paradigm shifts enabled by big data ILeverage more of the data being captured

Bank X

copy 2014 IBM Corporation8

Paradigm shifts enabled by big data IIReduce effort required to leverage data

copy 2014 IBM Corporation9

Paradigm shifts enabled by big data IIReduce effort required to leverage data

copy 2014 IBM Corporation10

Paradigm shifts enabled by big data IIIData leads the way ndash and sometimes correlations are good enough

copy 2014 IBM Corporation11

Paradigm shifts enabled by big data IIIData leads the way ndash and sometimes correlations are good enough

Hypothesis based correlation Weird correlation

copy 2014 IBM Corporation12

Paradigm shifts enabled by big data IIIData leads the way ndash and sometimes correlations are good enough

copy 2014 IBM Corporation13

Paradigm shifts enabled by big data IVLeverage data as it is captured

copy 2014 IBM Corporation14

Paradigm shifts enabled by big data IVLeverage data as it is captured

copy 2014 IBM Corporation15

Complementary Analytics

Traditional ApproachStructured analytical logical

New ApproachCreative holistic thought intuition

Multimedia

Data Warehouse

Web Logs

Social Data

Sensor data

images

RFID

Internal AppData

TransactionData

MainframeData

OLTP SystemData

Traditional databases

ERP Data

StructuredRepeatable

Linear

UnstructuredExploratory

Dynamic

Text Data

emails

Hadoop andStreams

NewSources

copy 2014 IBM Corporation16

Types of Analytic Tools

copy 2014 IBM Corporation17

Organisations are prioritising internal data sources

17

Untapped stores of internal data

Size and scope of some internal data such as

detailed transactions and operational log data

have become too large and varied to manage

within traditional systems

New infrastructure components make them

accessible for analysis

Some data has been collected but not

analyzed for years

Focus on customer insights

Customers ndash influenced by digital experiences

ndash often expect information provided to an

organization will then be ldquoknownrdquo during future

interactions

Combining disparate internal sources with

advanced analytics creates insights into

customer behavior and preferences

(Transactions Emails Call center interaction records)

Big data sources

Respondents were

asked which data

sources are currently

being collected and

analyzed as part of

active big data efforts

within their

organization

copy 2014 IBM Corporation18

Stages of Big Data adoption

18

Big data adoption

When segmented into four groups based on current levels of big data activity respondents showed significant consistency

in organizational behaviors Total respondents n = 1061

Totals do not equal 100 due to rounding

copy 2014 IBM Corporation19

Hadoop workloads

92

92

83

58

42

25

58

92

92

92

67

67

67

83

Staging area

Online archive

Transformation Engine

Ad hoc queries

Scheduled reports

Visual exploration

Data mining

Today In 18 Months

Based on respondents that have implemented Hadoop BI Leadership Forum April 2012

copy 2014 IBM Corporation20

Big Data ExplorationFind visualize understand all big data to improve decision making

Enhanced 360o Viewof the CustomerExtend existing customer views (MDM CRM etc) by incorporating additional internal and external information sources

Operations AnalysisAnalyze a variety of machinedata for improved business results

Data Warehouse ModernizationIntegrate big data and data warehouse capabilities to increase operational efficiency

SecurityIntelligence ExtensionLower risk detect fraud and monitor cyber security in real-time

Key Big Data Use Cases

copy 2014 IBM Corporation21

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation22

Solution for Big Data

Rest Data

ndash Data to analyze are already stored (structured and unstructured)

ndash Examples logs facebook twitter etc

ndash Solution Hadoop (open source)

Data in motion

ndash Data are analyzed in real time just in the moment they are generated They are analyzed with any previous storage

ndash Examples Sensors RFID etc

ndash Solution Streams CEP solutions

copy 2014 IBM Corporation23

Hardware improvements through the years

CPU Speedsndash 1990 - 44 MIPS at 40 MHz

ndash 2000 - 3561 MIPS at 12 GHz ndash 2010 - 147600 MIPS at 33 GHz

RAM Memoryndash 1990 ndash 640K conventional memory (256K extended memory recommended)ndash 2000 ndash 64MB memoryndash 2010 - 8-32GB (and more)

Disk Capacityndash 1990 ndash 20MBndash 2000 - 1GBndash 2010 ndash 1TB

Disk Latency (speed of reads and writes) ndash not much improvement in last 7-10 years currently around 70 ndash 80MB sec

copy 2014 IBM Corporation24

How long it will take to read 1TB of data

1TB (at 80Mb sec)ndash 1 disk - 34 hours

ndash 10 disks - 20 min

ndash 100 disks - 2 min

ndash 1000 disks - 12 sec

copy 2014 IBM Corporation25

Parallel Data Processing is the answer

It was with us for a whilendash GRID computing - spreads processing load

ndash Distributed workload - hard to manage applications overhead on

developer

ndash Parallel databases ndash DB2 DPF Teradata Netezza etc (distribute the

data)

copy 2014 IBM Corporation26

What is Apache Hadoop

Apache Open source software framework

Flexible enterprise-class support for processing large volumes of

data ndash Inspired by Google technologies (MapReduce GFS BigTable hellip)

ndash Initiated at Yahoobull Originally built to address scalability problems of Nutch an open source Web search

technology

ndash Well-suited to batch-oriented read-intensive applications

ndash Supports wide variety of data

Enables applications to work with thousands of nodes and petabytes

of data in a highly parallel cost effective mannerndash CPU + local disks = ldquonoderdquo

ndash Nodes can be combined into clusters

ndash New nodes can be added as needed without changing

bull Data formats

bull How data is loaded

bull How jobs are written

copy 2014 IBM Corporation27

Design principles of Hadoop New way of storing and processing the data

ndash Let system handle most of the issues automaticallybull Failuresbull Scalabilitybull Reduce communications bull Distribute data and processing power to where the data isbull Make parallelism part of operating systembull Meant for heterogeneous commodity hardware

Bring processing to Data

Hadoop = HDFS + MapReduce infrastructure

Optimized to handlendash Massive amounts of data through parallelism

ndash A variety of data (structured unstructured semi-structured)

ndash Using inexpensive commodity hardware

Reliability provided through replication

copy 2014 IBM Corporation28

What is the Hadoop Distributed File System Driving principals

ndash Data is stored across the entire cluster (multiple nodes)

ndash Programs are brought to the data not the data to the program

ndash Follows the Divide and Conquer paradigm

Data is stored across the entire cluster (the DFS)

ndash The entire cluster participates in the file system

ndash Blocks of a single file are distributed across the cluster

ndash A given block is typically replicated as well for resiliency

10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111

0100

Logical File

1

2

3

4

Blocks

1

Cluster

1

1

2

22

3

3

34

44

copy 2011 IBM Corporation29

Introduction to MapReduce

Scalable to thousands of nodes and petabytes of data

MapReduce Application

1 Map Phase(break job into small parts)

2 Shuffle(transfer interim output

for final processing)

3 Reduce Phase(boil all output down to

a single result set)

Return a single result setResult Set

Shuffle

public static class TokenizerMapper extends MapperltObjectTextTextIntWritablegt

private final static IntWritableone = new IntWritable(1)

private Text word = new Text()

public void map(Object key Text val ContextStringTokenizer itr =

new StringTokenizer(valtoString())while (itrhasMoreTokens()) wordset(itrnextToken())

contextwrite(word one)

public static class IntSumReducer extends ReducerltTextIntWritableTextIntWrita

private IntWritable result = new IntWritable()

public void reduce(Text keyIterableltIntWritablegt val Context context)int sum = 0for (IntWritable v val)

sum += vget()

Distribute map

tasks to cluster

Hadoop Data Nodes

copy 2011 IBM Corporation30

MapReduce Example

Hello World Bye World

Hello IBM

Reduce (final output)

lt Bye 1gt

lt IBM 1gt

lt Hello 2gt

lt World 2gt

Map 1lt Hello 1gt

lt World 1gt

lt Bye 1gt

lt World 1gt

Count number of words occurrences

Map 2lt Hello 1gt

lt IBM 1gt

Entry Data

Map

Process

Reduce

Process

Shuffle

Process

copy 2014 IBM Corporation31

How to Analyze Large Data Sets in Hadoop

Its not just runtime Development phase has to be taken into

account

Although the Hadoop framework is implemented in Java

MapReduce applications do not need to be written in Java

To abstract complexities of Hadoop programming model a few

application development languages have emerged that build on top

of Hadoopndash Pig

ndash Hive

ndash Jaql

ndash Jaql

copy 2014 IBM Corporation32

Pig Hive Jaql ndash Similarities

Reduced program size over Java

Applications are translated to map

and reduce jobs behind scenes

Extension points for extending

existing functionality

Interoperability with other

languages

Not designed for random

readswrites or low-latency queries

Pig Hive Jaql ndash Differences

Characteristic Pig Hive Jaql

Developed by Yahoo Facebook IBM

Language Pig Latin HiveQL Jaql

Type of language

Data flow Declarative (SQL dialect) Data flow

Data structures supported

Complex Better suited for structured data

JSON semi structured

Schema Optional Not optional Optional

Hadoop Distributions

34

copy 2014 IBM Corporation35

Example of Hadoop Ecosystem

Visualization amp DiscoveryIntegration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

BigSheets JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Dashboard amp Visualization

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

copy 2014 IBM Corporation36

Open Source frameworks I

Avro A data serialization system that includes a schema within each file A schema defines the data types that are

contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data

types and complex type definitions within a schema

Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop

cluster

HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data

sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications

are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column

families so that the elements of a column family are all stored together This approach is different from a row-oriented

relational database where all columns of a row are stored together

HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do

not need to know where or how your data is stored You can change how you write data while still supporting existing data in

older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service

that includes functions for both MapReduce and Pig

Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to

analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements

which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere

BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos

Business Intelligence software

copy 2014 IBM Corporation37

Open Source frameworks II

Lucene A high-performance text search engine library that is written entirely in Java When you search within a

collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key

component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene

libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build

scan and query Lucene indexes

Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides

users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the

required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of

specific data in the file system

R A Project for Statistical Computing

Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop

systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and

export it to relational databases and enterprise data warehouses

Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper

maintains common objects that are needed in large cluster environments such as configuration information distributed

synchronization and group services

copy 2014 IBM Corporation38

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

copy 2014 IBM Corporation39

BigSheets

Browser based Analytical Tool that generates MapReduce Jobs

working over Hadoop Big Data

Helps non-programmers to work with Hadoop cluster

User models their big data as familiar spreadsheet-like tabular data

structures (collections) Once data is represented in a collection

business analysts can filter and enrich its content using built-in

functions and macros Furthermore analysts can combine data

residing in different collections as well as generate charts and new

ldquosheetsrdquo (collections) to visualize their data They can even export

data into a variety of common formats with a click of a button

Much of the technology included in Sheets was derived from the

BigSheets project of IBMrsquos Emerging Technologies team

copy 2014 IBM Corporation40

BigSheets Collection Sample

Spreadsheet-like structures defined by user

Based on data accessible through BigInsights Web console ndash eg file

system data output from Web crawl etc

copy 2014 IBM Corporation41

Big Sheets Collection Operations

Work with built-in ldquosheetsrdquo editor

Add delete columns

Filter data

Specify formulas to compute new

values using spreadsheet-style

syntax

Apply built-in or custom macro

functions

helliphelliphelliphellip

copy 2014 IBM Corporation42

BigSheets Collection Graphic Visualization

Built-in charting facility aids analysis

Pie charts bar charts tag clouds maps etc

Hover over sections to reveal details

copy 2014 IBM Corporation43

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

copy 2014 IBM Corporation44

What is Big SQL

Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture

ndash Comprehensive SQL92 ansi support

ndash Standards compliant client drivers (JDBC amp ODBC)

ndash Efficient handling of point queries

ndash Wide variety of data sources and file formats

ndash Extensive HBase focus

ndash Open source interoperability

Our driving design goalsndash Existing queries should run with no or few modifications

ndash Existing JDBC and ODBC compliant tools should continue to function

ndash Queries should be executed as efficiently as the chosen storage

mechanisms allow

copy 2014 IBM Corporation45

Architecture

Big SQL shares catalogs with

Hive via the Hive metastorendash Each can query the others tables

SQL engine analyzes incoming

queriesndash Separates portion(s) to execute at

the server vs portion(s) to execute

on the cluster

ndash Re-writes query if necessary for

improved performance

ndash Determines appropriate storage

handler for data

ndash Produces execution plan

ndash Executes and coordinates query

Server layout and relative sizes

for illustrative purposes only

Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node

Job TrackerNetwork Protocol

SQL Engine

Storage Handlers

Del

Files

SEQ

FilesHBase RDBMS bullbullbull

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore

copy 2014 IBM Corporation46

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

copy 2014 IBM Corporation47

What is Text Analytics

High Performance and Scalable rule based Information Extraction Engine

Distill structured information from unstructured data

- Rich annotator library supports multiple languages

Provides sophisticated tooling to help build test and refine rules

ndash Developer tools an easy to use text analytics language and a set of

extractors for fast adoption

ndash Multilingual support including support for DBCS languages

Developed at IBM Research since 2004 System T

BigInsights is the first time IBM opens up the Text Analytics Engine

technology for customization and development

copy 2014 IBM Corporation48

Annotator Query Language (AQL)

Language to create rules for Text Analytics

SQL Like Language

Fully declarative text analytics language

Once compiled produced an AOG plan to work in the data

No ldquoblack boxesrdquo or modules that canrsquot be customized

Tooling for easy customization because you are abstracted from the

programmatic details

Competing solutions make use of locked up black-box modules that cannot be

customized which restricts flexibility and are difficult to optimize for performance

create view AmountWithUnit as

extract pattern ltNmatchgt ltUmatchgt

as match

from Number N Unit U

copy 2014 IBM Corporation49

Text Analytic Simple Example

NetherlandsStrikerArjen Robben

Keeper SpainIker Casillas

WingerAndres Iniesta Spain

World Cup 2010 Highlights

Football World Cup 2010 one team distinguished well

from the rest winning the final Early in the second

half Netherlandsrsquo striker Arjen Robben had a chance

to score but the awesome keeper for Spain Iker

Casillas made the save Winner superiority was

reflected when Winger Andres Iniesta scored for Spain

for the win

copy 2014 IBM Corporation5050

Text Analytic Real Example

copy 2014 IBM Corporation5151

One step beyond Watson

copy 2014 IBM Corporation52

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

R

copy 2014 IBM Corporation53

Big R

bull Explore visualize transform

and model big data using

familiar R syntax and

paradigm

bull Scale out R with MR

programming

ndash Partitioning of large data

ndash Parallel cluster execution of R

code

bull Distributed Machine

Learning

ndash A scalable statistics engine that

provides canned algorithms and

an ability to author new ones all

via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3

copy 2014 IBM Corporation54

Where Does BigData Fit

Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate

ldquoCapture in case itrsquos neededrdquo

1 Extract transform load

ldquoCapture only whatrsquos neededrdquo

9 Report and mine data

copy 2014 IBM Corporation55

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation56

Data scientist ndash The new cool guy in town

Article in Fortune ldquoThe unemployment rate in

the US continues to be abysmal (91 in

July) but the tech world has spawned a

new kind of highly skilled nerdy-cool job

that companies are scrambling to fill data

scientistrdquo

McKinsey Global Institute ldquoBig data Reportrdquo

By 2018 the United States alone could

face a shortage of 140000 to 190000

people with deep analytical skills as well as

15 million managers and analysts with the

know-how to use the analysis of big data to

make effective decisions

copy 2014 IBM Corporation57

Data Science is Multidisciplinary

copy 2014 IBM Corporation58

Successful Data Scientist Characteristics

copy 2014 IBM Corporation59

Data Scientist Qualities

copy 2014 IBM Corporation60

How Long Does It Take For a Beginner to Become

a Good Data Scientist

copy 2014 IBM Corporation61

wwwkagglecom

copy 2014 IBM Corporation62

Kaggle ranking

copy 2014 IBM Corporation63 copy 2013 IBM Corporation63

Learn Big Data

Reading Materials - Online

ndash Understanding Big Data ndash Free PDF Book

bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF

ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights

bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml

ndash Implementing IBM InfoSphere BigInsights on System x - Redbook

bull httpwwwredbooksibmcomredpiecesabstractssg248077html

Resources

ndash Big Data Information Center

bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml

ndash InfoSphere BigInsights

bull www-01ibmcomsoftwaredatainfospherebiginsights

ndash Stream Computing

bull www-01ibmcomsoftwaredatainfospherestream-computing

ndash DeveloperWorks forums demos

bull httpwwwibmcomdeveloperworkswikibiginsights

copy 2014 IBM Corporation64

Learn Big Data Technologies

BigDataUniversitycom

Flexible on-line delivery

allows learning your place

and your pace

Free courses free study

materials

Cloud-based sandbox for

exercises ndash zero setup

Robust Course

Management System and

Content Distribution

infrastructure

copy 2014 IBM Corporation65

65

Page 3: Big data presentation (2014)

copy 2014 IBM Corporation3

Traditional DW

BI Server

ERP

CRM

Data Marts

Reports Dashboards

Operational System

ETL ETL

BENEFITS

Mature Technology

SQL Language (declarative non technical)

Skills amp resources availablity (programmers DBAshellip)

LIMITATIONS

Big operational data volumes

Queries take too long or donrsquot even finish

Admin complexity (partitions archivinghellip)

New data types

Free text images video audiohellip

Data in real time (sensors logs geospatial data etchellip)

New analysis types

Exploratory

Predictive

Flat filesSpread sheets

Data Warehouse(s)

copy 2014 IBM Corporation4

1 in 2business leaders donrsquot have access to data they need

83of CIOrsquos cited BI and analytics as part of their visionary plan

54Xmore likely that top performers use business analytics

80of the worldrsquos data today is unstructured

90of the worldrsquos

data was created in the last two

years

20of available data can

be processed by traditional systems

Source GigaOM Software Group IBM Institute for Business Value

Intrinsic Property of Data hellip it grows

copy 2014 IBM Corporation5

Characteristics of Big Data

Velocity is the game changer Itrsquos NOT just how

fast data is produced or changed BUT the

speed at which it must be analyzed

received understood and processed

copy 2014 IBM Corporation6

Paradigm shifts enabled by big data ILeverage more of the data being captured

copy 2014 IBM Corporation7

Paradigm shifts enabled by big data ILeverage more of the data being captured

Bank X

copy 2014 IBM Corporation8

Paradigm shifts enabled by big data IIReduce effort required to leverage data

copy 2014 IBM Corporation9

Paradigm shifts enabled by big data IIReduce effort required to leverage data

copy 2014 IBM Corporation10

Paradigm shifts enabled by big data IIIData leads the way ndash and sometimes correlations are good enough

copy 2014 IBM Corporation11

Paradigm shifts enabled by big data IIIData leads the way ndash and sometimes correlations are good enough

Hypothesis based correlation Weird correlation

copy 2014 IBM Corporation12

Paradigm shifts enabled by big data IIIData leads the way ndash and sometimes correlations are good enough

copy 2014 IBM Corporation13

Paradigm shifts enabled by big data IVLeverage data as it is captured

copy 2014 IBM Corporation14

Paradigm shifts enabled by big data IVLeverage data as it is captured

copy 2014 IBM Corporation15

Complementary Analytics

Traditional ApproachStructured analytical logical

New ApproachCreative holistic thought intuition

Multimedia

Data Warehouse

Web Logs

Social Data

Sensor data

images

RFID

Internal AppData

TransactionData

MainframeData

OLTP SystemData

Traditional databases

ERP Data

StructuredRepeatable

Linear

UnstructuredExploratory

Dynamic

Text Data

emails

Hadoop andStreams

NewSources

copy 2014 IBM Corporation16

Types of Analytic Tools

copy 2014 IBM Corporation17

Organisations are prioritising internal data sources

17

Untapped stores of internal data

Size and scope of some internal data such as

detailed transactions and operational log data

have become too large and varied to manage

within traditional systems

New infrastructure components make them

accessible for analysis

Some data has been collected but not

analyzed for years

Focus on customer insights

Customers ndash influenced by digital experiences

ndash often expect information provided to an

organization will then be ldquoknownrdquo during future

interactions

Combining disparate internal sources with

advanced analytics creates insights into

customer behavior and preferences

(Transactions Emails Call center interaction records)

Big data sources

Respondents were

asked which data

sources are currently

being collected and

analyzed as part of

active big data efforts

within their

organization

copy 2014 IBM Corporation18

Stages of Big Data adoption

18

Big data adoption

When segmented into four groups based on current levels of big data activity respondents showed significant consistency

in organizational behaviors Total respondents n = 1061

Totals do not equal 100 due to rounding

copy 2014 IBM Corporation19

Hadoop workloads

92

92

83

58

42

25

58

92

92

92

67

67

67

83

Staging area

Online archive

Transformation Engine

Ad hoc queries

Scheduled reports

Visual exploration

Data mining

Today In 18 Months

Based on respondents that have implemented Hadoop BI Leadership Forum April 2012

copy 2014 IBM Corporation20

Big Data ExplorationFind visualize understand all big data to improve decision making

Enhanced 360o Viewof the CustomerExtend existing customer views (MDM CRM etc) by incorporating additional internal and external information sources

Operations AnalysisAnalyze a variety of machinedata for improved business results

Data Warehouse ModernizationIntegrate big data and data warehouse capabilities to increase operational efficiency

SecurityIntelligence ExtensionLower risk detect fraud and monitor cyber security in real-time

Key Big Data Use Cases

copy 2014 IBM Corporation21

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation22

Solution for Big Data

Rest Data

ndash Data to analyze are already stored (structured and unstructured)

ndash Examples logs facebook twitter etc

ndash Solution Hadoop (open source)

Data in motion

ndash Data are analyzed in real time just in the moment they are generated They are analyzed with any previous storage

ndash Examples Sensors RFID etc

ndash Solution Streams CEP solutions

copy 2014 IBM Corporation23

Hardware improvements through the years

CPU Speedsndash 1990 - 44 MIPS at 40 MHz

ndash 2000 - 3561 MIPS at 12 GHz ndash 2010 - 147600 MIPS at 33 GHz

RAM Memoryndash 1990 ndash 640K conventional memory (256K extended memory recommended)ndash 2000 ndash 64MB memoryndash 2010 - 8-32GB (and more)

Disk Capacityndash 1990 ndash 20MBndash 2000 - 1GBndash 2010 ndash 1TB

Disk Latency (speed of reads and writes) ndash not much improvement in last 7-10 years currently around 70 ndash 80MB sec

copy 2014 IBM Corporation24

How long it will take to read 1TB of data

1TB (at 80Mb sec)ndash 1 disk - 34 hours

ndash 10 disks - 20 min

ndash 100 disks - 2 min

ndash 1000 disks - 12 sec

copy 2014 IBM Corporation25

Parallel Data Processing is the answer

It was with us for a whilendash GRID computing - spreads processing load

ndash Distributed workload - hard to manage applications overhead on

developer

ndash Parallel databases ndash DB2 DPF Teradata Netezza etc (distribute the

data)

copy 2014 IBM Corporation26

What is Apache Hadoop

Apache Open source software framework

Flexible enterprise-class support for processing large volumes of

data ndash Inspired by Google technologies (MapReduce GFS BigTable hellip)

ndash Initiated at Yahoobull Originally built to address scalability problems of Nutch an open source Web search

technology

ndash Well-suited to batch-oriented read-intensive applications

ndash Supports wide variety of data

Enables applications to work with thousands of nodes and petabytes

of data in a highly parallel cost effective mannerndash CPU + local disks = ldquonoderdquo

ndash Nodes can be combined into clusters

ndash New nodes can be added as needed without changing

bull Data formats

bull How data is loaded

bull How jobs are written

copy 2014 IBM Corporation27

Design principles of Hadoop New way of storing and processing the data

ndash Let system handle most of the issues automaticallybull Failuresbull Scalabilitybull Reduce communications bull Distribute data and processing power to where the data isbull Make parallelism part of operating systembull Meant for heterogeneous commodity hardware

Bring processing to Data

Hadoop = HDFS + MapReduce infrastructure

Optimized to handlendash Massive amounts of data through parallelism

ndash A variety of data (structured unstructured semi-structured)

ndash Using inexpensive commodity hardware

Reliability provided through replication

copy 2014 IBM Corporation28

What is the Hadoop Distributed File System Driving principals

ndash Data is stored across the entire cluster (multiple nodes)

ndash Programs are brought to the data not the data to the program

ndash Follows the Divide and Conquer paradigm

Data is stored across the entire cluster (the DFS)

ndash The entire cluster participates in the file system

ndash Blocks of a single file are distributed across the cluster

ndash A given block is typically replicated as well for resiliency

10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111

0100

Logical File

1

2

3

4

Blocks

1

Cluster

1

1

2

22

3

3

34

44

copy 2011 IBM Corporation29

Introduction to MapReduce

Scalable to thousands of nodes and petabytes of data

MapReduce Application

1 Map Phase(break job into small parts)

2 Shuffle(transfer interim output

for final processing)

3 Reduce Phase(boil all output down to

a single result set)

Return a single result setResult Set

Shuffle

public static class TokenizerMapper extends MapperltObjectTextTextIntWritablegt

private final static IntWritableone = new IntWritable(1)

private Text word = new Text()

public void map(Object key Text val ContextStringTokenizer itr =

new StringTokenizer(valtoString())while (itrhasMoreTokens()) wordset(itrnextToken())

contextwrite(word one)

public static class IntSumReducer extends ReducerltTextIntWritableTextIntWrita

private IntWritable result = new IntWritable()

public void reduce(Text keyIterableltIntWritablegt val Context context)int sum = 0for (IntWritable v val)

sum += vget()

Distribute map

tasks to cluster

Hadoop Data Nodes

copy 2011 IBM Corporation30

MapReduce Example

Hello World Bye World

Hello IBM

Reduce (final output)

lt Bye 1gt

lt IBM 1gt

lt Hello 2gt

lt World 2gt

Map 1lt Hello 1gt

lt World 1gt

lt Bye 1gt

lt World 1gt

Count number of words occurrences

Map 2lt Hello 1gt

lt IBM 1gt

Entry Data

Map

Process

Reduce

Process

Shuffle

Process

copy 2014 IBM Corporation31

How to Analyze Large Data Sets in Hadoop

Its not just runtime Development phase has to be taken into

account

Although the Hadoop framework is implemented in Java

MapReduce applications do not need to be written in Java

To abstract complexities of Hadoop programming model a few

application development languages have emerged that build on top

of Hadoopndash Pig

ndash Hive

ndash Jaql

ndash Jaql

copy 2014 IBM Corporation32

Pig Hive Jaql ndash Similarities

Reduced program size over Java

Applications are translated to map

and reduce jobs behind scenes

Extension points for extending

existing functionality

Interoperability with other

languages

Not designed for random

readswrites or low-latency queries

Pig Hive Jaql ndash Differences

Characteristic Pig Hive Jaql

Developed by Yahoo Facebook IBM

Language Pig Latin HiveQL Jaql

Type of language

Data flow Declarative (SQL dialect) Data flow

Data structures supported

Complex Better suited for structured data

JSON semi structured

Schema Optional Not optional Optional

Hadoop Distributions

34

copy 2014 IBM Corporation35

Example of Hadoop Ecosystem

Visualization amp DiscoveryIntegration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

BigSheets JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Dashboard amp Visualization

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

copy 2014 IBM Corporation36

Open Source frameworks I

Avro A data serialization system that includes a schema within each file A schema defines the data types that are

contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data

types and complex type definitions within a schema

Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop

cluster

HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data

sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications

are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column

families so that the elements of a column family are all stored together This approach is different from a row-oriented

relational database where all columns of a row are stored together

HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do

not need to know where or how your data is stored You can change how you write data while still supporting existing data in

older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service

that includes functions for both MapReduce and Pig

Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to

analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements

which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere

BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos

Business Intelligence software

copy 2014 IBM Corporation37

Open Source frameworks II

Lucene A high-performance text search engine library that is written entirely in Java When you search within a

collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key

component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene

libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build

scan and query Lucene indexes

Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides

users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the

required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of

specific data in the file system

R A Project for Statistical Computing

Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop

systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and

export it to relational databases and enterprise data warehouses

Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper

maintains common objects that are needed in large cluster environments such as configuration information distributed

synchronization and group services

copy 2014 IBM Corporation38

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

copy 2014 IBM Corporation39

BigSheets

Browser based Analytical Tool that generates MapReduce Jobs

working over Hadoop Big Data

Helps non-programmers to work with Hadoop cluster

User models their big data as familiar spreadsheet-like tabular data

structures (collections) Once data is represented in a collection

business analysts can filter and enrich its content using built-in

functions and macros Furthermore analysts can combine data

residing in different collections as well as generate charts and new

ldquosheetsrdquo (collections) to visualize their data They can even export

data into a variety of common formats with a click of a button

Much of the technology included in Sheets was derived from the

BigSheets project of IBMrsquos Emerging Technologies team

copy 2014 IBM Corporation40

BigSheets Collection Sample

Spreadsheet-like structures defined by user

Based on data accessible through BigInsights Web console ndash eg file

system data output from Web crawl etc

copy 2014 IBM Corporation41

Big Sheets Collection Operations

Work with built-in ldquosheetsrdquo editor

Add delete columns

Filter data

Specify formulas to compute new

values using spreadsheet-style

syntax

Apply built-in or custom macro

functions

helliphelliphelliphellip

copy 2014 IBM Corporation42

BigSheets Collection Graphic Visualization

Built-in charting facility aids analysis

Pie charts bar charts tag clouds maps etc

Hover over sections to reveal details

copy 2014 IBM Corporation43

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

copy 2014 IBM Corporation44

What is Big SQL

Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture

ndash Comprehensive SQL92 ansi support

ndash Standards compliant client drivers (JDBC amp ODBC)

ndash Efficient handling of point queries

ndash Wide variety of data sources and file formats

ndash Extensive HBase focus

ndash Open source interoperability

Our driving design goalsndash Existing queries should run with no or few modifications

ndash Existing JDBC and ODBC compliant tools should continue to function

ndash Queries should be executed as efficiently as the chosen storage

mechanisms allow

copy 2014 IBM Corporation45

Architecture

Big SQL shares catalogs with

Hive via the Hive metastorendash Each can query the others tables

SQL engine analyzes incoming

queriesndash Separates portion(s) to execute at

the server vs portion(s) to execute

on the cluster

ndash Re-writes query if necessary for

improved performance

ndash Determines appropriate storage

handler for data

ndash Produces execution plan

ndash Executes and coordinates query

Server layout and relative sizes

for illustrative purposes only

Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node

Job TrackerNetwork Protocol

SQL Engine

Storage Handlers

Del

Files

SEQ

FilesHBase RDBMS bullbullbull

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore

copy 2014 IBM Corporation46

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

copy 2014 IBM Corporation47

What is Text Analytics

High Performance and Scalable rule based Information Extraction Engine

Distill structured information from unstructured data

- Rich annotator library supports multiple languages

Provides sophisticated tooling to help build test and refine rules

ndash Developer tools an easy to use text analytics language and a set of

extractors for fast adoption

ndash Multilingual support including support for DBCS languages

Developed at IBM Research since 2004 System T

BigInsights is the first time IBM opens up the Text Analytics Engine

technology for customization and development

copy 2014 IBM Corporation48

Annotator Query Language (AQL)

Language to create rules for Text Analytics

SQL Like Language

Fully declarative text analytics language

Once compiled produced an AOG plan to work in the data

No ldquoblack boxesrdquo or modules that canrsquot be customized

Tooling for easy customization because you are abstracted from the

programmatic details

Competing solutions make use of locked up black-box modules that cannot be

customized which restricts flexibility and are difficult to optimize for performance

create view AmountWithUnit as

extract pattern ltNmatchgt ltUmatchgt

as match

from Number N Unit U

copy 2014 IBM Corporation49

Text Analytic Simple Example

NetherlandsStrikerArjen Robben

Keeper SpainIker Casillas

WingerAndres Iniesta Spain

World Cup 2010 Highlights

Football World Cup 2010 one team distinguished well

from the rest winning the final Early in the second

half Netherlandsrsquo striker Arjen Robben had a chance

to score but the awesome keeper for Spain Iker

Casillas made the save Winner superiority was

reflected when Winger Andres Iniesta scored for Spain

for the win

copy 2014 IBM Corporation5050

Text Analytic Real Example

copy 2014 IBM Corporation5151

One step beyond Watson

copy 2014 IBM Corporation52

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

R

copy 2014 IBM Corporation53

Big R

bull Explore visualize transform

and model big data using

familiar R syntax and

paradigm

bull Scale out R with MR

programming

ndash Partitioning of large data

ndash Parallel cluster execution of R

code

bull Distributed Machine

Learning

ndash A scalable statistics engine that

provides canned algorithms and

an ability to author new ones all

via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3

copy 2014 IBM Corporation54

Where Does BigData Fit

Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate

ldquoCapture in case itrsquos neededrdquo

1 Extract transform load

ldquoCapture only whatrsquos neededrdquo

9 Report and mine data

copy 2014 IBM Corporation55

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation56

Data scientist ndash The new cool guy in town

Article in Fortune ldquoThe unemployment rate in

the US continues to be abysmal (91 in

July) but the tech world has spawned a

new kind of highly skilled nerdy-cool job

that companies are scrambling to fill data

scientistrdquo

McKinsey Global Institute ldquoBig data Reportrdquo

By 2018 the United States alone could

face a shortage of 140000 to 190000

people with deep analytical skills as well as

15 million managers and analysts with the

know-how to use the analysis of big data to

make effective decisions

copy 2014 IBM Corporation57

Data Science is Multidisciplinary

copy 2014 IBM Corporation58

Successful Data Scientist Characteristics

copy 2014 IBM Corporation59

Data Scientist Qualities

copy 2014 IBM Corporation60

How Long Does It Take For a Beginner to Become

a Good Data Scientist

copy 2014 IBM Corporation61

wwwkagglecom

copy 2014 IBM Corporation62

Kaggle ranking

copy 2014 IBM Corporation63 copy 2013 IBM Corporation63

Learn Big Data

Reading Materials - Online

ndash Understanding Big Data ndash Free PDF Book

bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF

ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights

bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml

ndash Implementing IBM InfoSphere BigInsights on System x - Redbook

bull httpwwwredbooksibmcomredpiecesabstractssg248077html

Resources

ndash Big Data Information Center

bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml

ndash InfoSphere BigInsights

bull www-01ibmcomsoftwaredatainfospherebiginsights

ndash Stream Computing

bull www-01ibmcomsoftwaredatainfospherestream-computing

ndash DeveloperWorks forums demos

bull httpwwwibmcomdeveloperworkswikibiginsights

copy 2014 IBM Corporation64

Learn Big Data Technologies

BigDataUniversitycom

Flexible on-line delivery

allows learning your place

and your pace

Free courses free study

materials

Cloud-based sandbox for

exercises ndash zero setup

Robust Course

Management System and

Content Distribution

infrastructure

copy 2014 IBM Corporation65

65

Page 4: Big data presentation (2014)

copy 2014 IBM Corporation4

1 in 2business leaders donrsquot have access to data they need

83of CIOrsquos cited BI and analytics as part of their visionary plan

54Xmore likely that top performers use business analytics

80of the worldrsquos data today is unstructured

90of the worldrsquos

data was created in the last two

years

20of available data can

be processed by traditional systems

Source GigaOM Software Group IBM Institute for Business Value

Intrinsic Property of Data hellip it grows

copy 2014 IBM Corporation5

Characteristics of Big Data

Velocity is the game changer Itrsquos NOT just how

fast data is produced or changed BUT the

speed at which it must be analyzed

received understood and processed

copy 2014 IBM Corporation6

Paradigm shifts enabled by big data ILeverage more of the data being captured

copy 2014 IBM Corporation7

Paradigm shifts enabled by big data ILeverage more of the data being captured

Bank X

copy 2014 IBM Corporation8

Paradigm shifts enabled by big data IIReduce effort required to leverage data

copy 2014 IBM Corporation9

Paradigm shifts enabled by big data IIReduce effort required to leverage data

copy 2014 IBM Corporation10

Paradigm shifts enabled by big data IIIData leads the way ndash and sometimes correlations are good enough

copy 2014 IBM Corporation11

Paradigm shifts enabled by big data IIIData leads the way ndash and sometimes correlations are good enough

Hypothesis based correlation Weird correlation

copy 2014 IBM Corporation12

Paradigm shifts enabled by big data IIIData leads the way ndash and sometimes correlations are good enough

copy 2014 IBM Corporation13

Paradigm shifts enabled by big data IVLeverage data as it is captured

copy 2014 IBM Corporation14

Paradigm shifts enabled by big data IVLeverage data as it is captured

copy 2014 IBM Corporation15

Complementary Analytics

Traditional ApproachStructured analytical logical

New ApproachCreative holistic thought intuition

Multimedia

Data Warehouse

Web Logs

Social Data

Sensor data

images

RFID

Internal AppData

TransactionData

MainframeData

OLTP SystemData

Traditional databases

ERP Data

StructuredRepeatable

Linear

UnstructuredExploratory

Dynamic

Text Data

emails

Hadoop andStreams

NewSources

copy 2014 IBM Corporation16

Types of Analytic Tools

copy 2014 IBM Corporation17

Organisations are prioritising internal data sources

17

Untapped stores of internal data

Size and scope of some internal data such as

detailed transactions and operational log data

have become too large and varied to manage

within traditional systems

New infrastructure components make them

accessible for analysis

Some data has been collected but not

analyzed for years

Focus on customer insights

Customers ndash influenced by digital experiences

ndash often expect information provided to an

organization will then be ldquoknownrdquo during future

interactions

Combining disparate internal sources with

advanced analytics creates insights into

customer behavior and preferences

(Transactions Emails Call center interaction records)

Big data sources

Respondents were

asked which data

sources are currently

being collected and

analyzed as part of

active big data efforts

within their

organization

copy 2014 IBM Corporation18

Stages of Big Data adoption

18

Big data adoption

When segmented into four groups based on current levels of big data activity respondents showed significant consistency

in organizational behaviors Total respondents n = 1061

Totals do not equal 100 due to rounding

copy 2014 IBM Corporation19

Hadoop workloads

92

92

83

58

42

25

58

92

92

92

67

67

67

83

Staging area

Online archive

Transformation Engine

Ad hoc queries

Scheduled reports

Visual exploration

Data mining

Today In 18 Months

Based on respondents that have implemented Hadoop BI Leadership Forum April 2012

copy 2014 IBM Corporation20

Big Data ExplorationFind visualize understand all big data to improve decision making

Enhanced 360o Viewof the CustomerExtend existing customer views (MDM CRM etc) by incorporating additional internal and external information sources

Operations AnalysisAnalyze a variety of machinedata for improved business results

Data Warehouse ModernizationIntegrate big data and data warehouse capabilities to increase operational efficiency

SecurityIntelligence ExtensionLower risk detect fraud and monitor cyber security in real-time

Key Big Data Use Cases

copy 2014 IBM Corporation21

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation22

Solution for Big Data

Rest Data

ndash Data to analyze are already stored (structured and unstructured)

ndash Examples logs facebook twitter etc

ndash Solution Hadoop (open source)

Data in motion

ndash Data are analyzed in real time just in the moment they are generated They are analyzed with any previous storage

ndash Examples Sensors RFID etc

ndash Solution Streams CEP solutions

copy 2014 IBM Corporation23

Hardware improvements through the years

CPU Speedsndash 1990 - 44 MIPS at 40 MHz

ndash 2000 - 3561 MIPS at 12 GHz ndash 2010 - 147600 MIPS at 33 GHz

RAM Memoryndash 1990 ndash 640K conventional memory (256K extended memory recommended)ndash 2000 ndash 64MB memoryndash 2010 - 8-32GB (and more)

Disk Capacityndash 1990 ndash 20MBndash 2000 - 1GBndash 2010 ndash 1TB

Disk Latency (speed of reads and writes) ndash not much improvement in last 7-10 years currently around 70 ndash 80MB sec

copy 2014 IBM Corporation24

How long it will take to read 1TB of data

1TB (at 80Mb sec)ndash 1 disk - 34 hours

ndash 10 disks - 20 min

ndash 100 disks - 2 min

ndash 1000 disks - 12 sec

copy 2014 IBM Corporation25

Parallel Data Processing is the answer

It was with us for a whilendash GRID computing - spreads processing load

ndash Distributed workload - hard to manage applications overhead on

developer

ndash Parallel databases ndash DB2 DPF Teradata Netezza etc (distribute the

data)

copy 2014 IBM Corporation26

What is Apache Hadoop

Apache Open source software framework

Flexible enterprise-class support for processing large volumes of

data ndash Inspired by Google technologies (MapReduce GFS BigTable hellip)

ndash Initiated at Yahoobull Originally built to address scalability problems of Nutch an open source Web search

technology

ndash Well-suited to batch-oriented read-intensive applications

ndash Supports wide variety of data

Enables applications to work with thousands of nodes and petabytes

of data in a highly parallel cost effective mannerndash CPU + local disks = ldquonoderdquo

ndash Nodes can be combined into clusters

ndash New nodes can be added as needed without changing

bull Data formats

bull How data is loaded

bull How jobs are written

copy 2014 IBM Corporation27

Design principles of Hadoop New way of storing and processing the data

ndash Let system handle most of the issues automaticallybull Failuresbull Scalabilitybull Reduce communications bull Distribute data and processing power to where the data isbull Make parallelism part of operating systembull Meant for heterogeneous commodity hardware

Bring processing to Data

Hadoop = HDFS + MapReduce infrastructure

Optimized to handlendash Massive amounts of data through parallelism

ndash A variety of data (structured unstructured semi-structured)

ndash Using inexpensive commodity hardware

Reliability provided through replication

copy 2014 IBM Corporation28

What is the Hadoop Distributed File System Driving principals

ndash Data is stored across the entire cluster (multiple nodes)

ndash Programs are brought to the data not the data to the program

ndash Follows the Divide and Conquer paradigm

Data is stored across the entire cluster (the DFS)

ndash The entire cluster participates in the file system

ndash Blocks of a single file are distributed across the cluster

ndash A given block is typically replicated as well for resiliency

10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111

0100

Logical File

1

2

3

4

Blocks

1

Cluster

1

1

2

22

3

3

34

44

copy 2011 IBM Corporation29

Introduction to MapReduce

Scalable to thousands of nodes and petabytes of data

MapReduce Application

1 Map Phase(break job into small parts)

2 Shuffle(transfer interim output

for final processing)

3 Reduce Phase(boil all output down to

a single result set)

Return a single result setResult Set

Shuffle

public static class TokenizerMapper extends MapperltObjectTextTextIntWritablegt

private final static IntWritableone = new IntWritable(1)

private Text word = new Text()

public void map(Object key Text val ContextStringTokenizer itr =

new StringTokenizer(valtoString())while (itrhasMoreTokens()) wordset(itrnextToken())

contextwrite(word one)

public static class IntSumReducer extends ReducerltTextIntWritableTextIntWrita

private IntWritable result = new IntWritable()

public void reduce(Text keyIterableltIntWritablegt val Context context)int sum = 0for (IntWritable v val)

sum += vget()

Distribute map

tasks to cluster

Hadoop Data Nodes

copy 2011 IBM Corporation30

MapReduce Example

Hello World Bye World

Hello IBM

Reduce (final output)

lt Bye 1gt

lt IBM 1gt

lt Hello 2gt

lt World 2gt

Map 1lt Hello 1gt

lt World 1gt

lt Bye 1gt

lt World 1gt

Count number of words occurrences

Map 2lt Hello 1gt

lt IBM 1gt

Entry Data

Map

Process

Reduce

Process

Shuffle

Process

copy 2014 IBM Corporation31

How to Analyze Large Data Sets in Hadoop

Its not just runtime Development phase has to be taken into

account

Although the Hadoop framework is implemented in Java

MapReduce applications do not need to be written in Java

To abstract complexities of Hadoop programming model a few

application development languages have emerged that build on top

of Hadoopndash Pig

ndash Hive

ndash Jaql

ndash Jaql

copy 2014 IBM Corporation32

Pig Hive Jaql ndash Similarities

Reduced program size over Java

Applications are translated to map

and reduce jobs behind scenes

Extension points for extending

existing functionality

Interoperability with other

languages

Not designed for random

readswrites or low-latency queries

Pig Hive Jaql ndash Differences

Characteristic Pig Hive Jaql

Developed by Yahoo Facebook IBM

Language Pig Latin HiveQL Jaql

Type of language

Data flow Declarative (SQL dialect) Data flow

Data structures supported

Complex Better suited for structured data

JSON semi structured

Schema Optional Not optional Optional

Hadoop Distributions

34

copy 2014 IBM Corporation35

Example of Hadoop Ecosystem

Visualization amp DiscoveryIntegration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

BigSheets JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Dashboard amp Visualization

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

copy 2014 IBM Corporation36

Open Source frameworks I

Avro A data serialization system that includes a schema within each file A schema defines the data types that are

contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data

types and complex type definitions within a schema

Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop

cluster

HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data

sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications

are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column

families so that the elements of a column family are all stored together This approach is different from a row-oriented

relational database where all columns of a row are stored together

HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do

not need to know where or how your data is stored You can change how you write data while still supporting existing data in

older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service

that includes functions for both MapReduce and Pig

Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to

analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements

which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere

BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos

Business Intelligence software

copy 2014 IBM Corporation37

Open Source frameworks II

Lucene A high-performance text search engine library that is written entirely in Java When you search within a

collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key

component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene

libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build

scan and query Lucene indexes

Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides

users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the

required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of

specific data in the file system

R A Project for Statistical Computing

Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop

systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and

export it to relational databases and enterprise data warehouses

Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper

maintains common objects that are needed in large cluster environments such as configuration information distributed

synchronization and group services

copy 2014 IBM Corporation38

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

copy 2014 IBM Corporation39

BigSheets

Browser based Analytical Tool that generates MapReduce Jobs

working over Hadoop Big Data

Helps non-programmers to work with Hadoop cluster

User models their big data as familiar spreadsheet-like tabular data

structures (collections) Once data is represented in a collection

business analysts can filter and enrich its content using built-in

functions and macros Furthermore analysts can combine data

residing in different collections as well as generate charts and new

ldquosheetsrdquo (collections) to visualize their data They can even export

data into a variety of common formats with a click of a button

Much of the technology included in Sheets was derived from the

BigSheets project of IBMrsquos Emerging Technologies team

copy 2014 IBM Corporation40

BigSheets Collection Sample

Spreadsheet-like structures defined by user

Based on data accessible through BigInsights Web console ndash eg file

system data output from Web crawl etc

copy 2014 IBM Corporation41

Big Sheets Collection Operations

Work with built-in ldquosheetsrdquo editor

Add delete columns

Filter data

Specify formulas to compute new

values using spreadsheet-style

syntax

Apply built-in or custom macro

functions

helliphelliphelliphellip

copy 2014 IBM Corporation42

BigSheets Collection Graphic Visualization

Built-in charting facility aids analysis

Pie charts bar charts tag clouds maps etc

Hover over sections to reveal details

copy 2014 IBM Corporation43

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

copy 2014 IBM Corporation44

What is Big SQL

Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture

ndash Comprehensive SQL92 ansi support

ndash Standards compliant client drivers (JDBC amp ODBC)

ndash Efficient handling of point queries

ndash Wide variety of data sources and file formats

ndash Extensive HBase focus

ndash Open source interoperability

Our driving design goalsndash Existing queries should run with no or few modifications

ndash Existing JDBC and ODBC compliant tools should continue to function

ndash Queries should be executed as efficiently as the chosen storage

mechanisms allow

copy 2014 IBM Corporation45

Architecture

Big SQL shares catalogs with

Hive via the Hive metastorendash Each can query the others tables

SQL engine analyzes incoming

queriesndash Separates portion(s) to execute at

the server vs portion(s) to execute

on the cluster

ndash Re-writes query if necessary for

improved performance

ndash Determines appropriate storage

handler for data

ndash Produces execution plan

ndash Executes and coordinates query

Server layout and relative sizes

for illustrative purposes only

Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node

Job TrackerNetwork Protocol

SQL Engine

Storage Handlers

Del

Files

SEQ

FilesHBase RDBMS bullbullbull

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore

copy 2014 IBM Corporation46

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

copy 2014 IBM Corporation47

What is Text Analytics

High Performance and Scalable rule based Information Extraction Engine

Distill structured information from unstructured data

- Rich annotator library supports multiple languages

Provides sophisticated tooling to help build test and refine rules

ndash Developer tools an easy to use text analytics language and a set of

extractors for fast adoption

ndash Multilingual support including support for DBCS languages

Developed at IBM Research since 2004 System T

BigInsights is the first time IBM opens up the Text Analytics Engine

technology for customization and development

copy 2014 IBM Corporation48

Annotator Query Language (AQL)

Language to create rules for Text Analytics

SQL Like Language

Fully declarative text analytics language

Once compiled produced an AOG plan to work in the data

No ldquoblack boxesrdquo or modules that canrsquot be customized

Tooling for easy customization because you are abstracted from the

programmatic details

Competing solutions make use of locked up black-box modules that cannot be

customized which restricts flexibility and are difficult to optimize for performance

create view AmountWithUnit as

extract pattern ltNmatchgt ltUmatchgt

as match

from Number N Unit U

copy 2014 IBM Corporation49

Text Analytic Simple Example

NetherlandsStrikerArjen Robben

Keeper SpainIker Casillas

WingerAndres Iniesta Spain

World Cup 2010 Highlights

Football World Cup 2010 one team distinguished well

from the rest winning the final Early in the second

half Netherlandsrsquo striker Arjen Robben had a chance

to score but the awesome keeper for Spain Iker

Casillas made the save Winner superiority was

reflected when Winger Andres Iniesta scored for Spain

for the win

copy 2014 IBM Corporation5050

Text Analytic Real Example

copy 2014 IBM Corporation5151

One step beyond Watson

copy 2014 IBM Corporation52

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

R

copy 2014 IBM Corporation53

Big R

bull Explore visualize transform

and model big data using

familiar R syntax and

paradigm

bull Scale out R with MR

programming

ndash Partitioning of large data

ndash Parallel cluster execution of R

code

bull Distributed Machine

Learning

ndash A scalable statistics engine that

provides canned algorithms and

an ability to author new ones all

via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3

copy 2014 IBM Corporation54

Where Does BigData Fit

Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate

ldquoCapture in case itrsquos neededrdquo

1 Extract transform load

ldquoCapture only whatrsquos neededrdquo

9 Report and mine data

copy 2014 IBM Corporation55

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation56

Data scientist ndash The new cool guy in town

Article in Fortune ldquoThe unemployment rate in

the US continues to be abysmal (91 in

July) but the tech world has spawned a

new kind of highly skilled nerdy-cool job

that companies are scrambling to fill data

scientistrdquo

McKinsey Global Institute ldquoBig data Reportrdquo

By 2018 the United States alone could

face a shortage of 140000 to 190000

people with deep analytical skills as well as

15 million managers and analysts with the

know-how to use the analysis of big data to

make effective decisions

copy 2014 IBM Corporation57

Data Science is Multidisciplinary

copy 2014 IBM Corporation58

Successful Data Scientist Characteristics

copy 2014 IBM Corporation59

Data Scientist Qualities

copy 2014 IBM Corporation60

How Long Does It Take For a Beginner to Become

a Good Data Scientist

copy 2014 IBM Corporation61

wwwkagglecom

copy 2014 IBM Corporation62

Kaggle ranking

copy 2014 IBM Corporation63 copy 2013 IBM Corporation63

Learn Big Data

Reading Materials - Online

ndash Understanding Big Data ndash Free PDF Book

bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF

ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights

bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml

ndash Implementing IBM InfoSphere BigInsights on System x - Redbook

bull httpwwwredbooksibmcomredpiecesabstractssg248077html

Resources

ndash Big Data Information Center

bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml

ndash InfoSphere BigInsights

bull www-01ibmcomsoftwaredatainfospherebiginsights

ndash Stream Computing

bull www-01ibmcomsoftwaredatainfospherestream-computing

ndash DeveloperWorks forums demos

bull httpwwwibmcomdeveloperworkswikibiginsights

copy 2014 IBM Corporation64

Learn Big Data Technologies

BigDataUniversitycom

Flexible on-line delivery

allows learning your place

and your pace

Free courses free study

materials

Cloud-based sandbox for

exercises ndash zero setup

Robust Course

Management System and

Content Distribution

infrastructure

copy 2014 IBM Corporation65

65

Page 5: Big data presentation (2014)

copy 2014 IBM Corporation5

Characteristics of Big Data

Velocity is the game changer Itrsquos NOT just how

fast data is produced or changed BUT the

speed at which it must be analyzed

received understood and processed

copy 2014 IBM Corporation6

Paradigm shifts enabled by big data ILeverage more of the data being captured

copy 2014 IBM Corporation7

Paradigm shifts enabled by big data ILeverage more of the data being captured

Bank X

copy 2014 IBM Corporation8

Paradigm shifts enabled by big data IIReduce effort required to leverage data

copy 2014 IBM Corporation9

Paradigm shifts enabled by big data IIReduce effort required to leverage data

copy 2014 IBM Corporation10

Paradigm shifts enabled by big data IIIData leads the way ndash and sometimes correlations are good enough

copy 2014 IBM Corporation11

Paradigm shifts enabled by big data IIIData leads the way ndash and sometimes correlations are good enough

Hypothesis based correlation Weird correlation

copy 2014 IBM Corporation12

Paradigm shifts enabled by big data IIIData leads the way ndash and sometimes correlations are good enough

copy 2014 IBM Corporation13

Paradigm shifts enabled by big data IVLeverage data as it is captured

copy 2014 IBM Corporation14

Paradigm shifts enabled by big data IVLeverage data as it is captured

copy 2014 IBM Corporation15

Complementary Analytics

Traditional ApproachStructured analytical logical

New ApproachCreative holistic thought intuition

Multimedia

Data Warehouse

Web Logs

Social Data

Sensor data

images

RFID

Internal AppData

TransactionData

MainframeData

OLTP SystemData

Traditional databases

ERP Data

StructuredRepeatable

Linear

UnstructuredExploratory

Dynamic

Text Data

emails

Hadoop andStreams

NewSources

copy 2014 IBM Corporation16

Types of Analytic Tools

copy 2014 IBM Corporation17

Organisations are prioritising internal data sources

17

Untapped stores of internal data

Size and scope of some internal data such as

detailed transactions and operational log data

have become too large and varied to manage

within traditional systems

New infrastructure components make them

accessible for analysis

Some data has been collected but not

analyzed for years

Focus on customer insights

Customers ndash influenced by digital experiences

ndash often expect information provided to an

organization will then be ldquoknownrdquo during future

interactions

Combining disparate internal sources with

advanced analytics creates insights into

customer behavior and preferences

(Transactions Emails Call center interaction records)

Big data sources

Respondents were

asked which data

sources are currently

being collected and

analyzed as part of

active big data efforts

within their

organization

copy 2014 IBM Corporation18

Stages of Big Data adoption

18

Big data adoption

When segmented into four groups based on current levels of big data activity respondents showed significant consistency

in organizational behaviors Total respondents n = 1061

Totals do not equal 100 due to rounding

copy 2014 IBM Corporation19

Hadoop workloads

92

92

83

58

42

25

58

92

92

92

67

67

67

83

Staging area

Online archive

Transformation Engine

Ad hoc queries

Scheduled reports

Visual exploration

Data mining

Today In 18 Months

Based on respondents that have implemented Hadoop BI Leadership Forum April 2012

copy 2014 IBM Corporation20

Big Data ExplorationFind visualize understand all big data to improve decision making

Enhanced 360o Viewof the CustomerExtend existing customer views (MDM CRM etc) by incorporating additional internal and external information sources

Operations AnalysisAnalyze a variety of machinedata for improved business results

Data Warehouse ModernizationIntegrate big data and data warehouse capabilities to increase operational efficiency

SecurityIntelligence ExtensionLower risk detect fraud and monitor cyber security in real-time

Key Big Data Use Cases

copy 2014 IBM Corporation21

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation22

Solution for Big Data

Rest Data

ndash Data to analyze are already stored (structured and unstructured)

ndash Examples logs facebook twitter etc

ndash Solution Hadoop (open source)

Data in motion

ndash Data are analyzed in real time just in the moment they are generated They are analyzed with any previous storage

ndash Examples Sensors RFID etc

ndash Solution Streams CEP solutions

copy 2014 IBM Corporation23

Hardware improvements through the years

CPU Speedsndash 1990 - 44 MIPS at 40 MHz

ndash 2000 - 3561 MIPS at 12 GHz ndash 2010 - 147600 MIPS at 33 GHz

RAM Memoryndash 1990 ndash 640K conventional memory (256K extended memory recommended)ndash 2000 ndash 64MB memoryndash 2010 - 8-32GB (and more)

Disk Capacityndash 1990 ndash 20MBndash 2000 - 1GBndash 2010 ndash 1TB

Disk Latency (speed of reads and writes) ndash not much improvement in last 7-10 years currently around 70 ndash 80MB sec

copy 2014 IBM Corporation24

How long it will take to read 1TB of data

1TB (at 80Mb sec)ndash 1 disk - 34 hours

ndash 10 disks - 20 min

ndash 100 disks - 2 min

ndash 1000 disks - 12 sec

copy 2014 IBM Corporation25

Parallel Data Processing is the answer

It was with us for a whilendash GRID computing - spreads processing load

ndash Distributed workload - hard to manage applications overhead on

developer

ndash Parallel databases ndash DB2 DPF Teradata Netezza etc (distribute the

data)

copy 2014 IBM Corporation26

What is Apache Hadoop

Apache Open source software framework

Flexible enterprise-class support for processing large volumes of

data ndash Inspired by Google technologies (MapReduce GFS BigTable hellip)

ndash Initiated at Yahoobull Originally built to address scalability problems of Nutch an open source Web search

technology

ndash Well-suited to batch-oriented read-intensive applications

ndash Supports wide variety of data

Enables applications to work with thousands of nodes and petabytes

of data in a highly parallel cost effective mannerndash CPU + local disks = ldquonoderdquo

ndash Nodes can be combined into clusters

ndash New nodes can be added as needed without changing

bull Data formats

bull How data is loaded

bull How jobs are written

copy 2014 IBM Corporation27

Design principles of Hadoop New way of storing and processing the data

ndash Let system handle most of the issues automaticallybull Failuresbull Scalabilitybull Reduce communications bull Distribute data and processing power to where the data isbull Make parallelism part of operating systembull Meant for heterogeneous commodity hardware

Bring processing to Data

Hadoop = HDFS + MapReduce infrastructure

Optimized to handlendash Massive amounts of data through parallelism

ndash A variety of data (structured unstructured semi-structured)

ndash Using inexpensive commodity hardware

Reliability provided through replication

copy 2014 IBM Corporation28

What is the Hadoop Distributed File System Driving principals

ndash Data is stored across the entire cluster (multiple nodes)

ndash Programs are brought to the data not the data to the program

ndash Follows the Divide and Conquer paradigm

Data is stored across the entire cluster (the DFS)

ndash The entire cluster participates in the file system

ndash Blocks of a single file are distributed across the cluster

ndash A given block is typically replicated as well for resiliency

10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111

0100

Logical File

1

2

3

4

Blocks

1

Cluster

1

1

2

22

3

3

34

44

copy 2011 IBM Corporation29

Introduction to MapReduce

Scalable to thousands of nodes and petabytes of data

MapReduce Application

1 Map Phase(break job into small parts)

2 Shuffle(transfer interim output

for final processing)

3 Reduce Phase(boil all output down to

a single result set)

Return a single result setResult Set

Shuffle

public static class TokenizerMapper extends MapperltObjectTextTextIntWritablegt

private final static IntWritableone = new IntWritable(1)

private Text word = new Text()

public void map(Object key Text val ContextStringTokenizer itr =

new StringTokenizer(valtoString())while (itrhasMoreTokens()) wordset(itrnextToken())

contextwrite(word one)

public static class IntSumReducer extends ReducerltTextIntWritableTextIntWrita

private IntWritable result = new IntWritable()

public void reduce(Text keyIterableltIntWritablegt val Context context)int sum = 0for (IntWritable v val)

sum += vget()

Distribute map

tasks to cluster

Hadoop Data Nodes

copy 2011 IBM Corporation30

MapReduce Example

Hello World Bye World

Hello IBM

Reduce (final output)

lt Bye 1gt

lt IBM 1gt

lt Hello 2gt

lt World 2gt

Map 1lt Hello 1gt

lt World 1gt

lt Bye 1gt

lt World 1gt

Count number of words occurrences

Map 2lt Hello 1gt

lt IBM 1gt

Entry Data

Map

Process

Reduce

Process

Shuffle

Process

copy 2014 IBM Corporation31

How to Analyze Large Data Sets in Hadoop

Its not just runtime Development phase has to be taken into

account

Although the Hadoop framework is implemented in Java

MapReduce applications do not need to be written in Java

To abstract complexities of Hadoop programming model a few

application development languages have emerged that build on top

of Hadoopndash Pig

ndash Hive

ndash Jaql

ndash Jaql

copy 2014 IBM Corporation32

Pig Hive Jaql ndash Similarities

Reduced program size over Java

Applications are translated to map

and reduce jobs behind scenes

Extension points for extending

existing functionality

Interoperability with other

languages

Not designed for random

readswrites or low-latency queries

Pig Hive Jaql ndash Differences

Characteristic Pig Hive Jaql

Developed by Yahoo Facebook IBM

Language Pig Latin HiveQL Jaql

Type of language

Data flow Declarative (SQL dialect) Data flow

Data structures supported

Complex Better suited for structured data

JSON semi structured

Schema Optional Not optional Optional

Hadoop Distributions

34

copy 2014 IBM Corporation35

Example of Hadoop Ecosystem

Visualization amp DiscoveryIntegration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

BigSheets JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Dashboard amp Visualization

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

copy 2014 IBM Corporation36

Open Source frameworks I

Avro A data serialization system that includes a schema within each file A schema defines the data types that are

contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data

types and complex type definitions within a schema

Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop

cluster

HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data

sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications

are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column

families so that the elements of a column family are all stored together This approach is different from a row-oriented

relational database where all columns of a row are stored together

HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do

not need to know where or how your data is stored You can change how you write data while still supporting existing data in

older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service

that includes functions for both MapReduce and Pig

Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to

analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements

which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere

BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos

Business Intelligence software

copy 2014 IBM Corporation37

Open Source frameworks II

Lucene A high-performance text search engine library that is written entirely in Java When you search within a

collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key

component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene

libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build

scan and query Lucene indexes

Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides

users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the

required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of

specific data in the file system

R A Project for Statistical Computing

Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop

systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and

export it to relational databases and enterprise data warehouses

Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper

maintains common objects that are needed in large cluster environments such as configuration information distributed

synchronization and group services

copy 2014 IBM Corporation38

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

copy 2014 IBM Corporation39

BigSheets

Browser based Analytical Tool that generates MapReduce Jobs

working over Hadoop Big Data

Helps non-programmers to work with Hadoop cluster

User models their big data as familiar spreadsheet-like tabular data

structures (collections) Once data is represented in a collection

business analysts can filter and enrich its content using built-in

functions and macros Furthermore analysts can combine data

residing in different collections as well as generate charts and new

ldquosheetsrdquo (collections) to visualize their data They can even export

data into a variety of common formats with a click of a button

Much of the technology included in Sheets was derived from the

BigSheets project of IBMrsquos Emerging Technologies team

copy 2014 IBM Corporation40

BigSheets Collection Sample

Spreadsheet-like structures defined by user

Based on data accessible through BigInsights Web console ndash eg file

system data output from Web crawl etc

copy 2014 IBM Corporation41

Big Sheets Collection Operations

Work with built-in ldquosheetsrdquo editor

Add delete columns

Filter data

Specify formulas to compute new

values using spreadsheet-style

syntax

Apply built-in or custom macro

functions

helliphelliphelliphellip

copy 2014 IBM Corporation42

BigSheets Collection Graphic Visualization

Built-in charting facility aids analysis

Pie charts bar charts tag clouds maps etc

Hover over sections to reveal details

copy 2014 IBM Corporation43

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

copy 2014 IBM Corporation44

What is Big SQL

Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture

ndash Comprehensive SQL92 ansi support

ndash Standards compliant client drivers (JDBC amp ODBC)

ndash Efficient handling of point queries

ndash Wide variety of data sources and file formats

ndash Extensive HBase focus

ndash Open source interoperability

Our driving design goalsndash Existing queries should run with no or few modifications

ndash Existing JDBC and ODBC compliant tools should continue to function

ndash Queries should be executed as efficiently as the chosen storage

mechanisms allow

copy 2014 IBM Corporation45

Architecture

Big SQL shares catalogs with

Hive via the Hive metastorendash Each can query the others tables

SQL engine analyzes incoming

queriesndash Separates portion(s) to execute at

the server vs portion(s) to execute

on the cluster

ndash Re-writes query if necessary for

improved performance

ndash Determines appropriate storage

handler for data

ndash Produces execution plan

ndash Executes and coordinates query

Server layout and relative sizes

for illustrative purposes only

Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node

Job TrackerNetwork Protocol

SQL Engine

Storage Handlers

Del

Files

SEQ

FilesHBase RDBMS bullbullbull

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore

copy 2014 IBM Corporation46

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

copy 2014 IBM Corporation47

What is Text Analytics

High Performance and Scalable rule based Information Extraction Engine

Distill structured information from unstructured data

- Rich annotator library supports multiple languages

Provides sophisticated tooling to help build test and refine rules

ndash Developer tools an easy to use text analytics language and a set of

extractors for fast adoption

ndash Multilingual support including support for DBCS languages

Developed at IBM Research since 2004 System T

BigInsights is the first time IBM opens up the Text Analytics Engine

technology for customization and development

copy 2014 IBM Corporation48

Annotator Query Language (AQL)

Language to create rules for Text Analytics

SQL Like Language

Fully declarative text analytics language

Once compiled produced an AOG plan to work in the data

No ldquoblack boxesrdquo or modules that canrsquot be customized

Tooling for easy customization because you are abstracted from the

programmatic details

Competing solutions make use of locked up black-box modules that cannot be

customized which restricts flexibility and are difficult to optimize for performance

create view AmountWithUnit as

extract pattern ltNmatchgt ltUmatchgt

as match

from Number N Unit U

copy 2014 IBM Corporation49

Text Analytic Simple Example

NetherlandsStrikerArjen Robben

Keeper SpainIker Casillas

WingerAndres Iniesta Spain

World Cup 2010 Highlights

Football World Cup 2010 one team distinguished well

from the rest winning the final Early in the second

half Netherlandsrsquo striker Arjen Robben had a chance

to score but the awesome keeper for Spain Iker

Casillas made the save Winner superiority was

reflected when Winger Andres Iniesta scored for Spain

for the win

copy 2014 IBM Corporation5050

Text Analytic Real Example

copy 2014 IBM Corporation5151

One step beyond Watson

copy 2014 IBM Corporation52

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

R

copy 2014 IBM Corporation53

Big R

bull Explore visualize transform

and model big data using

familiar R syntax and

paradigm

bull Scale out R with MR

programming

ndash Partitioning of large data

ndash Parallel cluster execution of R

code

bull Distributed Machine

Learning

ndash A scalable statistics engine that

provides canned algorithms and

an ability to author new ones all

via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3

copy 2014 IBM Corporation54

Where Does BigData Fit

Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate

ldquoCapture in case itrsquos neededrdquo

1 Extract transform load

ldquoCapture only whatrsquos neededrdquo

9 Report and mine data

copy 2014 IBM Corporation55

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation56

Data scientist ndash The new cool guy in town

Article in Fortune ldquoThe unemployment rate in

the US continues to be abysmal (91 in

July) but the tech world has spawned a

new kind of highly skilled nerdy-cool job

that companies are scrambling to fill data

scientistrdquo

McKinsey Global Institute ldquoBig data Reportrdquo

By 2018 the United States alone could

face a shortage of 140000 to 190000

people with deep analytical skills as well as

15 million managers and analysts with the

know-how to use the analysis of big data to

make effective decisions

copy 2014 IBM Corporation57

Data Science is Multidisciplinary

copy 2014 IBM Corporation58

Successful Data Scientist Characteristics

copy 2014 IBM Corporation59

Data Scientist Qualities

copy 2014 IBM Corporation60

How Long Does It Take For a Beginner to Become

a Good Data Scientist

copy 2014 IBM Corporation61

wwwkagglecom

copy 2014 IBM Corporation62

Kaggle ranking

copy 2014 IBM Corporation63 copy 2013 IBM Corporation63

Learn Big Data

Reading Materials - Online

ndash Understanding Big Data ndash Free PDF Book

bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF

ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights

bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml

ndash Implementing IBM InfoSphere BigInsights on System x - Redbook

bull httpwwwredbooksibmcomredpiecesabstractssg248077html

Resources

ndash Big Data Information Center

bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml

ndash InfoSphere BigInsights

bull www-01ibmcomsoftwaredatainfospherebiginsights

ndash Stream Computing

bull www-01ibmcomsoftwaredatainfospherestream-computing

ndash DeveloperWorks forums demos

bull httpwwwibmcomdeveloperworkswikibiginsights

copy 2014 IBM Corporation64

Learn Big Data Technologies

BigDataUniversitycom

Flexible on-line delivery

allows learning your place

and your pace

Free courses free study

materials

Cloud-based sandbox for

exercises ndash zero setup

Robust Course

Management System and

Content Distribution

infrastructure

copy 2014 IBM Corporation65

65

Page 6: Big data presentation (2014)

copy 2014 IBM Corporation6

Paradigm shifts enabled by big data ILeverage more of the data being captured

copy 2014 IBM Corporation7

Paradigm shifts enabled by big data ILeverage more of the data being captured

Bank X

copy 2014 IBM Corporation8

Paradigm shifts enabled by big data IIReduce effort required to leverage data

copy 2014 IBM Corporation9

Paradigm shifts enabled by big data IIReduce effort required to leverage data

copy 2014 IBM Corporation10

Paradigm shifts enabled by big data IIIData leads the way ndash and sometimes correlations are good enough

copy 2014 IBM Corporation11

Paradigm shifts enabled by big data IIIData leads the way ndash and sometimes correlations are good enough

Hypothesis based correlation Weird correlation

copy 2014 IBM Corporation12

Paradigm shifts enabled by big data IIIData leads the way ndash and sometimes correlations are good enough

copy 2014 IBM Corporation13

Paradigm shifts enabled by big data IVLeverage data as it is captured

copy 2014 IBM Corporation14

Paradigm shifts enabled by big data IVLeverage data as it is captured

copy 2014 IBM Corporation15

Complementary Analytics

Traditional ApproachStructured analytical logical

New ApproachCreative holistic thought intuition

Multimedia

Data Warehouse

Web Logs

Social Data

Sensor data

images

RFID

Internal AppData

TransactionData

MainframeData

OLTP SystemData

Traditional databases

ERP Data

StructuredRepeatable

Linear

UnstructuredExploratory

Dynamic

Text Data

emails

Hadoop andStreams

NewSources

copy 2014 IBM Corporation16

Types of Analytic Tools

copy 2014 IBM Corporation17

Organisations are prioritising internal data sources

17

Untapped stores of internal data

Size and scope of some internal data such as

detailed transactions and operational log data

have become too large and varied to manage

within traditional systems

New infrastructure components make them

accessible for analysis

Some data has been collected but not

analyzed for years

Focus on customer insights

Customers ndash influenced by digital experiences

ndash often expect information provided to an

organization will then be ldquoknownrdquo during future

interactions

Combining disparate internal sources with

advanced analytics creates insights into

customer behavior and preferences

(Transactions Emails Call center interaction records)

Big data sources

Respondents were

asked which data

sources are currently

being collected and

analyzed as part of

active big data efforts

within their

organization

copy 2014 IBM Corporation18

Stages of Big Data adoption

18

Big data adoption

When segmented into four groups based on current levels of big data activity respondents showed significant consistency

in organizational behaviors Total respondents n = 1061

Totals do not equal 100 due to rounding

copy 2014 IBM Corporation19

Hadoop workloads

92

92

83

58

42

25

58

92

92

92

67

67

67

83

Staging area

Online archive

Transformation Engine

Ad hoc queries

Scheduled reports

Visual exploration

Data mining

Today In 18 Months

Based on respondents that have implemented Hadoop BI Leadership Forum April 2012

copy 2014 IBM Corporation20

Big Data ExplorationFind visualize understand all big data to improve decision making

Enhanced 360o Viewof the CustomerExtend existing customer views (MDM CRM etc) by incorporating additional internal and external information sources

Operations AnalysisAnalyze a variety of machinedata for improved business results

Data Warehouse ModernizationIntegrate big data and data warehouse capabilities to increase operational efficiency

SecurityIntelligence ExtensionLower risk detect fraud and monitor cyber security in real-time

Key Big Data Use Cases

copy 2014 IBM Corporation21

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation22

Solution for Big Data

Rest Data

ndash Data to analyze are already stored (structured and unstructured)

ndash Examples logs facebook twitter etc

ndash Solution Hadoop (open source)

Data in motion

ndash Data are analyzed in real time just in the moment they are generated They are analyzed with any previous storage

ndash Examples Sensors RFID etc

ndash Solution Streams CEP solutions

copy 2014 IBM Corporation23

Hardware improvements through the years

CPU Speedsndash 1990 - 44 MIPS at 40 MHz

ndash 2000 - 3561 MIPS at 12 GHz ndash 2010 - 147600 MIPS at 33 GHz

RAM Memoryndash 1990 ndash 640K conventional memory (256K extended memory recommended)ndash 2000 ndash 64MB memoryndash 2010 - 8-32GB (and more)

Disk Capacityndash 1990 ndash 20MBndash 2000 - 1GBndash 2010 ndash 1TB

Disk Latency (speed of reads and writes) ndash not much improvement in last 7-10 years currently around 70 ndash 80MB sec

copy 2014 IBM Corporation24

How long it will take to read 1TB of data

1TB (at 80Mb sec)ndash 1 disk - 34 hours

ndash 10 disks - 20 min

ndash 100 disks - 2 min

ndash 1000 disks - 12 sec

copy 2014 IBM Corporation25

Parallel Data Processing is the answer

It was with us for a whilendash GRID computing - spreads processing load

ndash Distributed workload - hard to manage applications overhead on

developer

ndash Parallel databases ndash DB2 DPF Teradata Netezza etc (distribute the

data)

copy 2014 IBM Corporation26

What is Apache Hadoop

Apache Open source software framework

Flexible enterprise-class support for processing large volumes of

data ndash Inspired by Google technologies (MapReduce GFS BigTable hellip)

ndash Initiated at Yahoobull Originally built to address scalability problems of Nutch an open source Web search

technology

ndash Well-suited to batch-oriented read-intensive applications

ndash Supports wide variety of data

Enables applications to work with thousands of nodes and petabytes

of data in a highly parallel cost effective mannerndash CPU + local disks = ldquonoderdquo

ndash Nodes can be combined into clusters

ndash New nodes can be added as needed without changing

bull Data formats

bull How data is loaded

bull How jobs are written

copy 2014 IBM Corporation27

Design principles of Hadoop New way of storing and processing the data

ndash Let system handle most of the issues automaticallybull Failuresbull Scalabilitybull Reduce communications bull Distribute data and processing power to where the data isbull Make parallelism part of operating systembull Meant for heterogeneous commodity hardware

Bring processing to Data

Hadoop = HDFS + MapReduce infrastructure

Optimized to handlendash Massive amounts of data through parallelism

ndash A variety of data (structured unstructured semi-structured)

ndash Using inexpensive commodity hardware

Reliability provided through replication

copy 2014 IBM Corporation28

What is the Hadoop Distributed File System Driving principals

ndash Data is stored across the entire cluster (multiple nodes)

ndash Programs are brought to the data not the data to the program

ndash Follows the Divide and Conquer paradigm

Data is stored across the entire cluster (the DFS)

ndash The entire cluster participates in the file system

ndash Blocks of a single file are distributed across the cluster

ndash A given block is typically replicated as well for resiliency

10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111

0100

Logical File

1

2

3

4

Blocks

1

Cluster

1

1

2

22

3

3

34

44

copy 2011 IBM Corporation29

Introduction to MapReduce

Scalable to thousands of nodes and petabytes of data

MapReduce Application

1 Map Phase(break job into small parts)

2 Shuffle(transfer interim output

for final processing)

3 Reduce Phase(boil all output down to

a single result set)

Return a single result setResult Set

Shuffle

public static class TokenizerMapper extends MapperltObjectTextTextIntWritablegt

private final static IntWritableone = new IntWritable(1)

private Text word = new Text()

public void map(Object key Text val ContextStringTokenizer itr =

new StringTokenizer(valtoString())while (itrhasMoreTokens()) wordset(itrnextToken())

contextwrite(word one)

public static class IntSumReducer extends ReducerltTextIntWritableTextIntWrita

private IntWritable result = new IntWritable()

public void reduce(Text keyIterableltIntWritablegt val Context context)int sum = 0for (IntWritable v val)

sum += vget()

Distribute map

tasks to cluster

Hadoop Data Nodes

copy 2011 IBM Corporation30

MapReduce Example

Hello World Bye World

Hello IBM

Reduce (final output)

lt Bye 1gt

lt IBM 1gt

lt Hello 2gt

lt World 2gt

Map 1lt Hello 1gt

lt World 1gt

lt Bye 1gt

lt World 1gt

Count number of words occurrences

Map 2lt Hello 1gt

lt IBM 1gt

Entry Data

Map

Process

Reduce

Process

Shuffle

Process

copy 2014 IBM Corporation31

How to Analyze Large Data Sets in Hadoop

Its not just runtime Development phase has to be taken into

account

Although the Hadoop framework is implemented in Java

MapReduce applications do not need to be written in Java

To abstract complexities of Hadoop programming model a few

application development languages have emerged that build on top

of Hadoopndash Pig

ndash Hive

ndash Jaql

ndash Jaql

copy 2014 IBM Corporation32

Pig Hive Jaql ndash Similarities

Reduced program size over Java

Applications are translated to map

and reduce jobs behind scenes

Extension points for extending

existing functionality

Interoperability with other

languages

Not designed for random

readswrites or low-latency queries

Pig Hive Jaql ndash Differences

Characteristic Pig Hive Jaql

Developed by Yahoo Facebook IBM

Language Pig Latin HiveQL Jaql

Type of language

Data flow Declarative (SQL dialect) Data flow

Data structures supported

Complex Better suited for structured data

JSON semi structured

Schema Optional Not optional Optional

Hadoop Distributions

34

copy 2014 IBM Corporation35

Example of Hadoop Ecosystem

Visualization amp DiscoveryIntegration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

BigSheets JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Dashboard amp Visualization

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

copy 2014 IBM Corporation36

Open Source frameworks I

Avro A data serialization system that includes a schema within each file A schema defines the data types that are

contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data

types and complex type definitions within a schema

Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop

cluster

HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data

sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications

are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column

families so that the elements of a column family are all stored together This approach is different from a row-oriented

relational database where all columns of a row are stored together

HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do

not need to know where or how your data is stored You can change how you write data while still supporting existing data in

older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service

that includes functions for both MapReduce and Pig

Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to

analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements

which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere

BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos

Business Intelligence software

copy 2014 IBM Corporation37

Open Source frameworks II

Lucene A high-performance text search engine library that is written entirely in Java When you search within a

collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key

component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene

libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build

scan and query Lucene indexes

Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides

users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the

required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of

specific data in the file system

R A Project for Statistical Computing

Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop

systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and

export it to relational databases and enterprise data warehouses

Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper

maintains common objects that are needed in large cluster environments such as configuration information distributed

synchronization and group services

copy 2014 IBM Corporation38

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

copy 2014 IBM Corporation39

BigSheets

Browser based Analytical Tool that generates MapReduce Jobs

working over Hadoop Big Data

Helps non-programmers to work with Hadoop cluster

User models their big data as familiar spreadsheet-like tabular data

structures (collections) Once data is represented in a collection

business analysts can filter and enrich its content using built-in

functions and macros Furthermore analysts can combine data

residing in different collections as well as generate charts and new

ldquosheetsrdquo (collections) to visualize their data They can even export

data into a variety of common formats with a click of a button

Much of the technology included in Sheets was derived from the

BigSheets project of IBMrsquos Emerging Technologies team

copy 2014 IBM Corporation40

BigSheets Collection Sample

Spreadsheet-like structures defined by user

Based on data accessible through BigInsights Web console ndash eg file

system data output from Web crawl etc

copy 2014 IBM Corporation41

Big Sheets Collection Operations

Work with built-in ldquosheetsrdquo editor

Add delete columns

Filter data

Specify formulas to compute new

values using spreadsheet-style

syntax

Apply built-in or custom macro

functions

helliphelliphelliphellip

copy 2014 IBM Corporation42

BigSheets Collection Graphic Visualization

Built-in charting facility aids analysis

Pie charts bar charts tag clouds maps etc

Hover over sections to reveal details

copy 2014 IBM Corporation43

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

copy 2014 IBM Corporation44

What is Big SQL

Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture

ndash Comprehensive SQL92 ansi support

ndash Standards compliant client drivers (JDBC amp ODBC)

ndash Efficient handling of point queries

ndash Wide variety of data sources and file formats

ndash Extensive HBase focus

ndash Open source interoperability

Our driving design goalsndash Existing queries should run with no or few modifications

ndash Existing JDBC and ODBC compliant tools should continue to function

ndash Queries should be executed as efficiently as the chosen storage

mechanisms allow

copy 2014 IBM Corporation45

Architecture

Big SQL shares catalogs with

Hive via the Hive metastorendash Each can query the others tables

SQL engine analyzes incoming

queriesndash Separates portion(s) to execute at

the server vs portion(s) to execute

on the cluster

ndash Re-writes query if necessary for

improved performance

ndash Determines appropriate storage

handler for data

ndash Produces execution plan

ndash Executes and coordinates query

Server layout and relative sizes

for illustrative purposes only

Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node

Job TrackerNetwork Protocol

SQL Engine

Storage Handlers

Del

Files

SEQ

FilesHBase RDBMS bullbullbull

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore

copy 2014 IBM Corporation46

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

copy 2014 IBM Corporation47

What is Text Analytics

High Performance and Scalable rule based Information Extraction Engine

Distill structured information from unstructured data

- Rich annotator library supports multiple languages

Provides sophisticated tooling to help build test and refine rules

ndash Developer tools an easy to use text analytics language and a set of

extractors for fast adoption

ndash Multilingual support including support for DBCS languages

Developed at IBM Research since 2004 System T

BigInsights is the first time IBM opens up the Text Analytics Engine

technology for customization and development

copy 2014 IBM Corporation48

Annotator Query Language (AQL)

Language to create rules for Text Analytics

SQL Like Language

Fully declarative text analytics language

Once compiled produced an AOG plan to work in the data

No ldquoblack boxesrdquo or modules that canrsquot be customized

Tooling for easy customization because you are abstracted from the

programmatic details

Competing solutions make use of locked up black-box modules that cannot be

customized which restricts flexibility and are difficult to optimize for performance

create view AmountWithUnit as

extract pattern ltNmatchgt ltUmatchgt

as match

from Number N Unit U

copy 2014 IBM Corporation49

Text Analytic Simple Example

NetherlandsStrikerArjen Robben

Keeper SpainIker Casillas

WingerAndres Iniesta Spain

World Cup 2010 Highlights

Football World Cup 2010 one team distinguished well

from the rest winning the final Early in the second

half Netherlandsrsquo striker Arjen Robben had a chance

to score but the awesome keeper for Spain Iker

Casillas made the save Winner superiority was

reflected when Winger Andres Iniesta scored for Spain

for the win

copy 2014 IBM Corporation5050

Text Analytic Real Example

copy 2014 IBM Corporation5151

One step beyond Watson

copy 2014 IBM Corporation52

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

R

copy 2014 IBM Corporation53

Big R

bull Explore visualize transform

and model big data using

familiar R syntax and

paradigm

bull Scale out R with MR

programming

ndash Partitioning of large data

ndash Parallel cluster execution of R

code

bull Distributed Machine

Learning

ndash A scalable statistics engine that

provides canned algorithms and

an ability to author new ones all

via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3

copy 2014 IBM Corporation54

Where Does BigData Fit

Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate

ldquoCapture in case itrsquos neededrdquo

1 Extract transform load

ldquoCapture only whatrsquos neededrdquo

9 Report and mine data

copy 2014 IBM Corporation55

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation56

Data scientist ndash The new cool guy in town

Article in Fortune ldquoThe unemployment rate in

the US continues to be abysmal (91 in

July) but the tech world has spawned a

new kind of highly skilled nerdy-cool job

that companies are scrambling to fill data

scientistrdquo

McKinsey Global Institute ldquoBig data Reportrdquo

By 2018 the United States alone could

face a shortage of 140000 to 190000

people with deep analytical skills as well as

15 million managers and analysts with the

know-how to use the analysis of big data to

make effective decisions

copy 2014 IBM Corporation57

Data Science is Multidisciplinary

copy 2014 IBM Corporation58

Successful Data Scientist Characteristics

copy 2014 IBM Corporation59

Data Scientist Qualities

copy 2014 IBM Corporation60

How Long Does It Take For a Beginner to Become

a Good Data Scientist

copy 2014 IBM Corporation61

wwwkagglecom

copy 2014 IBM Corporation62

Kaggle ranking

copy 2014 IBM Corporation63 copy 2013 IBM Corporation63

Learn Big Data

Reading Materials - Online

ndash Understanding Big Data ndash Free PDF Book

bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF

ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights

bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml

ndash Implementing IBM InfoSphere BigInsights on System x - Redbook

bull httpwwwredbooksibmcomredpiecesabstractssg248077html

Resources

ndash Big Data Information Center

bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml

ndash InfoSphere BigInsights

bull www-01ibmcomsoftwaredatainfospherebiginsights

ndash Stream Computing

bull www-01ibmcomsoftwaredatainfospherestream-computing

ndash DeveloperWorks forums demos

bull httpwwwibmcomdeveloperworkswikibiginsights

copy 2014 IBM Corporation64

Learn Big Data Technologies

BigDataUniversitycom

Flexible on-line delivery

allows learning your place

and your pace

Free courses free study

materials

Cloud-based sandbox for

exercises ndash zero setup

Robust Course

Management System and

Content Distribution

infrastructure

copy 2014 IBM Corporation65

65

Page 7: Big data presentation (2014)

copy 2014 IBM Corporation7

Paradigm shifts enabled by big data ILeverage more of the data being captured

Bank X

copy 2014 IBM Corporation8

Paradigm shifts enabled by big data IIReduce effort required to leverage data

copy 2014 IBM Corporation9

Paradigm shifts enabled by big data IIReduce effort required to leverage data

copy 2014 IBM Corporation10

Paradigm shifts enabled by big data IIIData leads the way ndash and sometimes correlations are good enough

copy 2014 IBM Corporation11

Paradigm shifts enabled by big data IIIData leads the way ndash and sometimes correlations are good enough

Hypothesis based correlation Weird correlation

copy 2014 IBM Corporation12

Paradigm shifts enabled by big data IIIData leads the way ndash and sometimes correlations are good enough

copy 2014 IBM Corporation13

Paradigm shifts enabled by big data IVLeverage data as it is captured

copy 2014 IBM Corporation14

Paradigm shifts enabled by big data IVLeverage data as it is captured

copy 2014 IBM Corporation15

Complementary Analytics

Traditional ApproachStructured analytical logical

New ApproachCreative holistic thought intuition

Multimedia

Data Warehouse

Web Logs

Social Data

Sensor data

images

RFID

Internal AppData

TransactionData

MainframeData

OLTP SystemData

Traditional databases

ERP Data

StructuredRepeatable

Linear

UnstructuredExploratory

Dynamic

Text Data

emails

Hadoop andStreams

NewSources

copy 2014 IBM Corporation16

Types of Analytic Tools

copy 2014 IBM Corporation17

Organisations are prioritising internal data sources

17

Untapped stores of internal data

Size and scope of some internal data such as

detailed transactions and operational log data

have become too large and varied to manage

within traditional systems

New infrastructure components make them

accessible for analysis

Some data has been collected but not

analyzed for years

Focus on customer insights

Customers ndash influenced by digital experiences

ndash often expect information provided to an

organization will then be ldquoknownrdquo during future

interactions

Combining disparate internal sources with

advanced analytics creates insights into

customer behavior and preferences

(Transactions Emails Call center interaction records)

Big data sources

Respondents were

asked which data

sources are currently

being collected and

analyzed as part of

active big data efforts

within their

organization

copy 2014 IBM Corporation18

Stages of Big Data adoption

18

Big data adoption

When segmented into four groups based on current levels of big data activity respondents showed significant consistency

in organizational behaviors Total respondents n = 1061

Totals do not equal 100 due to rounding

copy 2014 IBM Corporation19

Hadoop workloads

92

92

83

58

42

25

58

92

92

92

67

67

67

83

Staging area

Online archive

Transformation Engine

Ad hoc queries

Scheduled reports

Visual exploration

Data mining

Today In 18 Months

Based on respondents that have implemented Hadoop BI Leadership Forum April 2012

copy 2014 IBM Corporation20

Big Data ExplorationFind visualize understand all big data to improve decision making

Enhanced 360o Viewof the CustomerExtend existing customer views (MDM CRM etc) by incorporating additional internal and external information sources

Operations AnalysisAnalyze a variety of machinedata for improved business results

Data Warehouse ModernizationIntegrate big data and data warehouse capabilities to increase operational efficiency

SecurityIntelligence ExtensionLower risk detect fraud and monitor cyber security in real-time

Key Big Data Use Cases

copy 2014 IBM Corporation21

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation22

Solution for Big Data

Rest Data

ndash Data to analyze are already stored (structured and unstructured)

ndash Examples logs facebook twitter etc

ndash Solution Hadoop (open source)

Data in motion

ndash Data are analyzed in real time just in the moment they are generated They are analyzed with any previous storage

ndash Examples Sensors RFID etc

ndash Solution Streams CEP solutions

copy 2014 IBM Corporation23

Hardware improvements through the years

CPU Speedsndash 1990 - 44 MIPS at 40 MHz

ndash 2000 - 3561 MIPS at 12 GHz ndash 2010 - 147600 MIPS at 33 GHz

RAM Memoryndash 1990 ndash 640K conventional memory (256K extended memory recommended)ndash 2000 ndash 64MB memoryndash 2010 - 8-32GB (and more)

Disk Capacityndash 1990 ndash 20MBndash 2000 - 1GBndash 2010 ndash 1TB

Disk Latency (speed of reads and writes) ndash not much improvement in last 7-10 years currently around 70 ndash 80MB sec

copy 2014 IBM Corporation24

How long it will take to read 1TB of data

1TB (at 80Mb sec)ndash 1 disk - 34 hours

ndash 10 disks - 20 min

ndash 100 disks - 2 min

ndash 1000 disks - 12 sec

copy 2014 IBM Corporation25

Parallel Data Processing is the answer

It was with us for a whilendash GRID computing - spreads processing load

ndash Distributed workload - hard to manage applications overhead on

developer

ndash Parallel databases ndash DB2 DPF Teradata Netezza etc (distribute the

data)

copy 2014 IBM Corporation26

What is Apache Hadoop

Apache Open source software framework

Flexible enterprise-class support for processing large volumes of

data ndash Inspired by Google technologies (MapReduce GFS BigTable hellip)

ndash Initiated at Yahoobull Originally built to address scalability problems of Nutch an open source Web search

technology

ndash Well-suited to batch-oriented read-intensive applications

ndash Supports wide variety of data

Enables applications to work with thousands of nodes and petabytes

of data in a highly parallel cost effective mannerndash CPU + local disks = ldquonoderdquo

ndash Nodes can be combined into clusters

ndash New nodes can be added as needed without changing

bull Data formats

bull How data is loaded

bull How jobs are written

copy 2014 IBM Corporation27

Design principles of Hadoop New way of storing and processing the data

ndash Let system handle most of the issues automaticallybull Failuresbull Scalabilitybull Reduce communications bull Distribute data and processing power to where the data isbull Make parallelism part of operating systembull Meant for heterogeneous commodity hardware

Bring processing to Data

Hadoop = HDFS + MapReduce infrastructure

Optimized to handlendash Massive amounts of data through parallelism

ndash A variety of data (structured unstructured semi-structured)

ndash Using inexpensive commodity hardware

Reliability provided through replication

copy 2014 IBM Corporation28

What is the Hadoop Distributed File System Driving principals

ndash Data is stored across the entire cluster (multiple nodes)

ndash Programs are brought to the data not the data to the program

ndash Follows the Divide and Conquer paradigm

Data is stored across the entire cluster (the DFS)

ndash The entire cluster participates in the file system

ndash Blocks of a single file are distributed across the cluster

ndash A given block is typically replicated as well for resiliency

10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111

0100

Logical File

1

2

3

4

Blocks

1

Cluster

1

1

2

22

3

3

34

44

copy 2011 IBM Corporation29

Introduction to MapReduce

Scalable to thousands of nodes and petabytes of data

MapReduce Application

1 Map Phase(break job into small parts)

2 Shuffle(transfer interim output

for final processing)

3 Reduce Phase(boil all output down to

a single result set)

Return a single result setResult Set

Shuffle

public static class TokenizerMapper extends MapperltObjectTextTextIntWritablegt

private final static IntWritableone = new IntWritable(1)

private Text word = new Text()

public void map(Object key Text val ContextStringTokenizer itr =

new StringTokenizer(valtoString())while (itrhasMoreTokens()) wordset(itrnextToken())

contextwrite(word one)

public static class IntSumReducer extends ReducerltTextIntWritableTextIntWrita

private IntWritable result = new IntWritable()

public void reduce(Text keyIterableltIntWritablegt val Context context)int sum = 0for (IntWritable v val)

sum += vget()

Distribute map

tasks to cluster

Hadoop Data Nodes

copy 2011 IBM Corporation30

MapReduce Example

Hello World Bye World

Hello IBM

Reduce (final output)

lt Bye 1gt

lt IBM 1gt

lt Hello 2gt

lt World 2gt

Map 1lt Hello 1gt

lt World 1gt

lt Bye 1gt

lt World 1gt

Count number of words occurrences

Map 2lt Hello 1gt

lt IBM 1gt

Entry Data

Map

Process

Reduce

Process

Shuffle

Process

copy 2014 IBM Corporation31

How to Analyze Large Data Sets in Hadoop

Its not just runtime Development phase has to be taken into

account

Although the Hadoop framework is implemented in Java

MapReduce applications do not need to be written in Java

To abstract complexities of Hadoop programming model a few

application development languages have emerged that build on top

of Hadoopndash Pig

ndash Hive

ndash Jaql

ndash Jaql

copy 2014 IBM Corporation32

Pig Hive Jaql ndash Similarities

Reduced program size over Java

Applications are translated to map

and reduce jobs behind scenes

Extension points for extending

existing functionality

Interoperability with other

languages

Not designed for random

readswrites or low-latency queries

Pig Hive Jaql ndash Differences

Characteristic Pig Hive Jaql

Developed by Yahoo Facebook IBM

Language Pig Latin HiveQL Jaql

Type of language

Data flow Declarative (SQL dialect) Data flow

Data structures supported

Complex Better suited for structured data

JSON semi structured

Schema Optional Not optional Optional

Hadoop Distributions

34

copy 2014 IBM Corporation35

Example of Hadoop Ecosystem

Visualization amp DiscoveryIntegration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

BigSheets JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Dashboard amp Visualization

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

copy 2014 IBM Corporation36

Open Source frameworks I

Avro A data serialization system that includes a schema within each file A schema defines the data types that are

contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data

types and complex type definitions within a schema

Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop

cluster

HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data

sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications

are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column

families so that the elements of a column family are all stored together This approach is different from a row-oriented

relational database where all columns of a row are stored together

HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do

not need to know where or how your data is stored You can change how you write data while still supporting existing data in

older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service

that includes functions for both MapReduce and Pig

Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to

analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements

which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere

BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos

Business Intelligence software

copy 2014 IBM Corporation37

Open Source frameworks II

Lucene A high-performance text search engine library that is written entirely in Java When you search within a

collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key

component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene

libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build

scan and query Lucene indexes

Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides

users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the

required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of

specific data in the file system

R A Project for Statistical Computing

Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop

systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and

export it to relational databases and enterprise data warehouses

Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper

maintains common objects that are needed in large cluster environments such as configuration information distributed

synchronization and group services

copy 2014 IBM Corporation38

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

copy 2014 IBM Corporation39

BigSheets

Browser based Analytical Tool that generates MapReduce Jobs

working over Hadoop Big Data

Helps non-programmers to work with Hadoop cluster

User models their big data as familiar spreadsheet-like tabular data

structures (collections) Once data is represented in a collection

business analysts can filter and enrich its content using built-in

functions and macros Furthermore analysts can combine data

residing in different collections as well as generate charts and new

ldquosheetsrdquo (collections) to visualize their data They can even export

data into a variety of common formats with a click of a button

Much of the technology included in Sheets was derived from the

BigSheets project of IBMrsquos Emerging Technologies team

copy 2014 IBM Corporation40

BigSheets Collection Sample

Spreadsheet-like structures defined by user

Based on data accessible through BigInsights Web console ndash eg file

system data output from Web crawl etc

copy 2014 IBM Corporation41

Big Sheets Collection Operations

Work with built-in ldquosheetsrdquo editor

Add delete columns

Filter data

Specify formulas to compute new

values using spreadsheet-style

syntax

Apply built-in or custom macro

functions

helliphelliphelliphellip

copy 2014 IBM Corporation42

BigSheets Collection Graphic Visualization

Built-in charting facility aids analysis

Pie charts bar charts tag clouds maps etc

Hover over sections to reveal details

copy 2014 IBM Corporation43

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

copy 2014 IBM Corporation44

What is Big SQL

Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture

ndash Comprehensive SQL92 ansi support

ndash Standards compliant client drivers (JDBC amp ODBC)

ndash Efficient handling of point queries

ndash Wide variety of data sources and file formats

ndash Extensive HBase focus

ndash Open source interoperability

Our driving design goalsndash Existing queries should run with no or few modifications

ndash Existing JDBC and ODBC compliant tools should continue to function

ndash Queries should be executed as efficiently as the chosen storage

mechanisms allow

copy 2014 IBM Corporation45

Architecture

Big SQL shares catalogs with

Hive via the Hive metastorendash Each can query the others tables

SQL engine analyzes incoming

queriesndash Separates portion(s) to execute at

the server vs portion(s) to execute

on the cluster

ndash Re-writes query if necessary for

improved performance

ndash Determines appropriate storage

handler for data

ndash Produces execution plan

ndash Executes and coordinates query

Server layout and relative sizes

for illustrative purposes only

Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node

Job TrackerNetwork Protocol

SQL Engine

Storage Handlers

Del

Files

SEQ

FilesHBase RDBMS bullbullbull

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore

copy 2014 IBM Corporation46

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

copy 2014 IBM Corporation47

What is Text Analytics

High Performance and Scalable rule based Information Extraction Engine

Distill structured information from unstructured data

- Rich annotator library supports multiple languages

Provides sophisticated tooling to help build test and refine rules

ndash Developer tools an easy to use text analytics language and a set of

extractors for fast adoption

ndash Multilingual support including support for DBCS languages

Developed at IBM Research since 2004 System T

BigInsights is the first time IBM opens up the Text Analytics Engine

technology for customization and development

copy 2014 IBM Corporation48

Annotator Query Language (AQL)

Language to create rules for Text Analytics

SQL Like Language

Fully declarative text analytics language

Once compiled produced an AOG plan to work in the data

No ldquoblack boxesrdquo or modules that canrsquot be customized

Tooling for easy customization because you are abstracted from the

programmatic details

Competing solutions make use of locked up black-box modules that cannot be

customized which restricts flexibility and are difficult to optimize for performance

create view AmountWithUnit as

extract pattern ltNmatchgt ltUmatchgt

as match

from Number N Unit U

copy 2014 IBM Corporation49

Text Analytic Simple Example

NetherlandsStrikerArjen Robben

Keeper SpainIker Casillas

WingerAndres Iniesta Spain

World Cup 2010 Highlights

Football World Cup 2010 one team distinguished well

from the rest winning the final Early in the second

half Netherlandsrsquo striker Arjen Robben had a chance

to score but the awesome keeper for Spain Iker

Casillas made the save Winner superiority was

reflected when Winger Andres Iniesta scored for Spain

for the win

copy 2014 IBM Corporation5050

Text Analytic Real Example

copy 2014 IBM Corporation5151

One step beyond Watson

copy 2014 IBM Corporation52

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

R

copy 2014 IBM Corporation53

Big R

bull Explore visualize transform

and model big data using

familiar R syntax and

paradigm

bull Scale out R with MR

programming

ndash Partitioning of large data

ndash Parallel cluster execution of R

code

bull Distributed Machine

Learning

ndash A scalable statistics engine that

provides canned algorithms and

an ability to author new ones all

via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3

copy 2014 IBM Corporation54

Where Does BigData Fit

Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate

ldquoCapture in case itrsquos neededrdquo

1 Extract transform load

ldquoCapture only whatrsquos neededrdquo

9 Report and mine data

copy 2014 IBM Corporation55

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation56

Data scientist ndash The new cool guy in town

Article in Fortune ldquoThe unemployment rate in

the US continues to be abysmal (91 in

July) but the tech world has spawned a

new kind of highly skilled nerdy-cool job

that companies are scrambling to fill data

scientistrdquo

McKinsey Global Institute ldquoBig data Reportrdquo

By 2018 the United States alone could

face a shortage of 140000 to 190000

people with deep analytical skills as well as

15 million managers and analysts with the

know-how to use the analysis of big data to

make effective decisions

copy 2014 IBM Corporation57

Data Science is Multidisciplinary

copy 2014 IBM Corporation58

Successful Data Scientist Characteristics

copy 2014 IBM Corporation59

Data Scientist Qualities

copy 2014 IBM Corporation60

How Long Does It Take For a Beginner to Become

a Good Data Scientist

copy 2014 IBM Corporation61

wwwkagglecom

copy 2014 IBM Corporation62

Kaggle ranking

copy 2014 IBM Corporation63 copy 2013 IBM Corporation63

Learn Big Data

Reading Materials - Online

ndash Understanding Big Data ndash Free PDF Book

bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF

ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights

bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml

ndash Implementing IBM InfoSphere BigInsights on System x - Redbook

bull httpwwwredbooksibmcomredpiecesabstractssg248077html

Resources

ndash Big Data Information Center

bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml

ndash InfoSphere BigInsights

bull www-01ibmcomsoftwaredatainfospherebiginsights

ndash Stream Computing

bull www-01ibmcomsoftwaredatainfospherestream-computing

ndash DeveloperWorks forums demos

bull httpwwwibmcomdeveloperworkswikibiginsights

copy 2014 IBM Corporation64

Learn Big Data Technologies

BigDataUniversitycom

Flexible on-line delivery

allows learning your place

and your pace

Free courses free study

materials

Cloud-based sandbox for

exercises ndash zero setup

Robust Course

Management System and

Content Distribution

infrastructure

copy 2014 IBM Corporation65

65

Page 8: Big data presentation (2014)

copy 2014 IBM Corporation8

Paradigm shifts enabled by big data IIReduce effort required to leverage data

copy 2014 IBM Corporation9

Paradigm shifts enabled by big data IIReduce effort required to leverage data

copy 2014 IBM Corporation10

Paradigm shifts enabled by big data IIIData leads the way ndash and sometimes correlations are good enough

copy 2014 IBM Corporation11

Paradigm shifts enabled by big data IIIData leads the way ndash and sometimes correlations are good enough

Hypothesis based correlation Weird correlation

copy 2014 IBM Corporation12

Paradigm shifts enabled by big data IIIData leads the way ndash and sometimes correlations are good enough

copy 2014 IBM Corporation13

Paradigm shifts enabled by big data IVLeverage data as it is captured

copy 2014 IBM Corporation14

Paradigm shifts enabled by big data IVLeverage data as it is captured

copy 2014 IBM Corporation15

Complementary Analytics

Traditional ApproachStructured analytical logical

New ApproachCreative holistic thought intuition

Multimedia

Data Warehouse

Web Logs

Social Data

Sensor data

images

RFID

Internal AppData

TransactionData

MainframeData

OLTP SystemData

Traditional databases

ERP Data

StructuredRepeatable

Linear

UnstructuredExploratory

Dynamic

Text Data

emails

Hadoop andStreams

NewSources

copy 2014 IBM Corporation16

Types of Analytic Tools

copy 2014 IBM Corporation17

Organisations are prioritising internal data sources

17

Untapped stores of internal data

Size and scope of some internal data such as

detailed transactions and operational log data

have become too large and varied to manage

within traditional systems

New infrastructure components make them

accessible for analysis

Some data has been collected but not

analyzed for years

Focus on customer insights

Customers ndash influenced by digital experiences

ndash often expect information provided to an

organization will then be ldquoknownrdquo during future

interactions

Combining disparate internal sources with

advanced analytics creates insights into

customer behavior and preferences

(Transactions Emails Call center interaction records)

Big data sources

Respondents were

asked which data

sources are currently

being collected and

analyzed as part of

active big data efforts

within their

organization

copy 2014 IBM Corporation18

Stages of Big Data adoption

18

Big data adoption

When segmented into four groups based on current levels of big data activity respondents showed significant consistency

in organizational behaviors Total respondents n = 1061

Totals do not equal 100 due to rounding

copy 2014 IBM Corporation19

Hadoop workloads

92

92

83

58

42

25

58

92

92

92

67

67

67

83

Staging area

Online archive

Transformation Engine

Ad hoc queries

Scheduled reports

Visual exploration

Data mining

Today In 18 Months

Based on respondents that have implemented Hadoop BI Leadership Forum April 2012

copy 2014 IBM Corporation20

Big Data ExplorationFind visualize understand all big data to improve decision making

Enhanced 360o Viewof the CustomerExtend existing customer views (MDM CRM etc) by incorporating additional internal and external information sources

Operations AnalysisAnalyze a variety of machinedata for improved business results

Data Warehouse ModernizationIntegrate big data and data warehouse capabilities to increase operational efficiency

SecurityIntelligence ExtensionLower risk detect fraud and monitor cyber security in real-time

Key Big Data Use Cases

copy 2014 IBM Corporation21

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation22

Solution for Big Data

Rest Data

ndash Data to analyze are already stored (structured and unstructured)

ndash Examples logs facebook twitter etc

ndash Solution Hadoop (open source)

Data in motion

ndash Data are analyzed in real time just in the moment they are generated They are analyzed with any previous storage

ndash Examples Sensors RFID etc

ndash Solution Streams CEP solutions

copy 2014 IBM Corporation23

Hardware improvements through the years

CPU Speedsndash 1990 - 44 MIPS at 40 MHz

ndash 2000 - 3561 MIPS at 12 GHz ndash 2010 - 147600 MIPS at 33 GHz

RAM Memoryndash 1990 ndash 640K conventional memory (256K extended memory recommended)ndash 2000 ndash 64MB memoryndash 2010 - 8-32GB (and more)

Disk Capacityndash 1990 ndash 20MBndash 2000 - 1GBndash 2010 ndash 1TB

Disk Latency (speed of reads and writes) ndash not much improvement in last 7-10 years currently around 70 ndash 80MB sec

copy 2014 IBM Corporation24

How long it will take to read 1TB of data

1TB (at 80Mb sec)ndash 1 disk - 34 hours

ndash 10 disks - 20 min

ndash 100 disks - 2 min

ndash 1000 disks - 12 sec

copy 2014 IBM Corporation25

Parallel Data Processing is the answer

It was with us for a whilendash GRID computing - spreads processing load

ndash Distributed workload - hard to manage applications overhead on

developer

ndash Parallel databases ndash DB2 DPF Teradata Netezza etc (distribute the

data)

copy 2014 IBM Corporation26

What is Apache Hadoop

Apache Open source software framework

Flexible enterprise-class support for processing large volumes of

data ndash Inspired by Google technologies (MapReduce GFS BigTable hellip)

ndash Initiated at Yahoobull Originally built to address scalability problems of Nutch an open source Web search

technology

ndash Well-suited to batch-oriented read-intensive applications

ndash Supports wide variety of data

Enables applications to work with thousands of nodes and petabytes

of data in a highly parallel cost effective mannerndash CPU + local disks = ldquonoderdquo

ndash Nodes can be combined into clusters

ndash New nodes can be added as needed without changing

bull Data formats

bull How data is loaded

bull How jobs are written

copy 2014 IBM Corporation27

Design principles of Hadoop New way of storing and processing the data

ndash Let system handle most of the issues automaticallybull Failuresbull Scalabilitybull Reduce communications bull Distribute data and processing power to where the data isbull Make parallelism part of operating systembull Meant for heterogeneous commodity hardware

Bring processing to Data

Hadoop = HDFS + MapReduce infrastructure

Optimized to handlendash Massive amounts of data through parallelism

ndash A variety of data (structured unstructured semi-structured)

ndash Using inexpensive commodity hardware

Reliability provided through replication

copy 2014 IBM Corporation28

What is the Hadoop Distributed File System Driving principals

ndash Data is stored across the entire cluster (multiple nodes)

ndash Programs are brought to the data not the data to the program

ndash Follows the Divide and Conquer paradigm

Data is stored across the entire cluster (the DFS)

ndash The entire cluster participates in the file system

ndash Blocks of a single file are distributed across the cluster

ndash A given block is typically replicated as well for resiliency

10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111

0100

Logical File

1

2

3

4

Blocks

1

Cluster

1

1

2

22

3

3

34

44

copy 2011 IBM Corporation29

Introduction to MapReduce

Scalable to thousands of nodes and petabytes of data

MapReduce Application

1 Map Phase(break job into small parts)

2 Shuffle(transfer interim output

for final processing)

3 Reduce Phase(boil all output down to

a single result set)

Return a single result setResult Set

Shuffle

public static class TokenizerMapper extends MapperltObjectTextTextIntWritablegt

private final static IntWritableone = new IntWritable(1)

private Text word = new Text()

public void map(Object key Text val ContextStringTokenizer itr =

new StringTokenizer(valtoString())while (itrhasMoreTokens()) wordset(itrnextToken())

contextwrite(word one)

public static class IntSumReducer extends ReducerltTextIntWritableTextIntWrita

private IntWritable result = new IntWritable()

public void reduce(Text keyIterableltIntWritablegt val Context context)int sum = 0for (IntWritable v val)

sum += vget()

Distribute map

tasks to cluster

Hadoop Data Nodes

copy 2011 IBM Corporation30

MapReduce Example

Hello World Bye World

Hello IBM

Reduce (final output)

lt Bye 1gt

lt IBM 1gt

lt Hello 2gt

lt World 2gt

Map 1lt Hello 1gt

lt World 1gt

lt Bye 1gt

lt World 1gt

Count number of words occurrences

Map 2lt Hello 1gt

lt IBM 1gt

Entry Data

Map

Process

Reduce

Process

Shuffle

Process

copy 2014 IBM Corporation31

How to Analyze Large Data Sets in Hadoop

Its not just runtime Development phase has to be taken into

account

Although the Hadoop framework is implemented in Java

MapReduce applications do not need to be written in Java

To abstract complexities of Hadoop programming model a few

application development languages have emerged that build on top

of Hadoopndash Pig

ndash Hive

ndash Jaql

ndash Jaql

copy 2014 IBM Corporation32

Pig Hive Jaql ndash Similarities

Reduced program size over Java

Applications are translated to map

and reduce jobs behind scenes

Extension points for extending

existing functionality

Interoperability with other

languages

Not designed for random

readswrites or low-latency queries

Pig Hive Jaql ndash Differences

Characteristic Pig Hive Jaql

Developed by Yahoo Facebook IBM

Language Pig Latin HiveQL Jaql

Type of language

Data flow Declarative (SQL dialect) Data flow

Data structures supported

Complex Better suited for structured data

JSON semi structured

Schema Optional Not optional Optional

Hadoop Distributions

34

copy 2014 IBM Corporation35

Example of Hadoop Ecosystem

Visualization amp DiscoveryIntegration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

BigSheets JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Dashboard amp Visualization

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

copy 2014 IBM Corporation36

Open Source frameworks I

Avro A data serialization system that includes a schema within each file A schema defines the data types that are

contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data

types and complex type definitions within a schema

Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop

cluster

HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data

sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications

are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column

families so that the elements of a column family are all stored together This approach is different from a row-oriented

relational database where all columns of a row are stored together

HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do

not need to know where or how your data is stored You can change how you write data while still supporting existing data in

older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service

that includes functions for both MapReduce and Pig

Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to

analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements

which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere

BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos

Business Intelligence software

copy 2014 IBM Corporation37

Open Source frameworks II

Lucene A high-performance text search engine library that is written entirely in Java When you search within a

collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key

component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene

libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build

scan and query Lucene indexes

Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides

users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the

required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of

specific data in the file system

R A Project for Statistical Computing

Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop

systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and

export it to relational databases and enterprise data warehouses

Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper

maintains common objects that are needed in large cluster environments such as configuration information distributed

synchronization and group services

copy 2014 IBM Corporation38

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

copy 2014 IBM Corporation39

BigSheets

Browser based Analytical Tool that generates MapReduce Jobs

working over Hadoop Big Data

Helps non-programmers to work with Hadoop cluster

User models their big data as familiar spreadsheet-like tabular data

structures (collections) Once data is represented in a collection

business analysts can filter and enrich its content using built-in

functions and macros Furthermore analysts can combine data

residing in different collections as well as generate charts and new

ldquosheetsrdquo (collections) to visualize their data They can even export

data into a variety of common formats with a click of a button

Much of the technology included in Sheets was derived from the

BigSheets project of IBMrsquos Emerging Technologies team

copy 2014 IBM Corporation40

BigSheets Collection Sample

Spreadsheet-like structures defined by user

Based on data accessible through BigInsights Web console ndash eg file

system data output from Web crawl etc

copy 2014 IBM Corporation41

Big Sheets Collection Operations

Work with built-in ldquosheetsrdquo editor

Add delete columns

Filter data

Specify formulas to compute new

values using spreadsheet-style

syntax

Apply built-in or custom macro

functions

helliphelliphelliphellip

copy 2014 IBM Corporation42

BigSheets Collection Graphic Visualization

Built-in charting facility aids analysis

Pie charts bar charts tag clouds maps etc

Hover over sections to reveal details

copy 2014 IBM Corporation43

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

copy 2014 IBM Corporation44

What is Big SQL

Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture

ndash Comprehensive SQL92 ansi support

ndash Standards compliant client drivers (JDBC amp ODBC)

ndash Efficient handling of point queries

ndash Wide variety of data sources and file formats

ndash Extensive HBase focus

ndash Open source interoperability

Our driving design goalsndash Existing queries should run with no or few modifications

ndash Existing JDBC and ODBC compliant tools should continue to function

ndash Queries should be executed as efficiently as the chosen storage

mechanisms allow

copy 2014 IBM Corporation45

Architecture

Big SQL shares catalogs with

Hive via the Hive metastorendash Each can query the others tables

SQL engine analyzes incoming

queriesndash Separates portion(s) to execute at

the server vs portion(s) to execute

on the cluster

ndash Re-writes query if necessary for

improved performance

ndash Determines appropriate storage

handler for data

ndash Produces execution plan

ndash Executes and coordinates query

Server layout and relative sizes

for illustrative purposes only

Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node

Job TrackerNetwork Protocol

SQL Engine

Storage Handlers

Del

Files

SEQ

FilesHBase RDBMS bullbullbull

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore

copy 2014 IBM Corporation46

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

copy 2014 IBM Corporation47

What is Text Analytics

High Performance and Scalable rule based Information Extraction Engine

Distill structured information from unstructured data

- Rich annotator library supports multiple languages

Provides sophisticated tooling to help build test and refine rules

ndash Developer tools an easy to use text analytics language and a set of

extractors for fast adoption

ndash Multilingual support including support for DBCS languages

Developed at IBM Research since 2004 System T

BigInsights is the first time IBM opens up the Text Analytics Engine

technology for customization and development

copy 2014 IBM Corporation48

Annotator Query Language (AQL)

Language to create rules for Text Analytics

SQL Like Language

Fully declarative text analytics language

Once compiled produced an AOG plan to work in the data

No ldquoblack boxesrdquo or modules that canrsquot be customized

Tooling for easy customization because you are abstracted from the

programmatic details

Competing solutions make use of locked up black-box modules that cannot be

customized which restricts flexibility and are difficult to optimize for performance

create view AmountWithUnit as

extract pattern ltNmatchgt ltUmatchgt

as match

from Number N Unit U

copy 2014 IBM Corporation49

Text Analytic Simple Example

NetherlandsStrikerArjen Robben

Keeper SpainIker Casillas

WingerAndres Iniesta Spain

World Cup 2010 Highlights

Football World Cup 2010 one team distinguished well

from the rest winning the final Early in the second

half Netherlandsrsquo striker Arjen Robben had a chance

to score but the awesome keeper for Spain Iker

Casillas made the save Winner superiority was

reflected when Winger Andres Iniesta scored for Spain

for the win

copy 2014 IBM Corporation5050

Text Analytic Real Example

copy 2014 IBM Corporation5151

One step beyond Watson

copy 2014 IBM Corporation52

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

R

copy 2014 IBM Corporation53

Big R

bull Explore visualize transform

and model big data using

familiar R syntax and

paradigm

bull Scale out R with MR

programming

ndash Partitioning of large data

ndash Parallel cluster execution of R

code

bull Distributed Machine

Learning

ndash A scalable statistics engine that

provides canned algorithms and

an ability to author new ones all

via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3

copy 2014 IBM Corporation54

Where Does BigData Fit

Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate

ldquoCapture in case itrsquos neededrdquo

1 Extract transform load

ldquoCapture only whatrsquos neededrdquo

9 Report and mine data

copy 2014 IBM Corporation55

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation56

Data scientist ndash The new cool guy in town

Article in Fortune ldquoThe unemployment rate in

the US continues to be abysmal (91 in

July) but the tech world has spawned a

new kind of highly skilled nerdy-cool job

that companies are scrambling to fill data

scientistrdquo

McKinsey Global Institute ldquoBig data Reportrdquo

By 2018 the United States alone could

face a shortage of 140000 to 190000

people with deep analytical skills as well as

15 million managers and analysts with the

know-how to use the analysis of big data to

make effective decisions

copy 2014 IBM Corporation57

Data Science is Multidisciplinary

copy 2014 IBM Corporation58

Successful Data Scientist Characteristics

copy 2014 IBM Corporation59

Data Scientist Qualities

copy 2014 IBM Corporation60

How Long Does It Take For a Beginner to Become

a Good Data Scientist

copy 2014 IBM Corporation61

wwwkagglecom

copy 2014 IBM Corporation62

Kaggle ranking

copy 2014 IBM Corporation63 copy 2013 IBM Corporation63

Learn Big Data

Reading Materials - Online

ndash Understanding Big Data ndash Free PDF Book

bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF

ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights

bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml

ndash Implementing IBM InfoSphere BigInsights on System x - Redbook

bull httpwwwredbooksibmcomredpiecesabstractssg248077html

Resources

ndash Big Data Information Center

bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml

ndash InfoSphere BigInsights

bull www-01ibmcomsoftwaredatainfospherebiginsights

ndash Stream Computing

bull www-01ibmcomsoftwaredatainfospherestream-computing

ndash DeveloperWorks forums demos

bull httpwwwibmcomdeveloperworkswikibiginsights

copy 2014 IBM Corporation64

Learn Big Data Technologies

BigDataUniversitycom

Flexible on-line delivery

allows learning your place

and your pace

Free courses free study

materials

Cloud-based sandbox for

exercises ndash zero setup

Robust Course

Management System and

Content Distribution

infrastructure

copy 2014 IBM Corporation65

65

Page 9: Big data presentation (2014)

copy 2014 IBM Corporation9

Paradigm shifts enabled by big data IIReduce effort required to leverage data

copy 2014 IBM Corporation10

Paradigm shifts enabled by big data IIIData leads the way ndash and sometimes correlations are good enough

copy 2014 IBM Corporation11

Paradigm shifts enabled by big data IIIData leads the way ndash and sometimes correlations are good enough

Hypothesis based correlation Weird correlation

copy 2014 IBM Corporation12

Paradigm shifts enabled by big data IIIData leads the way ndash and sometimes correlations are good enough

copy 2014 IBM Corporation13

Paradigm shifts enabled by big data IVLeverage data as it is captured

copy 2014 IBM Corporation14

Paradigm shifts enabled by big data IVLeverage data as it is captured

copy 2014 IBM Corporation15

Complementary Analytics

Traditional ApproachStructured analytical logical

New ApproachCreative holistic thought intuition

Multimedia

Data Warehouse

Web Logs

Social Data

Sensor data

images

RFID

Internal AppData

TransactionData

MainframeData

OLTP SystemData

Traditional databases

ERP Data

StructuredRepeatable

Linear

UnstructuredExploratory

Dynamic

Text Data

emails

Hadoop andStreams

NewSources

copy 2014 IBM Corporation16

Types of Analytic Tools

copy 2014 IBM Corporation17

Organisations are prioritising internal data sources

17

Untapped stores of internal data

Size and scope of some internal data such as

detailed transactions and operational log data

have become too large and varied to manage

within traditional systems

New infrastructure components make them

accessible for analysis

Some data has been collected but not

analyzed for years

Focus on customer insights

Customers ndash influenced by digital experiences

ndash often expect information provided to an

organization will then be ldquoknownrdquo during future

interactions

Combining disparate internal sources with

advanced analytics creates insights into

customer behavior and preferences

(Transactions Emails Call center interaction records)

Big data sources

Respondents were

asked which data

sources are currently

being collected and

analyzed as part of

active big data efforts

within their

organization

copy 2014 IBM Corporation18

Stages of Big Data adoption

18

Big data adoption

When segmented into four groups based on current levels of big data activity respondents showed significant consistency

in organizational behaviors Total respondents n = 1061

Totals do not equal 100 due to rounding

copy 2014 IBM Corporation19

Hadoop workloads

92

92

83

58

42

25

58

92

92

92

67

67

67

83

Staging area

Online archive

Transformation Engine

Ad hoc queries

Scheduled reports

Visual exploration

Data mining

Today In 18 Months

Based on respondents that have implemented Hadoop BI Leadership Forum April 2012

copy 2014 IBM Corporation20

Big Data ExplorationFind visualize understand all big data to improve decision making

Enhanced 360o Viewof the CustomerExtend existing customer views (MDM CRM etc) by incorporating additional internal and external information sources

Operations AnalysisAnalyze a variety of machinedata for improved business results

Data Warehouse ModernizationIntegrate big data and data warehouse capabilities to increase operational efficiency

SecurityIntelligence ExtensionLower risk detect fraud and monitor cyber security in real-time

Key Big Data Use Cases

copy 2014 IBM Corporation21

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation22

Solution for Big Data

Rest Data

ndash Data to analyze are already stored (structured and unstructured)

ndash Examples logs facebook twitter etc

ndash Solution Hadoop (open source)

Data in motion

ndash Data are analyzed in real time just in the moment they are generated They are analyzed with any previous storage

ndash Examples Sensors RFID etc

ndash Solution Streams CEP solutions

copy 2014 IBM Corporation23

Hardware improvements through the years

CPU Speedsndash 1990 - 44 MIPS at 40 MHz

ndash 2000 - 3561 MIPS at 12 GHz ndash 2010 - 147600 MIPS at 33 GHz

RAM Memoryndash 1990 ndash 640K conventional memory (256K extended memory recommended)ndash 2000 ndash 64MB memoryndash 2010 - 8-32GB (and more)

Disk Capacityndash 1990 ndash 20MBndash 2000 - 1GBndash 2010 ndash 1TB

Disk Latency (speed of reads and writes) ndash not much improvement in last 7-10 years currently around 70 ndash 80MB sec

copy 2014 IBM Corporation24

How long it will take to read 1TB of data

1TB (at 80Mb sec)ndash 1 disk - 34 hours

ndash 10 disks - 20 min

ndash 100 disks - 2 min

ndash 1000 disks - 12 sec

copy 2014 IBM Corporation25

Parallel Data Processing is the answer

It was with us for a whilendash GRID computing - spreads processing load

ndash Distributed workload - hard to manage applications overhead on

developer

ndash Parallel databases ndash DB2 DPF Teradata Netezza etc (distribute the

data)

copy 2014 IBM Corporation26

What is Apache Hadoop

Apache Open source software framework

Flexible enterprise-class support for processing large volumes of

data ndash Inspired by Google technologies (MapReduce GFS BigTable hellip)

ndash Initiated at Yahoobull Originally built to address scalability problems of Nutch an open source Web search

technology

ndash Well-suited to batch-oriented read-intensive applications

ndash Supports wide variety of data

Enables applications to work with thousands of nodes and petabytes

of data in a highly parallel cost effective mannerndash CPU + local disks = ldquonoderdquo

ndash Nodes can be combined into clusters

ndash New nodes can be added as needed without changing

bull Data formats

bull How data is loaded

bull How jobs are written

copy 2014 IBM Corporation27

Design principles of Hadoop New way of storing and processing the data

ndash Let system handle most of the issues automaticallybull Failuresbull Scalabilitybull Reduce communications bull Distribute data and processing power to where the data isbull Make parallelism part of operating systembull Meant for heterogeneous commodity hardware

Bring processing to Data

Hadoop = HDFS + MapReduce infrastructure

Optimized to handlendash Massive amounts of data through parallelism

ndash A variety of data (structured unstructured semi-structured)

ndash Using inexpensive commodity hardware

Reliability provided through replication

copy 2014 IBM Corporation28

What is the Hadoop Distributed File System Driving principals

ndash Data is stored across the entire cluster (multiple nodes)

ndash Programs are brought to the data not the data to the program

ndash Follows the Divide and Conquer paradigm

Data is stored across the entire cluster (the DFS)

ndash The entire cluster participates in the file system

ndash Blocks of a single file are distributed across the cluster

ndash A given block is typically replicated as well for resiliency

10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111

0100

Logical File

1

2

3

4

Blocks

1

Cluster

1

1

2

22

3

3

34

44

copy 2011 IBM Corporation29

Introduction to MapReduce

Scalable to thousands of nodes and petabytes of data

MapReduce Application

1 Map Phase(break job into small parts)

2 Shuffle(transfer interim output

for final processing)

3 Reduce Phase(boil all output down to

a single result set)

Return a single result setResult Set

Shuffle

public static class TokenizerMapper extends MapperltObjectTextTextIntWritablegt

private final static IntWritableone = new IntWritable(1)

private Text word = new Text()

public void map(Object key Text val ContextStringTokenizer itr =

new StringTokenizer(valtoString())while (itrhasMoreTokens()) wordset(itrnextToken())

contextwrite(word one)

public static class IntSumReducer extends ReducerltTextIntWritableTextIntWrita

private IntWritable result = new IntWritable()

public void reduce(Text keyIterableltIntWritablegt val Context context)int sum = 0for (IntWritable v val)

sum += vget()

Distribute map

tasks to cluster

Hadoop Data Nodes

copy 2011 IBM Corporation30

MapReduce Example

Hello World Bye World

Hello IBM

Reduce (final output)

lt Bye 1gt

lt IBM 1gt

lt Hello 2gt

lt World 2gt

Map 1lt Hello 1gt

lt World 1gt

lt Bye 1gt

lt World 1gt

Count number of words occurrences

Map 2lt Hello 1gt

lt IBM 1gt

Entry Data

Map

Process

Reduce

Process

Shuffle

Process

copy 2014 IBM Corporation31

How to Analyze Large Data Sets in Hadoop

Its not just runtime Development phase has to be taken into

account

Although the Hadoop framework is implemented in Java

MapReduce applications do not need to be written in Java

To abstract complexities of Hadoop programming model a few

application development languages have emerged that build on top

of Hadoopndash Pig

ndash Hive

ndash Jaql

ndash Jaql

copy 2014 IBM Corporation32

Pig Hive Jaql ndash Similarities

Reduced program size over Java

Applications are translated to map

and reduce jobs behind scenes

Extension points for extending

existing functionality

Interoperability with other

languages

Not designed for random

readswrites or low-latency queries

Pig Hive Jaql ndash Differences

Characteristic Pig Hive Jaql

Developed by Yahoo Facebook IBM

Language Pig Latin HiveQL Jaql

Type of language

Data flow Declarative (SQL dialect) Data flow

Data structures supported

Complex Better suited for structured data

JSON semi structured

Schema Optional Not optional Optional

Hadoop Distributions

34

copy 2014 IBM Corporation35

Example of Hadoop Ecosystem

Visualization amp DiscoveryIntegration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

BigSheets JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Dashboard amp Visualization

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

copy 2014 IBM Corporation36

Open Source frameworks I

Avro A data serialization system that includes a schema within each file A schema defines the data types that are

contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data

types and complex type definitions within a schema

Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop

cluster

HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data

sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications

are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column

families so that the elements of a column family are all stored together This approach is different from a row-oriented

relational database where all columns of a row are stored together

HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do

not need to know where or how your data is stored You can change how you write data while still supporting existing data in

older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service

that includes functions for both MapReduce and Pig

Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to

analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements

which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere

BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos

Business Intelligence software

copy 2014 IBM Corporation37

Open Source frameworks II

Lucene A high-performance text search engine library that is written entirely in Java When you search within a

collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key

component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene

libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build

scan and query Lucene indexes

Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides

users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the

required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of

specific data in the file system

R A Project for Statistical Computing

Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop

systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and

export it to relational databases and enterprise data warehouses

Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper

maintains common objects that are needed in large cluster environments such as configuration information distributed

synchronization and group services

copy 2014 IBM Corporation38

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

copy 2014 IBM Corporation39

BigSheets

Browser based Analytical Tool that generates MapReduce Jobs

working over Hadoop Big Data

Helps non-programmers to work with Hadoop cluster

User models their big data as familiar spreadsheet-like tabular data

structures (collections) Once data is represented in a collection

business analysts can filter and enrich its content using built-in

functions and macros Furthermore analysts can combine data

residing in different collections as well as generate charts and new

ldquosheetsrdquo (collections) to visualize their data They can even export

data into a variety of common formats with a click of a button

Much of the technology included in Sheets was derived from the

BigSheets project of IBMrsquos Emerging Technologies team

copy 2014 IBM Corporation40

BigSheets Collection Sample

Spreadsheet-like structures defined by user

Based on data accessible through BigInsights Web console ndash eg file

system data output from Web crawl etc

copy 2014 IBM Corporation41

Big Sheets Collection Operations

Work with built-in ldquosheetsrdquo editor

Add delete columns

Filter data

Specify formulas to compute new

values using spreadsheet-style

syntax

Apply built-in or custom macro

functions

helliphelliphelliphellip

copy 2014 IBM Corporation42

BigSheets Collection Graphic Visualization

Built-in charting facility aids analysis

Pie charts bar charts tag clouds maps etc

Hover over sections to reveal details

copy 2014 IBM Corporation43

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

copy 2014 IBM Corporation44

What is Big SQL

Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture

ndash Comprehensive SQL92 ansi support

ndash Standards compliant client drivers (JDBC amp ODBC)

ndash Efficient handling of point queries

ndash Wide variety of data sources and file formats

ndash Extensive HBase focus

ndash Open source interoperability

Our driving design goalsndash Existing queries should run with no or few modifications

ndash Existing JDBC and ODBC compliant tools should continue to function

ndash Queries should be executed as efficiently as the chosen storage

mechanisms allow

copy 2014 IBM Corporation45

Architecture

Big SQL shares catalogs with

Hive via the Hive metastorendash Each can query the others tables

SQL engine analyzes incoming

queriesndash Separates portion(s) to execute at

the server vs portion(s) to execute

on the cluster

ndash Re-writes query if necessary for

improved performance

ndash Determines appropriate storage

handler for data

ndash Produces execution plan

ndash Executes and coordinates query

Server layout and relative sizes

for illustrative purposes only

Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node

Job TrackerNetwork Protocol

SQL Engine

Storage Handlers

Del

Files

SEQ

FilesHBase RDBMS bullbullbull

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore

copy 2014 IBM Corporation46

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

copy 2014 IBM Corporation47

What is Text Analytics

High Performance and Scalable rule based Information Extraction Engine

Distill structured information from unstructured data

- Rich annotator library supports multiple languages

Provides sophisticated tooling to help build test and refine rules

ndash Developer tools an easy to use text analytics language and a set of

extractors for fast adoption

ndash Multilingual support including support for DBCS languages

Developed at IBM Research since 2004 System T

BigInsights is the first time IBM opens up the Text Analytics Engine

technology for customization and development

copy 2014 IBM Corporation48

Annotator Query Language (AQL)

Language to create rules for Text Analytics

SQL Like Language

Fully declarative text analytics language

Once compiled produced an AOG plan to work in the data

No ldquoblack boxesrdquo or modules that canrsquot be customized

Tooling for easy customization because you are abstracted from the

programmatic details

Competing solutions make use of locked up black-box modules that cannot be

customized which restricts flexibility and are difficult to optimize for performance

create view AmountWithUnit as

extract pattern ltNmatchgt ltUmatchgt

as match

from Number N Unit U

copy 2014 IBM Corporation49

Text Analytic Simple Example

NetherlandsStrikerArjen Robben

Keeper SpainIker Casillas

WingerAndres Iniesta Spain

World Cup 2010 Highlights

Football World Cup 2010 one team distinguished well

from the rest winning the final Early in the second

half Netherlandsrsquo striker Arjen Robben had a chance

to score but the awesome keeper for Spain Iker

Casillas made the save Winner superiority was

reflected when Winger Andres Iniesta scored for Spain

for the win

copy 2014 IBM Corporation5050

Text Analytic Real Example

copy 2014 IBM Corporation5151

One step beyond Watson

copy 2014 IBM Corporation52

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

R

copy 2014 IBM Corporation53

Big R

bull Explore visualize transform

and model big data using

familiar R syntax and

paradigm

bull Scale out R with MR

programming

ndash Partitioning of large data

ndash Parallel cluster execution of R

code

bull Distributed Machine

Learning

ndash A scalable statistics engine that

provides canned algorithms and

an ability to author new ones all

via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3

copy 2014 IBM Corporation54

Where Does BigData Fit

Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate

ldquoCapture in case itrsquos neededrdquo

1 Extract transform load

ldquoCapture only whatrsquos neededrdquo

9 Report and mine data

copy 2014 IBM Corporation55

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation56

Data scientist ndash The new cool guy in town

Article in Fortune ldquoThe unemployment rate in

the US continues to be abysmal (91 in

July) but the tech world has spawned a

new kind of highly skilled nerdy-cool job

that companies are scrambling to fill data

scientistrdquo

McKinsey Global Institute ldquoBig data Reportrdquo

By 2018 the United States alone could

face a shortage of 140000 to 190000

people with deep analytical skills as well as

15 million managers and analysts with the

know-how to use the analysis of big data to

make effective decisions

copy 2014 IBM Corporation57

Data Science is Multidisciplinary

copy 2014 IBM Corporation58

Successful Data Scientist Characteristics

copy 2014 IBM Corporation59

Data Scientist Qualities

copy 2014 IBM Corporation60

How Long Does It Take For a Beginner to Become

a Good Data Scientist

copy 2014 IBM Corporation61

wwwkagglecom

copy 2014 IBM Corporation62

Kaggle ranking

copy 2014 IBM Corporation63 copy 2013 IBM Corporation63

Learn Big Data

Reading Materials - Online

ndash Understanding Big Data ndash Free PDF Book

bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF

ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights

bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml

ndash Implementing IBM InfoSphere BigInsights on System x - Redbook

bull httpwwwredbooksibmcomredpiecesabstractssg248077html

Resources

ndash Big Data Information Center

bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml

ndash InfoSphere BigInsights

bull www-01ibmcomsoftwaredatainfospherebiginsights

ndash Stream Computing

bull www-01ibmcomsoftwaredatainfospherestream-computing

ndash DeveloperWorks forums demos

bull httpwwwibmcomdeveloperworkswikibiginsights

copy 2014 IBM Corporation64

Learn Big Data Technologies

BigDataUniversitycom

Flexible on-line delivery

allows learning your place

and your pace

Free courses free study

materials

Cloud-based sandbox for

exercises ndash zero setup

Robust Course

Management System and

Content Distribution

infrastructure

copy 2014 IBM Corporation65

65

Page 10: Big data presentation (2014)

copy 2014 IBM Corporation10

Paradigm shifts enabled by big data IIIData leads the way ndash and sometimes correlations are good enough

copy 2014 IBM Corporation11

Paradigm shifts enabled by big data IIIData leads the way ndash and sometimes correlations are good enough

Hypothesis based correlation Weird correlation

copy 2014 IBM Corporation12

Paradigm shifts enabled by big data IIIData leads the way ndash and sometimes correlations are good enough

copy 2014 IBM Corporation13

Paradigm shifts enabled by big data IVLeverage data as it is captured

copy 2014 IBM Corporation14

Paradigm shifts enabled by big data IVLeverage data as it is captured

copy 2014 IBM Corporation15

Complementary Analytics

Traditional ApproachStructured analytical logical

New ApproachCreative holistic thought intuition

Multimedia

Data Warehouse

Web Logs

Social Data

Sensor data

images

RFID

Internal AppData

TransactionData

MainframeData

OLTP SystemData

Traditional databases

ERP Data

StructuredRepeatable

Linear

UnstructuredExploratory

Dynamic

Text Data

emails

Hadoop andStreams

NewSources

copy 2014 IBM Corporation16

Types of Analytic Tools

copy 2014 IBM Corporation17

Organisations are prioritising internal data sources

17

Untapped stores of internal data

Size and scope of some internal data such as

detailed transactions and operational log data

have become too large and varied to manage

within traditional systems

New infrastructure components make them

accessible for analysis

Some data has been collected but not

analyzed for years

Focus on customer insights

Customers ndash influenced by digital experiences

ndash often expect information provided to an

organization will then be ldquoknownrdquo during future

interactions

Combining disparate internal sources with

advanced analytics creates insights into

customer behavior and preferences

(Transactions Emails Call center interaction records)

Big data sources

Respondents were

asked which data

sources are currently

being collected and

analyzed as part of

active big data efforts

within their

organization

copy 2014 IBM Corporation18

Stages of Big Data adoption

18

Big data adoption

When segmented into four groups based on current levels of big data activity respondents showed significant consistency

in organizational behaviors Total respondents n = 1061

Totals do not equal 100 due to rounding

copy 2014 IBM Corporation19

Hadoop workloads

92

92

83

58

42

25

58

92

92

92

67

67

67

83

Staging area

Online archive

Transformation Engine

Ad hoc queries

Scheduled reports

Visual exploration

Data mining

Today In 18 Months

Based on respondents that have implemented Hadoop BI Leadership Forum April 2012

copy 2014 IBM Corporation20

Big Data ExplorationFind visualize understand all big data to improve decision making

Enhanced 360o Viewof the CustomerExtend existing customer views (MDM CRM etc) by incorporating additional internal and external information sources

Operations AnalysisAnalyze a variety of machinedata for improved business results

Data Warehouse ModernizationIntegrate big data and data warehouse capabilities to increase operational efficiency

SecurityIntelligence ExtensionLower risk detect fraud and monitor cyber security in real-time

Key Big Data Use Cases

copy 2014 IBM Corporation21

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation22

Solution for Big Data

Rest Data

ndash Data to analyze are already stored (structured and unstructured)

ndash Examples logs facebook twitter etc

ndash Solution Hadoop (open source)

Data in motion

ndash Data are analyzed in real time just in the moment they are generated They are analyzed with any previous storage

ndash Examples Sensors RFID etc

ndash Solution Streams CEP solutions

copy 2014 IBM Corporation23

Hardware improvements through the years

CPU Speedsndash 1990 - 44 MIPS at 40 MHz

ndash 2000 - 3561 MIPS at 12 GHz ndash 2010 - 147600 MIPS at 33 GHz

RAM Memoryndash 1990 ndash 640K conventional memory (256K extended memory recommended)ndash 2000 ndash 64MB memoryndash 2010 - 8-32GB (and more)

Disk Capacityndash 1990 ndash 20MBndash 2000 - 1GBndash 2010 ndash 1TB

Disk Latency (speed of reads and writes) ndash not much improvement in last 7-10 years currently around 70 ndash 80MB sec

copy 2014 IBM Corporation24

How long it will take to read 1TB of data

1TB (at 80Mb sec)ndash 1 disk - 34 hours

ndash 10 disks - 20 min

ndash 100 disks - 2 min

ndash 1000 disks - 12 sec

copy 2014 IBM Corporation25

Parallel Data Processing is the answer

It was with us for a whilendash GRID computing - spreads processing load

ndash Distributed workload - hard to manage applications overhead on

developer

ndash Parallel databases ndash DB2 DPF Teradata Netezza etc (distribute the

data)

copy 2014 IBM Corporation26

What is Apache Hadoop

Apache Open source software framework

Flexible enterprise-class support for processing large volumes of

data ndash Inspired by Google technologies (MapReduce GFS BigTable hellip)

ndash Initiated at Yahoobull Originally built to address scalability problems of Nutch an open source Web search

technology

ndash Well-suited to batch-oriented read-intensive applications

ndash Supports wide variety of data

Enables applications to work with thousands of nodes and petabytes

of data in a highly parallel cost effective mannerndash CPU + local disks = ldquonoderdquo

ndash Nodes can be combined into clusters

ndash New nodes can be added as needed without changing

bull Data formats

bull How data is loaded

bull How jobs are written

copy 2014 IBM Corporation27

Design principles of Hadoop New way of storing and processing the data

ndash Let system handle most of the issues automaticallybull Failuresbull Scalabilitybull Reduce communications bull Distribute data and processing power to where the data isbull Make parallelism part of operating systembull Meant for heterogeneous commodity hardware

Bring processing to Data

Hadoop = HDFS + MapReduce infrastructure

Optimized to handlendash Massive amounts of data through parallelism

ndash A variety of data (structured unstructured semi-structured)

ndash Using inexpensive commodity hardware

Reliability provided through replication

copy 2014 IBM Corporation28

What is the Hadoop Distributed File System Driving principals

ndash Data is stored across the entire cluster (multiple nodes)

ndash Programs are brought to the data not the data to the program

ndash Follows the Divide and Conquer paradigm

Data is stored across the entire cluster (the DFS)

ndash The entire cluster participates in the file system

ndash Blocks of a single file are distributed across the cluster

ndash A given block is typically replicated as well for resiliency

10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111

0100

Logical File

1

2

3

4

Blocks

1

Cluster

1

1

2

22

3

3

34

44

copy 2011 IBM Corporation29

Introduction to MapReduce

Scalable to thousands of nodes and petabytes of data

MapReduce Application

1 Map Phase(break job into small parts)

2 Shuffle(transfer interim output

for final processing)

3 Reduce Phase(boil all output down to

a single result set)

Return a single result setResult Set

Shuffle

public static class TokenizerMapper extends MapperltObjectTextTextIntWritablegt

private final static IntWritableone = new IntWritable(1)

private Text word = new Text()

public void map(Object key Text val ContextStringTokenizer itr =

new StringTokenizer(valtoString())while (itrhasMoreTokens()) wordset(itrnextToken())

contextwrite(word one)

public static class IntSumReducer extends ReducerltTextIntWritableTextIntWrita

private IntWritable result = new IntWritable()

public void reduce(Text keyIterableltIntWritablegt val Context context)int sum = 0for (IntWritable v val)

sum += vget()

Distribute map

tasks to cluster

Hadoop Data Nodes

copy 2011 IBM Corporation30

MapReduce Example

Hello World Bye World

Hello IBM

Reduce (final output)

lt Bye 1gt

lt IBM 1gt

lt Hello 2gt

lt World 2gt

Map 1lt Hello 1gt

lt World 1gt

lt Bye 1gt

lt World 1gt

Count number of words occurrences

Map 2lt Hello 1gt

lt IBM 1gt

Entry Data

Map

Process

Reduce

Process

Shuffle

Process

copy 2014 IBM Corporation31

How to Analyze Large Data Sets in Hadoop

Its not just runtime Development phase has to be taken into

account

Although the Hadoop framework is implemented in Java

MapReduce applications do not need to be written in Java

To abstract complexities of Hadoop programming model a few

application development languages have emerged that build on top

of Hadoopndash Pig

ndash Hive

ndash Jaql

ndash Jaql

copy 2014 IBM Corporation32

Pig Hive Jaql ndash Similarities

Reduced program size over Java

Applications are translated to map

and reduce jobs behind scenes

Extension points for extending

existing functionality

Interoperability with other

languages

Not designed for random

readswrites or low-latency queries

Pig Hive Jaql ndash Differences

Characteristic Pig Hive Jaql

Developed by Yahoo Facebook IBM

Language Pig Latin HiveQL Jaql

Type of language

Data flow Declarative (SQL dialect) Data flow

Data structures supported

Complex Better suited for structured data

JSON semi structured

Schema Optional Not optional Optional

Hadoop Distributions

34

copy 2014 IBM Corporation35

Example of Hadoop Ecosystem

Visualization amp DiscoveryIntegration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

BigSheets JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Dashboard amp Visualization

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

copy 2014 IBM Corporation36

Open Source frameworks I

Avro A data serialization system that includes a schema within each file A schema defines the data types that are

contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data

types and complex type definitions within a schema

Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop

cluster

HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data

sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications

are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column

families so that the elements of a column family are all stored together This approach is different from a row-oriented

relational database where all columns of a row are stored together

HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do

not need to know where or how your data is stored You can change how you write data while still supporting existing data in

older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service

that includes functions for both MapReduce and Pig

Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to

analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements

which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere

BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos

Business Intelligence software

copy 2014 IBM Corporation37

Open Source frameworks II

Lucene A high-performance text search engine library that is written entirely in Java When you search within a

collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key

component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene

libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build

scan and query Lucene indexes

Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides

users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the

required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of

specific data in the file system

R A Project for Statistical Computing

Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop

systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and

export it to relational databases and enterprise data warehouses

Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper

maintains common objects that are needed in large cluster environments such as configuration information distributed

synchronization and group services

copy 2014 IBM Corporation38

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

copy 2014 IBM Corporation39

BigSheets

Browser based Analytical Tool that generates MapReduce Jobs

working over Hadoop Big Data

Helps non-programmers to work with Hadoop cluster

User models their big data as familiar spreadsheet-like tabular data

structures (collections) Once data is represented in a collection

business analysts can filter and enrich its content using built-in

functions and macros Furthermore analysts can combine data

residing in different collections as well as generate charts and new

ldquosheetsrdquo (collections) to visualize their data They can even export

data into a variety of common formats with a click of a button

Much of the technology included in Sheets was derived from the

BigSheets project of IBMrsquos Emerging Technologies team

copy 2014 IBM Corporation40

BigSheets Collection Sample

Spreadsheet-like structures defined by user

Based on data accessible through BigInsights Web console ndash eg file

system data output from Web crawl etc

copy 2014 IBM Corporation41

Big Sheets Collection Operations

Work with built-in ldquosheetsrdquo editor

Add delete columns

Filter data

Specify formulas to compute new

values using spreadsheet-style

syntax

Apply built-in or custom macro

functions

helliphelliphelliphellip

copy 2014 IBM Corporation42

BigSheets Collection Graphic Visualization

Built-in charting facility aids analysis

Pie charts bar charts tag clouds maps etc

Hover over sections to reveal details

copy 2014 IBM Corporation43

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

copy 2014 IBM Corporation44

What is Big SQL

Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture

ndash Comprehensive SQL92 ansi support

ndash Standards compliant client drivers (JDBC amp ODBC)

ndash Efficient handling of point queries

ndash Wide variety of data sources and file formats

ndash Extensive HBase focus

ndash Open source interoperability

Our driving design goalsndash Existing queries should run with no or few modifications

ndash Existing JDBC and ODBC compliant tools should continue to function

ndash Queries should be executed as efficiently as the chosen storage

mechanisms allow

copy 2014 IBM Corporation45

Architecture

Big SQL shares catalogs with

Hive via the Hive metastorendash Each can query the others tables

SQL engine analyzes incoming

queriesndash Separates portion(s) to execute at

the server vs portion(s) to execute

on the cluster

ndash Re-writes query if necessary for

improved performance

ndash Determines appropriate storage

handler for data

ndash Produces execution plan

ndash Executes and coordinates query

Server layout and relative sizes

for illustrative purposes only

Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node

Job TrackerNetwork Protocol

SQL Engine

Storage Handlers

Del

Files

SEQ

FilesHBase RDBMS bullbullbull

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore

copy 2014 IBM Corporation46

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

copy 2014 IBM Corporation47

What is Text Analytics

High Performance and Scalable rule based Information Extraction Engine

Distill structured information from unstructured data

- Rich annotator library supports multiple languages

Provides sophisticated tooling to help build test and refine rules

ndash Developer tools an easy to use text analytics language and a set of

extractors for fast adoption

ndash Multilingual support including support for DBCS languages

Developed at IBM Research since 2004 System T

BigInsights is the first time IBM opens up the Text Analytics Engine

technology for customization and development

copy 2014 IBM Corporation48

Annotator Query Language (AQL)

Language to create rules for Text Analytics

SQL Like Language

Fully declarative text analytics language

Once compiled produced an AOG plan to work in the data

No ldquoblack boxesrdquo or modules that canrsquot be customized

Tooling for easy customization because you are abstracted from the

programmatic details

Competing solutions make use of locked up black-box modules that cannot be

customized which restricts flexibility and are difficult to optimize for performance

create view AmountWithUnit as

extract pattern ltNmatchgt ltUmatchgt

as match

from Number N Unit U

copy 2014 IBM Corporation49

Text Analytic Simple Example

NetherlandsStrikerArjen Robben

Keeper SpainIker Casillas

WingerAndres Iniesta Spain

World Cup 2010 Highlights

Football World Cup 2010 one team distinguished well

from the rest winning the final Early in the second

half Netherlandsrsquo striker Arjen Robben had a chance

to score but the awesome keeper for Spain Iker

Casillas made the save Winner superiority was

reflected when Winger Andres Iniesta scored for Spain

for the win

copy 2014 IBM Corporation5050

Text Analytic Real Example

copy 2014 IBM Corporation5151

One step beyond Watson

copy 2014 IBM Corporation52

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

R

copy 2014 IBM Corporation53

Big R

bull Explore visualize transform

and model big data using

familiar R syntax and

paradigm

bull Scale out R with MR

programming

ndash Partitioning of large data

ndash Parallel cluster execution of R

code

bull Distributed Machine

Learning

ndash A scalable statistics engine that

provides canned algorithms and

an ability to author new ones all

via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3

copy 2014 IBM Corporation54

Where Does BigData Fit

Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate

ldquoCapture in case itrsquos neededrdquo

1 Extract transform load

ldquoCapture only whatrsquos neededrdquo

9 Report and mine data

copy 2014 IBM Corporation55

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation56

Data scientist ndash The new cool guy in town

Article in Fortune ldquoThe unemployment rate in

the US continues to be abysmal (91 in

July) but the tech world has spawned a

new kind of highly skilled nerdy-cool job

that companies are scrambling to fill data

scientistrdquo

McKinsey Global Institute ldquoBig data Reportrdquo

By 2018 the United States alone could

face a shortage of 140000 to 190000

people with deep analytical skills as well as

15 million managers and analysts with the

know-how to use the analysis of big data to

make effective decisions

copy 2014 IBM Corporation57

Data Science is Multidisciplinary

copy 2014 IBM Corporation58

Successful Data Scientist Characteristics

copy 2014 IBM Corporation59

Data Scientist Qualities

copy 2014 IBM Corporation60

How Long Does It Take For a Beginner to Become

a Good Data Scientist

copy 2014 IBM Corporation61

wwwkagglecom

copy 2014 IBM Corporation62

Kaggle ranking

copy 2014 IBM Corporation63 copy 2013 IBM Corporation63

Learn Big Data

Reading Materials - Online

ndash Understanding Big Data ndash Free PDF Book

bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF

ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights

bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml

ndash Implementing IBM InfoSphere BigInsights on System x - Redbook

bull httpwwwredbooksibmcomredpiecesabstractssg248077html

Resources

ndash Big Data Information Center

bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml

ndash InfoSphere BigInsights

bull www-01ibmcomsoftwaredatainfospherebiginsights

ndash Stream Computing

bull www-01ibmcomsoftwaredatainfospherestream-computing

ndash DeveloperWorks forums demos

bull httpwwwibmcomdeveloperworkswikibiginsights

copy 2014 IBM Corporation64

Learn Big Data Technologies

BigDataUniversitycom

Flexible on-line delivery

allows learning your place

and your pace

Free courses free study

materials

Cloud-based sandbox for

exercises ndash zero setup

Robust Course

Management System and

Content Distribution

infrastructure

copy 2014 IBM Corporation65

65

Page 11: Big data presentation (2014)

copy 2014 IBM Corporation11

Paradigm shifts enabled by big data IIIData leads the way ndash and sometimes correlations are good enough

Hypothesis based correlation Weird correlation

copy 2014 IBM Corporation12

Paradigm shifts enabled by big data IIIData leads the way ndash and sometimes correlations are good enough

copy 2014 IBM Corporation13

Paradigm shifts enabled by big data IVLeverage data as it is captured

copy 2014 IBM Corporation14

Paradigm shifts enabled by big data IVLeverage data as it is captured

copy 2014 IBM Corporation15

Complementary Analytics

Traditional ApproachStructured analytical logical

New ApproachCreative holistic thought intuition

Multimedia

Data Warehouse

Web Logs

Social Data

Sensor data

images

RFID

Internal AppData

TransactionData

MainframeData

OLTP SystemData

Traditional databases

ERP Data

StructuredRepeatable

Linear

UnstructuredExploratory

Dynamic

Text Data

emails

Hadoop andStreams

NewSources

copy 2014 IBM Corporation16

Types of Analytic Tools

copy 2014 IBM Corporation17

Organisations are prioritising internal data sources

17

Untapped stores of internal data

Size and scope of some internal data such as

detailed transactions and operational log data

have become too large and varied to manage

within traditional systems

New infrastructure components make them

accessible for analysis

Some data has been collected but not

analyzed for years

Focus on customer insights

Customers ndash influenced by digital experiences

ndash often expect information provided to an

organization will then be ldquoknownrdquo during future

interactions

Combining disparate internal sources with

advanced analytics creates insights into

customer behavior and preferences

(Transactions Emails Call center interaction records)

Big data sources

Respondents were

asked which data

sources are currently

being collected and

analyzed as part of

active big data efforts

within their

organization

copy 2014 IBM Corporation18

Stages of Big Data adoption

18

Big data adoption

When segmented into four groups based on current levels of big data activity respondents showed significant consistency

in organizational behaviors Total respondents n = 1061

Totals do not equal 100 due to rounding

copy 2014 IBM Corporation19

Hadoop workloads

92

92

83

58

42

25

58

92

92

92

67

67

67

83

Staging area

Online archive

Transformation Engine

Ad hoc queries

Scheduled reports

Visual exploration

Data mining

Today In 18 Months

Based on respondents that have implemented Hadoop BI Leadership Forum April 2012

copy 2014 IBM Corporation20

Big Data ExplorationFind visualize understand all big data to improve decision making

Enhanced 360o Viewof the CustomerExtend existing customer views (MDM CRM etc) by incorporating additional internal and external information sources

Operations AnalysisAnalyze a variety of machinedata for improved business results

Data Warehouse ModernizationIntegrate big data and data warehouse capabilities to increase operational efficiency

SecurityIntelligence ExtensionLower risk detect fraud and monitor cyber security in real-time

Key Big Data Use Cases

copy 2014 IBM Corporation21

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation22

Solution for Big Data

Rest Data

ndash Data to analyze are already stored (structured and unstructured)

ndash Examples logs facebook twitter etc

ndash Solution Hadoop (open source)

Data in motion

ndash Data are analyzed in real time just in the moment they are generated They are analyzed with any previous storage

ndash Examples Sensors RFID etc

ndash Solution Streams CEP solutions

copy 2014 IBM Corporation23

Hardware improvements through the years

CPU Speedsndash 1990 - 44 MIPS at 40 MHz

ndash 2000 - 3561 MIPS at 12 GHz ndash 2010 - 147600 MIPS at 33 GHz

RAM Memoryndash 1990 ndash 640K conventional memory (256K extended memory recommended)ndash 2000 ndash 64MB memoryndash 2010 - 8-32GB (and more)

Disk Capacityndash 1990 ndash 20MBndash 2000 - 1GBndash 2010 ndash 1TB

Disk Latency (speed of reads and writes) ndash not much improvement in last 7-10 years currently around 70 ndash 80MB sec

copy 2014 IBM Corporation24

How long it will take to read 1TB of data

1TB (at 80Mb sec)ndash 1 disk - 34 hours

ndash 10 disks - 20 min

ndash 100 disks - 2 min

ndash 1000 disks - 12 sec

copy 2014 IBM Corporation25

Parallel Data Processing is the answer

It was with us for a whilendash GRID computing - spreads processing load

ndash Distributed workload - hard to manage applications overhead on

developer

ndash Parallel databases ndash DB2 DPF Teradata Netezza etc (distribute the

data)

copy 2014 IBM Corporation26

What is Apache Hadoop

Apache Open source software framework

Flexible enterprise-class support for processing large volumes of

data ndash Inspired by Google technologies (MapReduce GFS BigTable hellip)

ndash Initiated at Yahoobull Originally built to address scalability problems of Nutch an open source Web search

technology

ndash Well-suited to batch-oriented read-intensive applications

ndash Supports wide variety of data

Enables applications to work with thousands of nodes and petabytes

of data in a highly parallel cost effective mannerndash CPU + local disks = ldquonoderdquo

ndash Nodes can be combined into clusters

ndash New nodes can be added as needed without changing

bull Data formats

bull How data is loaded

bull How jobs are written

copy 2014 IBM Corporation27

Design principles of Hadoop New way of storing and processing the data

ndash Let system handle most of the issues automaticallybull Failuresbull Scalabilitybull Reduce communications bull Distribute data and processing power to where the data isbull Make parallelism part of operating systembull Meant for heterogeneous commodity hardware

Bring processing to Data

Hadoop = HDFS + MapReduce infrastructure

Optimized to handlendash Massive amounts of data through parallelism

ndash A variety of data (structured unstructured semi-structured)

ndash Using inexpensive commodity hardware

Reliability provided through replication

copy 2014 IBM Corporation28

What is the Hadoop Distributed File System Driving principals

ndash Data is stored across the entire cluster (multiple nodes)

ndash Programs are brought to the data not the data to the program

ndash Follows the Divide and Conquer paradigm

Data is stored across the entire cluster (the DFS)

ndash The entire cluster participates in the file system

ndash Blocks of a single file are distributed across the cluster

ndash A given block is typically replicated as well for resiliency

10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111

0100

Logical File

1

2

3

4

Blocks

1

Cluster

1

1

2

22

3

3

34

44

copy 2011 IBM Corporation29

Introduction to MapReduce

Scalable to thousands of nodes and petabytes of data

MapReduce Application

1 Map Phase(break job into small parts)

2 Shuffle(transfer interim output

for final processing)

3 Reduce Phase(boil all output down to

a single result set)

Return a single result setResult Set

Shuffle

public static class TokenizerMapper extends MapperltObjectTextTextIntWritablegt

private final static IntWritableone = new IntWritable(1)

private Text word = new Text()

public void map(Object key Text val ContextStringTokenizer itr =

new StringTokenizer(valtoString())while (itrhasMoreTokens()) wordset(itrnextToken())

contextwrite(word one)

public static class IntSumReducer extends ReducerltTextIntWritableTextIntWrita

private IntWritable result = new IntWritable()

public void reduce(Text keyIterableltIntWritablegt val Context context)int sum = 0for (IntWritable v val)

sum += vget()

Distribute map

tasks to cluster

Hadoop Data Nodes

copy 2011 IBM Corporation30

MapReduce Example

Hello World Bye World

Hello IBM

Reduce (final output)

lt Bye 1gt

lt IBM 1gt

lt Hello 2gt

lt World 2gt

Map 1lt Hello 1gt

lt World 1gt

lt Bye 1gt

lt World 1gt

Count number of words occurrences

Map 2lt Hello 1gt

lt IBM 1gt

Entry Data

Map

Process

Reduce

Process

Shuffle

Process

copy 2014 IBM Corporation31

How to Analyze Large Data Sets in Hadoop

Its not just runtime Development phase has to be taken into

account

Although the Hadoop framework is implemented in Java

MapReduce applications do not need to be written in Java

To abstract complexities of Hadoop programming model a few

application development languages have emerged that build on top

of Hadoopndash Pig

ndash Hive

ndash Jaql

ndash Jaql

copy 2014 IBM Corporation32

Pig Hive Jaql ndash Similarities

Reduced program size over Java

Applications are translated to map

and reduce jobs behind scenes

Extension points for extending

existing functionality

Interoperability with other

languages

Not designed for random

readswrites or low-latency queries

Pig Hive Jaql ndash Differences

Characteristic Pig Hive Jaql

Developed by Yahoo Facebook IBM

Language Pig Latin HiveQL Jaql

Type of language

Data flow Declarative (SQL dialect) Data flow

Data structures supported

Complex Better suited for structured data

JSON semi structured

Schema Optional Not optional Optional

Hadoop Distributions

34

copy 2014 IBM Corporation35

Example of Hadoop Ecosystem

Visualization amp DiscoveryIntegration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

BigSheets JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Dashboard amp Visualization

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

copy 2014 IBM Corporation36

Open Source frameworks I

Avro A data serialization system that includes a schema within each file A schema defines the data types that are

contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data

types and complex type definitions within a schema

Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop

cluster

HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data

sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications

are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column

families so that the elements of a column family are all stored together This approach is different from a row-oriented

relational database where all columns of a row are stored together

HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do

not need to know where or how your data is stored You can change how you write data while still supporting existing data in

older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service

that includes functions for both MapReduce and Pig

Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to

analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements

which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere

BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos

Business Intelligence software

copy 2014 IBM Corporation37

Open Source frameworks II

Lucene A high-performance text search engine library that is written entirely in Java When you search within a

collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key

component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene

libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build

scan and query Lucene indexes

Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides

users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the

required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of

specific data in the file system

R A Project for Statistical Computing

Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop

systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and

export it to relational databases and enterprise data warehouses

Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper

maintains common objects that are needed in large cluster environments such as configuration information distributed

synchronization and group services

copy 2014 IBM Corporation38

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

copy 2014 IBM Corporation39

BigSheets

Browser based Analytical Tool that generates MapReduce Jobs

working over Hadoop Big Data

Helps non-programmers to work with Hadoop cluster

User models their big data as familiar spreadsheet-like tabular data

structures (collections) Once data is represented in a collection

business analysts can filter and enrich its content using built-in

functions and macros Furthermore analysts can combine data

residing in different collections as well as generate charts and new

ldquosheetsrdquo (collections) to visualize their data They can even export

data into a variety of common formats with a click of a button

Much of the technology included in Sheets was derived from the

BigSheets project of IBMrsquos Emerging Technologies team

copy 2014 IBM Corporation40

BigSheets Collection Sample

Spreadsheet-like structures defined by user

Based on data accessible through BigInsights Web console ndash eg file

system data output from Web crawl etc

copy 2014 IBM Corporation41

Big Sheets Collection Operations

Work with built-in ldquosheetsrdquo editor

Add delete columns

Filter data

Specify formulas to compute new

values using spreadsheet-style

syntax

Apply built-in or custom macro

functions

helliphelliphelliphellip

copy 2014 IBM Corporation42

BigSheets Collection Graphic Visualization

Built-in charting facility aids analysis

Pie charts bar charts tag clouds maps etc

Hover over sections to reveal details

copy 2014 IBM Corporation43

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

copy 2014 IBM Corporation44

What is Big SQL

Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture

ndash Comprehensive SQL92 ansi support

ndash Standards compliant client drivers (JDBC amp ODBC)

ndash Efficient handling of point queries

ndash Wide variety of data sources and file formats

ndash Extensive HBase focus

ndash Open source interoperability

Our driving design goalsndash Existing queries should run with no or few modifications

ndash Existing JDBC and ODBC compliant tools should continue to function

ndash Queries should be executed as efficiently as the chosen storage

mechanisms allow

copy 2014 IBM Corporation45

Architecture

Big SQL shares catalogs with

Hive via the Hive metastorendash Each can query the others tables

SQL engine analyzes incoming

queriesndash Separates portion(s) to execute at

the server vs portion(s) to execute

on the cluster

ndash Re-writes query if necessary for

improved performance

ndash Determines appropriate storage

handler for data

ndash Produces execution plan

ndash Executes and coordinates query

Server layout and relative sizes

for illustrative purposes only

Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node

Job TrackerNetwork Protocol

SQL Engine

Storage Handlers

Del

Files

SEQ

FilesHBase RDBMS bullbullbull

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore

copy 2014 IBM Corporation46

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

copy 2014 IBM Corporation47

What is Text Analytics

High Performance and Scalable rule based Information Extraction Engine

Distill structured information from unstructured data

- Rich annotator library supports multiple languages

Provides sophisticated tooling to help build test and refine rules

ndash Developer tools an easy to use text analytics language and a set of

extractors for fast adoption

ndash Multilingual support including support for DBCS languages

Developed at IBM Research since 2004 System T

BigInsights is the first time IBM opens up the Text Analytics Engine

technology for customization and development

copy 2014 IBM Corporation48

Annotator Query Language (AQL)

Language to create rules for Text Analytics

SQL Like Language

Fully declarative text analytics language

Once compiled produced an AOG plan to work in the data

No ldquoblack boxesrdquo or modules that canrsquot be customized

Tooling for easy customization because you are abstracted from the

programmatic details

Competing solutions make use of locked up black-box modules that cannot be

customized which restricts flexibility and are difficult to optimize for performance

create view AmountWithUnit as

extract pattern ltNmatchgt ltUmatchgt

as match

from Number N Unit U

copy 2014 IBM Corporation49

Text Analytic Simple Example

NetherlandsStrikerArjen Robben

Keeper SpainIker Casillas

WingerAndres Iniesta Spain

World Cup 2010 Highlights

Football World Cup 2010 one team distinguished well

from the rest winning the final Early in the second

half Netherlandsrsquo striker Arjen Robben had a chance

to score but the awesome keeper for Spain Iker

Casillas made the save Winner superiority was

reflected when Winger Andres Iniesta scored for Spain

for the win

copy 2014 IBM Corporation5050

Text Analytic Real Example

copy 2014 IBM Corporation5151

One step beyond Watson

copy 2014 IBM Corporation52

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

R

copy 2014 IBM Corporation53

Big R

bull Explore visualize transform

and model big data using

familiar R syntax and

paradigm

bull Scale out R with MR

programming

ndash Partitioning of large data

ndash Parallel cluster execution of R

code

bull Distributed Machine

Learning

ndash A scalable statistics engine that

provides canned algorithms and

an ability to author new ones all

via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3

copy 2014 IBM Corporation54

Where Does BigData Fit

Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate

ldquoCapture in case itrsquos neededrdquo

1 Extract transform load

ldquoCapture only whatrsquos neededrdquo

9 Report and mine data

copy 2014 IBM Corporation55

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation56

Data scientist ndash The new cool guy in town

Article in Fortune ldquoThe unemployment rate in

the US continues to be abysmal (91 in

July) but the tech world has spawned a

new kind of highly skilled nerdy-cool job

that companies are scrambling to fill data

scientistrdquo

McKinsey Global Institute ldquoBig data Reportrdquo

By 2018 the United States alone could

face a shortage of 140000 to 190000

people with deep analytical skills as well as

15 million managers and analysts with the

know-how to use the analysis of big data to

make effective decisions

copy 2014 IBM Corporation57

Data Science is Multidisciplinary

copy 2014 IBM Corporation58

Successful Data Scientist Characteristics

copy 2014 IBM Corporation59

Data Scientist Qualities

copy 2014 IBM Corporation60

How Long Does It Take For a Beginner to Become

a Good Data Scientist

copy 2014 IBM Corporation61

wwwkagglecom

copy 2014 IBM Corporation62

Kaggle ranking

copy 2014 IBM Corporation63 copy 2013 IBM Corporation63

Learn Big Data

Reading Materials - Online

ndash Understanding Big Data ndash Free PDF Book

bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF

ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights

bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml

ndash Implementing IBM InfoSphere BigInsights on System x - Redbook

bull httpwwwredbooksibmcomredpiecesabstractssg248077html

Resources

ndash Big Data Information Center

bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml

ndash InfoSphere BigInsights

bull www-01ibmcomsoftwaredatainfospherebiginsights

ndash Stream Computing

bull www-01ibmcomsoftwaredatainfospherestream-computing

ndash DeveloperWorks forums demos

bull httpwwwibmcomdeveloperworkswikibiginsights

copy 2014 IBM Corporation64

Learn Big Data Technologies

BigDataUniversitycom

Flexible on-line delivery

allows learning your place

and your pace

Free courses free study

materials

Cloud-based sandbox for

exercises ndash zero setup

Robust Course

Management System and

Content Distribution

infrastructure

copy 2014 IBM Corporation65

65

Page 12: Big data presentation (2014)

copy 2014 IBM Corporation12

Paradigm shifts enabled by big data IIIData leads the way ndash and sometimes correlations are good enough

copy 2014 IBM Corporation13

Paradigm shifts enabled by big data IVLeverage data as it is captured

copy 2014 IBM Corporation14

Paradigm shifts enabled by big data IVLeverage data as it is captured

copy 2014 IBM Corporation15

Complementary Analytics

Traditional ApproachStructured analytical logical

New ApproachCreative holistic thought intuition

Multimedia

Data Warehouse

Web Logs

Social Data

Sensor data

images

RFID

Internal AppData

TransactionData

MainframeData

OLTP SystemData

Traditional databases

ERP Data

StructuredRepeatable

Linear

UnstructuredExploratory

Dynamic

Text Data

emails

Hadoop andStreams

NewSources

copy 2014 IBM Corporation16

Types of Analytic Tools

copy 2014 IBM Corporation17

Organisations are prioritising internal data sources

17

Untapped stores of internal data

Size and scope of some internal data such as

detailed transactions and operational log data

have become too large and varied to manage

within traditional systems

New infrastructure components make them

accessible for analysis

Some data has been collected but not

analyzed for years

Focus on customer insights

Customers ndash influenced by digital experiences

ndash often expect information provided to an

organization will then be ldquoknownrdquo during future

interactions

Combining disparate internal sources with

advanced analytics creates insights into

customer behavior and preferences

(Transactions Emails Call center interaction records)

Big data sources

Respondents were

asked which data

sources are currently

being collected and

analyzed as part of

active big data efforts

within their

organization

copy 2014 IBM Corporation18

Stages of Big Data adoption

18

Big data adoption

When segmented into four groups based on current levels of big data activity respondents showed significant consistency

in organizational behaviors Total respondents n = 1061

Totals do not equal 100 due to rounding

copy 2014 IBM Corporation19

Hadoop workloads

92

92

83

58

42

25

58

92

92

92

67

67

67

83

Staging area

Online archive

Transformation Engine

Ad hoc queries

Scheduled reports

Visual exploration

Data mining

Today In 18 Months

Based on respondents that have implemented Hadoop BI Leadership Forum April 2012

copy 2014 IBM Corporation20

Big Data ExplorationFind visualize understand all big data to improve decision making

Enhanced 360o Viewof the CustomerExtend existing customer views (MDM CRM etc) by incorporating additional internal and external information sources

Operations AnalysisAnalyze a variety of machinedata for improved business results

Data Warehouse ModernizationIntegrate big data and data warehouse capabilities to increase operational efficiency

SecurityIntelligence ExtensionLower risk detect fraud and monitor cyber security in real-time

Key Big Data Use Cases

copy 2014 IBM Corporation21

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation22

Solution for Big Data

Rest Data

ndash Data to analyze are already stored (structured and unstructured)

ndash Examples logs facebook twitter etc

ndash Solution Hadoop (open source)

Data in motion

ndash Data are analyzed in real time just in the moment they are generated They are analyzed with any previous storage

ndash Examples Sensors RFID etc

ndash Solution Streams CEP solutions

copy 2014 IBM Corporation23

Hardware improvements through the years

CPU Speedsndash 1990 - 44 MIPS at 40 MHz

ndash 2000 - 3561 MIPS at 12 GHz ndash 2010 - 147600 MIPS at 33 GHz

RAM Memoryndash 1990 ndash 640K conventional memory (256K extended memory recommended)ndash 2000 ndash 64MB memoryndash 2010 - 8-32GB (and more)

Disk Capacityndash 1990 ndash 20MBndash 2000 - 1GBndash 2010 ndash 1TB

Disk Latency (speed of reads and writes) ndash not much improvement in last 7-10 years currently around 70 ndash 80MB sec

copy 2014 IBM Corporation24

How long it will take to read 1TB of data

1TB (at 80Mb sec)ndash 1 disk - 34 hours

ndash 10 disks - 20 min

ndash 100 disks - 2 min

ndash 1000 disks - 12 sec

copy 2014 IBM Corporation25

Parallel Data Processing is the answer

It was with us for a whilendash GRID computing - spreads processing load

ndash Distributed workload - hard to manage applications overhead on

developer

ndash Parallel databases ndash DB2 DPF Teradata Netezza etc (distribute the

data)

copy 2014 IBM Corporation26

What is Apache Hadoop

Apache Open source software framework

Flexible enterprise-class support for processing large volumes of

data ndash Inspired by Google technologies (MapReduce GFS BigTable hellip)

ndash Initiated at Yahoobull Originally built to address scalability problems of Nutch an open source Web search

technology

ndash Well-suited to batch-oriented read-intensive applications

ndash Supports wide variety of data

Enables applications to work with thousands of nodes and petabytes

of data in a highly parallel cost effective mannerndash CPU + local disks = ldquonoderdquo

ndash Nodes can be combined into clusters

ndash New nodes can be added as needed without changing

bull Data formats

bull How data is loaded

bull How jobs are written

copy 2014 IBM Corporation27

Design principles of Hadoop New way of storing and processing the data

ndash Let system handle most of the issues automaticallybull Failuresbull Scalabilitybull Reduce communications bull Distribute data and processing power to where the data isbull Make parallelism part of operating systembull Meant for heterogeneous commodity hardware

Bring processing to Data

Hadoop = HDFS + MapReduce infrastructure

Optimized to handlendash Massive amounts of data through parallelism

ndash A variety of data (structured unstructured semi-structured)

ndash Using inexpensive commodity hardware

Reliability provided through replication

copy 2014 IBM Corporation28

What is the Hadoop Distributed File System Driving principals

ndash Data is stored across the entire cluster (multiple nodes)

ndash Programs are brought to the data not the data to the program

ndash Follows the Divide and Conquer paradigm

Data is stored across the entire cluster (the DFS)

ndash The entire cluster participates in the file system

ndash Blocks of a single file are distributed across the cluster

ndash A given block is typically replicated as well for resiliency

10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111

0100

Logical File

1

2

3

4

Blocks

1

Cluster

1

1

2

22

3

3

34

44

copy 2011 IBM Corporation29

Introduction to MapReduce

Scalable to thousands of nodes and petabytes of data

MapReduce Application

1 Map Phase(break job into small parts)

2 Shuffle(transfer interim output

for final processing)

3 Reduce Phase(boil all output down to

a single result set)

Return a single result setResult Set

Shuffle

public static class TokenizerMapper extends MapperltObjectTextTextIntWritablegt

private final static IntWritableone = new IntWritable(1)

private Text word = new Text()

public void map(Object key Text val ContextStringTokenizer itr =

new StringTokenizer(valtoString())while (itrhasMoreTokens()) wordset(itrnextToken())

contextwrite(word one)

public static class IntSumReducer extends ReducerltTextIntWritableTextIntWrita

private IntWritable result = new IntWritable()

public void reduce(Text keyIterableltIntWritablegt val Context context)int sum = 0for (IntWritable v val)

sum += vget()

Distribute map

tasks to cluster

Hadoop Data Nodes

copy 2011 IBM Corporation30

MapReduce Example

Hello World Bye World

Hello IBM

Reduce (final output)

lt Bye 1gt

lt IBM 1gt

lt Hello 2gt

lt World 2gt

Map 1lt Hello 1gt

lt World 1gt

lt Bye 1gt

lt World 1gt

Count number of words occurrences

Map 2lt Hello 1gt

lt IBM 1gt

Entry Data

Map

Process

Reduce

Process

Shuffle

Process

copy 2014 IBM Corporation31

How to Analyze Large Data Sets in Hadoop

Its not just runtime Development phase has to be taken into

account

Although the Hadoop framework is implemented in Java

MapReduce applications do not need to be written in Java

To abstract complexities of Hadoop programming model a few

application development languages have emerged that build on top

of Hadoopndash Pig

ndash Hive

ndash Jaql

ndash Jaql

copy 2014 IBM Corporation32

Pig Hive Jaql ndash Similarities

Reduced program size over Java

Applications are translated to map

and reduce jobs behind scenes

Extension points for extending

existing functionality

Interoperability with other

languages

Not designed for random

readswrites or low-latency queries

Pig Hive Jaql ndash Differences

Characteristic Pig Hive Jaql

Developed by Yahoo Facebook IBM

Language Pig Latin HiveQL Jaql

Type of language

Data flow Declarative (SQL dialect) Data flow

Data structures supported

Complex Better suited for structured data

JSON semi structured

Schema Optional Not optional Optional

Hadoop Distributions

34

copy 2014 IBM Corporation35

Example of Hadoop Ecosystem

Visualization amp DiscoveryIntegration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

BigSheets JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Dashboard amp Visualization

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

copy 2014 IBM Corporation36

Open Source frameworks I

Avro A data serialization system that includes a schema within each file A schema defines the data types that are

contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data

types and complex type definitions within a schema

Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop

cluster

HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data

sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications

are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column

families so that the elements of a column family are all stored together This approach is different from a row-oriented

relational database where all columns of a row are stored together

HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do

not need to know where or how your data is stored You can change how you write data while still supporting existing data in

older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service

that includes functions for both MapReduce and Pig

Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to

analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements

which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere

BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos

Business Intelligence software

copy 2014 IBM Corporation37

Open Source frameworks II

Lucene A high-performance text search engine library that is written entirely in Java When you search within a

collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key

component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene

libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build

scan and query Lucene indexes

Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides

users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the

required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of

specific data in the file system

R A Project for Statistical Computing

Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop

systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and

export it to relational databases and enterprise data warehouses

Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper

maintains common objects that are needed in large cluster environments such as configuration information distributed

synchronization and group services

copy 2014 IBM Corporation38

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

copy 2014 IBM Corporation39

BigSheets

Browser based Analytical Tool that generates MapReduce Jobs

working over Hadoop Big Data

Helps non-programmers to work with Hadoop cluster

User models their big data as familiar spreadsheet-like tabular data

structures (collections) Once data is represented in a collection

business analysts can filter and enrich its content using built-in

functions and macros Furthermore analysts can combine data

residing in different collections as well as generate charts and new

ldquosheetsrdquo (collections) to visualize their data They can even export

data into a variety of common formats with a click of a button

Much of the technology included in Sheets was derived from the

BigSheets project of IBMrsquos Emerging Technologies team

copy 2014 IBM Corporation40

BigSheets Collection Sample

Spreadsheet-like structures defined by user

Based on data accessible through BigInsights Web console ndash eg file

system data output from Web crawl etc

copy 2014 IBM Corporation41

Big Sheets Collection Operations

Work with built-in ldquosheetsrdquo editor

Add delete columns

Filter data

Specify formulas to compute new

values using spreadsheet-style

syntax

Apply built-in or custom macro

functions

helliphelliphelliphellip

copy 2014 IBM Corporation42

BigSheets Collection Graphic Visualization

Built-in charting facility aids analysis

Pie charts bar charts tag clouds maps etc

Hover over sections to reveal details

copy 2014 IBM Corporation43

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

copy 2014 IBM Corporation44

What is Big SQL

Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture

ndash Comprehensive SQL92 ansi support

ndash Standards compliant client drivers (JDBC amp ODBC)

ndash Efficient handling of point queries

ndash Wide variety of data sources and file formats

ndash Extensive HBase focus

ndash Open source interoperability

Our driving design goalsndash Existing queries should run with no or few modifications

ndash Existing JDBC and ODBC compliant tools should continue to function

ndash Queries should be executed as efficiently as the chosen storage

mechanisms allow

copy 2014 IBM Corporation45

Architecture

Big SQL shares catalogs with

Hive via the Hive metastorendash Each can query the others tables

SQL engine analyzes incoming

queriesndash Separates portion(s) to execute at

the server vs portion(s) to execute

on the cluster

ndash Re-writes query if necessary for

improved performance

ndash Determines appropriate storage

handler for data

ndash Produces execution plan

ndash Executes and coordinates query

Server layout and relative sizes

for illustrative purposes only

Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node

Job TrackerNetwork Protocol

SQL Engine

Storage Handlers

Del

Files

SEQ

FilesHBase RDBMS bullbullbull

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore

copy 2014 IBM Corporation46

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

copy 2014 IBM Corporation47

What is Text Analytics

High Performance and Scalable rule based Information Extraction Engine

Distill structured information from unstructured data

- Rich annotator library supports multiple languages

Provides sophisticated tooling to help build test and refine rules

ndash Developer tools an easy to use text analytics language and a set of

extractors for fast adoption

ndash Multilingual support including support for DBCS languages

Developed at IBM Research since 2004 System T

BigInsights is the first time IBM opens up the Text Analytics Engine

technology for customization and development

copy 2014 IBM Corporation48

Annotator Query Language (AQL)

Language to create rules for Text Analytics

SQL Like Language

Fully declarative text analytics language

Once compiled produced an AOG plan to work in the data

No ldquoblack boxesrdquo or modules that canrsquot be customized

Tooling for easy customization because you are abstracted from the

programmatic details

Competing solutions make use of locked up black-box modules that cannot be

customized which restricts flexibility and are difficult to optimize for performance

create view AmountWithUnit as

extract pattern ltNmatchgt ltUmatchgt

as match

from Number N Unit U

copy 2014 IBM Corporation49

Text Analytic Simple Example

NetherlandsStrikerArjen Robben

Keeper SpainIker Casillas

WingerAndres Iniesta Spain

World Cup 2010 Highlights

Football World Cup 2010 one team distinguished well

from the rest winning the final Early in the second

half Netherlandsrsquo striker Arjen Robben had a chance

to score but the awesome keeper for Spain Iker

Casillas made the save Winner superiority was

reflected when Winger Andres Iniesta scored for Spain

for the win

copy 2014 IBM Corporation5050

Text Analytic Real Example

copy 2014 IBM Corporation5151

One step beyond Watson

copy 2014 IBM Corporation52

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

R

copy 2014 IBM Corporation53

Big R

bull Explore visualize transform

and model big data using

familiar R syntax and

paradigm

bull Scale out R with MR

programming

ndash Partitioning of large data

ndash Parallel cluster execution of R

code

bull Distributed Machine

Learning

ndash A scalable statistics engine that

provides canned algorithms and

an ability to author new ones all

via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3

copy 2014 IBM Corporation54

Where Does BigData Fit

Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate

ldquoCapture in case itrsquos neededrdquo

1 Extract transform load

ldquoCapture only whatrsquos neededrdquo

9 Report and mine data

copy 2014 IBM Corporation55

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation56

Data scientist ndash The new cool guy in town

Article in Fortune ldquoThe unemployment rate in

the US continues to be abysmal (91 in

July) but the tech world has spawned a

new kind of highly skilled nerdy-cool job

that companies are scrambling to fill data

scientistrdquo

McKinsey Global Institute ldquoBig data Reportrdquo

By 2018 the United States alone could

face a shortage of 140000 to 190000

people with deep analytical skills as well as

15 million managers and analysts with the

know-how to use the analysis of big data to

make effective decisions

copy 2014 IBM Corporation57

Data Science is Multidisciplinary

copy 2014 IBM Corporation58

Successful Data Scientist Characteristics

copy 2014 IBM Corporation59

Data Scientist Qualities

copy 2014 IBM Corporation60

How Long Does It Take For a Beginner to Become

a Good Data Scientist

copy 2014 IBM Corporation61

wwwkagglecom

copy 2014 IBM Corporation62

Kaggle ranking

copy 2014 IBM Corporation63 copy 2013 IBM Corporation63

Learn Big Data

Reading Materials - Online

ndash Understanding Big Data ndash Free PDF Book

bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF

ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights

bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml

ndash Implementing IBM InfoSphere BigInsights on System x - Redbook

bull httpwwwredbooksibmcomredpiecesabstractssg248077html

Resources

ndash Big Data Information Center

bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml

ndash InfoSphere BigInsights

bull www-01ibmcomsoftwaredatainfospherebiginsights

ndash Stream Computing

bull www-01ibmcomsoftwaredatainfospherestream-computing

ndash DeveloperWorks forums demos

bull httpwwwibmcomdeveloperworkswikibiginsights

copy 2014 IBM Corporation64

Learn Big Data Technologies

BigDataUniversitycom

Flexible on-line delivery

allows learning your place

and your pace

Free courses free study

materials

Cloud-based sandbox for

exercises ndash zero setup

Robust Course

Management System and

Content Distribution

infrastructure

copy 2014 IBM Corporation65

65

Page 13: Big data presentation (2014)

copy 2014 IBM Corporation13

Paradigm shifts enabled by big data IVLeverage data as it is captured

copy 2014 IBM Corporation14

Paradigm shifts enabled by big data IVLeverage data as it is captured

copy 2014 IBM Corporation15

Complementary Analytics

Traditional ApproachStructured analytical logical

New ApproachCreative holistic thought intuition

Multimedia

Data Warehouse

Web Logs

Social Data

Sensor data

images

RFID

Internal AppData

TransactionData

MainframeData

OLTP SystemData

Traditional databases

ERP Data

StructuredRepeatable

Linear

UnstructuredExploratory

Dynamic

Text Data

emails

Hadoop andStreams

NewSources

copy 2014 IBM Corporation16

Types of Analytic Tools

copy 2014 IBM Corporation17

Organisations are prioritising internal data sources

17

Untapped stores of internal data

Size and scope of some internal data such as

detailed transactions and operational log data

have become too large and varied to manage

within traditional systems

New infrastructure components make them

accessible for analysis

Some data has been collected but not

analyzed for years

Focus on customer insights

Customers ndash influenced by digital experiences

ndash often expect information provided to an

organization will then be ldquoknownrdquo during future

interactions

Combining disparate internal sources with

advanced analytics creates insights into

customer behavior and preferences

(Transactions Emails Call center interaction records)

Big data sources

Respondents were

asked which data

sources are currently

being collected and

analyzed as part of

active big data efforts

within their

organization

copy 2014 IBM Corporation18

Stages of Big Data adoption

18

Big data adoption

When segmented into four groups based on current levels of big data activity respondents showed significant consistency

in organizational behaviors Total respondents n = 1061

Totals do not equal 100 due to rounding

copy 2014 IBM Corporation19

Hadoop workloads

92

92

83

58

42

25

58

92

92

92

67

67

67

83

Staging area

Online archive

Transformation Engine

Ad hoc queries

Scheduled reports

Visual exploration

Data mining

Today In 18 Months

Based on respondents that have implemented Hadoop BI Leadership Forum April 2012

copy 2014 IBM Corporation20

Big Data ExplorationFind visualize understand all big data to improve decision making

Enhanced 360o Viewof the CustomerExtend existing customer views (MDM CRM etc) by incorporating additional internal and external information sources

Operations AnalysisAnalyze a variety of machinedata for improved business results

Data Warehouse ModernizationIntegrate big data and data warehouse capabilities to increase operational efficiency

SecurityIntelligence ExtensionLower risk detect fraud and monitor cyber security in real-time

Key Big Data Use Cases

copy 2014 IBM Corporation21

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation22

Solution for Big Data

Rest Data

ndash Data to analyze are already stored (structured and unstructured)

ndash Examples logs facebook twitter etc

ndash Solution Hadoop (open source)

Data in motion

ndash Data are analyzed in real time just in the moment they are generated They are analyzed with any previous storage

ndash Examples Sensors RFID etc

ndash Solution Streams CEP solutions

copy 2014 IBM Corporation23

Hardware improvements through the years

CPU Speedsndash 1990 - 44 MIPS at 40 MHz

ndash 2000 - 3561 MIPS at 12 GHz ndash 2010 - 147600 MIPS at 33 GHz

RAM Memoryndash 1990 ndash 640K conventional memory (256K extended memory recommended)ndash 2000 ndash 64MB memoryndash 2010 - 8-32GB (and more)

Disk Capacityndash 1990 ndash 20MBndash 2000 - 1GBndash 2010 ndash 1TB

Disk Latency (speed of reads and writes) ndash not much improvement in last 7-10 years currently around 70 ndash 80MB sec

copy 2014 IBM Corporation24

How long it will take to read 1TB of data

1TB (at 80Mb sec)ndash 1 disk - 34 hours

ndash 10 disks - 20 min

ndash 100 disks - 2 min

ndash 1000 disks - 12 sec

copy 2014 IBM Corporation25

Parallel Data Processing is the answer

It was with us for a whilendash GRID computing - spreads processing load

ndash Distributed workload - hard to manage applications overhead on

developer

ndash Parallel databases ndash DB2 DPF Teradata Netezza etc (distribute the

data)

copy 2014 IBM Corporation26

What is Apache Hadoop

Apache Open source software framework

Flexible enterprise-class support for processing large volumes of

data ndash Inspired by Google technologies (MapReduce GFS BigTable hellip)

ndash Initiated at Yahoobull Originally built to address scalability problems of Nutch an open source Web search

technology

ndash Well-suited to batch-oriented read-intensive applications

ndash Supports wide variety of data

Enables applications to work with thousands of nodes and petabytes

of data in a highly parallel cost effective mannerndash CPU + local disks = ldquonoderdquo

ndash Nodes can be combined into clusters

ndash New nodes can be added as needed without changing

bull Data formats

bull How data is loaded

bull How jobs are written

copy 2014 IBM Corporation27

Design principles of Hadoop New way of storing and processing the data

ndash Let system handle most of the issues automaticallybull Failuresbull Scalabilitybull Reduce communications bull Distribute data and processing power to where the data isbull Make parallelism part of operating systembull Meant for heterogeneous commodity hardware

Bring processing to Data

Hadoop = HDFS + MapReduce infrastructure

Optimized to handlendash Massive amounts of data through parallelism

ndash A variety of data (structured unstructured semi-structured)

ndash Using inexpensive commodity hardware

Reliability provided through replication

copy 2014 IBM Corporation28

What is the Hadoop Distributed File System Driving principals

ndash Data is stored across the entire cluster (multiple nodes)

ndash Programs are brought to the data not the data to the program

ndash Follows the Divide and Conquer paradigm

Data is stored across the entire cluster (the DFS)

ndash The entire cluster participates in the file system

ndash Blocks of a single file are distributed across the cluster

ndash A given block is typically replicated as well for resiliency

10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111

0100

Logical File

1

2

3

4

Blocks

1

Cluster

1

1

2

22

3

3

34

44

copy 2011 IBM Corporation29

Introduction to MapReduce

Scalable to thousands of nodes and petabytes of data

MapReduce Application

1 Map Phase(break job into small parts)

2 Shuffle(transfer interim output

for final processing)

3 Reduce Phase(boil all output down to

a single result set)

Return a single result setResult Set

Shuffle

public static class TokenizerMapper extends MapperltObjectTextTextIntWritablegt

private final static IntWritableone = new IntWritable(1)

private Text word = new Text()

public void map(Object key Text val ContextStringTokenizer itr =

new StringTokenizer(valtoString())while (itrhasMoreTokens()) wordset(itrnextToken())

contextwrite(word one)

public static class IntSumReducer extends ReducerltTextIntWritableTextIntWrita

private IntWritable result = new IntWritable()

public void reduce(Text keyIterableltIntWritablegt val Context context)int sum = 0for (IntWritable v val)

sum += vget()

Distribute map

tasks to cluster

Hadoop Data Nodes

copy 2011 IBM Corporation30

MapReduce Example

Hello World Bye World

Hello IBM

Reduce (final output)

lt Bye 1gt

lt IBM 1gt

lt Hello 2gt

lt World 2gt

Map 1lt Hello 1gt

lt World 1gt

lt Bye 1gt

lt World 1gt

Count number of words occurrences

Map 2lt Hello 1gt

lt IBM 1gt

Entry Data

Map

Process

Reduce

Process

Shuffle

Process

copy 2014 IBM Corporation31

How to Analyze Large Data Sets in Hadoop

Its not just runtime Development phase has to be taken into

account

Although the Hadoop framework is implemented in Java

MapReduce applications do not need to be written in Java

To abstract complexities of Hadoop programming model a few

application development languages have emerged that build on top

of Hadoopndash Pig

ndash Hive

ndash Jaql

ndash Jaql

copy 2014 IBM Corporation32

Pig Hive Jaql ndash Similarities

Reduced program size over Java

Applications are translated to map

and reduce jobs behind scenes

Extension points for extending

existing functionality

Interoperability with other

languages

Not designed for random

readswrites or low-latency queries

Pig Hive Jaql ndash Differences

Characteristic Pig Hive Jaql

Developed by Yahoo Facebook IBM

Language Pig Latin HiveQL Jaql

Type of language

Data flow Declarative (SQL dialect) Data flow

Data structures supported

Complex Better suited for structured data

JSON semi structured

Schema Optional Not optional Optional

Hadoop Distributions

34

copy 2014 IBM Corporation35

Example of Hadoop Ecosystem

Visualization amp DiscoveryIntegration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

BigSheets JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Dashboard amp Visualization

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

copy 2014 IBM Corporation36

Open Source frameworks I

Avro A data serialization system that includes a schema within each file A schema defines the data types that are

contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data

types and complex type definitions within a schema

Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop

cluster

HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data

sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications

are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column

families so that the elements of a column family are all stored together This approach is different from a row-oriented

relational database where all columns of a row are stored together

HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do

not need to know where or how your data is stored You can change how you write data while still supporting existing data in

older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service

that includes functions for both MapReduce and Pig

Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to

analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements

which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere

BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos

Business Intelligence software

copy 2014 IBM Corporation37

Open Source frameworks II

Lucene A high-performance text search engine library that is written entirely in Java When you search within a

collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key

component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene

libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build

scan and query Lucene indexes

Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides

users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the

required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of

specific data in the file system

R A Project for Statistical Computing

Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop

systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and

export it to relational databases and enterprise data warehouses

Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper

maintains common objects that are needed in large cluster environments such as configuration information distributed

synchronization and group services

copy 2014 IBM Corporation38

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

copy 2014 IBM Corporation39

BigSheets

Browser based Analytical Tool that generates MapReduce Jobs

working over Hadoop Big Data

Helps non-programmers to work with Hadoop cluster

User models their big data as familiar spreadsheet-like tabular data

structures (collections) Once data is represented in a collection

business analysts can filter and enrich its content using built-in

functions and macros Furthermore analysts can combine data

residing in different collections as well as generate charts and new

ldquosheetsrdquo (collections) to visualize their data They can even export

data into a variety of common formats with a click of a button

Much of the technology included in Sheets was derived from the

BigSheets project of IBMrsquos Emerging Technologies team

copy 2014 IBM Corporation40

BigSheets Collection Sample

Spreadsheet-like structures defined by user

Based on data accessible through BigInsights Web console ndash eg file

system data output from Web crawl etc

copy 2014 IBM Corporation41

Big Sheets Collection Operations

Work with built-in ldquosheetsrdquo editor

Add delete columns

Filter data

Specify formulas to compute new

values using spreadsheet-style

syntax

Apply built-in or custom macro

functions

helliphelliphelliphellip

copy 2014 IBM Corporation42

BigSheets Collection Graphic Visualization

Built-in charting facility aids analysis

Pie charts bar charts tag clouds maps etc

Hover over sections to reveal details

copy 2014 IBM Corporation43

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

copy 2014 IBM Corporation44

What is Big SQL

Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture

ndash Comprehensive SQL92 ansi support

ndash Standards compliant client drivers (JDBC amp ODBC)

ndash Efficient handling of point queries

ndash Wide variety of data sources and file formats

ndash Extensive HBase focus

ndash Open source interoperability

Our driving design goalsndash Existing queries should run with no or few modifications

ndash Existing JDBC and ODBC compliant tools should continue to function

ndash Queries should be executed as efficiently as the chosen storage

mechanisms allow

copy 2014 IBM Corporation45

Architecture

Big SQL shares catalogs with

Hive via the Hive metastorendash Each can query the others tables

SQL engine analyzes incoming

queriesndash Separates portion(s) to execute at

the server vs portion(s) to execute

on the cluster

ndash Re-writes query if necessary for

improved performance

ndash Determines appropriate storage

handler for data

ndash Produces execution plan

ndash Executes and coordinates query

Server layout and relative sizes

for illustrative purposes only

Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node

Job TrackerNetwork Protocol

SQL Engine

Storage Handlers

Del

Files

SEQ

FilesHBase RDBMS bullbullbull

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore

copy 2014 IBM Corporation46

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

copy 2014 IBM Corporation47

What is Text Analytics

High Performance and Scalable rule based Information Extraction Engine

Distill structured information from unstructured data

- Rich annotator library supports multiple languages

Provides sophisticated tooling to help build test and refine rules

ndash Developer tools an easy to use text analytics language and a set of

extractors for fast adoption

ndash Multilingual support including support for DBCS languages

Developed at IBM Research since 2004 System T

BigInsights is the first time IBM opens up the Text Analytics Engine

technology for customization and development

copy 2014 IBM Corporation48

Annotator Query Language (AQL)

Language to create rules for Text Analytics

SQL Like Language

Fully declarative text analytics language

Once compiled produced an AOG plan to work in the data

No ldquoblack boxesrdquo or modules that canrsquot be customized

Tooling for easy customization because you are abstracted from the

programmatic details

Competing solutions make use of locked up black-box modules that cannot be

customized which restricts flexibility and are difficult to optimize for performance

create view AmountWithUnit as

extract pattern ltNmatchgt ltUmatchgt

as match

from Number N Unit U

copy 2014 IBM Corporation49

Text Analytic Simple Example

NetherlandsStrikerArjen Robben

Keeper SpainIker Casillas

WingerAndres Iniesta Spain

World Cup 2010 Highlights

Football World Cup 2010 one team distinguished well

from the rest winning the final Early in the second

half Netherlandsrsquo striker Arjen Robben had a chance

to score but the awesome keeper for Spain Iker

Casillas made the save Winner superiority was

reflected when Winger Andres Iniesta scored for Spain

for the win

copy 2014 IBM Corporation5050

Text Analytic Real Example

copy 2014 IBM Corporation5151

One step beyond Watson

copy 2014 IBM Corporation52

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

R

copy 2014 IBM Corporation53

Big R

bull Explore visualize transform

and model big data using

familiar R syntax and

paradigm

bull Scale out R with MR

programming

ndash Partitioning of large data

ndash Parallel cluster execution of R

code

bull Distributed Machine

Learning

ndash A scalable statistics engine that

provides canned algorithms and

an ability to author new ones all

via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3

copy 2014 IBM Corporation54

Where Does BigData Fit

Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate

ldquoCapture in case itrsquos neededrdquo

1 Extract transform load

ldquoCapture only whatrsquos neededrdquo

9 Report and mine data

copy 2014 IBM Corporation55

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation56

Data scientist ndash The new cool guy in town

Article in Fortune ldquoThe unemployment rate in

the US continues to be abysmal (91 in

July) but the tech world has spawned a

new kind of highly skilled nerdy-cool job

that companies are scrambling to fill data

scientistrdquo

McKinsey Global Institute ldquoBig data Reportrdquo

By 2018 the United States alone could

face a shortage of 140000 to 190000

people with deep analytical skills as well as

15 million managers and analysts with the

know-how to use the analysis of big data to

make effective decisions

copy 2014 IBM Corporation57

Data Science is Multidisciplinary

copy 2014 IBM Corporation58

Successful Data Scientist Characteristics

copy 2014 IBM Corporation59

Data Scientist Qualities

copy 2014 IBM Corporation60

How Long Does It Take For a Beginner to Become

a Good Data Scientist

copy 2014 IBM Corporation61

wwwkagglecom

copy 2014 IBM Corporation62

Kaggle ranking

copy 2014 IBM Corporation63 copy 2013 IBM Corporation63

Learn Big Data

Reading Materials - Online

ndash Understanding Big Data ndash Free PDF Book

bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF

ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights

bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml

ndash Implementing IBM InfoSphere BigInsights on System x - Redbook

bull httpwwwredbooksibmcomredpiecesabstractssg248077html

Resources

ndash Big Data Information Center

bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml

ndash InfoSphere BigInsights

bull www-01ibmcomsoftwaredatainfospherebiginsights

ndash Stream Computing

bull www-01ibmcomsoftwaredatainfospherestream-computing

ndash DeveloperWorks forums demos

bull httpwwwibmcomdeveloperworkswikibiginsights

copy 2014 IBM Corporation64

Learn Big Data Technologies

BigDataUniversitycom

Flexible on-line delivery

allows learning your place

and your pace

Free courses free study

materials

Cloud-based sandbox for

exercises ndash zero setup

Robust Course

Management System and

Content Distribution

infrastructure

copy 2014 IBM Corporation65

65

Page 14: Big data presentation (2014)

copy 2014 IBM Corporation14

Paradigm shifts enabled by big data IVLeverage data as it is captured

copy 2014 IBM Corporation15

Complementary Analytics

Traditional ApproachStructured analytical logical

New ApproachCreative holistic thought intuition

Multimedia

Data Warehouse

Web Logs

Social Data

Sensor data

images

RFID

Internal AppData

TransactionData

MainframeData

OLTP SystemData

Traditional databases

ERP Data

StructuredRepeatable

Linear

UnstructuredExploratory

Dynamic

Text Data

emails

Hadoop andStreams

NewSources

copy 2014 IBM Corporation16

Types of Analytic Tools

copy 2014 IBM Corporation17

Organisations are prioritising internal data sources

17

Untapped stores of internal data

Size and scope of some internal data such as

detailed transactions and operational log data

have become too large and varied to manage

within traditional systems

New infrastructure components make them

accessible for analysis

Some data has been collected but not

analyzed for years

Focus on customer insights

Customers ndash influenced by digital experiences

ndash often expect information provided to an

organization will then be ldquoknownrdquo during future

interactions

Combining disparate internal sources with

advanced analytics creates insights into

customer behavior and preferences

(Transactions Emails Call center interaction records)

Big data sources

Respondents were

asked which data

sources are currently

being collected and

analyzed as part of

active big data efforts

within their

organization

copy 2014 IBM Corporation18

Stages of Big Data adoption

18

Big data adoption

When segmented into four groups based on current levels of big data activity respondents showed significant consistency

in organizational behaviors Total respondents n = 1061

Totals do not equal 100 due to rounding

copy 2014 IBM Corporation19

Hadoop workloads

92

92

83

58

42

25

58

92

92

92

67

67

67

83

Staging area

Online archive

Transformation Engine

Ad hoc queries

Scheduled reports

Visual exploration

Data mining

Today In 18 Months

Based on respondents that have implemented Hadoop BI Leadership Forum April 2012

copy 2014 IBM Corporation20

Big Data ExplorationFind visualize understand all big data to improve decision making

Enhanced 360o Viewof the CustomerExtend existing customer views (MDM CRM etc) by incorporating additional internal and external information sources

Operations AnalysisAnalyze a variety of machinedata for improved business results

Data Warehouse ModernizationIntegrate big data and data warehouse capabilities to increase operational efficiency

SecurityIntelligence ExtensionLower risk detect fraud and monitor cyber security in real-time

Key Big Data Use Cases

copy 2014 IBM Corporation21

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation22

Solution for Big Data

Rest Data

ndash Data to analyze are already stored (structured and unstructured)

ndash Examples logs facebook twitter etc

ndash Solution Hadoop (open source)

Data in motion

ndash Data are analyzed in real time just in the moment they are generated They are analyzed with any previous storage

ndash Examples Sensors RFID etc

ndash Solution Streams CEP solutions

copy 2014 IBM Corporation23

Hardware improvements through the years

CPU Speedsndash 1990 - 44 MIPS at 40 MHz

ndash 2000 - 3561 MIPS at 12 GHz ndash 2010 - 147600 MIPS at 33 GHz

RAM Memoryndash 1990 ndash 640K conventional memory (256K extended memory recommended)ndash 2000 ndash 64MB memoryndash 2010 - 8-32GB (and more)

Disk Capacityndash 1990 ndash 20MBndash 2000 - 1GBndash 2010 ndash 1TB

Disk Latency (speed of reads and writes) ndash not much improvement in last 7-10 years currently around 70 ndash 80MB sec

copy 2014 IBM Corporation24

How long it will take to read 1TB of data

1TB (at 80Mb sec)ndash 1 disk - 34 hours

ndash 10 disks - 20 min

ndash 100 disks - 2 min

ndash 1000 disks - 12 sec

copy 2014 IBM Corporation25

Parallel Data Processing is the answer

It was with us for a whilendash GRID computing - spreads processing load

ndash Distributed workload - hard to manage applications overhead on

developer

ndash Parallel databases ndash DB2 DPF Teradata Netezza etc (distribute the

data)

copy 2014 IBM Corporation26

What is Apache Hadoop

Apache Open source software framework

Flexible enterprise-class support for processing large volumes of

data ndash Inspired by Google technologies (MapReduce GFS BigTable hellip)

ndash Initiated at Yahoobull Originally built to address scalability problems of Nutch an open source Web search

technology

ndash Well-suited to batch-oriented read-intensive applications

ndash Supports wide variety of data

Enables applications to work with thousands of nodes and petabytes

of data in a highly parallel cost effective mannerndash CPU + local disks = ldquonoderdquo

ndash Nodes can be combined into clusters

ndash New nodes can be added as needed without changing

bull Data formats

bull How data is loaded

bull How jobs are written

copy 2014 IBM Corporation27

Design principles of Hadoop New way of storing and processing the data

ndash Let system handle most of the issues automaticallybull Failuresbull Scalabilitybull Reduce communications bull Distribute data and processing power to where the data isbull Make parallelism part of operating systembull Meant for heterogeneous commodity hardware

Bring processing to Data

Hadoop = HDFS + MapReduce infrastructure

Optimized to handlendash Massive amounts of data through parallelism

ndash A variety of data (structured unstructured semi-structured)

ndash Using inexpensive commodity hardware

Reliability provided through replication

copy 2014 IBM Corporation28

What is the Hadoop Distributed File System Driving principals

ndash Data is stored across the entire cluster (multiple nodes)

ndash Programs are brought to the data not the data to the program

ndash Follows the Divide and Conquer paradigm

Data is stored across the entire cluster (the DFS)

ndash The entire cluster participates in the file system

ndash Blocks of a single file are distributed across the cluster

ndash A given block is typically replicated as well for resiliency

10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111

0100

Logical File

1

2

3

4

Blocks

1

Cluster

1

1

2

22

3

3

34

44

copy 2011 IBM Corporation29

Introduction to MapReduce

Scalable to thousands of nodes and petabytes of data

MapReduce Application

1 Map Phase(break job into small parts)

2 Shuffle(transfer interim output

for final processing)

3 Reduce Phase(boil all output down to

a single result set)

Return a single result setResult Set

Shuffle

public static class TokenizerMapper extends MapperltObjectTextTextIntWritablegt

private final static IntWritableone = new IntWritable(1)

private Text word = new Text()

public void map(Object key Text val ContextStringTokenizer itr =

new StringTokenizer(valtoString())while (itrhasMoreTokens()) wordset(itrnextToken())

contextwrite(word one)

public static class IntSumReducer extends ReducerltTextIntWritableTextIntWrita

private IntWritable result = new IntWritable()

public void reduce(Text keyIterableltIntWritablegt val Context context)int sum = 0for (IntWritable v val)

sum += vget()

Distribute map

tasks to cluster

Hadoop Data Nodes

copy 2011 IBM Corporation30

MapReduce Example

Hello World Bye World

Hello IBM

Reduce (final output)

lt Bye 1gt

lt IBM 1gt

lt Hello 2gt

lt World 2gt

Map 1lt Hello 1gt

lt World 1gt

lt Bye 1gt

lt World 1gt

Count number of words occurrences

Map 2lt Hello 1gt

lt IBM 1gt

Entry Data

Map

Process

Reduce

Process

Shuffle

Process

copy 2014 IBM Corporation31

How to Analyze Large Data Sets in Hadoop

Its not just runtime Development phase has to be taken into

account

Although the Hadoop framework is implemented in Java

MapReduce applications do not need to be written in Java

To abstract complexities of Hadoop programming model a few

application development languages have emerged that build on top

of Hadoopndash Pig

ndash Hive

ndash Jaql

ndash Jaql

copy 2014 IBM Corporation32

Pig Hive Jaql ndash Similarities

Reduced program size over Java

Applications are translated to map

and reduce jobs behind scenes

Extension points for extending

existing functionality

Interoperability with other

languages

Not designed for random

readswrites or low-latency queries

Pig Hive Jaql ndash Differences

Characteristic Pig Hive Jaql

Developed by Yahoo Facebook IBM

Language Pig Latin HiveQL Jaql

Type of language

Data flow Declarative (SQL dialect) Data flow

Data structures supported

Complex Better suited for structured data

JSON semi structured

Schema Optional Not optional Optional

Hadoop Distributions

34

copy 2014 IBM Corporation35

Example of Hadoop Ecosystem

Visualization amp DiscoveryIntegration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

BigSheets JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Dashboard amp Visualization

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

copy 2014 IBM Corporation36

Open Source frameworks I

Avro A data serialization system that includes a schema within each file A schema defines the data types that are

contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data

types and complex type definitions within a schema

Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop

cluster

HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data

sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications

are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column

families so that the elements of a column family are all stored together This approach is different from a row-oriented

relational database where all columns of a row are stored together

HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do

not need to know where or how your data is stored You can change how you write data while still supporting existing data in

older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service

that includes functions for both MapReduce and Pig

Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to

analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements

which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere

BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos

Business Intelligence software

copy 2014 IBM Corporation37

Open Source frameworks II

Lucene A high-performance text search engine library that is written entirely in Java When you search within a

collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key

component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene

libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build

scan and query Lucene indexes

Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides

users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the

required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of

specific data in the file system

R A Project for Statistical Computing

Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop

systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and

export it to relational databases and enterprise data warehouses

Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper

maintains common objects that are needed in large cluster environments such as configuration information distributed

synchronization and group services

copy 2014 IBM Corporation38

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

copy 2014 IBM Corporation39

BigSheets

Browser based Analytical Tool that generates MapReduce Jobs

working over Hadoop Big Data

Helps non-programmers to work with Hadoop cluster

User models their big data as familiar spreadsheet-like tabular data

structures (collections) Once data is represented in a collection

business analysts can filter and enrich its content using built-in

functions and macros Furthermore analysts can combine data

residing in different collections as well as generate charts and new

ldquosheetsrdquo (collections) to visualize their data They can even export

data into a variety of common formats with a click of a button

Much of the technology included in Sheets was derived from the

BigSheets project of IBMrsquos Emerging Technologies team

copy 2014 IBM Corporation40

BigSheets Collection Sample

Spreadsheet-like structures defined by user

Based on data accessible through BigInsights Web console ndash eg file

system data output from Web crawl etc

copy 2014 IBM Corporation41

Big Sheets Collection Operations

Work with built-in ldquosheetsrdquo editor

Add delete columns

Filter data

Specify formulas to compute new

values using spreadsheet-style

syntax

Apply built-in or custom macro

functions

helliphelliphelliphellip

copy 2014 IBM Corporation42

BigSheets Collection Graphic Visualization

Built-in charting facility aids analysis

Pie charts bar charts tag clouds maps etc

Hover over sections to reveal details

copy 2014 IBM Corporation43

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

copy 2014 IBM Corporation44

What is Big SQL

Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture

ndash Comprehensive SQL92 ansi support

ndash Standards compliant client drivers (JDBC amp ODBC)

ndash Efficient handling of point queries

ndash Wide variety of data sources and file formats

ndash Extensive HBase focus

ndash Open source interoperability

Our driving design goalsndash Existing queries should run with no or few modifications

ndash Existing JDBC and ODBC compliant tools should continue to function

ndash Queries should be executed as efficiently as the chosen storage

mechanisms allow

copy 2014 IBM Corporation45

Architecture

Big SQL shares catalogs with

Hive via the Hive metastorendash Each can query the others tables

SQL engine analyzes incoming

queriesndash Separates portion(s) to execute at

the server vs portion(s) to execute

on the cluster

ndash Re-writes query if necessary for

improved performance

ndash Determines appropriate storage

handler for data

ndash Produces execution plan

ndash Executes and coordinates query

Server layout and relative sizes

for illustrative purposes only

Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node

Job TrackerNetwork Protocol

SQL Engine

Storage Handlers

Del

Files

SEQ

FilesHBase RDBMS bullbullbull

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore

copy 2014 IBM Corporation46

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

copy 2014 IBM Corporation47

What is Text Analytics

High Performance and Scalable rule based Information Extraction Engine

Distill structured information from unstructured data

- Rich annotator library supports multiple languages

Provides sophisticated tooling to help build test and refine rules

ndash Developer tools an easy to use text analytics language and a set of

extractors for fast adoption

ndash Multilingual support including support for DBCS languages

Developed at IBM Research since 2004 System T

BigInsights is the first time IBM opens up the Text Analytics Engine

technology for customization and development

copy 2014 IBM Corporation48

Annotator Query Language (AQL)

Language to create rules for Text Analytics

SQL Like Language

Fully declarative text analytics language

Once compiled produced an AOG plan to work in the data

No ldquoblack boxesrdquo or modules that canrsquot be customized

Tooling for easy customization because you are abstracted from the

programmatic details

Competing solutions make use of locked up black-box modules that cannot be

customized which restricts flexibility and are difficult to optimize for performance

create view AmountWithUnit as

extract pattern ltNmatchgt ltUmatchgt

as match

from Number N Unit U

copy 2014 IBM Corporation49

Text Analytic Simple Example

NetherlandsStrikerArjen Robben

Keeper SpainIker Casillas

WingerAndres Iniesta Spain

World Cup 2010 Highlights

Football World Cup 2010 one team distinguished well

from the rest winning the final Early in the second

half Netherlandsrsquo striker Arjen Robben had a chance

to score but the awesome keeper for Spain Iker

Casillas made the save Winner superiority was

reflected when Winger Andres Iniesta scored for Spain

for the win

copy 2014 IBM Corporation5050

Text Analytic Real Example

copy 2014 IBM Corporation5151

One step beyond Watson

copy 2014 IBM Corporation52

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

R

copy 2014 IBM Corporation53

Big R

bull Explore visualize transform

and model big data using

familiar R syntax and

paradigm

bull Scale out R with MR

programming

ndash Partitioning of large data

ndash Parallel cluster execution of R

code

bull Distributed Machine

Learning

ndash A scalable statistics engine that

provides canned algorithms and

an ability to author new ones all

via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3

copy 2014 IBM Corporation54

Where Does BigData Fit

Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate

ldquoCapture in case itrsquos neededrdquo

1 Extract transform load

ldquoCapture only whatrsquos neededrdquo

9 Report and mine data

copy 2014 IBM Corporation55

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation56

Data scientist ndash The new cool guy in town

Article in Fortune ldquoThe unemployment rate in

the US continues to be abysmal (91 in

July) but the tech world has spawned a

new kind of highly skilled nerdy-cool job

that companies are scrambling to fill data

scientistrdquo

McKinsey Global Institute ldquoBig data Reportrdquo

By 2018 the United States alone could

face a shortage of 140000 to 190000

people with deep analytical skills as well as

15 million managers and analysts with the

know-how to use the analysis of big data to

make effective decisions

copy 2014 IBM Corporation57

Data Science is Multidisciplinary

copy 2014 IBM Corporation58

Successful Data Scientist Characteristics

copy 2014 IBM Corporation59

Data Scientist Qualities

copy 2014 IBM Corporation60

How Long Does It Take For a Beginner to Become

a Good Data Scientist

copy 2014 IBM Corporation61

wwwkagglecom

copy 2014 IBM Corporation62

Kaggle ranking

copy 2014 IBM Corporation63 copy 2013 IBM Corporation63

Learn Big Data

Reading Materials - Online

ndash Understanding Big Data ndash Free PDF Book

bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF

ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights

bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml

ndash Implementing IBM InfoSphere BigInsights on System x - Redbook

bull httpwwwredbooksibmcomredpiecesabstractssg248077html

Resources

ndash Big Data Information Center

bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml

ndash InfoSphere BigInsights

bull www-01ibmcomsoftwaredatainfospherebiginsights

ndash Stream Computing

bull www-01ibmcomsoftwaredatainfospherestream-computing

ndash DeveloperWorks forums demos

bull httpwwwibmcomdeveloperworkswikibiginsights

copy 2014 IBM Corporation64

Learn Big Data Technologies

BigDataUniversitycom

Flexible on-line delivery

allows learning your place

and your pace

Free courses free study

materials

Cloud-based sandbox for

exercises ndash zero setup

Robust Course

Management System and

Content Distribution

infrastructure

copy 2014 IBM Corporation65

65

Page 15: Big data presentation (2014)

copy 2014 IBM Corporation15

Complementary Analytics

Traditional ApproachStructured analytical logical

New ApproachCreative holistic thought intuition

Multimedia

Data Warehouse

Web Logs

Social Data

Sensor data

images

RFID

Internal AppData

TransactionData

MainframeData

OLTP SystemData

Traditional databases

ERP Data

StructuredRepeatable

Linear

UnstructuredExploratory

Dynamic

Text Data

emails

Hadoop andStreams

NewSources

copy 2014 IBM Corporation16

Types of Analytic Tools

copy 2014 IBM Corporation17

Organisations are prioritising internal data sources

17

Untapped stores of internal data

Size and scope of some internal data such as

detailed transactions and operational log data

have become too large and varied to manage

within traditional systems

New infrastructure components make them

accessible for analysis

Some data has been collected but not

analyzed for years

Focus on customer insights

Customers ndash influenced by digital experiences

ndash often expect information provided to an

organization will then be ldquoknownrdquo during future

interactions

Combining disparate internal sources with

advanced analytics creates insights into

customer behavior and preferences

(Transactions Emails Call center interaction records)

Big data sources

Respondents were

asked which data

sources are currently

being collected and

analyzed as part of

active big data efforts

within their

organization

copy 2014 IBM Corporation18

Stages of Big Data adoption

18

Big data adoption

When segmented into four groups based on current levels of big data activity respondents showed significant consistency

in organizational behaviors Total respondents n = 1061

Totals do not equal 100 due to rounding

copy 2014 IBM Corporation19

Hadoop workloads

92

92

83

58

42

25

58

92

92

92

67

67

67

83

Staging area

Online archive

Transformation Engine

Ad hoc queries

Scheduled reports

Visual exploration

Data mining

Today In 18 Months

Based on respondents that have implemented Hadoop BI Leadership Forum April 2012

copy 2014 IBM Corporation20

Big Data ExplorationFind visualize understand all big data to improve decision making

Enhanced 360o Viewof the CustomerExtend existing customer views (MDM CRM etc) by incorporating additional internal and external information sources

Operations AnalysisAnalyze a variety of machinedata for improved business results

Data Warehouse ModernizationIntegrate big data and data warehouse capabilities to increase operational efficiency

SecurityIntelligence ExtensionLower risk detect fraud and monitor cyber security in real-time

Key Big Data Use Cases

copy 2014 IBM Corporation21

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation22

Solution for Big Data

Rest Data

ndash Data to analyze are already stored (structured and unstructured)

ndash Examples logs facebook twitter etc

ndash Solution Hadoop (open source)

Data in motion

ndash Data are analyzed in real time just in the moment they are generated They are analyzed with any previous storage

ndash Examples Sensors RFID etc

ndash Solution Streams CEP solutions

copy 2014 IBM Corporation23

Hardware improvements through the years

CPU Speedsndash 1990 - 44 MIPS at 40 MHz

ndash 2000 - 3561 MIPS at 12 GHz ndash 2010 - 147600 MIPS at 33 GHz

RAM Memoryndash 1990 ndash 640K conventional memory (256K extended memory recommended)ndash 2000 ndash 64MB memoryndash 2010 - 8-32GB (and more)

Disk Capacityndash 1990 ndash 20MBndash 2000 - 1GBndash 2010 ndash 1TB

Disk Latency (speed of reads and writes) ndash not much improvement in last 7-10 years currently around 70 ndash 80MB sec

copy 2014 IBM Corporation24

How long it will take to read 1TB of data

1TB (at 80Mb sec)ndash 1 disk - 34 hours

ndash 10 disks - 20 min

ndash 100 disks - 2 min

ndash 1000 disks - 12 sec

copy 2014 IBM Corporation25

Parallel Data Processing is the answer

It was with us for a whilendash GRID computing - spreads processing load

ndash Distributed workload - hard to manage applications overhead on

developer

ndash Parallel databases ndash DB2 DPF Teradata Netezza etc (distribute the

data)

copy 2014 IBM Corporation26

What is Apache Hadoop

Apache Open source software framework

Flexible enterprise-class support for processing large volumes of

data ndash Inspired by Google technologies (MapReduce GFS BigTable hellip)

ndash Initiated at Yahoobull Originally built to address scalability problems of Nutch an open source Web search

technology

ndash Well-suited to batch-oriented read-intensive applications

ndash Supports wide variety of data

Enables applications to work with thousands of nodes and petabytes

of data in a highly parallel cost effective mannerndash CPU + local disks = ldquonoderdquo

ndash Nodes can be combined into clusters

ndash New nodes can be added as needed without changing

bull Data formats

bull How data is loaded

bull How jobs are written

copy 2014 IBM Corporation27

Design principles of Hadoop New way of storing and processing the data

ndash Let system handle most of the issues automaticallybull Failuresbull Scalabilitybull Reduce communications bull Distribute data and processing power to where the data isbull Make parallelism part of operating systembull Meant for heterogeneous commodity hardware

Bring processing to Data

Hadoop = HDFS + MapReduce infrastructure

Optimized to handlendash Massive amounts of data through parallelism

ndash A variety of data (structured unstructured semi-structured)

ndash Using inexpensive commodity hardware

Reliability provided through replication

copy 2014 IBM Corporation28

What is the Hadoop Distributed File System Driving principals

ndash Data is stored across the entire cluster (multiple nodes)

ndash Programs are brought to the data not the data to the program

ndash Follows the Divide and Conquer paradigm

Data is stored across the entire cluster (the DFS)

ndash The entire cluster participates in the file system

ndash Blocks of a single file are distributed across the cluster

ndash A given block is typically replicated as well for resiliency

10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111

0100

Logical File

1

2

3

4

Blocks

1

Cluster

1

1

2

22

3

3

34

44

copy 2011 IBM Corporation29

Introduction to MapReduce

Scalable to thousands of nodes and petabytes of data

MapReduce Application

1 Map Phase(break job into small parts)

2 Shuffle(transfer interim output

for final processing)

3 Reduce Phase(boil all output down to

a single result set)

Return a single result setResult Set

Shuffle

public static class TokenizerMapper extends MapperltObjectTextTextIntWritablegt

private final static IntWritableone = new IntWritable(1)

private Text word = new Text()

public void map(Object key Text val ContextStringTokenizer itr =

new StringTokenizer(valtoString())while (itrhasMoreTokens()) wordset(itrnextToken())

contextwrite(word one)

public static class IntSumReducer extends ReducerltTextIntWritableTextIntWrita

private IntWritable result = new IntWritable()

public void reduce(Text keyIterableltIntWritablegt val Context context)int sum = 0for (IntWritable v val)

sum += vget()

Distribute map

tasks to cluster

Hadoop Data Nodes

copy 2011 IBM Corporation30

MapReduce Example

Hello World Bye World

Hello IBM

Reduce (final output)

lt Bye 1gt

lt IBM 1gt

lt Hello 2gt

lt World 2gt

Map 1lt Hello 1gt

lt World 1gt

lt Bye 1gt

lt World 1gt

Count number of words occurrences

Map 2lt Hello 1gt

lt IBM 1gt

Entry Data

Map

Process

Reduce

Process

Shuffle

Process

copy 2014 IBM Corporation31

How to Analyze Large Data Sets in Hadoop

Its not just runtime Development phase has to be taken into

account

Although the Hadoop framework is implemented in Java

MapReduce applications do not need to be written in Java

To abstract complexities of Hadoop programming model a few

application development languages have emerged that build on top

of Hadoopndash Pig

ndash Hive

ndash Jaql

ndash Jaql

copy 2014 IBM Corporation32

Pig Hive Jaql ndash Similarities

Reduced program size over Java

Applications are translated to map

and reduce jobs behind scenes

Extension points for extending

existing functionality

Interoperability with other

languages

Not designed for random

readswrites or low-latency queries

Pig Hive Jaql ndash Differences

Characteristic Pig Hive Jaql

Developed by Yahoo Facebook IBM

Language Pig Latin HiveQL Jaql

Type of language

Data flow Declarative (SQL dialect) Data flow

Data structures supported

Complex Better suited for structured data

JSON semi structured

Schema Optional Not optional Optional

Hadoop Distributions

34

copy 2014 IBM Corporation35

Example of Hadoop Ecosystem

Visualization amp DiscoveryIntegration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

BigSheets JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Dashboard amp Visualization

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

copy 2014 IBM Corporation36

Open Source frameworks I

Avro A data serialization system that includes a schema within each file A schema defines the data types that are

contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data

types and complex type definitions within a schema

Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop

cluster

HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data

sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications

are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column

families so that the elements of a column family are all stored together This approach is different from a row-oriented

relational database where all columns of a row are stored together

HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do

not need to know where or how your data is stored You can change how you write data while still supporting existing data in

older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service

that includes functions for both MapReduce and Pig

Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to

analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements

which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere

BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos

Business Intelligence software

copy 2014 IBM Corporation37

Open Source frameworks II

Lucene A high-performance text search engine library that is written entirely in Java When you search within a

collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key

component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene

libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build

scan and query Lucene indexes

Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides

users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the

required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of

specific data in the file system

R A Project for Statistical Computing

Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop

systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and

export it to relational databases and enterprise data warehouses

Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper

maintains common objects that are needed in large cluster environments such as configuration information distributed

synchronization and group services

copy 2014 IBM Corporation38

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

copy 2014 IBM Corporation39

BigSheets

Browser based Analytical Tool that generates MapReduce Jobs

working over Hadoop Big Data

Helps non-programmers to work with Hadoop cluster

User models their big data as familiar spreadsheet-like tabular data

structures (collections) Once data is represented in a collection

business analysts can filter and enrich its content using built-in

functions and macros Furthermore analysts can combine data

residing in different collections as well as generate charts and new

ldquosheetsrdquo (collections) to visualize their data They can even export

data into a variety of common formats with a click of a button

Much of the technology included in Sheets was derived from the

BigSheets project of IBMrsquos Emerging Technologies team

copy 2014 IBM Corporation40

BigSheets Collection Sample

Spreadsheet-like structures defined by user

Based on data accessible through BigInsights Web console ndash eg file

system data output from Web crawl etc

copy 2014 IBM Corporation41

Big Sheets Collection Operations

Work with built-in ldquosheetsrdquo editor

Add delete columns

Filter data

Specify formulas to compute new

values using spreadsheet-style

syntax

Apply built-in or custom macro

functions

helliphelliphelliphellip

copy 2014 IBM Corporation42

BigSheets Collection Graphic Visualization

Built-in charting facility aids analysis

Pie charts bar charts tag clouds maps etc

Hover over sections to reveal details

copy 2014 IBM Corporation43

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

copy 2014 IBM Corporation44

What is Big SQL

Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture

ndash Comprehensive SQL92 ansi support

ndash Standards compliant client drivers (JDBC amp ODBC)

ndash Efficient handling of point queries

ndash Wide variety of data sources and file formats

ndash Extensive HBase focus

ndash Open source interoperability

Our driving design goalsndash Existing queries should run with no or few modifications

ndash Existing JDBC and ODBC compliant tools should continue to function

ndash Queries should be executed as efficiently as the chosen storage

mechanisms allow

copy 2014 IBM Corporation45

Architecture

Big SQL shares catalogs with

Hive via the Hive metastorendash Each can query the others tables

SQL engine analyzes incoming

queriesndash Separates portion(s) to execute at

the server vs portion(s) to execute

on the cluster

ndash Re-writes query if necessary for

improved performance

ndash Determines appropriate storage

handler for data

ndash Produces execution plan

ndash Executes and coordinates query

Server layout and relative sizes

for illustrative purposes only

Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node

Job TrackerNetwork Protocol

SQL Engine

Storage Handlers

Del

Files

SEQ

FilesHBase RDBMS bullbullbull

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore

copy 2014 IBM Corporation46

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

copy 2014 IBM Corporation47

What is Text Analytics

High Performance and Scalable rule based Information Extraction Engine

Distill structured information from unstructured data

- Rich annotator library supports multiple languages

Provides sophisticated tooling to help build test and refine rules

ndash Developer tools an easy to use text analytics language and a set of

extractors for fast adoption

ndash Multilingual support including support for DBCS languages

Developed at IBM Research since 2004 System T

BigInsights is the first time IBM opens up the Text Analytics Engine

technology for customization and development

copy 2014 IBM Corporation48

Annotator Query Language (AQL)

Language to create rules for Text Analytics

SQL Like Language

Fully declarative text analytics language

Once compiled produced an AOG plan to work in the data

No ldquoblack boxesrdquo or modules that canrsquot be customized

Tooling for easy customization because you are abstracted from the

programmatic details

Competing solutions make use of locked up black-box modules that cannot be

customized which restricts flexibility and are difficult to optimize for performance

create view AmountWithUnit as

extract pattern ltNmatchgt ltUmatchgt

as match

from Number N Unit U

copy 2014 IBM Corporation49

Text Analytic Simple Example

NetherlandsStrikerArjen Robben

Keeper SpainIker Casillas

WingerAndres Iniesta Spain

World Cup 2010 Highlights

Football World Cup 2010 one team distinguished well

from the rest winning the final Early in the second

half Netherlandsrsquo striker Arjen Robben had a chance

to score but the awesome keeper for Spain Iker

Casillas made the save Winner superiority was

reflected when Winger Andres Iniesta scored for Spain

for the win

copy 2014 IBM Corporation5050

Text Analytic Real Example

copy 2014 IBM Corporation5151

One step beyond Watson

copy 2014 IBM Corporation52

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

R

copy 2014 IBM Corporation53

Big R

bull Explore visualize transform

and model big data using

familiar R syntax and

paradigm

bull Scale out R with MR

programming

ndash Partitioning of large data

ndash Parallel cluster execution of R

code

bull Distributed Machine

Learning

ndash A scalable statistics engine that

provides canned algorithms and

an ability to author new ones all

via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3

copy 2014 IBM Corporation54

Where Does BigData Fit

Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate

ldquoCapture in case itrsquos neededrdquo

1 Extract transform load

ldquoCapture only whatrsquos neededrdquo

9 Report and mine data

copy 2014 IBM Corporation55

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation56

Data scientist ndash The new cool guy in town

Article in Fortune ldquoThe unemployment rate in

the US continues to be abysmal (91 in

July) but the tech world has spawned a

new kind of highly skilled nerdy-cool job

that companies are scrambling to fill data

scientistrdquo

McKinsey Global Institute ldquoBig data Reportrdquo

By 2018 the United States alone could

face a shortage of 140000 to 190000

people with deep analytical skills as well as

15 million managers and analysts with the

know-how to use the analysis of big data to

make effective decisions

copy 2014 IBM Corporation57

Data Science is Multidisciplinary

copy 2014 IBM Corporation58

Successful Data Scientist Characteristics

copy 2014 IBM Corporation59

Data Scientist Qualities

copy 2014 IBM Corporation60

How Long Does It Take For a Beginner to Become

a Good Data Scientist

copy 2014 IBM Corporation61

wwwkagglecom

copy 2014 IBM Corporation62

Kaggle ranking

copy 2014 IBM Corporation63 copy 2013 IBM Corporation63

Learn Big Data

Reading Materials - Online

ndash Understanding Big Data ndash Free PDF Book

bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF

ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights

bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml

ndash Implementing IBM InfoSphere BigInsights on System x - Redbook

bull httpwwwredbooksibmcomredpiecesabstractssg248077html

Resources

ndash Big Data Information Center

bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml

ndash InfoSphere BigInsights

bull www-01ibmcomsoftwaredatainfospherebiginsights

ndash Stream Computing

bull www-01ibmcomsoftwaredatainfospherestream-computing

ndash DeveloperWorks forums demos

bull httpwwwibmcomdeveloperworkswikibiginsights

copy 2014 IBM Corporation64

Learn Big Data Technologies

BigDataUniversitycom

Flexible on-line delivery

allows learning your place

and your pace

Free courses free study

materials

Cloud-based sandbox for

exercises ndash zero setup

Robust Course

Management System and

Content Distribution

infrastructure

copy 2014 IBM Corporation65

65

Page 16: Big data presentation (2014)

copy 2014 IBM Corporation16

Types of Analytic Tools

copy 2014 IBM Corporation17

Organisations are prioritising internal data sources

17

Untapped stores of internal data

Size and scope of some internal data such as

detailed transactions and operational log data

have become too large and varied to manage

within traditional systems

New infrastructure components make them

accessible for analysis

Some data has been collected but not

analyzed for years

Focus on customer insights

Customers ndash influenced by digital experiences

ndash often expect information provided to an

organization will then be ldquoknownrdquo during future

interactions

Combining disparate internal sources with

advanced analytics creates insights into

customer behavior and preferences

(Transactions Emails Call center interaction records)

Big data sources

Respondents were

asked which data

sources are currently

being collected and

analyzed as part of

active big data efforts

within their

organization

copy 2014 IBM Corporation18

Stages of Big Data adoption

18

Big data adoption

When segmented into four groups based on current levels of big data activity respondents showed significant consistency

in organizational behaviors Total respondents n = 1061

Totals do not equal 100 due to rounding

copy 2014 IBM Corporation19

Hadoop workloads

92

92

83

58

42

25

58

92

92

92

67

67

67

83

Staging area

Online archive

Transformation Engine

Ad hoc queries

Scheduled reports

Visual exploration

Data mining

Today In 18 Months

Based on respondents that have implemented Hadoop BI Leadership Forum April 2012

copy 2014 IBM Corporation20

Big Data ExplorationFind visualize understand all big data to improve decision making

Enhanced 360o Viewof the CustomerExtend existing customer views (MDM CRM etc) by incorporating additional internal and external information sources

Operations AnalysisAnalyze a variety of machinedata for improved business results

Data Warehouse ModernizationIntegrate big data and data warehouse capabilities to increase operational efficiency

SecurityIntelligence ExtensionLower risk detect fraud and monitor cyber security in real-time

Key Big Data Use Cases

copy 2014 IBM Corporation21

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation22

Solution for Big Data

Rest Data

ndash Data to analyze are already stored (structured and unstructured)

ndash Examples logs facebook twitter etc

ndash Solution Hadoop (open source)

Data in motion

ndash Data are analyzed in real time just in the moment they are generated They are analyzed with any previous storage

ndash Examples Sensors RFID etc

ndash Solution Streams CEP solutions

copy 2014 IBM Corporation23

Hardware improvements through the years

CPU Speedsndash 1990 - 44 MIPS at 40 MHz

ndash 2000 - 3561 MIPS at 12 GHz ndash 2010 - 147600 MIPS at 33 GHz

RAM Memoryndash 1990 ndash 640K conventional memory (256K extended memory recommended)ndash 2000 ndash 64MB memoryndash 2010 - 8-32GB (and more)

Disk Capacityndash 1990 ndash 20MBndash 2000 - 1GBndash 2010 ndash 1TB

Disk Latency (speed of reads and writes) ndash not much improvement in last 7-10 years currently around 70 ndash 80MB sec

copy 2014 IBM Corporation24

How long it will take to read 1TB of data

1TB (at 80Mb sec)ndash 1 disk - 34 hours

ndash 10 disks - 20 min

ndash 100 disks - 2 min

ndash 1000 disks - 12 sec

copy 2014 IBM Corporation25

Parallel Data Processing is the answer

It was with us for a whilendash GRID computing - spreads processing load

ndash Distributed workload - hard to manage applications overhead on

developer

ndash Parallel databases ndash DB2 DPF Teradata Netezza etc (distribute the

data)

copy 2014 IBM Corporation26

What is Apache Hadoop

Apache Open source software framework

Flexible enterprise-class support for processing large volumes of

data ndash Inspired by Google technologies (MapReduce GFS BigTable hellip)

ndash Initiated at Yahoobull Originally built to address scalability problems of Nutch an open source Web search

technology

ndash Well-suited to batch-oriented read-intensive applications

ndash Supports wide variety of data

Enables applications to work with thousands of nodes and petabytes

of data in a highly parallel cost effective mannerndash CPU + local disks = ldquonoderdquo

ndash Nodes can be combined into clusters

ndash New nodes can be added as needed without changing

bull Data formats

bull How data is loaded

bull How jobs are written

copy 2014 IBM Corporation27

Design principles of Hadoop New way of storing and processing the data

ndash Let system handle most of the issues automaticallybull Failuresbull Scalabilitybull Reduce communications bull Distribute data and processing power to where the data isbull Make parallelism part of operating systembull Meant for heterogeneous commodity hardware

Bring processing to Data

Hadoop = HDFS + MapReduce infrastructure

Optimized to handlendash Massive amounts of data through parallelism

ndash A variety of data (structured unstructured semi-structured)

ndash Using inexpensive commodity hardware

Reliability provided through replication

copy 2014 IBM Corporation28

What is the Hadoop Distributed File System Driving principals

ndash Data is stored across the entire cluster (multiple nodes)

ndash Programs are brought to the data not the data to the program

ndash Follows the Divide and Conquer paradigm

Data is stored across the entire cluster (the DFS)

ndash The entire cluster participates in the file system

ndash Blocks of a single file are distributed across the cluster

ndash A given block is typically replicated as well for resiliency

10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111

0100

Logical File

1

2

3

4

Blocks

1

Cluster

1

1

2

22

3

3

34

44

copy 2011 IBM Corporation29

Introduction to MapReduce

Scalable to thousands of nodes and petabytes of data

MapReduce Application

1 Map Phase(break job into small parts)

2 Shuffle(transfer interim output

for final processing)

3 Reduce Phase(boil all output down to

a single result set)

Return a single result setResult Set

Shuffle

public static class TokenizerMapper extends MapperltObjectTextTextIntWritablegt

private final static IntWritableone = new IntWritable(1)

private Text word = new Text()

public void map(Object key Text val ContextStringTokenizer itr =

new StringTokenizer(valtoString())while (itrhasMoreTokens()) wordset(itrnextToken())

contextwrite(word one)

public static class IntSumReducer extends ReducerltTextIntWritableTextIntWrita

private IntWritable result = new IntWritable()

public void reduce(Text keyIterableltIntWritablegt val Context context)int sum = 0for (IntWritable v val)

sum += vget()

Distribute map

tasks to cluster

Hadoop Data Nodes

copy 2011 IBM Corporation30

MapReduce Example

Hello World Bye World

Hello IBM

Reduce (final output)

lt Bye 1gt

lt IBM 1gt

lt Hello 2gt

lt World 2gt

Map 1lt Hello 1gt

lt World 1gt

lt Bye 1gt

lt World 1gt

Count number of words occurrences

Map 2lt Hello 1gt

lt IBM 1gt

Entry Data

Map

Process

Reduce

Process

Shuffle

Process

copy 2014 IBM Corporation31

How to Analyze Large Data Sets in Hadoop

Its not just runtime Development phase has to be taken into

account

Although the Hadoop framework is implemented in Java

MapReduce applications do not need to be written in Java

To abstract complexities of Hadoop programming model a few

application development languages have emerged that build on top

of Hadoopndash Pig

ndash Hive

ndash Jaql

ndash Jaql

copy 2014 IBM Corporation32

Pig Hive Jaql ndash Similarities

Reduced program size over Java

Applications are translated to map

and reduce jobs behind scenes

Extension points for extending

existing functionality

Interoperability with other

languages

Not designed for random

readswrites or low-latency queries

Pig Hive Jaql ndash Differences

Characteristic Pig Hive Jaql

Developed by Yahoo Facebook IBM

Language Pig Latin HiveQL Jaql

Type of language

Data flow Declarative (SQL dialect) Data flow

Data structures supported

Complex Better suited for structured data

JSON semi structured

Schema Optional Not optional Optional

Hadoop Distributions

34

copy 2014 IBM Corporation35

Example of Hadoop Ecosystem

Visualization amp DiscoveryIntegration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

BigSheets JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Dashboard amp Visualization

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

copy 2014 IBM Corporation36

Open Source frameworks I

Avro A data serialization system that includes a schema within each file A schema defines the data types that are

contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data

types and complex type definitions within a schema

Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop

cluster

HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data

sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications

are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column

families so that the elements of a column family are all stored together This approach is different from a row-oriented

relational database where all columns of a row are stored together

HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do

not need to know where or how your data is stored You can change how you write data while still supporting existing data in

older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service

that includes functions for both MapReduce and Pig

Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to

analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements

which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere

BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos

Business Intelligence software

copy 2014 IBM Corporation37

Open Source frameworks II

Lucene A high-performance text search engine library that is written entirely in Java When you search within a

collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key

component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene

libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build

scan and query Lucene indexes

Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides

users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the

required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of

specific data in the file system

R A Project for Statistical Computing

Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop

systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and

export it to relational databases and enterprise data warehouses

Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper

maintains common objects that are needed in large cluster environments such as configuration information distributed

synchronization and group services

copy 2014 IBM Corporation38

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

copy 2014 IBM Corporation39

BigSheets

Browser based Analytical Tool that generates MapReduce Jobs

working over Hadoop Big Data

Helps non-programmers to work with Hadoop cluster

User models their big data as familiar spreadsheet-like tabular data

structures (collections) Once data is represented in a collection

business analysts can filter and enrich its content using built-in

functions and macros Furthermore analysts can combine data

residing in different collections as well as generate charts and new

ldquosheetsrdquo (collections) to visualize their data They can even export

data into a variety of common formats with a click of a button

Much of the technology included in Sheets was derived from the

BigSheets project of IBMrsquos Emerging Technologies team

copy 2014 IBM Corporation40

BigSheets Collection Sample

Spreadsheet-like structures defined by user

Based on data accessible through BigInsights Web console ndash eg file

system data output from Web crawl etc

copy 2014 IBM Corporation41

Big Sheets Collection Operations

Work with built-in ldquosheetsrdquo editor

Add delete columns

Filter data

Specify formulas to compute new

values using spreadsheet-style

syntax

Apply built-in or custom macro

functions

helliphelliphelliphellip

copy 2014 IBM Corporation42

BigSheets Collection Graphic Visualization

Built-in charting facility aids analysis

Pie charts bar charts tag clouds maps etc

Hover over sections to reveal details

copy 2014 IBM Corporation43

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

copy 2014 IBM Corporation44

What is Big SQL

Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture

ndash Comprehensive SQL92 ansi support

ndash Standards compliant client drivers (JDBC amp ODBC)

ndash Efficient handling of point queries

ndash Wide variety of data sources and file formats

ndash Extensive HBase focus

ndash Open source interoperability

Our driving design goalsndash Existing queries should run with no or few modifications

ndash Existing JDBC and ODBC compliant tools should continue to function

ndash Queries should be executed as efficiently as the chosen storage

mechanisms allow

copy 2014 IBM Corporation45

Architecture

Big SQL shares catalogs with

Hive via the Hive metastorendash Each can query the others tables

SQL engine analyzes incoming

queriesndash Separates portion(s) to execute at

the server vs portion(s) to execute

on the cluster

ndash Re-writes query if necessary for

improved performance

ndash Determines appropriate storage

handler for data

ndash Produces execution plan

ndash Executes and coordinates query

Server layout and relative sizes

for illustrative purposes only

Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node

Job TrackerNetwork Protocol

SQL Engine

Storage Handlers

Del

Files

SEQ

FilesHBase RDBMS bullbullbull

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore

copy 2014 IBM Corporation46

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

copy 2014 IBM Corporation47

What is Text Analytics

High Performance and Scalable rule based Information Extraction Engine

Distill structured information from unstructured data

- Rich annotator library supports multiple languages

Provides sophisticated tooling to help build test and refine rules

ndash Developer tools an easy to use text analytics language and a set of

extractors for fast adoption

ndash Multilingual support including support for DBCS languages

Developed at IBM Research since 2004 System T

BigInsights is the first time IBM opens up the Text Analytics Engine

technology for customization and development

copy 2014 IBM Corporation48

Annotator Query Language (AQL)

Language to create rules for Text Analytics

SQL Like Language

Fully declarative text analytics language

Once compiled produced an AOG plan to work in the data

No ldquoblack boxesrdquo or modules that canrsquot be customized

Tooling for easy customization because you are abstracted from the

programmatic details

Competing solutions make use of locked up black-box modules that cannot be

customized which restricts flexibility and are difficult to optimize for performance

create view AmountWithUnit as

extract pattern ltNmatchgt ltUmatchgt

as match

from Number N Unit U

copy 2014 IBM Corporation49

Text Analytic Simple Example

NetherlandsStrikerArjen Robben

Keeper SpainIker Casillas

WingerAndres Iniesta Spain

World Cup 2010 Highlights

Football World Cup 2010 one team distinguished well

from the rest winning the final Early in the second

half Netherlandsrsquo striker Arjen Robben had a chance

to score but the awesome keeper for Spain Iker

Casillas made the save Winner superiority was

reflected when Winger Andres Iniesta scored for Spain

for the win

copy 2014 IBM Corporation5050

Text Analytic Real Example

copy 2014 IBM Corporation5151

One step beyond Watson

copy 2014 IBM Corporation52

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

R

copy 2014 IBM Corporation53

Big R

bull Explore visualize transform

and model big data using

familiar R syntax and

paradigm

bull Scale out R with MR

programming

ndash Partitioning of large data

ndash Parallel cluster execution of R

code

bull Distributed Machine

Learning

ndash A scalable statistics engine that

provides canned algorithms and

an ability to author new ones all

via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3

copy 2014 IBM Corporation54

Where Does BigData Fit

Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate

ldquoCapture in case itrsquos neededrdquo

1 Extract transform load

ldquoCapture only whatrsquos neededrdquo

9 Report and mine data

copy 2014 IBM Corporation55

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation56

Data scientist ndash The new cool guy in town

Article in Fortune ldquoThe unemployment rate in

the US continues to be abysmal (91 in

July) but the tech world has spawned a

new kind of highly skilled nerdy-cool job

that companies are scrambling to fill data

scientistrdquo

McKinsey Global Institute ldquoBig data Reportrdquo

By 2018 the United States alone could

face a shortage of 140000 to 190000

people with deep analytical skills as well as

15 million managers and analysts with the

know-how to use the analysis of big data to

make effective decisions

copy 2014 IBM Corporation57

Data Science is Multidisciplinary

copy 2014 IBM Corporation58

Successful Data Scientist Characteristics

copy 2014 IBM Corporation59

Data Scientist Qualities

copy 2014 IBM Corporation60

How Long Does It Take For a Beginner to Become

a Good Data Scientist

copy 2014 IBM Corporation61

wwwkagglecom

copy 2014 IBM Corporation62

Kaggle ranking

copy 2014 IBM Corporation63 copy 2013 IBM Corporation63

Learn Big Data

Reading Materials - Online

ndash Understanding Big Data ndash Free PDF Book

bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF

ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights

bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml

ndash Implementing IBM InfoSphere BigInsights on System x - Redbook

bull httpwwwredbooksibmcomredpiecesabstractssg248077html

Resources

ndash Big Data Information Center

bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml

ndash InfoSphere BigInsights

bull www-01ibmcomsoftwaredatainfospherebiginsights

ndash Stream Computing

bull www-01ibmcomsoftwaredatainfospherestream-computing

ndash DeveloperWorks forums demos

bull httpwwwibmcomdeveloperworkswikibiginsights

copy 2014 IBM Corporation64

Learn Big Data Technologies

BigDataUniversitycom

Flexible on-line delivery

allows learning your place

and your pace

Free courses free study

materials

Cloud-based sandbox for

exercises ndash zero setup

Robust Course

Management System and

Content Distribution

infrastructure

copy 2014 IBM Corporation65

65

Page 17: Big data presentation (2014)

copy 2014 IBM Corporation17

Organisations are prioritising internal data sources

17

Untapped stores of internal data

Size and scope of some internal data such as

detailed transactions and operational log data

have become too large and varied to manage

within traditional systems

New infrastructure components make them

accessible for analysis

Some data has been collected but not

analyzed for years

Focus on customer insights

Customers ndash influenced by digital experiences

ndash often expect information provided to an

organization will then be ldquoknownrdquo during future

interactions

Combining disparate internal sources with

advanced analytics creates insights into

customer behavior and preferences

(Transactions Emails Call center interaction records)

Big data sources

Respondents were

asked which data

sources are currently

being collected and

analyzed as part of

active big data efforts

within their

organization

copy 2014 IBM Corporation18

Stages of Big Data adoption

18

Big data adoption

When segmented into four groups based on current levels of big data activity respondents showed significant consistency

in organizational behaviors Total respondents n = 1061

Totals do not equal 100 due to rounding

copy 2014 IBM Corporation19

Hadoop workloads

92

92

83

58

42

25

58

92

92

92

67

67

67

83

Staging area

Online archive

Transformation Engine

Ad hoc queries

Scheduled reports

Visual exploration

Data mining

Today In 18 Months

Based on respondents that have implemented Hadoop BI Leadership Forum April 2012

copy 2014 IBM Corporation20

Big Data ExplorationFind visualize understand all big data to improve decision making

Enhanced 360o Viewof the CustomerExtend existing customer views (MDM CRM etc) by incorporating additional internal and external information sources

Operations AnalysisAnalyze a variety of machinedata for improved business results

Data Warehouse ModernizationIntegrate big data and data warehouse capabilities to increase operational efficiency

SecurityIntelligence ExtensionLower risk detect fraud and monitor cyber security in real-time

Key Big Data Use Cases

copy 2014 IBM Corporation21

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation22

Solution for Big Data

Rest Data

ndash Data to analyze are already stored (structured and unstructured)

ndash Examples logs facebook twitter etc

ndash Solution Hadoop (open source)

Data in motion

ndash Data are analyzed in real time just in the moment they are generated They are analyzed with any previous storage

ndash Examples Sensors RFID etc

ndash Solution Streams CEP solutions

copy 2014 IBM Corporation23

Hardware improvements through the years

CPU Speedsndash 1990 - 44 MIPS at 40 MHz

ndash 2000 - 3561 MIPS at 12 GHz ndash 2010 - 147600 MIPS at 33 GHz

RAM Memoryndash 1990 ndash 640K conventional memory (256K extended memory recommended)ndash 2000 ndash 64MB memoryndash 2010 - 8-32GB (and more)

Disk Capacityndash 1990 ndash 20MBndash 2000 - 1GBndash 2010 ndash 1TB

Disk Latency (speed of reads and writes) ndash not much improvement in last 7-10 years currently around 70 ndash 80MB sec

copy 2014 IBM Corporation24

How long it will take to read 1TB of data

1TB (at 80Mb sec)ndash 1 disk - 34 hours

ndash 10 disks - 20 min

ndash 100 disks - 2 min

ndash 1000 disks - 12 sec

copy 2014 IBM Corporation25

Parallel Data Processing is the answer

It was with us for a whilendash GRID computing - spreads processing load

ndash Distributed workload - hard to manage applications overhead on

developer

ndash Parallel databases ndash DB2 DPF Teradata Netezza etc (distribute the

data)

copy 2014 IBM Corporation26

What is Apache Hadoop

Apache Open source software framework

Flexible enterprise-class support for processing large volumes of

data ndash Inspired by Google technologies (MapReduce GFS BigTable hellip)

ndash Initiated at Yahoobull Originally built to address scalability problems of Nutch an open source Web search

technology

ndash Well-suited to batch-oriented read-intensive applications

ndash Supports wide variety of data

Enables applications to work with thousands of nodes and petabytes

of data in a highly parallel cost effective mannerndash CPU + local disks = ldquonoderdquo

ndash Nodes can be combined into clusters

ndash New nodes can be added as needed without changing

bull Data formats

bull How data is loaded

bull How jobs are written

copy 2014 IBM Corporation27

Design principles of Hadoop New way of storing and processing the data

ndash Let system handle most of the issues automaticallybull Failuresbull Scalabilitybull Reduce communications bull Distribute data and processing power to where the data isbull Make parallelism part of operating systembull Meant for heterogeneous commodity hardware

Bring processing to Data

Hadoop = HDFS + MapReduce infrastructure

Optimized to handlendash Massive amounts of data through parallelism

ndash A variety of data (structured unstructured semi-structured)

ndash Using inexpensive commodity hardware

Reliability provided through replication

copy 2014 IBM Corporation28

What is the Hadoop Distributed File System Driving principals

ndash Data is stored across the entire cluster (multiple nodes)

ndash Programs are brought to the data not the data to the program

ndash Follows the Divide and Conquer paradigm

Data is stored across the entire cluster (the DFS)

ndash The entire cluster participates in the file system

ndash Blocks of a single file are distributed across the cluster

ndash A given block is typically replicated as well for resiliency

10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111

0100

Logical File

1

2

3

4

Blocks

1

Cluster

1

1

2

22

3

3

34

44

copy 2011 IBM Corporation29

Introduction to MapReduce

Scalable to thousands of nodes and petabytes of data

MapReduce Application

1 Map Phase(break job into small parts)

2 Shuffle(transfer interim output

for final processing)

3 Reduce Phase(boil all output down to

a single result set)

Return a single result setResult Set

Shuffle

public static class TokenizerMapper extends MapperltObjectTextTextIntWritablegt

private final static IntWritableone = new IntWritable(1)

private Text word = new Text()

public void map(Object key Text val ContextStringTokenizer itr =

new StringTokenizer(valtoString())while (itrhasMoreTokens()) wordset(itrnextToken())

contextwrite(word one)

public static class IntSumReducer extends ReducerltTextIntWritableTextIntWrita

private IntWritable result = new IntWritable()

public void reduce(Text keyIterableltIntWritablegt val Context context)int sum = 0for (IntWritable v val)

sum += vget()

Distribute map

tasks to cluster

Hadoop Data Nodes

copy 2011 IBM Corporation30

MapReduce Example

Hello World Bye World

Hello IBM

Reduce (final output)

lt Bye 1gt

lt IBM 1gt

lt Hello 2gt

lt World 2gt

Map 1lt Hello 1gt

lt World 1gt

lt Bye 1gt

lt World 1gt

Count number of words occurrences

Map 2lt Hello 1gt

lt IBM 1gt

Entry Data

Map

Process

Reduce

Process

Shuffle

Process

copy 2014 IBM Corporation31

How to Analyze Large Data Sets in Hadoop

Its not just runtime Development phase has to be taken into

account

Although the Hadoop framework is implemented in Java

MapReduce applications do not need to be written in Java

To abstract complexities of Hadoop programming model a few

application development languages have emerged that build on top

of Hadoopndash Pig

ndash Hive

ndash Jaql

ndash Jaql

copy 2014 IBM Corporation32

Pig Hive Jaql ndash Similarities

Reduced program size over Java

Applications are translated to map

and reduce jobs behind scenes

Extension points for extending

existing functionality

Interoperability with other

languages

Not designed for random

readswrites or low-latency queries

Pig Hive Jaql ndash Differences

Characteristic Pig Hive Jaql

Developed by Yahoo Facebook IBM

Language Pig Latin HiveQL Jaql

Type of language

Data flow Declarative (SQL dialect) Data flow

Data structures supported

Complex Better suited for structured data

JSON semi structured

Schema Optional Not optional Optional

Hadoop Distributions

34

copy 2014 IBM Corporation35

Example of Hadoop Ecosystem

Visualization amp DiscoveryIntegration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

BigSheets JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Dashboard amp Visualization

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

copy 2014 IBM Corporation36

Open Source frameworks I

Avro A data serialization system that includes a schema within each file A schema defines the data types that are

contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data

types and complex type definitions within a schema

Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop

cluster

HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data

sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications

are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column

families so that the elements of a column family are all stored together This approach is different from a row-oriented

relational database where all columns of a row are stored together

HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do

not need to know where or how your data is stored You can change how you write data while still supporting existing data in

older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service

that includes functions for both MapReduce and Pig

Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to

analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements

which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere

BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos

Business Intelligence software

copy 2014 IBM Corporation37

Open Source frameworks II

Lucene A high-performance text search engine library that is written entirely in Java When you search within a

collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key

component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene

libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build

scan and query Lucene indexes

Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides

users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the

required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of

specific data in the file system

R A Project for Statistical Computing

Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop

systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and

export it to relational databases and enterprise data warehouses

Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper

maintains common objects that are needed in large cluster environments such as configuration information distributed

synchronization and group services

copy 2014 IBM Corporation38

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

copy 2014 IBM Corporation39

BigSheets

Browser based Analytical Tool that generates MapReduce Jobs

working over Hadoop Big Data

Helps non-programmers to work with Hadoop cluster

User models their big data as familiar spreadsheet-like tabular data

structures (collections) Once data is represented in a collection

business analysts can filter and enrich its content using built-in

functions and macros Furthermore analysts can combine data

residing in different collections as well as generate charts and new

ldquosheetsrdquo (collections) to visualize their data They can even export

data into a variety of common formats with a click of a button

Much of the technology included in Sheets was derived from the

BigSheets project of IBMrsquos Emerging Technologies team

copy 2014 IBM Corporation40

BigSheets Collection Sample

Spreadsheet-like structures defined by user

Based on data accessible through BigInsights Web console ndash eg file

system data output from Web crawl etc

copy 2014 IBM Corporation41

Big Sheets Collection Operations

Work with built-in ldquosheetsrdquo editor

Add delete columns

Filter data

Specify formulas to compute new

values using spreadsheet-style

syntax

Apply built-in or custom macro

functions

helliphelliphelliphellip

copy 2014 IBM Corporation42

BigSheets Collection Graphic Visualization

Built-in charting facility aids analysis

Pie charts bar charts tag clouds maps etc

Hover over sections to reveal details

copy 2014 IBM Corporation43

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

copy 2014 IBM Corporation44

What is Big SQL

Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture

ndash Comprehensive SQL92 ansi support

ndash Standards compliant client drivers (JDBC amp ODBC)

ndash Efficient handling of point queries

ndash Wide variety of data sources and file formats

ndash Extensive HBase focus

ndash Open source interoperability

Our driving design goalsndash Existing queries should run with no or few modifications

ndash Existing JDBC and ODBC compliant tools should continue to function

ndash Queries should be executed as efficiently as the chosen storage

mechanisms allow

copy 2014 IBM Corporation45

Architecture

Big SQL shares catalogs with

Hive via the Hive metastorendash Each can query the others tables

SQL engine analyzes incoming

queriesndash Separates portion(s) to execute at

the server vs portion(s) to execute

on the cluster

ndash Re-writes query if necessary for

improved performance

ndash Determines appropriate storage

handler for data

ndash Produces execution plan

ndash Executes and coordinates query

Server layout and relative sizes

for illustrative purposes only

Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node

Job TrackerNetwork Protocol

SQL Engine

Storage Handlers

Del

Files

SEQ

FilesHBase RDBMS bullbullbull

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore

copy 2014 IBM Corporation46

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

copy 2014 IBM Corporation47

What is Text Analytics

High Performance and Scalable rule based Information Extraction Engine

Distill structured information from unstructured data

- Rich annotator library supports multiple languages

Provides sophisticated tooling to help build test and refine rules

ndash Developer tools an easy to use text analytics language and a set of

extractors for fast adoption

ndash Multilingual support including support for DBCS languages

Developed at IBM Research since 2004 System T

BigInsights is the first time IBM opens up the Text Analytics Engine

technology for customization and development

copy 2014 IBM Corporation48

Annotator Query Language (AQL)

Language to create rules for Text Analytics

SQL Like Language

Fully declarative text analytics language

Once compiled produced an AOG plan to work in the data

No ldquoblack boxesrdquo or modules that canrsquot be customized

Tooling for easy customization because you are abstracted from the

programmatic details

Competing solutions make use of locked up black-box modules that cannot be

customized which restricts flexibility and are difficult to optimize for performance

create view AmountWithUnit as

extract pattern ltNmatchgt ltUmatchgt

as match

from Number N Unit U

copy 2014 IBM Corporation49

Text Analytic Simple Example

NetherlandsStrikerArjen Robben

Keeper SpainIker Casillas

WingerAndres Iniesta Spain

World Cup 2010 Highlights

Football World Cup 2010 one team distinguished well

from the rest winning the final Early in the second

half Netherlandsrsquo striker Arjen Robben had a chance

to score but the awesome keeper for Spain Iker

Casillas made the save Winner superiority was

reflected when Winger Andres Iniesta scored for Spain

for the win

copy 2014 IBM Corporation5050

Text Analytic Real Example

copy 2014 IBM Corporation5151

One step beyond Watson

copy 2014 IBM Corporation52

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

R

copy 2014 IBM Corporation53

Big R

bull Explore visualize transform

and model big data using

familiar R syntax and

paradigm

bull Scale out R with MR

programming

ndash Partitioning of large data

ndash Parallel cluster execution of R

code

bull Distributed Machine

Learning

ndash A scalable statistics engine that

provides canned algorithms and

an ability to author new ones all

via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3

copy 2014 IBM Corporation54

Where Does BigData Fit

Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate

ldquoCapture in case itrsquos neededrdquo

1 Extract transform load

ldquoCapture only whatrsquos neededrdquo

9 Report and mine data

copy 2014 IBM Corporation55

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation56

Data scientist ndash The new cool guy in town

Article in Fortune ldquoThe unemployment rate in

the US continues to be abysmal (91 in

July) but the tech world has spawned a

new kind of highly skilled nerdy-cool job

that companies are scrambling to fill data

scientistrdquo

McKinsey Global Institute ldquoBig data Reportrdquo

By 2018 the United States alone could

face a shortage of 140000 to 190000

people with deep analytical skills as well as

15 million managers and analysts with the

know-how to use the analysis of big data to

make effective decisions

copy 2014 IBM Corporation57

Data Science is Multidisciplinary

copy 2014 IBM Corporation58

Successful Data Scientist Characteristics

copy 2014 IBM Corporation59

Data Scientist Qualities

copy 2014 IBM Corporation60

How Long Does It Take For a Beginner to Become

a Good Data Scientist

copy 2014 IBM Corporation61

wwwkagglecom

copy 2014 IBM Corporation62

Kaggle ranking

copy 2014 IBM Corporation63 copy 2013 IBM Corporation63

Learn Big Data

Reading Materials - Online

ndash Understanding Big Data ndash Free PDF Book

bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF

ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights

bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml

ndash Implementing IBM InfoSphere BigInsights on System x - Redbook

bull httpwwwredbooksibmcomredpiecesabstractssg248077html

Resources

ndash Big Data Information Center

bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml

ndash InfoSphere BigInsights

bull www-01ibmcomsoftwaredatainfospherebiginsights

ndash Stream Computing

bull www-01ibmcomsoftwaredatainfospherestream-computing

ndash DeveloperWorks forums demos

bull httpwwwibmcomdeveloperworkswikibiginsights

copy 2014 IBM Corporation64

Learn Big Data Technologies

BigDataUniversitycom

Flexible on-line delivery

allows learning your place

and your pace

Free courses free study

materials

Cloud-based sandbox for

exercises ndash zero setup

Robust Course

Management System and

Content Distribution

infrastructure

copy 2014 IBM Corporation65

65

Page 18: Big data presentation (2014)

copy 2014 IBM Corporation18

Stages of Big Data adoption

18

Big data adoption

When segmented into four groups based on current levels of big data activity respondents showed significant consistency

in organizational behaviors Total respondents n = 1061

Totals do not equal 100 due to rounding

copy 2014 IBM Corporation19

Hadoop workloads

92

92

83

58

42

25

58

92

92

92

67

67

67

83

Staging area

Online archive

Transformation Engine

Ad hoc queries

Scheduled reports

Visual exploration

Data mining

Today In 18 Months

Based on respondents that have implemented Hadoop BI Leadership Forum April 2012

copy 2014 IBM Corporation20

Big Data ExplorationFind visualize understand all big data to improve decision making

Enhanced 360o Viewof the CustomerExtend existing customer views (MDM CRM etc) by incorporating additional internal and external information sources

Operations AnalysisAnalyze a variety of machinedata for improved business results

Data Warehouse ModernizationIntegrate big data and data warehouse capabilities to increase operational efficiency

SecurityIntelligence ExtensionLower risk detect fraud and monitor cyber security in real-time

Key Big Data Use Cases

copy 2014 IBM Corporation21

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation22

Solution for Big Data

Rest Data

ndash Data to analyze are already stored (structured and unstructured)

ndash Examples logs facebook twitter etc

ndash Solution Hadoop (open source)

Data in motion

ndash Data are analyzed in real time just in the moment they are generated They are analyzed with any previous storage

ndash Examples Sensors RFID etc

ndash Solution Streams CEP solutions

copy 2014 IBM Corporation23

Hardware improvements through the years

CPU Speedsndash 1990 - 44 MIPS at 40 MHz

ndash 2000 - 3561 MIPS at 12 GHz ndash 2010 - 147600 MIPS at 33 GHz

RAM Memoryndash 1990 ndash 640K conventional memory (256K extended memory recommended)ndash 2000 ndash 64MB memoryndash 2010 - 8-32GB (and more)

Disk Capacityndash 1990 ndash 20MBndash 2000 - 1GBndash 2010 ndash 1TB

Disk Latency (speed of reads and writes) ndash not much improvement in last 7-10 years currently around 70 ndash 80MB sec

copy 2014 IBM Corporation24

How long it will take to read 1TB of data

1TB (at 80Mb sec)ndash 1 disk - 34 hours

ndash 10 disks - 20 min

ndash 100 disks - 2 min

ndash 1000 disks - 12 sec

copy 2014 IBM Corporation25

Parallel Data Processing is the answer

It was with us for a whilendash GRID computing - spreads processing load

ndash Distributed workload - hard to manage applications overhead on

developer

ndash Parallel databases ndash DB2 DPF Teradata Netezza etc (distribute the

data)

copy 2014 IBM Corporation26

What is Apache Hadoop

Apache Open source software framework

Flexible enterprise-class support for processing large volumes of

data ndash Inspired by Google technologies (MapReduce GFS BigTable hellip)

ndash Initiated at Yahoobull Originally built to address scalability problems of Nutch an open source Web search

technology

ndash Well-suited to batch-oriented read-intensive applications

ndash Supports wide variety of data

Enables applications to work with thousands of nodes and petabytes

of data in a highly parallel cost effective mannerndash CPU + local disks = ldquonoderdquo

ndash Nodes can be combined into clusters

ndash New nodes can be added as needed without changing

bull Data formats

bull How data is loaded

bull How jobs are written

copy 2014 IBM Corporation27

Design principles of Hadoop New way of storing and processing the data

ndash Let system handle most of the issues automaticallybull Failuresbull Scalabilitybull Reduce communications bull Distribute data and processing power to where the data isbull Make parallelism part of operating systembull Meant for heterogeneous commodity hardware

Bring processing to Data

Hadoop = HDFS + MapReduce infrastructure

Optimized to handlendash Massive amounts of data through parallelism

ndash A variety of data (structured unstructured semi-structured)

ndash Using inexpensive commodity hardware

Reliability provided through replication

copy 2014 IBM Corporation28

What is the Hadoop Distributed File System Driving principals

ndash Data is stored across the entire cluster (multiple nodes)

ndash Programs are brought to the data not the data to the program

ndash Follows the Divide and Conquer paradigm

Data is stored across the entire cluster (the DFS)

ndash The entire cluster participates in the file system

ndash Blocks of a single file are distributed across the cluster

ndash A given block is typically replicated as well for resiliency

10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111

0100

Logical File

1

2

3

4

Blocks

1

Cluster

1

1

2

22

3

3

34

44

copy 2011 IBM Corporation29

Introduction to MapReduce

Scalable to thousands of nodes and petabytes of data

MapReduce Application

1 Map Phase(break job into small parts)

2 Shuffle(transfer interim output

for final processing)

3 Reduce Phase(boil all output down to

a single result set)

Return a single result setResult Set

Shuffle

public static class TokenizerMapper extends MapperltObjectTextTextIntWritablegt

private final static IntWritableone = new IntWritable(1)

private Text word = new Text()

public void map(Object key Text val ContextStringTokenizer itr =

new StringTokenizer(valtoString())while (itrhasMoreTokens()) wordset(itrnextToken())

contextwrite(word one)

public static class IntSumReducer extends ReducerltTextIntWritableTextIntWrita

private IntWritable result = new IntWritable()

public void reduce(Text keyIterableltIntWritablegt val Context context)int sum = 0for (IntWritable v val)

sum += vget()

Distribute map

tasks to cluster

Hadoop Data Nodes

copy 2011 IBM Corporation30

MapReduce Example

Hello World Bye World

Hello IBM

Reduce (final output)

lt Bye 1gt

lt IBM 1gt

lt Hello 2gt

lt World 2gt

Map 1lt Hello 1gt

lt World 1gt

lt Bye 1gt

lt World 1gt

Count number of words occurrences

Map 2lt Hello 1gt

lt IBM 1gt

Entry Data

Map

Process

Reduce

Process

Shuffle

Process

copy 2014 IBM Corporation31

How to Analyze Large Data Sets in Hadoop

Its not just runtime Development phase has to be taken into

account

Although the Hadoop framework is implemented in Java

MapReduce applications do not need to be written in Java

To abstract complexities of Hadoop programming model a few

application development languages have emerged that build on top

of Hadoopndash Pig

ndash Hive

ndash Jaql

ndash Jaql

copy 2014 IBM Corporation32

Pig Hive Jaql ndash Similarities

Reduced program size over Java

Applications are translated to map

and reduce jobs behind scenes

Extension points for extending

existing functionality

Interoperability with other

languages

Not designed for random

readswrites or low-latency queries

Pig Hive Jaql ndash Differences

Characteristic Pig Hive Jaql

Developed by Yahoo Facebook IBM

Language Pig Latin HiveQL Jaql

Type of language

Data flow Declarative (SQL dialect) Data flow

Data structures supported

Complex Better suited for structured data

JSON semi structured

Schema Optional Not optional Optional

Hadoop Distributions

34

copy 2014 IBM Corporation35

Example of Hadoop Ecosystem

Visualization amp DiscoveryIntegration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

BigSheets JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Dashboard amp Visualization

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

copy 2014 IBM Corporation36

Open Source frameworks I

Avro A data serialization system that includes a schema within each file A schema defines the data types that are

contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data

types and complex type definitions within a schema

Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop

cluster

HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data

sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications

are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column

families so that the elements of a column family are all stored together This approach is different from a row-oriented

relational database where all columns of a row are stored together

HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do

not need to know where or how your data is stored You can change how you write data while still supporting existing data in

older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service

that includes functions for both MapReduce and Pig

Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to

analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements

which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere

BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos

Business Intelligence software

copy 2014 IBM Corporation37

Open Source frameworks II

Lucene A high-performance text search engine library that is written entirely in Java When you search within a

collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key

component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene

libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build

scan and query Lucene indexes

Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides

users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the

required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of

specific data in the file system

R A Project for Statistical Computing

Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop

systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and

export it to relational databases and enterprise data warehouses

Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper

maintains common objects that are needed in large cluster environments such as configuration information distributed

synchronization and group services

copy 2014 IBM Corporation38

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

copy 2014 IBM Corporation39

BigSheets

Browser based Analytical Tool that generates MapReduce Jobs

working over Hadoop Big Data

Helps non-programmers to work with Hadoop cluster

User models their big data as familiar spreadsheet-like tabular data

structures (collections) Once data is represented in a collection

business analysts can filter and enrich its content using built-in

functions and macros Furthermore analysts can combine data

residing in different collections as well as generate charts and new

ldquosheetsrdquo (collections) to visualize their data They can even export

data into a variety of common formats with a click of a button

Much of the technology included in Sheets was derived from the

BigSheets project of IBMrsquos Emerging Technologies team

copy 2014 IBM Corporation40

BigSheets Collection Sample

Spreadsheet-like structures defined by user

Based on data accessible through BigInsights Web console ndash eg file

system data output from Web crawl etc

copy 2014 IBM Corporation41

Big Sheets Collection Operations

Work with built-in ldquosheetsrdquo editor

Add delete columns

Filter data

Specify formulas to compute new

values using spreadsheet-style

syntax

Apply built-in or custom macro

functions

helliphelliphelliphellip

copy 2014 IBM Corporation42

BigSheets Collection Graphic Visualization

Built-in charting facility aids analysis

Pie charts bar charts tag clouds maps etc

Hover over sections to reveal details

copy 2014 IBM Corporation43

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

copy 2014 IBM Corporation44

What is Big SQL

Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture

ndash Comprehensive SQL92 ansi support

ndash Standards compliant client drivers (JDBC amp ODBC)

ndash Efficient handling of point queries

ndash Wide variety of data sources and file formats

ndash Extensive HBase focus

ndash Open source interoperability

Our driving design goalsndash Existing queries should run with no or few modifications

ndash Existing JDBC and ODBC compliant tools should continue to function

ndash Queries should be executed as efficiently as the chosen storage

mechanisms allow

copy 2014 IBM Corporation45

Architecture

Big SQL shares catalogs with

Hive via the Hive metastorendash Each can query the others tables

SQL engine analyzes incoming

queriesndash Separates portion(s) to execute at

the server vs portion(s) to execute

on the cluster

ndash Re-writes query if necessary for

improved performance

ndash Determines appropriate storage

handler for data

ndash Produces execution plan

ndash Executes and coordinates query

Server layout and relative sizes

for illustrative purposes only

Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node

Job TrackerNetwork Protocol

SQL Engine

Storage Handlers

Del

Files

SEQ

FilesHBase RDBMS bullbullbull

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore

copy 2014 IBM Corporation46

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

copy 2014 IBM Corporation47

What is Text Analytics

High Performance and Scalable rule based Information Extraction Engine

Distill structured information from unstructured data

- Rich annotator library supports multiple languages

Provides sophisticated tooling to help build test and refine rules

ndash Developer tools an easy to use text analytics language and a set of

extractors for fast adoption

ndash Multilingual support including support for DBCS languages

Developed at IBM Research since 2004 System T

BigInsights is the first time IBM opens up the Text Analytics Engine

technology for customization and development

copy 2014 IBM Corporation48

Annotator Query Language (AQL)

Language to create rules for Text Analytics

SQL Like Language

Fully declarative text analytics language

Once compiled produced an AOG plan to work in the data

No ldquoblack boxesrdquo or modules that canrsquot be customized

Tooling for easy customization because you are abstracted from the

programmatic details

Competing solutions make use of locked up black-box modules that cannot be

customized which restricts flexibility and are difficult to optimize for performance

create view AmountWithUnit as

extract pattern ltNmatchgt ltUmatchgt

as match

from Number N Unit U

copy 2014 IBM Corporation49

Text Analytic Simple Example

NetherlandsStrikerArjen Robben

Keeper SpainIker Casillas

WingerAndres Iniesta Spain

World Cup 2010 Highlights

Football World Cup 2010 one team distinguished well

from the rest winning the final Early in the second

half Netherlandsrsquo striker Arjen Robben had a chance

to score but the awesome keeper for Spain Iker

Casillas made the save Winner superiority was

reflected when Winger Andres Iniesta scored for Spain

for the win

copy 2014 IBM Corporation5050

Text Analytic Real Example

copy 2014 IBM Corporation5151

One step beyond Watson

copy 2014 IBM Corporation52

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

R

copy 2014 IBM Corporation53

Big R

bull Explore visualize transform

and model big data using

familiar R syntax and

paradigm

bull Scale out R with MR

programming

ndash Partitioning of large data

ndash Parallel cluster execution of R

code

bull Distributed Machine

Learning

ndash A scalable statistics engine that

provides canned algorithms and

an ability to author new ones all

via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3

copy 2014 IBM Corporation54

Where Does BigData Fit

Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate

ldquoCapture in case itrsquos neededrdquo

1 Extract transform load

ldquoCapture only whatrsquos neededrdquo

9 Report and mine data

copy 2014 IBM Corporation55

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation56

Data scientist ndash The new cool guy in town

Article in Fortune ldquoThe unemployment rate in

the US continues to be abysmal (91 in

July) but the tech world has spawned a

new kind of highly skilled nerdy-cool job

that companies are scrambling to fill data

scientistrdquo

McKinsey Global Institute ldquoBig data Reportrdquo

By 2018 the United States alone could

face a shortage of 140000 to 190000

people with deep analytical skills as well as

15 million managers and analysts with the

know-how to use the analysis of big data to

make effective decisions

copy 2014 IBM Corporation57

Data Science is Multidisciplinary

copy 2014 IBM Corporation58

Successful Data Scientist Characteristics

copy 2014 IBM Corporation59

Data Scientist Qualities

copy 2014 IBM Corporation60

How Long Does It Take For a Beginner to Become

a Good Data Scientist

copy 2014 IBM Corporation61

wwwkagglecom

copy 2014 IBM Corporation62

Kaggle ranking

copy 2014 IBM Corporation63 copy 2013 IBM Corporation63

Learn Big Data

Reading Materials - Online

ndash Understanding Big Data ndash Free PDF Book

bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF

ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights

bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml

ndash Implementing IBM InfoSphere BigInsights on System x - Redbook

bull httpwwwredbooksibmcomredpiecesabstractssg248077html

Resources

ndash Big Data Information Center

bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml

ndash InfoSphere BigInsights

bull www-01ibmcomsoftwaredatainfospherebiginsights

ndash Stream Computing

bull www-01ibmcomsoftwaredatainfospherestream-computing

ndash DeveloperWorks forums demos

bull httpwwwibmcomdeveloperworkswikibiginsights

copy 2014 IBM Corporation64

Learn Big Data Technologies

BigDataUniversitycom

Flexible on-line delivery

allows learning your place

and your pace

Free courses free study

materials

Cloud-based sandbox for

exercises ndash zero setup

Robust Course

Management System and

Content Distribution

infrastructure

copy 2014 IBM Corporation65

65

Page 19: Big data presentation (2014)

copy 2014 IBM Corporation19

Hadoop workloads

92

92

83

58

42

25

58

92

92

92

67

67

67

83

Staging area

Online archive

Transformation Engine

Ad hoc queries

Scheduled reports

Visual exploration

Data mining

Today In 18 Months

Based on respondents that have implemented Hadoop BI Leadership Forum April 2012

copy 2014 IBM Corporation20

Big Data ExplorationFind visualize understand all big data to improve decision making

Enhanced 360o Viewof the CustomerExtend existing customer views (MDM CRM etc) by incorporating additional internal and external information sources

Operations AnalysisAnalyze a variety of machinedata for improved business results

Data Warehouse ModernizationIntegrate big data and data warehouse capabilities to increase operational efficiency

SecurityIntelligence ExtensionLower risk detect fraud and monitor cyber security in real-time

Key Big Data Use Cases

copy 2014 IBM Corporation21

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation22

Solution for Big Data

Rest Data

ndash Data to analyze are already stored (structured and unstructured)

ndash Examples logs facebook twitter etc

ndash Solution Hadoop (open source)

Data in motion

ndash Data are analyzed in real time just in the moment they are generated They are analyzed with any previous storage

ndash Examples Sensors RFID etc

ndash Solution Streams CEP solutions

copy 2014 IBM Corporation23

Hardware improvements through the years

CPU Speedsndash 1990 - 44 MIPS at 40 MHz

ndash 2000 - 3561 MIPS at 12 GHz ndash 2010 - 147600 MIPS at 33 GHz

RAM Memoryndash 1990 ndash 640K conventional memory (256K extended memory recommended)ndash 2000 ndash 64MB memoryndash 2010 - 8-32GB (and more)

Disk Capacityndash 1990 ndash 20MBndash 2000 - 1GBndash 2010 ndash 1TB

Disk Latency (speed of reads and writes) ndash not much improvement in last 7-10 years currently around 70 ndash 80MB sec

copy 2014 IBM Corporation24

How long it will take to read 1TB of data

1TB (at 80Mb sec)ndash 1 disk - 34 hours

ndash 10 disks - 20 min

ndash 100 disks - 2 min

ndash 1000 disks - 12 sec

copy 2014 IBM Corporation25

Parallel Data Processing is the answer

It was with us for a whilendash GRID computing - spreads processing load

ndash Distributed workload - hard to manage applications overhead on

developer

ndash Parallel databases ndash DB2 DPF Teradata Netezza etc (distribute the

data)

copy 2014 IBM Corporation26

What is Apache Hadoop

Apache Open source software framework

Flexible enterprise-class support for processing large volumes of

data ndash Inspired by Google technologies (MapReduce GFS BigTable hellip)

ndash Initiated at Yahoobull Originally built to address scalability problems of Nutch an open source Web search

technology

ndash Well-suited to batch-oriented read-intensive applications

ndash Supports wide variety of data

Enables applications to work with thousands of nodes and petabytes

of data in a highly parallel cost effective mannerndash CPU + local disks = ldquonoderdquo

ndash Nodes can be combined into clusters

ndash New nodes can be added as needed without changing

bull Data formats

bull How data is loaded

bull How jobs are written

copy 2014 IBM Corporation27

Design principles of Hadoop New way of storing and processing the data

ndash Let system handle most of the issues automaticallybull Failuresbull Scalabilitybull Reduce communications bull Distribute data and processing power to where the data isbull Make parallelism part of operating systembull Meant for heterogeneous commodity hardware

Bring processing to Data

Hadoop = HDFS + MapReduce infrastructure

Optimized to handlendash Massive amounts of data through parallelism

ndash A variety of data (structured unstructured semi-structured)

ndash Using inexpensive commodity hardware

Reliability provided through replication

copy 2014 IBM Corporation28

What is the Hadoop Distributed File System Driving principals

ndash Data is stored across the entire cluster (multiple nodes)

ndash Programs are brought to the data not the data to the program

ndash Follows the Divide and Conquer paradigm

Data is stored across the entire cluster (the DFS)

ndash The entire cluster participates in the file system

ndash Blocks of a single file are distributed across the cluster

ndash A given block is typically replicated as well for resiliency

10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111

0100

Logical File

1

2

3

4

Blocks

1

Cluster

1

1

2

22

3

3

34

44

copy 2011 IBM Corporation29

Introduction to MapReduce

Scalable to thousands of nodes and petabytes of data

MapReduce Application

1 Map Phase(break job into small parts)

2 Shuffle(transfer interim output

for final processing)

3 Reduce Phase(boil all output down to

a single result set)

Return a single result setResult Set

Shuffle

public static class TokenizerMapper extends MapperltObjectTextTextIntWritablegt

private final static IntWritableone = new IntWritable(1)

private Text word = new Text()

public void map(Object key Text val ContextStringTokenizer itr =

new StringTokenizer(valtoString())while (itrhasMoreTokens()) wordset(itrnextToken())

contextwrite(word one)

public static class IntSumReducer extends ReducerltTextIntWritableTextIntWrita

private IntWritable result = new IntWritable()

public void reduce(Text keyIterableltIntWritablegt val Context context)int sum = 0for (IntWritable v val)

sum += vget()

Distribute map

tasks to cluster

Hadoop Data Nodes

copy 2011 IBM Corporation30

MapReduce Example

Hello World Bye World

Hello IBM

Reduce (final output)

lt Bye 1gt

lt IBM 1gt

lt Hello 2gt

lt World 2gt

Map 1lt Hello 1gt

lt World 1gt

lt Bye 1gt

lt World 1gt

Count number of words occurrences

Map 2lt Hello 1gt

lt IBM 1gt

Entry Data

Map

Process

Reduce

Process

Shuffle

Process

copy 2014 IBM Corporation31

How to Analyze Large Data Sets in Hadoop

Its not just runtime Development phase has to be taken into

account

Although the Hadoop framework is implemented in Java

MapReduce applications do not need to be written in Java

To abstract complexities of Hadoop programming model a few

application development languages have emerged that build on top

of Hadoopndash Pig

ndash Hive

ndash Jaql

ndash Jaql

copy 2014 IBM Corporation32

Pig Hive Jaql ndash Similarities

Reduced program size over Java

Applications are translated to map

and reduce jobs behind scenes

Extension points for extending

existing functionality

Interoperability with other

languages

Not designed for random

readswrites or low-latency queries

Pig Hive Jaql ndash Differences

Characteristic Pig Hive Jaql

Developed by Yahoo Facebook IBM

Language Pig Latin HiveQL Jaql

Type of language

Data flow Declarative (SQL dialect) Data flow

Data structures supported

Complex Better suited for structured data

JSON semi structured

Schema Optional Not optional Optional

Hadoop Distributions

34

copy 2014 IBM Corporation35

Example of Hadoop Ecosystem

Visualization amp DiscoveryIntegration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

BigSheets JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Dashboard amp Visualization

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

copy 2014 IBM Corporation36

Open Source frameworks I

Avro A data serialization system that includes a schema within each file A schema defines the data types that are

contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data

types and complex type definitions within a schema

Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop

cluster

HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data

sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications

are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column

families so that the elements of a column family are all stored together This approach is different from a row-oriented

relational database where all columns of a row are stored together

HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do

not need to know where or how your data is stored You can change how you write data while still supporting existing data in

older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service

that includes functions for both MapReduce and Pig

Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to

analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements

which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere

BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos

Business Intelligence software

copy 2014 IBM Corporation37

Open Source frameworks II

Lucene A high-performance text search engine library that is written entirely in Java When you search within a

collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key

component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene

libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build

scan and query Lucene indexes

Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides

users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the

required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of

specific data in the file system

R A Project for Statistical Computing

Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop

systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and

export it to relational databases and enterprise data warehouses

Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper

maintains common objects that are needed in large cluster environments such as configuration information distributed

synchronization and group services

copy 2014 IBM Corporation38

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

copy 2014 IBM Corporation39

BigSheets

Browser based Analytical Tool that generates MapReduce Jobs

working over Hadoop Big Data

Helps non-programmers to work with Hadoop cluster

User models their big data as familiar spreadsheet-like tabular data

structures (collections) Once data is represented in a collection

business analysts can filter and enrich its content using built-in

functions and macros Furthermore analysts can combine data

residing in different collections as well as generate charts and new

ldquosheetsrdquo (collections) to visualize their data They can even export

data into a variety of common formats with a click of a button

Much of the technology included in Sheets was derived from the

BigSheets project of IBMrsquos Emerging Technologies team

copy 2014 IBM Corporation40

BigSheets Collection Sample

Spreadsheet-like structures defined by user

Based on data accessible through BigInsights Web console ndash eg file

system data output from Web crawl etc

copy 2014 IBM Corporation41

Big Sheets Collection Operations

Work with built-in ldquosheetsrdquo editor

Add delete columns

Filter data

Specify formulas to compute new

values using spreadsheet-style

syntax

Apply built-in or custom macro

functions

helliphelliphelliphellip

copy 2014 IBM Corporation42

BigSheets Collection Graphic Visualization

Built-in charting facility aids analysis

Pie charts bar charts tag clouds maps etc

Hover over sections to reveal details

copy 2014 IBM Corporation43

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

copy 2014 IBM Corporation44

What is Big SQL

Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture

ndash Comprehensive SQL92 ansi support

ndash Standards compliant client drivers (JDBC amp ODBC)

ndash Efficient handling of point queries

ndash Wide variety of data sources and file formats

ndash Extensive HBase focus

ndash Open source interoperability

Our driving design goalsndash Existing queries should run with no or few modifications

ndash Existing JDBC and ODBC compliant tools should continue to function

ndash Queries should be executed as efficiently as the chosen storage

mechanisms allow

copy 2014 IBM Corporation45

Architecture

Big SQL shares catalogs with

Hive via the Hive metastorendash Each can query the others tables

SQL engine analyzes incoming

queriesndash Separates portion(s) to execute at

the server vs portion(s) to execute

on the cluster

ndash Re-writes query if necessary for

improved performance

ndash Determines appropriate storage

handler for data

ndash Produces execution plan

ndash Executes and coordinates query

Server layout and relative sizes

for illustrative purposes only

Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node

Job TrackerNetwork Protocol

SQL Engine

Storage Handlers

Del

Files

SEQ

FilesHBase RDBMS bullbullbull

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore

copy 2014 IBM Corporation46

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

copy 2014 IBM Corporation47

What is Text Analytics

High Performance and Scalable rule based Information Extraction Engine

Distill structured information from unstructured data

- Rich annotator library supports multiple languages

Provides sophisticated tooling to help build test and refine rules

ndash Developer tools an easy to use text analytics language and a set of

extractors for fast adoption

ndash Multilingual support including support for DBCS languages

Developed at IBM Research since 2004 System T

BigInsights is the first time IBM opens up the Text Analytics Engine

technology for customization and development

copy 2014 IBM Corporation48

Annotator Query Language (AQL)

Language to create rules for Text Analytics

SQL Like Language

Fully declarative text analytics language

Once compiled produced an AOG plan to work in the data

No ldquoblack boxesrdquo or modules that canrsquot be customized

Tooling for easy customization because you are abstracted from the

programmatic details

Competing solutions make use of locked up black-box modules that cannot be

customized which restricts flexibility and are difficult to optimize for performance

create view AmountWithUnit as

extract pattern ltNmatchgt ltUmatchgt

as match

from Number N Unit U

copy 2014 IBM Corporation49

Text Analytic Simple Example

NetherlandsStrikerArjen Robben

Keeper SpainIker Casillas

WingerAndres Iniesta Spain

World Cup 2010 Highlights

Football World Cup 2010 one team distinguished well

from the rest winning the final Early in the second

half Netherlandsrsquo striker Arjen Robben had a chance

to score but the awesome keeper for Spain Iker

Casillas made the save Winner superiority was

reflected when Winger Andres Iniesta scored for Spain

for the win

copy 2014 IBM Corporation5050

Text Analytic Real Example

copy 2014 IBM Corporation5151

One step beyond Watson

copy 2014 IBM Corporation52

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

R

copy 2014 IBM Corporation53

Big R

bull Explore visualize transform

and model big data using

familiar R syntax and

paradigm

bull Scale out R with MR

programming

ndash Partitioning of large data

ndash Parallel cluster execution of R

code

bull Distributed Machine

Learning

ndash A scalable statistics engine that

provides canned algorithms and

an ability to author new ones all

via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3

copy 2014 IBM Corporation54

Where Does BigData Fit

Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate

ldquoCapture in case itrsquos neededrdquo

1 Extract transform load

ldquoCapture only whatrsquos neededrdquo

9 Report and mine data

copy 2014 IBM Corporation55

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation56

Data scientist ndash The new cool guy in town

Article in Fortune ldquoThe unemployment rate in

the US continues to be abysmal (91 in

July) but the tech world has spawned a

new kind of highly skilled nerdy-cool job

that companies are scrambling to fill data

scientistrdquo

McKinsey Global Institute ldquoBig data Reportrdquo

By 2018 the United States alone could

face a shortage of 140000 to 190000

people with deep analytical skills as well as

15 million managers and analysts with the

know-how to use the analysis of big data to

make effective decisions

copy 2014 IBM Corporation57

Data Science is Multidisciplinary

copy 2014 IBM Corporation58

Successful Data Scientist Characteristics

copy 2014 IBM Corporation59

Data Scientist Qualities

copy 2014 IBM Corporation60

How Long Does It Take For a Beginner to Become

a Good Data Scientist

copy 2014 IBM Corporation61

wwwkagglecom

copy 2014 IBM Corporation62

Kaggle ranking

copy 2014 IBM Corporation63 copy 2013 IBM Corporation63

Learn Big Data

Reading Materials - Online

ndash Understanding Big Data ndash Free PDF Book

bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF

ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights

bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml

ndash Implementing IBM InfoSphere BigInsights on System x - Redbook

bull httpwwwredbooksibmcomredpiecesabstractssg248077html

Resources

ndash Big Data Information Center

bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml

ndash InfoSphere BigInsights

bull www-01ibmcomsoftwaredatainfospherebiginsights

ndash Stream Computing

bull www-01ibmcomsoftwaredatainfospherestream-computing

ndash DeveloperWorks forums demos

bull httpwwwibmcomdeveloperworkswikibiginsights

copy 2014 IBM Corporation64

Learn Big Data Technologies

BigDataUniversitycom

Flexible on-line delivery

allows learning your place

and your pace

Free courses free study

materials

Cloud-based sandbox for

exercises ndash zero setup

Robust Course

Management System and

Content Distribution

infrastructure

copy 2014 IBM Corporation65

65

Page 20: Big data presentation (2014)

copy 2014 IBM Corporation20

Big Data ExplorationFind visualize understand all big data to improve decision making

Enhanced 360o Viewof the CustomerExtend existing customer views (MDM CRM etc) by incorporating additional internal and external information sources

Operations AnalysisAnalyze a variety of machinedata for improved business results

Data Warehouse ModernizationIntegrate big data and data warehouse capabilities to increase operational efficiency

SecurityIntelligence ExtensionLower risk detect fraud and monitor cyber security in real-time

Key Big Data Use Cases

copy 2014 IBM Corporation21

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation22

Solution for Big Data

Rest Data

ndash Data to analyze are already stored (structured and unstructured)

ndash Examples logs facebook twitter etc

ndash Solution Hadoop (open source)

Data in motion

ndash Data are analyzed in real time just in the moment they are generated They are analyzed with any previous storage

ndash Examples Sensors RFID etc

ndash Solution Streams CEP solutions

copy 2014 IBM Corporation23

Hardware improvements through the years

CPU Speedsndash 1990 - 44 MIPS at 40 MHz

ndash 2000 - 3561 MIPS at 12 GHz ndash 2010 - 147600 MIPS at 33 GHz

RAM Memoryndash 1990 ndash 640K conventional memory (256K extended memory recommended)ndash 2000 ndash 64MB memoryndash 2010 - 8-32GB (and more)

Disk Capacityndash 1990 ndash 20MBndash 2000 - 1GBndash 2010 ndash 1TB

Disk Latency (speed of reads and writes) ndash not much improvement in last 7-10 years currently around 70 ndash 80MB sec

copy 2014 IBM Corporation24

How long it will take to read 1TB of data

1TB (at 80Mb sec)ndash 1 disk - 34 hours

ndash 10 disks - 20 min

ndash 100 disks - 2 min

ndash 1000 disks - 12 sec

copy 2014 IBM Corporation25

Parallel Data Processing is the answer

It was with us for a whilendash GRID computing - spreads processing load

ndash Distributed workload - hard to manage applications overhead on

developer

ndash Parallel databases ndash DB2 DPF Teradata Netezza etc (distribute the

data)

copy 2014 IBM Corporation26

What is Apache Hadoop

Apache Open source software framework

Flexible enterprise-class support for processing large volumes of

data ndash Inspired by Google technologies (MapReduce GFS BigTable hellip)

ndash Initiated at Yahoobull Originally built to address scalability problems of Nutch an open source Web search

technology

ndash Well-suited to batch-oriented read-intensive applications

ndash Supports wide variety of data

Enables applications to work with thousands of nodes and petabytes

of data in a highly parallel cost effective mannerndash CPU + local disks = ldquonoderdquo

ndash Nodes can be combined into clusters

ndash New nodes can be added as needed without changing

bull Data formats

bull How data is loaded

bull How jobs are written

copy 2014 IBM Corporation27

Design principles of Hadoop New way of storing and processing the data

ndash Let system handle most of the issues automaticallybull Failuresbull Scalabilitybull Reduce communications bull Distribute data and processing power to where the data isbull Make parallelism part of operating systembull Meant for heterogeneous commodity hardware

Bring processing to Data

Hadoop = HDFS + MapReduce infrastructure

Optimized to handlendash Massive amounts of data through parallelism

ndash A variety of data (structured unstructured semi-structured)

ndash Using inexpensive commodity hardware

Reliability provided through replication

copy 2014 IBM Corporation28

What is the Hadoop Distributed File System Driving principals

ndash Data is stored across the entire cluster (multiple nodes)

ndash Programs are brought to the data not the data to the program

ndash Follows the Divide and Conquer paradigm

Data is stored across the entire cluster (the DFS)

ndash The entire cluster participates in the file system

ndash Blocks of a single file are distributed across the cluster

ndash A given block is typically replicated as well for resiliency

10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111

0100

Logical File

1

2

3

4

Blocks

1

Cluster

1

1

2

22

3

3

34

44

copy 2011 IBM Corporation29

Introduction to MapReduce

Scalable to thousands of nodes and petabytes of data

MapReduce Application

1 Map Phase(break job into small parts)

2 Shuffle(transfer interim output

for final processing)

3 Reduce Phase(boil all output down to

a single result set)

Return a single result setResult Set

Shuffle

public static class TokenizerMapper extends MapperltObjectTextTextIntWritablegt

private final static IntWritableone = new IntWritable(1)

private Text word = new Text()

public void map(Object key Text val ContextStringTokenizer itr =

new StringTokenizer(valtoString())while (itrhasMoreTokens()) wordset(itrnextToken())

contextwrite(word one)

public static class IntSumReducer extends ReducerltTextIntWritableTextIntWrita

private IntWritable result = new IntWritable()

public void reduce(Text keyIterableltIntWritablegt val Context context)int sum = 0for (IntWritable v val)

sum += vget()

Distribute map

tasks to cluster

Hadoop Data Nodes

copy 2011 IBM Corporation30

MapReduce Example

Hello World Bye World

Hello IBM

Reduce (final output)

lt Bye 1gt

lt IBM 1gt

lt Hello 2gt

lt World 2gt

Map 1lt Hello 1gt

lt World 1gt

lt Bye 1gt

lt World 1gt

Count number of words occurrences

Map 2lt Hello 1gt

lt IBM 1gt

Entry Data

Map

Process

Reduce

Process

Shuffle

Process

copy 2014 IBM Corporation31

How to Analyze Large Data Sets in Hadoop

Its not just runtime Development phase has to be taken into

account

Although the Hadoop framework is implemented in Java

MapReduce applications do not need to be written in Java

To abstract complexities of Hadoop programming model a few

application development languages have emerged that build on top

of Hadoopndash Pig

ndash Hive

ndash Jaql

ndash Jaql

copy 2014 IBM Corporation32

Pig Hive Jaql ndash Similarities

Reduced program size over Java

Applications are translated to map

and reduce jobs behind scenes

Extension points for extending

existing functionality

Interoperability with other

languages

Not designed for random

readswrites or low-latency queries

Pig Hive Jaql ndash Differences

Characteristic Pig Hive Jaql

Developed by Yahoo Facebook IBM

Language Pig Latin HiveQL Jaql

Type of language

Data flow Declarative (SQL dialect) Data flow

Data structures supported

Complex Better suited for structured data

JSON semi structured

Schema Optional Not optional Optional

Hadoop Distributions

34

copy 2014 IBM Corporation35

Example of Hadoop Ecosystem

Visualization amp DiscoveryIntegration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

BigSheets JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Dashboard amp Visualization

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

copy 2014 IBM Corporation36

Open Source frameworks I

Avro A data serialization system that includes a schema within each file A schema defines the data types that are

contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data

types and complex type definitions within a schema

Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop

cluster

HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data

sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications

are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column

families so that the elements of a column family are all stored together This approach is different from a row-oriented

relational database where all columns of a row are stored together

HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do

not need to know where or how your data is stored You can change how you write data while still supporting existing data in

older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service

that includes functions for both MapReduce and Pig

Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to

analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements

which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere

BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos

Business Intelligence software

copy 2014 IBM Corporation37

Open Source frameworks II

Lucene A high-performance text search engine library that is written entirely in Java When you search within a

collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key

component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene

libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build

scan and query Lucene indexes

Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides

users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the

required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of

specific data in the file system

R A Project for Statistical Computing

Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop

systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and

export it to relational databases and enterprise data warehouses

Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper

maintains common objects that are needed in large cluster environments such as configuration information distributed

synchronization and group services

copy 2014 IBM Corporation38

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

copy 2014 IBM Corporation39

BigSheets

Browser based Analytical Tool that generates MapReduce Jobs

working over Hadoop Big Data

Helps non-programmers to work with Hadoop cluster

User models their big data as familiar spreadsheet-like tabular data

structures (collections) Once data is represented in a collection

business analysts can filter and enrich its content using built-in

functions and macros Furthermore analysts can combine data

residing in different collections as well as generate charts and new

ldquosheetsrdquo (collections) to visualize their data They can even export

data into a variety of common formats with a click of a button

Much of the technology included in Sheets was derived from the

BigSheets project of IBMrsquos Emerging Technologies team

copy 2014 IBM Corporation40

BigSheets Collection Sample

Spreadsheet-like structures defined by user

Based on data accessible through BigInsights Web console ndash eg file

system data output from Web crawl etc

copy 2014 IBM Corporation41

Big Sheets Collection Operations

Work with built-in ldquosheetsrdquo editor

Add delete columns

Filter data

Specify formulas to compute new

values using spreadsheet-style

syntax

Apply built-in or custom macro

functions

helliphelliphelliphellip

copy 2014 IBM Corporation42

BigSheets Collection Graphic Visualization

Built-in charting facility aids analysis

Pie charts bar charts tag clouds maps etc

Hover over sections to reveal details

copy 2014 IBM Corporation43

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

copy 2014 IBM Corporation44

What is Big SQL

Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture

ndash Comprehensive SQL92 ansi support

ndash Standards compliant client drivers (JDBC amp ODBC)

ndash Efficient handling of point queries

ndash Wide variety of data sources and file formats

ndash Extensive HBase focus

ndash Open source interoperability

Our driving design goalsndash Existing queries should run with no or few modifications

ndash Existing JDBC and ODBC compliant tools should continue to function

ndash Queries should be executed as efficiently as the chosen storage

mechanisms allow

copy 2014 IBM Corporation45

Architecture

Big SQL shares catalogs with

Hive via the Hive metastorendash Each can query the others tables

SQL engine analyzes incoming

queriesndash Separates portion(s) to execute at

the server vs portion(s) to execute

on the cluster

ndash Re-writes query if necessary for

improved performance

ndash Determines appropriate storage

handler for data

ndash Produces execution plan

ndash Executes and coordinates query

Server layout and relative sizes

for illustrative purposes only

Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node

Job TrackerNetwork Protocol

SQL Engine

Storage Handlers

Del

Files

SEQ

FilesHBase RDBMS bullbullbull

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore

copy 2014 IBM Corporation46

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

copy 2014 IBM Corporation47

What is Text Analytics

High Performance and Scalable rule based Information Extraction Engine

Distill structured information from unstructured data

- Rich annotator library supports multiple languages

Provides sophisticated tooling to help build test and refine rules

ndash Developer tools an easy to use text analytics language and a set of

extractors for fast adoption

ndash Multilingual support including support for DBCS languages

Developed at IBM Research since 2004 System T

BigInsights is the first time IBM opens up the Text Analytics Engine

technology for customization and development

copy 2014 IBM Corporation48

Annotator Query Language (AQL)

Language to create rules for Text Analytics

SQL Like Language

Fully declarative text analytics language

Once compiled produced an AOG plan to work in the data

No ldquoblack boxesrdquo or modules that canrsquot be customized

Tooling for easy customization because you are abstracted from the

programmatic details

Competing solutions make use of locked up black-box modules that cannot be

customized which restricts flexibility and are difficult to optimize for performance

create view AmountWithUnit as

extract pattern ltNmatchgt ltUmatchgt

as match

from Number N Unit U

copy 2014 IBM Corporation49

Text Analytic Simple Example

NetherlandsStrikerArjen Robben

Keeper SpainIker Casillas

WingerAndres Iniesta Spain

World Cup 2010 Highlights

Football World Cup 2010 one team distinguished well

from the rest winning the final Early in the second

half Netherlandsrsquo striker Arjen Robben had a chance

to score but the awesome keeper for Spain Iker

Casillas made the save Winner superiority was

reflected when Winger Andres Iniesta scored for Spain

for the win

copy 2014 IBM Corporation5050

Text Analytic Real Example

copy 2014 IBM Corporation5151

One step beyond Watson

copy 2014 IBM Corporation52

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

R

copy 2014 IBM Corporation53

Big R

bull Explore visualize transform

and model big data using

familiar R syntax and

paradigm

bull Scale out R with MR

programming

ndash Partitioning of large data

ndash Parallel cluster execution of R

code

bull Distributed Machine

Learning

ndash A scalable statistics engine that

provides canned algorithms and

an ability to author new ones all

via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3

copy 2014 IBM Corporation54

Where Does BigData Fit

Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate

ldquoCapture in case itrsquos neededrdquo

1 Extract transform load

ldquoCapture only whatrsquos neededrdquo

9 Report and mine data

copy 2014 IBM Corporation55

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation56

Data scientist ndash The new cool guy in town

Article in Fortune ldquoThe unemployment rate in

the US continues to be abysmal (91 in

July) but the tech world has spawned a

new kind of highly skilled nerdy-cool job

that companies are scrambling to fill data

scientistrdquo

McKinsey Global Institute ldquoBig data Reportrdquo

By 2018 the United States alone could

face a shortage of 140000 to 190000

people with deep analytical skills as well as

15 million managers and analysts with the

know-how to use the analysis of big data to

make effective decisions

copy 2014 IBM Corporation57

Data Science is Multidisciplinary

copy 2014 IBM Corporation58

Successful Data Scientist Characteristics

copy 2014 IBM Corporation59

Data Scientist Qualities

copy 2014 IBM Corporation60

How Long Does It Take For a Beginner to Become

a Good Data Scientist

copy 2014 IBM Corporation61

wwwkagglecom

copy 2014 IBM Corporation62

Kaggle ranking

copy 2014 IBM Corporation63 copy 2013 IBM Corporation63

Learn Big Data

Reading Materials - Online

ndash Understanding Big Data ndash Free PDF Book

bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF

ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights

bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml

ndash Implementing IBM InfoSphere BigInsights on System x - Redbook

bull httpwwwredbooksibmcomredpiecesabstractssg248077html

Resources

ndash Big Data Information Center

bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml

ndash InfoSphere BigInsights

bull www-01ibmcomsoftwaredatainfospherebiginsights

ndash Stream Computing

bull www-01ibmcomsoftwaredatainfospherestream-computing

ndash DeveloperWorks forums demos

bull httpwwwibmcomdeveloperworkswikibiginsights

copy 2014 IBM Corporation64

Learn Big Data Technologies

BigDataUniversitycom

Flexible on-line delivery

allows learning your place

and your pace

Free courses free study

materials

Cloud-based sandbox for

exercises ndash zero setup

Robust Course

Management System and

Content Distribution

infrastructure

copy 2014 IBM Corporation65

65

Page 21: Big data presentation (2014)

copy 2014 IBM Corporation21

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation22

Solution for Big Data

Rest Data

ndash Data to analyze are already stored (structured and unstructured)

ndash Examples logs facebook twitter etc

ndash Solution Hadoop (open source)

Data in motion

ndash Data are analyzed in real time just in the moment they are generated They are analyzed with any previous storage

ndash Examples Sensors RFID etc

ndash Solution Streams CEP solutions

copy 2014 IBM Corporation23

Hardware improvements through the years

CPU Speedsndash 1990 - 44 MIPS at 40 MHz

ndash 2000 - 3561 MIPS at 12 GHz ndash 2010 - 147600 MIPS at 33 GHz

RAM Memoryndash 1990 ndash 640K conventional memory (256K extended memory recommended)ndash 2000 ndash 64MB memoryndash 2010 - 8-32GB (and more)

Disk Capacityndash 1990 ndash 20MBndash 2000 - 1GBndash 2010 ndash 1TB

Disk Latency (speed of reads and writes) ndash not much improvement in last 7-10 years currently around 70 ndash 80MB sec

copy 2014 IBM Corporation24

How long it will take to read 1TB of data

1TB (at 80Mb sec)ndash 1 disk - 34 hours

ndash 10 disks - 20 min

ndash 100 disks - 2 min

ndash 1000 disks - 12 sec

copy 2014 IBM Corporation25

Parallel Data Processing is the answer

It was with us for a whilendash GRID computing - spreads processing load

ndash Distributed workload - hard to manage applications overhead on

developer

ndash Parallel databases ndash DB2 DPF Teradata Netezza etc (distribute the

data)

copy 2014 IBM Corporation26

What is Apache Hadoop

Apache Open source software framework

Flexible enterprise-class support for processing large volumes of

data ndash Inspired by Google technologies (MapReduce GFS BigTable hellip)

ndash Initiated at Yahoobull Originally built to address scalability problems of Nutch an open source Web search

technology

ndash Well-suited to batch-oriented read-intensive applications

ndash Supports wide variety of data

Enables applications to work with thousands of nodes and petabytes

of data in a highly parallel cost effective mannerndash CPU + local disks = ldquonoderdquo

ndash Nodes can be combined into clusters

ndash New nodes can be added as needed without changing

bull Data formats

bull How data is loaded

bull How jobs are written

copy 2014 IBM Corporation27

Design principles of Hadoop New way of storing and processing the data

ndash Let system handle most of the issues automaticallybull Failuresbull Scalabilitybull Reduce communications bull Distribute data and processing power to where the data isbull Make parallelism part of operating systembull Meant for heterogeneous commodity hardware

Bring processing to Data

Hadoop = HDFS + MapReduce infrastructure

Optimized to handlendash Massive amounts of data through parallelism

ndash A variety of data (structured unstructured semi-structured)

ndash Using inexpensive commodity hardware

Reliability provided through replication

copy 2014 IBM Corporation28

What is the Hadoop Distributed File System Driving principals

ndash Data is stored across the entire cluster (multiple nodes)

ndash Programs are brought to the data not the data to the program

ndash Follows the Divide and Conquer paradigm

Data is stored across the entire cluster (the DFS)

ndash The entire cluster participates in the file system

ndash Blocks of a single file are distributed across the cluster

ndash A given block is typically replicated as well for resiliency

10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111

0100

Logical File

1

2

3

4

Blocks

1

Cluster

1

1

2

22

3

3

34

44

copy 2011 IBM Corporation29

Introduction to MapReduce

Scalable to thousands of nodes and petabytes of data

MapReduce Application

1 Map Phase(break job into small parts)

2 Shuffle(transfer interim output

for final processing)

3 Reduce Phase(boil all output down to

a single result set)

Return a single result setResult Set

Shuffle

public static class TokenizerMapper extends MapperltObjectTextTextIntWritablegt

private final static IntWritableone = new IntWritable(1)

private Text word = new Text()

public void map(Object key Text val ContextStringTokenizer itr =

new StringTokenizer(valtoString())while (itrhasMoreTokens()) wordset(itrnextToken())

contextwrite(word one)

public static class IntSumReducer extends ReducerltTextIntWritableTextIntWrita

private IntWritable result = new IntWritable()

public void reduce(Text keyIterableltIntWritablegt val Context context)int sum = 0for (IntWritable v val)

sum += vget()

Distribute map

tasks to cluster

Hadoop Data Nodes

copy 2011 IBM Corporation30

MapReduce Example

Hello World Bye World

Hello IBM

Reduce (final output)

lt Bye 1gt

lt IBM 1gt

lt Hello 2gt

lt World 2gt

Map 1lt Hello 1gt

lt World 1gt

lt Bye 1gt

lt World 1gt

Count number of words occurrences

Map 2lt Hello 1gt

lt IBM 1gt

Entry Data

Map

Process

Reduce

Process

Shuffle

Process

copy 2014 IBM Corporation31

How to Analyze Large Data Sets in Hadoop

Its not just runtime Development phase has to be taken into

account

Although the Hadoop framework is implemented in Java

MapReduce applications do not need to be written in Java

To abstract complexities of Hadoop programming model a few

application development languages have emerged that build on top

of Hadoopndash Pig

ndash Hive

ndash Jaql

ndash Jaql

copy 2014 IBM Corporation32

Pig Hive Jaql ndash Similarities

Reduced program size over Java

Applications are translated to map

and reduce jobs behind scenes

Extension points for extending

existing functionality

Interoperability with other

languages

Not designed for random

readswrites or low-latency queries

Pig Hive Jaql ndash Differences

Characteristic Pig Hive Jaql

Developed by Yahoo Facebook IBM

Language Pig Latin HiveQL Jaql

Type of language

Data flow Declarative (SQL dialect) Data flow

Data structures supported

Complex Better suited for structured data

JSON semi structured

Schema Optional Not optional Optional

Hadoop Distributions

34

copy 2014 IBM Corporation35

Example of Hadoop Ecosystem

Visualization amp DiscoveryIntegration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

BigSheets JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Dashboard amp Visualization

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

copy 2014 IBM Corporation36

Open Source frameworks I

Avro A data serialization system that includes a schema within each file A schema defines the data types that are

contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data

types and complex type definitions within a schema

Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop

cluster

HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data

sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications

are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column

families so that the elements of a column family are all stored together This approach is different from a row-oriented

relational database where all columns of a row are stored together

HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do

not need to know where or how your data is stored You can change how you write data while still supporting existing data in

older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service

that includes functions for both MapReduce and Pig

Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to

analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements

which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere

BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos

Business Intelligence software

copy 2014 IBM Corporation37

Open Source frameworks II

Lucene A high-performance text search engine library that is written entirely in Java When you search within a

collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key

component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene

libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build

scan and query Lucene indexes

Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides

users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the

required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of

specific data in the file system

R A Project for Statistical Computing

Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop

systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and

export it to relational databases and enterprise data warehouses

Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper

maintains common objects that are needed in large cluster environments such as configuration information distributed

synchronization and group services

copy 2014 IBM Corporation38

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

copy 2014 IBM Corporation39

BigSheets

Browser based Analytical Tool that generates MapReduce Jobs

working over Hadoop Big Data

Helps non-programmers to work with Hadoop cluster

User models their big data as familiar spreadsheet-like tabular data

structures (collections) Once data is represented in a collection

business analysts can filter and enrich its content using built-in

functions and macros Furthermore analysts can combine data

residing in different collections as well as generate charts and new

ldquosheetsrdquo (collections) to visualize their data They can even export

data into a variety of common formats with a click of a button

Much of the technology included in Sheets was derived from the

BigSheets project of IBMrsquos Emerging Technologies team

copy 2014 IBM Corporation40

BigSheets Collection Sample

Spreadsheet-like structures defined by user

Based on data accessible through BigInsights Web console ndash eg file

system data output from Web crawl etc

copy 2014 IBM Corporation41

Big Sheets Collection Operations

Work with built-in ldquosheetsrdquo editor

Add delete columns

Filter data

Specify formulas to compute new

values using spreadsheet-style

syntax

Apply built-in or custom macro

functions

helliphelliphelliphellip

copy 2014 IBM Corporation42

BigSheets Collection Graphic Visualization

Built-in charting facility aids analysis

Pie charts bar charts tag clouds maps etc

Hover over sections to reveal details

copy 2014 IBM Corporation43

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

copy 2014 IBM Corporation44

What is Big SQL

Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture

ndash Comprehensive SQL92 ansi support

ndash Standards compliant client drivers (JDBC amp ODBC)

ndash Efficient handling of point queries

ndash Wide variety of data sources and file formats

ndash Extensive HBase focus

ndash Open source interoperability

Our driving design goalsndash Existing queries should run with no or few modifications

ndash Existing JDBC and ODBC compliant tools should continue to function

ndash Queries should be executed as efficiently as the chosen storage

mechanisms allow

copy 2014 IBM Corporation45

Architecture

Big SQL shares catalogs with

Hive via the Hive metastorendash Each can query the others tables

SQL engine analyzes incoming

queriesndash Separates portion(s) to execute at

the server vs portion(s) to execute

on the cluster

ndash Re-writes query if necessary for

improved performance

ndash Determines appropriate storage

handler for data

ndash Produces execution plan

ndash Executes and coordinates query

Server layout and relative sizes

for illustrative purposes only

Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node

Job TrackerNetwork Protocol

SQL Engine

Storage Handlers

Del

Files

SEQ

FilesHBase RDBMS bullbullbull

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore

copy 2014 IBM Corporation46

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

copy 2014 IBM Corporation47

What is Text Analytics

High Performance and Scalable rule based Information Extraction Engine

Distill structured information from unstructured data

- Rich annotator library supports multiple languages

Provides sophisticated tooling to help build test and refine rules

ndash Developer tools an easy to use text analytics language and a set of

extractors for fast adoption

ndash Multilingual support including support for DBCS languages

Developed at IBM Research since 2004 System T

BigInsights is the first time IBM opens up the Text Analytics Engine

technology for customization and development

copy 2014 IBM Corporation48

Annotator Query Language (AQL)

Language to create rules for Text Analytics

SQL Like Language

Fully declarative text analytics language

Once compiled produced an AOG plan to work in the data

No ldquoblack boxesrdquo or modules that canrsquot be customized

Tooling for easy customization because you are abstracted from the

programmatic details

Competing solutions make use of locked up black-box modules that cannot be

customized which restricts flexibility and are difficult to optimize for performance

create view AmountWithUnit as

extract pattern ltNmatchgt ltUmatchgt

as match

from Number N Unit U

copy 2014 IBM Corporation49

Text Analytic Simple Example

NetherlandsStrikerArjen Robben

Keeper SpainIker Casillas

WingerAndres Iniesta Spain

World Cup 2010 Highlights

Football World Cup 2010 one team distinguished well

from the rest winning the final Early in the second

half Netherlandsrsquo striker Arjen Robben had a chance

to score but the awesome keeper for Spain Iker

Casillas made the save Winner superiority was

reflected when Winger Andres Iniesta scored for Spain

for the win

copy 2014 IBM Corporation5050

Text Analytic Real Example

copy 2014 IBM Corporation5151

One step beyond Watson

copy 2014 IBM Corporation52

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

R

copy 2014 IBM Corporation53

Big R

bull Explore visualize transform

and model big data using

familiar R syntax and

paradigm

bull Scale out R with MR

programming

ndash Partitioning of large data

ndash Parallel cluster execution of R

code

bull Distributed Machine

Learning

ndash A scalable statistics engine that

provides canned algorithms and

an ability to author new ones all

via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3

copy 2014 IBM Corporation54

Where Does BigData Fit

Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate

ldquoCapture in case itrsquos neededrdquo

1 Extract transform load

ldquoCapture only whatrsquos neededrdquo

9 Report and mine data

copy 2014 IBM Corporation55

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation56

Data scientist ndash The new cool guy in town

Article in Fortune ldquoThe unemployment rate in

the US continues to be abysmal (91 in

July) but the tech world has spawned a

new kind of highly skilled nerdy-cool job

that companies are scrambling to fill data

scientistrdquo

McKinsey Global Institute ldquoBig data Reportrdquo

By 2018 the United States alone could

face a shortage of 140000 to 190000

people with deep analytical skills as well as

15 million managers and analysts with the

know-how to use the analysis of big data to

make effective decisions

copy 2014 IBM Corporation57

Data Science is Multidisciplinary

copy 2014 IBM Corporation58

Successful Data Scientist Characteristics

copy 2014 IBM Corporation59

Data Scientist Qualities

copy 2014 IBM Corporation60

How Long Does It Take For a Beginner to Become

a Good Data Scientist

copy 2014 IBM Corporation61

wwwkagglecom

copy 2014 IBM Corporation62

Kaggle ranking

copy 2014 IBM Corporation63 copy 2013 IBM Corporation63

Learn Big Data

Reading Materials - Online

ndash Understanding Big Data ndash Free PDF Book

bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF

ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights

bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml

ndash Implementing IBM InfoSphere BigInsights on System x - Redbook

bull httpwwwredbooksibmcomredpiecesabstractssg248077html

Resources

ndash Big Data Information Center

bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml

ndash InfoSphere BigInsights

bull www-01ibmcomsoftwaredatainfospherebiginsights

ndash Stream Computing

bull www-01ibmcomsoftwaredatainfospherestream-computing

ndash DeveloperWorks forums demos

bull httpwwwibmcomdeveloperworkswikibiginsights

copy 2014 IBM Corporation64

Learn Big Data Technologies

BigDataUniversitycom

Flexible on-line delivery

allows learning your place

and your pace

Free courses free study

materials

Cloud-based sandbox for

exercises ndash zero setup

Robust Course

Management System and

Content Distribution

infrastructure

copy 2014 IBM Corporation65

65

Page 22: Big data presentation (2014)

copy 2014 IBM Corporation22

Solution for Big Data

Rest Data

ndash Data to analyze are already stored (structured and unstructured)

ndash Examples logs facebook twitter etc

ndash Solution Hadoop (open source)

Data in motion

ndash Data are analyzed in real time just in the moment they are generated They are analyzed with any previous storage

ndash Examples Sensors RFID etc

ndash Solution Streams CEP solutions

copy 2014 IBM Corporation23

Hardware improvements through the years

CPU Speedsndash 1990 - 44 MIPS at 40 MHz

ndash 2000 - 3561 MIPS at 12 GHz ndash 2010 - 147600 MIPS at 33 GHz

RAM Memoryndash 1990 ndash 640K conventional memory (256K extended memory recommended)ndash 2000 ndash 64MB memoryndash 2010 - 8-32GB (and more)

Disk Capacityndash 1990 ndash 20MBndash 2000 - 1GBndash 2010 ndash 1TB

Disk Latency (speed of reads and writes) ndash not much improvement in last 7-10 years currently around 70 ndash 80MB sec

copy 2014 IBM Corporation24

How long it will take to read 1TB of data

1TB (at 80Mb sec)ndash 1 disk - 34 hours

ndash 10 disks - 20 min

ndash 100 disks - 2 min

ndash 1000 disks - 12 sec

copy 2014 IBM Corporation25

Parallel Data Processing is the answer

It was with us for a whilendash GRID computing - spreads processing load

ndash Distributed workload - hard to manage applications overhead on

developer

ndash Parallel databases ndash DB2 DPF Teradata Netezza etc (distribute the

data)

copy 2014 IBM Corporation26

What is Apache Hadoop

Apache Open source software framework

Flexible enterprise-class support for processing large volumes of

data ndash Inspired by Google technologies (MapReduce GFS BigTable hellip)

ndash Initiated at Yahoobull Originally built to address scalability problems of Nutch an open source Web search

technology

ndash Well-suited to batch-oriented read-intensive applications

ndash Supports wide variety of data

Enables applications to work with thousands of nodes and petabytes

of data in a highly parallel cost effective mannerndash CPU + local disks = ldquonoderdquo

ndash Nodes can be combined into clusters

ndash New nodes can be added as needed without changing

bull Data formats

bull How data is loaded

bull How jobs are written

copy 2014 IBM Corporation27

Design principles of Hadoop New way of storing and processing the data

ndash Let system handle most of the issues automaticallybull Failuresbull Scalabilitybull Reduce communications bull Distribute data and processing power to where the data isbull Make parallelism part of operating systembull Meant for heterogeneous commodity hardware

Bring processing to Data

Hadoop = HDFS + MapReduce infrastructure

Optimized to handlendash Massive amounts of data through parallelism

ndash A variety of data (structured unstructured semi-structured)

ndash Using inexpensive commodity hardware

Reliability provided through replication

copy 2014 IBM Corporation28

What is the Hadoop Distributed File System Driving principals

ndash Data is stored across the entire cluster (multiple nodes)

ndash Programs are brought to the data not the data to the program

ndash Follows the Divide and Conquer paradigm

Data is stored across the entire cluster (the DFS)

ndash The entire cluster participates in the file system

ndash Blocks of a single file are distributed across the cluster

ndash A given block is typically replicated as well for resiliency

10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111

0100

Logical File

1

2

3

4

Blocks

1

Cluster

1

1

2

22

3

3

34

44

copy 2011 IBM Corporation29

Introduction to MapReduce

Scalable to thousands of nodes and petabytes of data

MapReduce Application

1 Map Phase(break job into small parts)

2 Shuffle(transfer interim output

for final processing)

3 Reduce Phase(boil all output down to

a single result set)

Return a single result setResult Set

Shuffle

public static class TokenizerMapper extends MapperltObjectTextTextIntWritablegt

private final static IntWritableone = new IntWritable(1)

private Text word = new Text()

public void map(Object key Text val ContextStringTokenizer itr =

new StringTokenizer(valtoString())while (itrhasMoreTokens()) wordset(itrnextToken())

contextwrite(word one)

public static class IntSumReducer extends ReducerltTextIntWritableTextIntWrita

private IntWritable result = new IntWritable()

public void reduce(Text keyIterableltIntWritablegt val Context context)int sum = 0for (IntWritable v val)

sum += vget()

Distribute map

tasks to cluster

Hadoop Data Nodes

copy 2011 IBM Corporation30

MapReduce Example

Hello World Bye World

Hello IBM

Reduce (final output)

lt Bye 1gt

lt IBM 1gt

lt Hello 2gt

lt World 2gt

Map 1lt Hello 1gt

lt World 1gt

lt Bye 1gt

lt World 1gt

Count number of words occurrences

Map 2lt Hello 1gt

lt IBM 1gt

Entry Data

Map

Process

Reduce

Process

Shuffle

Process

copy 2014 IBM Corporation31

How to Analyze Large Data Sets in Hadoop

Its not just runtime Development phase has to be taken into

account

Although the Hadoop framework is implemented in Java

MapReduce applications do not need to be written in Java

To abstract complexities of Hadoop programming model a few

application development languages have emerged that build on top

of Hadoopndash Pig

ndash Hive

ndash Jaql

ndash Jaql

copy 2014 IBM Corporation32

Pig Hive Jaql ndash Similarities

Reduced program size over Java

Applications are translated to map

and reduce jobs behind scenes

Extension points for extending

existing functionality

Interoperability with other

languages

Not designed for random

readswrites or low-latency queries

Pig Hive Jaql ndash Differences

Characteristic Pig Hive Jaql

Developed by Yahoo Facebook IBM

Language Pig Latin HiveQL Jaql

Type of language

Data flow Declarative (SQL dialect) Data flow

Data structures supported

Complex Better suited for structured data

JSON semi structured

Schema Optional Not optional Optional

Hadoop Distributions

34

copy 2014 IBM Corporation35

Example of Hadoop Ecosystem

Visualization amp DiscoveryIntegration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

BigSheets JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Dashboard amp Visualization

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

copy 2014 IBM Corporation36

Open Source frameworks I

Avro A data serialization system that includes a schema within each file A schema defines the data types that are

contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data

types and complex type definitions within a schema

Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop

cluster

HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data

sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications

are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column

families so that the elements of a column family are all stored together This approach is different from a row-oriented

relational database where all columns of a row are stored together

HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do

not need to know where or how your data is stored You can change how you write data while still supporting existing data in

older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service

that includes functions for both MapReduce and Pig

Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to

analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements

which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere

BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos

Business Intelligence software

copy 2014 IBM Corporation37

Open Source frameworks II

Lucene A high-performance text search engine library that is written entirely in Java When you search within a

collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key

component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene

libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build

scan and query Lucene indexes

Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides

users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the

required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of

specific data in the file system

R A Project for Statistical Computing

Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop

systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and

export it to relational databases and enterprise data warehouses

Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper

maintains common objects that are needed in large cluster environments such as configuration information distributed

synchronization and group services

copy 2014 IBM Corporation38

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

copy 2014 IBM Corporation39

BigSheets

Browser based Analytical Tool that generates MapReduce Jobs

working over Hadoop Big Data

Helps non-programmers to work with Hadoop cluster

User models their big data as familiar spreadsheet-like tabular data

structures (collections) Once data is represented in a collection

business analysts can filter and enrich its content using built-in

functions and macros Furthermore analysts can combine data

residing in different collections as well as generate charts and new

ldquosheetsrdquo (collections) to visualize their data They can even export

data into a variety of common formats with a click of a button

Much of the technology included in Sheets was derived from the

BigSheets project of IBMrsquos Emerging Technologies team

copy 2014 IBM Corporation40

BigSheets Collection Sample

Spreadsheet-like structures defined by user

Based on data accessible through BigInsights Web console ndash eg file

system data output from Web crawl etc

copy 2014 IBM Corporation41

Big Sheets Collection Operations

Work with built-in ldquosheetsrdquo editor

Add delete columns

Filter data

Specify formulas to compute new

values using spreadsheet-style

syntax

Apply built-in or custom macro

functions

helliphelliphelliphellip

copy 2014 IBM Corporation42

BigSheets Collection Graphic Visualization

Built-in charting facility aids analysis

Pie charts bar charts tag clouds maps etc

Hover over sections to reveal details

copy 2014 IBM Corporation43

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

copy 2014 IBM Corporation44

What is Big SQL

Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture

ndash Comprehensive SQL92 ansi support

ndash Standards compliant client drivers (JDBC amp ODBC)

ndash Efficient handling of point queries

ndash Wide variety of data sources and file formats

ndash Extensive HBase focus

ndash Open source interoperability

Our driving design goalsndash Existing queries should run with no or few modifications

ndash Existing JDBC and ODBC compliant tools should continue to function

ndash Queries should be executed as efficiently as the chosen storage

mechanisms allow

copy 2014 IBM Corporation45

Architecture

Big SQL shares catalogs with

Hive via the Hive metastorendash Each can query the others tables

SQL engine analyzes incoming

queriesndash Separates portion(s) to execute at

the server vs portion(s) to execute

on the cluster

ndash Re-writes query if necessary for

improved performance

ndash Determines appropriate storage

handler for data

ndash Produces execution plan

ndash Executes and coordinates query

Server layout and relative sizes

for illustrative purposes only

Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node

Job TrackerNetwork Protocol

SQL Engine

Storage Handlers

Del

Files

SEQ

FilesHBase RDBMS bullbullbull

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore

copy 2014 IBM Corporation46

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

copy 2014 IBM Corporation47

What is Text Analytics

High Performance and Scalable rule based Information Extraction Engine

Distill structured information from unstructured data

- Rich annotator library supports multiple languages

Provides sophisticated tooling to help build test and refine rules

ndash Developer tools an easy to use text analytics language and a set of

extractors for fast adoption

ndash Multilingual support including support for DBCS languages

Developed at IBM Research since 2004 System T

BigInsights is the first time IBM opens up the Text Analytics Engine

technology for customization and development

copy 2014 IBM Corporation48

Annotator Query Language (AQL)

Language to create rules for Text Analytics

SQL Like Language

Fully declarative text analytics language

Once compiled produced an AOG plan to work in the data

No ldquoblack boxesrdquo or modules that canrsquot be customized

Tooling for easy customization because you are abstracted from the

programmatic details

Competing solutions make use of locked up black-box modules that cannot be

customized which restricts flexibility and are difficult to optimize for performance

create view AmountWithUnit as

extract pattern ltNmatchgt ltUmatchgt

as match

from Number N Unit U

copy 2014 IBM Corporation49

Text Analytic Simple Example

NetherlandsStrikerArjen Robben

Keeper SpainIker Casillas

WingerAndres Iniesta Spain

World Cup 2010 Highlights

Football World Cup 2010 one team distinguished well

from the rest winning the final Early in the second

half Netherlandsrsquo striker Arjen Robben had a chance

to score but the awesome keeper for Spain Iker

Casillas made the save Winner superiority was

reflected when Winger Andres Iniesta scored for Spain

for the win

copy 2014 IBM Corporation5050

Text Analytic Real Example

copy 2014 IBM Corporation5151

One step beyond Watson

copy 2014 IBM Corporation52

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

R

copy 2014 IBM Corporation53

Big R

bull Explore visualize transform

and model big data using

familiar R syntax and

paradigm

bull Scale out R with MR

programming

ndash Partitioning of large data

ndash Parallel cluster execution of R

code

bull Distributed Machine

Learning

ndash A scalable statistics engine that

provides canned algorithms and

an ability to author new ones all

via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3

copy 2014 IBM Corporation54

Where Does BigData Fit

Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate

ldquoCapture in case itrsquos neededrdquo

1 Extract transform load

ldquoCapture only whatrsquos neededrdquo

9 Report and mine data

copy 2014 IBM Corporation55

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation56

Data scientist ndash The new cool guy in town

Article in Fortune ldquoThe unemployment rate in

the US continues to be abysmal (91 in

July) but the tech world has spawned a

new kind of highly skilled nerdy-cool job

that companies are scrambling to fill data

scientistrdquo

McKinsey Global Institute ldquoBig data Reportrdquo

By 2018 the United States alone could

face a shortage of 140000 to 190000

people with deep analytical skills as well as

15 million managers and analysts with the

know-how to use the analysis of big data to

make effective decisions

copy 2014 IBM Corporation57

Data Science is Multidisciplinary

copy 2014 IBM Corporation58

Successful Data Scientist Characteristics

copy 2014 IBM Corporation59

Data Scientist Qualities

copy 2014 IBM Corporation60

How Long Does It Take For a Beginner to Become

a Good Data Scientist

copy 2014 IBM Corporation61

wwwkagglecom

copy 2014 IBM Corporation62

Kaggle ranking

copy 2014 IBM Corporation63 copy 2013 IBM Corporation63

Learn Big Data

Reading Materials - Online

ndash Understanding Big Data ndash Free PDF Book

bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF

ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights

bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml

ndash Implementing IBM InfoSphere BigInsights on System x - Redbook

bull httpwwwredbooksibmcomredpiecesabstractssg248077html

Resources

ndash Big Data Information Center

bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml

ndash InfoSphere BigInsights

bull www-01ibmcomsoftwaredatainfospherebiginsights

ndash Stream Computing

bull www-01ibmcomsoftwaredatainfospherestream-computing

ndash DeveloperWorks forums demos

bull httpwwwibmcomdeveloperworkswikibiginsights

copy 2014 IBM Corporation64

Learn Big Data Technologies

BigDataUniversitycom

Flexible on-line delivery

allows learning your place

and your pace

Free courses free study

materials

Cloud-based sandbox for

exercises ndash zero setup

Robust Course

Management System and

Content Distribution

infrastructure

copy 2014 IBM Corporation65

65

Page 23: Big data presentation (2014)

copy 2014 IBM Corporation23

Hardware improvements through the years

CPU Speedsndash 1990 - 44 MIPS at 40 MHz

ndash 2000 - 3561 MIPS at 12 GHz ndash 2010 - 147600 MIPS at 33 GHz

RAM Memoryndash 1990 ndash 640K conventional memory (256K extended memory recommended)ndash 2000 ndash 64MB memoryndash 2010 - 8-32GB (and more)

Disk Capacityndash 1990 ndash 20MBndash 2000 - 1GBndash 2010 ndash 1TB

Disk Latency (speed of reads and writes) ndash not much improvement in last 7-10 years currently around 70 ndash 80MB sec

copy 2014 IBM Corporation24

How long it will take to read 1TB of data

1TB (at 80Mb sec)ndash 1 disk - 34 hours

ndash 10 disks - 20 min

ndash 100 disks - 2 min

ndash 1000 disks - 12 sec

copy 2014 IBM Corporation25

Parallel Data Processing is the answer

It was with us for a whilendash GRID computing - spreads processing load

ndash Distributed workload - hard to manage applications overhead on

developer

ndash Parallel databases ndash DB2 DPF Teradata Netezza etc (distribute the

data)

copy 2014 IBM Corporation26

What is Apache Hadoop

Apache Open source software framework

Flexible enterprise-class support for processing large volumes of

data ndash Inspired by Google technologies (MapReduce GFS BigTable hellip)

ndash Initiated at Yahoobull Originally built to address scalability problems of Nutch an open source Web search

technology

ndash Well-suited to batch-oriented read-intensive applications

ndash Supports wide variety of data

Enables applications to work with thousands of nodes and petabytes

of data in a highly parallel cost effective mannerndash CPU + local disks = ldquonoderdquo

ndash Nodes can be combined into clusters

ndash New nodes can be added as needed without changing

bull Data formats

bull How data is loaded

bull How jobs are written

copy 2014 IBM Corporation27

Design principles of Hadoop New way of storing and processing the data

ndash Let system handle most of the issues automaticallybull Failuresbull Scalabilitybull Reduce communications bull Distribute data and processing power to where the data isbull Make parallelism part of operating systembull Meant for heterogeneous commodity hardware

Bring processing to Data

Hadoop = HDFS + MapReduce infrastructure

Optimized to handlendash Massive amounts of data through parallelism

ndash A variety of data (structured unstructured semi-structured)

ndash Using inexpensive commodity hardware

Reliability provided through replication

copy 2014 IBM Corporation28

What is the Hadoop Distributed File System Driving principals

ndash Data is stored across the entire cluster (multiple nodes)

ndash Programs are brought to the data not the data to the program

ndash Follows the Divide and Conquer paradigm

Data is stored across the entire cluster (the DFS)

ndash The entire cluster participates in the file system

ndash Blocks of a single file are distributed across the cluster

ndash A given block is typically replicated as well for resiliency

10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111

0100

Logical File

1

2

3

4

Blocks

1

Cluster

1

1

2

22

3

3

34

44

copy 2011 IBM Corporation29

Introduction to MapReduce

Scalable to thousands of nodes and petabytes of data

MapReduce Application

1 Map Phase(break job into small parts)

2 Shuffle(transfer interim output

for final processing)

3 Reduce Phase(boil all output down to

a single result set)

Return a single result setResult Set

Shuffle

public static class TokenizerMapper extends MapperltObjectTextTextIntWritablegt

private final static IntWritableone = new IntWritable(1)

private Text word = new Text()

public void map(Object key Text val ContextStringTokenizer itr =

new StringTokenizer(valtoString())while (itrhasMoreTokens()) wordset(itrnextToken())

contextwrite(word one)

public static class IntSumReducer extends ReducerltTextIntWritableTextIntWrita

private IntWritable result = new IntWritable()

public void reduce(Text keyIterableltIntWritablegt val Context context)int sum = 0for (IntWritable v val)

sum += vget()

Distribute map

tasks to cluster

Hadoop Data Nodes

copy 2011 IBM Corporation30

MapReduce Example

Hello World Bye World

Hello IBM

Reduce (final output)

lt Bye 1gt

lt IBM 1gt

lt Hello 2gt

lt World 2gt

Map 1lt Hello 1gt

lt World 1gt

lt Bye 1gt

lt World 1gt

Count number of words occurrences

Map 2lt Hello 1gt

lt IBM 1gt

Entry Data

Map

Process

Reduce

Process

Shuffle

Process

copy 2014 IBM Corporation31

How to Analyze Large Data Sets in Hadoop

Its not just runtime Development phase has to be taken into

account

Although the Hadoop framework is implemented in Java

MapReduce applications do not need to be written in Java

To abstract complexities of Hadoop programming model a few

application development languages have emerged that build on top

of Hadoopndash Pig

ndash Hive

ndash Jaql

ndash Jaql

copy 2014 IBM Corporation32

Pig Hive Jaql ndash Similarities

Reduced program size over Java

Applications are translated to map

and reduce jobs behind scenes

Extension points for extending

existing functionality

Interoperability with other

languages

Not designed for random

readswrites or low-latency queries

Pig Hive Jaql ndash Differences

Characteristic Pig Hive Jaql

Developed by Yahoo Facebook IBM

Language Pig Latin HiveQL Jaql

Type of language

Data flow Declarative (SQL dialect) Data flow

Data structures supported

Complex Better suited for structured data

JSON semi structured

Schema Optional Not optional Optional

Hadoop Distributions

34

copy 2014 IBM Corporation35

Example of Hadoop Ecosystem

Visualization amp DiscoveryIntegration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

BigSheets JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Dashboard amp Visualization

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

copy 2014 IBM Corporation36

Open Source frameworks I

Avro A data serialization system that includes a schema within each file A schema defines the data types that are

contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data

types and complex type definitions within a schema

Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop

cluster

HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data

sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications

are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column

families so that the elements of a column family are all stored together This approach is different from a row-oriented

relational database where all columns of a row are stored together

HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do

not need to know where or how your data is stored You can change how you write data while still supporting existing data in

older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service

that includes functions for both MapReduce and Pig

Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to

analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements

which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere

BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos

Business Intelligence software

copy 2014 IBM Corporation37

Open Source frameworks II

Lucene A high-performance text search engine library that is written entirely in Java When you search within a

collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key

component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene

libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build

scan and query Lucene indexes

Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides

users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the

required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of

specific data in the file system

R A Project for Statistical Computing

Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop

systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and

export it to relational databases and enterprise data warehouses

Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper

maintains common objects that are needed in large cluster environments such as configuration information distributed

synchronization and group services

copy 2014 IBM Corporation38

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

copy 2014 IBM Corporation39

BigSheets

Browser based Analytical Tool that generates MapReduce Jobs

working over Hadoop Big Data

Helps non-programmers to work with Hadoop cluster

User models their big data as familiar spreadsheet-like tabular data

structures (collections) Once data is represented in a collection

business analysts can filter and enrich its content using built-in

functions and macros Furthermore analysts can combine data

residing in different collections as well as generate charts and new

ldquosheetsrdquo (collections) to visualize their data They can even export

data into a variety of common formats with a click of a button

Much of the technology included in Sheets was derived from the

BigSheets project of IBMrsquos Emerging Technologies team

copy 2014 IBM Corporation40

BigSheets Collection Sample

Spreadsheet-like structures defined by user

Based on data accessible through BigInsights Web console ndash eg file

system data output from Web crawl etc

copy 2014 IBM Corporation41

Big Sheets Collection Operations

Work with built-in ldquosheetsrdquo editor

Add delete columns

Filter data

Specify formulas to compute new

values using spreadsheet-style

syntax

Apply built-in or custom macro

functions

helliphelliphelliphellip

copy 2014 IBM Corporation42

BigSheets Collection Graphic Visualization

Built-in charting facility aids analysis

Pie charts bar charts tag clouds maps etc

Hover over sections to reveal details

copy 2014 IBM Corporation43

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

copy 2014 IBM Corporation44

What is Big SQL

Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture

ndash Comprehensive SQL92 ansi support

ndash Standards compliant client drivers (JDBC amp ODBC)

ndash Efficient handling of point queries

ndash Wide variety of data sources and file formats

ndash Extensive HBase focus

ndash Open source interoperability

Our driving design goalsndash Existing queries should run with no or few modifications

ndash Existing JDBC and ODBC compliant tools should continue to function

ndash Queries should be executed as efficiently as the chosen storage

mechanisms allow

copy 2014 IBM Corporation45

Architecture

Big SQL shares catalogs with

Hive via the Hive metastorendash Each can query the others tables

SQL engine analyzes incoming

queriesndash Separates portion(s) to execute at

the server vs portion(s) to execute

on the cluster

ndash Re-writes query if necessary for

improved performance

ndash Determines appropriate storage

handler for data

ndash Produces execution plan

ndash Executes and coordinates query

Server layout and relative sizes

for illustrative purposes only

Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node

Job TrackerNetwork Protocol

SQL Engine

Storage Handlers

Del

Files

SEQ

FilesHBase RDBMS bullbullbull

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore

copy 2014 IBM Corporation46

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

copy 2014 IBM Corporation47

What is Text Analytics

High Performance and Scalable rule based Information Extraction Engine

Distill structured information from unstructured data

- Rich annotator library supports multiple languages

Provides sophisticated tooling to help build test and refine rules

ndash Developer tools an easy to use text analytics language and a set of

extractors for fast adoption

ndash Multilingual support including support for DBCS languages

Developed at IBM Research since 2004 System T

BigInsights is the first time IBM opens up the Text Analytics Engine

technology for customization and development

copy 2014 IBM Corporation48

Annotator Query Language (AQL)

Language to create rules for Text Analytics

SQL Like Language

Fully declarative text analytics language

Once compiled produced an AOG plan to work in the data

No ldquoblack boxesrdquo or modules that canrsquot be customized

Tooling for easy customization because you are abstracted from the

programmatic details

Competing solutions make use of locked up black-box modules that cannot be

customized which restricts flexibility and are difficult to optimize for performance

create view AmountWithUnit as

extract pattern ltNmatchgt ltUmatchgt

as match

from Number N Unit U

copy 2014 IBM Corporation49

Text Analytic Simple Example

NetherlandsStrikerArjen Robben

Keeper SpainIker Casillas

WingerAndres Iniesta Spain

World Cup 2010 Highlights

Football World Cup 2010 one team distinguished well

from the rest winning the final Early in the second

half Netherlandsrsquo striker Arjen Robben had a chance

to score but the awesome keeper for Spain Iker

Casillas made the save Winner superiority was

reflected when Winger Andres Iniesta scored for Spain

for the win

copy 2014 IBM Corporation5050

Text Analytic Real Example

copy 2014 IBM Corporation5151

One step beyond Watson

copy 2014 IBM Corporation52

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

R

copy 2014 IBM Corporation53

Big R

bull Explore visualize transform

and model big data using

familiar R syntax and

paradigm

bull Scale out R with MR

programming

ndash Partitioning of large data

ndash Parallel cluster execution of R

code

bull Distributed Machine

Learning

ndash A scalable statistics engine that

provides canned algorithms and

an ability to author new ones all

via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3

copy 2014 IBM Corporation54

Where Does BigData Fit

Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate

ldquoCapture in case itrsquos neededrdquo

1 Extract transform load

ldquoCapture only whatrsquos neededrdquo

9 Report and mine data

copy 2014 IBM Corporation55

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation56

Data scientist ndash The new cool guy in town

Article in Fortune ldquoThe unemployment rate in

the US continues to be abysmal (91 in

July) but the tech world has spawned a

new kind of highly skilled nerdy-cool job

that companies are scrambling to fill data

scientistrdquo

McKinsey Global Institute ldquoBig data Reportrdquo

By 2018 the United States alone could

face a shortage of 140000 to 190000

people with deep analytical skills as well as

15 million managers and analysts with the

know-how to use the analysis of big data to

make effective decisions

copy 2014 IBM Corporation57

Data Science is Multidisciplinary

copy 2014 IBM Corporation58

Successful Data Scientist Characteristics

copy 2014 IBM Corporation59

Data Scientist Qualities

copy 2014 IBM Corporation60

How Long Does It Take For a Beginner to Become

a Good Data Scientist

copy 2014 IBM Corporation61

wwwkagglecom

copy 2014 IBM Corporation62

Kaggle ranking

copy 2014 IBM Corporation63 copy 2013 IBM Corporation63

Learn Big Data

Reading Materials - Online

ndash Understanding Big Data ndash Free PDF Book

bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF

ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights

bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml

ndash Implementing IBM InfoSphere BigInsights on System x - Redbook

bull httpwwwredbooksibmcomredpiecesabstractssg248077html

Resources

ndash Big Data Information Center

bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml

ndash InfoSphere BigInsights

bull www-01ibmcomsoftwaredatainfospherebiginsights

ndash Stream Computing

bull www-01ibmcomsoftwaredatainfospherestream-computing

ndash DeveloperWorks forums demos

bull httpwwwibmcomdeveloperworkswikibiginsights

copy 2014 IBM Corporation64

Learn Big Data Technologies

BigDataUniversitycom

Flexible on-line delivery

allows learning your place

and your pace

Free courses free study

materials

Cloud-based sandbox for

exercises ndash zero setup

Robust Course

Management System and

Content Distribution

infrastructure

copy 2014 IBM Corporation65

65

Page 24: Big data presentation (2014)

copy 2014 IBM Corporation24

How long it will take to read 1TB of data

1TB (at 80Mb sec)ndash 1 disk - 34 hours

ndash 10 disks - 20 min

ndash 100 disks - 2 min

ndash 1000 disks - 12 sec

copy 2014 IBM Corporation25

Parallel Data Processing is the answer

It was with us for a whilendash GRID computing - spreads processing load

ndash Distributed workload - hard to manage applications overhead on

developer

ndash Parallel databases ndash DB2 DPF Teradata Netezza etc (distribute the

data)

copy 2014 IBM Corporation26

What is Apache Hadoop

Apache Open source software framework

Flexible enterprise-class support for processing large volumes of

data ndash Inspired by Google technologies (MapReduce GFS BigTable hellip)

ndash Initiated at Yahoobull Originally built to address scalability problems of Nutch an open source Web search

technology

ndash Well-suited to batch-oriented read-intensive applications

ndash Supports wide variety of data

Enables applications to work with thousands of nodes and petabytes

of data in a highly parallel cost effective mannerndash CPU + local disks = ldquonoderdquo

ndash Nodes can be combined into clusters

ndash New nodes can be added as needed without changing

bull Data formats

bull How data is loaded

bull How jobs are written

copy 2014 IBM Corporation27

Design principles of Hadoop New way of storing and processing the data

ndash Let system handle most of the issues automaticallybull Failuresbull Scalabilitybull Reduce communications bull Distribute data and processing power to where the data isbull Make parallelism part of operating systembull Meant for heterogeneous commodity hardware

Bring processing to Data

Hadoop = HDFS + MapReduce infrastructure

Optimized to handlendash Massive amounts of data through parallelism

ndash A variety of data (structured unstructured semi-structured)

ndash Using inexpensive commodity hardware

Reliability provided through replication

copy 2014 IBM Corporation28

What is the Hadoop Distributed File System Driving principals

ndash Data is stored across the entire cluster (multiple nodes)

ndash Programs are brought to the data not the data to the program

ndash Follows the Divide and Conquer paradigm

Data is stored across the entire cluster (the DFS)

ndash The entire cluster participates in the file system

ndash Blocks of a single file are distributed across the cluster

ndash A given block is typically replicated as well for resiliency

10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111

0100

Logical File

1

2

3

4

Blocks

1

Cluster

1

1

2

22

3

3

34

44

copy 2011 IBM Corporation29

Introduction to MapReduce

Scalable to thousands of nodes and petabytes of data

MapReduce Application

1 Map Phase(break job into small parts)

2 Shuffle(transfer interim output

for final processing)

3 Reduce Phase(boil all output down to

a single result set)

Return a single result setResult Set

Shuffle

public static class TokenizerMapper extends MapperltObjectTextTextIntWritablegt

private final static IntWritableone = new IntWritable(1)

private Text word = new Text()

public void map(Object key Text val ContextStringTokenizer itr =

new StringTokenizer(valtoString())while (itrhasMoreTokens()) wordset(itrnextToken())

contextwrite(word one)

public static class IntSumReducer extends ReducerltTextIntWritableTextIntWrita

private IntWritable result = new IntWritable()

public void reduce(Text keyIterableltIntWritablegt val Context context)int sum = 0for (IntWritable v val)

sum += vget()

Distribute map

tasks to cluster

Hadoop Data Nodes

copy 2011 IBM Corporation30

MapReduce Example

Hello World Bye World

Hello IBM

Reduce (final output)

lt Bye 1gt

lt IBM 1gt

lt Hello 2gt

lt World 2gt

Map 1lt Hello 1gt

lt World 1gt

lt Bye 1gt

lt World 1gt

Count number of words occurrences

Map 2lt Hello 1gt

lt IBM 1gt

Entry Data

Map

Process

Reduce

Process

Shuffle

Process

copy 2014 IBM Corporation31

How to Analyze Large Data Sets in Hadoop

Its not just runtime Development phase has to be taken into

account

Although the Hadoop framework is implemented in Java

MapReduce applications do not need to be written in Java

To abstract complexities of Hadoop programming model a few

application development languages have emerged that build on top

of Hadoopndash Pig

ndash Hive

ndash Jaql

ndash Jaql

copy 2014 IBM Corporation32

Pig Hive Jaql ndash Similarities

Reduced program size over Java

Applications are translated to map

and reduce jobs behind scenes

Extension points for extending

existing functionality

Interoperability with other

languages

Not designed for random

readswrites or low-latency queries

Pig Hive Jaql ndash Differences

Characteristic Pig Hive Jaql

Developed by Yahoo Facebook IBM

Language Pig Latin HiveQL Jaql

Type of language

Data flow Declarative (SQL dialect) Data flow

Data structures supported

Complex Better suited for structured data

JSON semi structured

Schema Optional Not optional Optional

Hadoop Distributions

34

copy 2014 IBM Corporation35

Example of Hadoop Ecosystem

Visualization amp DiscoveryIntegration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

BigSheets JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Dashboard amp Visualization

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

copy 2014 IBM Corporation36

Open Source frameworks I

Avro A data serialization system that includes a schema within each file A schema defines the data types that are

contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data

types and complex type definitions within a schema

Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop

cluster

HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data

sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications

are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column

families so that the elements of a column family are all stored together This approach is different from a row-oriented

relational database where all columns of a row are stored together

HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do

not need to know where or how your data is stored You can change how you write data while still supporting existing data in

older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service

that includes functions for both MapReduce and Pig

Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to

analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements

which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere

BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos

Business Intelligence software

copy 2014 IBM Corporation37

Open Source frameworks II

Lucene A high-performance text search engine library that is written entirely in Java When you search within a

collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key

component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene

libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build

scan and query Lucene indexes

Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides

users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the

required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of

specific data in the file system

R A Project for Statistical Computing

Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop

systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and

export it to relational databases and enterprise data warehouses

Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper

maintains common objects that are needed in large cluster environments such as configuration information distributed

synchronization and group services

copy 2014 IBM Corporation38

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

copy 2014 IBM Corporation39

BigSheets

Browser based Analytical Tool that generates MapReduce Jobs

working over Hadoop Big Data

Helps non-programmers to work with Hadoop cluster

User models their big data as familiar spreadsheet-like tabular data

structures (collections) Once data is represented in a collection

business analysts can filter and enrich its content using built-in

functions and macros Furthermore analysts can combine data

residing in different collections as well as generate charts and new

ldquosheetsrdquo (collections) to visualize their data They can even export

data into a variety of common formats with a click of a button

Much of the technology included in Sheets was derived from the

BigSheets project of IBMrsquos Emerging Technologies team

copy 2014 IBM Corporation40

BigSheets Collection Sample

Spreadsheet-like structures defined by user

Based on data accessible through BigInsights Web console ndash eg file

system data output from Web crawl etc

copy 2014 IBM Corporation41

Big Sheets Collection Operations

Work with built-in ldquosheetsrdquo editor

Add delete columns

Filter data

Specify formulas to compute new

values using spreadsheet-style

syntax

Apply built-in or custom macro

functions

helliphelliphelliphellip

copy 2014 IBM Corporation42

BigSheets Collection Graphic Visualization

Built-in charting facility aids analysis

Pie charts bar charts tag clouds maps etc

Hover over sections to reveal details

copy 2014 IBM Corporation43

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

copy 2014 IBM Corporation44

What is Big SQL

Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture

ndash Comprehensive SQL92 ansi support

ndash Standards compliant client drivers (JDBC amp ODBC)

ndash Efficient handling of point queries

ndash Wide variety of data sources and file formats

ndash Extensive HBase focus

ndash Open source interoperability

Our driving design goalsndash Existing queries should run with no or few modifications

ndash Existing JDBC and ODBC compliant tools should continue to function

ndash Queries should be executed as efficiently as the chosen storage

mechanisms allow

copy 2014 IBM Corporation45

Architecture

Big SQL shares catalogs with

Hive via the Hive metastorendash Each can query the others tables

SQL engine analyzes incoming

queriesndash Separates portion(s) to execute at

the server vs portion(s) to execute

on the cluster

ndash Re-writes query if necessary for

improved performance

ndash Determines appropriate storage

handler for data

ndash Produces execution plan

ndash Executes and coordinates query

Server layout and relative sizes

for illustrative purposes only

Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node

Job TrackerNetwork Protocol

SQL Engine

Storage Handlers

Del

Files

SEQ

FilesHBase RDBMS bullbullbull

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore

copy 2014 IBM Corporation46

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

copy 2014 IBM Corporation47

What is Text Analytics

High Performance and Scalable rule based Information Extraction Engine

Distill structured information from unstructured data

- Rich annotator library supports multiple languages

Provides sophisticated tooling to help build test and refine rules

ndash Developer tools an easy to use text analytics language and a set of

extractors for fast adoption

ndash Multilingual support including support for DBCS languages

Developed at IBM Research since 2004 System T

BigInsights is the first time IBM opens up the Text Analytics Engine

technology for customization and development

copy 2014 IBM Corporation48

Annotator Query Language (AQL)

Language to create rules for Text Analytics

SQL Like Language

Fully declarative text analytics language

Once compiled produced an AOG plan to work in the data

No ldquoblack boxesrdquo or modules that canrsquot be customized

Tooling for easy customization because you are abstracted from the

programmatic details

Competing solutions make use of locked up black-box modules that cannot be

customized which restricts flexibility and are difficult to optimize for performance

create view AmountWithUnit as

extract pattern ltNmatchgt ltUmatchgt

as match

from Number N Unit U

copy 2014 IBM Corporation49

Text Analytic Simple Example

NetherlandsStrikerArjen Robben

Keeper SpainIker Casillas

WingerAndres Iniesta Spain

World Cup 2010 Highlights

Football World Cup 2010 one team distinguished well

from the rest winning the final Early in the second

half Netherlandsrsquo striker Arjen Robben had a chance

to score but the awesome keeper for Spain Iker

Casillas made the save Winner superiority was

reflected when Winger Andres Iniesta scored for Spain

for the win

copy 2014 IBM Corporation5050

Text Analytic Real Example

copy 2014 IBM Corporation5151

One step beyond Watson

copy 2014 IBM Corporation52

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

R

copy 2014 IBM Corporation53

Big R

bull Explore visualize transform

and model big data using

familiar R syntax and

paradigm

bull Scale out R with MR

programming

ndash Partitioning of large data

ndash Parallel cluster execution of R

code

bull Distributed Machine

Learning

ndash A scalable statistics engine that

provides canned algorithms and

an ability to author new ones all

via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3

copy 2014 IBM Corporation54

Where Does BigData Fit

Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate

ldquoCapture in case itrsquos neededrdquo

1 Extract transform load

ldquoCapture only whatrsquos neededrdquo

9 Report and mine data

copy 2014 IBM Corporation55

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation56

Data scientist ndash The new cool guy in town

Article in Fortune ldquoThe unemployment rate in

the US continues to be abysmal (91 in

July) but the tech world has spawned a

new kind of highly skilled nerdy-cool job

that companies are scrambling to fill data

scientistrdquo

McKinsey Global Institute ldquoBig data Reportrdquo

By 2018 the United States alone could

face a shortage of 140000 to 190000

people with deep analytical skills as well as

15 million managers and analysts with the

know-how to use the analysis of big data to

make effective decisions

copy 2014 IBM Corporation57

Data Science is Multidisciplinary

copy 2014 IBM Corporation58

Successful Data Scientist Characteristics

copy 2014 IBM Corporation59

Data Scientist Qualities

copy 2014 IBM Corporation60

How Long Does It Take For a Beginner to Become

a Good Data Scientist

copy 2014 IBM Corporation61

wwwkagglecom

copy 2014 IBM Corporation62

Kaggle ranking

copy 2014 IBM Corporation63 copy 2013 IBM Corporation63

Learn Big Data

Reading Materials - Online

ndash Understanding Big Data ndash Free PDF Book

bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF

ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights

bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml

ndash Implementing IBM InfoSphere BigInsights on System x - Redbook

bull httpwwwredbooksibmcomredpiecesabstractssg248077html

Resources

ndash Big Data Information Center

bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml

ndash InfoSphere BigInsights

bull www-01ibmcomsoftwaredatainfospherebiginsights

ndash Stream Computing

bull www-01ibmcomsoftwaredatainfospherestream-computing

ndash DeveloperWorks forums demos

bull httpwwwibmcomdeveloperworkswikibiginsights

copy 2014 IBM Corporation64

Learn Big Data Technologies

BigDataUniversitycom

Flexible on-line delivery

allows learning your place

and your pace

Free courses free study

materials

Cloud-based sandbox for

exercises ndash zero setup

Robust Course

Management System and

Content Distribution

infrastructure

copy 2014 IBM Corporation65

65

Page 25: Big data presentation (2014)

copy 2014 IBM Corporation25

Parallel Data Processing is the answer

It was with us for a whilendash GRID computing - spreads processing load

ndash Distributed workload - hard to manage applications overhead on

developer

ndash Parallel databases ndash DB2 DPF Teradata Netezza etc (distribute the

data)

copy 2014 IBM Corporation26

What is Apache Hadoop

Apache Open source software framework

Flexible enterprise-class support for processing large volumes of

data ndash Inspired by Google technologies (MapReduce GFS BigTable hellip)

ndash Initiated at Yahoobull Originally built to address scalability problems of Nutch an open source Web search

technology

ndash Well-suited to batch-oriented read-intensive applications

ndash Supports wide variety of data

Enables applications to work with thousands of nodes and petabytes

of data in a highly parallel cost effective mannerndash CPU + local disks = ldquonoderdquo

ndash Nodes can be combined into clusters

ndash New nodes can be added as needed without changing

bull Data formats

bull How data is loaded

bull How jobs are written

copy 2014 IBM Corporation27

Design principles of Hadoop New way of storing and processing the data

ndash Let system handle most of the issues automaticallybull Failuresbull Scalabilitybull Reduce communications bull Distribute data and processing power to where the data isbull Make parallelism part of operating systembull Meant for heterogeneous commodity hardware

Bring processing to Data

Hadoop = HDFS + MapReduce infrastructure

Optimized to handlendash Massive amounts of data through parallelism

ndash A variety of data (structured unstructured semi-structured)

ndash Using inexpensive commodity hardware

Reliability provided through replication

copy 2014 IBM Corporation28

What is the Hadoop Distributed File System Driving principals

ndash Data is stored across the entire cluster (multiple nodes)

ndash Programs are brought to the data not the data to the program

ndash Follows the Divide and Conquer paradigm

Data is stored across the entire cluster (the DFS)

ndash The entire cluster participates in the file system

ndash Blocks of a single file are distributed across the cluster

ndash A given block is typically replicated as well for resiliency

10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111

0100

Logical File

1

2

3

4

Blocks

1

Cluster

1

1

2

22

3

3

34

44

copy 2011 IBM Corporation29

Introduction to MapReduce

Scalable to thousands of nodes and petabytes of data

MapReduce Application

1 Map Phase(break job into small parts)

2 Shuffle(transfer interim output

for final processing)

3 Reduce Phase(boil all output down to

a single result set)

Return a single result setResult Set

Shuffle

public static class TokenizerMapper extends MapperltObjectTextTextIntWritablegt

private final static IntWritableone = new IntWritable(1)

private Text word = new Text()

public void map(Object key Text val ContextStringTokenizer itr =

new StringTokenizer(valtoString())while (itrhasMoreTokens()) wordset(itrnextToken())

contextwrite(word one)

public static class IntSumReducer extends ReducerltTextIntWritableTextIntWrita

private IntWritable result = new IntWritable()

public void reduce(Text keyIterableltIntWritablegt val Context context)int sum = 0for (IntWritable v val)

sum += vget()

Distribute map

tasks to cluster

Hadoop Data Nodes

copy 2011 IBM Corporation30

MapReduce Example

Hello World Bye World

Hello IBM

Reduce (final output)

lt Bye 1gt

lt IBM 1gt

lt Hello 2gt

lt World 2gt

Map 1lt Hello 1gt

lt World 1gt

lt Bye 1gt

lt World 1gt

Count number of words occurrences

Map 2lt Hello 1gt

lt IBM 1gt

Entry Data

Map

Process

Reduce

Process

Shuffle

Process

copy 2014 IBM Corporation31

How to Analyze Large Data Sets in Hadoop

Its not just runtime Development phase has to be taken into

account

Although the Hadoop framework is implemented in Java

MapReduce applications do not need to be written in Java

To abstract complexities of Hadoop programming model a few

application development languages have emerged that build on top

of Hadoopndash Pig

ndash Hive

ndash Jaql

ndash Jaql

copy 2014 IBM Corporation32

Pig Hive Jaql ndash Similarities

Reduced program size over Java

Applications are translated to map

and reduce jobs behind scenes

Extension points for extending

existing functionality

Interoperability with other

languages

Not designed for random

readswrites or low-latency queries

Pig Hive Jaql ndash Differences

Characteristic Pig Hive Jaql

Developed by Yahoo Facebook IBM

Language Pig Latin HiveQL Jaql

Type of language

Data flow Declarative (SQL dialect) Data flow

Data structures supported

Complex Better suited for structured data

JSON semi structured

Schema Optional Not optional Optional

Hadoop Distributions

34

copy 2014 IBM Corporation35

Example of Hadoop Ecosystem

Visualization amp DiscoveryIntegration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

BigSheets JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Dashboard amp Visualization

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

copy 2014 IBM Corporation36

Open Source frameworks I

Avro A data serialization system that includes a schema within each file A schema defines the data types that are

contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data

types and complex type definitions within a schema

Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop

cluster

HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data

sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications

are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column

families so that the elements of a column family are all stored together This approach is different from a row-oriented

relational database where all columns of a row are stored together

HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do

not need to know where or how your data is stored You can change how you write data while still supporting existing data in

older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service

that includes functions for both MapReduce and Pig

Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to

analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements

which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere

BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos

Business Intelligence software

copy 2014 IBM Corporation37

Open Source frameworks II

Lucene A high-performance text search engine library that is written entirely in Java When you search within a

collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key

component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene

libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build

scan and query Lucene indexes

Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides

users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the

required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of

specific data in the file system

R A Project for Statistical Computing

Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop

systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and

export it to relational databases and enterprise data warehouses

Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper

maintains common objects that are needed in large cluster environments such as configuration information distributed

synchronization and group services

copy 2014 IBM Corporation38

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

copy 2014 IBM Corporation39

BigSheets

Browser based Analytical Tool that generates MapReduce Jobs

working over Hadoop Big Data

Helps non-programmers to work with Hadoop cluster

User models their big data as familiar spreadsheet-like tabular data

structures (collections) Once data is represented in a collection

business analysts can filter and enrich its content using built-in

functions and macros Furthermore analysts can combine data

residing in different collections as well as generate charts and new

ldquosheetsrdquo (collections) to visualize their data They can even export

data into a variety of common formats with a click of a button

Much of the technology included in Sheets was derived from the

BigSheets project of IBMrsquos Emerging Technologies team

copy 2014 IBM Corporation40

BigSheets Collection Sample

Spreadsheet-like structures defined by user

Based on data accessible through BigInsights Web console ndash eg file

system data output from Web crawl etc

copy 2014 IBM Corporation41

Big Sheets Collection Operations

Work with built-in ldquosheetsrdquo editor

Add delete columns

Filter data

Specify formulas to compute new

values using spreadsheet-style

syntax

Apply built-in or custom macro

functions

helliphelliphelliphellip

copy 2014 IBM Corporation42

BigSheets Collection Graphic Visualization

Built-in charting facility aids analysis

Pie charts bar charts tag clouds maps etc

Hover over sections to reveal details

copy 2014 IBM Corporation43

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

copy 2014 IBM Corporation44

What is Big SQL

Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture

ndash Comprehensive SQL92 ansi support

ndash Standards compliant client drivers (JDBC amp ODBC)

ndash Efficient handling of point queries

ndash Wide variety of data sources and file formats

ndash Extensive HBase focus

ndash Open source interoperability

Our driving design goalsndash Existing queries should run with no or few modifications

ndash Existing JDBC and ODBC compliant tools should continue to function

ndash Queries should be executed as efficiently as the chosen storage

mechanisms allow

copy 2014 IBM Corporation45

Architecture

Big SQL shares catalogs with

Hive via the Hive metastorendash Each can query the others tables

SQL engine analyzes incoming

queriesndash Separates portion(s) to execute at

the server vs portion(s) to execute

on the cluster

ndash Re-writes query if necessary for

improved performance

ndash Determines appropriate storage

handler for data

ndash Produces execution plan

ndash Executes and coordinates query

Server layout and relative sizes

for illustrative purposes only

Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node

Job TrackerNetwork Protocol

SQL Engine

Storage Handlers

Del

Files

SEQ

FilesHBase RDBMS bullbullbull

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore

copy 2014 IBM Corporation46

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

copy 2014 IBM Corporation47

What is Text Analytics

High Performance and Scalable rule based Information Extraction Engine

Distill structured information from unstructured data

- Rich annotator library supports multiple languages

Provides sophisticated tooling to help build test and refine rules

ndash Developer tools an easy to use text analytics language and a set of

extractors for fast adoption

ndash Multilingual support including support for DBCS languages

Developed at IBM Research since 2004 System T

BigInsights is the first time IBM opens up the Text Analytics Engine

technology for customization and development

copy 2014 IBM Corporation48

Annotator Query Language (AQL)

Language to create rules for Text Analytics

SQL Like Language

Fully declarative text analytics language

Once compiled produced an AOG plan to work in the data

No ldquoblack boxesrdquo or modules that canrsquot be customized

Tooling for easy customization because you are abstracted from the

programmatic details

Competing solutions make use of locked up black-box modules that cannot be

customized which restricts flexibility and are difficult to optimize for performance

create view AmountWithUnit as

extract pattern ltNmatchgt ltUmatchgt

as match

from Number N Unit U

copy 2014 IBM Corporation49

Text Analytic Simple Example

NetherlandsStrikerArjen Robben

Keeper SpainIker Casillas

WingerAndres Iniesta Spain

World Cup 2010 Highlights

Football World Cup 2010 one team distinguished well

from the rest winning the final Early in the second

half Netherlandsrsquo striker Arjen Robben had a chance

to score but the awesome keeper for Spain Iker

Casillas made the save Winner superiority was

reflected when Winger Andres Iniesta scored for Spain

for the win

copy 2014 IBM Corporation5050

Text Analytic Real Example

copy 2014 IBM Corporation5151

One step beyond Watson

copy 2014 IBM Corporation52

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

R

copy 2014 IBM Corporation53

Big R

bull Explore visualize transform

and model big data using

familiar R syntax and

paradigm

bull Scale out R with MR

programming

ndash Partitioning of large data

ndash Parallel cluster execution of R

code

bull Distributed Machine

Learning

ndash A scalable statistics engine that

provides canned algorithms and

an ability to author new ones all

via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3

copy 2014 IBM Corporation54

Where Does BigData Fit

Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate

ldquoCapture in case itrsquos neededrdquo

1 Extract transform load

ldquoCapture only whatrsquos neededrdquo

9 Report and mine data

copy 2014 IBM Corporation55

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation56

Data scientist ndash The new cool guy in town

Article in Fortune ldquoThe unemployment rate in

the US continues to be abysmal (91 in

July) but the tech world has spawned a

new kind of highly skilled nerdy-cool job

that companies are scrambling to fill data

scientistrdquo

McKinsey Global Institute ldquoBig data Reportrdquo

By 2018 the United States alone could

face a shortage of 140000 to 190000

people with deep analytical skills as well as

15 million managers and analysts with the

know-how to use the analysis of big data to

make effective decisions

copy 2014 IBM Corporation57

Data Science is Multidisciplinary

copy 2014 IBM Corporation58

Successful Data Scientist Characteristics

copy 2014 IBM Corporation59

Data Scientist Qualities

copy 2014 IBM Corporation60

How Long Does It Take For a Beginner to Become

a Good Data Scientist

copy 2014 IBM Corporation61

wwwkagglecom

copy 2014 IBM Corporation62

Kaggle ranking

copy 2014 IBM Corporation63 copy 2013 IBM Corporation63

Learn Big Data

Reading Materials - Online

ndash Understanding Big Data ndash Free PDF Book

bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF

ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights

bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml

ndash Implementing IBM InfoSphere BigInsights on System x - Redbook

bull httpwwwredbooksibmcomredpiecesabstractssg248077html

Resources

ndash Big Data Information Center

bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml

ndash InfoSphere BigInsights

bull www-01ibmcomsoftwaredatainfospherebiginsights

ndash Stream Computing

bull www-01ibmcomsoftwaredatainfospherestream-computing

ndash DeveloperWorks forums demos

bull httpwwwibmcomdeveloperworkswikibiginsights

copy 2014 IBM Corporation64

Learn Big Data Technologies

BigDataUniversitycom

Flexible on-line delivery

allows learning your place

and your pace

Free courses free study

materials

Cloud-based sandbox for

exercises ndash zero setup

Robust Course

Management System and

Content Distribution

infrastructure

copy 2014 IBM Corporation65

65

Page 26: Big data presentation (2014)

copy 2014 IBM Corporation26

What is Apache Hadoop

Apache Open source software framework

Flexible enterprise-class support for processing large volumes of

data ndash Inspired by Google technologies (MapReduce GFS BigTable hellip)

ndash Initiated at Yahoobull Originally built to address scalability problems of Nutch an open source Web search

technology

ndash Well-suited to batch-oriented read-intensive applications

ndash Supports wide variety of data

Enables applications to work with thousands of nodes and petabytes

of data in a highly parallel cost effective mannerndash CPU + local disks = ldquonoderdquo

ndash Nodes can be combined into clusters

ndash New nodes can be added as needed without changing

bull Data formats

bull How data is loaded

bull How jobs are written

copy 2014 IBM Corporation27

Design principles of Hadoop New way of storing and processing the data

ndash Let system handle most of the issues automaticallybull Failuresbull Scalabilitybull Reduce communications bull Distribute data and processing power to where the data isbull Make parallelism part of operating systembull Meant for heterogeneous commodity hardware

Bring processing to Data

Hadoop = HDFS + MapReduce infrastructure

Optimized to handlendash Massive amounts of data through parallelism

ndash A variety of data (structured unstructured semi-structured)

ndash Using inexpensive commodity hardware

Reliability provided through replication

copy 2014 IBM Corporation28

What is the Hadoop Distributed File System Driving principals

ndash Data is stored across the entire cluster (multiple nodes)

ndash Programs are brought to the data not the data to the program

ndash Follows the Divide and Conquer paradigm

Data is stored across the entire cluster (the DFS)

ndash The entire cluster participates in the file system

ndash Blocks of a single file are distributed across the cluster

ndash A given block is typically replicated as well for resiliency

10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111

0100

Logical File

1

2

3

4

Blocks

1

Cluster

1

1

2

22

3

3

34

44

copy 2011 IBM Corporation29

Introduction to MapReduce

Scalable to thousands of nodes and petabytes of data

MapReduce Application

1 Map Phase(break job into small parts)

2 Shuffle(transfer interim output

for final processing)

3 Reduce Phase(boil all output down to

a single result set)

Return a single result setResult Set

Shuffle

public static class TokenizerMapper extends MapperltObjectTextTextIntWritablegt

private final static IntWritableone = new IntWritable(1)

private Text word = new Text()

public void map(Object key Text val ContextStringTokenizer itr =

new StringTokenizer(valtoString())while (itrhasMoreTokens()) wordset(itrnextToken())

contextwrite(word one)

public static class IntSumReducer extends ReducerltTextIntWritableTextIntWrita

private IntWritable result = new IntWritable()

public void reduce(Text keyIterableltIntWritablegt val Context context)int sum = 0for (IntWritable v val)

sum += vget()

Distribute map

tasks to cluster

Hadoop Data Nodes

copy 2011 IBM Corporation30

MapReduce Example

Hello World Bye World

Hello IBM

Reduce (final output)

lt Bye 1gt

lt IBM 1gt

lt Hello 2gt

lt World 2gt

Map 1lt Hello 1gt

lt World 1gt

lt Bye 1gt

lt World 1gt

Count number of words occurrences

Map 2lt Hello 1gt

lt IBM 1gt

Entry Data

Map

Process

Reduce

Process

Shuffle

Process

copy 2014 IBM Corporation31

How to Analyze Large Data Sets in Hadoop

Its not just runtime Development phase has to be taken into

account

Although the Hadoop framework is implemented in Java

MapReduce applications do not need to be written in Java

To abstract complexities of Hadoop programming model a few

application development languages have emerged that build on top

of Hadoopndash Pig

ndash Hive

ndash Jaql

ndash Jaql

copy 2014 IBM Corporation32

Pig Hive Jaql ndash Similarities

Reduced program size over Java

Applications are translated to map

and reduce jobs behind scenes

Extension points for extending

existing functionality

Interoperability with other

languages

Not designed for random

readswrites or low-latency queries

Pig Hive Jaql ndash Differences

Characteristic Pig Hive Jaql

Developed by Yahoo Facebook IBM

Language Pig Latin HiveQL Jaql

Type of language

Data flow Declarative (SQL dialect) Data flow

Data structures supported

Complex Better suited for structured data

JSON semi structured

Schema Optional Not optional Optional

Hadoop Distributions

34

copy 2014 IBM Corporation35

Example of Hadoop Ecosystem

Visualization amp DiscoveryIntegration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

BigSheets JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Dashboard amp Visualization

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

copy 2014 IBM Corporation36

Open Source frameworks I

Avro A data serialization system that includes a schema within each file A schema defines the data types that are

contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data

types and complex type definitions within a schema

Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop

cluster

HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data

sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications

are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column

families so that the elements of a column family are all stored together This approach is different from a row-oriented

relational database where all columns of a row are stored together

HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do

not need to know where or how your data is stored You can change how you write data while still supporting existing data in

older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service

that includes functions for both MapReduce and Pig

Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to

analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements

which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere

BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos

Business Intelligence software

copy 2014 IBM Corporation37

Open Source frameworks II

Lucene A high-performance text search engine library that is written entirely in Java When you search within a

collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key

component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene

libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build

scan and query Lucene indexes

Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides

users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the

required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of

specific data in the file system

R A Project for Statistical Computing

Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop

systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and

export it to relational databases and enterprise data warehouses

Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper

maintains common objects that are needed in large cluster environments such as configuration information distributed

synchronization and group services

copy 2014 IBM Corporation38

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

copy 2014 IBM Corporation39

BigSheets

Browser based Analytical Tool that generates MapReduce Jobs

working over Hadoop Big Data

Helps non-programmers to work with Hadoop cluster

User models their big data as familiar spreadsheet-like tabular data

structures (collections) Once data is represented in a collection

business analysts can filter and enrich its content using built-in

functions and macros Furthermore analysts can combine data

residing in different collections as well as generate charts and new

ldquosheetsrdquo (collections) to visualize their data They can even export

data into a variety of common formats with a click of a button

Much of the technology included in Sheets was derived from the

BigSheets project of IBMrsquos Emerging Technologies team

copy 2014 IBM Corporation40

BigSheets Collection Sample

Spreadsheet-like structures defined by user

Based on data accessible through BigInsights Web console ndash eg file

system data output from Web crawl etc

copy 2014 IBM Corporation41

Big Sheets Collection Operations

Work with built-in ldquosheetsrdquo editor

Add delete columns

Filter data

Specify formulas to compute new

values using spreadsheet-style

syntax

Apply built-in or custom macro

functions

helliphelliphelliphellip

copy 2014 IBM Corporation42

BigSheets Collection Graphic Visualization

Built-in charting facility aids analysis

Pie charts bar charts tag clouds maps etc

Hover over sections to reveal details

copy 2014 IBM Corporation43

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

copy 2014 IBM Corporation44

What is Big SQL

Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture

ndash Comprehensive SQL92 ansi support

ndash Standards compliant client drivers (JDBC amp ODBC)

ndash Efficient handling of point queries

ndash Wide variety of data sources and file formats

ndash Extensive HBase focus

ndash Open source interoperability

Our driving design goalsndash Existing queries should run with no or few modifications

ndash Existing JDBC and ODBC compliant tools should continue to function

ndash Queries should be executed as efficiently as the chosen storage

mechanisms allow

copy 2014 IBM Corporation45

Architecture

Big SQL shares catalogs with

Hive via the Hive metastorendash Each can query the others tables

SQL engine analyzes incoming

queriesndash Separates portion(s) to execute at

the server vs portion(s) to execute

on the cluster

ndash Re-writes query if necessary for

improved performance

ndash Determines appropriate storage

handler for data

ndash Produces execution plan

ndash Executes and coordinates query

Server layout and relative sizes

for illustrative purposes only

Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node

Job TrackerNetwork Protocol

SQL Engine

Storage Handlers

Del

Files

SEQ

FilesHBase RDBMS bullbullbull

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore

copy 2014 IBM Corporation46

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

copy 2014 IBM Corporation47

What is Text Analytics

High Performance and Scalable rule based Information Extraction Engine

Distill structured information from unstructured data

- Rich annotator library supports multiple languages

Provides sophisticated tooling to help build test and refine rules

ndash Developer tools an easy to use text analytics language and a set of

extractors for fast adoption

ndash Multilingual support including support for DBCS languages

Developed at IBM Research since 2004 System T

BigInsights is the first time IBM opens up the Text Analytics Engine

technology for customization and development

copy 2014 IBM Corporation48

Annotator Query Language (AQL)

Language to create rules for Text Analytics

SQL Like Language

Fully declarative text analytics language

Once compiled produced an AOG plan to work in the data

No ldquoblack boxesrdquo or modules that canrsquot be customized

Tooling for easy customization because you are abstracted from the

programmatic details

Competing solutions make use of locked up black-box modules that cannot be

customized which restricts flexibility and are difficult to optimize for performance

create view AmountWithUnit as

extract pattern ltNmatchgt ltUmatchgt

as match

from Number N Unit U

copy 2014 IBM Corporation49

Text Analytic Simple Example

NetherlandsStrikerArjen Robben

Keeper SpainIker Casillas

WingerAndres Iniesta Spain

World Cup 2010 Highlights

Football World Cup 2010 one team distinguished well

from the rest winning the final Early in the second

half Netherlandsrsquo striker Arjen Robben had a chance

to score but the awesome keeper for Spain Iker

Casillas made the save Winner superiority was

reflected when Winger Andres Iniesta scored for Spain

for the win

copy 2014 IBM Corporation5050

Text Analytic Real Example

copy 2014 IBM Corporation5151

One step beyond Watson

copy 2014 IBM Corporation52

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

R

copy 2014 IBM Corporation53

Big R

bull Explore visualize transform

and model big data using

familiar R syntax and

paradigm

bull Scale out R with MR

programming

ndash Partitioning of large data

ndash Parallel cluster execution of R

code

bull Distributed Machine

Learning

ndash A scalable statistics engine that

provides canned algorithms and

an ability to author new ones all

via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3

copy 2014 IBM Corporation54

Where Does BigData Fit

Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate

ldquoCapture in case itrsquos neededrdquo

1 Extract transform load

ldquoCapture only whatrsquos neededrdquo

9 Report and mine data

copy 2014 IBM Corporation55

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation56

Data scientist ndash The new cool guy in town

Article in Fortune ldquoThe unemployment rate in

the US continues to be abysmal (91 in

July) but the tech world has spawned a

new kind of highly skilled nerdy-cool job

that companies are scrambling to fill data

scientistrdquo

McKinsey Global Institute ldquoBig data Reportrdquo

By 2018 the United States alone could

face a shortage of 140000 to 190000

people with deep analytical skills as well as

15 million managers and analysts with the

know-how to use the analysis of big data to

make effective decisions

copy 2014 IBM Corporation57

Data Science is Multidisciplinary

copy 2014 IBM Corporation58

Successful Data Scientist Characteristics

copy 2014 IBM Corporation59

Data Scientist Qualities

copy 2014 IBM Corporation60

How Long Does It Take For a Beginner to Become

a Good Data Scientist

copy 2014 IBM Corporation61

wwwkagglecom

copy 2014 IBM Corporation62

Kaggle ranking

copy 2014 IBM Corporation63 copy 2013 IBM Corporation63

Learn Big Data

Reading Materials - Online

ndash Understanding Big Data ndash Free PDF Book

bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF

ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights

bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml

ndash Implementing IBM InfoSphere BigInsights on System x - Redbook

bull httpwwwredbooksibmcomredpiecesabstractssg248077html

Resources

ndash Big Data Information Center

bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml

ndash InfoSphere BigInsights

bull www-01ibmcomsoftwaredatainfospherebiginsights

ndash Stream Computing

bull www-01ibmcomsoftwaredatainfospherestream-computing

ndash DeveloperWorks forums demos

bull httpwwwibmcomdeveloperworkswikibiginsights

copy 2014 IBM Corporation64

Learn Big Data Technologies

BigDataUniversitycom

Flexible on-line delivery

allows learning your place

and your pace

Free courses free study

materials

Cloud-based sandbox for

exercises ndash zero setup

Robust Course

Management System and

Content Distribution

infrastructure

copy 2014 IBM Corporation65

65

Page 27: Big data presentation (2014)

copy 2014 IBM Corporation27

Design principles of Hadoop New way of storing and processing the data

ndash Let system handle most of the issues automaticallybull Failuresbull Scalabilitybull Reduce communications bull Distribute data and processing power to where the data isbull Make parallelism part of operating systembull Meant for heterogeneous commodity hardware

Bring processing to Data

Hadoop = HDFS + MapReduce infrastructure

Optimized to handlendash Massive amounts of data through parallelism

ndash A variety of data (structured unstructured semi-structured)

ndash Using inexpensive commodity hardware

Reliability provided through replication

copy 2014 IBM Corporation28

What is the Hadoop Distributed File System Driving principals

ndash Data is stored across the entire cluster (multiple nodes)

ndash Programs are brought to the data not the data to the program

ndash Follows the Divide and Conquer paradigm

Data is stored across the entire cluster (the DFS)

ndash The entire cluster participates in the file system

ndash Blocks of a single file are distributed across the cluster

ndash A given block is typically replicated as well for resiliency

10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111

0100

Logical File

1

2

3

4

Blocks

1

Cluster

1

1

2

22

3

3

34

44

copy 2011 IBM Corporation29

Introduction to MapReduce

Scalable to thousands of nodes and petabytes of data

MapReduce Application

1 Map Phase(break job into small parts)

2 Shuffle(transfer interim output

for final processing)

3 Reduce Phase(boil all output down to

a single result set)

Return a single result setResult Set

Shuffle

public static class TokenizerMapper extends MapperltObjectTextTextIntWritablegt

private final static IntWritableone = new IntWritable(1)

private Text word = new Text()

public void map(Object key Text val ContextStringTokenizer itr =

new StringTokenizer(valtoString())while (itrhasMoreTokens()) wordset(itrnextToken())

contextwrite(word one)

public static class IntSumReducer extends ReducerltTextIntWritableTextIntWrita

private IntWritable result = new IntWritable()

public void reduce(Text keyIterableltIntWritablegt val Context context)int sum = 0for (IntWritable v val)

sum += vget()

Distribute map

tasks to cluster

Hadoop Data Nodes

copy 2011 IBM Corporation30

MapReduce Example

Hello World Bye World

Hello IBM

Reduce (final output)

lt Bye 1gt

lt IBM 1gt

lt Hello 2gt

lt World 2gt

Map 1lt Hello 1gt

lt World 1gt

lt Bye 1gt

lt World 1gt

Count number of words occurrences

Map 2lt Hello 1gt

lt IBM 1gt

Entry Data

Map

Process

Reduce

Process

Shuffle

Process

copy 2014 IBM Corporation31

How to Analyze Large Data Sets in Hadoop

Its not just runtime Development phase has to be taken into

account

Although the Hadoop framework is implemented in Java

MapReduce applications do not need to be written in Java

To abstract complexities of Hadoop programming model a few

application development languages have emerged that build on top

of Hadoopndash Pig

ndash Hive

ndash Jaql

ndash Jaql

copy 2014 IBM Corporation32

Pig Hive Jaql ndash Similarities

Reduced program size over Java

Applications are translated to map

and reduce jobs behind scenes

Extension points for extending

existing functionality

Interoperability with other

languages

Not designed for random

readswrites or low-latency queries

Pig Hive Jaql ndash Differences

Characteristic Pig Hive Jaql

Developed by Yahoo Facebook IBM

Language Pig Latin HiveQL Jaql

Type of language

Data flow Declarative (SQL dialect) Data flow

Data structures supported

Complex Better suited for structured data

JSON semi structured

Schema Optional Not optional Optional

Hadoop Distributions

34

copy 2014 IBM Corporation35

Example of Hadoop Ecosystem

Visualization amp DiscoveryIntegration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

BigSheets JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Dashboard amp Visualization

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

copy 2014 IBM Corporation36

Open Source frameworks I

Avro A data serialization system that includes a schema within each file A schema defines the data types that are

contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data

types and complex type definitions within a schema

Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop

cluster

HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data

sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications

are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column

families so that the elements of a column family are all stored together This approach is different from a row-oriented

relational database where all columns of a row are stored together

HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do

not need to know where or how your data is stored You can change how you write data while still supporting existing data in

older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service

that includes functions for both MapReduce and Pig

Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to

analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements

which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere

BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos

Business Intelligence software

copy 2014 IBM Corporation37

Open Source frameworks II

Lucene A high-performance text search engine library that is written entirely in Java When you search within a

collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key

component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene

libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build

scan and query Lucene indexes

Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides

users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the

required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of

specific data in the file system

R A Project for Statistical Computing

Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop

systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and

export it to relational databases and enterprise data warehouses

Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper

maintains common objects that are needed in large cluster environments such as configuration information distributed

synchronization and group services

copy 2014 IBM Corporation38

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

copy 2014 IBM Corporation39

BigSheets

Browser based Analytical Tool that generates MapReduce Jobs

working over Hadoop Big Data

Helps non-programmers to work with Hadoop cluster

User models their big data as familiar spreadsheet-like tabular data

structures (collections) Once data is represented in a collection

business analysts can filter and enrich its content using built-in

functions and macros Furthermore analysts can combine data

residing in different collections as well as generate charts and new

ldquosheetsrdquo (collections) to visualize their data They can even export

data into a variety of common formats with a click of a button

Much of the technology included in Sheets was derived from the

BigSheets project of IBMrsquos Emerging Technologies team

copy 2014 IBM Corporation40

BigSheets Collection Sample

Spreadsheet-like structures defined by user

Based on data accessible through BigInsights Web console ndash eg file

system data output from Web crawl etc

copy 2014 IBM Corporation41

Big Sheets Collection Operations

Work with built-in ldquosheetsrdquo editor

Add delete columns

Filter data

Specify formulas to compute new

values using spreadsheet-style

syntax

Apply built-in or custom macro

functions

helliphelliphelliphellip

copy 2014 IBM Corporation42

BigSheets Collection Graphic Visualization

Built-in charting facility aids analysis

Pie charts bar charts tag clouds maps etc

Hover over sections to reveal details

copy 2014 IBM Corporation43

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

copy 2014 IBM Corporation44

What is Big SQL

Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture

ndash Comprehensive SQL92 ansi support

ndash Standards compliant client drivers (JDBC amp ODBC)

ndash Efficient handling of point queries

ndash Wide variety of data sources and file formats

ndash Extensive HBase focus

ndash Open source interoperability

Our driving design goalsndash Existing queries should run with no or few modifications

ndash Existing JDBC and ODBC compliant tools should continue to function

ndash Queries should be executed as efficiently as the chosen storage

mechanisms allow

copy 2014 IBM Corporation45

Architecture

Big SQL shares catalogs with

Hive via the Hive metastorendash Each can query the others tables

SQL engine analyzes incoming

queriesndash Separates portion(s) to execute at

the server vs portion(s) to execute

on the cluster

ndash Re-writes query if necessary for

improved performance

ndash Determines appropriate storage

handler for data

ndash Produces execution plan

ndash Executes and coordinates query

Server layout and relative sizes

for illustrative purposes only

Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node

Job TrackerNetwork Protocol

SQL Engine

Storage Handlers

Del

Files

SEQ

FilesHBase RDBMS bullbullbull

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore

copy 2014 IBM Corporation46

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

copy 2014 IBM Corporation47

What is Text Analytics

High Performance and Scalable rule based Information Extraction Engine

Distill structured information from unstructured data

- Rich annotator library supports multiple languages

Provides sophisticated tooling to help build test and refine rules

ndash Developer tools an easy to use text analytics language and a set of

extractors for fast adoption

ndash Multilingual support including support for DBCS languages

Developed at IBM Research since 2004 System T

BigInsights is the first time IBM opens up the Text Analytics Engine

technology for customization and development

copy 2014 IBM Corporation48

Annotator Query Language (AQL)

Language to create rules for Text Analytics

SQL Like Language

Fully declarative text analytics language

Once compiled produced an AOG plan to work in the data

No ldquoblack boxesrdquo or modules that canrsquot be customized

Tooling for easy customization because you are abstracted from the

programmatic details

Competing solutions make use of locked up black-box modules that cannot be

customized which restricts flexibility and are difficult to optimize for performance

create view AmountWithUnit as

extract pattern ltNmatchgt ltUmatchgt

as match

from Number N Unit U

copy 2014 IBM Corporation49

Text Analytic Simple Example

NetherlandsStrikerArjen Robben

Keeper SpainIker Casillas

WingerAndres Iniesta Spain

World Cup 2010 Highlights

Football World Cup 2010 one team distinguished well

from the rest winning the final Early in the second

half Netherlandsrsquo striker Arjen Robben had a chance

to score but the awesome keeper for Spain Iker

Casillas made the save Winner superiority was

reflected when Winger Andres Iniesta scored for Spain

for the win

copy 2014 IBM Corporation5050

Text Analytic Real Example

copy 2014 IBM Corporation5151

One step beyond Watson

copy 2014 IBM Corporation52

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

R

copy 2014 IBM Corporation53

Big R

bull Explore visualize transform

and model big data using

familiar R syntax and

paradigm

bull Scale out R with MR

programming

ndash Partitioning of large data

ndash Parallel cluster execution of R

code

bull Distributed Machine

Learning

ndash A scalable statistics engine that

provides canned algorithms and

an ability to author new ones all

via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3

copy 2014 IBM Corporation54

Where Does BigData Fit

Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate

ldquoCapture in case itrsquos neededrdquo

1 Extract transform load

ldquoCapture only whatrsquos neededrdquo

9 Report and mine data

copy 2014 IBM Corporation55

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation56

Data scientist ndash The new cool guy in town

Article in Fortune ldquoThe unemployment rate in

the US continues to be abysmal (91 in

July) but the tech world has spawned a

new kind of highly skilled nerdy-cool job

that companies are scrambling to fill data

scientistrdquo

McKinsey Global Institute ldquoBig data Reportrdquo

By 2018 the United States alone could

face a shortage of 140000 to 190000

people with deep analytical skills as well as

15 million managers and analysts with the

know-how to use the analysis of big data to

make effective decisions

copy 2014 IBM Corporation57

Data Science is Multidisciplinary

copy 2014 IBM Corporation58

Successful Data Scientist Characteristics

copy 2014 IBM Corporation59

Data Scientist Qualities

copy 2014 IBM Corporation60

How Long Does It Take For a Beginner to Become

a Good Data Scientist

copy 2014 IBM Corporation61

wwwkagglecom

copy 2014 IBM Corporation62

Kaggle ranking

copy 2014 IBM Corporation63 copy 2013 IBM Corporation63

Learn Big Data

Reading Materials - Online

ndash Understanding Big Data ndash Free PDF Book

bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF

ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights

bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml

ndash Implementing IBM InfoSphere BigInsights on System x - Redbook

bull httpwwwredbooksibmcomredpiecesabstractssg248077html

Resources

ndash Big Data Information Center

bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml

ndash InfoSphere BigInsights

bull www-01ibmcomsoftwaredatainfospherebiginsights

ndash Stream Computing

bull www-01ibmcomsoftwaredatainfospherestream-computing

ndash DeveloperWorks forums demos

bull httpwwwibmcomdeveloperworkswikibiginsights

copy 2014 IBM Corporation64

Learn Big Data Technologies

BigDataUniversitycom

Flexible on-line delivery

allows learning your place

and your pace

Free courses free study

materials

Cloud-based sandbox for

exercises ndash zero setup

Robust Course

Management System and

Content Distribution

infrastructure

copy 2014 IBM Corporation65

65

Page 28: Big data presentation (2014)

copy 2014 IBM Corporation28

What is the Hadoop Distributed File System Driving principals

ndash Data is stored across the entire cluster (multiple nodes)

ndash Programs are brought to the data not the data to the program

ndash Follows the Divide and Conquer paradigm

Data is stored across the entire cluster (the DFS)

ndash The entire cluster participates in the file system

ndash Blocks of a single file are distributed across the cluster

ndash A given block is typically replicated as well for resiliency

10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111

0100

Logical File

1

2

3

4

Blocks

1

Cluster

1

1

2

22

3

3

34

44

copy 2011 IBM Corporation29

Introduction to MapReduce

Scalable to thousands of nodes and petabytes of data

MapReduce Application

1 Map Phase(break job into small parts)

2 Shuffle(transfer interim output

for final processing)

3 Reduce Phase(boil all output down to

a single result set)

Return a single result setResult Set

Shuffle

public static class TokenizerMapper extends MapperltObjectTextTextIntWritablegt

private final static IntWritableone = new IntWritable(1)

private Text word = new Text()

public void map(Object key Text val ContextStringTokenizer itr =

new StringTokenizer(valtoString())while (itrhasMoreTokens()) wordset(itrnextToken())

contextwrite(word one)

public static class IntSumReducer extends ReducerltTextIntWritableTextIntWrita

private IntWritable result = new IntWritable()

public void reduce(Text keyIterableltIntWritablegt val Context context)int sum = 0for (IntWritable v val)

sum += vget()

Distribute map

tasks to cluster

Hadoop Data Nodes

copy 2011 IBM Corporation30

MapReduce Example

Hello World Bye World

Hello IBM

Reduce (final output)

lt Bye 1gt

lt IBM 1gt

lt Hello 2gt

lt World 2gt

Map 1lt Hello 1gt

lt World 1gt

lt Bye 1gt

lt World 1gt

Count number of words occurrences

Map 2lt Hello 1gt

lt IBM 1gt

Entry Data

Map

Process

Reduce

Process

Shuffle

Process

copy 2014 IBM Corporation31

How to Analyze Large Data Sets in Hadoop

Its not just runtime Development phase has to be taken into

account

Although the Hadoop framework is implemented in Java

MapReduce applications do not need to be written in Java

To abstract complexities of Hadoop programming model a few

application development languages have emerged that build on top

of Hadoopndash Pig

ndash Hive

ndash Jaql

ndash Jaql

copy 2014 IBM Corporation32

Pig Hive Jaql ndash Similarities

Reduced program size over Java

Applications are translated to map

and reduce jobs behind scenes

Extension points for extending

existing functionality

Interoperability with other

languages

Not designed for random

readswrites or low-latency queries

Pig Hive Jaql ndash Differences

Characteristic Pig Hive Jaql

Developed by Yahoo Facebook IBM

Language Pig Latin HiveQL Jaql

Type of language

Data flow Declarative (SQL dialect) Data flow

Data structures supported

Complex Better suited for structured data

JSON semi structured

Schema Optional Not optional Optional

Hadoop Distributions

34

copy 2014 IBM Corporation35

Example of Hadoop Ecosystem

Visualization amp DiscoveryIntegration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

BigSheets JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Dashboard amp Visualization

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

copy 2014 IBM Corporation36

Open Source frameworks I

Avro A data serialization system that includes a schema within each file A schema defines the data types that are

contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data

types and complex type definitions within a schema

Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop

cluster

HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data

sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications

are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column

families so that the elements of a column family are all stored together This approach is different from a row-oriented

relational database where all columns of a row are stored together

HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do

not need to know where or how your data is stored You can change how you write data while still supporting existing data in

older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service

that includes functions for both MapReduce and Pig

Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to

analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements

which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere

BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos

Business Intelligence software

copy 2014 IBM Corporation37

Open Source frameworks II

Lucene A high-performance text search engine library that is written entirely in Java When you search within a

collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key

component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene

libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build

scan and query Lucene indexes

Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides

users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the

required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of

specific data in the file system

R A Project for Statistical Computing

Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop

systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and

export it to relational databases and enterprise data warehouses

Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper

maintains common objects that are needed in large cluster environments such as configuration information distributed

synchronization and group services

copy 2014 IBM Corporation38

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

copy 2014 IBM Corporation39

BigSheets

Browser based Analytical Tool that generates MapReduce Jobs

working over Hadoop Big Data

Helps non-programmers to work with Hadoop cluster

User models their big data as familiar spreadsheet-like tabular data

structures (collections) Once data is represented in a collection

business analysts can filter and enrich its content using built-in

functions and macros Furthermore analysts can combine data

residing in different collections as well as generate charts and new

ldquosheetsrdquo (collections) to visualize their data They can even export

data into a variety of common formats with a click of a button

Much of the technology included in Sheets was derived from the

BigSheets project of IBMrsquos Emerging Technologies team

copy 2014 IBM Corporation40

BigSheets Collection Sample

Spreadsheet-like structures defined by user

Based on data accessible through BigInsights Web console ndash eg file

system data output from Web crawl etc

copy 2014 IBM Corporation41

Big Sheets Collection Operations

Work with built-in ldquosheetsrdquo editor

Add delete columns

Filter data

Specify formulas to compute new

values using spreadsheet-style

syntax

Apply built-in or custom macro

functions

helliphelliphelliphellip

copy 2014 IBM Corporation42

BigSheets Collection Graphic Visualization

Built-in charting facility aids analysis

Pie charts bar charts tag clouds maps etc

Hover over sections to reveal details

copy 2014 IBM Corporation43

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

copy 2014 IBM Corporation44

What is Big SQL

Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture

ndash Comprehensive SQL92 ansi support

ndash Standards compliant client drivers (JDBC amp ODBC)

ndash Efficient handling of point queries

ndash Wide variety of data sources and file formats

ndash Extensive HBase focus

ndash Open source interoperability

Our driving design goalsndash Existing queries should run with no or few modifications

ndash Existing JDBC and ODBC compliant tools should continue to function

ndash Queries should be executed as efficiently as the chosen storage

mechanisms allow

copy 2014 IBM Corporation45

Architecture

Big SQL shares catalogs with

Hive via the Hive metastorendash Each can query the others tables

SQL engine analyzes incoming

queriesndash Separates portion(s) to execute at

the server vs portion(s) to execute

on the cluster

ndash Re-writes query if necessary for

improved performance

ndash Determines appropriate storage

handler for data

ndash Produces execution plan

ndash Executes and coordinates query

Server layout and relative sizes

for illustrative purposes only

Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node

Job TrackerNetwork Protocol

SQL Engine

Storage Handlers

Del

Files

SEQ

FilesHBase RDBMS bullbullbull

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore

copy 2014 IBM Corporation46

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

copy 2014 IBM Corporation47

What is Text Analytics

High Performance and Scalable rule based Information Extraction Engine

Distill structured information from unstructured data

- Rich annotator library supports multiple languages

Provides sophisticated tooling to help build test and refine rules

ndash Developer tools an easy to use text analytics language and a set of

extractors for fast adoption

ndash Multilingual support including support for DBCS languages

Developed at IBM Research since 2004 System T

BigInsights is the first time IBM opens up the Text Analytics Engine

technology for customization and development

copy 2014 IBM Corporation48

Annotator Query Language (AQL)

Language to create rules for Text Analytics

SQL Like Language

Fully declarative text analytics language

Once compiled produced an AOG plan to work in the data

No ldquoblack boxesrdquo or modules that canrsquot be customized

Tooling for easy customization because you are abstracted from the

programmatic details

Competing solutions make use of locked up black-box modules that cannot be

customized which restricts flexibility and are difficult to optimize for performance

create view AmountWithUnit as

extract pattern ltNmatchgt ltUmatchgt

as match

from Number N Unit U

copy 2014 IBM Corporation49

Text Analytic Simple Example

NetherlandsStrikerArjen Robben

Keeper SpainIker Casillas

WingerAndres Iniesta Spain

World Cup 2010 Highlights

Football World Cup 2010 one team distinguished well

from the rest winning the final Early in the second

half Netherlandsrsquo striker Arjen Robben had a chance

to score but the awesome keeper for Spain Iker

Casillas made the save Winner superiority was

reflected when Winger Andres Iniesta scored for Spain

for the win

copy 2014 IBM Corporation5050

Text Analytic Real Example

copy 2014 IBM Corporation5151

One step beyond Watson

copy 2014 IBM Corporation52

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

R

copy 2014 IBM Corporation53

Big R

bull Explore visualize transform

and model big data using

familiar R syntax and

paradigm

bull Scale out R with MR

programming

ndash Partitioning of large data

ndash Parallel cluster execution of R

code

bull Distributed Machine

Learning

ndash A scalable statistics engine that

provides canned algorithms and

an ability to author new ones all

via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3

copy 2014 IBM Corporation54

Where Does BigData Fit

Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate

ldquoCapture in case itrsquos neededrdquo

1 Extract transform load

ldquoCapture only whatrsquos neededrdquo

9 Report and mine data

copy 2014 IBM Corporation55

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation56

Data scientist ndash The new cool guy in town

Article in Fortune ldquoThe unemployment rate in

the US continues to be abysmal (91 in

July) but the tech world has spawned a

new kind of highly skilled nerdy-cool job

that companies are scrambling to fill data

scientistrdquo

McKinsey Global Institute ldquoBig data Reportrdquo

By 2018 the United States alone could

face a shortage of 140000 to 190000

people with deep analytical skills as well as

15 million managers and analysts with the

know-how to use the analysis of big data to

make effective decisions

copy 2014 IBM Corporation57

Data Science is Multidisciplinary

copy 2014 IBM Corporation58

Successful Data Scientist Characteristics

copy 2014 IBM Corporation59

Data Scientist Qualities

copy 2014 IBM Corporation60

How Long Does It Take For a Beginner to Become

a Good Data Scientist

copy 2014 IBM Corporation61

wwwkagglecom

copy 2014 IBM Corporation62

Kaggle ranking

copy 2014 IBM Corporation63 copy 2013 IBM Corporation63

Learn Big Data

Reading Materials - Online

ndash Understanding Big Data ndash Free PDF Book

bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF

ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights

bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml

ndash Implementing IBM InfoSphere BigInsights on System x - Redbook

bull httpwwwredbooksibmcomredpiecesabstractssg248077html

Resources

ndash Big Data Information Center

bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml

ndash InfoSphere BigInsights

bull www-01ibmcomsoftwaredatainfospherebiginsights

ndash Stream Computing

bull www-01ibmcomsoftwaredatainfospherestream-computing

ndash DeveloperWorks forums demos

bull httpwwwibmcomdeveloperworkswikibiginsights

copy 2014 IBM Corporation64

Learn Big Data Technologies

BigDataUniversitycom

Flexible on-line delivery

allows learning your place

and your pace

Free courses free study

materials

Cloud-based sandbox for

exercises ndash zero setup

Robust Course

Management System and

Content Distribution

infrastructure

copy 2014 IBM Corporation65

65

Page 29: Big data presentation (2014)

copy 2011 IBM Corporation29

Introduction to MapReduce

Scalable to thousands of nodes and petabytes of data

MapReduce Application

1 Map Phase(break job into small parts)

2 Shuffle(transfer interim output

for final processing)

3 Reduce Phase(boil all output down to

a single result set)

Return a single result setResult Set

Shuffle

public static class TokenizerMapper extends MapperltObjectTextTextIntWritablegt

private final static IntWritableone = new IntWritable(1)

private Text word = new Text()

public void map(Object key Text val ContextStringTokenizer itr =

new StringTokenizer(valtoString())while (itrhasMoreTokens()) wordset(itrnextToken())

contextwrite(word one)

public static class IntSumReducer extends ReducerltTextIntWritableTextIntWrita

private IntWritable result = new IntWritable()

public void reduce(Text keyIterableltIntWritablegt val Context context)int sum = 0for (IntWritable v val)

sum += vget()

Distribute map

tasks to cluster

Hadoop Data Nodes

copy 2011 IBM Corporation30

MapReduce Example

Hello World Bye World

Hello IBM

Reduce (final output)

lt Bye 1gt

lt IBM 1gt

lt Hello 2gt

lt World 2gt

Map 1lt Hello 1gt

lt World 1gt

lt Bye 1gt

lt World 1gt

Count number of words occurrences

Map 2lt Hello 1gt

lt IBM 1gt

Entry Data

Map

Process

Reduce

Process

Shuffle

Process

copy 2014 IBM Corporation31

How to Analyze Large Data Sets in Hadoop

Its not just runtime Development phase has to be taken into

account

Although the Hadoop framework is implemented in Java

MapReduce applications do not need to be written in Java

To abstract complexities of Hadoop programming model a few

application development languages have emerged that build on top

of Hadoopndash Pig

ndash Hive

ndash Jaql

ndash Jaql

copy 2014 IBM Corporation32

Pig Hive Jaql ndash Similarities

Reduced program size over Java

Applications are translated to map

and reduce jobs behind scenes

Extension points for extending

existing functionality

Interoperability with other

languages

Not designed for random

readswrites or low-latency queries

Pig Hive Jaql ndash Differences

Characteristic Pig Hive Jaql

Developed by Yahoo Facebook IBM

Language Pig Latin HiveQL Jaql

Type of language

Data flow Declarative (SQL dialect) Data flow

Data structures supported

Complex Better suited for structured data

JSON semi structured

Schema Optional Not optional Optional

Hadoop Distributions

34

copy 2014 IBM Corporation35

Example of Hadoop Ecosystem

Visualization amp DiscoveryIntegration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

BigSheets JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Dashboard amp Visualization

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

copy 2014 IBM Corporation36

Open Source frameworks I

Avro A data serialization system that includes a schema within each file A schema defines the data types that are

contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data

types and complex type definitions within a schema

Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop

cluster

HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data

sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications

are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column

families so that the elements of a column family are all stored together This approach is different from a row-oriented

relational database where all columns of a row are stored together

HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do

not need to know where or how your data is stored You can change how you write data while still supporting existing data in

older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service

that includes functions for both MapReduce and Pig

Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to

analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements

which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere

BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos

Business Intelligence software

copy 2014 IBM Corporation37

Open Source frameworks II

Lucene A high-performance text search engine library that is written entirely in Java When you search within a

collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key

component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene

libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build

scan and query Lucene indexes

Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides

users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the

required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of

specific data in the file system

R A Project for Statistical Computing

Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop

systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and

export it to relational databases and enterprise data warehouses

Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper

maintains common objects that are needed in large cluster environments such as configuration information distributed

synchronization and group services

copy 2014 IBM Corporation38

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

copy 2014 IBM Corporation39

BigSheets

Browser based Analytical Tool that generates MapReduce Jobs

working over Hadoop Big Data

Helps non-programmers to work with Hadoop cluster

User models their big data as familiar spreadsheet-like tabular data

structures (collections) Once data is represented in a collection

business analysts can filter and enrich its content using built-in

functions and macros Furthermore analysts can combine data

residing in different collections as well as generate charts and new

ldquosheetsrdquo (collections) to visualize their data They can even export

data into a variety of common formats with a click of a button

Much of the technology included in Sheets was derived from the

BigSheets project of IBMrsquos Emerging Technologies team

copy 2014 IBM Corporation40

BigSheets Collection Sample

Spreadsheet-like structures defined by user

Based on data accessible through BigInsights Web console ndash eg file

system data output from Web crawl etc

copy 2014 IBM Corporation41

Big Sheets Collection Operations

Work with built-in ldquosheetsrdquo editor

Add delete columns

Filter data

Specify formulas to compute new

values using spreadsheet-style

syntax

Apply built-in or custom macro

functions

helliphelliphelliphellip

copy 2014 IBM Corporation42

BigSheets Collection Graphic Visualization

Built-in charting facility aids analysis

Pie charts bar charts tag clouds maps etc

Hover over sections to reveal details

copy 2014 IBM Corporation43

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

copy 2014 IBM Corporation44

What is Big SQL

Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture

ndash Comprehensive SQL92 ansi support

ndash Standards compliant client drivers (JDBC amp ODBC)

ndash Efficient handling of point queries

ndash Wide variety of data sources and file formats

ndash Extensive HBase focus

ndash Open source interoperability

Our driving design goalsndash Existing queries should run with no or few modifications

ndash Existing JDBC and ODBC compliant tools should continue to function

ndash Queries should be executed as efficiently as the chosen storage

mechanisms allow

copy 2014 IBM Corporation45

Architecture

Big SQL shares catalogs with

Hive via the Hive metastorendash Each can query the others tables

SQL engine analyzes incoming

queriesndash Separates portion(s) to execute at

the server vs portion(s) to execute

on the cluster

ndash Re-writes query if necessary for

improved performance

ndash Determines appropriate storage

handler for data

ndash Produces execution plan

ndash Executes and coordinates query

Server layout and relative sizes

for illustrative purposes only

Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node

Job TrackerNetwork Protocol

SQL Engine

Storage Handlers

Del

Files

SEQ

FilesHBase RDBMS bullbullbull

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore

copy 2014 IBM Corporation46

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

copy 2014 IBM Corporation47

What is Text Analytics

High Performance and Scalable rule based Information Extraction Engine

Distill structured information from unstructured data

- Rich annotator library supports multiple languages

Provides sophisticated tooling to help build test and refine rules

ndash Developer tools an easy to use text analytics language and a set of

extractors for fast adoption

ndash Multilingual support including support for DBCS languages

Developed at IBM Research since 2004 System T

BigInsights is the first time IBM opens up the Text Analytics Engine

technology for customization and development

copy 2014 IBM Corporation48

Annotator Query Language (AQL)

Language to create rules for Text Analytics

SQL Like Language

Fully declarative text analytics language

Once compiled produced an AOG plan to work in the data

No ldquoblack boxesrdquo or modules that canrsquot be customized

Tooling for easy customization because you are abstracted from the

programmatic details

Competing solutions make use of locked up black-box modules that cannot be

customized which restricts flexibility and are difficult to optimize for performance

create view AmountWithUnit as

extract pattern ltNmatchgt ltUmatchgt

as match

from Number N Unit U

copy 2014 IBM Corporation49

Text Analytic Simple Example

NetherlandsStrikerArjen Robben

Keeper SpainIker Casillas

WingerAndres Iniesta Spain

World Cup 2010 Highlights

Football World Cup 2010 one team distinguished well

from the rest winning the final Early in the second

half Netherlandsrsquo striker Arjen Robben had a chance

to score but the awesome keeper for Spain Iker

Casillas made the save Winner superiority was

reflected when Winger Andres Iniesta scored for Spain

for the win

copy 2014 IBM Corporation5050

Text Analytic Real Example

copy 2014 IBM Corporation5151

One step beyond Watson

copy 2014 IBM Corporation52

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

R

copy 2014 IBM Corporation53

Big R

bull Explore visualize transform

and model big data using

familiar R syntax and

paradigm

bull Scale out R with MR

programming

ndash Partitioning of large data

ndash Parallel cluster execution of R

code

bull Distributed Machine

Learning

ndash A scalable statistics engine that

provides canned algorithms and

an ability to author new ones all

via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3

copy 2014 IBM Corporation54

Where Does BigData Fit

Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate

ldquoCapture in case itrsquos neededrdquo

1 Extract transform load

ldquoCapture only whatrsquos neededrdquo

9 Report and mine data

copy 2014 IBM Corporation55

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation56

Data scientist ndash The new cool guy in town

Article in Fortune ldquoThe unemployment rate in

the US continues to be abysmal (91 in

July) but the tech world has spawned a

new kind of highly skilled nerdy-cool job

that companies are scrambling to fill data

scientistrdquo

McKinsey Global Institute ldquoBig data Reportrdquo

By 2018 the United States alone could

face a shortage of 140000 to 190000

people with deep analytical skills as well as

15 million managers and analysts with the

know-how to use the analysis of big data to

make effective decisions

copy 2014 IBM Corporation57

Data Science is Multidisciplinary

copy 2014 IBM Corporation58

Successful Data Scientist Characteristics

copy 2014 IBM Corporation59

Data Scientist Qualities

copy 2014 IBM Corporation60

How Long Does It Take For a Beginner to Become

a Good Data Scientist

copy 2014 IBM Corporation61

wwwkagglecom

copy 2014 IBM Corporation62

Kaggle ranking

copy 2014 IBM Corporation63 copy 2013 IBM Corporation63

Learn Big Data

Reading Materials - Online

ndash Understanding Big Data ndash Free PDF Book

bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF

ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights

bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml

ndash Implementing IBM InfoSphere BigInsights on System x - Redbook

bull httpwwwredbooksibmcomredpiecesabstractssg248077html

Resources

ndash Big Data Information Center

bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml

ndash InfoSphere BigInsights

bull www-01ibmcomsoftwaredatainfospherebiginsights

ndash Stream Computing

bull www-01ibmcomsoftwaredatainfospherestream-computing

ndash DeveloperWorks forums demos

bull httpwwwibmcomdeveloperworkswikibiginsights

copy 2014 IBM Corporation64

Learn Big Data Technologies

BigDataUniversitycom

Flexible on-line delivery

allows learning your place

and your pace

Free courses free study

materials

Cloud-based sandbox for

exercises ndash zero setup

Robust Course

Management System and

Content Distribution

infrastructure

copy 2014 IBM Corporation65

65

Page 30: Big data presentation (2014)

copy 2011 IBM Corporation30

MapReduce Example

Hello World Bye World

Hello IBM

Reduce (final output)

lt Bye 1gt

lt IBM 1gt

lt Hello 2gt

lt World 2gt

Map 1lt Hello 1gt

lt World 1gt

lt Bye 1gt

lt World 1gt

Count number of words occurrences

Map 2lt Hello 1gt

lt IBM 1gt

Entry Data

Map

Process

Reduce

Process

Shuffle

Process

copy 2014 IBM Corporation31

How to Analyze Large Data Sets in Hadoop

Its not just runtime Development phase has to be taken into

account

Although the Hadoop framework is implemented in Java

MapReduce applications do not need to be written in Java

To abstract complexities of Hadoop programming model a few

application development languages have emerged that build on top

of Hadoopndash Pig

ndash Hive

ndash Jaql

ndash Jaql

copy 2014 IBM Corporation32

Pig Hive Jaql ndash Similarities

Reduced program size over Java

Applications are translated to map

and reduce jobs behind scenes

Extension points for extending

existing functionality

Interoperability with other

languages

Not designed for random

readswrites or low-latency queries

Pig Hive Jaql ndash Differences

Characteristic Pig Hive Jaql

Developed by Yahoo Facebook IBM

Language Pig Latin HiveQL Jaql

Type of language

Data flow Declarative (SQL dialect) Data flow

Data structures supported

Complex Better suited for structured data

JSON semi structured

Schema Optional Not optional Optional

Hadoop Distributions

34

copy 2014 IBM Corporation35

Example of Hadoop Ecosystem

Visualization amp DiscoveryIntegration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

BigSheets JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Dashboard amp Visualization

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

copy 2014 IBM Corporation36

Open Source frameworks I

Avro A data serialization system that includes a schema within each file A schema defines the data types that are

contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data

types and complex type definitions within a schema

Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop

cluster

HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data

sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications

are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column

families so that the elements of a column family are all stored together This approach is different from a row-oriented

relational database where all columns of a row are stored together

HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do

not need to know where or how your data is stored You can change how you write data while still supporting existing data in

older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service

that includes functions for both MapReduce and Pig

Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to

analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements

which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere

BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos

Business Intelligence software

copy 2014 IBM Corporation37

Open Source frameworks II

Lucene A high-performance text search engine library that is written entirely in Java When you search within a

collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key

component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene

libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build

scan and query Lucene indexes

Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides

users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the

required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of

specific data in the file system

R A Project for Statistical Computing

Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop

systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and

export it to relational databases and enterprise data warehouses

Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper

maintains common objects that are needed in large cluster environments such as configuration information distributed

synchronization and group services

copy 2014 IBM Corporation38

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

copy 2014 IBM Corporation39

BigSheets

Browser based Analytical Tool that generates MapReduce Jobs

working over Hadoop Big Data

Helps non-programmers to work with Hadoop cluster

User models their big data as familiar spreadsheet-like tabular data

structures (collections) Once data is represented in a collection

business analysts can filter and enrich its content using built-in

functions and macros Furthermore analysts can combine data

residing in different collections as well as generate charts and new

ldquosheetsrdquo (collections) to visualize their data They can even export

data into a variety of common formats with a click of a button

Much of the technology included in Sheets was derived from the

BigSheets project of IBMrsquos Emerging Technologies team

copy 2014 IBM Corporation40

BigSheets Collection Sample

Spreadsheet-like structures defined by user

Based on data accessible through BigInsights Web console ndash eg file

system data output from Web crawl etc

copy 2014 IBM Corporation41

Big Sheets Collection Operations

Work with built-in ldquosheetsrdquo editor

Add delete columns

Filter data

Specify formulas to compute new

values using spreadsheet-style

syntax

Apply built-in or custom macro

functions

helliphelliphelliphellip

copy 2014 IBM Corporation42

BigSheets Collection Graphic Visualization

Built-in charting facility aids analysis

Pie charts bar charts tag clouds maps etc

Hover over sections to reveal details

copy 2014 IBM Corporation43

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

copy 2014 IBM Corporation44

What is Big SQL

Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture

ndash Comprehensive SQL92 ansi support

ndash Standards compliant client drivers (JDBC amp ODBC)

ndash Efficient handling of point queries

ndash Wide variety of data sources and file formats

ndash Extensive HBase focus

ndash Open source interoperability

Our driving design goalsndash Existing queries should run with no or few modifications

ndash Existing JDBC and ODBC compliant tools should continue to function

ndash Queries should be executed as efficiently as the chosen storage

mechanisms allow

copy 2014 IBM Corporation45

Architecture

Big SQL shares catalogs with

Hive via the Hive metastorendash Each can query the others tables

SQL engine analyzes incoming

queriesndash Separates portion(s) to execute at

the server vs portion(s) to execute

on the cluster

ndash Re-writes query if necessary for

improved performance

ndash Determines appropriate storage

handler for data

ndash Produces execution plan

ndash Executes and coordinates query

Server layout and relative sizes

for illustrative purposes only

Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node

Job TrackerNetwork Protocol

SQL Engine

Storage Handlers

Del

Files

SEQ

FilesHBase RDBMS bullbullbull

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore

copy 2014 IBM Corporation46

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

copy 2014 IBM Corporation47

What is Text Analytics

High Performance and Scalable rule based Information Extraction Engine

Distill structured information from unstructured data

- Rich annotator library supports multiple languages

Provides sophisticated tooling to help build test and refine rules

ndash Developer tools an easy to use text analytics language and a set of

extractors for fast adoption

ndash Multilingual support including support for DBCS languages

Developed at IBM Research since 2004 System T

BigInsights is the first time IBM opens up the Text Analytics Engine

technology for customization and development

copy 2014 IBM Corporation48

Annotator Query Language (AQL)

Language to create rules for Text Analytics

SQL Like Language

Fully declarative text analytics language

Once compiled produced an AOG plan to work in the data

No ldquoblack boxesrdquo or modules that canrsquot be customized

Tooling for easy customization because you are abstracted from the

programmatic details

Competing solutions make use of locked up black-box modules that cannot be

customized which restricts flexibility and are difficult to optimize for performance

create view AmountWithUnit as

extract pattern ltNmatchgt ltUmatchgt

as match

from Number N Unit U

copy 2014 IBM Corporation49

Text Analytic Simple Example

NetherlandsStrikerArjen Robben

Keeper SpainIker Casillas

WingerAndres Iniesta Spain

World Cup 2010 Highlights

Football World Cup 2010 one team distinguished well

from the rest winning the final Early in the second

half Netherlandsrsquo striker Arjen Robben had a chance

to score but the awesome keeper for Spain Iker

Casillas made the save Winner superiority was

reflected when Winger Andres Iniesta scored for Spain

for the win

copy 2014 IBM Corporation5050

Text Analytic Real Example

copy 2014 IBM Corporation5151

One step beyond Watson

copy 2014 IBM Corporation52

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

R

copy 2014 IBM Corporation53

Big R

bull Explore visualize transform

and model big data using

familiar R syntax and

paradigm

bull Scale out R with MR

programming

ndash Partitioning of large data

ndash Parallel cluster execution of R

code

bull Distributed Machine

Learning

ndash A scalable statistics engine that

provides canned algorithms and

an ability to author new ones all

via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3

copy 2014 IBM Corporation54

Where Does BigData Fit

Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate

ldquoCapture in case itrsquos neededrdquo

1 Extract transform load

ldquoCapture only whatrsquos neededrdquo

9 Report and mine data

copy 2014 IBM Corporation55

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation56

Data scientist ndash The new cool guy in town

Article in Fortune ldquoThe unemployment rate in

the US continues to be abysmal (91 in

July) but the tech world has spawned a

new kind of highly skilled nerdy-cool job

that companies are scrambling to fill data

scientistrdquo

McKinsey Global Institute ldquoBig data Reportrdquo

By 2018 the United States alone could

face a shortage of 140000 to 190000

people with deep analytical skills as well as

15 million managers and analysts with the

know-how to use the analysis of big data to

make effective decisions

copy 2014 IBM Corporation57

Data Science is Multidisciplinary

copy 2014 IBM Corporation58

Successful Data Scientist Characteristics

copy 2014 IBM Corporation59

Data Scientist Qualities

copy 2014 IBM Corporation60

How Long Does It Take For a Beginner to Become

a Good Data Scientist

copy 2014 IBM Corporation61

wwwkagglecom

copy 2014 IBM Corporation62

Kaggle ranking

copy 2014 IBM Corporation63 copy 2013 IBM Corporation63

Learn Big Data

Reading Materials - Online

ndash Understanding Big Data ndash Free PDF Book

bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF

ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights

bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml

ndash Implementing IBM InfoSphere BigInsights on System x - Redbook

bull httpwwwredbooksibmcomredpiecesabstractssg248077html

Resources

ndash Big Data Information Center

bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml

ndash InfoSphere BigInsights

bull www-01ibmcomsoftwaredatainfospherebiginsights

ndash Stream Computing

bull www-01ibmcomsoftwaredatainfospherestream-computing

ndash DeveloperWorks forums demos

bull httpwwwibmcomdeveloperworkswikibiginsights

copy 2014 IBM Corporation64

Learn Big Data Technologies

BigDataUniversitycom

Flexible on-line delivery

allows learning your place

and your pace

Free courses free study

materials

Cloud-based sandbox for

exercises ndash zero setup

Robust Course

Management System and

Content Distribution

infrastructure

copy 2014 IBM Corporation65

65

Page 31: Big data presentation (2014)

copy 2014 IBM Corporation31

How to Analyze Large Data Sets in Hadoop

Its not just runtime Development phase has to be taken into

account

Although the Hadoop framework is implemented in Java

MapReduce applications do not need to be written in Java

To abstract complexities of Hadoop programming model a few

application development languages have emerged that build on top

of Hadoopndash Pig

ndash Hive

ndash Jaql

ndash Jaql

copy 2014 IBM Corporation32

Pig Hive Jaql ndash Similarities

Reduced program size over Java

Applications are translated to map

and reduce jobs behind scenes

Extension points for extending

existing functionality

Interoperability with other

languages

Not designed for random

readswrites or low-latency queries

Pig Hive Jaql ndash Differences

Characteristic Pig Hive Jaql

Developed by Yahoo Facebook IBM

Language Pig Latin HiveQL Jaql

Type of language

Data flow Declarative (SQL dialect) Data flow

Data structures supported

Complex Better suited for structured data

JSON semi structured

Schema Optional Not optional Optional

Hadoop Distributions

34

copy 2014 IBM Corporation35

Example of Hadoop Ecosystem

Visualization amp DiscoveryIntegration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

BigSheets JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Dashboard amp Visualization

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

copy 2014 IBM Corporation36

Open Source frameworks I

Avro A data serialization system that includes a schema within each file A schema defines the data types that are

contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data

types and complex type definitions within a schema

Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop

cluster

HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data

sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications

are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column

families so that the elements of a column family are all stored together This approach is different from a row-oriented

relational database where all columns of a row are stored together

HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do

not need to know where or how your data is stored You can change how you write data while still supporting existing data in

older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service

that includes functions for both MapReduce and Pig

Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to

analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements

which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere

BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos

Business Intelligence software

copy 2014 IBM Corporation37

Open Source frameworks II

Lucene A high-performance text search engine library that is written entirely in Java When you search within a

collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key

component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene

libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build

scan and query Lucene indexes

Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides

users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the

required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of

specific data in the file system

R A Project for Statistical Computing

Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop

systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and

export it to relational databases and enterprise data warehouses

Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper

maintains common objects that are needed in large cluster environments such as configuration information distributed

synchronization and group services

copy 2014 IBM Corporation38

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

copy 2014 IBM Corporation39

BigSheets

Browser based Analytical Tool that generates MapReduce Jobs

working over Hadoop Big Data

Helps non-programmers to work with Hadoop cluster

User models their big data as familiar spreadsheet-like tabular data

structures (collections) Once data is represented in a collection

business analysts can filter and enrich its content using built-in

functions and macros Furthermore analysts can combine data

residing in different collections as well as generate charts and new

ldquosheetsrdquo (collections) to visualize their data They can even export

data into a variety of common formats with a click of a button

Much of the technology included in Sheets was derived from the

BigSheets project of IBMrsquos Emerging Technologies team

copy 2014 IBM Corporation40

BigSheets Collection Sample

Spreadsheet-like structures defined by user

Based on data accessible through BigInsights Web console ndash eg file

system data output from Web crawl etc

copy 2014 IBM Corporation41

Big Sheets Collection Operations

Work with built-in ldquosheetsrdquo editor

Add delete columns

Filter data

Specify formulas to compute new

values using spreadsheet-style

syntax

Apply built-in or custom macro

functions

helliphelliphelliphellip

copy 2014 IBM Corporation42

BigSheets Collection Graphic Visualization

Built-in charting facility aids analysis

Pie charts bar charts tag clouds maps etc

Hover over sections to reveal details

copy 2014 IBM Corporation43

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

copy 2014 IBM Corporation44

What is Big SQL

Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture

ndash Comprehensive SQL92 ansi support

ndash Standards compliant client drivers (JDBC amp ODBC)

ndash Efficient handling of point queries

ndash Wide variety of data sources and file formats

ndash Extensive HBase focus

ndash Open source interoperability

Our driving design goalsndash Existing queries should run with no or few modifications

ndash Existing JDBC and ODBC compliant tools should continue to function

ndash Queries should be executed as efficiently as the chosen storage

mechanisms allow

copy 2014 IBM Corporation45

Architecture

Big SQL shares catalogs with

Hive via the Hive metastorendash Each can query the others tables

SQL engine analyzes incoming

queriesndash Separates portion(s) to execute at

the server vs portion(s) to execute

on the cluster

ndash Re-writes query if necessary for

improved performance

ndash Determines appropriate storage

handler for data

ndash Produces execution plan

ndash Executes and coordinates query

Server layout and relative sizes

for illustrative purposes only

Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node

Job TrackerNetwork Protocol

SQL Engine

Storage Handlers

Del

Files

SEQ

FilesHBase RDBMS bullbullbull

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore

copy 2014 IBM Corporation46

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

copy 2014 IBM Corporation47

What is Text Analytics

High Performance and Scalable rule based Information Extraction Engine

Distill structured information from unstructured data

- Rich annotator library supports multiple languages

Provides sophisticated tooling to help build test and refine rules

ndash Developer tools an easy to use text analytics language and a set of

extractors for fast adoption

ndash Multilingual support including support for DBCS languages

Developed at IBM Research since 2004 System T

BigInsights is the first time IBM opens up the Text Analytics Engine

technology for customization and development

copy 2014 IBM Corporation48

Annotator Query Language (AQL)

Language to create rules for Text Analytics

SQL Like Language

Fully declarative text analytics language

Once compiled produced an AOG plan to work in the data

No ldquoblack boxesrdquo or modules that canrsquot be customized

Tooling for easy customization because you are abstracted from the

programmatic details

Competing solutions make use of locked up black-box modules that cannot be

customized which restricts flexibility and are difficult to optimize for performance

create view AmountWithUnit as

extract pattern ltNmatchgt ltUmatchgt

as match

from Number N Unit U

copy 2014 IBM Corporation49

Text Analytic Simple Example

NetherlandsStrikerArjen Robben

Keeper SpainIker Casillas

WingerAndres Iniesta Spain

World Cup 2010 Highlights

Football World Cup 2010 one team distinguished well

from the rest winning the final Early in the second

half Netherlandsrsquo striker Arjen Robben had a chance

to score but the awesome keeper for Spain Iker

Casillas made the save Winner superiority was

reflected when Winger Andres Iniesta scored for Spain

for the win

copy 2014 IBM Corporation5050

Text Analytic Real Example

copy 2014 IBM Corporation5151

One step beyond Watson

copy 2014 IBM Corporation52

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

R

copy 2014 IBM Corporation53

Big R

bull Explore visualize transform

and model big data using

familiar R syntax and

paradigm

bull Scale out R with MR

programming

ndash Partitioning of large data

ndash Parallel cluster execution of R

code

bull Distributed Machine

Learning

ndash A scalable statistics engine that

provides canned algorithms and

an ability to author new ones all

via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3

copy 2014 IBM Corporation54

Where Does BigData Fit

Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate

ldquoCapture in case itrsquos neededrdquo

1 Extract transform load

ldquoCapture only whatrsquos neededrdquo

9 Report and mine data

copy 2014 IBM Corporation55

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation56

Data scientist ndash The new cool guy in town

Article in Fortune ldquoThe unemployment rate in

the US continues to be abysmal (91 in

July) but the tech world has spawned a

new kind of highly skilled nerdy-cool job

that companies are scrambling to fill data

scientistrdquo

McKinsey Global Institute ldquoBig data Reportrdquo

By 2018 the United States alone could

face a shortage of 140000 to 190000

people with deep analytical skills as well as

15 million managers and analysts with the

know-how to use the analysis of big data to

make effective decisions

copy 2014 IBM Corporation57

Data Science is Multidisciplinary

copy 2014 IBM Corporation58

Successful Data Scientist Characteristics

copy 2014 IBM Corporation59

Data Scientist Qualities

copy 2014 IBM Corporation60

How Long Does It Take For a Beginner to Become

a Good Data Scientist

copy 2014 IBM Corporation61

wwwkagglecom

copy 2014 IBM Corporation62

Kaggle ranking

copy 2014 IBM Corporation63 copy 2013 IBM Corporation63

Learn Big Data

Reading Materials - Online

ndash Understanding Big Data ndash Free PDF Book

bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF

ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights

bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml

ndash Implementing IBM InfoSphere BigInsights on System x - Redbook

bull httpwwwredbooksibmcomredpiecesabstractssg248077html

Resources

ndash Big Data Information Center

bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml

ndash InfoSphere BigInsights

bull www-01ibmcomsoftwaredatainfospherebiginsights

ndash Stream Computing

bull www-01ibmcomsoftwaredatainfospherestream-computing

ndash DeveloperWorks forums demos

bull httpwwwibmcomdeveloperworkswikibiginsights

copy 2014 IBM Corporation64

Learn Big Data Technologies

BigDataUniversitycom

Flexible on-line delivery

allows learning your place

and your pace

Free courses free study

materials

Cloud-based sandbox for

exercises ndash zero setup

Robust Course

Management System and

Content Distribution

infrastructure

copy 2014 IBM Corporation65

65

Page 32: Big data presentation (2014)

copy 2014 IBM Corporation32

Pig Hive Jaql ndash Similarities

Reduced program size over Java

Applications are translated to map

and reduce jobs behind scenes

Extension points for extending

existing functionality

Interoperability with other

languages

Not designed for random

readswrites or low-latency queries

Pig Hive Jaql ndash Differences

Characteristic Pig Hive Jaql

Developed by Yahoo Facebook IBM

Language Pig Latin HiveQL Jaql

Type of language

Data flow Declarative (SQL dialect) Data flow

Data structures supported

Complex Better suited for structured data

JSON semi structured

Schema Optional Not optional Optional

Hadoop Distributions

34

copy 2014 IBM Corporation35

Example of Hadoop Ecosystem

Visualization amp DiscoveryIntegration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

BigSheets JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Dashboard amp Visualization

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

copy 2014 IBM Corporation36

Open Source frameworks I

Avro A data serialization system that includes a schema within each file A schema defines the data types that are

contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data

types and complex type definitions within a schema

Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop

cluster

HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data

sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications

are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column

families so that the elements of a column family are all stored together This approach is different from a row-oriented

relational database where all columns of a row are stored together

HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do

not need to know where or how your data is stored You can change how you write data while still supporting existing data in

older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service

that includes functions for both MapReduce and Pig

Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to

analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements

which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere

BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos

Business Intelligence software

copy 2014 IBM Corporation37

Open Source frameworks II

Lucene A high-performance text search engine library that is written entirely in Java When you search within a

collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key

component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene

libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build

scan and query Lucene indexes

Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides

users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the

required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of

specific data in the file system

R A Project for Statistical Computing

Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop

systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and

export it to relational databases and enterprise data warehouses

Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper

maintains common objects that are needed in large cluster environments such as configuration information distributed

synchronization and group services

copy 2014 IBM Corporation38

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

copy 2014 IBM Corporation39

BigSheets

Browser based Analytical Tool that generates MapReduce Jobs

working over Hadoop Big Data

Helps non-programmers to work with Hadoop cluster

User models their big data as familiar spreadsheet-like tabular data

structures (collections) Once data is represented in a collection

business analysts can filter and enrich its content using built-in

functions and macros Furthermore analysts can combine data

residing in different collections as well as generate charts and new

ldquosheetsrdquo (collections) to visualize their data They can even export

data into a variety of common formats with a click of a button

Much of the technology included in Sheets was derived from the

BigSheets project of IBMrsquos Emerging Technologies team

copy 2014 IBM Corporation40

BigSheets Collection Sample

Spreadsheet-like structures defined by user

Based on data accessible through BigInsights Web console ndash eg file

system data output from Web crawl etc

copy 2014 IBM Corporation41

Big Sheets Collection Operations

Work with built-in ldquosheetsrdquo editor

Add delete columns

Filter data

Specify formulas to compute new

values using spreadsheet-style

syntax

Apply built-in or custom macro

functions

helliphelliphelliphellip

copy 2014 IBM Corporation42

BigSheets Collection Graphic Visualization

Built-in charting facility aids analysis

Pie charts bar charts tag clouds maps etc

Hover over sections to reveal details

copy 2014 IBM Corporation43

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

copy 2014 IBM Corporation44

What is Big SQL

Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture

ndash Comprehensive SQL92 ansi support

ndash Standards compliant client drivers (JDBC amp ODBC)

ndash Efficient handling of point queries

ndash Wide variety of data sources and file formats

ndash Extensive HBase focus

ndash Open source interoperability

Our driving design goalsndash Existing queries should run with no or few modifications

ndash Existing JDBC and ODBC compliant tools should continue to function

ndash Queries should be executed as efficiently as the chosen storage

mechanisms allow

copy 2014 IBM Corporation45

Architecture

Big SQL shares catalogs with

Hive via the Hive metastorendash Each can query the others tables

SQL engine analyzes incoming

queriesndash Separates portion(s) to execute at

the server vs portion(s) to execute

on the cluster

ndash Re-writes query if necessary for

improved performance

ndash Determines appropriate storage

handler for data

ndash Produces execution plan

ndash Executes and coordinates query

Server layout and relative sizes

for illustrative purposes only

Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node

Job TrackerNetwork Protocol

SQL Engine

Storage Handlers

Del

Files

SEQ

FilesHBase RDBMS bullbullbull

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore

copy 2014 IBM Corporation46

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

copy 2014 IBM Corporation47

What is Text Analytics

High Performance and Scalable rule based Information Extraction Engine

Distill structured information from unstructured data

- Rich annotator library supports multiple languages

Provides sophisticated tooling to help build test and refine rules

ndash Developer tools an easy to use text analytics language and a set of

extractors for fast adoption

ndash Multilingual support including support for DBCS languages

Developed at IBM Research since 2004 System T

BigInsights is the first time IBM opens up the Text Analytics Engine

technology for customization and development

copy 2014 IBM Corporation48

Annotator Query Language (AQL)

Language to create rules for Text Analytics

SQL Like Language

Fully declarative text analytics language

Once compiled produced an AOG plan to work in the data

No ldquoblack boxesrdquo or modules that canrsquot be customized

Tooling for easy customization because you are abstracted from the

programmatic details

Competing solutions make use of locked up black-box modules that cannot be

customized which restricts flexibility and are difficult to optimize for performance

create view AmountWithUnit as

extract pattern ltNmatchgt ltUmatchgt

as match

from Number N Unit U

copy 2014 IBM Corporation49

Text Analytic Simple Example

NetherlandsStrikerArjen Robben

Keeper SpainIker Casillas

WingerAndres Iniesta Spain

World Cup 2010 Highlights

Football World Cup 2010 one team distinguished well

from the rest winning the final Early in the second

half Netherlandsrsquo striker Arjen Robben had a chance

to score but the awesome keeper for Spain Iker

Casillas made the save Winner superiority was

reflected when Winger Andres Iniesta scored for Spain

for the win

copy 2014 IBM Corporation5050

Text Analytic Real Example

copy 2014 IBM Corporation5151

One step beyond Watson

copy 2014 IBM Corporation52

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

R

copy 2014 IBM Corporation53

Big R

bull Explore visualize transform

and model big data using

familiar R syntax and

paradigm

bull Scale out R with MR

programming

ndash Partitioning of large data

ndash Parallel cluster execution of R

code

bull Distributed Machine

Learning

ndash A scalable statistics engine that

provides canned algorithms and

an ability to author new ones all

via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3

copy 2014 IBM Corporation54

Where Does BigData Fit

Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate

ldquoCapture in case itrsquos neededrdquo

1 Extract transform load

ldquoCapture only whatrsquos neededrdquo

9 Report and mine data

copy 2014 IBM Corporation55

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation56

Data scientist ndash The new cool guy in town

Article in Fortune ldquoThe unemployment rate in

the US continues to be abysmal (91 in

July) but the tech world has spawned a

new kind of highly skilled nerdy-cool job

that companies are scrambling to fill data

scientistrdquo

McKinsey Global Institute ldquoBig data Reportrdquo

By 2018 the United States alone could

face a shortage of 140000 to 190000

people with deep analytical skills as well as

15 million managers and analysts with the

know-how to use the analysis of big data to

make effective decisions

copy 2014 IBM Corporation57

Data Science is Multidisciplinary

copy 2014 IBM Corporation58

Successful Data Scientist Characteristics

copy 2014 IBM Corporation59

Data Scientist Qualities

copy 2014 IBM Corporation60

How Long Does It Take For a Beginner to Become

a Good Data Scientist

copy 2014 IBM Corporation61

wwwkagglecom

copy 2014 IBM Corporation62

Kaggle ranking

copy 2014 IBM Corporation63 copy 2013 IBM Corporation63

Learn Big Data

Reading Materials - Online

ndash Understanding Big Data ndash Free PDF Book

bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF

ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights

bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml

ndash Implementing IBM InfoSphere BigInsights on System x - Redbook

bull httpwwwredbooksibmcomredpiecesabstractssg248077html

Resources

ndash Big Data Information Center

bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml

ndash InfoSphere BigInsights

bull www-01ibmcomsoftwaredatainfospherebiginsights

ndash Stream Computing

bull www-01ibmcomsoftwaredatainfospherestream-computing

ndash DeveloperWorks forums demos

bull httpwwwibmcomdeveloperworkswikibiginsights

copy 2014 IBM Corporation64

Learn Big Data Technologies

BigDataUniversitycom

Flexible on-line delivery

allows learning your place

and your pace

Free courses free study

materials

Cloud-based sandbox for

exercises ndash zero setup

Robust Course

Management System and

Content Distribution

infrastructure

copy 2014 IBM Corporation65

65

Page 33: Big data presentation (2014)

Pig Hive Jaql ndash Differences

Characteristic Pig Hive Jaql

Developed by Yahoo Facebook IBM

Language Pig Latin HiveQL Jaql

Type of language

Data flow Declarative (SQL dialect) Data flow

Data structures supported

Complex Better suited for structured data

JSON semi structured

Schema Optional Not optional Optional

Hadoop Distributions

34

copy 2014 IBM Corporation35

Example of Hadoop Ecosystem

Visualization amp DiscoveryIntegration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

BigSheets JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Dashboard amp Visualization

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

copy 2014 IBM Corporation36

Open Source frameworks I

Avro A data serialization system that includes a schema within each file A schema defines the data types that are

contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data

types and complex type definitions within a schema

Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop

cluster

HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data

sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications

are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column

families so that the elements of a column family are all stored together This approach is different from a row-oriented

relational database where all columns of a row are stored together

HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do

not need to know where or how your data is stored You can change how you write data while still supporting existing data in

older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service

that includes functions for both MapReduce and Pig

Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to

analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements

which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere

BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos

Business Intelligence software

copy 2014 IBM Corporation37

Open Source frameworks II

Lucene A high-performance text search engine library that is written entirely in Java When you search within a

collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key

component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene

libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build

scan and query Lucene indexes

Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides

users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the

required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of

specific data in the file system

R A Project for Statistical Computing

Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop

systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and

export it to relational databases and enterprise data warehouses

Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper

maintains common objects that are needed in large cluster environments such as configuration information distributed

synchronization and group services

copy 2014 IBM Corporation38

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

copy 2014 IBM Corporation39

BigSheets

Browser based Analytical Tool that generates MapReduce Jobs

working over Hadoop Big Data

Helps non-programmers to work with Hadoop cluster

User models their big data as familiar spreadsheet-like tabular data

structures (collections) Once data is represented in a collection

business analysts can filter and enrich its content using built-in

functions and macros Furthermore analysts can combine data

residing in different collections as well as generate charts and new

ldquosheetsrdquo (collections) to visualize their data They can even export

data into a variety of common formats with a click of a button

Much of the technology included in Sheets was derived from the

BigSheets project of IBMrsquos Emerging Technologies team

copy 2014 IBM Corporation40

BigSheets Collection Sample

Spreadsheet-like structures defined by user

Based on data accessible through BigInsights Web console ndash eg file

system data output from Web crawl etc

copy 2014 IBM Corporation41

Big Sheets Collection Operations

Work with built-in ldquosheetsrdquo editor

Add delete columns

Filter data

Specify formulas to compute new

values using spreadsheet-style

syntax

Apply built-in or custom macro

functions

helliphelliphelliphellip

copy 2014 IBM Corporation42

BigSheets Collection Graphic Visualization

Built-in charting facility aids analysis

Pie charts bar charts tag clouds maps etc

Hover over sections to reveal details

copy 2014 IBM Corporation43

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

copy 2014 IBM Corporation44

What is Big SQL

Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture

ndash Comprehensive SQL92 ansi support

ndash Standards compliant client drivers (JDBC amp ODBC)

ndash Efficient handling of point queries

ndash Wide variety of data sources and file formats

ndash Extensive HBase focus

ndash Open source interoperability

Our driving design goalsndash Existing queries should run with no or few modifications

ndash Existing JDBC and ODBC compliant tools should continue to function

ndash Queries should be executed as efficiently as the chosen storage

mechanisms allow

copy 2014 IBM Corporation45

Architecture

Big SQL shares catalogs with

Hive via the Hive metastorendash Each can query the others tables

SQL engine analyzes incoming

queriesndash Separates portion(s) to execute at

the server vs portion(s) to execute

on the cluster

ndash Re-writes query if necessary for

improved performance

ndash Determines appropriate storage

handler for data

ndash Produces execution plan

ndash Executes and coordinates query

Server layout and relative sizes

for illustrative purposes only

Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node

Job TrackerNetwork Protocol

SQL Engine

Storage Handlers

Del

Files

SEQ

FilesHBase RDBMS bullbullbull

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore

copy 2014 IBM Corporation46

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

copy 2014 IBM Corporation47

What is Text Analytics

High Performance and Scalable rule based Information Extraction Engine

Distill structured information from unstructured data

- Rich annotator library supports multiple languages

Provides sophisticated tooling to help build test and refine rules

ndash Developer tools an easy to use text analytics language and a set of

extractors for fast adoption

ndash Multilingual support including support for DBCS languages

Developed at IBM Research since 2004 System T

BigInsights is the first time IBM opens up the Text Analytics Engine

technology for customization and development

copy 2014 IBM Corporation48

Annotator Query Language (AQL)

Language to create rules for Text Analytics

SQL Like Language

Fully declarative text analytics language

Once compiled produced an AOG plan to work in the data

No ldquoblack boxesrdquo or modules that canrsquot be customized

Tooling for easy customization because you are abstracted from the

programmatic details

Competing solutions make use of locked up black-box modules that cannot be

customized which restricts flexibility and are difficult to optimize for performance

create view AmountWithUnit as

extract pattern ltNmatchgt ltUmatchgt

as match

from Number N Unit U

copy 2014 IBM Corporation49

Text Analytic Simple Example

NetherlandsStrikerArjen Robben

Keeper SpainIker Casillas

WingerAndres Iniesta Spain

World Cup 2010 Highlights

Football World Cup 2010 one team distinguished well

from the rest winning the final Early in the second

half Netherlandsrsquo striker Arjen Robben had a chance

to score but the awesome keeper for Spain Iker

Casillas made the save Winner superiority was

reflected when Winger Andres Iniesta scored for Spain

for the win

copy 2014 IBM Corporation5050

Text Analytic Real Example

copy 2014 IBM Corporation5151

One step beyond Watson

copy 2014 IBM Corporation52

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

R

copy 2014 IBM Corporation53

Big R

bull Explore visualize transform

and model big data using

familiar R syntax and

paradigm

bull Scale out R with MR

programming

ndash Partitioning of large data

ndash Parallel cluster execution of R

code

bull Distributed Machine

Learning

ndash A scalable statistics engine that

provides canned algorithms and

an ability to author new ones all

via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3

copy 2014 IBM Corporation54

Where Does BigData Fit

Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate

ldquoCapture in case itrsquos neededrdquo

1 Extract transform load

ldquoCapture only whatrsquos neededrdquo

9 Report and mine data

copy 2014 IBM Corporation55

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation56

Data scientist ndash The new cool guy in town

Article in Fortune ldquoThe unemployment rate in

the US continues to be abysmal (91 in

July) but the tech world has spawned a

new kind of highly skilled nerdy-cool job

that companies are scrambling to fill data

scientistrdquo

McKinsey Global Institute ldquoBig data Reportrdquo

By 2018 the United States alone could

face a shortage of 140000 to 190000

people with deep analytical skills as well as

15 million managers and analysts with the

know-how to use the analysis of big data to

make effective decisions

copy 2014 IBM Corporation57

Data Science is Multidisciplinary

copy 2014 IBM Corporation58

Successful Data Scientist Characteristics

copy 2014 IBM Corporation59

Data Scientist Qualities

copy 2014 IBM Corporation60

How Long Does It Take For a Beginner to Become

a Good Data Scientist

copy 2014 IBM Corporation61

wwwkagglecom

copy 2014 IBM Corporation62

Kaggle ranking

copy 2014 IBM Corporation63 copy 2013 IBM Corporation63

Learn Big Data

Reading Materials - Online

ndash Understanding Big Data ndash Free PDF Book

bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF

ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights

bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml

ndash Implementing IBM InfoSphere BigInsights on System x - Redbook

bull httpwwwredbooksibmcomredpiecesabstractssg248077html

Resources

ndash Big Data Information Center

bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml

ndash InfoSphere BigInsights

bull www-01ibmcomsoftwaredatainfospherebiginsights

ndash Stream Computing

bull www-01ibmcomsoftwaredatainfospherestream-computing

ndash DeveloperWorks forums demos

bull httpwwwibmcomdeveloperworkswikibiginsights

copy 2014 IBM Corporation64

Learn Big Data Technologies

BigDataUniversitycom

Flexible on-line delivery

allows learning your place

and your pace

Free courses free study

materials

Cloud-based sandbox for

exercises ndash zero setup

Robust Course

Management System and

Content Distribution

infrastructure

copy 2014 IBM Corporation65

65

Page 34: Big data presentation (2014)

Hadoop Distributions

34

copy 2014 IBM Corporation35

Example of Hadoop Ecosystem

Visualization amp DiscoveryIntegration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

BigSheets JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Dashboard amp Visualization

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

copy 2014 IBM Corporation36

Open Source frameworks I

Avro A data serialization system that includes a schema within each file A schema defines the data types that are

contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data

types and complex type definitions within a schema

Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop

cluster

HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data

sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications

are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column

families so that the elements of a column family are all stored together This approach is different from a row-oriented

relational database where all columns of a row are stored together

HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do

not need to know where or how your data is stored You can change how you write data while still supporting existing data in

older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service

that includes functions for both MapReduce and Pig

Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to

analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements

which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere

BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos

Business Intelligence software

copy 2014 IBM Corporation37

Open Source frameworks II

Lucene A high-performance text search engine library that is written entirely in Java When you search within a

collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key

component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene

libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build

scan and query Lucene indexes

Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides

users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the

required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of

specific data in the file system

R A Project for Statistical Computing

Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop

systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and

export it to relational databases and enterprise data warehouses

Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper

maintains common objects that are needed in large cluster environments such as configuration information distributed

synchronization and group services

copy 2014 IBM Corporation38

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

copy 2014 IBM Corporation39

BigSheets

Browser based Analytical Tool that generates MapReduce Jobs

working over Hadoop Big Data

Helps non-programmers to work with Hadoop cluster

User models their big data as familiar spreadsheet-like tabular data

structures (collections) Once data is represented in a collection

business analysts can filter and enrich its content using built-in

functions and macros Furthermore analysts can combine data

residing in different collections as well as generate charts and new

ldquosheetsrdquo (collections) to visualize their data They can even export

data into a variety of common formats with a click of a button

Much of the technology included in Sheets was derived from the

BigSheets project of IBMrsquos Emerging Technologies team

copy 2014 IBM Corporation40

BigSheets Collection Sample

Spreadsheet-like structures defined by user

Based on data accessible through BigInsights Web console ndash eg file

system data output from Web crawl etc

copy 2014 IBM Corporation41

Big Sheets Collection Operations

Work with built-in ldquosheetsrdquo editor

Add delete columns

Filter data

Specify formulas to compute new

values using spreadsheet-style

syntax

Apply built-in or custom macro

functions

helliphelliphelliphellip

copy 2014 IBM Corporation42

BigSheets Collection Graphic Visualization

Built-in charting facility aids analysis

Pie charts bar charts tag clouds maps etc

Hover over sections to reveal details

copy 2014 IBM Corporation43

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

copy 2014 IBM Corporation44

What is Big SQL

Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture

ndash Comprehensive SQL92 ansi support

ndash Standards compliant client drivers (JDBC amp ODBC)

ndash Efficient handling of point queries

ndash Wide variety of data sources and file formats

ndash Extensive HBase focus

ndash Open source interoperability

Our driving design goalsndash Existing queries should run with no or few modifications

ndash Existing JDBC and ODBC compliant tools should continue to function

ndash Queries should be executed as efficiently as the chosen storage

mechanisms allow

copy 2014 IBM Corporation45

Architecture

Big SQL shares catalogs with

Hive via the Hive metastorendash Each can query the others tables

SQL engine analyzes incoming

queriesndash Separates portion(s) to execute at

the server vs portion(s) to execute

on the cluster

ndash Re-writes query if necessary for

improved performance

ndash Determines appropriate storage

handler for data

ndash Produces execution plan

ndash Executes and coordinates query

Server layout and relative sizes

for illustrative purposes only

Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node

Job TrackerNetwork Protocol

SQL Engine

Storage Handlers

Del

Files

SEQ

FilesHBase RDBMS bullbullbull

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore

copy 2014 IBM Corporation46

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

copy 2014 IBM Corporation47

What is Text Analytics

High Performance and Scalable rule based Information Extraction Engine

Distill structured information from unstructured data

- Rich annotator library supports multiple languages

Provides sophisticated tooling to help build test and refine rules

ndash Developer tools an easy to use text analytics language and a set of

extractors for fast adoption

ndash Multilingual support including support for DBCS languages

Developed at IBM Research since 2004 System T

BigInsights is the first time IBM opens up the Text Analytics Engine

technology for customization and development

copy 2014 IBM Corporation48

Annotator Query Language (AQL)

Language to create rules for Text Analytics

SQL Like Language

Fully declarative text analytics language

Once compiled produced an AOG plan to work in the data

No ldquoblack boxesrdquo or modules that canrsquot be customized

Tooling for easy customization because you are abstracted from the

programmatic details

Competing solutions make use of locked up black-box modules that cannot be

customized which restricts flexibility and are difficult to optimize for performance

create view AmountWithUnit as

extract pattern ltNmatchgt ltUmatchgt

as match

from Number N Unit U

copy 2014 IBM Corporation49

Text Analytic Simple Example

NetherlandsStrikerArjen Robben

Keeper SpainIker Casillas

WingerAndres Iniesta Spain

World Cup 2010 Highlights

Football World Cup 2010 one team distinguished well

from the rest winning the final Early in the second

half Netherlandsrsquo striker Arjen Robben had a chance

to score but the awesome keeper for Spain Iker

Casillas made the save Winner superiority was

reflected when Winger Andres Iniesta scored for Spain

for the win

copy 2014 IBM Corporation5050

Text Analytic Real Example

copy 2014 IBM Corporation5151

One step beyond Watson

copy 2014 IBM Corporation52

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

R

copy 2014 IBM Corporation53

Big R

bull Explore visualize transform

and model big data using

familiar R syntax and

paradigm

bull Scale out R with MR

programming

ndash Partitioning of large data

ndash Parallel cluster execution of R

code

bull Distributed Machine

Learning

ndash A scalable statistics engine that

provides canned algorithms and

an ability to author new ones all

via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3

copy 2014 IBM Corporation54

Where Does BigData Fit

Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate

ldquoCapture in case itrsquos neededrdquo

1 Extract transform load

ldquoCapture only whatrsquos neededrdquo

9 Report and mine data

copy 2014 IBM Corporation55

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation56

Data scientist ndash The new cool guy in town

Article in Fortune ldquoThe unemployment rate in

the US continues to be abysmal (91 in

July) but the tech world has spawned a

new kind of highly skilled nerdy-cool job

that companies are scrambling to fill data

scientistrdquo

McKinsey Global Institute ldquoBig data Reportrdquo

By 2018 the United States alone could

face a shortage of 140000 to 190000

people with deep analytical skills as well as

15 million managers and analysts with the

know-how to use the analysis of big data to

make effective decisions

copy 2014 IBM Corporation57

Data Science is Multidisciplinary

copy 2014 IBM Corporation58

Successful Data Scientist Characteristics

copy 2014 IBM Corporation59

Data Scientist Qualities

copy 2014 IBM Corporation60

How Long Does It Take For a Beginner to Become

a Good Data Scientist

copy 2014 IBM Corporation61

wwwkagglecom

copy 2014 IBM Corporation62

Kaggle ranking

copy 2014 IBM Corporation63 copy 2013 IBM Corporation63

Learn Big Data

Reading Materials - Online

ndash Understanding Big Data ndash Free PDF Book

bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF

ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights

bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml

ndash Implementing IBM InfoSphere BigInsights on System x - Redbook

bull httpwwwredbooksibmcomredpiecesabstractssg248077html

Resources

ndash Big Data Information Center

bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml

ndash InfoSphere BigInsights

bull www-01ibmcomsoftwaredatainfospherebiginsights

ndash Stream Computing

bull www-01ibmcomsoftwaredatainfospherestream-computing

ndash DeveloperWorks forums demos

bull httpwwwibmcomdeveloperworkswikibiginsights

copy 2014 IBM Corporation64

Learn Big Data Technologies

BigDataUniversitycom

Flexible on-line delivery

allows learning your place

and your pace

Free courses free study

materials

Cloud-based sandbox for

exercises ndash zero setup

Robust Course

Management System and

Content Distribution

infrastructure

copy 2014 IBM Corporation65

65

Page 35: Big data presentation (2014)

copy 2014 IBM Corporation35

Example of Hadoop Ecosystem

Visualization amp DiscoveryIntegration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

BigSheets JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Dashboard amp Visualization

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

copy 2014 IBM Corporation36

Open Source frameworks I

Avro A data serialization system that includes a schema within each file A schema defines the data types that are

contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data

types and complex type definitions within a schema

Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop

cluster

HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data

sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications

are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column

families so that the elements of a column family are all stored together This approach is different from a row-oriented

relational database where all columns of a row are stored together

HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do

not need to know where or how your data is stored You can change how you write data while still supporting existing data in

older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service

that includes functions for both MapReduce and Pig

Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to

analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements

which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere

BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos

Business Intelligence software

copy 2014 IBM Corporation37

Open Source frameworks II

Lucene A high-performance text search engine library that is written entirely in Java When you search within a

collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key

component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene

libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build

scan and query Lucene indexes

Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides

users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the

required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of

specific data in the file system

R A Project for Statistical Computing

Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop

systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and

export it to relational databases and enterprise data warehouses

Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper

maintains common objects that are needed in large cluster environments such as configuration information distributed

synchronization and group services

copy 2014 IBM Corporation38

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

copy 2014 IBM Corporation39

BigSheets

Browser based Analytical Tool that generates MapReduce Jobs

working over Hadoop Big Data

Helps non-programmers to work with Hadoop cluster

User models their big data as familiar spreadsheet-like tabular data

structures (collections) Once data is represented in a collection

business analysts can filter and enrich its content using built-in

functions and macros Furthermore analysts can combine data

residing in different collections as well as generate charts and new

ldquosheetsrdquo (collections) to visualize their data They can even export

data into a variety of common formats with a click of a button

Much of the technology included in Sheets was derived from the

BigSheets project of IBMrsquos Emerging Technologies team

copy 2014 IBM Corporation40

BigSheets Collection Sample

Spreadsheet-like structures defined by user

Based on data accessible through BigInsights Web console ndash eg file

system data output from Web crawl etc

copy 2014 IBM Corporation41

Big Sheets Collection Operations

Work with built-in ldquosheetsrdquo editor

Add delete columns

Filter data

Specify formulas to compute new

values using spreadsheet-style

syntax

Apply built-in or custom macro

functions

helliphelliphelliphellip

copy 2014 IBM Corporation42

BigSheets Collection Graphic Visualization

Built-in charting facility aids analysis

Pie charts bar charts tag clouds maps etc

Hover over sections to reveal details

copy 2014 IBM Corporation43

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

copy 2014 IBM Corporation44

What is Big SQL

Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture

ndash Comprehensive SQL92 ansi support

ndash Standards compliant client drivers (JDBC amp ODBC)

ndash Efficient handling of point queries

ndash Wide variety of data sources and file formats

ndash Extensive HBase focus

ndash Open source interoperability

Our driving design goalsndash Existing queries should run with no or few modifications

ndash Existing JDBC and ODBC compliant tools should continue to function

ndash Queries should be executed as efficiently as the chosen storage

mechanisms allow

copy 2014 IBM Corporation45

Architecture

Big SQL shares catalogs with

Hive via the Hive metastorendash Each can query the others tables

SQL engine analyzes incoming

queriesndash Separates portion(s) to execute at

the server vs portion(s) to execute

on the cluster

ndash Re-writes query if necessary for

improved performance

ndash Determines appropriate storage

handler for data

ndash Produces execution plan

ndash Executes and coordinates query

Server layout and relative sizes

for illustrative purposes only

Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node

Job TrackerNetwork Protocol

SQL Engine

Storage Handlers

Del

Files

SEQ

FilesHBase RDBMS bullbullbull

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore

copy 2014 IBM Corporation46

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

copy 2014 IBM Corporation47

What is Text Analytics

High Performance and Scalable rule based Information Extraction Engine

Distill structured information from unstructured data

- Rich annotator library supports multiple languages

Provides sophisticated tooling to help build test and refine rules

ndash Developer tools an easy to use text analytics language and a set of

extractors for fast adoption

ndash Multilingual support including support for DBCS languages

Developed at IBM Research since 2004 System T

BigInsights is the first time IBM opens up the Text Analytics Engine

technology for customization and development

copy 2014 IBM Corporation48

Annotator Query Language (AQL)

Language to create rules for Text Analytics

SQL Like Language

Fully declarative text analytics language

Once compiled produced an AOG plan to work in the data

No ldquoblack boxesrdquo or modules that canrsquot be customized

Tooling for easy customization because you are abstracted from the

programmatic details

Competing solutions make use of locked up black-box modules that cannot be

customized which restricts flexibility and are difficult to optimize for performance

create view AmountWithUnit as

extract pattern ltNmatchgt ltUmatchgt

as match

from Number N Unit U

copy 2014 IBM Corporation49

Text Analytic Simple Example

NetherlandsStrikerArjen Robben

Keeper SpainIker Casillas

WingerAndres Iniesta Spain

World Cup 2010 Highlights

Football World Cup 2010 one team distinguished well

from the rest winning the final Early in the second

half Netherlandsrsquo striker Arjen Robben had a chance

to score but the awesome keeper for Spain Iker

Casillas made the save Winner superiority was

reflected when Winger Andres Iniesta scored for Spain

for the win

copy 2014 IBM Corporation5050

Text Analytic Real Example

copy 2014 IBM Corporation5151

One step beyond Watson

copy 2014 IBM Corporation52

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

R

copy 2014 IBM Corporation53

Big R

bull Explore visualize transform

and model big data using

familiar R syntax and

paradigm

bull Scale out R with MR

programming

ndash Partitioning of large data

ndash Parallel cluster execution of R

code

bull Distributed Machine

Learning

ndash A scalable statistics engine that

provides canned algorithms and

an ability to author new ones all

via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3

copy 2014 IBM Corporation54

Where Does BigData Fit

Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate

ldquoCapture in case itrsquos neededrdquo

1 Extract transform load

ldquoCapture only whatrsquos neededrdquo

9 Report and mine data

copy 2014 IBM Corporation55

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation56

Data scientist ndash The new cool guy in town

Article in Fortune ldquoThe unemployment rate in

the US continues to be abysmal (91 in

July) but the tech world has spawned a

new kind of highly skilled nerdy-cool job

that companies are scrambling to fill data

scientistrdquo

McKinsey Global Institute ldquoBig data Reportrdquo

By 2018 the United States alone could

face a shortage of 140000 to 190000

people with deep analytical skills as well as

15 million managers and analysts with the

know-how to use the analysis of big data to

make effective decisions

copy 2014 IBM Corporation57

Data Science is Multidisciplinary

copy 2014 IBM Corporation58

Successful Data Scientist Characteristics

copy 2014 IBM Corporation59

Data Scientist Qualities

copy 2014 IBM Corporation60

How Long Does It Take For a Beginner to Become

a Good Data Scientist

copy 2014 IBM Corporation61

wwwkagglecom

copy 2014 IBM Corporation62

Kaggle ranking

copy 2014 IBM Corporation63 copy 2013 IBM Corporation63

Learn Big Data

Reading Materials - Online

ndash Understanding Big Data ndash Free PDF Book

bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF

ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights

bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml

ndash Implementing IBM InfoSphere BigInsights on System x - Redbook

bull httpwwwredbooksibmcomredpiecesabstractssg248077html

Resources

ndash Big Data Information Center

bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml

ndash InfoSphere BigInsights

bull www-01ibmcomsoftwaredatainfospherebiginsights

ndash Stream Computing

bull www-01ibmcomsoftwaredatainfospherestream-computing

ndash DeveloperWorks forums demos

bull httpwwwibmcomdeveloperworkswikibiginsights

copy 2014 IBM Corporation64

Learn Big Data Technologies

BigDataUniversitycom

Flexible on-line delivery

allows learning your place

and your pace

Free courses free study

materials

Cloud-based sandbox for

exercises ndash zero setup

Robust Course

Management System and

Content Distribution

infrastructure

copy 2014 IBM Corporation65

65

Page 36: Big data presentation (2014)

copy 2014 IBM Corporation36

Open Source frameworks I

Avro A data serialization system that includes a schema within each file A schema defines the data types that are

contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data

types and complex type definitions within a schema

Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop

cluster

HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data

sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications

are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column

families so that the elements of a column family are all stored together This approach is different from a row-oriented

relational database where all columns of a row are stored together

HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do

not need to know where or how your data is stored You can change how you write data while still supporting existing data in

older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service

that includes functions for both MapReduce and Pig

Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to

analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements

which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere

BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos

Business Intelligence software

copy 2014 IBM Corporation37

Open Source frameworks II

Lucene A high-performance text search engine library that is written entirely in Java When you search within a

collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key

component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene

libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build

scan and query Lucene indexes

Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides

users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the

required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of

specific data in the file system

R A Project for Statistical Computing

Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop

systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and

export it to relational databases and enterprise data warehouses

Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper

maintains common objects that are needed in large cluster environments such as configuration information distributed

synchronization and group services

copy 2014 IBM Corporation38

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

copy 2014 IBM Corporation39

BigSheets

Browser based Analytical Tool that generates MapReduce Jobs

working over Hadoop Big Data

Helps non-programmers to work with Hadoop cluster

User models their big data as familiar spreadsheet-like tabular data

structures (collections) Once data is represented in a collection

business analysts can filter and enrich its content using built-in

functions and macros Furthermore analysts can combine data

residing in different collections as well as generate charts and new

ldquosheetsrdquo (collections) to visualize their data They can even export

data into a variety of common formats with a click of a button

Much of the technology included in Sheets was derived from the

BigSheets project of IBMrsquos Emerging Technologies team

copy 2014 IBM Corporation40

BigSheets Collection Sample

Spreadsheet-like structures defined by user

Based on data accessible through BigInsights Web console ndash eg file

system data output from Web crawl etc

copy 2014 IBM Corporation41

Big Sheets Collection Operations

Work with built-in ldquosheetsrdquo editor

Add delete columns

Filter data

Specify formulas to compute new

values using spreadsheet-style

syntax

Apply built-in or custom macro

functions

helliphelliphelliphellip

copy 2014 IBM Corporation42

BigSheets Collection Graphic Visualization

Built-in charting facility aids analysis

Pie charts bar charts tag clouds maps etc

Hover over sections to reveal details

copy 2014 IBM Corporation43

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

copy 2014 IBM Corporation44

What is Big SQL

Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture

ndash Comprehensive SQL92 ansi support

ndash Standards compliant client drivers (JDBC amp ODBC)

ndash Efficient handling of point queries

ndash Wide variety of data sources and file formats

ndash Extensive HBase focus

ndash Open source interoperability

Our driving design goalsndash Existing queries should run with no or few modifications

ndash Existing JDBC and ODBC compliant tools should continue to function

ndash Queries should be executed as efficiently as the chosen storage

mechanisms allow

copy 2014 IBM Corporation45

Architecture

Big SQL shares catalogs with

Hive via the Hive metastorendash Each can query the others tables

SQL engine analyzes incoming

queriesndash Separates portion(s) to execute at

the server vs portion(s) to execute

on the cluster

ndash Re-writes query if necessary for

improved performance

ndash Determines appropriate storage

handler for data

ndash Produces execution plan

ndash Executes and coordinates query

Server layout and relative sizes

for illustrative purposes only

Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node

Job TrackerNetwork Protocol

SQL Engine

Storage Handlers

Del

Files

SEQ

FilesHBase RDBMS bullbullbull

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore

copy 2014 IBM Corporation46

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

copy 2014 IBM Corporation47

What is Text Analytics

High Performance and Scalable rule based Information Extraction Engine

Distill structured information from unstructured data

- Rich annotator library supports multiple languages

Provides sophisticated tooling to help build test and refine rules

ndash Developer tools an easy to use text analytics language and a set of

extractors for fast adoption

ndash Multilingual support including support for DBCS languages

Developed at IBM Research since 2004 System T

BigInsights is the first time IBM opens up the Text Analytics Engine

technology for customization and development

copy 2014 IBM Corporation48

Annotator Query Language (AQL)

Language to create rules for Text Analytics

SQL Like Language

Fully declarative text analytics language

Once compiled produced an AOG plan to work in the data

No ldquoblack boxesrdquo or modules that canrsquot be customized

Tooling for easy customization because you are abstracted from the

programmatic details

Competing solutions make use of locked up black-box modules that cannot be

customized which restricts flexibility and are difficult to optimize for performance

create view AmountWithUnit as

extract pattern ltNmatchgt ltUmatchgt

as match

from Number N Unit U

copy 2014 IBM Corporation49

Text Analytic Simple Example

NetherlandsStrikerArjen Robben

Keeper SpainIker Casillas

WingerAndres Iniesta Spain

World Cup 2010 Highlights

Football World Cup 2010 one team distinguished well

from the rest winning the final Early in the second

half Netherlandsrsquo striker Arjen Robben had a chance

to score but the awesome keeper for Spain Iker

Casillas made the save Winner superiority was

reflected when Winger Andres Iniesta scored for Spain

for the win

copy 2014 IBM Corporation5050

Text Analytic Real Example

copy 2014 IBM Corporation5151

One step beyond Watson

copy 2014 IBM Corporation52

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

R

copy 2014 IBM Corporation53

Big R

bull Explore visualize transform

and model big data using

familiar R syntax and

paradigm

bull Scale out R with MR

programming

ndash Partitioning of large data

ndash Parallel cluster execution of R

code

bull Distributed Machine

Learning

ndash A scalable statistics engine that

provides canned algorithms and

an ability to author new ones all

via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3

copy 2014 IBM Corporation54

Where Does BigData Fit

Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate

ldquoCapture in case itrsquos neededrdquo

1 Extract transform load

ldquoCapture only whatrsquos neededrdquo

9 Report and mine data

copy 2014 IBM Corporation55

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation56

Data scientist ndash The new cool guy in town

Article in Fortune ldquoThe unemployment rate in

the US continues to be abysmal (91 in

July) but the tech world has spawned a

new kind of highly skilled nerdy-cool job

that companies are scrambling to fill data

scientistrdquo

McKinsey Global Institute ldquoBig data Reportrdquo

By 2018 the United States alone could

face a shortage of 140000 to 190000

people with deep analytical skills as well as

15 million managers and analysts with the

know-how to use the analysis of big data to

make effective decisions

copy 2014 IBM Corporation57

Data Science is Multidisciplinary

copy 2014 IBM Corporation58

Successful Data Scientist Characteristics

copy 2014 IBM Corporation59

Data Scientist Qualities

copy 2014 IBM Corporation60

How Long Does It Take For a Beginner to Become

a Good Data Scientist

copy 2014 IBM Corporation61

wwwkagglecom

copy 2014 IBM Corporation62

Kaggle ranking

copy 2014 IBM Corporation63 copy 2013 IBM Corporation63

Learn Big Data

Reading Materials - Online

ndash Understanding Big Data ndash Free PDF Book

bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF

ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights

bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml

ndash Implementing IBM InfoSphere BigInsights on System x - Redbook

bull httpwwwredbooksibmcomredpiecesabstractssg248077html

Resources

ndash Big Data Information Center

bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml

ndash InfoSphere BigInsights

bull www-01ibmcomsoftwaredatainfospherebiginsights

ndash Stream Computing

bull www-01ibmcomsoftwaredatainfospherestream-computing

ndash DeveloperWorks forums demos

bull httpwwwibmcomdeveloperworkswikibiginsights

copy 2014 IBM Corporation64

Learn Big Data Technologies

BigDataUniversitycom

Flexible on-line delivery

allows learning your place

and your pace

Free courses free study

materials

Cloud-based sandbox for

exercises ndash zero setup

Robust Course

Management System and

Content Distribution

infrastructure

copy 2014 IBM Corporation65

65

Page 37: Big data presentation (2014)

copy 2014 IBM Corporation37

Open Source frameworks II

Lucene A high-performance text search engine library that is written entirely in Java When you search within a

collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key

component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene

libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build

scan and query Lucene indexes

Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides

users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the

required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of

specific data in the file system

R A Project for Statistical Computing

Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop

systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and

export it to relational databases and enterprise data warehouses

Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper

maintains common objects that are needed in large cluster environments such as configuration information distributed

synchronization and group services

copy 2014 IBM Corporation38

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

copy 2014 IBM Corporation39

BigSheets

Browser based Analytical Tool that generates MapReduce Jobs

working over Hadoop Big Data

Helps non-programmers to work with Hadoop cluster

User models their big data as familiar spreadsheet-like tabular data

structures (collections) Once data is represented in a collection

business analysts can filter and enrich its content using built-in

functions and macros Furthermore analysts can combine data

residing in different collections as well as generate charts and new

ldquosheetsrdquo (collections) to visualize their data They can even export

data into a variety of common formats with a click of a button

Much of the technology included in Sheets was derived from the

BigSheets project of IBMrsquos Emerging Technologies team

copy 2014 IBM Corporation40

BigSheets Collection Sample

Spreadsheet-like structures defined by user

Based on data accessible through BigInsights Web console ndash eg file

system data output from Web crawl etc

copy 2014 IBM Corporation41

Big Sheets Collection Operations

Work with built-in ldquosheetsrdquo editor

Add delete columns

Filter data

Specify formulas to compute new

values using spreadsheet-style

syntax

Apply built-in or custom macro

functions

helliphelliphelliphellip

copy 2014 IBM Corporation42

BigSheets Collection Graphic Visualization

Built-in charting facility aids analysis

Pie charts bar charts tag clouds maps etc

Hover over sections to reveal details

copy 2014 IBM Corporation43

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

copy 2014 IBM Corporation44

What is Big SQL

Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture

ndash Comprehensive SQL92 ansi support

ndash Standards compliant client drivers (JDBC amp ODBC)

ndash Efficient handling of point queries

ndash Wide variety of data sources and file formats

ndash Extensive HBase focus

ndash Open source interoperability

Our driving design goalsndash Existing queries should run with no or few modifications

ndash Existing JDBC and ODBC compliant tools should continue to function

ndash Queries should be executed as efficiently as the chosen storage

mechanisms allow

copy 2014 IBM Corporation45

Architecture

Big SQL shares catalogs with

Hive via the Hive metastorendash Each can query the others tables

SQL engine analyzes incoming

queriesndash Separates portion(s) to execute at

the server vs portion(s) to execute

on the cluster

ndash Re-writes query if necessary for

improved performance

ndash Determines appropriate storage

handler for data

ndash Produces execution plan

ndash Executes and coordinates query

Server layout and relative sizes

for illustrative purposes only

Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node

Job TrackerNetwork Protocol

SQL Engine

Storage Handlers

Del

Files

SEQ

FilesHBase RDBMS bullbullbull

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore

copy 2014 IBM Corporation46

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

copy 2014 IBM Corporation47

What is Text Analytics

High Performance and Scalable rule based Information Extraction Engine

Distill structured information from unstructured data

- Rich annotator library supports multiple languages

Provides sophisticated tooling to help build test and refine rules

ndash Developer tools an easy to use text analytics language and a set of

extractors for fast adoption

ndash Multilingual support including support for DBCS languages

Developed at IBM Research since 2004 System T

BigInsights is the first time IBM opens up the Text Analytics Engine

technology for customization and development

copy 2014 IBM Corporation48

Annotator Query Language (AQL)

Language to create rules for Text Analytics

SQL Like Language

Fully declarative text analytics language

Once compiled produced an AOG plan to work in the data

No ldquoblack boxesrdquo or modules that canrsquot be customized

Tooling for easy customization because you are abstracted from the

programmatic details

Competing solutions make use of locked up black-box modules that cannot be

customized which restricts flexibility and are difficult to optimize for performance

create view AmountWithUnit as

extract pattern ltNmatchgt ltUmatchgt

as match

from Number N Unit U

copy 2014 IBM Corporation49

Text Analytic Simple Example

NetherlandsStrikerArjen Robben

Keeper SpainIker Casillas

WingerAndres Iniesta Spain

World Cup 2010 Highlights

Football World Cup 2010 one team distinguished well

from the rest winning the final Early in the second

half Netherlandsrsquo striker Arjen Robben had a chance

to score but the awesome keeper for Spain Iker

Casillas made the save Winner superiority was

reflected when Winger Andres Iniesta scored for Spain

for the win

copy 2014 IBM Corporation5050

Text Analytic Real Example

copy 2014 IBM Corporation5151

One step beyond Watson

copy 2014 IBM Corporation52

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

R

copy 2014 IBM Corporation53

Big R

bull Explore visualize transform

and model big data using

familiar R syntax and

paradigm

bull Scale out R with MR

programming

ndash Partitioning of large data

ndash Parallel cluster execution of R

code

bull Distributed Machine

Learning

ndash A scalable statistics engine that

provides canned algorithms and

an ability to author new ones all

via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3

copy 2014 IBM Corporation54

Where Does BigData Fit

Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate

ldquoCapture in case itrsquos neededrdquo

1 Extract transform load

ldquoCapture only whatrsquos neededrdquo

9 Report and mine data

copy 2014 IBM Corporation55

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation56

Data scientist ndash The new cool guy in town

Article in Fortune ldquoThe unemployment rate in

the US continues to be abysmal (91 in

July) but the tech world has spawned a

new kind of highly skilled nerdy-cool job

that companies are scrambling to fill data

scientistrdquo

McKinsey Global Institute ldquoBig data Reportrdquo

By 2018 the United States alone could

face a shortage of 140000 to 190000

people with deep analytical skills as well as

15 million managers and analysts with the

know-how to use the analysis of big data to

make effective decisions

copy 2014 IBM Corporation57

Data Science is Multidisciplinary

copy 2014 IBM Corporation58

Successful Data Scientist Characteristics

copy 2014 IBM Corporation59

Data Scientist Qualities

copy 2014 IBM Corporation60

How Long Does It Take For a Beginner to Become

a Good Data Scientist

copy 2014 IBM Corporation61

wwwkagglecom

copy 2014 IBM Corporation62

Kaggle ranking

copy 2014 IBM Corporation63 copy 2013 IBM Corporation63

Learn Big Data

Reading Materials - Online

ndash Understanding Big Data ndash Free PDF Book

bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF

ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights

bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml

ndash Implementing IBM InfoSphere BigInsights on System x - Redbook

bull httpwwwredbooksibmcomredpiecesabstractssg248077html

Resources

ndash Big Data Information Center

bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml

ndash InfoSphere BigInsights

bull www-01ibmcomsoftwaredatainfospherebiginsights

ndash Stream Computing

bull www-01ibmcomsoftwaredatainfospherestream-computing

ndash DeveloperWorks forums demos

bull httpwwwibmcomdeveloperworkswikibiginsights

copy 2014 IBM Corporation64

Learn Big Data Technologies

BigDataUniversitycom

Flexible on-line delivery

allows learning your place

and your pace

Free courses free study

materials

Cloud-based sandbox for

exercises ndash zero setup

Robust Course

Management System and

Content Distribution

infrastructure

copy 2014 IBM Corporation65

65

Page 38: Big data presentation (2014)

copy 2014 IBM Corporation38

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

Big SQL

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

copy 2014 IBM Corporation39

BigSheets

Browser based Analytical Tool that generates MapReduce Jobs

working over Hadoop Big Data

Helps non-programmers to work with Hadoop cluster

User models their big data as familiar spreadsheet-like tabular data

structures (collections) Once data is represented in a collection

business analysts can filter and enrich its content using built-in

functions and macros Furthermore analysts can combine data

residing in different collections as well as generate charts and new

ldquosheetsrdquo (collections) to visualize their data They can even export

data into a variety of common formats with a click of a button

Much of the technology included in Sheets was derived from the

BigSheets project of IBMrsquos Emerging Technologies team

copy 2014 IBM Corporation40

BigSheets Collection Sample

Spreadsheet-like structures defined by user

Based on data accessible through BigInsights Web console ndash eg file

system data output from Web crawl etc

copy 2014 IBM Corporation41

Big Sheets Collection Operations

Work with built-in ldquosheetsrdquo editor

Add delete columns

Filter data

Specify formulas to compute new

values using spreadsheet-style

syntax

Apply built-in or custom macro

functions

helliphelliphelliphellip

copy 2014 IBM Corporation42

BigSheets Collection Graphic Visualization

Built-in charting facility aids analysis

Pie charts bar charts tag clouds maps etc

Hover over sections to reveal details

copy 2014 IBM Corporation43

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

copy 2014 IBM Corporation44

What is Big SQL

Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture

ndash Comprehensive SQL92 ansi support

ndash Standards compliant client drivers (JDBC amp ODBC)

ndash Efficient handling of point queries

ndash Wide variety of data sources and file formats

ndash Extensive HBase focus

ndash Open source interoperability

Our driving design goalsndash Existing queries should run with no or few modifications

ndash Existing JDBC and ODBC compliant tools should continue to function

ndash Queries should be executed as efficiently as the chosen storage

mechanisms allow

copy 2014 IBM Corporation45

Architecture

Big SQL shares catalogs with

Hive via the Hive metastorendash Each can query the others tables

SQL engine analyzes incoming

queriesndash Separates portion(s) to execute at

the server vs portion(s) to execute

on the cluster

ndash Re-writes query if necessary for

improved performance

ndash Determines appropriate storage

handler for data

ndash Produces execution plan

ndash Executes and coordinates query

Server layout and relative sizes

for illustrative purposes only

Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node

Job TrackerNetwork Protocol

SQL Engine

Storage Handlers

Del

Files

SEQ

FilesHBase RDBMS bullbullbull

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore

copy 2014 IBM Corporation46

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

copy 2014 IBM Corporation47

What is Text Analytics

High Performance and Scalable rule based Information Extraction Engine

Distill structured information from unstructured data

- Rich annotator library supports multiple languages

Provides sophisticated tooling to help build test and refine rules

ndash Developer tools an easy to use text analytics language and a set of

extractors for fast adoption

ndash Multilingual support including support for DBCS languages

Developed at IBM Research since 2004 System T

BigInsights is the first time IBM opens up the Text Analytics Engine

technology for customization and development

copy 2014 IBM Corporation48

Annotator Query Language (AQL)

Language to create rules for Text Analytics

SQL Like Language

Fully declarative text analytics language

Once compiled produced an AOG plan to work in the data

No ldquoblack boxesrdquo or modules that canrsquot be customized

Tooling for easy customization because you are abstracted from the

programmatic details

Competing solutions make use of locked up black-box modules that cannot be

customized which restricts flexibility and are difficult to optimize for performance

create view AmountWithUnit as

extract pattern ltNmatchgt ltUmatchgt

as match

from Number N Unit U

copy 2014 IBM Corporation49

Text Analytic Simple Example

NetherlandsStrikerArjen Robben

Keeper SpainIker Casillas

WingerAndres Iniesta Spain

World Cup 2010 Highlights

Football World Cup 2010 one team distinguished well

from the rest winning the final Early in the second

half Netherlandsrsquo striker Arjen Robben had a chance

to score but the awesome keeper for Spain Iker

Casillas made the save Winner superiority was

reflected when Winger Andres Iniesta scored for Spain

for the win

copy 2014 IBM Corporation5050

Text Analytic Real Example

copy 2014 IBM Corporation5151

One step beyond Watson

copy 2014 IBM Corporation52

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

R

copy 2014 IBM Corporation53

Big R

bull Explore visualize transform

and model big data using

familiar R syntax and

paradigm

bull Scale out R with MR

programming

ndash Partitioning of large data

ndash Parallel cluster execution of R

code

bull Distributed Machine

Learning

ndash A scalable statistics engine that

provides canned algorithms and

an ability to author new ones all

via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3

copy 2014 IBM Corporation54

Where Does BigData Fit

Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate

ldquoCapture in case itrsquos neededrdquo

1 Extract transform load

ldquoCapture only whatrsquos neededrdquo

9 Report and mine data

copy 2014 IBM Corporation55

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation56

Data scientist ndash The new cool guy in town

Article in Fortune ldquoThe unemployment rate in

the US continues to be abysmal (91 in

July) but the tech world has spawned a

new kind of highly skilled nerdy-cool job

that companies are scrambling to fill data

scientistrdquo

McKinsey Global Institute ldquoBig data Reportrdquo

By 2018 the United States alone could

face a shortage of 140000 to 190000

people with deep analytical skills as well as

15 million managers and analysts with the

know-how to use the analysis of big data to

make effective decisions

copy 2014 IBM Corporation57

Data Science is Multidisciplinary

copy 2014 IBM Corporation58

Successful Data Scientist Characteristics

copy 2014 IBM Corporation59

Data Scientist Qualities

copy 2014 IBM Corporation60

How Long Does It Take For a Beginner to Become

a Good Data Scientist

copy 2014 IBM Corporation61

wwwkagglecom

copy 2014 IBM Corporation62

Kaggle ranking

copy 2014 IBM Corporation63 copy 2013 IBM Corporation63

Learn Big Data

Reading Materials - Online

ndash Understanding Big Data ndash Free PDF Book

bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF

ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights

bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml

ndash Implementing IBM InfoSphere BigInsights on System x - Redbook

bull httpwwwredbooksibmcomredpiecesabstractssg248077html

Resources

ndash Big Data Information Center

bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml

ndash InfoSphere BigInsights

bull www-01ibmcomsoftwaredatainfospherebiginsights

ndash Stream Computing

bull www-01ibmcomsoftwaredatainfospherestream-computing

ndash DeveloperWorks forums demos

bull httpwwwibmcomdeveloperworkswikibiginsights

copy 2014 IBM Corporation64

Learn Big Data Technologies

BigDataUniversitycom

Flexible on-line delivery

allows learning your place

and your pace

Free courses free study

materials

Cloud-based sandbox for

exercises ndash zero setup

Robust Course

Management System and

Content Distribution

infrastructure

copy 2014 IBM Corporation65

65

Page 39: Big data presentation (2014)

copy 2014 IBM Corporation39

BigSheets

Browser based Analytical Tool that generates MapReduce Jobs

working over Hadoop Big Data

Helps non-programmers to work with Hadoop cluster

User models their big data as familiar spreadsheet-like tabular data

structures (collections) Once data is represented in a collection

business analysts can filter and enrich its content using built-in

functions and macros Furthermore analysts can combine data

residing in different collections as well as generate charts and new

ldquosheetsrdquo (collections) to visualize their data They can even export

data into a variety of common formats with a click of a button

Much of the technology included in Sheets was derived from the

BigSheets project of IBMrsquos Emerging Technologies team

copy 2014 IBM Corporation40

BigSheets Collection Sample

Spreadsheet-like structures defined by user

Based on data accessible through BigInsights Web console ndash eg file

system data output from Web crawl etc

copy 2014 IBM Corporation41

Big Sheets Collection Operations

Work with built-in ldquosheetsrdquo editor

Add delete columns

Filter data

Specify formulas to compute new

values using spreadsheet-style

syntax

Apply built-in or custom macro

functions

helliphelliphelliphellip

copy 2014 IBM Corporation42

BigSheets Collection Graphic Visualization

Built-in charting facility aids analysis

Pie charts bar charts tag clouds maps etc

Hover over sections to reveal details

copy 2014 IBM Corporation43

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

copy 2014 IBM Corporation44

What is Big SQL

Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture

ndash Comprehensive SQL92 ansi support

ndash Standards compliant client drivers (JDBC amp ODBC)

ndash Efficient handling of point queries

ndash Wide variety of data sources and file formats

ndash Extensive HBase focus

ndash Open source interoperability

Our driving design goalsndash Existing queries should run with no or few modifications

ndash Existing JDBC and ODBC compliant tools should continue to function

ndash Queries should be executed as efficiently as the chosen storage

mechanisms allow

copy 2014 IBM Corporation45

Architecture

Big SQL shares catalogs with

Hive via the Hive metastorendash Each can query the others tables

SQL engine analyzes incoming

queriesndash Separates portion(s) to execute at

the server vs portion(s) to execute

on the cluster

ndash Re-writes query if necessary for

improved performance

ndash Determines appropriate storage

handler for data

ndash Produces execution plan

ndash Executes and coordinates query

Server layout and relative sizes

for illustrative purposes only

Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node

Job TrackerNetwork Protocol

SQL Engine

Storage Handlers

Del

Files

SEQ

FilesHBase RDBMS bullbullbull

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore

copy 2014 IBM Corporation46

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

copy 2014 IBM Corporation47

What is Text Analytics

High Performance and Scalable rule based Information Extraction Engine

Distill structured information from unstructured data

- Rich annotator library supports multiple languages

Provides sophisticated tooling to help build test and refine rules

ndash Developer tools an easy to use text analytics language and a set of

extractors for fast adoption

ndash Multilingual support including support for DBCS languages

Developed at IBM Research since 2004 System T

BigInsights is the first time IBM opens up the Text Analytics Engine

technology for customization and development

copy 2014 IBM Corporation48

Annotator Query Language (AQL)

Language to create rules for Text Analytics

SQL Like Language

Fully declarative text analytics language

Once compiled produced an AOG plan to work in the data

No ldquoblack boxesrdquo or modules that canrsquot be customized

Tooling for easy customization because you are abstracted from the

programmatic details

Competing solutions make use of locked up black-box modules that cannot be

customized which restricts flexibility and are difficult to optimize for performance

create view AmountWithUnit as

extract pattern ltNmatchgt ltUmatchgt

as match

from Number N Unit U

copy 2014 IBM Corporation49

Text Analytic Simple Example

NetherlandsStrikerArjen Robben

Keeper SpainIker Casillas

WingerAndres Iniesta Spain

World Cup 2010 Highlights

Football World Cup 2010 one team distinguished well

from the rest winning the final Early in the second

half Netherlandsrsquo striker Arjen Robben had a chance

to score but the awesome keeper for Spain Iker

Casillas made the save Winner superiority was

reflected when Winger Andres Iniesta scored for Spain

for the win

copy 2014 IBM Corporation5050

Text Analytic Real Example

copy 2014 IBM Corporation5151

One step beyond Watson

copy 2014 IBM Corporation52

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

R

copy 2014 IBM Corporation53

Big R

bull Explore visualize transform

and model big data using

familiar R syntax and

paradigm

bull Scale out R with MR

programming

ndash Partitioning of large data

ndash Parallel cluster execution of R

code

bull Distributed Machine

Learning

ndash A scalable statistics engine that

provides canned algorithms and

an ability to author new ones all

via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3

copy 2014 IBM Corporation54

Where Does BigData Fit

Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate

ldquoCapture in case itrsquos neededrdquo

1 Extract transform load

ldquoCapture only whatrsquos neededrdquo

9 Report and mine data

copy 2014 IBM Corporation55

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation56

Data scientist ndash The new cool guy in town

Article in Fortune ldquoThe unemployment rate in

the US continues to be abysmal (91 in

July) but the tech world has spawned a

new kind of highly skilled nerdy-cool job

that companies are scrambling to fill data

scientistrdquo

McKinsey Global Institute ldquoBig data Reportrdquo

By 2018 the United States alone could

face a shortage of 140000 to 190000

people with deep analytical skills as well as

15 million managers and analysts with the

know-how to use the analysis of big data to

make effective decisions

copy 2014 IBM Corporation57

Data Science is Multidisciplinary

copy 2014 IBM Corporation58

Successful Data Scientist Characteristics

copy 2014 IBM Corporation59

Data Scientist Qualities

copy 2014 IBM Corporation60

How Long Does It Take For a Beginner to Become

a Good Data Scientist

copy 2014 IBM Corporation61

wwwkagglecom

copy 2014 IBM Corporation62

Kaggle ranking

copy 2014 IBM Corporation63 copy 2013 IBM Corporation63

Learn Big Data

Reading Materials - Online

ndash Understanding Big Data ndash Free PDF Book

bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF

ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights

bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml

ndash Implementing IBM InfoSphere BigInsights on System x - Redbook

bull httpwwwredbooksibmcomredpiecesabstractssg248077html

Resources

ndash Big Data Information Center

bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml

ndash InfoSphere BigInsights

bull www-01ibmcomsoftwaredatainfospherebiginsights

ndash Stream Computing

bull www-01ibmcomsoftwaredatainfospherestream-computing

ndash DeveloperWorks forums demos

bull httpwwwibmcomdeveloperworkswikibiginsights

copy 2014 IBM Corporation64

Learn Big Data Technologies

BigDataUniversitycom

Flexible on-line delivery

allows learning your place

and your pace

Free courses free study

materials

Cloud-based sandbox for

exercises ndash zero setup

Robust Course

Management System and

Content Distribution

infrastructure

copy 2014 IBM Corporation65

65

Page 40: Big data presentation (2014)

copy 2014 IBM Corporation40

BigSheets Collection Sample

Spreadsheet-like structures defined by user

Based on data accessible through BigInsights Web console ndash eg file

system data output from Web crawl etc

copy 2014 IBM Corporation41

Big Sheets Collection Operations

Work with built-in ldquosheetsrdquo editor

Add delete columns

Filter data

Specify formulas to compute new

values using spreadsheet-style

syntax

Apply built-in or custom macro

functions

helliphelliphelliphellip

copy 2014 IBM Corporation42

BigSheets Collection Graphic Visualization

Built-in charting facility aids analysis

Pie charts bar charts tag clouds maps etc

Hover over sections to reveal details

copy 2014 IBM Corporation43

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

copy 2014 IBM Corporation44

What is Big SQL

Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture

ndash Comprehensive SQL92 ansi support

ndash Standards compliant client drivers (JDBC amp ODBC)

ndash Efficient handling of point queries

ndash Wide variety of data sources and file formats

ndash Extensive HBase focus

ndash Open source interoperability

Our driving design goalsndash Existing queries should run with no or few modifications

ndash Existing JDBC and ODBC compliant tools should continue to function

ndash Queries should be executed as efficiently as the chosen storage

mechanisms allow

copy 2014 IBM Corporation45

Architecture

Big SQL shares catalogs with

Hive via the Hive metastorendash Each can query the others tables

SQL engine analyzes incoming

queriesndash Separates portion(s) to execute at

the server vs portion(s) to execute

on the cluster

ndash Re-writes query if necessary for

improved performance

ndash Determines appropriate storage

handler for data

ndash Produces execution plan

ndash Executes and coordinates query

Server layout and relative sizes

for illustrative purposes only

Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node

Job TrackerNetwork Protocol

SQL Engine

Storage Handlers

Del

Files

SEQ

FilesHBase RDBMS bullbullbull

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore

copy 2014 IBM Corporation46

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

copy 2014 IBM Corporation47

What is Text Analytics

High Performance and Scalable rule based Information Extraction Engine

Distill structured information from unstructured data

- Rich annotator library supports multiple languages

Provides sophisticated tooling to help build test and refine rules

ndash Developer tools an easy to use text analytics language and a set of

extractors for fast adoption

ndash Multilingual support including support for DBCS languages

Developed at IBM Research since 2004 System T

BigInsights is the first time IBM opens up the Text Analytics Engine

technology for customization and development

copy 2014 IBM Corporation48

Annotator Query Language (AQL)

Language to create rules for Text Analytics

SQL Like Language

Fully declarative text analytics language

Once compiled produced an AOG plan to work in the data

No ldquoblack boxesrdquo or modules that canrsquot be customized

Tooling for easy customization because you are abstracted from the

programmatic details

Competing solutions make use of locked up black-box modules that cannot be

customized which restricts flexibility and are difficult to optimize for performance

create view AmountWithUnit as

extract pattern ltNmatchgt ltUmatchgt

as match

from Number N Unit U

copy 2014 IBM Corporation49

Text Analytic Simple Example

NetherlandsStrikerArjen Robben

Keeper SpainIker Casillas

WingerAndres Iniesta Spain

World Cup 2010 Highlights

Football World Cup 2010 one team distinguished well

from the rest winning the final Early in the second

half Netherlandsrsquo striker Arjen Robben had a chance

to score but the awesome keeper for Spain Iker

Casillas made the save Winner superiority was

reflected when Winger Andres Iniesta scored for Spain

for the win

copy 2014 IBM Corporation5050

Text Analytic Real Example

copy 2014 IBM Corporation5151

One step beyond Watson

copy 2014 IBM Corporation52

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

R

copy 2014 IBM Corporation53

Big R

bull Explore visualize transform

and model big data using

familiar R syntax and

paradigm

bull Scale out R with MR

programming

ndash Partitioning of large data

ndash Parallel cluster execution of R

code

bull Distributed Machine

Learning

ndash A scalable statistics engine that

provides canned algorithms and

an ability to author new ones all

via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3

copy 2014 IBM Corporation54

Where Does BigData Fit

Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate

ldquoCapture in case itrsquos neededrdquo

1 Extract transform load

ldquoCapture only whatrsquos neededrdquo

9 Report and mine data

copy 2014 IBM Corporation55

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation56

Data scientist ndash The new cool guy in town

Article in Fortune ldquoThe unemployment rate in

the US continues to be abysmal (91 in

July) but the tech world has spawned a

new kind of highly skilled nerdy-cool job

that companies are scrambling to fill data

scientistrdquo

McKinsey Global Institute ldquoBig data Reportrdquo

By 2018 the United States alone could

face a shortage of 140000 to 190000

people with deep analytical skills as well as

15 million managers and analysts with the

know-how to use the analysis of big data to

make effective decisions

copy 2014 IBM Corporation57

Data Science is Multidisciplinary

copy 2014 IBM Corporation58

Successful Data Scientist Characteristics

copy 2014 IBM Corporation59

Data Scientist Qualities

copy 2014 IBM Corporation60

How Long Does It Take For a Beginner to Become

a Good Data Scientist

copy 2014 IBM Corporation61

wwwkagglecom

copy 2014 IBM Corporation62

Kaggle ranking

copy 2014 IBM Corporation63 copy 2013 IBM Corporation63

Learn Big Data

Reading Materials - Online

ndash Understanding Big Data ndash Free PDF Book

bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF

ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights

bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml

ndash Implementing IBM InfoSphere BigInsights on System x - Redbook

bull httpwwwredbooksibmcomredpiecesabstractssg248077html

Resources

ndash Big Data Information Center

bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml

ndash InfoSphere BigInsights

bull www-01ibmcomsoftwaredatainfospherebiginsights

ndash Stream Computing

bull www-01ibmcomsoftwaredatainfospherestream-computing

ndash DeveloperWorks forums demos

bull httpwwwibmcomdeveloperworkswikibiginsights

copy 2014 IBM Corporation64

Learn Big Data Technologies

BigDataUniversitycom

Flexible on-line delivery

allows learning your place

and your pace

Free courses free study

materials

Cloud-based sandbox for

exercises ndash zero setup

Robust Course

Management System and

Content Distribution

infrastructure

copy 2014 IBM Corporation65

65

Page 41: Big data presentation (2014)

copy 2014 IBM Corporation41

Big Sheets Collection Operations

Work with built-in ldquosheetsrdquo editor

Add delete columns

Filter data

Specify formulas to compute new

values using spreadsheet-style

syntax

Apply built-in or custom macro

functions

helliphelliphelliphellip

copy 2014 IBM Corporation42

BigSheets Collection Graphic Visualization

Built-in charting facility aids analysis

Pie charts bar charts tag clouds maps etc

Hover over sections to reveal details

copy 2014 IBM Corporation43

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

copy 2014 IBM Corporation44

What is Big SQL

Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture

ndash Comprehensive SQL92 ansi support

ndash Standards compliant client drivers (JDBC amp ODBC)

ndash Efficient handling of point queries

ndash Wide variety of data sources and file formats

ndash Extensive HBase focus

ndash Open source interoperability

Our driving design goalsndash Existing queries should run with no or few modifications

ndash Existing JDBC and ODBC compliant tools should continue to function

ndash Queries should be executed as efficiently as the chosen storage

mechanisms allow

copy 2014 IBM Corporation45

Architecture

Big SQL shares catalogs with

Hive via the Hive metastorendash Each can query the others tables

SQL engine analyzes incoming

queriesndash Separates portion(s) to execute at

the server vs portion(s) to execute

on the cluster

ndash Re-writes query if necessary for

improved performance

ndash Determines appropriate storage

handler for data

ndash Produces execution plan

ndash Executes and coordinates query

Server layout and relative sizes

for illustrative purposes only

Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node

Job TrackerNetwork Protocol

SQL Engine

Storage Handlers

Del

Files

SEQ

FilesHBase RDBMS bullbullbull

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore

copy 2014 IBM Corporation46

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

copy 2014 IBM Corporation47

What is Text Analytics

High Performance and Scalable rule based Information Extraction Engine

Distill structured information from unstructured data

- Rich annotator library supports multiple languages

Provides sophisticated tooling to help build test and refine rules

ndash Developer tools an easy to use text analytics language and a set of

extractors for fast adoption

ndash Multilingual support including support for DBCS languages

Developed at IBM Research since 2004 System T

BigInsights is the first time IBM opens up the Text Analytics Engine

technology for customization and development

copy 2014 IBM Corporation48

Annotator Query Language (AQL)

Language to create rules for Text Analytics

SQL Like Language

Fully declarative text analytics language

Once compiled produced an AOG plan to work in the data

No ldquoblack boxesrdquo or modules that canrsquot be customized

Tooling for easy customization because you are abstracted from the

programmatic details

Competing solutions make use of locked up black-box modules that cannot be

customized which restricts flexibility and are difficult to optimize for performance

create view AmountWithUnit as

extract pattern ltNmatchgt ltUmatchgt

as match

from Number N Unit U

copy 2014 IBM Corporation49

Text Analytic Simple Example

NetherlandsStrikerArjen Robben

Keeper SpainIker Casillas

WingerAndres Iniesta Spain

World Cup 2010 Highlights

Football World Cup 2010 one team distinguished well

from the rest winning the final Early in the second

half Netherlandsrsquo striker Arjen Robben had a chance

to score but the awesome keeper for Spain Iker

Casillas made the save Winner superiority was

reflected when Winger Andres Iniesta scored for Spain

for the win

copy 2014 IBM Corporation5050

Text Analytic Real Example

copy 2014 IBM Corporation5151

One step beyond Watson

copy 2014 IBM Corporation52

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

R

copy 2014 IBM Corporation53

Big R

bull Explore visualize transform

and model big data using

familiar R syntax and

paradigm

bull Scale out R with MR

programming

ndash Partitioning of large data

ndash Parallel cluster execution of R

code

bull Distributed Machine

Learning

ndash A scalable statistics engine that

provides canned algorithms and

an ability to author new ones all

via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3

copy 2014 IBM Corporation54

Where Does BigData Fit

Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate

ldquoCapture in case itrsquos neededrdquo

1 Extract transform load

ldquoCapture only whatrsquos neededrdquo

9 Report and mine data

copy 2014 IBM Corporation55

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation56

Data scientist ndash The new cool guy in town

Article in Fortune ldquoThe unemployment rate in

the US continues to be abysmal (91 in

July) but the tech world has spawned a

new kind of highly skilled nerdy-cool job

that companies are scrambling to fill data

scientistrdquo

McKinsey Global Institute ldquoBig data Reportrdquo

By 2018 the United States alone could

face a shortage of 140000 to 190000

people with deep analytical skills as well as

15 million managers and analysts with the

know-how to use the analysis of big data to

make effective decisions

copy 2014 IBM Corporation57

Data Science is Multidisciplinary

copy 2014 IBM Corporation58

Successful Data Scientist Characteristics

copy 2014 IBM Corporation59

Data Scientist Qualities

copy 2014 IBM Corporation60

How Long Does It Take For a Beginner to Become

a Good Data Scientist

copy 2014 IBM Corporation61

wwwkagglecom

copy 2014 IBM Corporation62

Kaggle ranking

copy 2014 IBM Corporation63 copy 2013 IBM Corporation63

Learn Big Data

Reading Materials - Online

ndash Understanding Big Data ndash Free PDF Book

bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF

ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights

bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml

ndash Implementing IBM InfoSphere BigInsights on System x - Redbook

bull httpwwwredbooksibmcomredpiecesabstractssg248077html

Resources

ndash Big Data Information Center

bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml

ndash InfoSphere BigInsights

bull www-01ibmcomsoftwaredatainfospherebiginsights

ndash Stream Computing

bull www-01ibmcomsoftwaredatainfospherestream-computing

ndash DeveloperWorks forums demos

bull httpwwwibmcomdeveloperworkswikibiginsights

copy 2014 IBM Corporation64

Learn Big Data Technologies

BigDataUniversitycom

Flexible on-line delivery

allows learning your place

and your pace

Free courses free study

materials

Cloud-based sandbox for

exercises ndash zero setup

Robust Course

Management System and

Content Distribution

infrastructure

copy 2014 IBM Corporation65

65

Page 42: Big data presentation (2014)

copy 2014 IBM Corporation42

BigSheets Collection Graphic Visualization

Built-in charting facility aids analysis

Pie charts bar charts tag clouds maps etc

Hover over sections to reveal details

copy 2014 IBM Corporation43

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

copy 2014 IBM Corporation44

What is Big SQL

Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture

ndash Comprehensive SQL92 ansi support

ndash Standards compliant client drivers (JDBC amp ODBC)

ndash Efficient handling of point queries

ndash Wide variety of data sources and file formats

ndash Extensive HBase focus

ndash Open source interoperability

Our driving design goalsndash Existing queries should run with no or few modifications

ndash Existing JDBC and ODBC compliant tools should continue to function

ndash Queries should be executed as efficiently as the chosen storage

mechanisms allow

copy 2014 IBM Corporation45

Architecture

Big SQL shares catalogs with

Hive via the Hive metastorendash Each can query the others tables

SQL engine analyzes incoming

queriesndash Separates portion(s) to execute at

the server vs portion(s) to execute

on the cluster

ndash Re-writes query if necessary for

improved performance

ndash Determines appropriate storage

handler for data

ndash Produces execution plan

ndash Executes and coordinates query

Server layout and relative sizes

for illustrative purposes only

Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node

Job TrackerNetwork Protocol

SQL Engine

Storage Handlers

Del

Files

SEQ

FilesHBase RDBMS bullbullbull

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore

copy 2014 IBM Corporation46

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

copy 2014 IBM Corporation47

What is Text Analytics

High Performance and Scalable rule based Information Extraction Engine

Distill structured information from unstructured data

- Rich annotator library supports multiple languages

Provides sophisticated tooling to help build test and refine rules

ndash Developer tools an easy to use text analytics language and a set of

extractors for fast adoption

ndash Multilingual support including support for DBCS languages

Developed at IBM Research since 2004 System T

BigInsights is the first time IBM opens up the Text Analytics Engine

technology for customization and development

copy 2014 IBM Corporation48

Annotator Query Language (AQL)

Language to create rules for Text Analytics

SQL Like Language

Fully declarative text analytics language

Once compiled produced an AOG plan to work in the data

No ldquoblack boxesrdquo or modules that canrsquot be customized

Tooling for easy customization because you are abstracted from the

programmatic details

Competing solutions make use of locked up black-box modules that cannot be

customized which restricts flexibility and are difficult to optimize for performance

create view AmountWithUnit as

extract pattern ltNmatchgt ltUmatchgt

as match

from Number N Unit U

copy 2014 IBM Corporation49

Text Analytic Simple Example

NetherlandsStrikerArjen Robben

Keeper SpainIker Casillas

WingerAndres Iniesta Spain

World Cup 2010 Highlights

Football World Cup 2010 one team distinguished well

from the rest winning the final Early in the second

half Netherlandsrsquo striker Arjen Robben had a chance

to score but the awesome keeper for Spain Iker

Casillas made the save Winner superiority was

reflected when Winger Andres Iniesta scored for Spain

for the win

copy 2014 IBM Corporation5050

Text Analytic Real Example

copy 2014 IBM Corporation5151

One step beyond Watson

copy 2014 IBM Corporation52

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

R

copy 2014 IBM Corporation53

Big R

bull Explore visualize transform

and model big data using

familiar R syntax and

paradigm

bull Scale out R with MR

programming

ndash Partitioning of large data

ndash Parallel cluster execution of R

code

bull Distributed Machine

Learning

ndash A scalable statistics engine that

provides canned algorithms and

an ability to author new ones all

via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3

copy 2014 IBM Corporation54

Where Does BigData Fit

Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate

ldquoCapture in case itrsquos neededrdquo

1 Extract transform load

ldquoCapture only whatrsquos neededrdquo

9 Report and mine data

copy 2014 IBM Corporation55

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation56

Data scientist ndash The new cool guy in town

Article in Fortune ldquoThe unemployment rate in

the US continues to be abysmal (91 in

July) but the tech world has spawned a

new kind of highly skilled nerdy-cool job

that companies are scrambling to fill data

scientistrdquo

McKinsey Global Institute ldquoBig data Reportrdquo

By 2018 the United States alone could

face a shortage of 140000 to 190000

people with deep analytical skills as well as

15 million managers and analysts with the

know-how to use the analysis of big data to

make effective decisions

copy 2014 IBM Corporation57

Data Science is Multidisciplinary

copy 2014 IBM Corporation58

Successful Data Scientist Characteristics

copy 2014 IBM Corporation59

Data Scientist Qualities

copy 2014 IBM Corporation60

How Long Does It Take For a Beginner to Become

a Good Data Scientist

copy 2014 IBM Corporation61

wwwkagglecom

copy 2014 IBM Corporation62

Kaggle ranking

copy 2014 IBM Corporation63 copy 2013 IBM Corporation63

Learn Big Data

Reading Materials - Online

ndash Understanding Big Data ndash Free PDF Book

bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF

ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights

bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml

ndash Implementing IBM InfoSphere BigInsights on System x - Redbook

bull httpwwwredbooksibmcomredpiecesabstractssg248077html

Resources

ndash Big Data Information Center

bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml

ndash InfoSphere BigInsights

bull www-01ibmcomsoftwaredatainfospherebiginsights

ndash Stream Computing

bull www-01ibmcomsoftwaredatainfospherestream-computing

ndash DeveloperWorks forums demos

bull httpwwwibmcomdeveloperworkswikibiginsights

copy 2014 IBM Corporation64

Learn Big Data Technologies

BigDataUniversitycom

Flexible on-line delivery

allows learning your place

and your pace

Free courses free study

materials

Cloud-based sandbox for

exercises ndash zero setup

Robust Course

Management System and

Content Distribution

infrastructure

copy 2014 IBM Corporation65

65

Page 43: Big data presentation (2014)

copy 2014 IBM Corporation43

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

Text Analytics MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

copy 2014 IBM Corporation44

What is Big SQL

Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture

ndash Comprehensive SQL92 ansi support

ndash Standards compliant client drivers (JDBC amp ODBC)

ndash Efficient handling of point queries

ndash Wide variety of data sources and file formats

ndash Extensive HBase focus

ndash Open source interoperability

Our driving design goalsndash Existing queries should run with no or few modifications

ndash Existing JDBC and ODBC compliant tools should continue to function

ndash Queries should be executed as efficiently as the chosen storage

mechanisms allow

copy 2014 IBM Corporation45

Architecture

Big SQL shares catalogs with

Hive via the Hive metastorendash Each can query the others tables

SQL engine analyzes incoming

queriesndash Separates portion(s) to execute at

the server vs portion(s) to execute

on the cluster

ndash Re-writes query if necessary for

improved performance

ndash Determines appropriate storage

handler for data

ndash Produces execution plan

ndash Executes and coordinates query

Server layout and relative sizes

for illustrative purposes only

Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node

Job TrackerNetwork Protocol

SQL Engine

Storage Handlers

Del

Files

SEQ

FilesHBase RDBMS bullbullbull

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore

copy 2014 IBM Corporation46

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

copy 2014 IBM Corporation47

What is Text Analytics

High Performance and Scalable rule based Information Extraction Engine

Distill structured information from unstructured data

- Rich annotator library supports multiple languages

Provides sophisticated tooling to help build test and refine rules

ndash Developer tools an easy to use text analytics language and a set of

extractors for fast adoption

ndash Multilingual support including support for DBCS languages

Developed at IBM Research since 2004 System T

BigInsights is the first time IBM opens up the Text Analytics Engine

technology for customization and development

copy 2014 IBM Corporation48

Annotator Query Language (AQL)

Language to create rules for Text Analytics

SQL Like Language

Fully declarative text analytics language

Once compiled produced an AOG plan to work in the data

No ldquoblack boxesrdquo or modules that canrsquot be customized

Tooling for easy customization because you are abstracted from the

programmatic details

Competing solutions make use of locked up black-box modules that cannot be

customized which restricts flexibility and are difficult to optimize for performance

create view AmountWithUnit as

extract pattern ltNmatchgt ltUmatchgt

as match

from Number N Unit U

copy 2014 IBM Corporation49

Text Analytic Simple Example

NetherlandsStrikerArjen Robben

Keeper SpainIker Casillas

WingerAndres Iniesta Spain

World Cup 2010 Highlights

Football World Cup 2010 one team distinguished well

from the rest winning the final Early in the second

half Netherlandsrsquo striker Arjen Robben had a chance

to score but the awesome keeper for Spain Iker

Casillas made the save Winner superiority was

reflected when Winger Andres Iniesta scored for Spain

for the win

copy 2014 IBM Corporation5050

Text Analytic Real Example

copy 2014 IBM Corporation5151

One step beyond Watson

copy 2014 IBM Corporation52

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

R

copy 2014 IBM Corporation53

Big R

bull Explore visualize transform

and model big data using

familiar R syntax and

paradigm

bull Scale out R with MR

programming

ndash Partitioning of large data

ndash Parallel cluster execution of R

code

bull Distributed Machine

Learning

ndash A scalable statistics engine that

provides canned algorithms and

an ability to author new ones all

via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3

copy 2014 IBM Corporation54

Where Does BigData Fit

Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate

ldquoCapture in case itrsquos neededrdquo

1 Extract transform load

ldquoCapture only whatrsquos neededrdquo

9 Report and mine data

copy 2014 IBM Corporation55

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation56

Data scientist ndash The new cool guy in town

Article in Fortune ldquoThe unemployment rate in

the US continues to be abysmal (91 in

July) but the tech world has spawned a

new kind of highly skilled nerdy-cool job

that companies are scrambling to fill data

scientistrdquo

McKinsey Global Institute ldquoBig data Reportrdquo

By 2018 the United States alone could

face a shortage of 140000 to 190000

people with deep analytical skills as well as

15 million managers and analysts with the

know-how to use the analysis of big data to

make effective decisions

copy 2014 IBM Corporation57

Data Science is Multidisciplinary

copy 2014 IBM Corporation58

Successful Data Scientist Characteristics

copy 2014 IBM Corporation59

Data Scientist Qualities

copy 2014 IBM Corporation60

How Long Does It Take For a Beginner to Become

a Good Data Scientist

copy 2014 IBM Corporation61

wwwkagglecom

copy 2014 IBM Corporation62

Kaggle ranking

copy 2014 IBM Corporation63 copy 2013 IBM Corporation63

Learn Big Data

Reading Materials - Online

ndash Understanding Big Data ndash Free PDF Book

bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF

ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights

bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml

ndash Implementing IBM InfoSphere BigInsights on System x - Redbook

bull httpwwwredbooksibmcomredpiecesabstractssg248077html

Resources

ndash Big Data Information Center

bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml

ndash InfoSphere BigInsights

bull www-01ibmcomsoftwaredatainfospherebiginsights

ndash Stream Computing

bull www-01ibmcomsoftwaredatainfospherestream-computing

ndash DeveloperWorks forums demos

bull httpwwwibmcomdeveloperworkswikibiginsights

copy 2014 IBM Corporation64

Learn Big Data Technologies

BigDataUniversitycom

Flexible on-line delivery

allows learning your place

and your pace

Free courses free study

materials

Cloud-based sandbox for

exercises ndash zero setup

Robust Course

Management System and

Content Distribution

infrastructure

copy 2014 IBM Corporation65

65

Page 44: Big data presentation (2014)

copy 2014 IBM Corporation44

What is Big SQL

Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture

ndash Comprehensive SQL92 ansi support

ndash Standards compliant client drivers (JDBC amp ODBC)

ndash Efficient handling of point queries

ndash Wide variety of data sources and file formats

ndash Extensive HBase focus

ndash Open source interoperability

Our driving design goalsndash Existing queries should run with no or few modifications

ndash Existing JDBC and ODBC compliant tools should continue to function

ndash Queries should be executed as efficiently as the chosen storage

mechanisms allow

copy 2014 IBM Corporation45

Architecture

Big SQL shares catalogs with

Hive via the Hive metastorendash Each can query the others tables

SQL engine analyzes incoming

queriesndash Separates portion(s) to execute at

the server vs portion(s) to execute

on the cluster

ndash Re-writes query if necessary for

improved performance

ndash Determines appropriate storage

handler for data

ndash Produces execution plan

ndash Executes and coordinates query

Server layout and relative sizes

for illustrative purposes only

Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node

Job TrackerNetwork Protocol

SQL Engine

Storage Handlers

Del

Files

SEQ

FilesHBase RDBMS bullbullbull

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore

copy 2014 IBM Corporation46

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

copy 2014 IBM Corporation47

What is Text Analytics

High Performance and Scalable rule based Information Extraction Engine

Distill structured information from unstructured data

- Rich annotator library supports multiple languages

Provides sophisticated tooling to help build test and refine rules

ndash Developer tools an easy to use text analytics language and a set of

extractors for fast adoption

ndash Multilingual support including support for DBCS languages

Developed at IBM Research since 2004 System T

BigInsights is the first time IBM opens up the Text Analytics Engine

technology for customization and development

copy 2014 IBM Corporation48

Annotator Query Language (AQL)

Language to create rules for Text Analytics

SQL Like Language

Fully declarative text analytics language

Once compiled produced an AOG plan to work in the data

No ldquoblack boxesrdquo or modules that canrsquot be customized

Tooling for easy customization because you are abstracted from the

programmatic details

Competing solutions make use of locked up black-box modules that cannot be

customized which restricts flexibility and are difficult to optimize for performance

create view AmountWithUnit as

extract pattern ltNmatchgt ltUmatchgt

as match

from Number N Unit U

copy 2014 IBM Corporation49

Text Analytic Simple Example

NetherlandsStrikerArjen Robben

Keeper SpainIker Casillas

WingerAndres Iniesta Spain

World Cup 2010 Highlights

Football World Cup 2010 one team distinguished well

from the rest winning the final Early in the second

half Netherlandsrsquo striker Arjen Robben had a chance

to score but the awesome keeper for Spain Iker

Casillas made the save Winner superiority was

reflected when Winger Andres Iniesta scored for Spain

for the win

copy 2014 IBM Corporation5050

Text Analytic Real Example

copy 2014 IBM Corporation5151

One step beyond Watson

copy 2014 IBM Corporation52

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

R

copy 2014 IBM Corporation53

Big R

bull Explore visualize transform

and model big data using

familiar R syntax and

paradigm

bull Scale out R with MR

programming

ndash Partitioning of large data

ndash Parallel cluster execution of R

code

bull Distributed Machine

Learning

ndash A scalable statistics engine that

provides canned algorithms and

an ability to author new ones all

via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3

copy 2014 IBM Corporation54

Where Does BigData Fit

Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate

ldquoCapture in case itrsquos neededrdquo

1 Extract transform load

ldquoCapture only whatrsquos neededrdquo

9 Report and mine data

copy 2014 IBM Corporation55

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation56

Data scientist ndash The new cool guy in town

Article in Fortune ldquoThe unemployment rate in

the US continues to be abysmal (91 in

July) but the tech world has spawned a

new kind of highly skilled nerdy-cool job

that companies are scrambling to fill data

scientistrdquo

McKinsey Global Institute ldquoBig data Reportrdquo

By 2018 the United States alone could

face a shortage of 140000 to 190000

people with deep analytical skills as well as

15 million managers and analysts with the

know-how to use the analysis of big data to

make effective decisions

copy 2014 IBM Corporation57

Data Science is Multidisciplinary

copy 2014 IBM Corporation58

Successful Data Scientist Characteristics

copy 2014 IBM Corporation59

Data Scientist Qualities

copy 2014 IBM Corporation60

How Long Does It Take For a Beginner to Become

a Good Data Scientist

copy 2014 IBM Corporation61

wwwkagglecom

copy 2014 IBM Corporation62

Kaggle ranking

copy 2014 IBM Corporation63 copy 2013 IBM Corporation63

Learn Big Data

Reading Materials - Online

ndash Understanding Big Data ndash Free PDF Book

bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF

ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights

bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml

ndash Implementing IBM InfoSphere BigInsights on System x - Redbook

bull httpwwwredbooksibmcomredpiecesabstractssg248077html

Resources

ndash Big Data Information Center

bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml

ndash InfoSphere BigInsights

bull www-01ibmcomsoftwaredatainfospherebiginsights

ndash Stream Computing

bull www-01ibmcomsoftwaredatainfospherestream-computing

ndash DeveloperWorks forums demos

bull httpwwwibmcomdeveloperworkswikibiginsights

copy 2014 IBM Corporation64

Learn Big Data Technologies

BigDataUniversitycom

Flexible on-line delivery

allows learning your place

and your pace

Free courses free study

materials

Cloud-based sandbox for

exercises ndash zero setup

Robust Course

Management System and

Content Distribution

infrastructure

copy 2014 IBM Corporation65

65

Page 45: Big data presentation (2014)

copy 2014 IBM Corporation45

Architecture

Big SQL shares catalogs with

Hive via the Hive metastorendash Each can query the others tables

SQL engine analyzes incoming

queriesndash Separates portion(s) to execute at

the server vs portion(s) to execute

on the cluster

ndash Re-writes query if necessary for

improved performance

ndash Determines appropriate storage

handler for data

ndash Produces execution plan

ndash Executes and coordinates query

Server layout and relative sizes

for illustrative purposes only

Application

SQL Language

JDBC ODBC Driver

BigInsights Cluster

Head Node

Big SQL Server

Head Node

Name Node

Head Node

Job TrackerNetwork Protocol

SQL Engine

Storage Handlers

Del

Files

SEQ

FilesHBase RDBMS bullbullbull

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Region

Server

bullbullbull

Compute Node

Task

Tracker

Data

Node

Region

Server

Head Node

Hive Metastore

copy 2014 IBM Corporation46

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

copy 2014 IBM Corporation47

What is Text Analytics

High Performance and Scalable rule based Information Extraction Engine

Distill structured information from unstructured data

- Rich annotator library supports multiple languages

Provides sophisticated tooling to help build test and refine rules

ndash Developer tools an easy to use text analytics language and a set of

extractors for fast adoption

ndash Multilingual support including support for DBCS languages

Developed at IBM Research since 2004 System T

BigInsights is the first time IBM opens up the Text Analytics Engine

technology for customization and development

copy 2014 IBM Corporation48

Annotator Query Language (AQL)

Language to create rules for Text Analytics

SQL Like Language

Fully declarative text analytics language

Once compiled produced an AOG plan to work in the data

No ldquoblack boxesrdquo or modules that canrsquot be customized

Tooling for easy customization because you are abstracted from the

programmatic details

Competing solutions make use of locked up black-box modules that cannot be

customized which restricts flexibility and are difficult to optimize for performance

create view AmountWithUnit as

extract pattern ltNmatchgt ltUmatchgt

as match

from Number N Unit U

copy 2014 IBM Corporation49

Text Analytic Simple Example

NetherlandsStrikerArjen Robben

Keeper SpainIker Casillas

WingerAndres Iniesta Spain

World Cup 2010 Highlights

Football World Cup 2010 one team distinguished well

from the rest winning the final Early in the second

half Netherlandsrsquo striker Arjen Robben had a chance

to score but the awesome keeper for Spain Iker

Casillas made the save Winner superiority was

reflected when Winger Andres Iniesta scored for Spain

for the win

copy 2014 IBM Corporation5050

Text Analytic Real Example

copy 2014 IBM Corporation5151

One step beyond Watson

copy 2014 IBM Corporation52

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

R

copy 2014 IBM Corporation53

Big R

bull Explore visualize transform

and model big data using

familiar R syntax and

paradigm

bull Scale out R with MR

programming

ndash Partitioning of large data

ndash Parallel cluster execution of R

code

bull Distributed Machine

Learning

ndash A scalable statistics engine that

provides canned algorithms and

an ability to author new ones all

via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3

copy 2014 IBM Corporation54

Where Does BigData Fit

Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate

ldquoCapture in case itrsquos neededrdquo

1 Extract transform load

ldquoCapture only whatrsquos neededrdquo

9 Report and mine data

copy 2014 IBM Corporation55

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation56

Data scientist ndash The new cool guy in town

Article in Fortune ldquoThe unemployment rate in

the US continues to be abysmal (91 in

July) but the tech world has spawned a

new kind of highly skilled nerdy-cool job

that companies are scrambling to fill data

scientistrdquo

McKinsey Global Institute ldquoBig data Reportrdquo

By 2018 the United States alone could

face a shortage of 140000 to 190000

people with deep analytical skills as well as

15 million managers and analysts with the

know-how to use the analysis of big data to

make effective decisions

copy 2014 IBM Corporation57

Data Science is Multidisciplinary

copy 2014 IBM Corporation58

Successful Data Scientist Characteristics

copy 2014 IBM Corporation59

Data Scientist Qualities

copy 2014 IBM Corporation60

How Long Does It Take For a Beginner to Become

a Good Data Scientist

copy 2014 IBM Corporation61

wwwkagglecom

copy 2014 IBM Corporation62

Kaggle ranking

copy 2014 IBM Corporation63 copy 2013 IBM Corporation63

Learn Big Data

Reading Materials - Online

ndash Understanding Big Data ndash Free PDF Book

bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF

ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights

bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml

ndash Implementing IBM InfoSphere BigInsights on System x - Redbook

bull httpwwwredbooksibmcomredpiecesabstractssg248077html

Resources

ndash Big Data Information Center

bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml

ndash InfoSphere BigInsights

bull www-01ibmcomsoftwaredatainfospherebiginsights

ndash Stream Computing

bull www-01ibmcomsoftwaredatainfospherestream-computing

ndash DeveloperWorks forums demos

bull httpwwwibmcomdeveloperworkswikibiginsights

copy 2014 IBM Corporation64

Learn Big Data Technologies

BigDataUniversitycom

Flexible on-line delivery

allows learning your place

and your pace

Free courses free study

materials

Cloud-based sandbox for

exercises ndash zero setup

Robust Course

Management System and

Content Distribution

infrastructure

copy 2014 IBM Corporation65

65

Page 46: Big data presentation (2014)

copy 2014 IBM Corporation46

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

R

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

copy 2014 IBM Corporation47

What is Text Analytics

High Performance and Scalable rule based Information Extraction Engine

Distill structured information from unstructured data

- Rich annotator library supports multiple languages

Provides sophisticated tooling to help build test and refine rules

ndash Developer tools an easy to use text analytics language and a set of

extractors for fast adoption

ndash Multilingual support including support for DBCS languages

Developed at IBM Research since 2004 System T

BigInsights is the first time IBM opens up the Text Analytics Engine

technology for customization and development

copy 2014 IBM Corporation48

Annotator Query Language (AQL)

Language to create rules for Text Analytics

SQL Like Language

Fully declarative text analytics language

Once compiled produced an AOG plan to work in the data

No ldquoblack boxesrdquo or modules that canrsquot be customized

Tooling for easy customization because you are abstracted from the

programmatic details

Competing solutions make use of locked up black-box modules that cannot be

customized which restricts flexibility and are difficult to optimize for performance

create view AmountWithUnit as

extract pattern ltNmatchgt ltUmatchgt

as match

from Number N Unit U

copy 2014 IBM Corporation49

Text Analytic Simple Example

NetherlandsStrikerArjen Robben

Keeper SpainIker Casillas

WingerAndres Iniesta Spain

World Cup 2010 Highlights

Football World Cup 2010 one team distinguished well

from the rest winning the final Early in the second

half Netherlandsrsquo striker Arjen Robben had a chance

to score but the awesome keeper for Spain Iker

Casillas made the save Winner superiority was

reflected when Winger Andres Iniesta scored for Spain

for the win

copy 2014 IBM Corporation5050

Text Analytic Real Example

copy 2014 IBM Corporation5151

One step beyond Watson

copy 2014 IBM Corporation52

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

R

copy 2014 IBM Corporation53

Big R

bull Explore visualize transform

and model big data using

familiar R syntax and

paradigm

bull Scale out R with MR

programming

ndash Partitioning of large data

ndash Parallel cluster execution of R

code

bull Distributed Machine

Learning

ndash A scalable statistics engine that

provides canned algorithms and

an ability to author new ones all

via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3

copy 2014 IBM Corporation54

Where Does BigData Fit

Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate

ldquoCapture in case itrsquos neededrdquo

1 Extract transform load

ldquoCapture only whatrsquos neededrdquo

9 Report and mine data

copy 2014 IBM Corporation55

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation56

Data scientist ndash The new cool guy in town

Article in Fortune ldquoThe unemployment rate in

the US continues to be abysmal (91 in

July) but the tech world has spawned a

new kind of highly skilled nerdy-cool job

that companies are scrambling to fill data

scientistrdquo

McKinsey Global Institute ldquoBig data Reportrdquo

By 2018 the United States alone could

face a shortage of 140000 to 190000

people with deep analytical skills as well as

15 million managers and analysts with the

know-how to use the analysis of big data to

make effective decisions

copy 2014 IBM Corporation57

Data Science is Multidisciplinary

copy 2014 IBM Corporation58

Successful Data Scientist Characteristics

copy 2014 IBM Corporation59

Data Scientist Qualities

copy 2014 IBM Corporation60

How Long Does It Take For a Beginner to Become

a Good Data Scientist

copy 2014 IBM Corporation61

wwwkagglecom

copy 2014 IBM Corporation62

Kaggle ranking

copy 2014 IBM Corporation63 copy 2013 IBM Corporation63

Learn Big Data

Reading Materials - Online

ndash Understanding Big Data ndash Free PDF Book

bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF

ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights

bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml

ndash Implementing IBM InfoSphere BigInsights on System x - Redbook

bull httpwwwredbooksibmcomredpiecesabstractssg248077html

Resources

ndash Big Data Information Center

bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml

ndash InfoSphere BigInsights

bull www-01ibmcomsoftwaredatainfospherebiginsights

ndash Stream Computing

bull www-01ibmcomsoftwaredatainfospherestream-computing

ndash DeveloperWorks forums demos

bull httpwwwibmcomdeveloperworkswikibiginsights

copy 2014 IBM Corporation64

Learn Big Data Technologies

BigDataUniversitycom

Flexible on-line delivery

allows learning your place

and your pace

Free courses free study

materials

Cloud-based sandbox for

exercises ndash zero setup

Robust Course

Management System and

Content Distribution

infrastructure

copy 2014 IBM Corporation65

65

Page 47: Big data presentation (2014)

copy 2014 IBM Corporation47

What is Text Analytics

High Performance and Scalable rule based Information Extraction Engine

Distill structured information from unstructured data

- Rich annotator library supports multiple languages

Provides sophisticated tooling to help build test and refine rules

ndash Developer tools an easy to use text analytics language and a set of

extractors for fast adoption

ndash Multilingual support including support for DBCS languages

Developed at IBM Research since 2004 System T

BigInsights is the first time IBM opens up the Text Analytics Engine

technology for customization and development

copy 2014 IBM Corporation48

Annotator Query Language (AQL)

Language to create rules for Text Analytics

SQL Like Language

Fully declarative text analytics language

Once compiled produced an AOG plan to work in the data

No ldquoblack boxesrdquo or modules that canrsquot be customized

Tooling for easy customization because you are abstracted from the

programmatic details

Competing solutions make use of locked up black-box modules that cannot be

customized which restricts flexibility and are difficult to optimize for performance

create view AmountWithUnit as

extract pattern ltNmatchgt ltUmatchgt

as match

from Number N Unit U

copy 2014 IBM Corporation49

Text Analytic Simple Example

NetherlandsStrikerArjen Robben

Keeper SpainIker Casillas

WingerAndres Iniesta Spain

World Cup 2010 Highlights

Football World Cup 2010 one team distinguished well

from the rest winning the final Early in the second

half Netherlandsrsquo striker Arjen Robben had a chance

to score but the awesome keeper for Spain Iker

Casillas made the save Winner superiority was

reflected when Winger Andres Iniesta scored for Spain

for the win

copy 2014 IBM Corporation5050

Text Analytic Real Example

copy 2014 IBM Corporation5151

One step beyond Watson

copy 2014 IBM Corporation52

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

R

copy 2014 IBM Corporation53

Big R

bull Explore visualize transform

and model big data using

familiar R syntax and

paradigm

bull Scale out R with MR

programming

ndash Partitioning of large data

ndash Parallel cluster execution of R

code

bull Distributed Machine

Learning

ndash A scalable statistics engine that

provides canned algorithms and

an ability to author new ones all

via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3

copy 2014 IBM Corporation54

Where Does BigData Fit

Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate

ldquoCapture in case itrsquos neededrdquo

1 Extract transform load

ldquoCapture only whatrsquos neededrdquo

9 Report and mine data

copy 2014 IBM Corporation55

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation56

Data scientist ndash The new cool guy in town

Article in Fortune ldquoThe unemployment rate in

the US continues to be abysmal (91 in

July) but the tech world has spawned a

new kind of highly skilled nerdy-cool job

that companies are scrambling to fill data

scientistrdquo

McKinsey Global Institute ldquoBig data Reportrdquo

By 2018 the United States alone could

face a shortage of 140000 to 190000

people with deep analytical skills as well as

15 million managers and analysts with the

know-how to use the analysis of big data to

make effective decisions

copy 2014 IBM Corporation57

Data Science is Multidisciplinary

copy 2014 IBM Corporation58

Successful Data Scientist Characteristics

copy 2014 IBM Corporation59

Data Scientist Qualities

copy 2014 IBM Corporation60

How Long Does It Take For a Beginner to Become

a Good Data Scientist

copy 2014 IBM Corporation61

wwwkagglecom

copy 2014 IBM Corporation62

Kaggle ranking

copy 2014 IBM Corporation63 copy 2013 IBM Corporation63

Learn Big Data

Reading Materials - Online

ndash Understanding Big Data ndash Free PDF Book

bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF

ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights

bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml

ndash Implementing IBM InfoSphere BigInsights on System x - Redbook

bull httpwwwredbooksibmcomredpiecesabstractssg248077html

Resources

ndash Big Data Information Center

bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml

ndash InfoSphere BigInsights

bull www-01ibmcomsoftwaredatainfospherebiginsights

ndash Stream Computing

bull www-01ibmcomsoftwaredatainfospherestream-computing

ndash DeveloperWorks forums demos

bull httpwwwibmcomdeveloperworkswikibiginsights

copy 2014 IBM Corporation64

Learn Big Data Technologies

BigDataUniversitycom

Flexible on-line delivery

allows learning your place

and your pace

Free courses free study

materials

Cloud-based sandbox for

exercises ndash zero setup

Robust Course

Management System and

Content Distribution

infrastructure

copy 2014 IBM Corporation65

65

Page 48: Big data presentation (2014)

copy 2014 IBM Corporation48

Annotator Query Language (AQL)

Language to create rules for Text Analytics

SQL Like Language

Fully declarative text analytics language

Once compiled produced an AOG plan to work in the data

No ldquoblack boxesrdquo or modules that canrsquot be customized

Tooling for easy customization because you are abstracted from the

programmatic details

Competing solutions make use of locked up black-box modules that cannot be

customized which restricts flexibility and are difficult to optimize for performance

create view AmountWithUnit as

extract pattern ltNmatchgt ltUmatchgt

as match

from Number N Unit U

copy 2014 IBM Corporation49

Text Analytic Simple Example

NetherlandsStrikerArjen Robben

Keeper SpainIker Casillas

WingerAndres Iniesta Spain

World Cup 2010 Highlights

Football World Cup 2010 one team distinguished well

from the rest winning the final Early in the second

half Netherlandsrsquo striker Arjen Robben had a chance

to score but the awesome keeper for Spain Iker

Casillas made the save Winner superiority was

reflected when Winger Andres Iniesta scored for Spain

for the win

copy 2014 IBM Corporation5050

Text Analytic Real Example

copy 2014 IBM Corporation5151

One step beyond Watson

copy 2014 IBM Corporation52

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

R

copy 2014 IBM Corporation53

Big R

bull Explore visualize transform

and model big data using

familiar R syntax and

paradigm

bull Scale out R with MR

programming

ndash Partitioning of large data

ndash Parallel cluster execution of R

code

bull Distributed Machine

Learning

ndash A scalable statistics engine that

provides canned algorithms and

an ability to author new ones all

via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3

copy 2014 IBM Corporation54

Where Does BigData Fit

Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate

ldquoCapture in case itrsquos neededrdquo

1 Extract transform load

ldquoCapture only whatrsquos neededrdquo

9 Report and mine data

copy 2014 IBM Corporation55

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation56

Data scientist ndash The new cool guy in town

Article in Fortune ldquoThe unemployment rate in

the US continues to be abysmal (91 in

July) but the tech world has spawned a

new kind of highly skilled nerdy-cool job

that companies are scrambling to fill data

scientistrdquo

McKinsey Global Institute ldquoBig data Reportrdquo

By 2018 the United States alone could

face a shortage of 140000 to 190000

people with deep analytical skills as well as

15 million managers and analysts with the

know-how to use the analysis of big data to

make effective decisions

copy 2014 IBM Corporation57

Data Science is Multidisciplinary

copy 2014 IBM Corporation58

Successful Data Scientist Characteristics

copy 2014 IBM Corporation59

Data Scientist Qualities

copy 2014 IBM Corporation60

How Long Does It Take For a Beginner to Become

a Good Data Scientist

copy 2014 IBM Corporation61

wwwkagglecom

copy 2014 IBM Corporation62

Kaggle ranking

copy 2014 IBM Corporation63 copy 2013 IBM Corporation63

Learn Big Data

Reading Materials - Online

ndash Understanding Big Data ndash Free PDF Book

bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF

ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights

bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml

ndash Implementing IBM InfoSphere BigInsights on System x - Redbook

bull httpwwwredbooksibmcomredpiecesabstractssg248077html

Resources

ndash Big Data Information Center

bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml

ndash InfoSphere BigInsights

bull www-01ibmcomsoftwaredatainfospherebiginsights

ndash Stream Computing

bull www-01ibmcomsoftwaredatainfospherestream-computing

ndash DeveloperWorks forums demos

bull httpwwwibmcomdeveloperworkswikibiginsights

copy 2014 IBM Corporation64

Learn Big Data Technologies

BigDataUniversitycom

Flexible on-line delivery

allows learning your place

and your pace

Free courses free study

materials

Cloud-based sandbox for

exercises ndash zero setup

Robust Course

Management System and

Content Distribution

infrastructure

copy 2014 IBM Corporation65

65

Page 49: Big data presentation (2014)

copy 2014 IBM Corporation49

Text Analytic Simple Example

NetherlandsStrikerArjen Robben

Keeper SpainIker Casillas

WingerAndres Iniesta Spain

World Cup 2010 Highlights

Football World Cup 2010 one team distinguished well

from the rest winning the final Early in the second

half Netherlandsrsquo striker Arjen Robben had a chance

to score but the awesome keeper for Spain Iker

Casillas made the save Winner superiority was

reflected when Winger Andres Iniesta scored for Spain

for the win

copy 2014 IBM Corporation5050

Text Analytic Real Example

copy 2014 IBM Corporation5151

One step beyond Watson

copy 2014 IBM Corporation52

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

R

copy 2014 IBM Corporation53

Big R

bull Explore visualize transform

and model big data using

familiar R syntax and

paradigm

bull Scale out R with MR

programming

ndash Partitioning of large data

ndash Parallel cluster execution of R

code

bull Distributed Machine

Learning

ndash A scalable statistics engine that

provides canned algorithms and

an ability to author new ones all

via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3

copy 2014 IBM Corporation54

Where Does BigData Fit

Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate

ldquoCapture in case itrsquos neededrdquo

1 Extract transform load

ldquoCapture only whatrsquos neededrdquo

9 Report and mine data

copy 2014 IBM Corporation55

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation56

Data scientist ndash The new cool guy in town

Article in Fortune ldquoThe unemployment rate in

the US continues to be abysmal (91 in

July) but the tech world has spawned a

new kind of highly skilled nerdy-cool job

that companies are scrambling to fill data

scientistrdquo

McKinsey Global Institute ldquoBig data Reportrdquo

By 2018 the United States alone could

face a shortage of 140000 to 190000

people with deep analytical skills as well as

15 million managers and analysts with the

know-how to use the analysis of big data to

make effective decisions

copy 2014 IBM Corporation57

Data Science is Multidisciplinary

copy 2014 IBM Corporation58

Successful Data Scientist Characteristics

copy 2014 IBM Corporation59

Data Scientist Qualities

copy 2014 IBM Corporation60

How Long Does It Take For a Beginner to Become

a Good Data Scientist

copy 2014 IBM Corporation61

wwwkagglecom

copy 2014 IBM Corporation62

Kaggle ranking

copy 2014 IBM Corporation63 copy 2013 IBM Corporation63

Learn Big Data

Reading Materials - Online

ndash Understanding Big Data ndash Free PDF Book

bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF

ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights

bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml

ndash Implementing IBM InfoSphere BigInsights on System x - Redbook

bull httpwwwredbooksibmcomredpiecesabstractssg248077html

Resources

ndash Big Data Information Center

bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml

ndash InfoSphere BigInsights

bull www-01ibmcomsoftwaredatainfospherebiginsights

ndash Stream Computing

bull www-01ibmcomsoftwaredatainfospherestream-computing

ndash DeveloperWorks forums demos

bull httpwwwibmcomdeveloperworkswikibiginsights

copy 2014 IBM Corporation64

Learn Big Data Technologies

BigDataUniversitycom

Flexible on-line delivery

allows learning your place

and your pace

Free courses free study

materials

Cloud-based sandbox for

exercises ndash zero setup

Robust Course

Management System and

Content Distribution

infrastructure

copy 2014 IBM Corporation65

65

Page 50: Big data presentation (2014)

copy 2014 IBM Corporation5050

Text Analytic Real Example

copy 2014 IBM Corporation5151

One step beyond Watson

copy 2014 IBM Corporation52

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

R

copy 2014 IBM Corporation53

Big R

bull Explore visualize transform

and model big data using

familiar R syntax and

paradigm

bull Scale out R with MR

programming

ndash Partitioning of large data

ndash Parallel cluster execution of R

code

bull Distributed Machine

Learning

ndash A scalable statistics engine that

provides canned algorithms and

an ability to author new ones all

via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3

copy 2014 IBM Corporation54

Where Does BigData Fit

Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate

ldquoCapture in case itrsquos neededrdquo

1 Extract transform load

ldquoCapture only whatrsquos neededrdquo

9 Report and mine data

copy 2014 IBM Corporation55

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation56

Data scientist ndash The new cool guy in town

Article in Fortune ldquoThe unemployment rate in

the US continues to be abysmal (91 in

July) but the tech world has spawned a

new kind of highly skilled nerdy-cool job

that companies are scrambling to fill data

scientistrdquo

McKinsey Global Institute ldquoBig data Reportrdquo

By 2018 the United States alone could

face a shortage of 140000 to 190000

people with deep analytical skills as well as

15 million managers and analysts with the

know-how to use the analysis of big data to

make effective decisions

copy 2014 IBM Corporation57

Data Science is Multidisciplinary

copy 2014 IBM Corporation58

Successful Data Scientist Characteristics

copy 2014 IBM Corporation59

Data Scientist Qualities

copy 2014 IBM Corporation60

How Long Does It Take For a Beginner to Become

a Good Data Scientist

copy 2014 IBM Corporation61

wwwkagglecom

copy 2014 IBM Corporation62

Kaggle ranking

copy 2014 IBM Corporation63 copy 2013 IBM Corporation63

Learn Big Data

Reading Materials - Online

ndash Understanding Big Data ndash Free PDF Book

bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF

ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights

bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml

ndash Implementing IBM InfoSphere BigInsights on System x - Redbook

bull httpwwwredbooksibmcomredpiecesabstractssg248077html

Resources

ndash Big Data Information Center

bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml

ndash InfoSphere BigInsights

bull www-01ibmcomsoftwaredatainfospherebiginsights

ndash Stream Computing

bull www-01ibmcomsoftwaredatainfospherestream-computing

ndash DeveloperWorks forums demos

bull httpwwwibmcomdeveloperworkswikibiginsights

copy 2014 IBM Corporation64

Learn Big Data Technologies

BigDataUniversitycom

Flexible on-line delivery

allows learning your place

and your pace

Free courses free study

materials

Cloud-based sandbox for

exercises ndash zero setup

Robust Course

Management System and

Content Distribution

infrastructure

copy 2014 IBM Corporation65

65

Page 51: Big data presentation (2014)

copy 2014 IBM Corporation5151

One step beyond Watson

copy 2014 IBM Corporation52

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

R

copy 2014 IBM Corporation53

Big R

bull Explore visualize transform

and model big data using

familiar R syntax and

paradigm

bull Scale out R with MR

programming

ndash Partitioning of large data

ndash Parallel cluster execution of R

code

bull Distributed Machine

Learning

ndash A scalable statistics engine that

provides canned algorithms and

an ability to author new ones all

via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3

copy 2014 IBM Corporation54

Where Does BigData Fit

Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate

ldquoCapture in case itrsquos neededrdquo

1 Extract transform load

ldquoCapture only whatrsquos neededrdquo

9 Report and mine data

copy 2014 IBM Corporation55

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation56

Data scientist ndash The new cool guy in town

Article in Fortune ldquoThe unemployment rate in

the US continues to be abysmal (91 in

July) but the tech world has spawned a

new kind of highly skilled nerdy-cool job

that companies are scrambling to fill data

scientistrdquo

McKinsey Global Institute ldquoBig data Reportrdquo

By 2018 the United States alone could

face a shortage of 140000 to 190000

people with deep analytical skills as well as

15 million managers and analysts with the

know-how to use the analysis of big data to

make effective decisions

copy 2014 IBM Corporation57

Data Science is Multidisciplinary

copy 2014 IBM Corporation58

Successful Data Scientist Characteristics

copy 2014 IBM Corporation59

Data Scientist Qualities

copy 2014 IBM Corporation60

How Long Does It Take For a Beginner to Become

a Good Data Scientist

copy 2014 IBM Corporation61

wwwkagglecom

copy 2014 IBM Corporation62

Kaggle ranking

copy 2014 IBM Corporation63 copy 2013 IBM Corporation63

Learn Big Data

Reading Materials - Online

ndash Understanding Big Data ndash Free PDF Book

bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF

ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights

bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml

ndash Implementing IBM InfoSphere BigInsights on System x - Redbook

bull httpwwwredbooksibmcomredpiecesabstractssg248077html

Resources

ndash Big Data Information Center

bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml

ndash InfoSphere BigInsights

bull www-01ibmcomsoftwaredatainfospherebiginsights

ndash Stream Computing

bull www-01ibmcomsoftwaredatainfospherestream-computing

ndash DeveloperWorks forums demos

bull httpwwwibmcomdeveloperworkswikibiginsights

copy 2014 IBM Corporation64

Learn Big Data Technologies

BigDataUniversitycom

Flexible on-line delivery

allows learning your place

and your pace

Free courses free study

materials

Cloud-based sandbox for

exercises ndash zero setup

Robust Course

Management System and

Content Distribution

infrastructure

copy 2014 IBM Corporation65

65

Page 52: Big data presentation (2014)

copy 2014 IBM Corporation52

Example of Hadoop Ecosystem

Dashboard amp Visualization

Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine amp Extractor Library)

JDBC

Applications amp Development

MapReduce

Pig amp Jaql Hive

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible Scheduler

Jaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Apps

Workflow Monitoring

Management

HCatalog

Security

Audit amp History

Lineage

Guardium

PlatformComputing

Cognos

IBMOpen Source

GPFS-FPO

NameNode High Avail

Avro

Visualization amp Discovery

BigSheets

Big SQL

Text Analytics

R

copy 2014 IBM Corporation53

Big R

bull Explore visualize transform

and model big data using

familiar R syntax and

paradigm

bull Scale out R with MR

programming

ndash Partitioning of large data

ndash Parallel cluster execution of R

code

bull Distributed Machine

Learning

ndash A scalable statistics engine that

provides canned algorithms and

an ability to author new ones all

via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3

copy 2014 IBM Corporation54

Where Does BigData Fit

Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate

ldquoCapture in case itrsquos neededrdquo

1 Extract transform load

ldquoCapture only whatrsquos neededrdquo

9 Report and mine data

copy 2014 IBM Corporation55

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation56

Data scientist ndash The new cool guy in town

Article in Fortune ldquoThe unemployment rate in

the US continues to be abysmal (91 in

July) but the tech world has spawned a

new kind of highly skilled nerdy-cool job

that companies are scrambling to fill data

scientistrdquo

McKinsey Global Institute ldquoBig data Reportrdquo

By 2018 the United States alone could

face a shortage of 140000 to 190000

people with deep analytical skills as well as

15 million managers and analysts with the

know-how to use the analysis of big data to

make effective decisions

copy 2014 IBM Corporation57

Data Science is Multidisciplinary

copy 2014 IBM Corporation58

Successful Data Scientist Characteristics

copy 2014 IBM Corporation59

Data Scientist Qualities

copy 2014 IBM Corporation60

How Long Does It Take For a Beginner to Become

a Good Data Scientist

copy 2014 IBM Corporation61

wwwkagglecom

copy 2014 IBM Corporation62

Kaggle ranking

copy 2014 IBM Corporation63 copy 2013 IBM Corporation63

Learn Big Data

Reading Materials - Online

ndash Understanding Big Data ndash Free PDF Book

bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF

ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights

bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml

ndash Implementing IBM InfoSphere BigInsights on System x - Redbook

bull httpwwwredbooksibmcomredpiecesabstractssg248077html

Resources

ndash Big Data Information Center

bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml

ndash InfoSphere BigInsights

bull www-01ibmcomsoftwaredatainfospherebiginsights

ndash Stream Computing

bull www-01ibmcomsoftwaredatainfospherestream-computing

ndash DeveloperWorks forums demos

bull httpwwwibmcomdeveloperworkswikibiginsights

copy 2014 IBM Corporation64

Learn Big Data Technologies

BigDataUniversitycom

Flexible on-line delivery

allows learning your place

and your pace

Free courses free study

materials

Cloud-based sandbox for

exercises ndash zero setup

Robust Course

Management System and

Content Distribution

infrastructure

copy 2014 IBM Corporation65

65

Page 53: Big data presentation (2014)

copy 2014 IBM Corporation53

Big R

bull Explore visualize transform

and model big data using

familiar R syntax and

paradigm

bull Scale out R with MR

programming

ndash Partitioning of large data

ndash Parallel cluster execution of R

code

bull Distributed Machine

Learning

ndash A scalable statistics engine that

provides canned algorithms and

an ability to author new ones all

via R

R Clients

Scalable

Machine

Learning

Data Sources

Embedded R

Execution

IBM R Packages

IBM R Packages

Pull data

(summaries) to

R client

Or push R

functions

right on the

data

1

2

3

copy 2014 IBM Corporation54

Where Does BigData Fit

Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate

ldquoCapture in case itrsquos neededrdquo

1 Extract transform load

ldquoCapture only whatrsquos neededrdquo

9 Report and mine data

copy 2014 IBM Corporation55

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation56

Data scientist ndash The new cool guy in town

Article in Fortune ldquoThe unemployment rate in

the US continues to be abysmal (91 in

July) but the tech world has spawned a

new kind of highly skilled nerdy-cool job

that companies are scrambling to fill data

scientistrdquo

McKinsey Global Institute ldquoBig data Reportrdquo

By 2018 the United States alone could

face a shortage of 140000 to 190000

people with deep analytical skills as well as

15 million managers and analysts with the

know-how to use the analysis of big data to

make effective decisions

copy 2014 IBM Corporation57

Data Science is Multidisciplinary

copy 2014 IBM Corporation58

Successful Data Scientist Characteristics

copy 2014 IBM Corporation59

Data Scientist Qualities

copy 2014 IBM Corporation60

How Long Does It Take For a Beginner to Become

a Good Data Scientist

copy 2014 IBM Corporation61

wwwkagglecom

copy 2014 IBM Corporation62

Kaggle ranking

copy 2014 IBM Corporation63 copy 2013 IBM Corporation63

Learn Big Data

Reading Materials - Online

ndash Understanding Big Data ndash Free PDF Book

bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF

ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights

bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml

ndash Implementing IBM InfoSphere BigInsights on System x - Redbook

bull httpwwwredbooksibmcomredpiecesabstractssg248077html

Resources

ndash Big Data Information Center

bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml

ndash InfoSphere BigInsights

bull www-01ibmcomsoftwaredatainfospherebiginsights

ndash Stream Computing

bull www-01ibmcomsoftwaredatainfospherestream-computing

ndash DeveloperWorks forums demos

bull httpwwwibmcomdeveloperworkswikibiginsights

copy 2014 IBM Corporation64

Learn Big Data Technologies

BigDataUniversitycom

Flexible on-line delivery

allows learning your place

and your pace

Free courses free study

materials

Cloud-based sandbox for

exercises ndash zero setup

Robust Course

Management System and

Content Distribution

infrastructure

copy 2014 IBM Corporation65

65

Page 54: Big data presentation (2014)

copy 2014 IBM Corporation54

Where Does BigData Fit

Analytical database

(DW)

Source Systems

Analytical tools

5 Explore data

6 Parse aggregate

ldquoCapture in case itrsquos neededrdquo

1 Extract transform load

ldquoCapture only whatrsquos neededrdquo

9 Report and mine data

copy 2014 IBM Corporation55

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation56

Data scientist ndash The new cool guy in town

Article in Fortune ldquoThe unemployment rate in

the US continues to be abysmal (91 in

July) but the tech world has spawned a

new kind of highly skilled nerdy-cool job

that companies are scrambling to fill data

scientistrdquo

McKinsey Global Institute ldquoBig data Reportrdquo

By 2018 the United States alone could

face a shortage of 140000 to 190000

people with deep analytical skills as well as

15 million managers and analysts with the

know-how to use the analysis of big data to

make effective decisions

copy 2014 IBM Corporation57

Data Science is Multidisciplinary

copy 2014 IBM Corporation58

Successful Data Scientist Characteristics

copy 2014 IBM Corporation59

Data Scientist Qualities

copy 2014 IBM Corporation60

How Long Does It Take For a Beginner to Become

a Good Data Scientist

copy 2014 IBM Corporation61

wwwkagglecom

copy 2014 IBM Corporation62

Kaggle ranking

copy 2014 IBM Corporation63 copy 2013 IBM Corporation63

Learn Big Data

Reading Materials - Online

ndash Understanding Big Data ndash Free PDF Book

bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF

ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights

bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml

ndash Implementing IBM InfoSphere BigInsights on System x - Redbook

bull httpwwwredbooksibmcomredpiecesabstractssg248077html

Resources

ndash Big Data Information Center

bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml

ndash InfoSphere BigInsights

bull www-01ibmcomsoftwaredatainfospherebiginsights

ndash Stream Computing

bull www-01ibmcomsoftwaredatainfospherestream-computing

ndash DeveloperWorks forums demos

bull httpwwwibmcomdeveloperworkswikibiginsights

copy 2014 IBM Corporation64

Learn Big Data Technologies

BigDataUniversitycom

Flexible on-line delivery

allows learning your place

and your pace

Free courses free study

materials

Cloud-based sandbox for

exercises ndash zero setup

Robust Course

Management System and

Content Distribution

infrastructure

copy 2014 IBM Corporation65

65

Page 55: Big data presentation (2014)

copy 2014 IBM Corporation55

Big Data Concepts

Big Data Technology

Data Scientists

copy 2014 IBM Corporation56

Data scientist ndash The new cool guy in town

Article in Fortune ldquoThe unemployment rate in

the US continues to be abysmal (91 in

July) but the tech world has spawned a

new kind of highly skilled nerdy-cool job

that companies are scrambling to fill data

scientistrdquo

McKinsey Global Institute ldquoBig data Reportrdquo

By 2018 the United States alone could

face a shortage of 140000 to 190000

people with deep analytical skills as well as

15 million managers and analysts with the

know-how to use the analysis of big data to

make effective decisions

copy 2014 IBM Corporation57

Data Science is Multidisciplinary

copy 2014 IBM Corporation58

Successful Data Scientist Characteristics

copy 2014 IBM Corporation59

Data Scientist Qualities

copy 2014 IBM Corporation60

How Long Does It Take For a Beginner to Become

a Good Data Scientist

copy 2014 IBM Corporation61

wwwkagglecom

copy 2014 IBM Corporation62

Kaggle ranking

copy 2014 IBM Corporation63 copy 2013 IBM Corporation63

Learn Big Data

Reading Materials - Online

ndash Understanding Big Data ndash Free PDF Book

bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF

ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights

bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml

ndash Implementing IBM InfoSphere BigInsights on System x - Redbook

bull httpwwwredbooksibmcomredpiecesabstractssg248077html

Resources

ndash Big Data Information Center

bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml

ndash InfoSphere BigInsights

bull www-01ibmcomsoftwaredatainfospherebiginsights

ndash Stream Computing

bull www-01ibmcomsoftwaredatainfospherestream-computing

ndash DeveloperWorks forums demos

bull httpwwwibmcomdeveloperworkswikibiginsights

copy 2014 IBM Corporation64

Learn Big Data Technologies

BigDataUniversitycom

Flexible on-line delivery

allows learning your place

and your pace

Free courses free study

materials

Cloud-based sandbox for

exercises ndash zero setup

Robust Course

Management System and

Content Distribution

infrastructure

copy 2014 IBM Corporation65

65

Page 56: Big data presentation (2014)

copy 2014 IBM Corporation56

Data scientist ndash The new cool guy in town

Article in Fortune ldquoThe unemployment rate in

the US continues to be abysmal (91 in

July) but the tech world has spawned a

new kind of highly skilled nerdy-cool job

that companies are scrambling to fill data

scientistrdquo

McKinsey Global Institute ldquoBig data Reportrdquo

By 2018 the United States alone could

face a shortage of 140000 to 190000

people with deep analytical skills as well as

15 million managers and analysts with the

know-how to use the analysis of big data to

make effective decisions

copy 2014 IBM Corporation57

Data Science is Multidisciplinary

copy 2014 IBM Corporation58

Successful Data Scientist Characteristics

copy 2014 IBM Corporation59

Data Scientist Qualities

copy 2014 IBM Corporation60

How Long Does It Take For a Beginner to Become

a Good Data Scientist

copy 2014 IBM Corporation61

wwwkagglecom

copy 2014 IBM Corporation62

Kaggle ranking

copy 2014 IBM Corporation63 copy 2013 IBM Corporation63

Learn Big Data

Reading Materials - Online

ndash Understanding Big Data ndash Free PDF Book

bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF

ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights

bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml

ndash Implementing IBM InfoSphere BigInsights on System x - Redbook

bull httpwwwredbooksibmcomredpiecesabstractssg248077html

Resources

ndash Big Data Information Center

bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml

ndash InfoSphere BigInsights

bull www-01ibmcomsoftwaredatainfospherebiginsights

ndash Stream Computing

bull www-01ibmcomsoftwaredatainfospherestream-computing

ndash DeveloperWorks forums demos

bull httpwwwibmcomdeveloperworkswikibiginsights

copy 2014 IBM Corporation64

Learn Big Data Technologies

BigDataUniversitycom

Flexible on-line delivery

allows learning your place

and your pace

Free courses free study

materials

Cloud-based sandbox for

exercises ndash zero setup

Robust Course

Management System and

Content Distribution

infrastructure

copy 2014 IBM Corporation65

65

Page 57: Big data presentation (2014)

copy 2014 IBM Corporation57

Data Science is Multidisciplinary

copy 2014 IBM Corporation58

Successful Data Scientist Characteristics

copy 2014 IBM Corporation59

Data Scientist Qualities

copy 2014 IBM Corporation60

How Long Does It Take For a Beginner to Become

a Good Data Scientist

copy 2014 IBM Corporation61

wwwkagglecom

copy 2014 IBM Corporation62

Kaggle ranking

copy 2014 IBM Corporation63 copy 2013 IBM Corporation63

Learn Big Data

Reading Materials - Online

ndash Understanding Big Data ndash Free PDF Book

bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF

ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights

bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml

ndash Implementing IBM InfoSphere BigInsights on System x - Redbook

bull httpwwwredbooksibmcomredpiecesabstractssg248077html

Resources

ndash Big Data Information Center

bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml

ndash InfoSphere BigInsights

bull www-01ibmcomsoftwaredatainfospherebiginsights

ndash Stream Computing

bull www-01ibmcomsoftwaredatainfospherestream-computing

ndash DeveloperWorks forums demos

bull httpwwwibmcomdeveloperworkswikibiginsights

copy 2014 IBM Corporation64

Learn Big Data Technologies

BigDataUniversitycom

Flexible on-line delivery

allows learning your place

and your pace

Free courses free study

materials

Cloud-based sandbox for

exercises ndash zero setup

Robust Course

Management System and

Content Distribution

infrastructure

copy 2014 IBM Corporation65

65

Page 58: Big data presentation (2014)

copy 2014 IBM Corporation58

Successful Data Scientist Characteristics

copy 2014 IBM Corporation59

Data Scientist Qualities

copy 2014 IBM Corporation60

How Long Does It Take For a Beginner to Become

a Good Data Scientist

copy 2014 IBM Corporation61

wwwkagglecom

copy 2014 IBM Corporation62

Kaggle ranking

copy 2014 IBM Corporation63 copy 2013 IBM Corporation63

Learn Big Data

Reading Materials - Online

ndash Understanding Big Data ndash Free PDF Book

bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF

ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights

bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml

ndash Implementing IBM InfoSphere BigInsights on System x - Redbook

bull httpwwwredbooksibmcomredpiecesabstractssg248077html

Resources

ndash Big Data Information Center

bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml

ndash InfoSphere BigInsights

bull www-01ibmcomsoftwaredatainfospherebiginsights

ndash Stream Computing

bull www-01ibmcomsoftwaredatainfospherestream-computing

ndash DeveloperWorks forums demos

bull httpwwwibmcomdeveloperworkswikibiginsights

copy 2014 IBM Corporation64

Learn Big Data Technologies

BigDataUniversitycom

Flexible on-line delivery

allows learning your place

and your pace

Free courses free study

materials

Cloud-based sandbox for

exercises ndash zero setup

Robust Course

Management System and

Content Distribution

infrastructure

copy 2014 IBM Corporation65

65

Page 59: Big data presentation (2014)

copy 2014 IBM Corporation59

Data Scientist Qualities

copy 2014 IBM Corporation60

How Long Does It Take For a Beginner to Become

a Good Data Scientist

copy 2014 IBM Corporation61

wwwkagglecom

copy 2014 IBM Corporation62

Kaggle ranking

copy 2014 IBM Corporation63 copy 2013 IBM Corporation63

Learn Big Data

Reading Materials - Online

ndash Understanding Big Data ndash Free PDF Book

bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF

ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights

bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml

ndash Implementing IBM InfoSphere BigInsights on System x - Redbook

bull httpwwwredbooksibmcomredpiecesabstractssg248077html

Resources

ndash Big Data Information Center

bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml

ndash InfoSphere BigInsights

bull www-01ibmcomsoftwaredatainfospherebiginsights

ndash Stream Computing

bull www-01ibmcomsoftwaredatainfospherestream-computing

ndash DeveloperWorks forums demos

bull httpwwwibmcomdeveloperworkswikibiginsights

copy 2014 IBM Corporation64

Learn Big Data Technologies

BigDataUniversitycom

Flexible on-line delivery

allows learning your place

and your pace

Free courses free study

materials

Cloud-based sandbox for

exercises ndash zero setup

Robust Course

Management System and

Content Distribution

infrastructure

copy 2014 IBM Corporation65

65

Page 60: Big data presentation (2014)

copy 2014 IBM Corporation60

How Long Does It Take For a Beginner to Become

a Good Data Scientist

copy 2014 IBM Corporation61

wwwkagglecom

copy 2014 IBM Corporation62

Kaggle ranking

copy 2014 IBM Corporation63 copy 2013 IBM Corporation63

Learn Big Data

Reading Materials - Online

ndash Understanding Big Data ndash Free PDF Book

bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF

ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights

bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml

ndash Implementing IBM InfoSphere BigInsights on System x - Redbook

bull httpwwwredbooksibmcomredpiecesabstractssg248077html

Resources

ndash Big Data Information Center

bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml

ndash InfoSphere BigInsights

bull www-01ibmcomsoftwaredatainfospherebiginsights

ndash Stream Computing

bull www-01ibmcomsoftwaredatainfospherestream-computing

ndash DeveloperWorks forums demos

bull httpwwwibmcomdeveloperworkswikibiginsights

copy 2014 IBM Corporation64

Learn Big Data Technologies

BigDataUniversitycom

Flexible on-line delivery

allows learning your place

and your pace

Free courses free study

materials

Cloud-based sandbox for

exercises ndash zero setup

Robust Course

Management System and

Content Distribution

infrastructure

copy 2014 IBM Corporation65

65

Page 61: Big data presentation (2014)

copy 2014 IBM Corporation61

wwwkagglecom

copy 2014 IBM Corporation62

Kaggle ranking

copy 2014 IBM Corporation63 copy 2013 IBM Corporation63

Learn Big Data

Reading Materials - Online

ndash Understanding Big Data ndash Free PDF Book

bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF

ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights

bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml

ndash Implementing IBM InfoSphere BigInsights on System x - Redbook

bull httpwwwredbooksibmcomredpiecesabstractssg248077html

Resources

ndash Big Data Information Center

bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml

ndash InfoSphere BigInsights

bull www-01ibmcomsoftwaredatainfospherebiginsights

ndash Stream Computing

bull www-01ibmcomsoftwaredatainfospherestream-computing

ndash DeveloperWorks forums demos

bull httpwwwibmcomdeveloperworkswikibiginsights

copy 2014 IBM Corporation64

Learn Big Data Technologies

BigDataUniversitycom

Flexible on-line delivery

allows learning your place

and your pace

Free courses free study

materials

Cloud-based sandbox for

exercises ndash zero setup

Robust Course

Management System and

Content Distribution

infrastructure

copy 2014 IBM Corporation65

65

Page 62: Big data presentation (2014)

copy 2014 IBM Corporation62

Kaggle ranking

copy 2014 IBM Corporation63 copy 2013 IBM Corporation63

Learn Big Data

Reading Materials - Online

ndash Understanding Big Data ndash Free PDF Book

bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF

ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights

bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml

ndash Implementing IBM InfoSphere BigInsights on System x - Redbook

bull httpwwwredbooksibmcomredpiecesabstractssg248077html

Resources

ndash Big Data Information Center

bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml

ndash InfoSphere BigInsights

bull www-01ibmcomsoftwaredatainfospherebiginsights

ndash Stream Computing

bull www-01ibmcomsoftwaredatainfospherestream-computing

ndash DeveloperWorks forums demos

bull httpwwwibmcomdeveloperworkswikibiginsights

copy 2014 IBM Corporation64

Learn Big Data Technologies

BigDataUniversitycom

Flexible on-line delivery

allows learning your place

and your pace

Free courses free study

materials

Cloud-based sandbox for

exercises ndash zero setup

Robust Course

Management System and

Content Distribution

infrastructure

copy 2014 IBM Corporation65

65

Page 63: Big data presentation (2014)

copy 2014 IBM Corporation63 copy 2013 IBM Corporation63

Learn Big Data

Reading Materials - Online

ndash Understanding Big Data ndash Free PDF Book

bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF

ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights

bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml

ndash Implementing IBM InfoSphere BigInsights on System x - Redbook

bull httpwwwredbooksibmcomredpiecesabstractssg248077html

Resources

ndash Big Data Information Center

bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml

ndash InfoSphere BigInsights

bull www-01ibmcomsoftwaredatainfospherebiginsights

ndash Stream Computing

bull www-01ibmcomsoftwaredatainfospherestream-computing

ndash DeveloperWorks forums demos

bull httpwwwibmcomdeveloperworkswikibiginsights

copy 2014 IBM Corporation64

Learn Big Data Technologies

BigDataUniversitycom

Flexible on-line delivery

allows learning your place

and your pace

Free courses free study

materials

Cloud-based sandbox for

exercises ndash zero setup

Robust Course

Management System and

Content Distribution

infrastructure

copy 2014 IBM Corporation65

65

Page 64: Big data presentation (2014)

copy 2014 IBM Corporation64

Learn Big Data Technologies

BigDataUniversitycom

Flexible on-line delivery

allows learning your place

and your pace

Free courses free study

materials

Cloud-based sandbox for

exercises ndash zero setup

Robust Course

Management System and

Content Distribution

infrastructure

copy 2014 IBM Corporation65

65

Page 65: Big data presentation (2014)

copy 2014 IBM Corporation65

65