How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights and Streams

46
How to Crunch Petabytes with Hadoop and Big Data using InfoSphere BigInsights and Streams Tom Deutsch, IBM Vl di iB ki F d S iS ik Vladimir Bacvanski, Founder, SciSpike [email protected] Stephen Brodsky, Technical Executive and Distinguished Engineer, IBM b dk@ ib © 2011 IBM Corporation & SciSpike August 24, 2011 sbrodsky@us.ibm.com

description

Do you wonder how to process huge amounts of data in short amount of time? If yes, this session is for you! You will learn why Apache Hadoop and Streams is the core framework that enables storing, managing and analyzing of vast amounts of data. You will learn the idea behind Hadoop's famous map-reduce algorithm and why it is at the heart of solutions that process massive amounts of data with flexible workloads and software based scaling. We explore how to go beyond Hadoop with both real-time and batch analytics, usability, and manageability. For practical examples, we will use IBM InfoSphere BigInsights and Streams, which build on top of open source tooling when going beyond basics and scaling up and out is needed.

Transcript of How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights and Streams

Page 1: How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights and Streams

How to Crunch Petabytes with Hadoop and Big Data using InfoSphere BigInsights and Streams

Tom Deutsch, IBM

Vl di i B ki F d S iS ikVladimir Bacvanski, Founder, [email protected] Brodsky, Technical Executive and Distinguished Engineer, IBMb d k @ ib

© 2011 IBM Corporation & SciSpikeAugust 24, 2011

[email protected]

Page 2: How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights and Streams

Who are we?Who are we?

Dr. Vladimir Bacvanski– Consultant, trainer, and mentor focusing on making clients successful in

adopting new data and software approaches – Over 20 years of experiencey p– Founder of SciSpike – a training and consulting firm specializing in

advanced software and data technologies

Stephen Brodsky, Ph.D.Di ti i h d E i d T h i l E ti f IBM Bi D t– Distinguished Engineer and Technical Executive for IBM Big Data initiatives at the IBM Silicon Valley Laboratory

– Previously led the architecture for the Optim Data Studio product line and pureQuery and was a member of the architecture team for DB2 pureXML, Rational Application Developer (RAD), and WebSphere.

© 2011 IBM Corporation & SciSpike2

Page 3: How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights and Streams

AgendaAgenda

The “Big Data” challenge: smarter analytics for aThe Big Data challenge: smarter analytics for a smarter planet

How to do it? – The big data challenge

F d i f Bi D h– Foundations of Big Data approaches– MapReduce and Hadoop– Real-time data and stream processing– Real-time data and stream processing– Integration with existing systems

© 2011 IBM Corporation & SciSpike3

Page 4: How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights and Streams

The “Big Data” Challenge

© 2011 IBM Corporation & SciSpikeAugust 24, 2011

Page 5: How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights and Streams

The World is Changing and Becoming MoreThe World is Changing and Becoming More…

INSTRUMENTED

INTERCONNECTED

INTELLIGENT

Th lti l i f i f ti t d fThe resulting explosion of information creates a need for a new kind of intelligence

© 2011 IBM Corporation & SciSpike5

…to help build a Smarter Planet…to help build a Smarter Planet

Page 6: How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights and Streams

Information is Growing at a Phenomenal Rate . . . .Information is Growing at a Phenomenal Rate . . . .

44 80as much data and contentover coming decade44x Of world’s data

is unstructured80%

202035 zettabytes

2009800,000 petabytes

35 zettabytes(35 billion terabytes)

© 2011 IBM Corporation & SciSpike6

800,000 petabytes

Source: IDC, The Digital Universe Decade – Are You Ready?, May 2010

Page 7: How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights and Streams

The BIG Data Challenge• Manage and benefit from massive and growing amounts of data• Handle varied data formats (structured, unstructured, semi-structured) and

increased data velocity • Exploit BIG Data in a timely and cost effective fashion• Exploit BIG Data in a timely and cost effective fashion

Collect ManageCOLLECT MANAGE

Integrate AnalyzeINTEGRATE ANALYZE

© 2011 IBM Corporation & SciSpike7

Page 8: How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights and Streams

What clients are saying . . .What clients are saying . . .

Lots of potentially valuable data is dormant or discarded p ydue to size/performance considerations

Large volume of unstructured or semi-structured data is not worthLarge volume of unstructured or semi structured data is not worth integrating fully (e.g. Tweets, logs, . . .)

Not clear what should be analyzed (exploratory iterative) Not clear what should be analyzed (exploratory, iterative)

Information distributed across multiple systems and/or Internet

Some information has a short useful lifespan

Volumes can be extremely high

Analysis needed in the context of existing information (not stand

© 2011 IBM Corporation & SciSpike8

alone)

Page 9: How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights and Streams

Big Data Presents Big OpportunitiesExtract insight from a high volume, variety and velocity of data in a timely and cost-effective manner

Big Data Presents Big Opportunities

in a timely and cost-effective manner

Manage and benefit fromVariety: Manage and benefit from diverse data types and data structures

Variety:

structures

Analyze streaming data and Velocity:large volumes of persistent data

Scale from terabytes to zettabytes

Volume:

© 2011 IBM Corporation & SciSpike9999

ettabytes

Page 10: How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights and Streams

Streams and Oceans of InformationStreams and Oceans of Information . . . .

Hi h d i f ti fl i i

Information streamsInformation oceans

Information stored outside ti l t D tHigh speed information flowing in

real-time, often transient Information from sensors, instruments,

etc

conventional systems. Data may originate from the Web or different internal different systems

etc. Information flowing from real-time logs

and activity monitors Streaming content like audio and video

Collection of what has streamed

Information from social media, logs, click streams, emails, etc.

High speed transactions like tickers, trades, or traffic systems Unstructured or mixed schema documents

like claims, forms, desktop applications, etc.

© 2011 IBM Corporation & SciSpike10

Structured data from disparate systems

Page 11: How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights and Streams

Applications for Big Data AnalyticsApplications for Big Data Analytics

Finance Smarter Healthcare Multi-channel sales

Homeland security TelecomTraffic Control

Manufacturing Trading Analytics

Many more!

© 2011 IBM Corporation & SciSpike11

Page 12: How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights and Streams

Use Case Example: Energy CompanyUse Case Example: Energy Company

Business scenario Business scenario

Analyze large volumes of public and private weather data for alternative energy business

E i ti hi h f ti

Technical challengesTechnical challenges

Existing high-performance computing hardware, limited staff

Technical challengesTechnical challenges

High data volume: 2+ PB

Range of query typesg q y yp- Avg temp in given location? (Small

result) - Geo pts where ice may form on wind

turbines? (Large result derived values –turbines? (Large result, derived values –icing determined by humidity + temp.)

Run on system with non-Hadoop apps

© 2011 IBM Corporation & SciSpike12

Page 13: How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights and Streams

Use Case Example: Global Media FirmUse Case Example: Global Media Firm

Business scenario Business scenario

Identify unauthorized content streaming in digital media (piracy)

- Quantify annual revenue loss- Analyze trendsAnalyze trends Monitor social media sites to identify dissemination of pirated content. Time sensitive!

Technical challengesTechnical challenges

High variety of unstructured and semi-t t d d tstructured data.

Initial focus: text analytics over 1 year’s worth of social media data. Look for live streaming URLs, sentiment, event info, etc.

Complex rules to qualify & classify info

© 2011 IBM Corporation & SciSpike13

Future potential for video analysis

Page 14: How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights and Streams

IBM WatsonIBM Watson

IBM Watson is a breakthrough in analytic innovation, but it is only successful because of the quality of the information from which it is working.

© 2011 IBM Corporation & SciSpike14

Page 15: How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights and Streams

Big Data and WatsonBig Data and WatsonWatson technology offers great potential

for advanced business analyticsBig Data technology is used to build

Watson’s knowledge base

Watson uses the Apache Hadoop open framework to distribute the workload for

POS Data

CRM DataSocial Media

loading information into memory.

Approx. 200M pages of text

InfoSphere BigInsights

Distilled Insight- Spending habits- Social relationships- Buying trends

(To compete on Jeopardy!)

oSp e e g s g ts

Advanced search and analysis

Watson’s Memory

© 2011 IBM Corporation & SciSpike15

Page 16: How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights and Streams

Customer EngagementsUse patterns• Customer sentiment analysis (cross

Common requirements• Extract business insight from large volumes of

Customer Engagements

• Customer sentiment analysis (cross-sell, up-sell, campaign management)

• Integrated retail and web customer behavior modeling

• Extract business insight from large volumes of raw data (often outside operational systems)

• Integrate with other existing software• Ready for enterprise useg

• Predictive modeling (credit card fraud)• System log analytics (reduce

operational risk)

• Ready for enterprise use

p )

Text Blog WeblogConsumer

Text, Blog, Weblog

Click streams

Log & transactions

Multi-channel sales

Insight

Biological Sequences

Next Gen Fraud Models

New Business Operational system & streams data sources Statistical Model

Text Analytics

© 2011 IBM Corporation & SciSpike1616

Developmentp y Stat st ca ode

Building

Page 17: How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights and Streams

The approach to crunching big datacrunching big data

© 2011 IBM Corporation & SciSpikeAugust 24, 2011

Page 18: How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights and Streams

How to approach Big Data analytics?How to approach Big Data analytics?InfoSphere BigInsights and InfoSphere Streams

• Analytics for data in-motion and at-rest• Platform for processing large volumes of diverse data • Complements and integrates with existing software solutions

© 2011 IBM Corporation & SciSpike18

Page 19: How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights and Streams

Addressing the Key RequirementsAddressing the Key Requirements

1. Platform for V3 – Variety, Velocity, Volume Variety - manage data & content “As Is” Handle any velocity - low-latency streams and large volume batch Volume - huge volumes of at-rest or streaming data

2 Analytics for V3

Big Data Platform2. Analytics for V3

Analyze Sources in their native format - text, data, rich content Analyze all of the data - not just a subset Dynamic analytics - automatic adjustments and actions

3. Ease of Use for Developers and Users Developer UIs, common languages & automatic optimization End-user UIs & visualization

4. Enterprise Class Failure tolerance, Security and Privacy Scale Economically

5. Extensive Integration Capabilities Integrate wide variety of sources Leverage enterprise integration technologies

© 2011 IBM Corporation & SciSpike19

Page 20: How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights and Streams

Bi D t I iti ti

Volumes of diverse persistent data

Big Data Initiative

Analytic applications for Volumes of diverse, persistent data “Big Data”

InfoSphere

Warehouse

pBigInsights

IBM Confidential

Traditional warehouse applications

InfoSphereStreams

© 2011 IBM Corporation & SciSpike20

Streams

Real-time streaming data

Page 21: How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights and Streams

BigInsights SummaryBigInsights Summary

BigInsights = analytical platform for persistent “Big Data”– Based on open source & IBM technologies

Distinguishing characteristics– Built-in analytics . . . . Enhances business knowledge

Enterprise soft are integration Complements and e tends– Enterprise software integration . . . . Complements and extends existing capabilities

– Production-ready platform . . . . Speeds time-to-value; simplifies development and maintenance

© 2011 IBM Corporation & SciSpike21

Page 22: How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights and Streams

Big Data Platform VisionBig Data Platform Vision

Big Data Solutions

Bringing Big Data to the EnterpriseData

Warehouse

Big Data User EnvironmentsBig Data User EnvironmentsInformation Integration

Developers End Users Administrators

A IN

Master Data Mgmt

Big Data Enterprise EnginesBig Data Enterprise Engines

AG

ENTS

NTEG

RATIO

Database

Content Analytics

Internet Scale AnalyticsStreaming Analytics

ON Analytics

Business Analyticsyg y

Open Source Foundational ComponentsMarketing

© 2011 IBM Corporation & SciSpike22

Data Growth Management

Page 23: How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights and Streams

InfoSphere BigInsights v 1.1InfoSphere BigInsights v 1.1Platform for volume,

variety, velocity -- V3variety, velocity V Hadoop foundation

Analytics for V3

Text analytics & tooling Enterprise Edition Text analytics & tooling

Usability Web administrative

lass

Enterprise EditionLicensed

Web admin console, LDAP authenticationRDBMS, warehouse connectivity

console Integrated install Spreadsheet-style

l ti t l nter

pris

e cl

Apache

Basic EditionText analytics

Spreadsheet-style analytic tool Flexible job schedulerFree download

24 x 7 Webanalytic tool

Enterprise Class Storage, security,

En

ApacheHadoop

24 x 7 Web support

cluster management

Integration Connectivity to DB2,

Breadth of capabilities

© 2011 IBM Corporation & SciSpike23

Connectivity to DB2, Netezza

Page 24: How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights and Streams

BigInsights Platform: Key IdeasBigInsights Platform: Key Ideas

Flexible, enterprise-class support for processing large volumes of data – Based on Google’s MapReduce technology – Inspired by Apache Hadoop; compatible with its ecosystem and sp ed by pac e adoop; co pat b e t ts ecosyste a d

distribution – Well-suited to batch-oriented, read-intensive applications – Supports wide variety of dataSupports wide variety of data

Enables applications to work with thousands of nodes and t b t f d t i hi hl ll l t ff tipetabytes of data in a highly parallel, cost effective manner

– CPU + disks = “node”– Nodes can be combined into clusters– New nodes can be added as needed without changing

• Data formats• How data is loaded

© 2011 IBM Corporation & SciSpike24

• How jobs are written

Page 25: How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights and Streams

Th M R d P i M d lThe MapReduce Programming Model

"Map" step:Map step: – Input split into pieces

W k d i di id l i i ll l ( d– Worker nodes process individual pieces in parallel (under global control of the Job Tracker node)

– Each worker node stores its result in its local file system where a reducer is able to access it

"Reduce" step:– Data is aggregated (‘reduced” from the map steps) by– Data is aggregated ( reduced from the map steps) by

worker nodes (under control of the Job Tracker)

M lti l d t k ll li th ti

© 2011 IBM Corporation & SciSpike2525

– Multiple reduce tasks can parallelize the aggregation

Page 26: How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights and Streams

What is Hadoop?What is Hadoop?

Apache Hadoop = free, open source framework for data-intensive applications – Inspired by Google technologies (MapReduce, GFS)– Well-suited to batch-oriented, read-intensive applications e su ted to batc o e ted, ead te s e app cat o s– Originally built to address scalability problems of Nutch, an open source

Web search technology

Enables applications to work with thousands of nodes and petabytes of data in a highly parallel, cost effective manner

CPU di k f dit b H d “ d ”– CPU + disks of commodity box = Hadoop “node”– Boxes can be combined into clusters– New nodes can be added as needed without changing

• Data formats• How data is loaded• How jobs are written

© 2011 IBM Corporation & SciSpike26

Page 27: How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights and Streams

Two Key Aspects of HadoopTwo Key Aspects of Hadoop

MapReduce framework – How Hadoop understands and assigns work to the nodes

(machines)

Hadoop Distributed File System = HDFS Hadoop Distributed File System = HDFS– Where Hadoop stores data– A file system that spans all the nodes in a Hadoop clusterA file system that spans all the nodes in a Hadoop cluster– It links together the file systems on many local nodes to

make them into one big file system

© 2011 IBM Corporation & SciSpike27

Page 28: How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights and Streams

Logical MapReduce Example: Word CountLogical MapReduce Example: Word Count

Hello World Bye WorldContent of Input Documents

map(String key, String value): // key: document name

Hello World Bye World

Hello IBM

// value: document contents for each word w in value: EmitIntermediate(w "1");

Map 1 emits:< Hello, 1> < World, 1> < Bye 1>EmitIntermediate(w, 1 );

reduce(String key, Iterator values):

< Bye, 1> < World, 1>

Map 2 emits:( g y, )// key: a word// values: a list of counts

Reduce (final output):

< Hello, 1> < IBM, 1>

int result = 0;for each v in values:result += ParseInt(v);

Reduce (final output):

< Bye, 1> < IBM, 1>

H ll 2

© 2011 IBM Corporation & SciSpike28

result += ParseInt(v);Emit(AsString(result));

< Hello, 2> < World, 2>

Page 29: How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights and Streams

How To Create MapReduce JobsHow To Create MapReduce Jobs

MapReduce development in Javap p–Low level, very flexible–Time consuming development

Hive– Open source language / Apache sub-projectOpen source language / Apache sub project– Provides a SQL-like interface to Hadoop

Pig– Data flow language / Apache sub-project

Jaql– A query language for JSON

© 2011 IBM Corporation & SciSpike29

– Useful for loosely structured data

Page 30: How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights and Streams

Management Tools: Web Console Graphically manage cluster, jobs, HDFS Sample administration tasks

Start/Stop Servers

Management Tools: Web Console

– Start/Stop Servers– Add/Remove Servers– Server Status Details (Log)

© 2011 IBM Corporation & SciSpike30

Page 31: How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights and Streams

Spreadsheet-like Analysis ToolSpreadsheet like Analysis Tool Web-based analysis

and visualization tool BigSheetsBigSheets

Spreadsheet-like interfaceinterface – Define and manage

long running data ll i j bcollection jobs

– Analyze content of the text on the pages that have been retrieved

© 2011 IBM Corporation & SciSpike31

Page 32: How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights and Streams

Text AnalyticsText Analytics

"Acquisition""Address"

• Distill structured info from unstructured data • Sentiment analysis Address

"Alliance""AnalystEarningsEstimate""City""CompanyEarningsAnnouncement"

• Sentiment analysis• Consumer behavior • Illegal or suspicious activities

CompanyEarningsAnnouncement"CompanyEarningsGuidance""Continent""Country""County"

• . . .

• Pre-built library of text annotators for common County"DateTime""EmailAddress""JointVenture""Location"

business entities

• Rich language and tooling to build custom Location"Merger""NotesEmailAddress""Organization""Person"

g g gannotators

• Support for Western languages (English, Person"PhoneNumber""StateOrProvince""URL""ZipCode"

pp g g ( g ,Dutch/Flemish, French, German, Italian, Portuguese, or Spanish) plus select Asian languages (Japanese, Chinese)

© 2011 IBM Corporation & SciSpike3232

ZipCode

Page 33: How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights and Streams

Eclipse-based Text Analytics DevelopmentEclipse based Text Analytics Development

© 2011 IBM Corporation & SciSpike33

Page 34: How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights and Streams

So What Does This Result In?So What Does This Result In?

Easy To Scale

Fault Tolerant and Self-Healing

Data Agnostic

Extremely FlexibleExtremely Flexible

© 2011 IBM Corporation & SciSpike34

Page 35: How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights and Streams

Working with streaming data: a new paradigmWorking with streaming data: a new paradigm

Conventional processing: static data

DataQueries ResultsDataQueries Results

Real-time processing: streaming data Real-time processing: streaming data

QueriesData Results

© 2011 IBM Corporation & SciSpike35

Page 36: How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights and Streams

Real Time Data with InfoSphere StreamsReal-Time Data with InfoSphere StreamsStreaming analytic applications

M lti l i t t

Source Adapters

Sink AdaptersOperator Repository

– Multiple input streams– Advanced streaming analytics

Eclipse based IDEEclipse based IDE– Define sources, apply operators,

define intermediary and final output sinks

InfoSphere Streams Studio (IDE for Streams Processing Language)

output sinks– User defined operators in Java or

C++O i i i il Automated Optimized DeployOptimizing compiler automates

deployment and connections– Extremely low latency

Automated, Optimized Deploy and Management (Scheduler)

y y– Cluster of up to 125 nodes

© 2011 IBM Corporation & SciSpike36

Page 37: How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights and Streams

Scalable stream processingScalable stream processing InfoSphere Streams provides

A i d l d IDE f d fi i d t d– A programming model and IDE for defining data sources and software analytic modules called operators that are fused into process execution units (PEs)

– infrastructure to support the composition of scalable stream processing applications from these components

– deployment and operation of these applications across distributed p y p ppx86 processing nodes, when scaled processing is required

– stream connectivity between data sources and PEs of a stream processing applicationprocessing application

© 2011 IBM Corporation & SciSpike37

Page 38: How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights and Streams

Merging the Traditional and Big Data ApproachesMerging the Traditional and Big Data ApproachesBig Data Approach

Iterative & Exploratory AnalysisTraditional Approach

Structured & Repeatable Analysis

ITDelivers a platform to

bl ti

Business UsersDetermine what enable creative

discoveryDetermine what question to ask

IT BusinessITStructures the data to answer that question

Business Explores what questions could be askedq

Monthly sales reportsProfitability analysisCustomer surveys

Brand sentimentProduct strategyMaximum asset utilization

© 2011 IBM Corporation & SciSpike38

Customer surveys Maximum asset utilization

Page 39: How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights and Streams

BigInsights and the data warehouse: filtering andBigInsights and the data warehouse: filtering and summarizing “Big Data”

BigInsights

Data warehouse• Broader analytic coverage• Exploits IT investments while

© 2011 IBM Corporation & SciSpike39

Data warehousepminimizing burden

Page 40: How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights and Streams

BigInsights as a “queryable archive” for growingBigInsights as a queryable archive for growing data warehouses

BigInsights

Offl d “ ld” d t d h i f b tData warehouse • Offload “cold” or dated warehouse info but maintain access for further exploration • Keep warehouse size manageable and focused

© 2011 IBM Corporation & SciSpike40

on well-known business analytic needs

Page 41: How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights and Streams

Trends and directionsTrends and directions Enterprise software integration

– Data warehouses, RDBMSsETL l tf– ETL platforms

– Business intelligence tools– Applications– . . .

Diverse range of analytics– Text – Image / video (e g content-based user profiling)Image / video (e.g., content based user profiling)– Predictive modeling (e.g., ranking and classification based on

machine learning) – . . .

Sophisticated, scalable infrastructure for processing massive data volumes– High-performance file system with full POSIX compliance, granular g p y p , g

security – Fully recoverable and restartable workflows – Parallel, distributed indexing for text (“BigIndex”) – Read-optimized column store

© 2011 IBM Corporation & SciSpike41

p– Tooling for administrators, programmers, analysts– . . .

Page 42: How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights and Streams

Integrating Relational, Streams, and BigInsightsIntegrating Relational, Streams, and BigInsights

Traditional Traditional / Traditional / TraditionalWarehouse

Database &Warehouse

At-rest data

Results

Relational Data Sources

Relational Data Sources

Warehouse data analytics

Non-Traditional / N R l ti l

Non-Traditional / N R l ti lStreams

In-Motion Analytics

Ultra Low Latency

Non-Relational Data Sources

Non-Relational Data Sources

Analytics yResults

InfoSphere Big InsightsInfoSphere Big InsightsVaried data

formats Varied data

formats

Big DataBatch oriented data analytics

ResultsMassive Scale

Semi-structured, unstructured...

Semi-structured, unstructured...

© 2011 IBM Corporation & SciSpike42

Page 43: How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights and Streams

Typical Strategy for AnalyticsTypical Strategy for Analytics

SQL Analytics, MiningETL

Source

Data warehouse / marts

S

Extract Transform/subset Load

SourceSources

subset

© 2011 IBM Corporation & SciSpike43

Page 44: How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights and Streams

Emerging requirements for analyticsEmerging requirements for analyticsSQL Analytics, MiningETL, ELT (MR BI, Mining)

Transform,Analyze

SourceWarehouses / martsStructured

SourcesTransform/

subset LoadExtract

BigInsightsg gRepositorySourceOther

Sources

Explore large volumes of “raw” or diverse data.

Discover, analyze new insights with BigInsights

© 2011 IBM Corporation & SciSpike44

Page 45: How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights and Streams

ConclusionsConclusions

Scale out to crunch petabytes– Scale out to crunch petabytes

– We need a mix of technologies • Data at rest: MapReduce, Hadoop and beyond

Data in motion: stream processing• Data in motion: stream processing

– To be successful, integrate with conventionalTo be successful, integrate with conventional technologies

© 2011 IBM Corporation & SciSpike45

Page 46: How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights and Streams

Getting in touchGetting in touch

Stephen Brodsky – IBM– Email: [email protected]

InfoSphere BigInsights– http://www-01.ibm.com/software/data/infosphere/biginsights/ttp // 0 b co /so t a e/data/ osp e e/b g s g ts/

InfoSphere Streams– http://www-01.ibm.com/software/data/infosphere/streams/

Vladimir Bacvanski - SciSpike– Email: [email protected]– Blog: http://www.OnBuildingSoftware.com/– Twitter: http://twitter.com/OnSoftware– LinkedIn: http://www.linkedin.com/in/VladimirBacvanskip

© 2011 IBM Corporation & SciSpike46