How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights and Streams

How to Crunch Petabytes with Hadoop and Big Data using InfoSphere BigInsights and Streams

Tom Deutsch, IBM

Vl di i B ki F d S iS ikVladimir Bacvanski, Founder, [email protected] Brodsky, Technical Executive and Distinguished Engineer, IBMb d k @ ib

© 2011 IBM Corporation & SciSpikeAugust 24, 2011

[email protected]

Who are we?Who are we?

Dr. Vladimir Bacvanski– Consultant, trainer, and mentor focusing on making clients successful in

adopting new data and software approaches – Over 20 years of experiencey p– Founder of SciSpike – a training and consulting firm specializing in

advanced software and data technologies

Stephen Brodsky, Ph.D.Di ti i h d E i d T h i l E ti f IBM Bi D t– Distinguished Engineer and Technical Executive for IBM Big Data initiatives at the IBM Silicon Valley Laboratory

– Previously led the architecture for the Optim Data Studio product line and pureQuery and was a member of the architecture team for DB2 pureXML, Rational Application Developer (RAD), and WebSphere.

© 2011 IBM Corporation & SciSpike2

AgendaAgenda

The “Big Data” challenge: smarter analytics for aThe Big Data challenge: smarter analytics for a smarter planet

How to do it? – The big data challenge

F d i f Bi D h– Foundations of Big Data approaches– MapReduce and Hadoop– Real-time data and stream processing– Real-time data and stream processing– Integration with existing systems


The “Big Data” Challenge


The World is Changing and Becoming MoreThe World is Changing and Becoming More…

INSTRUMENTED

INTERCONNECTED

INTELLIGENT

Th lti l i f i f ti t d fThe resulting explosion of information creates a need for a new kind of intelligence


…to help build a Smarter Planet…to help build a Smarter Planet

Information is Growing at a Phenomenal Rate . . . .Information is Growing at a Phenomenal Rate . . . .

44 80as much data and contentover coming decade44x Of world’s data

is unstructured80%

202035 zettabytes

2009800,000 petabytes

35 zettabytes(35 billion terabytes)


800,000 petabytes

Source: IDC, The Digital Universe Decade – Are You Ready?, May 2010

The BIG Data Challenge• Manage and benefit from massive and growing amounts of data• Handle varied data formats (structured, unstructured, semi-structured) and

increased data velocity • Exploit BIG Data in a timely and cost effective fashion• Exploit BIG Data in a timely and cost effective fashion

Collect ManageCOLLECT MANAGE

Integrate AnalyzeINTEGRATE ANALYZE


What clients are saying . . .What clients are saying . . .

Lots of potentially valuable data is dormant or discarded p ydue to size/performance considerations

Large volume of unstructured or semi-structured data is not worthLarge volume of unstructured or semi structured data is not worth integrating fully (e.g. Tweets, logs, . . .)

Not clear what should be analyzed (exploratory iterative) Not clear what should be analyzed (exploratory, iterative)

Information distributed across multiple systems and/or Internet

Some information has a short useful lifespan

Volumes can be extremely high

Analysis needed in the context of existing information (not stand


alone)

Big Data Presents Big OpportunitiesExtract insight from a high volume, variety and velocity of data in a timely and cost-effective manner

Big Data Presents Big Opportunities

in a timely and cost-effective manner

Manage and benefit fromVariety: Manage and benefit from diverse data types and data structures

Variety:

structures

Analyze streaming data and Velocity:large volumes of persistent data

Scale from terabytes to zettabytes

Volume:


ettabytes

Streams and Oceans of InformationStreams and Oceans of Information . . . .

Hi h d i f ti fl i i

Information streamsInformation oceans

Information stored outside ti l t D tHigh speed information flowing in

real-time, often transient Information from sensors, instruments,

etc

conventional systems. Data may originate from the Web or different internal different systems

etc. Information flowing from real-time logs

and activity monitors Streaming content like audio and video

Collection of what has streamed

Information from social media, logs, click streams, emails, etc.

High speed transactions like tickers, trades, or traffic systems Unstructured or mixed schema documents

like claims, forms, desktop applications, etc.


Structured data from disparate systems

Applications for Big Data AnalyticsApplications for Big Data Analytics

Finance Smarter Healthcare Multi-channel sales

Homeland security TelecomTraffic Control

Manufacturing Trading Analytics

Many more!


Use Case Example: Energy CompanyUse Case Example: Energy Company

Business scenario Business scenario

Analyze large volumes of public and private weather data for alternative energy business

E i ti hi h f ti

Technical challengesTechnical challenges

Existing high-performance computing hardware, limited staff


High data volume: 2+ PB

Range of query typesg q y yp- Avg temp in given location? (Small

result) - Geo pts where ice may form on wind

turbines? (Large result derived values –turbines? (Large result, derived values –icing determined by humidity + temp.)

Run on system with non-Hadoop apps


Use Case Example: Global Media FirmUse Case Example: Global Media Firm

Business scenario Business scenario

Identify unauthorized content streaming in digital media (piracy)

- Quantify annual revenue loss- Analyze trendsAnalyze trends Monitor social media sites to identify dissemination of pirated content. Time sensitive!


High variety of unstructured and semi-t t d d tstructured data.

Initial focus: text analytics over 1 year’s worth of social media data. Look for live streaming URLs, sentiment, event info, etc.

Complex rules to qualify & classify info


Future potential for video analysis

IBM WatsonIBM Watson

IBM Watson is a breakthrough in analytic innovation, but it is only successful because of the quality of the information from which it is working.


Big Data and WatsonBig Data and WatsonWatson technology offers great potential

for advanced business analyticsBig Data technology is used to build

Watson’s knowledge base

Watson uses the Apache Hadoop open framework to distribute the workload for

POS Data

CRM DataSocial Media

loading information into memory.

Approx. 200M pages of text

InfoSphere BigInsights

Distilled Insight- Spending habits- Social relationships- Buying trends

(To compete on Jeopardy!)

oSp e e g s g ts

Advanced search and analysis

Watson’s Memory


Customer EngagementsUse patterns• Customer sentiment analysis (cross

Common requirements• Extract business insight from large volumes of

Customer Engagements

• Customer sentiment analysis (cross-sell, up-sell, campaign management)

• Integrated retail and web customer behavior modeling

• Extract business insight from large volumes of raw data (often outside operational systems)

• Integrate with other existing software• Ready for enterprise useg

• Predictive modeling (credit card fraud)• System log analytics (reduce

operational risk)

• Ready for enterprise use

p )

Text Blog WeblogConsumer

Text, Blog, Weblog

Click streams

Log & transactions

Multi-channel sales

Insight

Biological Sequences

Next Gen Fraud Models

New Business Operational system & streams data sources Statistical Model

Text Analytics


Developmentp y Stat st ca ode

Building

The approach to crunching big datacrunching big data


How to approach Big Data analytics?How to approach Big Data analytics?InfoSphere BigInsights and InfoSphere Streams

• Analytics for data in-motion and at-rest• Platform for processing large volumes of diverse data • Complements and integrates with existing software solutions


Addressing the Key RequirementsAddressing the Key Requirements

1. Platform for V3 – Variety, Velocity, Volume Variety - manage data & content “As Is” Handle any velocity - low-latency streams and large volume batch Volume - huge volumes of at-rest or streaming data

2 Analytics for V3

Big Data Platform2. Analytics for V3

Analyze Sources in their native format - text, data, rich content Analyze all of the data - not just a subset Dynamic analytics - automatic adjustments and actions

3. Ease of Use for Developers and Users Developer UIs, common languages & automatic optimization End-user UIs & visualization

4. Enterprise Class Failure tolerance, Security and Privacy Scale Economically

5. Extensive Integration Capabilities Integrate wide variety of sources Leverage enterprise integration technologies


Bi D t I iti ti

Volumes of diverse persistent data

Big Data Initiative

Analytic applications for Volumes of diverse, persistent data “Big Data”

InfoSphere

Warehouse

pBigInsights

IBM Confidential

Traditional warehouse applications

InfoSphereStreams


Streams

Real-time streaming data

BigInsights SummaryBigInsights Summary

BigInsights = analytical platform for persistent “Big Data”– Based on open source & IBM technologies

Distinguishing characteristics– Built-in analytics . . . . Enhances business knowledge

Enterprise soft are integration Complements and e tends– Enterprise software integration . . . . Complements and extends existing capabilities

– Production-ready platform . . . . Speeds time-to-value; simplifies development and maintenance


Big Data Platform VisionBig Data Platform Vision

Big Data Solutions

Bringing Big Data to the EnterpriseData

Warehouse

Big Data User EnvironmentsBig Data User EnvironmentsInformation Integration

Developers End Users Administrators

A IN

Master Data Mgmt

Big Data Enterprise EnginesBig Data Enterprise Engines

AG

ENTS

NTEG

RATIO

Database

Content Analytics

Internet Scale AnalyticsStreaming Analytics

ON Analytics

Business Analyticsyg y

Open Source Foundational ComponentsMarketing


Data Growth Management

InfoSphere BigInsights v 1.1InfoSphere BigInsights v 1.1Platform for volume,

variety, velocity -- V3variety, velocity V Hadoop foundation

Analytics for V3

Text analytics & tooling Enterprise Edition Text analytics & tooling

Usability Web administrative

lass

Enterprise EditionLicensed

Web admin console, LDAP authenticationRDBMS, warehouse connectivity

console Integrated install Spreadsheet-style

l ti t l nter

pris

e cl

Apache

Basic EditionText analytics

Spreadsheet-style analytic tool Flexible job schedulerFree download

24 x 7 Webanalytic tool

Enterprise Class Storage, security,

En

ApacheHadoop

24 x 7 Web support

cluster management

Integration Connectivity to DB2,

Breadth of capabilities


Connectivity to DB2, Netezza

BigInsights Platform: Key IdeasBigInsights Platform: Key Ideas

Flexible, enterprise-class support for processing large volumes of data – Based on Google’s MapReduce technology – Inspired by Apache Hadoop; compatible with its ecosystem and sp ed by pac e adoop; co pat b e t ts ecosyste a d

distribution – Well-suited to batch-oriented, read-intensive applications – Supports wide variety of dataSupports wide variety of data

Enables applications to work with thousands of nodes and t b t f d t i hi hl ll l t ff tipetabytes of data in a highly parallel, cost effective manner

– CPU + disks = “node”– Nodes can be combined into clusters– New nodes can be added as needed without changing

• Data formats• How data is loaded


• How jobs are written

Th M R d P i M d lThe MapReduce Programming Model

"Map" step:Map step: – Input split into pieces

W k d i di id l i i ll l ( d– Worker nodes process individual pieces in parallel (under global control of the Job Tracker node)

– Each worker node stores its result in its local file system where a reducer is able to access it

"Reduce" step:– Data is aggregated (‘reduced” from the map steps) by– Data is aggregated ( reduced from the map steps) by

worker nodes (under control of the Job Tracker)

M lti l d t k ll li th ti


– Multiple reduce tasks can parallelize the aggregation

What is Hadoop?What is Hadoop?

Apache Hadoop = free, open source framework for data-intensive applications – Inspired by Google technologies (MapReduce, GFS)– Well-suited to batch-oriented, read-intensive applications e su ted to batc o e ted, ead te s e app cat o s– Originally built to address scalability problems of Nutch, an open source

Web search technology

Enables applications to work with thousands of nodes and petabytes of data in a highly parallel, cost effective manner

CPU di k f dit b H d “ d ”– CPU + disks of commodity box = Hadoop “node”– Boxes can be combined into clusters– New nodes can be added as needed without changing

• Data formats• How data is loaded• How jobs are written


Two Key Aspects of HadoopTwo Key Aspects of Hadoop

MapReduce framework – How Hadoop understands and assigns work to the nodes

(machines)

Hadoop Distributed File System = HDFS Hadoop Distributed File System = HDFS– Where Hadoop stores data– A file system that spans all the nodes in a Hadoop clusterA file system that spans all the nodes in a Hadoop cluster– It links together the file systems on many local nodes to

make them into one big file system


Logical MapReduce Example: Word CountLogical MapReduce Example: Word Count

Hello World Bye WorldContent of Input Documents

map(String key, String value): // key: document name

Hello World Bye World

Hello IBM

// value: document contents for each word w in value: EmitIntermediate(w "1");

Map 1 emits:< Hello, 1> < World, 1> < Bye 1>EmitIntermediate(w, 1 );

reduce(String key, Iterator values):

< Bye, 1> < World, 1>

Map 2 emits:( g y, )// key: a word// values: a list of counts

Reduce (final output):

< Hello, 1> < IBM, 1>

int result = 0;for each v in values:result += ParseInt(v);

Reduce (final output):

< Bye, 1> < IBM, 1>

H ll 2


result += ParseInt(v);Emit(AsString(result));

< Hello, 2> < World, 2>

How To Create MapReduce JobsHow To Create MapReduce Jobs

MapReduce development in Javap p–Low level, very flexible–Time consuming development

Hive– Open source language / Apache sub-projectOpen source language / Apache sub project– Provides a SQL-like interface to Hadoop

Pig– Data flow language / Apache sub-project

Jaql– A query language for JSON


– Useful for loosely structured data

Management Tools: Web Console Graphically manage cluster, jobs, HDFS Sample administration tasks

Start/Stop Servers

Management Tools: Web Console

– Start/Stop Servers– Add/Remove Servers– Server Status Details (Log)


Spreadsheet-like Analysis ToolSpreadsheet like Analysis Tool Web-based analysis

and visualization tool BigSheetsBigSheets

Spreadsheet-like interfaceinterface – Define and manage

long running data ll i j bcollection jobs

– Analyze content of the text on the pages that have been retrieved


Text AnalyticsText Analytics

"Acquisition""Address"

• Distill structured info from unstructured data • Sentiment analysis Address

"Alliance""AnalystEarningsEstimate""City""CompanyEarningsAnnouncement"

• Sentiment analysis• Consumer behavior • Illegal or suspicious activities

CompanyEarningsAnnouncement"CompanyEarningsGuidance""Continent""Country""County"

• . . .

• Pre-built library of text annotators for common County"DateTime""EmailAddress""JointVenture""Location"

business entities

• Rich language and tooling to build custom Location"Merger""NotesEmailAddress""Organization""Person"

g g gannotators

• Support for Western languages (English, Person"PhoneNumber""StateOrProvince""URL""ZipCode"

pp g g ( g ,Dutch/Flemish, French, German, Italian, Portuguese, or Spanish) plus select Asian languages (Japanese, Chinese)


ZipCode

Eclipse-based Text Analytics DevelopmentEclipse based Text Analytics Development


So What Does This Result In?So What Does This Result In?

Easy To Scale

Fault Tolerant and Self-Healing

Data Agnostic

Extremely FlexibleExtremely Flexible


Working with streaming data: a new paradigmWorking with streaming data: a new paradigm

Conventional processing: static data

DataQueries ResultsDataQueries Results

Real-time processing: streaming data Real-time processing: streaming data

QueriesData Results


Real Time Data with InfoSphere StreamsReal-Time Data with InfoSphere StreamsStreaming analytic applications

M lti l i t t

Source Adapters

Sink AdaptersOperator Repository

– Multiple input streams– Advanced streaming analytics

Eclipse based IDEEclipse based IDE– Define sources, apply operators,

define intermediary and final output sinks

InfoSphere Streams Studio (IDE for Streams Processing Language)

output sinks– User defined operators in Java or

C++O i i i il Automated Optimized DeployOptimizing compiler automates

deployment and connections– Extremely low latency

Automated, Optimized Deploy and Management (Scheduler)

y y– Cluster of up to 125 nodes


Scalable stream processingScalable stream processing InfoSphere Streams provides

A i d l d IDE f d fi i d t d– A programming model and IDE for defining data sources and software analytic modules called operators that are fused into process execution units (PEs)

– infrastructure to support the composition of scalable stream processing applications from these components

– deployment and operation of these applications across distributed p y p ppx86 processing nodes, when scaled processing is required

– stream connectivity between data sources and PEs of a stream processing applicationprocessing application


Merging the Traditional and Big Data ApproachesMerging the Traditional and Big Data ApproachesBig Data Approach

Iterative & Exploratory AnalysisTraditional Approach

Structured & Repeatable Analysis

ITDelivers a platform to

bl ti

Business UsersDetermine what enable creative

discoveryDetermine what question to ask

IT BusinessITStructures the data to answer that question

Business Explores what questions could be askedq

Monthly sales reportsProfitability analysisCustomer surveys

Brand sentimentProduct strategyMaximum asset utilization


Customer surveys Maximum asset utilization

BigInsights and the data warehouse: filtering andBigInsights and the data warehouse: filtering and summarizing “Big Data”

BigInsights

Data warehouse• Broader analytic coverage• Exploits IT investments while


Data warehousepminimizing burden

BigInsights as a “queryable archive” for growingBigInsights as a queryable archive for growing data warehouses

BigInsights

Offl d “ ld” d t d h i f b tData warehouse • Offload “cold” or dated warehouse info but maintain access for further exploration • Keep warehouse size manageable and focused


on well-known business analytic needs

Trends and directionsTrends and directions Enterprise software integration

– Data warehouses, RDBMSsETL l tf– ETL platforms

– Business intelligence tools– Applications– . . .

Diverse range of analytics– Text – Image / video (e g content-based user profiling)Image / video (e.g., content based user profiling)– Predictive modeling (e.g., ranking and classification based on

machine learning) – . . .

Sophisticated, scalable infrastructure for processing massive data volumes– High-performance file system with full POSIX compliance, granular g p y p , g

security – Fully recoverable and restartable workflows – Parallel, distributed indexing for text (“BigIndex”) – Read-optimized column store


p– Tooling for administrators, programmers, analysts– . . .

Integrating Relational, Streams, and BigInsightsIntegrating Relational, Streams, and BigInsights

Traditional Traditional / Traditional / TraditionalWarehouse

Database &Warehouse

At-rest data

Results

Relational Data Sources

Relational Data Sources

Warehouse data analytics

Non-Traditional / N R l ti l

Non-Traditional / N R l ti lStreams

In-Motion Analytics

Ultra Low Latency

Non-Relational Data Sources

Non-Relational Data Sources

Analytics yResults

InfoSphere Big InsightsInfoSphere Big InsightsVaried data

formats Varied data

formats

Big DataBatch oriented data analytics

ResultsMassive Scale

Semi-structured, unstructured...

Semi-structured, unstructured...


Typical Strategy for AnalyticsTypical Strategy for Analytics

SQL Analytics, MiningETL

Source

Data warehouse / marts

S

Extract Transform/subset Load

SourceSources

subset


Emerging requirements for analyticsEmerging requirements for analyticsSQL Analytics, MiningETL, ELT (MR BI, Mining)

Transform,Analyze

SourceWarehouses / martsStructured

SourcesTransform/

subset LoadExtract

BigInsightsg gRepositorySourceOther

Sources

Explore large volumes of “raw” or diverse data.

Discover, analyze new insights with BigInsights


ConclusionsConclusions

Scale out to crunch petabytes– Scale out to crunch petabytes

– We need a mix of technologies • Data at rest: MapReduce, Hadoop and beyond

Data in motion: stream processing• Data in motion: stream processing

– To be successful, integrate with conventionalTo be successful, integrate with conventional technologies


Getting in touchGetting in touch

Stephen Brodsky – IBM– Email: [email protected]

InfoSphere BigInsights– http://www-01.ibm.com/software/data/infosphere/biginsights/ttp // 0 b co /so t a e/data/ osp e e/b g s g ts/

InfoSphere Streams– http://www-01.ibm.com/software/data/infosphere/streams/

Vladimir Bacvanski - SciSpike– Email: [email protected]– Blog: http://www.OnBuildingSoftware.com/– Twitter: http://twitter.com/OnSoftware– LinkedIn: http://www.linkedin.com/in/VladimirBacvanskip


How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights and Streams

Technology

Transcript of How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights and Streams