Big Data, IBM Power Event

© 2012 IBM Corporation

Big Data Analytics

IBM Power Event – Hindsgavl SlotMay 2, 2012

Flemming Bagger, Nordic Sales Leader for Big Data An alytics and Data Warehousing

Søren Ravn, Consulting IT Specialist for Big Data

© 2011 IBM Corporation2

Why is 2012 the YEAR of Big Data?

“most enterprise data warehouse (EDW) and BI teams currently lack a clear understanding of big data technologies… They are increasingly asking the question, "How can we use big data to deliver new insights?"Gartner 2012

Searches for "big data" on Gartner's website have increased 981% between March 2011 -October 2011

“Big Data: The next frontier for innovation, competition and productivity”McKinsey Global Institute

2012 will be the year of 'big data'BBC Nov 30 2011

Big Data will be the CIO Issue of 2012

IDC Prediction 2012 report


Insights from the IBM Global CEO Study 2010

Vast majority of CEOs experience the New Economic Environment as distinctly different

69%18%13% More volatileDeeper/faster cycles, more risk

65%21%14% More uncertainLess predictable

60%22%18% More complexMulti-faceted, interconnected

53%21%26% Structurally differentSustained change

Source: Q7 To what extent is the new economic environment different? Volatile n=1514; Uncertain n=1521; Complex n=1522 ; Structurally different n=1523; Nordics n=83

The New Economic EnvironmentFull Sample

68%19%13%

79%13%8%

41%31%28%

37%29%34%

Nordics

Not at all/to a limited extent To a large/very large extentTo some extent

“Last year’s experience was a wake-up call, like looking into the dark with no light at the end of the tunnel.”

CEO, Industrial Products, The Netherlands


9

13

12

2

4

3

1

7

4

Which underprepared areas are the most critical for CMOs

IBM Institute for Business Value

50

60

70

40

20 40 600

86

Financial constraints

Decreasing brand loyalty

Growth market opportunities

ROI accountability

Customer collaboration and influence

Privacy considerations

Global outsourcing

Regulatory considerations

Corporate transparency

5

6

7

8

9

10

11

12

13

Data explosion1

Social media2

Growth of channel and device choices3

Shifting consumer demographics4

Mean

Marketing Priority Matrix

Source: Q7 Which of the following market factors will have the most impact on your marketing organization over the next 3 to 5 years? n1=1733; Q8 How prepared are you to manage the impact of the top 5 market factors that will have the most impact on your marketing organization over the next 3 to 5 years? n2=149 to 1141 (n2 = number of respondents who selected the factor as important in Q7)

Factors impacting marketingPercent of CMOs selecting as “Top five factors”

UnderpreparednessPercent of CMOs reporting underpreparedness

5

1011

© 2012 IBM Corporation5 5

2009800,000 petabytes

202035 zettabytes

as much Data and ContentOver Coming Decade

44x Business leaders frequently make decisions based on information they don’t trust, or don’t have

1 in 3

83%of CIOs cited “Business intelligence and analytics” as part of their visionary plansto enhance competitiveness

Business leaders say they don’t have access to the information they need to do their jobs

1 in 2

of CEOs need to do a better job capturing and understanding information rapidly in order to make swift business decisions

60%

… And Organizations Need Deeper Insights

Of world’s datais unstructured

80%

Information is at the Center of a New Wave of Opportunity…


Data AVAILABLE to an organization

Data an organization can PROCESS

The Big Data Conundrum

� The percentage of available data an enterprise can ana lyze is decreasing proportionately to the available to that en terprise

� Quite simply, this means as enterprises, we are getting “more naive” about our business over time


What should a Big Data platform do?

Analyze Information in MotionStreaming data analysis

Large volume data bursts & ad-hoc analysis

Analyze a Variety of InformationNovel analytics on a broad set of mixed information that could not be analyzed before

Discover & ExperimentAd-hoc analytics, data discovery & experimentation

Analyze Extreme Volumes of InformationCost-efficiently process and analyze petabytes of information

Manage & analyze high volumes of structured, relational data

Manage & PlanEnforce data structure, integrity and control to ensure consistency for repeatable queries

The 3 Vs


IBM Big Data Strategy: Move the Analytics Closer to the Data

� Netezza is for High Economic Value data that requires deep, extensive and frequent analysis with results delivered in minutes

� Streams is for Low Latency, Real Time Analysis of high velocity data with results delivered sub-second after which the data is discarded or stored elsewhere

� Big Insights is for Discovery and Exploration on data of uncertain economic value to identify patterns and correlations which can be proceduralised… it can also be used as a lower cost per terabyte store of data that is used or accessed in a non-time critical manner


Why Didn ’t We Use All of the Big Data Before?


StructuredRepeatable

LinearMonthly sales reports

Profitability analysisCustomer surveys

Content Communities

Collaboration

Virtual Worlds

Blogs/Micro-blogs

Social Networking

Customer•Segment

•Social Network•Demographics • Sex, Age Group, etc

•Tenure•Rate plan

•Credit Rating, ARPU Group

Device•Class

•Manufacturer•Model

•OS•Media Capability •Keyboard Type

Transactions•Voice, SMS, MMS

•Data & Web Sessions•Click Streams

•Purchases•Downloads

•Signaling, Authentication•Probe/DPI

Network•Availability

•Throughput/Speed•Latency•Location•Facilities

Interface•Discovery•Navigation

•Recommendations

Product/Service•Subscriptions

•Rate Plans•Media Type

•Category/Classification•Price

Starts, StopsSuccess Rates

Errors

ThroughputSetup Time

Connection TimeUsage

RecencyFrequencyMonetary

Latency

One customer... Two data worlds


Complementary Approaches for Different Use Cases

Traditional ApproachStructured, analytical, logical

New ApproachCreative, holistic thought, intuition

StructuredRepeatable

LinearMonthly sales reports

Profitability analysisCustomer surveys

Internal App Data

Data Warehouse

Traditional Sources

StructuredRepeatableLinear

Transaction Data

ERP data

Mainframe Data

OLTP System Data

UnstructuredExploratoryIterativeBrand sentimentProduct strategyMaximum asset utilization

Hadoop,Streams

New Sources

UnstructuredExploratoryIterative

Web Logs

Social Data

Text Data: emails

Sensor data: images

RFID

Enterprise Integration

© 2012 IBM Corporation13 13

InfoSphere Streams: Analyze all your data, all the time, just in time

Analytic Results

More context

What if you could get IMMEDIATE insight?What if you could analyze MORE kinds of data?

What if you could do it with exceptional price/performance?

Traditional Data, Sensor Events,

Signals

Alerts / Actions

Billing/ Transaction

Systems

CustomerReal-time

Offers

ThreatPrevention Systems

Enterprise Storage andWarehousing


Real time analysis of data-in-motion - analyses data before you store it

A stream of structured or unstructured data

Analytic operations on streaming data in real-time

Streams finds the needle as it’s blowing by

Historical fact finding - Find and analyze information stored on disk

Batch paradigm, pull model

Query-driven: submits queries to static data Relies on Databases, Data Warehouses

Databases find the needle in the haystack

Traditional Computing Stream Computing

QueryQuery DataData ResultsResults DataData QueryQuery ResultsResults


Easy to extend:Built in adaptorsUsers add capability with familiar C++ and Java

InfoSphere Streams for superior real time analytic processingCompile groups of operators into single processes:

Efficient use of coresDistributed executionVery fast data exchangeCan be automatic or tunedScaled with push of a button

Streams Processing Language (SPL) built for Streaming applications:

Reusable operatorsRapid application developmentContinuous “pipeline” processing

Flexible and high performance transport:

Very low latencyHigh data rates

Use the data that gives you a competitive advantage:

Can handle virtually any data typeUse data that is tooexpensive and timesensitive for traditional approaches

Easy to manage:Automatic placementExtend applications incrementallywithout downtime Multi-user / multiple applications

Dynamic analysis:Programmatically change

topology at runtimeCreate new subscriptionsCreate new port properties


InfoSphere BigInsights – A Full Hadoop Stack

HDFS

Storage HBase

GPFS-SNC

Application

AdaptiveMR

Zoo

keep

er

Avr

oPig Hive Jaql

MapReduce

Flume

Data Sources/Connectors

Informix

DB2 LUW Netezza

DB2 z

Streams

Oracle

Oozie

User Interface DevelopmentTooling (ODS)

Analytics Visualization

ManagementConsole

Analytics

ML Analytics

Text Analytics

Lucene

R

TeradataData Stage

IntegratedInstall


What is Hadoop?� Apache Hadoop – free, open source framework

for data-intensive applications – Inspired by Google technologies (MapReduce, GFS)– Originally built to address scalability problems of Web

search and analytics– Extensively used by Yahoo!

� Enables applications to work with thousands of nodes and petabytes of data in a highly parallel, cost effective manner

– CPU + disks of a commodity box = Hadoop node– Boxes can be combined into clusters– New nodes can be added without changing

• Data formats• How data is loaded• How jobs are written

� MapReduce framework – How Hadoop understands and assigns work to the

nodes (machines)

� Hadoop Distributed File System = HDFS– Where Hadoop stores data– A file system that spans all the nodes in a Hadoop

cluster– It links together the file systems on many local nodes

to make them into one big file system

Processing

Storage


Machine Learning Analytics

� SystemML– IBM Research invented Machine Learning engine for native use on

BigInsights

� Directly implementing ML algorithms on MapReduce is difficult– Natural mathematical operators need to be re-expressed in terms of

key-value pairs, map and reduce functions.– Data characteristics dictate the optimal MapReduce implementation, so

user bears responsibility for efficient hand-coding

� Sample Uses– Finding non-obvious data correlations over Internet Scale data

collections• E.g. Topic Modeling, Recommender Systems, Ranking, …


Statistical and Predictive Analysis

� Framework for machine learning (ML) implementations on Big Data– Large, sparse data sets, e.g. 5B non-zero values– Runs on large BigInsights clusters with 1000s of nodes

� Productivity– Build and enhance predictive models directly on Big Data– High-level language – Declarative Machine Learning Language (DML)

• E.g. 1500 lines of Java code boils down to 15 lines of DML code– Parallel SPSS data mining algorithms implementable in DML

� Optimization– Compile algorithms into optimized parallel code– For different clusters– For different data characteristics– E.g. 1 hr. execution (hand-coded) down to 10 mins

0

500

1000

1500

2000

2500

3000

3500

4000

4500

0 500 1000 1500 2000

# non zeros (million)

Exe

cutio

n T

ime

(sec

)

Java Map-Reduce SystemML Single node R


Customer Use Case: Log Analytics (storing computer logs & transaction data)

� IBM Solution: – Ingests the all system logging at

low latency (under 15 minutes) and re-assembles the transactions into a whole, providing exact details on system component response times and trending.

– This solution can store more than a year’s worth of data.

– An analytics layer can be delivered through a web front-end, and standard browser based tooling for ah-hoc analytics.

Business Problem: The size and volume of log data generated by computer systems constrains the ability of many enterprises to create and maintain effective platforms for compliance and analysis.


Log Analysis is a Big Data Problem

� Volume – Large number of devices– Logs generated at hardware, Firmware, OS and middleware,– Aggregation over time for predictive analysis generates vast amounts of log

data

� Velocity– Online analysis needed to explore the data to discover meaningful correlations

� Variety– Logs formats lack a unified structure

• Variation across device types, firmware middleware versions– Log data needs to be supplement with additional data

• Performance and Availability/Fault data • Reference data


Log Analysis - why

� IBM and its customers have huge amounts of log data� System logs� Application logs

� We know there is valuable information hidden in the se logs

� Anomaly detection: What kind of alerts should I add to my automated monitoring system?

� Root cause analysis: What sequence of minor problems caused this major problem?

� Resource planning: Where do I need to add redundancy? When should a particular machine be replaced?

� Marketing: How can I turn more of the visitors to my site into customers?

� But getting that information out requires � Extraction, transformation and complex statistical analysis at scale


Insight into your logs

Dashboard

� Import

- Log files, performance data, fault

data, reference data (network topology, device dictionaries) from various source systems into HDFS

� Transform

– Identify record boundries, Extract information from text, Identify patterns

– Find cross log relationships and integration across diverse data sources

– Build indexes

� Import

- Log files, performance data, fault

data, reference data (network topology, device dictionaries) from various source systems into HDFS

� Transform

– Identify record boundries, Extract information from text, Identify patterns

– Find cross log relationships and integration across diverse data sources

– Build indexes

� Analyze

– Sessionization: Identify which records are part of the same sessions

– Identify subsequences containing fault or performance issue

– Observe correlations

– Predictive operators

� Visualize

– Ad-Hoc exploration with BigSheets

– Institutionalizing the knowledge gleaned from Ad-Hoc exploration (Network operating center dashboards, reports, alerts)

� Analyze

– Sessionization: Identify which records are part of the same sessions

– Identify subsequences containing fault or performance issue

– Observe correlations

– Predictive operators

� Visualize

– Ad-Hoc exploration with BigSheets

– Institutionalizing the knowledge gleaned from Ad-Hoc exploration (Network operating center dashboards, reports, alerts)

24

Import Logs Transform Analyze

Ad-Hoc Exploration

Reports & Dashboards

&Alerts

Data Analyst, Programmer

Analytics Developer End User


Business Challenge

� Wind turbines are expensive, have a service life of ~25 years

� Existing process for turbine placements requires weeks of analysis, uses subset of available data and does not yield optimal results

Project objectives

� Leverage large volume of weather data to optimize placement of turbines. (2+ PB today; ~20 PB by 2015)

� Reduce modeling time from weeks to hours.

� Analyze data from turbines to optimize ongoing operations.

The benefits

� Clear fulfillment of Vestas business needs through IBM technology and expertise

� Reliability, security, scalability, and integration needs fulfilled

� Standard enterprise software support

� Single-vendor solution for software, hardware, storage, support

Solution Components:

� IBM InfoSphere BigInsights Enterprise Edition:

� GPFS-based file system capable of running Hadoop and non-Hadoop apps

� Powerful, extensible query support (JAQL)

� Read-optimized column storage

� IBM xSeries hardware

Optimizing capital investments based on double-digit Petabyte analysis


The Big Data Challenge

7/25/2008 Google passes 1 trillion URLs

$187/second Cost of last Ebay outage ($16,156,800/Day )

789.4 PB Current size of YouTube

2/4/2011 IPv4 address space is exhausted, 4.3 billio n addresses have been allocated

(340x1038) Size of IPv6 address space

100 million gigabytes Size of Google’s index

144 million Number of Tweets per day

1.7 trillion Items at Facebook - 90 PB of data

4.3 Billion Mobile devices


The Biggest Big Data challenge of our future– Humans are limited– Sensors are unbounded– “Sensorization” of everything means– Everything is a sensor

� The problem– Don’t know the future value of a dot today– Cannot connect dots we don’t have

The Big Data Challenge


Understand current state and desired state …

Current approaches might not be enough in the future


THINK

ibm.com/bigdata

Big Data, IBM Power Event

Technology

Transcript of Big Data, IBM Power Event