Big Data, IBM Power Event
-
Upload
ibm-danmark -
Category
Technology
-
view
1.486 -
download
2
description
Transcript of Big Data, IBM Power Event
© 2012 IBM Corporation
Big Data Analytics
IBM Power Event – Hindsgavl SlotMay 2, 2012
Flemming Bagger, Nordic Sales Leader for Big Data An alytics and Data Warehousing
Søren Ravn, Consulting IT Specialist for Big Data
© 2011 IBM Corporation2
Why is 2012 the YEAR of Big Data?
“most enterprise data warehouse (EDW) and BI teams currently lack a clear understanding of big data technologies… They are increasingly asking the question, "How can we use big data to deliver new insights?"Gartner 2012
Searches for "big data" on Gartner's website have increased 981% between March 2011 -October 2011
“Big Data: The next frontier for innovation, competition and productivity”McKinsey Global Institute
2012 will be the year of 'big data'BBC Nov 30 2011
Big Data will be the CIO Issue of 2012
IDC Prediction 2012 report
© 2010 IBM Corporation
Insights from the IBM Global CEO Study 2010
Vast majority of CEOs experience the New Economic Environment as distinctly different
69%18%13% More volatileDeeper/faster cycles, more risk
65%21%14% More uncertainLess predictable
60%22%18% More complexMulti-faceted, interconnected
53%21%26% Structurally differentSustained change
Source: Q7 To what extent is the new economic environment different? Volatile n=1514; Uncertain n=1521; Complex n=1522 ; Structurally different n=1523; Nordics n=83
The New Economic EnvironmentFull Sample
68%19%13%
79%13%8%
41%31%28%
37%29%34%
Nordics
Not at all/to a limited extent To a large/very large extentTo some extent
“Last year’s experience was a wake-up call, like looking into the dark with no light at the end of the tunnel.”
CEO, Industrial Products, The Netherlands
© 2011 IBM Corporation
9
13
12
2
4
3
1
7
4
Which underprepared areas are the most critical for CMOs
IBM Institute for Business Value
50
60
70
40
20 40 600
86
Financial constraints
Decreasing brand loyalty
Growth market opportunities
ROI accountability
Customer collaboration and influence
Privacy considerations
Global outsourcing
Regulatory considerations
Corporate transparency
5
6
7
8
9
10
11
12
13
Data explosion1
Social media2
Growth of channel and device choices3
Shifting consumer demographics4
Mean
Marketing Priority Matrix
Source: Q7 Which of the following market factors will have the most impact on your marketing organization over the next 3 to 5 years? n1=1733; Q8 How prepared are you to manage the impact of the top 5 market factors that will have the most impact on your marketing organization over the next 3 to 5 years? n2=149 to 1141 (n2 = number of respondents who selected the factor as important in Q7)
Factors impacting marketingPercent of CMOs selecting as “Top five factors”
UnderpreparednessPercent of CMOs reporting underpreparedness
5
1011
© 2012 IBM Corporation5 5
2009800,000 petabytes
202035 zettabytes
as much Data and ContentOver Coming Decade
44x Business leaders frequently make decisions based on information they don’t trust, or don’t have
1 in 3
83%of CIOs cited “Business intelligence and analytics” as part of their visionary plansto enhance competitiveness
Business leaders say they don’t have access to the information they need to do their jobs
1 in 2
of CEOs need to do a better job capturing and understanding information rapidly in order to make swift business decisions
60%
… And Organizations Need Deeper Insights
Of world’s datais unstructured
80%
Information is at the Center of a New Wave of Opportunity…
© 2012 IBM Corporation6
Data AVAILABLE to an organization
Data an organization can PROCESS
The Big Data Conundrum
� The percentage of available data an enterprise can ana lyze is decreasing proportionately to the available to that en terprise
� Quite simply, this means as enterprises, we are getting “more naive” about our business over time
© 2012 IBM Corporation7
What should a Big Data platform do?
Analyze Information in MotionStreaming data analysis
Large volume data bursts & ad-hoc analysis
Analyze a Variety of InformationNovel analytics on a broad set of mixed information that could not be analyzed before
Discover & ExperimentAd-hoc analytics, data discovery & experimentation
Analyze Extreme Volumes of InformationCost-efficiently process and analyze petabytes of information
Manage & analyze high volumes of structured, relational data
Manage & PlanEnforce data structure, integrity and control to ensure consistency for repeatable queries
The 3 Vs
© 2012 IBM Corporation8
IBM Big Data Strategy: Move the Analytics Closer to the Data
� Netezza is for High Economic Value data that requires deep, extensive and frequent analysis with results delivered in minutes
� Streams is for Low Latency, Real Time Analysis of high velocity data with results delivered sub-second after which the data is discarded or stored elsewhere
� Big Insights is for Discovery and Exploration on data of uncertain economic value to identify patterns and correlations which can be proceduralised… it can also be used as a lower cost per terabyte store of data that is used or accessed in a non-time critical manner
© 2012 IBM Corporation9
Why Didn ’t We Use All of the Big Data Before?
© 2012 IBM Corporation10
StructuredRepeatable
LinearMonthly sales reports
Profitability analysisCustomer surveys
Content Communities
Collaboration
Virtual Worlds
Blogs/Micro-blogs
Social Networking
Customer•Segment
•Social Network•Demographics • Sex, Age Group, etc
•Tenure•Rate plan
•Credit Rating, ARPU Group
Device•Class
•Manufacturer•Model
•OS•Media Capability •Keyboard Type
Transactions•Voice, SMS, MMS
•Data & Web Sessions•Click Streams
•Purchases•Downloads
•Signaling, Authentication•Probe/DPI
Network•Availability
•Throughput/Speed•Latency•Location•Facilities
Interface•Discovery•Navigation
•Recommendations
Product/Service•Subscriptions
•Rate Plans•Media Type
•Category/Classification•Price
Starts, StopsSuccess Rates
Errors
ThroughputSetup Time
Connection TimeUsage
RecencyFrequencyMonetary
Latency
One customer... Two data worlds
© 2012 IBM Corporation11
Complementary Approaches for Different Use Cases
Traditional ApproachStructured, analytical, logical
New ApproachCreative, holistic thought, intuition
StructuredRepeatable
LinearMonthly sales reports
Profitability analysisCustomer surveys
Internal App Data
Data Warehouse
Traditional Sources
StructuredRepeatableLinear
Transaction Data
ERP data
Mainframe Data
OLTP System Data
UnstructuredExploratoryIterativeBrand sentimentProduct strategyMaximum asset utilization
Hadoop,Streams
New Sources
UnstructuredExploratoryIterative
Web Logs
Social Data
Text Data: emails
Sensor data: images
RFID
Enterprise Integration
© 2012 IBM Corporation12
IBM Big Data Strategy: Move the Analytics Closer to the Data
� Netezza is for High Economic Value data that requires deep, extensive and frequent analysis with results delivered in minutes
� Streams is for Low Latency, Real Time Analysis of high velocity data with results delivered sub-second after which the data is discarded or stored elsewhere
� Big Insights is for Discovery and Exploration on data of uncertain economic value to identify patterns and correlations which can be proceduralised… it can also be used as a lower cost per terabyte store of data that is used or accessed in a non-time critical manner
© 2012 IBM Corporation13 13
InfoSphere Streams: Analyze all your data, all the time, just in time
Analytic Results
More context
What if you could get IMMEDIATE insight?What if you could analyze MORE kinds of data?
What if you could do it with exceptional price/performance?
Traditional Data, Sensor Events,
Signals
Alerts / Actions
Billing/ Transaction
Systems
CustomerReal-time
Offers
ThreatPrevention Systems
Enterprise Storage andWarehousing
© 2012 IBM Corporation14
Real time analysis of data-in-motion - analyses data before you store it
A stream of structured or unstructured data
Analytic operations on streaming data in real-time
Streams finds the needle as it’s blowing by
Historical fact finding - Find and analyze information stored on disk
Batch paradigm, pull model
Query-driven: submits queries to static data Relies on Databases, Data Warehouses
Databases find the needle in the haystack
Traditional Computing Stream Computing
QueryQuery DataData ResultsResults DataData QueryQuery ResultsResults
© 2012 IBM Corporation15
Easy to extend:Built in adaptorsUsers add capability with familiar C++ and Java
InfoSphere Streams for superior real time analytic processingCompile groups of operators into single processes:
Efficient use of coresDistributed executionVery fast data exchangeCan be automatic or tunedScaled with push of a button
Streams Processing Language (SPL) built for Streaming applications:
Reusable operatorsRapid application developmentContinuous “pipeline” processing
Flexible and high performance transport:
Very low latencyHigh data rates
Use the data that gives you a competitive advantage:
Can handle virtually any data typeUse data that is tooexpensive and timesensitive for traditional approaches
Easy to manage:Automatic placementExtend applications incrementallywithout downtime Multi-user / multiple applications
Dynamic analysis:Programmatically change
topology at runtimeCreate new subscriptionsCreate new port properties
© 2012 IBM Corporation16
IBM Big Data Strategy: Move the Analytics Closer to the Data
� Netezza is for High Economic Value data that requires deep, extensive and frequent analysis with results delivered in minutes
� Streams is for Low Latency, Real Time Analysis of high velocity data with results delivered sub-second after which the data is discarded or stored elsewhere
� Big Insights is for Discovery and Exploration on data of uncertain economic value to identify patterns and correlations which can be proceduralised… it can also be used as a lower cost per terabyte store of data that is used or accessed in a non-time critical manner
© 2012 IBM Corporation17
InfoSphere BigInsights – A Full Hadoop Stack
HDFS
Storage HBase
GPFS-SNC
Application
AdaptiveMR
Zoo
keep
er
Avr
oPig Hive Jaql
MapReduce
Flume
Data Sources/Connectors
Informix
DB2 LUW Netezza
DB2 z
Streams
Oracle
Oozie
User Interface DevelopmentTooling (ODS)
Analytics Visualization
ManagementConsole
Analytics
ML Analytics
Text Analytics
Lucene
R
TeradataData Stage
IntegratedInstall
© 2012 IBM Corporation18
What is Hadoop?� Apache Hadoop – free, open source framework
for data-intensive applications – Inspired by Google technologies (MapReduce, GFS)– Originally built to address scalability problems of Web
search and analytics– Extensively used by Yahoo!
� Enables applications to work with thousands of nodes and petabytes of data in a highly parallel, cost effective manner
– CPU + disks of a commodity box = Hadoop node– Boxes can be combined into clusters– New nodes can be added without changing
• Data formats• How data is loaded• How jobs are written
� MapReduce framework – How Hadoop understands and assigns work to the
nodes (machines)
� Hadoop Distributed File System = HDFS– Where Hadoop stores data– A file system that spans all the nodes in a Hadoop
cluster– It links together the file systems on many local nodes
to make them into one big file system
Processing
Storage
© 2012 IBM Corporation19
Machine Learning Analytics
� SystemML– IBM Research invented Machine Learning engine for native use on
BigInsights
� Directly implementing ML algorithms on MapReduce is difficult– Natural mathematical operators need to be re-expressed in terms of
key-value pairs, map and reduce functions.– Data characteristics dictate the optimal MapReduce implementation, so
user bears responsibility for efficient hand-coding
� Sample Uses– Finding non-obvious data correlations over Internet Scale data
collections• E.g. Topic Modeling, Recommender Systems, Ranking, …
© 2012 IBM Corporation20
Statistical and Predictive Analysis
� Framework for machine learning (ML) implementations on Big Data– Large, sparse data sets, e.g. 5B non-zero values– Runs on large BigInsights clusters with 1000s of nodes
� Productivity– Build and enhance predictive models directly on Big Data– High-level language – Declarative Machine Learning Language (DML)
• E.g. 1500 lines of Java code boils down to 15 lines of DML code– Parallel SPSS data mining algorithms implementable in DML
� Optimization– Compile algorithms into optimized parallel code– For different clusters– For different data characteristics– E.g. 1 hr. execution (hand-coded) down to 10 mins
0
500
1000
1500
2000
2500
3000
3500
4000
4500
0 500 1000 1500 2000
# non zeros (million)
Exe
cutio
n T
ime
(sec
)
Java Map-Reduce SystemML Single node R
© 2012 IBM Corporation21
Customer Use Case: Log Analytics (storing computer logs & transaction data)
� IBM Solution: – Ingests the all system logging at
low latency (under 15 minutes) and re-assembles the transactions into a whole, providing exact details on system component response times and trending.
– This solution can store more than a year’s worth of data.
– An analytics layer can be delivered through a web front-end, and standard browser based tooling for ah-hoc analytics.
Business Problem: The size and volume of log data generated by computer systems constrains the ability of many enterprises to create and maintain effective platforms for compliance and analysis.
© 2012 IBM Corporation22
Log Analysis is a Big Data Problem
� Volume – Large number of devices– Logs generated at hardware, Firmware, OS and middleware,– Aggregation over time for predictive analysis generates vast amounts of log
data
� Velocity– Online analysis needed to explore the data to discover meaningful correlations
� Variety– Logs formats lack a unified structure
• Variation across device types, firmware middleware versions– Log data needs to be supplement with additional data
• Performance and Availability/Fault data • Reference data
© 2012 IBM Corporation23
Log Analysis - why
� IBM and its customers have huge amounts of log data� System logs� Application logs
� We know there is valuable information hidden in the se logs
� Anomaly detection: What kind of alerts should I add to my automated monitoring system?
� Root cause analysis: What sequence of minor problems caused this major problem?
� Resource planning: Where do I need to add redundancy? When should a particular machine be replaced?
� Marketing: How can I turn more of the visitors to my site into customers?
� But getting that information out requires � Extraction, transformation and complex statistical analysis at scale
© 2012 IBM Corporation24
Insight into your logs
Dashboard
� Import
- Log files, performance data, fault
data, reference data (network topology, device dictionaries) from various source systems into HDFS
� Transform
– Identify record boundries, Extract information from text, Identify patterns
– Find cross log relationships and integration across diverse data sources
– Build indexes
� Import
- Log files, performance data, fault
data, reference data (network topology, device dictionaries) from various source systems into HDFS
� Transform
– Identify record boundries, Extract information from text, Identify patterns
– Find cross log relationships and integration across diverse data sources
– Build indexes
� Analyze
– Sessionization: Identify which records are part of the same sessions
– Identify subsequences containing fault or performance issue
– Observe correlations
– Predictive operators
� Visualize
– Ad-Hoc exploration with BigSheets
– Institutionalizing the knowledge gleaned from Ad-Hoc exploration (Network operating center dashboards, reports, alerts)
� Analyze
– Sessionization: Identify which records are part of the same sessions
– Identify subsequences containing fault or performance issue
– Observe correlations
– Predictive operators
� Visualize
– Ad-Hoc exploration with BigSheets
– Institutionalizing the knowledge gleaned from Ad-Hoc exploration (Network operating center dashboards, reports, alerts)
24
Import Logs Transform Analyze
Ad-Hoc Exploration
Reports & Dashboards
&Alerts
Data Analyst, Programmer
Analytics Developer End User
© 2012 IBM Corporation25
Business Challenge
� Wind turbines are expensive, have a service life of ~25 years
� Existing process for turbine placements requires weeks of analysis, uses subset of available data and does not yield optimal results
Project objectives
� Leverage large volume of weather data to optimize placement of turbines. (2+ PB today; ~20 PB by 2015)
� Reduce modeling time from weeks to hours.
� Analyze data from turbines to optimize ongoing operations.
The benefits
� Clear fulfillment of Vestas business needs through IBM technology and expertise
� Reliability, security, scalability, and integration needs fulfilled
� Standard enterprise software support
� Single-vendor solution for software, hardware, storage, support
Solution Components:
� IBM InfoSphere BigInsights Enterprise Edition:
� GPFS-based file system capable of running Hadoop and non-Hadoop apps
� Powerful, extensible query support (JAQL)
� Read-optimized column storage
� IBM xSeries hardware
Optimizing capital investments based on double-digit Petabyte analysis
© 2012 IBM Corporation26
The Big Data Challenge
7/25/2008 Google passes 1 trillion URLs
$187/second Cost of last Ebay outage ($16,156,800/Day )
789.4 PB Current size of YouTube
2/4/2011 IPv4 address space is exhausted, 4.3 billio n addresses have been allocated
(340x1038) Size of IPv6 address space
100 million gigabytes Size of Google’s index
144 million Number of Tweets per day
1.7 trillion Items at Facebook - 90 PB of data
4.3 Billion Mobile devices
© 2012 IBM Corporation27
The Biggest Big Data challenge of our future– Humans are limited– Sensors are unbounded– “Sensorization” of everything means– Everything is a sensor
� The problem– Don’t know the future value of a dot today– Cannot connect dots we don’t have
The Big Data Challenge
© 2012 IBM Corporation28
Understand current state and desired state …
Current approaches might not be enough in the future
© 2012 IBM Corporation29
THINK
ibm.com/bigdata