How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights and Streams
-
Upload
dataversity -
Category
Technology
-
view
1.982 -
download
7
description
Transcript of How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights and Streams
How to Crunch Petabytes with Hadoop and Big Data using InfoSphere BigInsights and Streams
Tom Deutsch, IBM
Vl di i B ki F d S iS ikVladimir Bacvanski, Founder, [email protected] Brodsky, Technical Executive and Distinguished Engineer, IBMb d k @ ib
© 2011 IBM Corporation & SciSpikeAugust 24, 2011
Who are we?Who are we?
Dr. Vladimir Bacvanski– Consultant, trainer, and mentor focusing on making clients successful in
adopting new data and software approaches – Over 20 years of experiencey p– Founder of SciSpike – a training and consulting firm specializing in
advanced software and data technologies
Stephen Brodsky, Ph.D.Di ti i h d E i d T h i l E ti f IBM Bi D t– Distinguished Engineer and Technical Executive for IBM Big Data initiatives at the IBM Silicon Valley Laboratory
– Previously led the architecture for the Optim Data Studio product line and pureQuery and was a member of the architecture team for DB2 pureXML, Rational Application Developer (RAD), and WebSphere.
© 2011 IBM Corporation & SciSpike2
AgendaAgenda
The “Big Data” challenge: smarter analytics for aThe Big Data challenge: smarter analytics for a smarter planet
How to do it? – The big data challenge
F d i f Bi D h– Foundations of Big Data approaches– MapReduce and Hadoop– Real-time data and stream processing– Real-time data and stream processing– Integration with existing systems
© 2011 IBM Corporation & SciSpike3
The “Big Data” Challenge
© 2011 IBM Corporation & SciSpikeAugust 24, 2011
The World is Changing and Becoming MoreThe World is Changing and Becoming More…
INSTRUMENTED
INTERCONNECTED
INTELLIGENT
Th lti l i f i f ti t d fThe resulting explosion of information creates a need for a new kind of intelligence
© 2011 IBM Corporation & SciSpike5
…to help build a Smarter Planet…to help build a Smarter Planet
Information is Growing at a Phenomenal Rate . . . .Information is Growing at a Phenomenal Rate . . . .
44 80as much data and contentover coming decade44x Of world’s data
is unstructured80%
202035 zettabytes
2009800,000 petabytes
35 zettabytes(35 billion terabytes)
© 2011 IBM Corporation & SciSpike6
800,000 petabytes
Source: IDC, The Digital Universe Decade – Are You Ready?, May 2010
The BIG Data Challenge• Manage and benefit from massive and growing amounts of data• Handle varied data formats (structured, unstructured, semi-structured) and
increased data velocity • Exploit BIG Data in a timely and cost effective fashion• Exploit BIG Data in a timely and cost effective fashion
Collect ManageCOLLECT MANAGE
Integrate AnalyzeINTEGRATE ANALYZE
© 2011 IBM Corporation & SciSpike7
What clients are saying . . .What clients are saying . . .
Lots of potentially valuable data is dormant or discarded p ydue to size/performance considerations
Large volume of unstructured or semi-structured data is not worthLarge volume of unstructured or semi structured data is not worth integrating fully (e.g. Tweets, logs, . . .)
Not clear what should be analyzed (exploratory iterative) Not clear what should be analyzed (exploratory, iterative)
Information distributed across multiple systems and/or Internet
Some information has a short useful lifespan
Volumes can be extremely high
Analysis needed in the context of existing information (not stand
© 2011 IBM Corporation & SciSpike8
alone)
Big Data Presents Big OpportunitiesExtract insight from a high volume, variety and velocity of data in a timely and cost-effective manner
Big Data Presents Big Opportunities
in a timely and cost-effective manner
Manage and benefit fromVariety: Manage and benefit from diverse data types and data structures
Variety:
structures
Analyze streaming data and Velocity:large volumes of persistent data
Scale from terabytes to zettabytes
Volume:
© 2011 IBM Corporation & SciSpike9999
ettabytes
Streams and Oceans of InformationStreams and Oceans of Information . . . .
Hi h d i f ti fl i i
Information streamsInformation oceans
Information stored outside ti l t D tHigh speed information flowing in
real-time, often transient Information from sensors, instruments,
etc
conventional systems. Data may originate from the Web or different internal different systems
etc. Information flowing from real-time logs
and activity monitors Streaming content like audio and video
Collection of what has streamed
Information from social media, logs, click streams, emails, etc.
High speed transactions like tickers, trades, or traffic systems Unstructured or mixed schema documents
like claims, forms, desktop applications, etc.
© 2011 IBM Corporation & SciSpike10
Structured data from disparate systems
Applications for Big Data AnalyticsApplications for Big Data Analytics
Finance Smarter Healthcare Multi-channel sales
Homeland security TelecomTraffic Control
Manufacturing Trading Analytics
Many more!
© 2011 IBM Corporation & SciSpike11
Use Case Example: Energy CompanyUse Case Example: Energy Company
Business scenario Business scenario
Analyze large volumes of public and private weather data for alternative energy business
E i ti hi h f ti
Technical challengesTechnical challenges
Existing high-performance computing hardware, limited staff
Technical challengesTechnical challenges
High data volume: 2+ PB
Range of query typesg q y yp- Avg temp in given location? (Small
result) - Geo pts where ice may form on wind
turbines? (Large result derived values –turbines? (Large result, derived values –icing determined by humidity + temp.)
Run on system with non-Hadoop apps
© 2011 IBM Corporation & SciSpike12
Use Case Example: Global Media FirmUse Case Example: Global Media Firm
Business scenario Business scenario
Identify unauthorized content streaming in digital media (piracy)
- Quantify annual revenue loss- Analyze trendsAnalyze trends Monitor social media sites to identify dissemination of pirated content. Time sensitive!
Technical challengesTechnical challenges
High variety of unstructured and semi-t t d d tstructured data.
Initial focus: text analytics over 1 year’s worth of social media data. Look for live streaming URLs, sentiment, event info, etc.
Complex rules to qualify & classify info
© 2011 IBM Corporation & SciSpike13
Future potential for video analysis
IBM WatsonIBM Watson
IBM Watson is a breakthrough in analytic innovation, but it is only successful because of the quality of the information from which it is working.
© 2011 IBM Corporation & SciSpike14
Big Data and WatsonBig Data and WatsonWatson technology offers great potential
for advanced business analyticsBig Data technology is used to build
Watson’s knowledge base
Watson uses the Apache Hadoop open framework to distribute the workload for
POS Data
CRM DataSocial Media
loading information into memory.
Approx. 200M pages of text
InfoSphere BigInsights
Distilled Insight- Spending habits- Social relationships- Buying trends
(To compete on Jeopardy!)
oSp e e g s g ts
Advanced search and analysis
Watson’s Memory
© 2011 IBM Corporation & SciSpike15
Customer EngagementsUse patterns• Customer sentiment analysis (cross
Common requirements• Extract business insight from large volumes of
Customer Engagements
• Customer sentiment analysis (cross-sell, up-sell, campaign management)
• Integrated retail and web customer behavior modeling
• Extract business insight from large volumes of raw data (often outside operational systems)
• Integrate with other existing software• Ready for enterprise useg
• Predictive modeling (credit card fraud)• System log analytics (reduce
operational risk)
• Ready for enterprise use
p )
Text Blog WeblogConsumer
Text, Blog, Weblog
Click streams
Log & transactions
Multi-channel sales
Insight
Biological Sequences
Next Gen Fraud Models
New Business Operational system & streams data sources Statistical Model
Text Analytics
© 2011 IBM Corporation & SciSpike1616
Developmentp y Stat st ca ode
Building
The approach to crunching big datacrunching big data
© 2011 IBM Corporation & SciSpikeAugust 24, 2011
How to approach Big Data analytics?How to approach Big Data analytics?InfoSphere BigInsights and InfoSphere Streams
• Analytics for data in-motion and at-rest• Platform for processing large volumes of diverse data • Complements and integrates with existing software solutions
© 2011 IBM Corporation & SciSpike18
Addressing the Key RequirementsAddressing the Key Requirements
1. Platform for V3 – Variety, Velocity, Volume Variety - manage data & content “As Is” Handle any velocity - low-latency streams and large volume batch Volume - huge volumes of at-rest or streaming data
2 Analytics for V3
Big Data Platform2. Analytics for V3
Analyze Sources in their native format - text, data, rich content Analyze all of the data - not just a subset Dynamic analytics - automatic adjustments and actions
3. Ease of Use for Developers and Users Developer UIs, common languages & automatic optimization End-user UIs & visualization
4. Enterprise Class Failure tolerance, Security and Privacy Scale Economically
5. Extensive Integration Capabilities Integrate wide variety of sources Leverage enterprise integration technologies
© 2011 IBM Corporation & SciSpike19
Bi D t I iti ti
Volumes of diverse persistent data
Big Data Initiative
Analytic applications for Volumes of diverse, persistent data “Big Data”
InfoSphere
Warehouse
pBigInsights
IBM Confidential
Traditional warehouse applications
InfoSphereStreams
© 2011 IBM Corporation & SciSpike20
Streams
Real-time streaming data
BigInsights SummaryBigInsights Summary
BigInsights = analytical platform for persistent “Big Data”– Based on open source & IBM technologies
Distinguishing characteristics– Built-in analytics . . . . Enhances business knowledge
Enterprise soft are integration Complements and e tends– Enterprise software integration . . . . Complements and extends existing capabilities
– Production-ready platform . . . . Speeds time-to-value; simplifies development and maintenance
© 2011 IBM Corporation & SciSpike21
Big Data Platform VisionBig Data Platform Vision
Big Data Solutions
Bringing Big Data to the EnterpriseData
Warehouse
Big Data User EnvironmentsBig Data User EnvironmentsInformation Integration
Developers End Users Administrators
A IN
Master Data Mgmt
Big Data Enterprise EnginesBig Data Enterprise Engines
AG
ENTS
NTEG
RATIO
Database
Content Analytics
Internet Scale AnalyticsStreaming Analytics
ON Analytics
Business Analyticsyg y
Open Source Foundational ComponentsMarketing
© 2011 IBM Corporation & SciSpike22
Data Growth Management
InfoSphere BigInsights v 1.1InfoSphere BigInsights v 1.1Platform for volume,
variety, velocity -- V3variety, velocity V Hadoop foundation
Analytics for V3
Text analytics & tooling Enterprise Edition Text analytics & tooling
Usability Web administrative
lass
Enterprise EditionLicensed
Web admin console, LDAP authenticationRDBMS, warehouse connectivity
console Integrated install Spreadsheet-style
l ti t l nter
pris
e cl
Apache
Basic EditionText analytics
Spreadsheet-style analytic tool Flexible job schedulerFree download
24 x 7 Webanalytic tool
Enterprise Class Storage, security,
En
ApacheHadoop
24 x 7 Web support
cluster management
Integration Connectivity to DB2,
Breadth of capabilities
© 2011 IBM Corporation & SciSpike23
Connectivity to DB2, Netezza
BigInsights Platform: Key IdeasBigInsights Platform: Key Ideas
Flexible, enterprise-class support for processing large volumes of data – Based on Google’s MapReduce technology – Inspired by Apache Hadoop; compatible with its ecosystem and sp ed by pac e adoop; co pat b e t ts ecosyste a d
distribution – Well-suited to batch-oriented, read-intensive applications – Supports wide variety of dataSupports wide variety of data
Enables applications to work with thousands of nodes and t b t f d t i hi hl ll l t ff tipetabytes of data in a highly parallel, cost effective manner
– CPU + disks = “node”– Nodes can be combined into clusters– New nodes can be added as needed without changing
• Data formats• How data is loaded
© 2011 IBM Corporation & SciSpike24
• How jobs are written
Th M R d P i M d lThe MapReduce Programming Model
"Map" step:Map step: – Input split into pieces
W k d i di id l i i ll l ( d– Worker nodes process individual pieces in parallel (under global control of the Job Tracker node)
– Each worker node stores its result in its local file system where a reducer is able to access it
"Reduce" step:– Data is aggregated (‘reduced” from the map steps) by– Data is aggregated ( reduced from the map steps) by
worker nodes (under control of the Job Tracker)
M lti l d t k ll li th ti
© 2011 IBM Corporation & SciSpike2525
– Multiple reduce tasks can parallelize the aggregation
What is Hadoop?What is Hadoop?
Apache Hadoop = free, open source framework for data-intensive applications – Inspired by Google technologies (MapReduce, GFS)– Well-suited to batch-oriented, read-intensive applications e su ted to batc o e ted, ead te s e app cat o s– Originally built to address scalability problems of Nutch, an open source
Web search technology
Enables applications to work with thousands of nodes and petabytes of data in a highly parallel, cost effective manner
CPU di k f dit b H d “ d ”– CPU + disks of commodity box = Hadoop “node”– Boxes can be combined into clusters– New nodes can be added as needed without changing
• Data formats• How data is loaded• How jobs are written
© 2011 IBM Corporation & SciSpike26
Two Key Aspects of HadoopTwo Key Aspects of Hadoop
MapReduce framework – How Hadoop understands and assigns work to the nodes
(machines)
Hadoop Distributed File System = HDFS Hadoop Distributed File System = HDFS– Where Hadoop stores data– A file system that spans all the nodes in a Hadoop clusterA file system that spans all the nodes in a Hadoop cluster– It links together the file systems on many local nodes to
make them into one big file system
© 2011 IBM Corporation & SciSpike27
Logical MapReduce Example: Word CountLogical MapReduce Example: Word Count
Hello World Bye WorldContent of Input Documents
map(String key, String value): // key: document name
Hello World Bye World
Hello IBM
// value: document contents for each word w in value: EmitIntermediate(w "1");
Map 1 emits:< Hello, 1> < World, 1> < Bye 1>EmitIntermediate(w, 1 );
reduce(String key, Iterator values):
< Bye, 1> < World, 1>
Map 2 emits:( g y, )// key: a word// values: a list of counts
Reduce (final output):
< Hello, 1> < IBM, 1>
int result = 0;for each v in values:result += ParseInt(v);
Reduce (final output):
< Bye, 1> < IBM, 1>
H ll 2
© 2011 IBM Corporation & SciSpike28
result += ParseInt(v);Emit(AsString(result));
< Hello, 2> < World, 2>
How To Create MapReduce JobsHow To Create MapReduce Jobs
MapReduce development in Javap p–Low level, very flexible–Time consuming development
Hive– Open source language / Apache sub-projectOpen source language / Apache sub project– Provides a SQL-like interface to Hadoop
Pig– Data flow language / Apache sub-project
Jaql– A query language for JSON
© 2011 IBM Corporation & SciSpike29
– Useful for loosely structured data
Management Tools: Web Console Graphically manage cluster, jobs, HDFS Sample administration tasks
Start/Stop Servers
Management Tools: Web Console
– Start/Stop Servers– Add/Remove Servers– Server Status Details (Log)
© 2011 IBM Corporation & SciSpike30
Spreadsheet-like Analysis ToolSpreadsheet like Analysis Tool Web-based analysis
and visualization tool BigSheetsBigSheets
Spreadsheet-like interfaceinterface – Define and manage
long running data ll i j bcollection jobs
– Analyze content of the text on the pages that have been retrieved
© 2011 IBM Corporation & SciSpike31
Text AnalyticsText Analytics
"Acquisition""Address"
• Distill structured info from unstructured data • Sentiment analysis Address
"Alliance""AnalystEarningsEstimate""City""CompanyEarningsAnnouncement"
• Sentiment analysis• Consumer behavior • Illegal or suspicious activities
CompanyEarningsAnnouncement"CompanyEarningsGuidance""Continent""Country""County"
• . . .
• Pre-built library of text annotators for common County"DateTime""EmailAddress""JointVenture""Location"
business entities
• Rich language and tooling to build custom Location"Merger""NotesEmailAddress""Organization""Person"
g g gannotators
• Support for Western languages (English, Person"PhoneNumber""StateOrProvince""URL""ZipCode"
pp g g ( g ,Dutch/Flemish, French, German, Italian, Portuguese, or Spanish) plus select Asian languages (Japanese, Chinese)
© 2011 IBM Corporation & SciSpike3232
ZipCode
Eclipse-based Text Analytics DevelopmentEclipse based Text Analytics Development
© 2011 IBM Corporation & SciSpike33
So What Does This Result In?So What Does This Result In?
Easy To Scale
Fault Tolerant and Self-Healing
Data Agnostic
Extremely FlexibleExtremely Flexible
© 2011 IBM Corporation & SciSpike34
Working with streaming data: a new paradigmWorking with streaming data: a new paradigm
Conventional processing: static data
DataQueries ResultsDataQueries Results
Real-time processing: streaming data Real-time processing: streaming data
QueriesData Results
© 2011 IBM Corporation & SciSpike35
Real Time Data with InfoSphere StreamsReal-Time Data with InfoSphere StreamsStreaming analytic applications
M lti l i t t
Source Adapters
Sink AdaptersOperator Repository
– Multiple input streams– Advanced streaming analytics
Eclipse based IDEEclipse based IDE– Define sources, apply operators,
define intermediary and final output sinks
InfoSphere Streams Studio (IDE for Streams Processing Language)
output sinks– User defined operators in Java or
C++O i i i il Automated Optimized DeployOptimizing compiler automates
deployment and connections– Extremely low latency
Automated, Optimized Deploy and Management (Scheduler)
y y– Cluster of up to 125 nodes
© 2011 IBM Corporation & SciSpike36
Scalable stream processingScalable stream processing InfoSphere Streams provides
A i d l d IDE f d fi i d t d– A programming model and IDE for defining data sources and software analytic modules called operators that are fused into process execution units (PEs)
– infrastructure to support the composition of scalable stream processing applications from these components
– deployment and operation of these applications across distributed p y p ppx86 processing nodes, when scaled processing is required
– stream connectivity between data sources and PEs of a stream processing applicationprocessing application
© 2011 IBM Corporation & SciSpike37
Merging the Traditional and Big Data ApproachesMerging the Traditional and Big Data ApproachesBig Data Approach
Iterative & Exploratory AnalysisTraditional Approach
Structured & Repeatable Analysis
ITDelivers a platform to
bl ti
Business UsersDetermine what enable creative
discoveryDetermine what question to ask
IT BusinessITStructures the data to answer that question
Business Explores what questions could be askedq
Monthly sales reportsProfitability analysisCustomer surveys
Brand sentimentProduct strategyMaximum asset utilization
© 2011 IBM Corporation & SciSpike38
Customer surveys Maximum asset utilization
BigInsights and the data warehouse: filtering andBigInsights and the data warehouse: filtering and summarizing “Big Data”
BigInsights
Data warehouse• Broader analytic coverage• Exploits IT investments while
© 2011 IBM Corporation & SciSpike39
Data warehousepminimizing burden
BigInsights as a “queryable archive” for growingBigInsights as a queryable archive for growing data warehouses
BigInsights
Offl d “ ld” d t d h i f b tData warehouse • Offload “cold” or dated warehouse info but maintain access for further exploration • Keep warehouse size manageable and focused
© 2011 IBM Corporation & SciSpike40
on well-known business analytic needs
Trends and directionsTrends and directions Enterprise software integration
– Data warehouses, RDBMSsETL l tf– ETL platforms
– Business intelligence tools– Applications– . . .
Diverse range of analytics– Text – Image / video (e g content-based user profiling)Image / video (e.g., content based user profiling)– Predictive modeling (e.g., ranking and classification based on
machine learning) – . . .
Sophisticated, scalable infrastructure for processing massive data volumes– High-performance file system with full POSIX compliance, granular g p y p , g
security – Fully recoverable and restartable workflows – Parallel, distributed indexing for text (“BigIndex”) – Read-optimized column store
© 2011 IBM Corporation & SciSpike41
p– Tooling for administrators, programmers, analysts– . . .
Integrating Relational, Streams, and BigInsightsIntegrating Relational, Streams, and BigInsights
Traditional Traditional / Traditional / TraditionalWarehouse
Database &Warehouse
At-rest data
Results
Relational Data Sources
Relational Data Sources
Warehouse data analytics
Non-Traditional / N R l ti l
Non-Traditional / N R l ti lStreams
In-Motion Analytics
Ultra Low Latency
Non-Relational Data Sources
Non-Relational Data Sources
Analytics yResults
InfoSphere Big InsightsInfoSphere Big InsightsVaried data
formats Varied data
formats
Big DataBatch oriented data analytics
ResultsMassive Scale
Semi-structured, unstructured...
Semi-structured, unstructured...
© 2011 IBM Corporation & SciSpike42
Typical Strategy for AnalyticsTypical Strategy for Analytics
SQL Analytics, MiningETL
Source
Data warehouse / marts
S
Extract Transform/subset Load
SourceSources
subset
© 2011 IBM Corporation & SciSpike43
Emerging requirements for analyticsEmerging requirements for analyticsSQL Analytics, MiningETL, ELT (MR BI, Mining)
Transform,Analyze
SourceWarehouses / martsStructured
SourcesTransform/
subset LoadExtract
BigInsightsg gRepositorySourceOther
Sources
Explore large volumes of “raw” or diverse data.
Discover, analyze new insights with BigInsights
© 2011 IBM Corporation & SciSpike44
ConclusionsConclusions
Scale out to crunch petabytes– Scale out to crunch petabytes
– We need a mix of technologies • Data at rest: MapReduce, Hadoop and beyond
Data in motion: stream processing• Data in motion: stream processing
– To be successful, integrate with conventionalTo be successful, integrate with conventional technologies
© 2011 IBM Corporation & SciSpike45
Getting in touchGetting in touch
Stephen Brodsky – IBM– Email: [email protected]
InfoSphere BigInsights– http://www-01.ibm.com/software/data/infosphere/biginsights/ttp // 0 b co /so t a e/data/ osp e e/b g s g ts/
InfoSphere Streams– http://www-01.ibm.com/software/data/infosphere/streams/
Vladimir Bacvanski - SciSpike– Email: [email protected]– Blog: http://www.OnBuildingSoftware.com/– Twitter: http://twitter.com/OnSoftware– LinkedIn: http://www.linkedin.com/in/VladimirBacvanskip
© 2011 IBM Corporation & SciSpike46