InfoSphere Streams
description
Transcript of InfoSphere Streams
© 2013 IBM CorporationNov 2013
InfoSphere Streams
Tushar KaleBig Data Evangelist – Streams [email protected]
© 2013 IBM Corporation2
Information Management
Agenda
Overview
Architecture
Customer Use Cases
© 2013 IBM Corporation3
Information Management
Extracting insight from an immense volume, variety and velocity of data, in context, beyond what was previously possible.
Big Data = Variety, Velocity, and Volume
Variety Manage the complexity of multiple relational and non-relational data types and schemas
Velocity Streaming data and large volume data movement
Volume Scale from terabytes to zettabytes
© 2013 IBM Corporation4
Information Management
InfoSphere Streams
Volume
Millions of events per
second
Microsecond Latency
Traditional / Non-traditional data sources
Real time delivery
PowerfulAnalytics
Algo Trading
Telco churnpredict
SmartGrid
CyberSecurity
Government /Law enforcement
ICUMonitoring
EnvironmentMonitoring
A Platform to Run In-Motion Analytics on BIG Data
Handles up to Petabytes of
data per day
Supports traditional as well as
non-traditional data (Audio,
Video etc.)
Delivers insights with
microsecond latencies
Supports custom analytics
written in C++/Java and
warehouse analytic models
Single instance can support
multiple applications
Variety
Velocity
ComplexAnalytics
Agility
© 2013 IBM Corporation5
Information Management
5
Stream Computing Illustrated
directory: ”/img"filename: “farm”
directory: ”/img"filename: “bird”
directory: ”/opt"filename: “java”
directory: ”/img"filename: “cat”
tuple
height: 640width: 480data:
height: 1280width: 1024data:
height: 640width: 480data:
© 2013 IBM Corporation6
Information Management
What can Streams do for you?
Analyze and react to events as they are happening
Take advantage of more sources of data in “true” real time
Build models on your most up-to-the-second information that will help predict what happens next
Streams is a middleware and language for building and running analytic applications operating on data in motion• Scale – easily handles a few events per second through multiple
millions of events per second• Reaction time – possible to get actionable results in much less than a
second (< 20 micros possible)
Enables TRUE situational awareness
© 2013 IBM Corporation7
Information Management
BIG Data – Extending the Warehouse
Streams
Internet
Scale
Warehouse
In-Motion Analytics
Data Analytics,Data Operations &
Model Building
Results
Internet Scale
Database &Warehouse
At-Rest Data Analytics
Results
Ultra Low Latency Results
Traditional / Relational
Data Sources
Non-Traditional / Non-Relational
Data Sources
Non-Traditional/Non-RelationalData Sources
Traditional/Relational Data Sources
InfoSphereStreams
InfoSphereBigInsights
© 2013 IBM Corporation8
Information Management
Adaptive AnalyticsIntegrating Analytics on Data in Motion and Data at Rest
1. Data Ingest
Data Integration, data mining, machine learning, statistical modeling
Visualization of real-time and historical insights
3. Adaptive Analytics Model
Data ingest, preparation,
online analysis, model validation
Data
2. Bootstrap/Enrich
Control flow
InfoSphereBigInsights, Database & Warehouse
InfoSphereStreams
© 2013 IBM Corporation9
Information Management
Agenda
Overview
Architecture
Customer Use Cases
© 2013 IBM Corporation10
Information Management
10
What are key differentiating technical capabilities of Streams?
Performance and Scaling:Operator Fusing and ThreadingEfficient use of coresDistributed executionVery fast data exchange
Language built for Streaming applications:
Reusable operatorsRapid application developmentContinuous “pipeline” processing
Flexible and high performance transport:
Very low latencyHigh data rates
Easy to extend:Built in adaptorsUsers add capability with familiar C++ and Java
Use the data that gives you a competitive advantage:
Can handle virtually any data typeUse data that is too expensive and time sensitive for traditional
approaches
Easy to manage:Automatic placementExtend applications incrementally without downtime Multi-user / multiple applications
Dynamic analysis:Programmatically change
topology at runtimeCreate new subscriptionsCreate new port properties
© 2013 IBM Corporation11
Information Management
InfoSphere Streams
Streams Processing Language and IDE
Runtime Environment
Tools and Technology Integration
Highly Scalable stream processing runtime
Streams Console & Monitoring,Built-in Stream Relational Analytics,
Adapters, Toolkits
Streams StudioEclipse IDE for SPL
Supported on x86 hardware, RedHat Enterprise Linux Version 5 (5.3 and up)
Front Office 3.0
© 2013 IBM Corporation12
Information Management
Terminology Application
• Data flow graph of operator instances connected to each other via stream connections
Operator• Reusable stream analytic
• Input ports: receives data / Output ports: produces data• Source: No input ports / Sink: No output ports
Operator Instance• A specific instantiation of an operator
Stream• Continuous series of tuples, generated by an operator instance’s output port
Stream connection• A stream connected to a specific operator instance input port
PE• A runtime process that executes a set of operator instances
Job• An application instance running on a set of hosts
O1
O2
O3
(stream<Type> A) as O1 = MySrc() {}() as O2 = MySink(A) {}() as O3 = MySink(A) {}
A
stream Astream connection
MySink
MySink
MySrc
© 2013 IBM Corporation13
Information Management
InfoSphere Streams Programming Model
Application Programming (SPL)
Source Adapters Sink AdaptersOperator Repository
Platform optimized compilation
© 2013 IBM Corporation14
Information Management
The Join operator is used for correlating two streams
The Functor operator is used for performing tuple-level manipulations
The Aggregate operator is used for grouping and summarization of incoming tuples
The Punctor operator is for inserting punctuation marks in streams
The Sort operator is used for imposing an order on incoming tuples in a stream
The Barrier operator is used as a synchronization point
The Delay operator is used to “artificially” slowdown a stream
The Split operator is used for dividing incoming tuples into separate streams for parallel processing
And more!
Streams Core Analytical Capabilities Streams Built-in Relational and Utility Operators
© 2013 IBM Corporation15
Information Management
The ODBCSource operator is used for reading data from databases, such as DB2, IDS, Oracle
The ODBCAppend operator is used for writing data to databases, such as DB2, IDS, Oracle
The ODBCEnrich operator is used for extending streaming data based on lookups performed from database tables
The solidDBEnrich operator is used for extending streaming data based on lookups performed from in-memory database tables
The FileSource operator is used for reading data from files in formats such as csv, line, or binary
The FileSink operator is used for writing data to files in formats such as csv, line, or binary
The TCP / UDPSink operator is used for writing data to sockets in formats such as csv, line, or binary
The TCP / UDPSource operator is used for reading data from sockets in formats such as csv, line, or binary
Streams Core Adapter Capabilities Streams Built-in Adapters and DB Toolkit
© 2013 IBM Corporation16
Information Management
Extensibility
User-defined operators that extend the language–A reusable, generic operator model
•written in general purpose programming languages (C++/Java)
User-defined functions that extend the language
Toolkits: Set of domain-specific operators/functions–Toolkits available as part of Streams
•DB toolkit•Data mining toolkit•Financial toolkit
–Streams Exchange on developerWorks•Re-usable Assets and Forum
Developers in two categories–Application developers–Toolkit developers
© 2013 IBM Corporation17
Information Management
Static vs. Dynamic Composition
Static connections–Fully specified at application development-time and do not change at run-time
Dynamic connections–Partially specified at application development-time (Name or Properties)–Established at run-time, as new jobs come and go
•Specifications can also be updated at run-time
Dynamic application composition–Incremental deployment of applications–Dynamic adaptation of applications
© 2013 IBM Corporation18
Information Management
Static vs. Dynamic Composition
Static connections–Fully specified at application development-time and do not change at run-time
Dynamic connections–Partially specified at application development-time (Name or Properties)–Established at run-time, as new jobs come and go
•Specifications can also be updated at run-time
Dynamic application composition–Incremental deployment of applications–Dynamic adaptation of applications
© 2013 IBM Corporation19
Information Management
InfoSphere Streams Runtime Architecture
Streams ApplicationManager
StreamsWeb Service
Name ServiceRoot Service
Components running on management hosts
Components running on application hosts
Name ServicePartition Service
Scheduler
Running anywhere inside the clusterstreamtool
InfoSphere Streams Runtime running on a cluster – 125 blades
Subset of aSPL application
(a collection of operators)
Streams ResourceManager
Authorization and Authentication
Service
Host ControllerProcessing
ElementContainer Agent
Language/OptimizingCompiler
Management APIsAdmin Config / Console
Eclipse IDE and Management Tools
© 2013 IBM Corporation20
Information Management
Streams is a distributed, multi-user, multi-instance system•Multiple instances can run at the same time•Can run jobs from multiple users•A security model is provided for authentication and authorization
Application management •New jobs can be added/removed at any time•New and existing jobs can connect to each other •Scheduler assigns PEs to Hosts based on load
Resource management •Hosts & Services configuration and state•System & Application Metrics
Failure semantics•Recovery of management services state•PEs can be restarted or relocated upon failure•All connections will be re-established once a PE restarts
•All state and in transit tuples are lost•Checkpointing can be used to restore operator state
InfoSphere Streams Runtime
© 2013 IBM Corporation21
Information Management
X86 Host X86 Host X86 Host X86 Host X86 Host
Runs on commodity hardware•From single node to blade centers to high performance multi-rack clusters
Adapts to changes :
InfoSphere Streams Runtime - cont’d
© 2013 IBM Corporation22
Information Management
X86 Host X86 Host X86 Host X86 Host X86 Host
Runs on commodity hardware•From single node to blade centers to high performance multi-rack clusters
Adapts to changes :•In workloads
InfoSphere Streams Runtime – cont’d
© 2013 IBM Corporation23
Information Management
X86 Host X86 Host X86 Host X86 Host X86 Host
Runs on commodity hardware•From single node to blade centers to high performance multi-rack clusters
Adapts to changes :•In workloads
InfoSphere Streams Runtime – cont’d
© 2013 IBM Corporation24
Information Management
X86 Host X86 Host X86 Host X86 Host X86 Host
Runs on commodity hardware•From single node to blade centers to high performance multi-rack clusters
Adapts to changes :•In workloads•In resources
InfoSphere Streams Runtime – cont’d
© 2013 IBM Corporation25
Information Management
X86 Host X86 Host X86 Host X86 Host X86 Host
Runs on commodity hardware•From single node to blade centers to high performance multi-rack clusters
Adapts to changes :•In workloads•In resources
InfoSphere Streams Runtime – cont’d
© 2013 IBM Corporation26
Information Management
Streams Studio Eclipse IDE
© 2013 IBM Corporation27
Information Management
Streams Console – Metrics
© 2013 IBM Corporation28
Information Management
Agenda
Overview
Architecture
Customer Use Case
© 2013 IBM Corporation29
Information Management
Streaming Analytics in Action
Stock Market Impact of weather on securities prices Analyze market data at ultra-low latencies
Fraud Prevention Detecting multi-party fraud Real time fraud prevention
e-Science Space weather prediction Detection of transient events Synchrotron atomic research
Transportation Intelligent traffic
management
Manufacturing Process control for
microchip fabrication
Natural Systems Wildfire management Water management
Telephony CDR processing Social analysis Churn prediction Geomapping
Other Smart Grid Text analysis Who’s talking to whom? ERP for commodities FPGA acceleration
Real-time multimodal surveillance Situational awareness Cyber security detection
Law Enforcement, Defense & Cyber Security
Health & Life Sciences Neonatal ICU monitoring Epidemic early warning
system Remote healthcare
monitoring
© 2013 IBM Corporation30
Information Management
Smarter Faster Cheaper CDR Processing
InfoSphere Streams InfoSphere Streams xDR HubxDR Hub
Key Requirements: Price/Performance and Scaling
6 Billion CDRs per day, dedups over 7 days, processing latency from 12 hours to a few seconds6 machines (using ½ processor capacity)
© 2013 IBM Corporation31
Information Management
Call QualityAnalytics
Telco: Beyond CDR processing, building on existing insight
NetworkAnalytics
Campaign Analytics
LocationAnalytics
Business
Rules
Call DataAnalytics
AudioAnalytics
ChurnAnalytics
Social Analytics
…
Analytics…
Analytics…
Analytics
Mobile Network
Customer Interactions
Weather
Social Media InfoSphere
Streams
Database & Warehouse
© 2013 IBM Corporation32
Information Management
Use scenario• State-of-the-art covert surveillance system
based on Streams platform
• Acoustic signals from buried fiber optic cables are monitored, analyzed and reported in real time for necessary action
• Currently designed to scale up to 1600 streams of raw binary data
Requirement
• Real-time processing of multi-modal signals (acoustics. video, etc)
• Easy to expand, dynamic
• 3.5M data elements per second
Winner 2010 IBM CTO Innovation Award
Surveillance and Physical Security: TerraEchos (Business Partner)
© 2013 IBM Corporation33
Information Management
Cyber Security Analytics
Botnet nodes / Malware IP/MAC identifying suspects
Processing Element Container
Processing Element Container
Processing Element Container
Processing Element Container
Processing Element Container
Live PacketCapture
DNS / DHCP / Netflow sources
Botnet Behavior modeling
External C&C Feeds (live DB queries)
IT I/S Firewalls
Remediation Infrastructure / Ticketing
33
InfoSphereStreams
© 2013 IBM Corporation34
Information Management
University of Ontario Institute of Technology (UOIT) and Sick Kids Hospital
IBM Data Babyhttp://youtu.be/ZiqY7p1v950
IBM Data Babyhttp://youtu.be/ZiqY7p1v950
© 2013 IBM Corporation35
Information Management
Intelligent Transportation
Multimodal Data Streams• GPS• Counts, speeds, travel times• Public Transport• Pollution measurements• Weather Conditions
Archiving of cleansed data
Real Time Traffic Monitoring
Real Time Traffic Information
(Multimodal) Travel Planner
Only 4 x86 Blade servers to process 250,000 GPS probes per second
GPSData
Streams
Real Time Transformatio
n Logic
Real Time Geo
Mapping
Real Time Speed & Heading
Estimation
Real Time Aggregates & Statistics
DataWarehouseWeb
Server
GoogleEarth
Offlinestatisticalanalysis
Interactivevisualization
Storageadapters
© 2013 IBM Corporation Nov 2013
Information Management
THINK
36
© 2013 IBM Corporation Nov 2013
Information Management
Questions?
© 2013 IBM Corporation38
Information Management