Real-Time Stream Processing CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
IBM Stream au Hadoop User Group
-
Upload
hadoop-user-group-france -
Category
Documents
-
view
3.257 -
download
2
Transcript of IBM Stream au Hadoop User Group
© 2011 IBM Corporation
Imagine the Possibilities of Analyzing All Available Data
Real-time Traffic Flow Optimization
Fraud & risk detection
Accurate and timely threat detection
Predict and act on intent to purchase
Faster, More Comprehensive, Less Expensive
Understand and act on customer sentiment
Low-latency network analysis
© 2011 IBM Corporation3
Where is this data coming from?
Source: McKinsey & Company, May 2011
Every second of HD video generates > 2,000 times as
many bytes as required to store a single page of text.
Every day, the New York Stock Exchange captures 1 TB of trade information.
More than 30M networked sensor, growing at a rate
>30% per year.
12 TB of tweets being created each day.
5 Billion mobile phones in use in 2010. Only 12% were
smartphones.
What is your business doing with it?
© 2011 IBM Corporation4
4
Why is Big Data important ?
Data AVAILABLE to an organization
data an organization can PROCESS
Missed
opportunity
Enterprises are “more blind” to new opportunities.
Organizations are able to process less and less of the
available data.
© 2011 IBM Corporation5
What does a Big Data platform do ?
Analyze Information in MotionStreaming data analysis
Large volume data bursts & ad-hoc analysis
Analyze a Variety of Information Novel analytics on a broad set of mixed information that could not be analyzed before
Discover & ExperimentAd-hoc analytics, data discovery & experimentation
Analyze Extreme Volumes of InformationCost-efficiently process and analyze petabytes of information
Manage & analyze high volumes of structured, relational data
Manage & Plan
Enforce data structure, integrity and control to ensure consistency for repeatable queries
© 2011 IBM Corporation
Complementary Approaches for Different Use Cases
Traditional ApproachStructured, analytical, logical
New ApproachCreative, holistic thought, intuition
StructuredRepeatable
Linear
Monthly sales reportsProfitability analysis
Customer surveys
Internal App Data
Data Warehouse
Traditional Sources
StructuredRepeatableLinear
Transaction Data
ERP data
Mainframe Data
OLTP System Data
UnstructuredExploratoryIterative
Brand sentimentProduct strategyMaximum asset utilization
HadoopStreams
New Sources
UnstructuredExploratoryIterative
Web Logs
Social Data
Text Data: emails
Sensor data: images
RFID
Enterprise Integration
© 2011 IBM Corporation
IBM Big Data Strategy: Move the Analytics Closer to the Data
BI / Reporting
BI / Reporting
Exploration / Visualization
FunctionalApp
IndustryApp
Predictive Analytics
Content Analytics
Analytic Applications
IBM Big Data Platform
Systems Management
Application Development
Visualization & Discovery
Accelerators
Information Integration & Governance
HadoopSystem
Stream Computing
Data Warehouse
New analytic applications drive the requirements for a big data platform
• Integrate and manage the full variety, velocity and volume of data
• Apply advanced analytics to information in its native form
• Visualize all available data for ad-hoc analysis
• Development environment for building new analytic applications
• Workload optimization and scheduling
• Security and Governance
© 2011 IBM Corporation
Most Client Use Cases Combine Multiple Technologies
Pre-processing
Ingest and analyze unstructured data types and convert to structured data
Combine structured and unstructured analysis
Augment data warehouse with additional external sources, such as social media
Combine high velocity and historical analysis
Analyze and react to data in motion; adjust models with deep historical analysis
Reuse structured data for exploratory analysis
Experimentation and ad-hoc analysis with structured data
© 2011 IBM Corporation
IBM is in a lead position to exploit the Big Data opportunity
February 2012 “The Forrester Wave™: Enterprise Hadoop Solutions, Q1 2012”
Forrester Wave™: Enterprise Hadoop Solutions, Q1 ’12
IBM DifferentiationEmbracing Open SourceData in Motion (Streams) and Data at
Rest (Hadoop/BigInsights)Tight integration with other Information
Management productsBundled, scalable analytics technologyHardened Apache Hadoop for enterprise
readiness
© 2011 IBM Corporation10
IBM’s unique strengths in Big Data
Ingest, analyze and act on massive volumes of streaming data.Faster AND more cost-effective for specific use cases. (10x volume
of data on the same hardware.)
Analyzes a variety of data types, in their native format – text, geospatial, time series, video, audio & more.
Open source enhanced for reliability, performance and security.High performance warehouse software and appliancesEase of use with end users, admin and development UIs.
Integration into your IM architecture.Pre-integrated analytic applications.
Big Data in Real-Time
Fit for purpose analytics
Enterprise Class
Integration
© 2011 IBM Corporation11
Analytic Results
More context
Traditional Data, Sensor Events,
Signals
Alerts
ThreatPrevention
Systems
Logging
Active response
Storage andWarehousing
What if you could get IMMEDIATE insight?
What if you could analyze MORE kinds of data?
What if you could do it with exceptional performance?
Stream Computing : What is good for ?Analyze all your data, all the time, just in time
© 2011 IBM Corporation
What is Stream Processing ?
Relational databases and warehouses find information stored on disk
Stream computing analyzes data before you store it
Databases find the needle in the haystack
Streams finds the needle as it’s blowing by
© 2011 IBM Corporation13 IBM and Customer Confidential
Without Streams With Streams• Intensive scripting• Embedded SQL• File / Storage management by hand• Record management embedded in application
code• Data Buffering, Locality• Security• Dynamic Application Composition• High Availability• Application management (checkpointing,
performance optimization, monitoring, workload management, error and event handling)
• Application tied to specific Hardware, Infrastructure
• Multithreading, Multiprocessing• Debugging• Migration from development to production• Integration of best-of-breed commercial tools• Code reusability• Source / Target interfaces
Streams provide a Productive and Reusable Development Environment
Streams Runtime provides your Application Infrastructure
“TerraEchos developers can deliver applications 45% faster due to the agility of Streams Processing Language.“
– Alex Philp, TerraEchos
© 2011 IBM Corporation
Streams
15© 2011 IBM Corporation
Achieve scale:By partitioning applications into software componentsBy distributing across stream-connected hardware hosts
Infrastructure provides services forScheduling analytics across hardware hosts, Establishing streaming connectivity
TransformTransformFilter / SampleFilter / Sample
ClassifyClassifyCorrelateCorrelate
AnnotateAnnotate
Where appropriate: Elements can be fused togetherfor lower communication latency
Continuous ingestion Continuous analysis
How Streams Works
16© 2011 IBM Corporation
Scalable Stream Processing
Streams programming model: construct a graph
– Mathematical concept• not a line -, bar -, or pie chart!• Also called a network• Familiar: for example, a tree structure is a graph
– Consisting of operators and the streams that connect them• The vertices (or nodes) and edges of the mathematical graph• A directed graph: the edges have a direction (arrows)
Streams runtime model: distributed processes– Single or multiple operators form a Processing Element (PE)– Compiler and runtime services make it easy to deploy PEs
• On one machine• Across multiple hosts in a cluster when scaled-up processing is required
– All links and data transport are handled by runtime services• Automatically• With manual placement directives where required
OP
OP
OP
OP
OP
OP
OP
stream
17© 2011 IBM Corporation
InfoSphere Streams Objects: Runtime View
Instance– Runtime instantiation of InfoSphere
Streams executing across one or more hosts
– Collection of components and services
Processing Element (PE)– Fundamental execution unit that is run by
the Streams instance – Can encapsulate a single operator or
many “fused” operators
Job– A deployed Streams application
executing in an instance – Consists of one or more PEs
InstanceInstance
JobJob
NodeNode
Stream 1 PEPE PEPE
NodeNode
PEPE
Stream 1
Stream 2
Stream 3
Stream 3
Stream 4
Stream 5
operator
18© 2011 IBM Corporation
InfoSphere Streams Objects: Development View
directory: "/img"filename: "farm"
directory: "/img"filename: "bird"
directory: "/opt"filename: "java"
directory: "/img"filename: "cat"
Streams Application
stream
tuple
height: 640width: 480data:
height: 1280width: 1024data:
height: 640width: 480data:
operator
Operator– The fundamental building block of the Streams
Processing Language– Operators process data from streams and may
produce new streams
Stream– An infinite sequence of structured tuples– Can be consumed by operators on a tuple-by-
tuple basis or through the definition of a window
Tuple– A structured list of attributes and their types.
Each tuple on a stream has the form dictated by its stream type
Stream type– Specification of the name and data type of each
attribute in the tuple
Window– A finite, sequential group of tuples– Based on count, time, attribute value,
or punctuation marks
19© 2011 IBM Corporation
What is Streams Processing Language?
Designed for stream computing–Define a streaming-data flow graph–Rich set of data types to define tuple attributes
Declarative–Operator invocations name the input and output streams–Referring to streams by name is enough to connect the graph
Procedural support–Full-featured C++/Java-like language–Custom logic in operator invocations–Expressions in attribute assignments and parameter definitions
Extensible–User-defined data types–Custom functions written in SPL or a native language (C++ or Java)–Custom operator written in SPL–User-defined operators written in C++ or Java
20© 2011 IBM Corporation
Some SPL Terms
An operator represents a class of manipulations – of tuples from one or more input streams– to produce tuples on one or more output streams
A stream connects to an operator on a port – an operator defines input and output ports
An operator invocation – is a specific use of an operator – with specific assigned input and output streams– with locally specified parameters, logic, etc.
Many operators have one input port and one output port; others have
– zero input ports: source adapters, e.g., TCPSource– zero output ports: sink adapters, e.g., FileSink– multiple output ports, e.g., Split– multiple input ports, e.g., Join
A composite operator is a collection of operators– An encapsulation of a subgraph of
• Primitive operators (non-composite)• Composite operators (nested)
– Similar to a macro in a procedural language
Aggregate
Employee Info
Aggregate
port
Salary Statistics
port
TCPSource
FileSink
compositeoperator
Split
Join
21© 2011 IBM Corporation
Composite Operators
Every graph is encoded as a composite–A composite is a graph of one or more operators–A composite may have input and output ports–Source code construct only
• Nothing to do with operator fusion (PEs)
Each stream declaration in the composite– Invokes a primitive operator or–another composite operator
An application is a main composite–No input or output ports–Data flows in and out but not on
streams within a graph–Streams may be exported to and
imported from other applicationsrunning in the same instance
21
composite Main { graph stream … { } stream … { } . . . }
Application (logical view)Application (logical view)
Stream 1
Stream 1
Stream 2
Stream 3
Stream 3
Stream 4
Stream 5
operator
22© 2011 IBM Corporation
Anatomy of an Operator Invocation
Operators share a common structure– <> are sections to fill in
Reading an operator invocation– Declare a stream stream-name – With attributes from stream-type – that is produced by MyOperator – from the input(s) input-stream– MyOperator behavior defined by
logic, parameters, windowspec, and configuration; output attribute assignments are specified in output
For the example:– Declare the stream Sale with the attribute
item, which is a raw string– Join Bid and Ask streams with – sliding windows of 30 seconds on Bid,
and 50 tuples of Ask– When items are equal, and Bid price is
greater than or equal to Ask price– Output the item value on the Sale stream
stream<stream-type> stream-name = MyOperator(input-stream; …) { logic logic ; param parameters ; output output ; window windowspec ; config configuration ;}
stream<rstring item> Sale = Join(Bid; Ask) { window Bid: sliding, time(30); Ask: sliding, count(50); param match: Bid.item == Ask.item && Bid.price >= Ask.price; output Sale: item = Bid.item }
Syntax:
Example:
22
23© 2011 IBM Corporation
Streams V2.0 Data Types
(any)
(composite)(composite)(primitive)
(collection)(collection) tupletupleboolean enumenum (numeric) timestamptimestamp (string) blobblob
list setset mapmaprstring ustringustring(integral) (floatingpoint) (complex)(complex)
(signed) (unsigned)(unsigned) (float) (decimal)(decimal)
int8
int16
int32
int64
uint8
uint16
uint32
uint64
uint8
uint16
uint32
uint64
float32
float64
float128
decimal32
decimal64
decimal128
decimal32
decimal64
decimal128
complex32
complex64
complex128
complex32
complex64
complex128
24© 2011 IBM Corporation
Stream and Tuple Types Stream type (often called “schema”)
– Definition of the structure of the data flowing through the stream
Tuple type definition– tuple<sequence of attributes>tuple<uint16 id, rstring name>
• Attribute: a type and a name• Nesting: any attribute may be another tuple type
Stream type is a tuple type– stream<sequence of attributes> stream<uint16 id, rstring name>
Indirect stream type definitions– Fully defined within the output stream declaration
stream<uint32 callerNum, … rstring endTime, list<uint32> mastIDs> Calls = Op(…)…
– Reference a tuple type
CallInfo = tuple<uint32 callerNum, … rstring endTime, list<uint32> mastIDs>;
stream<CallInfo> InternationalCalls = Op(…) {…}
– Reference another stream
stream<uint32 callerNum, … rstring endTime, list<uint32> mastIDs> Calls = Op(…)…
stream<Calls> RoamingCalls = Op(…) {…}
25© 2011 IBM Corporation
Collection Types
list: array with bounds-checking [0, 17, age-1, 99]– Random access: can access any element at any time
Ordered, base-zero indexed: first element is someList[0]
set: unordered collection {"cats", "yeasts", "plankton"}– No duplicate element values
map: key-to-value mappings {"Mon":0, "Sat":99, "Sun":-1}– Unordered
Use type constructors to specify element type– list<type>, set<type> list<uint16>, set<rstring>– map<key-type,value-type> map<rstring[3],int8>
Can be nested to any number of levels– map<int32, list<tuple<ustring name, int64 value>>>– {1 : [{"Joe",117885}, {"Fred",923416}], 2 : [{"Max",117885}], -1 : []}
Bounded collections optimize performance– list<int32>[5]: at most 5 (32-bit) integer elements– Bounds also apply to strings: rstring[3] has at most 3 (8-bit) characters
26© 2011 IBM Corporation
The Functor Operator
Transforms input tuples into output tuples
– One input port– One or more output ports
May filter tuples– Parameter filter– A boolean expression– If true, emit output tuple;
if false, do not
Arbitrary attribute assignments– Full-blown expressions– Including function calls– Drop, add, transform attributes– Omitted attributes auto-assigned
Custom logic supported– logic clause– May include state– Applies to filter and assignments
stream<rstring name,
uint32 age,
uint64 salary> Person = Op(…){}
stream<rstring name,
uint32 age,
rstring login,
tuple<boolean young,
boolean rich> info>
Adult = Functor(Person) {
param
filter : age >= 21u;
output Adult :
login = lower(name),
info = {young = (age < 30u),
rich = (salary > 100000ul)};
}
Person AdultFunctorname
agesalary
nameagelogininfo
27© 2011 IBM Corporation
The FileSink Operator
Writes tuples to a file
Has a single input port–No output port: data goes to a file,
not a Streams stream
Selected Parameters– file
• Mandatory• Base for relative paths is
data subdirectory• Directories must already exist
– flush• Flush the output buffer after a given
number of tuples– format
• csv: comma-separated values• txt, line, binary, block
() as Sink = FileSink(StreamIn) {
param
file : "/tmp/people.dat";
format : csv;
flush : 20u;
}File-Sink
28© 2011 IBM Corporation
Communication Between Streams Applications
Streams jobs exchange data with the outside world–Source- and Sink-type operators–Can also be used between Streams jobs (e.g., TCPSource/Sink)
Streams jobs can exchange data with each other–Within one Streams Instance
Supports Dynamic Application Composition–By name or based on properties (tags)–One job exports a stream; another imports it
Implemented using two new pseudo-operators: Export and Import
Job 2
Job 1
Export Import
sourceoper-ator
oper-ator
sink
oper-ator
sinksource
Stream exported by Job 1and imported by Job 2
29© 2011 IBM Corporation
Application Design – Dynamic Stream Properties
API available for toolkit development
Can add/modify/delete– Exported stream properties– Imported stream subscription expression
Dynamic Job Flow Control Bus Pattern– Operators within jobs interpret control stream tuples – Rewire the flow of data from job to job
ExportedControl Stream
Job A Job B Job C Job D
[A,B,C]
Data Stream
Flow Control Tuples
30© 2011 IBM Corporation
Application Design – Dynamic Stream Properties
API available for toolkit development
Can add/modify/delete– Exported stream properties– Imported stream subscription expression
Dynamic Job Flow Control Bus Pattern– Operators within jobs interpret control stream tuples – Rewire the flow of data from job to job
ExportedControl Stream
Job A Job B Job C Job D
[A,B,C]
[A,C,D]
Data Stream
Flow Control Tuples
31© 2011 IBM Corporation
Streams Instance: stream1Streams Instance: stream1
Application Design – Multi-job Design
Application / Job Decomposition– Dynamic Job Submission + Stream Import / Export
Job: imagefeeder
File metadata
Image-Source
Directory-Scan
Job: imagewriter
File metadataTimestamp + Filename
FunctorImage-Sink
FileSink
properties:name = "Feed",type = "Image",write = “ok"
properties:name = "Feed",type = "Image",write = “ok"
subscription:type == "Image" &&write == “ok"
subscription:type == "Image" &&write == “ok"
32© 2011 IBM Corporation
Streams Instance: stream1Streams Instance: stream1
Application Design – Multi-job Design
Application / Job Decomposition– Dynamic Job Submission + Stream Import / Export
Job: imagefeeder
File metadata
Image-Source
Directory-Scan
Job: imagewriter
File metadataTimestamp + Filename
FunctorImage-Sink
FileSink
Image +File metadata
properties:name = "Feed",type = "Image",write = “ok"
properties:name = "Feed",type = "Image",write = “ok"
subscription:type == "Image" &&write == “ok"
subscription:type == "Image" &&write == “ok"
33© 2011 IBM Corporation
Streams Instance: stream1Streams Instance: stream1
Application Design – Multi-job Design
Application / Job Decomposition– Dynamic Job Submission + Stream Import / Export
Job: imagefeeder
File metadata
Image-Source
Directory-Scan
Job: imagewriter
File metadataTimestamp + Filename
FunctorImage-Sink
FileSink
Job:greyscaler
Greyscale
Image +File metadata
properties:name = "Feed",type = "Image",write = “ok"
properties:name = "Feed",type = "Image",write = “ok"
subscription:type == "Image" &&write == “ok"
subscription:type == "Image" &&write == “ok"
subscription:name == "Feed"subscription:name == "Feed"
properties:name = “Grey",type = "Image",write = “ok"
properties:name = “Grey",type = "Image",write = “ok"
34© 2011 IBM Corporation
Streams Instance: stream1Streams Instance: stream1
Application Design – Multi-job Design
Application / Job Decomposition– Dynamic Job Submission + Stream Import / Export
Job: resizer
Job:facial scan Job: Alerter
Job: imagefeeder
File metadata
Image-Source
Directory-Scan
Job: imagewriter
File metadataTimestamp + Filename
FunctorImage-Sink
FileSink
Job:greyscaler
Greyscale
Image +File metadata
properties:name = "Feed",type = "Image",write = “ok"
properties:name = "Feed",type = "Image",write = “ok"
subscription:type == "Image" &&write == “ok"
subscription:type == "Image" &&write == “ok"
subscription:name == "Feed"subscription:name == "Feed"
properties:name = “Grey",type = "Image",write = “ok"
properties:name = “Grey",type = "Image",write = “ok"
35© 2011 IBM Corporation
Streams Instance: stream1Streams Instance: stream1
Job: imagewriter
WriteImageFile metadata
SinkFunctor
Application Design – Multi-job Design
Application / Job Decomposition– Dynamic Job Submission + Stream Import / Export
Job: imagefeeder
DirReaderFile metadata
Job: imagefeeder
DirReaderFile metadata
Job: resizer
Job:facial scan Job: Alerter
Job: imagefeeder
File metadata
Image-Source
Directory-Scan
Job: imagewriter
File metadataTimestamp + Filename
FunctorImage-Sink
FileSink
Job:greyscaler
Greyscale
Image +File metadata
properties:name = "Feed",type = "Image",write = “ok"
properties:name = "Feed",type = "Image",write = “ok"
subscription:type == "Image" &&write == “ok"
subscription:type == "Image" &&write == “ok"
subscription:name == "Feed"subscription:name == "Feed"
properties:name = “Grey",type = "Image",write = “ok"
properties:name = “Grey",type = "Image",write = “ok"
36© 2011 IBM Corporation
Two Styles of Export/Import
Publish and subscribe (Recommended approach):–The exporting application publishes a stream with certain properties–The importing stream subscribes to an exported stream with properties
satisfying a specified condition
Point to point:–The importing application names a specific stream of a specific exporting
application
Dynamic publish and subscribe:–Export properties and Import expressions can be altered during the execution of
a job–Allows dynamic data flows–Alter the flow of data based on the data (history, trends, etc.)
() as ImageStream = Export(ImagesIn) { param properties : { streamName = "ImageFeed", dataType = "IplImage", writeImage = "true"};}
stream<IplImage image, rstring filename, rstring directory> ImagesIn = Import() { param subscription : dataType == "IplImage" && writeImage == "true";}
37© 2011 IBM Corporation
Parallelization Patterns – Introduction
Problem Statement–Series of operations to be performed on a piece of data (a tuple)–How to improve performance of these operations?
Key Question–Reduce latency?
• For a single piece of data – Increase throughput?
• For the entire data flow
Three possible design patterns–Serial Path–Parallel Operators (Task Parallelization)–Parallel Paths (Data Parallelization)
38© 2011 IBM Corporation
Parallelization Patterns – Pipeline, Task
Pipeline (serial path)
–Base pattern: inherent in graph paradigm–Results arrive at D in time T(A) + T(B) + T(C)
Parallel operators (task parallelization)
–Process the tuple in operators A, B, and C at the same time–Requires merger (e.g., Barrier) before operator D–Results arrive at D in time Max(T(A),T(B),T(C)) + T(M)–Use when tuple latency requirement < T(A) + T(B) + T(C)–Complexity of merger depends on behavior of operators A, B, and C
A B C D
A
B
C
M D
39© 2011 IBM Corporation
Parallelization Patterns – Parallel Pipelines
Parallel pipelines (data parallelization)
–Migration step from pipeline patttern–Can improve throughput
• Especially good for variable-size data / processing time
Design Decisions–Are there latency and/or throughput requirements?–Do the operators perform filtering, feature extraction, transformation?–Is there an execution order requirement?–Is there a tuple order requirement?
Recommend Pipeline Parallel Pipelines when possible
A B C
A B C
A B C
D
40© 2011 IBM Corporation
Application Design – Multi-tier Design
N-tier design–Number and purpose of tiers is a result of Application Design
Create well-defined interfaces between the tiers
Supports several overarching concepts– Incremental development / testing–Application / Job / Operator reuse–Modular programming practices
Each tier in these examples may be made up of one or more jobs (programs)
Transport AdaptationTransport Adaptation IngestionIngestion ReductionReduction Processing /
AnalyticsProcessing /
Analytics TransformationTransformation TransportAdaptationTransportAdaptation
Transport AdaptationTransport Adaptation IngestionIngestion Processing /
AnalyticsProcessing /
AnalyticsTransportAdaptationTransportAdaptation
Examples
41© 2011 IBM Corporation
Application Design – High Availability
HA application design pattern–Source job exports stream, enriched with tuple ID–Jobs 1 & 2 process in parallel, and export final streams–Sink job imports streams, discards duplicates, alerts on missing tuples
Host pool 1
Host pool 4
Host pool 2
Host pool 3
Job 1Job 1 Job 1Job 1
Job 1Job 1 Job 1Job 1
Job 2Job 2
SinkSink
x86 host x86 host x86 host x86 host x86 host
Job 2Job 2 Job 2Job 2Job 2Job 2SourceSource
42© 2011 IBM Corporation
Application Design – High Availability
HA application design pattern–Source job exports stream, enriched with tuple ID–Jobs 1 & 2 process in parallel, and export final streams–Sink job imports streams, discards duplicates, alerts on missing tuples
Host pool 1
Host pool 4
Host pool 2
Host pool 3
Job 1Job 1 Job 1Job 1
Job 1Job 1 Job 1Job 1
Job 2Job 2
SinkSink
x86 host
x86 host x86 host x86 host x86 host
Job 2Job 2 Job 2Job 2Job 2Job 2
SourceSource
© 2011 IBM Corporation
IBM InfoSphere Streams
Eclipse IDE
Streams Live Graph
Streams Debugger
Agile Development Environment
Distributed Runtime Environment
Sophisticated Analytics with Toolkits & Adapters
Clustered runtime for massive scalability
RHEL v5.x and v6.x, CentOS v6.x
x86 & Power multicore hardware
Ethernet & InfiniBand
Toolkits Database Mining Financial Standard Internet BigData
• HDFS• DataExplorer
Over 50 samples
Front Office 3.0
Advanced Text Geospatial Timeseries Messaging ... User-defined
44© 2011 IBM Corporation
Toolkits and Operators to Speed and Simplify Development
Standard ToolkitRelational Operators
Filter Sort Functor JoinPunctor Aggregate
Adapter OperatorsFileSource UDPSourceFileSink UDPSinkDirectoryScan Export TCPSource ImportTCPSink MetricsSink
Utility OperatorsCustom SplitBeacon DeDuplicateThrottle Union Delay ThreadedSplitBarrier DynamicFilterPair GateJavaOp
Financial Toolkit Data Mining Toolkit Big Data toolkit Text Toolkit ….. User-Defined Toolkits
Extend the language by adding user-defined operatorsand functions
Internet ToolkitInetSource
HTTP FTP HTTPSFTPS RSS file
Database ToolkitODBCAppendODBCEnrichODBCSource SolidDBEnrichDB2SplitDB DB2PartitionedAppend
Supports: DB2 LUW, IDS, solidDB, Netezza, Oracle, SQL Server, MySQLSupports: DB2 LUW, IDS, solidDB, Netezza, Oracle, SQL Server, MySQL
Standard toolkit contains the default operators shipped with the product
Standard toolkit contains the default operators shipped with the product
45© 2011 IBM Corporation
User Defined Toolkits
Streams supports toolkits–Reusable sets of operators and functions–What can be included in a toolkit?
• Primitive and composite operators• Native and SPL functions• Types• Tools/documentation/samples/data, etc.
–Versioning is supported–Define dependencies on other versioned assets (toolkits, Streams)–Create cross-domain and domain-specific accelerators
45
46© 2011 IBM Corporation
© 2011 IBM Corporation
InfoSphere Streams Instance – Single HostInfoSphere Streams Instance – Single Host
Management Services & Applications
Management Services & ApplicationsStreams Web Service (SWS)Streams Web Service (SWS)
Streams Application Manager (SAM)Streams Application Manager (SAM)
Streams Resource Manager (SRM)Streams Resource Manager (SRM)
Authorization and Authentication Service (AAS)Authorization and Authentication Service (AAS)
SchedulerScheduler Name ServerName ServerRecover DBRecover DB
File SystemFile System
Host ControllerHost Controller Processing ElementContainer
Processing ElementContainer
A quick peek inside …
© 2011 IBM Corporation
InfoSphere Streams Instance – Multi host, Management Services on separate nodeInfoSphere Streams Instance – Multi host, Management Services on separate node
Management ServicesManagement ServicesStreams Web Service (SWS)Streams Web Service (SWS)
Streams Application Manager (SAM)Streams Application Manager (SAM)
Streams Resource Manager (SRM)Streams Resource Manager (SRM)
Authorization and Authentication Service (AAS)Authorization and Authentication Service (AAS)
SchedulerScheduler Name ServerName ServerRecover DBRecover DB
Shared File SystemShared File System
Application HostApplication Host
Host ControllerHost Controller
Processing ElementContainer
Processing ElementContainer
Application HostApplication Host
Host ControllerHost Controller
Processing ElementContainer
Processing ElementContainer
Application HostApplication Host
Host ControllerHost Controller
Processing ElementContainer
Processing ElementContainer
A quick peek inside …
© 2011 IBM Corporation
InfoSphere Streams Instance – Multi host, Management Services on multiple hostsInfoSphere Streams Instance – Multi host, Management Services on multiple hosts
Shared File SystemShared File System
Application HostApplication Host
Host ControllerHost Controller
Processing ElementContainer
Processing ElementContainer
A quick peek inside …
Management Management
Streams Web ServiceStreams Web Service
Management Management
Streams App ManagerStreams App Manager
Application HostApplication Host
Host ControllerHost Controller
Processing ElementContainer
Processing ElementContainer
Application HostApplication Host
Host ControllerHost Controller
Processing ElementContainer
Processing ElementContainer
Application HostApplication Host
Host ControllerHost Controller
Processing ElementContainer
Processing ElementContainer
Management Management
Streams Resource MgrStreams Resource Mgr
Management Management
AASAAS
Management Management
SchedulerScheduler
Management Management
Name ServerName Server
Management Management
Recovery DBRecovery DB
Application HostApplication Host
Host ControllerHost Controller
Processing ElementContainer
Processing ElementContainer