Geospatial Toolkit Enhancements for IBM InfoSphere Streams V4.0
SPL Enhancements InfoSphere Streams Version 3.0
description
Transcript of SPL Enhancements InfoSphere Streams Version 3.0
© 2012 IBM Corporation1
SPL Enhancements
InfoSphere Streams Version 3.0
Howard NasgaardSPL Compiler, SPL Runtime & Standard Toolkit Development
© 2012 IBM Corporation2
Important Disclaimer
THE INFORMATION CONTAINED IN THIS PRESENTATION IS PROVIDED FOR INFORMATIONAL PURPOSES ONLY.
WHILE EFFORTS WERE MADE TO VERIFY THE COMPLETENESS AND ACCURACY OF THE INFORMATION CONTAINED IN THIS PRESENTATION, IT IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED.
IN ADDITION, THIS INFORMATION IS BASED ON IBM’S CURRENT PRODUCT PLANS AND STRATEGY, WHICH ARE SUBJECT TO CHANGE BY IBM WITHOUT NOTICE.
IBM SHALL NOT BE RESPONSIBLE FOR ANY DAMAGES ARISING OUT OF THE USE OF, OR OTHERWISE RELATED TO, THIS PRESENTATION OR ANY OTHER DOCUMENTATION.
NOTHING CONTAINED IN THIS PRESENTATION IS INTENDED TO, OR SHALL HAVE THE EFFECT OF:
• CREATING ANY WARRANTY OR REPRESENTATION FROM IBM (OR ITS AFFILIATES OR ITS OR THEIR SUPPLIERS AND/OR LICENSORS); OR
• ALTERING THE TERMS AND CONDITIONS OF THE APPLICABLE LICENSE AGREEMENT GOVERNING THE USE OF IBM SOFTWARE.
The information on the new product is intended to outline our general product direction and it should not be relied on in making a purchasing decision. The information on the new product is for informational purposes only and may not be incorporated into any contract. The information on the new product is not a commitment, promise, or legal obligation to deliver any material, code or functionality. The development, release, and timing of any features or functionality described for our products remains at our sole discretion.
THIS INFORMATION IS BASED ON IBM’S CURRENT PRODUCT PLANS AND STRATEGY, WHICH ARE SUBJECT TO CHANGE BY IBM WITHOUT NOTICE. IBM SHALL NOT BE RESPONSIBLE FOR ANY DAMAGES ARISING OUT OF THE USE OF, OR OTHERWISE RELATED TO, THIS PRESENTATION OR ANY OTHER DOCUMENTATION.
© 2012 IBM Corporation3
Agenda
Walk-through of new SPL and Standard Toolkit changes and additions
© 2012 IBM Corporation4
Problem
How do I ingest and work on XML data in a streams application?
© 2012 IBM Corporation5
XML Support
‘xml’ added as a first-class datatype– stream<xml x> ....– Checked for form
Can also specify a schema– stream<xml<“mySchema”> x>– Checked for form and validity
Schema can be file or web-based URI– Recommend local file
• data directory root if relative
© 2012 IBM Corporation6
XML Support - Conversion
rstring <-> xml– xml x = (xml) “<doc>...</doc>”r;
• Checked for form– xml<“schema”> xs = (xml<“schema”>)“<doc>...</doc>”r;
• Checked for form and validity– rstring s = (rstring)x;
Form and validity checking done only when needed• xml x = (xml)xs;
– Not checked• xs = (xml<“schema”>)x
– Checked for validity
Validation failure at runtime will throw an exception Set of built-in functions available to convert to xml
– Can be used with return code in logic
© 2012 IBM Corporation7
XML Support - Conversion
xml <-> tuple– type T = tuple<int32 i, rstring s>;– mutable T t = {....};– mutable xml x = (xml)t;– Converted to xml in “Serialized Tuple Model” format
• Schema provided with Streams (serializedTupleModel.xsd)– mutable T t = (T)x;
• Validated against tuple model schema
No conversion from ustring to xml directly– Must go through rstring
© 2012 IBM Corporation8
XML Literals
New literal type added for XML– “<a b=\”hi\”>x</a>”x;
String syntax extended to ease use in XML literals– ‘ (single quote) can now be used to delineate strings
• ‘<a b=”hi”>x</a>’x;– Embedded new-lines are now legal in string literals
• ‘<a> <b>x</b> </a>’x;
– Can be used in all string literals
© 2012 IBM Corporation9
XML Support - Encoding
XML literals are assumed to be in UTF-8 encoding– Source files are in UTF-8 so any explicit encoding must be too– ‘<?xml ... encoding=xxxxx”> ...’x;
• Compile-time error if xxxxx is not UTF-8 rstring expressions that contain XML data...
– Are assumed to be encoded in UTF-8 if no encoding specified– Must contain valid characters if encoding is specified
• Error raised at cast time if not
© 2012 IBM Corporation10
XML Support – Source/Sink Operators
Source operators can read attributes of xml type– stream<xml x> In = FileSource() {...}– xml is checked for form and validated (if there is a schema)
Sink operators can write attributes of xml type– () as Out = FileSink(stream<xml x> In) {...}
csv, txt as quoted xml literals bin in serialized form
• No validation (assumed valid) Source can read “traditional” xml file in line or block
– Requires XMLParse operator to be useful– No validation in Source operators
Sink operators cannot write XML using line or block format– XML must first be converted to rstring or blob.
© 2012 IBM Corporation11
XML Support - XMLParse Operator
Converts xml data to tuples Input attribute can be rstring, ustring, blob, xml Input data can be in multiple lines/blocks with rstring, ustring
or blob– “line” format can be used to read “traditional” XML files– If ustring any encoding directive is ignored– Operator validates xml
Input in rstring, ustring or blob can contain multiple, sequential, XML documents
A window marker punctuation is generated at the end of each XML document
XMLParse will not produce attributes of xml type
© 2012 IBM Corporation12
XML Support – XMLParse Operator
Generates one or more output stream Each stream
– Corresponds to a subtree within the XML (ie: element)– Requires a “trigger” expression
Trigger expression– rstring containing an XPath expression that defines a node set– Tuples are generated for each node in the node set– Must start at the root of the document
• param trigger : “/doc/a”; Two mechanisms to specify the mapping of XML to tuple
content– Implicit: content derived from the output stream schema– Explicit: content specified in the output clauses
© 2012 IBM Corporation13
XML Support – Implicitly deriving tuple content
The tuple schema representation of an XML element is:– type Element = tuple<map<rstring, rstring> _attrs, rstring _text [,
NestedTuples]*> – _attrs contains all the attribute name/value pairs– _text contains the text content between the open/close tag– Additional tuples or lists of tuples represent nested elements
Defining an output stream schema that follows this notion allows the XMLParse operator to generate a SAX parser that will extract the desired information
An example:
© 2012 IBM Corporation14
XML Support – XMLParse example
Things to note:– The nested tuples for sub-elements ‘d’ and ‘e’ do not have a map for
attributes. Not needed.– The trigger expression “/a” always starts with ‘/’
<a b="1" c="vc1"> va1 <d>vd1</d> <e>ve1a</e> <e>ve1b</e></a>
type aElem = tuple<map<rstring,rstring> _attrs, rstring _text, tuple<rstring _text> d, list<tuple<rstring _text>> e>;
stream<aElem> O = XMLParse(…) { param trigger : “/a”; }
© 2012 IBM Corporation15
XML Support – Another XMLParse example
Things to note:– Map has been replaced by scalar b and list<scalar>[1] c
• param flatten : attributes• Put XML attributes in spl [list]scalar attributes of the same name
– SPL attribute b has type int32• Everything in XML is considered rstring by default• Specifying a non-rstring type causes a conversion
<a b="1" c="vc1"> va1 <d>vd1</d> <e>ve1a</e> <e>ve1b</e></a>
type aElem = tuple<int32 b, list<rstring>[1] c, rstring _text, tuple<rstring _text> d, list<tuple<rstring _text>> e>;
stream<aElem> O = XMLParse(…) { param trigger : “/a”; flatten : attributes; }
© 2012 IBM Corporation16
XML Support – Still another XMLParse example
Things to note:– The nested tuple<rstring _text> d is reduced to rstring d
• rstring SPL attributes not named _text are assumed to refer to the text content of a nested element by that name
– The list<tuple<rstring _text>> e is reduced to list<rstring> e– The map is back? Why?
<a b="1" c="vc1"> va1 <d>vd1</d> <e>ve1a</e> <e>ve1b</e></a>
type aElem = tuple<map<rstring,rstring> _attrs, rstring _text, rstring d, list<rstring> e>;
stream<aElem> O = XMLParse(…) { param trigger : “/a”; flatten : elements; }
© 2012 IBM Corporation17
XML Support – Implicitly deriving tuple content
Reduction of maps/tuples to scalars is referred to as flattening Reduction of maps/tuples to scalars can only be done for XML
attributes OR elements, not both– rstring b could mean element b or attribute b. – You must tell the XMLParse operator which one you want
• param flatten : attributes/elements/none (default none) XML attribute or element content not represented in the tuple
schema will be ignored– You do not need to fully represent the XML structure in the schema
My XML just happens to have an element named _text.– params textName and AttributeName can change the default values of
_text and _attrs.
© 2012 IBM Corporation18
XML Support – Explicitly specifying tuple content
As with other operators, expressions in the output clause assign values to the output tuple attributes
Expressions use custom output functions to specify the mapping of XML data to SPL attribute– rstring XPath(rstring xpathExpn)– <tuple T> XPath (rstring xpathExpn, T tupleLiteral)– list<rstring> XPathList(rstring xPathExpn)– <any T> list<T> XPathList(rstring xpathExpn, T elements)– map<rstring,rstring> XPathMap (rstring xpathExpn)
Each of these functions require an XPath expression relative to the trigger expression, or the containing expression
Examples, please!
© 2012 IBM Corporation19
XML Support – Example of using explicit specification
Things to note:– Trigger expression says output a tuple for each “e” subtree– XPath expression “text()” specifies what to get from the ‘e’ subtree, the
‘e’ element’s text content in this case– Everything else in the XML is ignored– Two tuples would be output for the example XML– No naming convention for tuple attributes
<a b="1" c="vc1"> va1 <d>vd1</d> <e>ve1a</e> <e>ve1b</e></a>
stream<rstring s> O = XMLParse(...) {
param trigger : “/a/e”;
output O : s = XPath(“text()”); }
© 2012 IBM Corporation20
XML Support – Example of using explicit specification
Things to note:– Trigger expression says output a tuple for each “a” subtree– XPath expression “@b” specifies that we want the content of XML
attribute ‘b’– Must explicitly cast output of COFs if not rstring– One tuple would be output for the example XML
<a b="1" c="vc1"> va1 <d>vd1</d> <e>ve1a</e> <e>ve1b</e></a>
stream<int32 i> O = XMLParse(...) {
param trigger : “/a”;
output O : i = (int32)XPath(“@b”); }
© 2012 IBM Corporation21
XML Support – Example of using explicit specification
Things to note:– Trigger expression says output a tuple for each “a” subtree– XPath expression “e/text()” specifies that we want the content of XML
element ‘e’ and the XPathList function returns a list of all ‘e’ contents– One tuple would be output for the example XML with two values in list l
<a b="1" c="vc1"> va1 <d>vd1</d> <e>ve1a</e> <e>ve1b</e></a>
stream<list<rstring> l> O = XMLParse(...) {
param trigger : “/a”;
output O : l = XPathList(“e/text()”); }
© 2012 IBM Corporation22
XML Support – Example of using explicit specification
Things to note:– XPath expression “@*” specifies we want all attributes– XPathMap function returns the map containing all the attributes– One tuple will be output with a map containing two key/value pairs
<a b="1" c="vc1"> va1 <d>vd1</d> <e>ve1a</e> <e>ve1b</e></a>
stream<map<rstring, rstring> attrs> O = XMLParse(...) {
param trigger : “/a”;
output O : attrs = XPathMap(“@*”); }
© 2012 IBM Corporation23
XML Support – XMLParse Operator
Some other behavior– If an SPL attribute assignment does NOT contain XPath, XPathList or
XPathMap, then the expression will be resolved from the input stream– If an SPL attribute assignment is omitted, the XMLParse operator will
try to generate an implicit assignment using a default XPath or XPathList expression
– The parsing parameter controls its error processing:• strict: logs an error and terminates the operator• permissive: logs an error and continues
© 2012 IBM Corporation24
XML Support – spl-schema-from-xml utility
Given complex XML, crafting either tuple schemas for implicit generation, or output clauses for explicit generation, could be difficult
Enter the spl-schema-from-xml utility– Given a representative XML document it will:
• generate a set of typedefs for the tuple schema to support the full XML• optionally generate output clauses for each trigger specified• optionally generate a schema for the XML• optionally generate a composite operator wrapping the XMLParse operator• optionally generate a main composite with a source, sink and a call to the
parser composite– You can tailor the output from the utility to suit your needs– You can tell it to flatten elements or attributes
© 2012 IBM Corporation25
Sample spl-schema-from-xml output
<a b="1" c="vc1"> va1 <d>vd1</d> <e>ve1a</e> <e>ve1b</e></a>
spl-schema-from-xml -o a.spl -t '/a' --composite Parse --mainComposite Main data/test1.xml
© 2012 IBM Corporation26
Sample spl-schema-from-xml outputuse spl.XML::*;
composite Parse(input input0; output output0) {type static a_type = tuple<map<rstring, rstring> _attrs, rstring _text, a_d_type d, list<a_e_type> e>; static a_d_type = tuple<rstring _text>; static a_e_type = tuple<rstring _text>; graph stream<a_type> output0 = XMLParse(input0) { param trigger : "/a"; parsing : permissive; // log and ignore errors output output0 : _attrs = XPathMap("@*"), _text = XPath("text()"), d = XPath("d", {_text = XPath("text()")}), e = XPathList("e", {_text = XPath("text()")}); // *trigger: /a }}composite Main() { graph stream<rstring s> Input = FileSource() { param file : "test1.xml"; format : line; } stream<Parse.a_type> X0 = Parse(Input) { } () as O0 = FileSink(X0) { param file : "out0.dat"; }}
© 2012 IBM Corporation27
XML Support – Standard Library Support Functions
A number of new functions have been added to the library
Safe conversions from string to xml– <xml X, string T> void convertToXML(mutable X xmlResult, T input, mutable int32 error); – <xml X, string T> public bool convertToXML (mutable X xmlResult, T input);
An XQuery engine is added as an alternative to the XMLParse operator
– <xml X > public list<rstring> xquery (X input, rstring xqueryExpression); – <xml X > public list<rstring> xquery (X input, rstring xqueryExpression, mutable int32
error);– And numerous more flavors– All return a list of rstrings with the query results
© 2012 IBM Corporation28
XML Support – XQuery exampletype T = tuple<int32 id, tuple<rstring b, list<int32> x, float64 d> a, rstring c>;stream<T> OutTuples = Custom (Data) {
logic onTuple Data: {// extract string ‘c’mutable list<rstring> results = xquery(Data.xmlVar, “/something/bar/c/text()”);mutable rstring s = results[0];
// extract string ‘b’ attribute in ‘a’mutable tuple<rstring b, list<int32> x, float64 d> a = {};results = xquery(Data.xmlVar, “/something/bar/a/@bdata”);a.b = results[0];
// extract list<int32> ‘x’ attribute in ‘a’results = xquery(Data.xmlVar, “/something/bar/foo/text()”);for (rstring r in results)
appendM (a.x, (int32) r);
// extract float64 ‘d’ attribute in ‘a’results = xquery(Data.xmlVar, “/something/bar/a/d/text()”);a.d = (float64) results[0];// submit the final resultsubmit ({id = Data.id, a = a, c = s}, OutTuples);
}
© 2012 IBM Corporation29
XML Support – Database Toolkit
All database toolkit operators have been extended to support XML– XML converted to/from char data for DB that doesn’t support XML– DB2 PureXML capabilities are accessible if using DB2 V9.7 or later
© 2012 IBM Corporation30
Problem
How do I use a for statement to iterate over, and modify, a list?
– In SPL the for statement looks like
list<rstring> l = [...];for (rstring entry in l) { l = “????”;}
– This doesn’t work because entry is an rstring value, not an index into the list
– You need a list of indexes with the same number of entries as the list you are iterating overfor (int32 i in indexes) { l[i] = “????”;}
© 2012 IBM Corporation31
More Efficient ‘for’ Loops
Introduce a set of ‘range’ functions– // return [0, ..., limit-1] list<int32> range(int32 limit);
– // return [start, ..., limit-1]list<int32> range(int32 start, int32 limit);
– // return [start, start+step, ... number < limit]list<int32> range(int32 start, int32 limit, int32 step)
– // return [0, ..., size(l)-1]list<T> list<int32> range(T l)
Use:– mutable list<rstring> myList = [“hi”, “there”]; for (int32 i in range(myList)) { myList[i] = upper(myList[i]);}
Compiled into the C++ code when used inside a for loop
© 2012 IBM Corporation32
More Efficient ‘for’ loops
logic ....: { mutable list<rstring> myList = ["hi", "there"]; for (int32 i in range(myList)) { myList[i] = upper(myList[i]); println(myList[i]); } }
SPL::list<SPL::rstring > myList = ...; SPL::int32 temp = myList.size(); for (SPL::int32 i = 0; i < temp; i++) { myList.at(i) = ...::upper(myList.at(i)); ...::println(myList.at(i)); }
© 2012 IBM Corporation33
Other SPL Changes
SPADE to SPL translator removed– Must install Streams 2.x if you need translation
submit([tuple|punct], portNo) functions added– Enable dynamic port selection– Will raise an exception at runtime if port invalid
Return statement allowed in logic clause to enable simplification– Does not affect the normal processing of tuples in the generated
primitive operator. new Perl regex compatible functions added
– regexMatchPerl– regexReplacePerl– Both rstring and ustring varients
© 2012 IBM Corporation34
Problem
I want to write a primitive operator with custom output functions that can be nested within an output assignment– You saw this in the examples of the XMLParse operator
© 2012 IBM Corporation35
Operator Model Changes
Allow Custom Output Functions to be nested within an expression– Recall from XMLParse:
• output O : attr = XPathList(“...”, XPathList(...));– outputPortOpenSetType
• <allowNestedCustomOutputFunctions> true/false Allow Custom Output Functions to be used in a param
expression– parameterType
• <customOutputFunction> - name of a COF that can appear
© 2012 IBM Corporation36
Operator Model Changes
To support nested COFs and COFs in a param expression– Compiler will optionally generate an expression tree into the Operator
Instance Model (OIM)– APIs provided in the OIM interface to walk the expression tree and
query characteristics – APIs documented in html in doc/spl/operator/code-generation-api/perl– Also support for generation of C++ code from the expression tree– Use documented in the Toolkit Developer’s Reference
© 2012 IBM Corporation37
Problem
I have a Streams application that Imports from, or Exports to, another streams application
I would like to dynamically update the Export properties or the Import subscription
© 2012 IBM Corporation38
Export Property/Import Subscription Update from SPL code
Allows SPL programs to query/update properties/ subscriptions without having to use primitive operators.– getOutputPortExportProperties– setOutputPortExportProperties– getInputPortImportSubscription– setInputPortImportSubscription
Port must come from Import or go to Export operator that uses subscription/property
Triggers a disconnect/reconnect Use:setInputPortImportSubscription(‘stock == "IBM“’, 0u);
© 2012 IBM Corporation39
Problem
I provide a toolkit that is used in various countries. I would like to be able to load strings in a language appropriate for the locale in effect where my toolkit is used.
Within those strings I would like numeric values, for example, to be formatted in a locale sensitive way.
© 2012 IBM Corporation40
Localization Support
Utilizes ICU under the covers Resource bundle creation locale sensitive loading mechanism Translatable strings contained in XLIFF files
– XML Localization Interchange File Format Specify .xlf files in info.xml file Resource bundles built during toolkit indexing C++ header and Perl module generated Standard library functions added to load resource
– loadAndFormatResource Localized strings available at compile-time and run-time Localization sample program Documented in the Toolkit Developement Reference
© 2012 IBM Corporation41
XLIFF File
<xliff version="1.1" xmlns="urn:oasis:names:tc:xliff:document:1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:oasis:names:tc:xliff:document:1.1 _path_/xliff-core-1.1.xsd"> <file datatype="plaintext" original="root.txt" source-language="en" target-language="en" xml:space="preserve"> <body> <group> <trans-unit id="1" extraData="MESSAGE_1" resname="MSG0001"> <source>English: A message emitted at compile time.</source> </trans-unit> <trans-unit id="2" extraData="MESSAGE_2" resname="MSG0002"> <source>English: A message emitted at run time. A formatted value ''{0,number,currency}''.</source> </trans-unit> </group> </body> </file></xliff>
© 2012 IBM Corporation42
Usage:
<%
# Add a require for the Perl module that contains the subroutine that loads and formats the string.
require MyResource;
# Emit the message using a SPL helper method
SPL::CodeGen::println(MyResource::MESSAGE_1());
%>
// Add an include for the header that contains the macro which loads and formats the message
#include "MyResource.h“
// Tuple processing for non-mutating ports
void MY_OPERATOR::process(Tuple const & tuple, uint32_t port)
{
const IPort0Type & t = static_cast<const IPort0Type &>(tuple);
// Get the loaded and formatted message and initialize the output tuple
SPL::rstring r = MESSAGE_2(t.get_i());
// Add a message to the runtime log
SPLAPPLOG(L_INFO, r, "test");
© 2012 IBM Corporation43
Standard Toolkit Changes
© 2012 IBM Corporation44
Problem
I find the Beacon operator somewhat limited in how I can use it as a stream generator. Is there another way of generating tuples?
© 2012 IBM Corporation45
Custom Operator as a Source
A Custom operator with no inputs can act as a source(stream<int32 a> A; stream<int32 a> B)=Custom() {
logic state : mutable int32 i = 9; onProcess : { for (int32 x in range(10)) { submit ({a = x + i}, A); submit ({a = 6 + x + i}, B); i++; } } }
onProcess clause added– Only allowed in a Custom operator– Only allowed if there are no input ports
© 2012 IBM Corporation46
Problem
My Streams application imports data from another application, but I am only interested in part of the data. I have to add a Functor to filter out a lot of the imported data. That wastes a lot of transport time.
© 2012 IBM Corporation47
Filter Support for Import
A new filter param added to the Import operator– boolean or rstring expression
type streamT = int64 value, rstring str, int32 x; stream<streamT> I = Import () { param subscription : “a >= 55”;
filter : value < 0 && str == “foo”; }
Filtering will be performed at Export operator and only matching tuples will be seen at the Import operator
© 2012 IBM Corporation48
Filter Support for Import
Filter expressions are the same as subscription expressions: – int64, rstring, float64, lists of same
Export operator has a new parameter:– allowFilter: true/false;
If allowFilter is false, an Import with a filter parameter will not connect to the Export operator
New metric added to PE output port (per connection): – TuplesFilteredOut– Shows number of tuples not sent over connection
New functions to query/update filter expression– getInputPortImportFilterExpression– setInputPortImportFilterExpression– Update is asyncronous
© 2012 IBM Corporation49
Problem
I have noticed, when using windows in my primitive operator, that they cache every tuple within the library.
It occurs to me that, at least with tumbling windows, it shouldn’t be necessary to cache the tuples.
© 2012 IBM Corporation50
Window Library Extended to Optimize Tumbling Windows
In Streams Version 2.0 the window library cached all tuples– Can use a lot of memory
In many cases it is not necessary to cache all the tuples– ie: Compute the average of attribute price in a tumbling window
• Requires only a count and a running total In Version 3.0 the window library is extended with
Summarizers– Aggregate operator updated to use this optimization
© 2012 IBM Corporation51
Problem
The Aggregate operator is very useful, however, if I want my own type of aggregation, I must re-write the whole operator.
© 2012 IBM Corporation52
User Extension of the Aggregate Operator
Version 3.0 adds a new aggregation function called “Custom”–Allows you to implement arbitrary aggregation functions without duplicating Aggregate operator
Requires mutable state tupleTakes three functions as arguments
–init function•Takes state, returns boolean
–process function•Takes attribute and state, returns boolean
–result function•Takes state, returns attribute type
This is an operator change and not a language change–Nested calls must return a value
© 2012 IBM Corporation53
Example
type AvgContext = float64 sum, int32 count;
...
stream<float64 avg > B = Aggregate (Src) {
logic state : mutable AvgContext avgContext;
window Src : tumbling, count(3);
output B : avg = Custom(myInit(avgContext), process(price, avgContext), result(avgContext));
}
© 2012 IBM Corporation54
Example
type AvgContext = float64 sum, int32 count;
boolean myInit (mutable AvgContext c) {
c.sum = 0.0; c.count = 0; return false;
}
boolean process (float64 value, mutable AvgContext c) {
c.count++; c.sum += value; return true;
}
float64 result (AvgContext c) {
if (c.count == 0) return 0.0;
return c.sum/(float64)c.count;
}
© 2012 IBM Corporation55
More on Custom
Can be used more than once –Each unique aggregation needs its own functions and state
•Can re-use functions with unique stateCan be used with partitionBy/groupBy
–Does not use summarizers for tumbling windows (stores tuples)•Only one state, no partition/group information exchanged•User maintains state
Can be interleaved with pre-defined aggregation functions–output O : ticker=Any(ticker), myAvg=Custom(..process(price, ...), ...);
© 2012 IBM Corporation56
Problem
I would really like to augment the output tuple of a ???Source operator with attribute data that doesn’t come from the input file, socket, whatever.
© 2012 IBM Corporation57
Enhancements to FileSource/UDPSource/TCPSource
Source operators can now have an output clause New Custom Output Functions added:
– FileSouce: TupleNumber(), FileName()– TCPSource: TupleNumber(), RemoteIP(), RemotePort(), LocalPort(),
ServerPort()– UDPSource: TupleNumber(), RemoteIP(), RemotePort(), ServerPort()
Use:– output A: fName = FileName(), value= someFcn();
All other output stream attributes filled from data source as usual
© 2012 IBM Corporation58
Problem
I have attributes in my stream that I don’t want my ???Sink to write out.
What I’d really like to do is use some of those attributes to, say, help generate a unique file name that could be used by the FileSink.
© 2012 IBM Corporation59
Enhancements to FileSink
suppress : inputAttributes;– named attributes will not be written to file
Add closeMode : dynamic– file parameter expression allowed to reference input attributes– file expression is evaluated for each tuple, and if it changes, the
existing file will be closed and a new file opened based on the value New filename formatting specifiers
– {localtime:strftimeString}– {gmtime:strftimeString}
© 2012 IBM Corporation60
Enhancements to FileSink
param file : “{localtime:%m.%d}_” + CompanyName + “.txt”;– This would generate a filename with %m being the current month
number, %d the current day in the month, an underscore (_), the value of the CompanyName input attribute, and then “.txt”
– Each time the month, day or company name changes a new filename would be used
– Useful with the append param
© 2012 IBM Corporation61
Enhancements to TCPSink
closeMode and suppress parameters added– never – current behaviour– dynamic - to connect based on host/port in tuple
type FinalResult=tuple<rstring host,uint32 port,uint32 value>;graph () as TCPSink1 = TCPSink(A) {
param closeMode: dynamic;suppress: host, port;
address : host; port : port; role : client; retryFailedSends : true; }
Need role : client, closeMode: dynamic , retryFailedSends : true address and port can use runtime expressions TCPSink will close old connection and re-open new one if host or port
changes
© 2012 IBM Corporation62
Enhancements to FileSink
Improved write error detection/handling New parameter: writeFailureAction:
– Optional; values are ignore (the default), log, terminate. – ignore: do nothing, and all future writes will fail.– log: the error is logged and the error condition is cleared. Further writes may
again cause failures, if the underlying cause is not corrected. Even if the underlying cause is corrected, there will be gaps in the file due to the failed writes.
– terminate: an error is logged, and the operator will terminate. Closing a file (closeMode) will reset the error. Future writes should
succeed if the underlying problem has been corrected.
© 2012 IBM Corporation63
Problem
I would like to have greater control over what my FileSink operator does when it encounters a write error
© 2012 IBM Corporation64
Enhancements to FileSink
new metric: nTupleWriteErrors – Number of tuple writes (not tuples) that had an error on the file stream
after the write completed.– Due to buffering, write failure may not be detected immediately. Use
param flush : 1u; to ensure quicker detection, but with a (possibly large) performance penalty
– Once failure is detected, all future writes will fail unless the error condition is cleared.
© 2012 IBM Corporation65
Problem
I have some data that I would like to ingest, but it is an a compression form that is not supported by the existing operators.
I don’t want to have to write a whole Source operator in order to support it.
© 2012 IBM Corporation66
New Utility Operators
A new set of operators has been added allowing functionality of Standard Source/Sink operators to be extended– Parse: accept blob (ie: csv), generate tuple– Format: accept tuple, generate blob (ie: csv)– Decompress: decompress data in blob (gzip), generate blob– Compress: compress data in blob, generate blob (gzip)– CharacterTransform: convert from one encoding in blob to another
encoding in blob Allows supporting more data formats without having to write a
complete Source/Sink operator
© 2012 IBM Corporation67
New Utility Operatorsstream<…> Tuples = XXXSource () {
param compression: gzip; encoding: “ISO_8859-9”; format: csv;
}
This is equivalent to:
stream<blob data> Input = XXXSource() { …param format : block; blockSize : 4096u; }
stream<Input> Uncompressed = Decompress(Input) {param compression : gzip;}
stream<Input> Decoded = CharacterTransform (Uncompressed) {output Decoded : data = Convert (“ISO_8859-9”, “UTF8”,
data); }
stream<…> Tuples = Parse(Decoded) { … param format : csv; … }
© 2012 IBM Corporation68
New Utility Operators
stream<blob data> Formatted = Format(someStream) { param format : csv;
}
stream<Formatted> Encoded = CharacterTransform(Formatted) {output Encoded : data =
Convert (“UTF8”, “ISO_8859-9”, data); }
stream<Formatted> Compressed = Compress(Encoded) {param compression : gzip;
}
() as Nul = XXXSink(Compressed) {param format : block; blockSize : 4096u;
}
© 2012 IBM Corporation69
Problem
I have an operator which has a table I need to initialize at startup. I’d like to ensure that no tuples will flow to my operator prior to its initialization
© 2012 IBM Corporation70
Switch Operator
The Switch operator acts like an open/closed switch. When open, tuples will wait until the switch closes A control port will set the switch open or closed Useful to allow an operator to finish initialization before
processing a stream After initialized, send a tuple to Switch to allow tuples to flow
© 2012 IBM Corporation71
Problem
I would like some way to ensure, when making a TCP connection using a TCPSource/TCPSink, that the format of the data being sent is compatible with what the Source is expecting.
© 2012 IBM Corporation72
Stream Schema Checking for TCPSource/TCPSink
param confirmWireFormat : true/false;– Default is false
If true, TCP server role will send information about data to be sent– Tuple schema, format, compression, encoding, hasDelay, contains
punctuation TCP client role will determine if the information is compatible
– Returns go/nogo status and optional message If not compatible, connection closed
© 2012 IBM Corporation73
Problem
When you release a new version of the product, you never tell me about the little things you added
© 2012 IBM Corporation74
Miscellaneous Changes
Import: subscription and filter support mod (%)– identifier % int64Lit compareOp int64Lit– param subscription : anId % 10ll >= 8ll;
Add param writePunctuations: boolean to *Sink with format bin to write punctuations to output
param readPunctuations with format bin for *Source will read punctuations from stream and submit downstream
Optional second parameter to DeDuplicate that generates the tuples that are duplicates