Hewlett Packard Enterprise confidential information...Please give me your feedback –Use the mobile...
Transcript of Hewlett Packard Enterprise confidential information...Please give me your feedback –Use the mobile...
#SeizeTheData
Hewlett Packard Enterprise confidential informationThis is a rolling (up to three year) roadmap and is subject to change without notice.
This Roadmap contains Hewlett Packard Enterprise Confidential Information. If you have a valid Confidential Disclosure Agreement with Hewlett Packard Enterprise, disclosure of the Roadmap is subject to that CDA. If not, it is subject to the following terms: for a period of three years after the date of disclosure, you may use the Roadmap solely for the purpose of evaluating purchase decisions from HPE and use a reasonable standard of care to prevent disclosures. You will not disclose the contents of the Roadmap to any third party unless it becomes publically known, rightfully received by you from a third party without duty of confidentiality, or disclosed with Hewlett Packard Enterprise’s prior written approval.
#SeizeTheData
Please give me your feedback
–Use the mobile app to complete a session survey 1. Access “My schedule”2. Click on the session detail page3. Scroll down to “Rate & review”
– If the session is not on your schedule, just find it via the Discover app’s “Session Schedule” menu, click on this session, and scroll down to “Rate & Review”
– If you have not downloaded our event app, please go to your phone’s app store and search on “Discover 2016 Las Vegas”
– Thank you for providing your feedback, which helps us enhance content for future events.
Session ID:Bxxxxx Speaker: Mark Fay, Natalia Stavisky
Effectively managing & monitoring streaming data loads
Mark FayNatalia Stavisky
using Kafka and the Vertica Management Console
#SeizeTheData
Vertica & Kafka Integration
In a world of just-in-time inventory and on-demand services, the ability to quickly load and analyze tremendous amounts of data is more important than ever before. Last year, HPE Vertica addressed this growing need by integrating with Apache Kafka to offer scalable, real-time loading from Kafka sources. Today Vertica continues to leverage these strengths by adding flexibility, monitoring, and the ability to relay data back to Kafka. With the upcoming Frontloader release, Vertica has created a data ecosystem capable of supporting even the most demanding needs.
5
#SeizeTheData
Agenda
1. Kafka Background
2. Vertica & Kafka Integration
3. Filtering & Parsing Enhancements
4. Closing the Loop: Vertica to Kafka Production
5. Scheduler CLI & Schema Enhancements
6. Monitoring Data Load with MC
6
#SeizeTheData
Kafka Background
7
#SeizeTheData
Apache Kafka Overview
A scalable, distributed message bus─ Apache project originating from LinkedIn─ Rich ecosystem of libraries and tools─ Highly optimized for low latency streaming
Solves the data integration problem─ Producers decoupled from consumers─ O(N) instead of O(N2) data pipelines─ Throughput scalable independently of source &
destination
Producer A Producer B Producer C
Consumer X Consumer Y Consumer ZConsumer Y Consumer ZConsumer X
Producer A Producer B Producer C
Kafka
8
#SeizeTheData
Apache Kafka Architecture
9
Broker A
Partition 0 0 1 32 4 765
Broker B
Partition 1 0 1 32 4 65
Broker N
Partition N 0 1 32 4 97 865
Producer writes to a
topic
Consumer reads offsets
from partitions…
TopicBrokers
PartitionsOffsets
#SeizeTheData
Recap: Vertica & Kafka in 7.2 Excavator
10
#SeizeTheData
Streaming Load Architecture
11
− Vertica schedules loads to continuously consume from any source via Kafka
− JSON, Avro, or custom data formats
− CLI driven− In-database monitoring
#SeizeTheData
Breaking Things Down
Load SchedulerImplements continuous, exactly-once
streaming
Dynamically prioritizes
resources to load from many
topics
Microbatch CommandsLoads a finite chunk of data
Updates stream progress
Kafka UDx PluginPulls data from Kafka Converts Kafka messages to
Vertica tuples
12
#SeizeTheData
Kafka UDx PluginExtending Vertica’s parallel load operators to load from Kafka
Store
…
Parse
Filter
SourceRaw bytes
Transformed bytes
Vertica Tuples
Transformed Tuples
─ Vertica’s execution is modeled as a series of datatransformations pipelined through operators for processing
─ The user defined extension (UDx) framework enables custom logic during this pipeline
─ UDx writer worries about domain logic, Vertica worries about resource management, parallelism, node communication, fault tolerance...
Source: acquire bytes (files, HDFS, Kafka)
Filter: transform bytes (decryption, decompression)
Parse: convert bytes to tuples (JSON, Avro)
Store: write tuples to projections (WOS, ROS)
13
#SeizeTheData
Microbatch CommandsSupport ‘exactly once’ through Vertica transactions
Kafka Data @ Offset X
Data Inserted Into Vertica
New Offsets Stored In Vertica
Commit
Microbatch (µB)
Confidential
#SeizeTheData
Scheduler SQL Statements
SELECT source, target_table, partition, start_offsetFROM stream_microbatch_history;-- run a microbatch for each item returned
COPY target_tableSOURCE KafkaSource(
stream=‘topic|0|0,topic|1|0’, brokers=‘broker:port’, duration=interval ‘10000 milliseconds’, stop_on_eof=true)
PARSER KafkaJSONParser( ) REJECTED DATA AS TABLE rejections_tableDIRECT NO COMMIT;
INSERT INTO stream_microbatch_history(…,*) from (SELECT KafkaOffsets() OVER ()) as microbatch_results;
COMMIT;
stream_microbatch_history table stores state about what to do next. Bootstrap with a SELECT query.
KafkaSource instructs Vertica nodes to load in parallel from Kafka for a period of time, starting at the specified <topic|partition|offset>’s
KafkaJSONParser coverts Kafka JSON messages emitted by the source into Vertica tuples for storage
KafkaOffsets returns the ending offset for each <topic|partition|offset> read by the source. Next frame will start here.
Commit atomically persists the data and ending offsets. It’s all or nothing!
µB
Frame Bootstrap
15
#SeizeTheData
Scheduling
16
#SeizeTheData
Static Scheduling Algorithm
Simple Example:– 5 topics– Concurrency of 1– Frame split into 5 equal parts
– 10 seconds total: 2 seconds each
17
1Example scheduling yields: 2 3 4 5
Hot topics become starved Lots of wasted time!
#SeizeTheData
Dynamic Scheduling Algorithm
1
1 2
1 2 3
1 2 3 4
1 2 3 4 5
To start, every batch gets an even portion of the frame.
If a batch ends early, split the leftover time evenly amongst the remaining batches.Corollary: batches that run later in the frame tend to have more time to run.
…but there’s still some wasted time at the end of the frame.
18
#SeizeTheData
Dynamic Scheduling Algorithm
5
5 4
5 4 3
5 4 3 1
5 4 3 1 2
Next time, sort by the runtime of the previous frame so that batches that ended early go first.
µB2 gets lots of time now!
19
#SeizeTheData
Dynamic Scheduling in Action
– Scheduler configured to load two topics with a frame duration of 5 seconds
– Two producers continuously producing at dynamic rates (dotted lines)
– Vertica’s load rate for the topics keeps up with the produce rate (solid lines), roughly 5 seconds behind
– Net throughput rate remains constant as load resources shift from one topic to the other
20
#SeizeTheData
Since 7.2 Excavator
21
#SeizeTheData
Added Since 7.2 Excavator
– Multiple Kafka cluster support– Added capability within a scheduler
configuration to setup multiple Kafka clusters– Kafka topics can be associated with clusters,
allowing users to stream data into Vertica from anywhere
– Single resource pool; single configuration
– Kafka version support– Added support for Kafka 0.9.x– Working with Confluent to keep up-to-date on
Kafka’s fast release cycles– 0.10 in the works
22
Scheduler Configuration
Kafka cluster
Kafka cluster
Kafka cluster
Vertica
#SeizeTheData
User-Defined Filters and Parsers
Why only JSON & Avro in 7.2 Excavator?– Kafka messages arrive with structure & metadata in
the source– Traditional parsers assume no structure; instead they
discover that structure in the data stream– Kafka JSON & Avro parsers specially designed to
preserve & leverage that information without modifying the data stream
How can I use other formats? Inject a filter!– KafkaInsertDelimiters(delimiter=E’$’)– KafkaInsertLengths()
Once filtered, data can be parsed using built-in parsers or your own custom UDParser
23
Parse
Filter
Source
WTJ(5
Slide 23
WTJ(5 Too much text.Wall, Tom James (Vertica), 8/2/2016
#SeizeTheData
User-Defined Filters and Parsers Example
KafkaInsertDelimiters(delimiter=E'$')– Appends a delimiter character after each message
– Most builtin parsers look for a record boundary
COPY t SOURCE KafkaSource(stream=‘some_topic|0|-2’, stop_on_eof=true, brokers=‘localhost:9092’)
FILTER KafkaInsertDelimiter(delimiter=E’$’)
RECORD TERMINATOR E’$’ DIRECT;
KafkaInsertLengths() – Prepends a unit32 length before each message
– Custom parsers can inspect lengths for efficient parsing
COPY t SOURCE KafkaSource(stream=‘some_topic|0|-2’, stop_on_eof=true, brokers=‘localhost:9092’)
FILTER KafkaInsertLengths()
Parser MyCustomParser() DIRECT;
24
Data in Kafka Data emitted by SOURCE
Data emitted by FILTER
Offset 0: {a:“foo”}Offset 1: {b:“bar”}Offset 2: {c:“baz”}
{a:“foo”}{b:“bar”}{c:“baz”} {a:“foo”}${b:“bar”}${c:“baz”}$
Data in Kafka Data emitted by SOURCE
Data emitted by FILTER
Offset 0: {a:“foo”}Offset 1: {b:“bar”}Offset 2: {c:“baz”}
{a:“foo”}{b:“bar”}{c:“baz”} 9{a:“foo”}9{b:“bar”}10{c:“baz”}
#SeizeTheData
KafkaAVROParser External Schema Support
– Avro Documents have three parts– Schema: JSON blob describing the message(s) in the
document– Object metadata: metadata for parsing the object
(SpecificData vs GenericData)– The data (i.e. a vertica row)
– Kafka Avro serializers typically do one document per Kafka message – lots of bloat!
– Remove bloat with two settings:– external_schema – specify the JSON header up front
and omit from your messages– with_metadata = false (default) to omit metadata
(parse using Avro GenericData)
– Kafka 0.10 schema registry not supported yet
25
Schema (JSON)
Metadata Data
Metadata Data
Metadata Data
Metadata Data
#SeizeTheData
VerticaKafkaKafkaExport UDTSend query results to a kafka topic in parallel!
– Input:– Partition (optional, NULL for round-robin)– Key (optional, NULL for unkeyed)– Message
– Output is messages that failed to send & reasons why (at least once semantics)
– Typical Kafka producer settings available to control performance & reliability
– INSERT … (SELECT …) for error management
26
CREATE TEMP TABLE kafka_rejections(partition integer, key varchar(128), message varchar(2000), reason varchar(1000));
INSERT INTO kafka_rejectionsSELECT KafkaExport(
partition, key, message
USING PARAMETERS
brokers=‘host1:9092,host2:9092’,
topic=‘foo’,
message_timeout_ms=5000
queue_buffering_max_ms=2000,
queue_buffering_max_messages=‘10000’)
OVER(PARTITION BEST) FROM export_src;
#SeizeTheData
VerticaKafkaNotifiers– Notifiers emit messages to external systems,
starting with Kafka
– Data Collector hooks can trigger notifiers when a record is written
– Enables external monitoring of Vertica, with persistence!
27
CREATE NOTIFIER dc_to_kafkaACTION ‘kafka://localhost:9092’MAXMEMORYSIZE ‘1GB’
#SeizeTheData
Schema and CLI Enhancements:A more flexible, more [re-]useable Scheduler– CLI reworked for more flexibility, maintainability &
extensibility
– Separated configuration from state: no longer worry about configuring topics and having the entire offsets history updated.
– Better projection design to optimize scheduler operations
– More consistent CLI config schema mappings to make it easier to do SQL based monitoring
28
MicroBatch
Source
Cluster
Target
Load Spec
#SeizeTheData
From Old to New
Old CLI utilities:
– scheduler
– kafka-cluster
– topic
New CLI utilities:
– scheduler
– cluster
– source
– target
– load-spec
– microbatch
Topic utility managed several different components
Now each component is separated into logical utilities
29
#SeizeTheData
More Flexibility
– Configure clusters that reference Kafka brokers– vkconfig cluster --create --cluster kafka1 --hosts some-kafka-broker:9092
– Separation of Topic (now: Source) and Target:– vkconfig source --create --source topic1 --cluster kafka1 --partitions 3– vkconfig source --create --source topic2 --cluster kafka1 --partitions 5– vkconfig target --create --target-schema public --target-table tgt1
– Configure Microbatches with N:1 source(s)target– Reuse sources and targets as desired– Full N:M multiplexing capabilities with M Microbatches
– vkconfig microbatch --create --microbatch mb1 --target-schema public --target-table tgt1 --add-source-cluster kafka1 --add-source topic1
– vkconfig microbatch --update --microbatch mb1 --add-source-cluster kafka1 --add-source topic2
Note: BOLD refers to Unique Keys for
referencing the specific part of the configuration.
30
#SeizeTheData
More [re-]usability
– COPY statements have many parameters, which are great for differing workloads.
– Sometimes, however, we want to reuse the same “load specification”:
– Introducing new CLI and configuration table: load spec
vkconfig load-spec --create --load-spec SPEC-1 --load-method DIRECT --parser KafkaJSONParser --parser-parameters flatten_tables=true
vkconfig microbatch --update --microbatch mb1 --load-spec SPEC-1
SPEC-1:- Load DIRECT- JSON format- Flatten JSON- No FILTERS
SPEC-2:- Load TRICKLE- Pipe-delimited CSV format- Insert Delimiter FILTER- Specific Kafka configs
31
#SeizeTheData
The New CLI
vkconfig cluster --create --cluster kafka1 --hosts some-kafka-broker:9092
vkconfig source --create --source topic1 --cluster kafka1 --partitions 3
vkconfig source --create --source topic2 --cluster kafka1 --partitions 5
vkconfig target --create --target-schema public --target-table tgt1
vkconfig load-spec --create --load-spec SPEC-1 --load-method DIRECT --parser KafkaJSONParser --parser-parameters flatten_tables=true
vkconfig microbatch --create --microbatch mb1 --target-schema public --target-table tgt1 --add-source-cluster kafka1 --add-source topic1
vkconfig microbatch --update --microbatch mb1 --add-source-cluster kafka1 --add-source topic2
Each component has its own CLI
Each instance of a component is uniquely
identifiable
All components are reusable
Each component independently editable
CRUD keywords for consistency
32
#SeizeTheData
Upgrade Process
– Upgrade will convert current scheduler settings & state to new format
– Old config state left in-tact for historical purposes, but is no longer used
– vkconfig scheduler --upgrade [--upgrade-to-schema <desired-schema>]– Upgrade by default upgrades your 7.2.x configuration within the same schema– upgrade-to-schema allows users to move upgraded schema to a new location– All objects have human-readable identifiers. Upgrade auto generates names, which can be edited afterwards
33
#SeizeTheData
Monitoring Kafka Loading with Vertica Management Console
34
#SeizeTheData
Monitoring data load activities in MC – available in Frontloader 8.0
– Displays history of data loading jobs including COPY command
– Shows outcome of individual COPY commands
35
Kafka loading – many, many COPY commands executed repeatedly over time…
After configuring Kafka streams:
#SeizeTheData
How is Kafka loading different from other types of data loading?Need to track and display many different pieces of data!– Is the data flowing?
– What microbatches are defined in the database?
– Is the data getting processed by my microbatches?
– Is the Scheduler running?
– How many messages have been processed in the last hour? In the last frame?
– Are there any errors?
– Are there any rejections?
36
The MC now presents a separate view of Instance and Continuous types of loading
The MC user can easily focus on the type of loading tasks that they want to track
#SeizeTheData
Continuous (Kafka) loading – your data flow at a glanceMonitoring Kafka loading: MC data collector streams
37
#SeizeTheData
Explore the details…
Scheduler
Microbatch 38
#SeizeTheData
Explore the details…
Microbatch errors Microbatchrejections
39
#SeizeTheData
Suspend or Resume Topic Processing
40
#SeizeTheData
Filtering Continuous Loads
41
Filtering out MC data collector monitoring streams
Filtering on the source
#SeizeTheData
Benefits of Using MC to Monitor Kafka
Monitor the Scheduler: – Is it running?
Monitor microbatches: – Are they enabled?
Monitor microbatch processing messages: – Is the data flowing the way it is expected?
Easier to triage errors and rejected data
42
#SeizeTheData
Wrap Up
– Enhancements to Integration– Closed the loop:
– Export Vertica records to Kafka– Write DC table data to Kafka
– Enhanced Filtering & Parsing capabilities– Any UDFilter & UDParser can be used, not just Kafka-specific– Native Vertica parsing
– Scheduler: extensible, relational schema design– Flexible– Easily SQL-monitored
– Scheduler CLI enhancements
– Monitoring– Browser-based access to the status of the scheduler and microbatches– Easy assessment of issues such as: data not loading, errors and rejections
43
Thank youContact information
44