Streaming ETL for All
-
Upload
joey-echeverria -
Category
Technology
-
view
425 -
download
0
Transcript of Streaming ETL for All
![Page 1: Streaming ETL for All](https://reader033.fdocuments.in/reader033/viewer/2022051707/58ed0f051a28ab52768b4705/html5/thumbnails/1.jpg)
© Rocana, Inc. All Rights Reserved. | 1
Joey Echeverria, Platform Technical Lead
San Francisco Hadoop Users Group, June 14th 2016
San Francisco, CA
Streaming ETL for All
![Page 2: Streaming ETL for All](https://reader033.fdocuments.in/reader033/viewer/2022051707/58ed0f051a28ab52768b4705/html5/thumbnails/2.jpg)
© Rocana, Inc. All Rights Reserved. | 2
Slides
http://bit.ly/streaming-etl-slides
![Page 3: Streaming ETL for All](https://reader033.fdocuments.in/reader033/viewer/2022051707/58ed0f051a28ab52768b4705/html5/thumbnails/3.jpg)
© Rocana, Inc. All Rights Reserved. | 3
Context
![Page 4: Streaming ETL for All](https://reader033.fdocuments.in/reader033/viewer/2022051707/58ed0f051a28ab52768b4705/html5/thumbnails/4.jpg)
© Rocana, Inc. All Rights Reserved. | 4
Joey• Where I work: Rocana – Platform Technical Lead
• Where I used to work: Cloudera (’11-’15), NSA
• Distributed systems, security, data processing, big data
![Page 5: Streaming ETL for All](https://reader033.fdocuments.in/reader033/viewer/2022051707/58ed0f051a28ab52768b4705/html5/thumbnails/5.jpg)
© Rocana, Inc. All Rights Reserved. | 5
![Page 6: Streaming ETL for All](https://reader033.fdocuments.in/reader033/viewer/2022051707/58ed0f051a28ab52768b4705/html5/thumbnails/6.jpg)
© Rocana, Inc. All Rights Reserved. | 6
History
![Page 7: Streaming ETL for All](https://reader033.fdocuments.in/reader033/viewer/2022051707/58ed0f051a28ab52768b4705/html5/thumbnails/7.jpg)
© Rocana, Inc. All Rights Reserved. | 7
Spark
Impala
“Legacy” data architecture
HDFS
Avro/Parquet FilesFlume/Sqoop
Data Producers MapReduce
Visualization/Query
![Page 8: Streaming ETL for All](https://reader033.fdocuments.in/reader033/viewer/2022051707/58ed0f051a28ab52768b4705/html5/thumbnails/8.jpg)
© Rocana, Inc. All Rights Reserved. | 8
Flink
Storm
Stream data architecture
Kafka
Avro Serialized Recrods
Data Producers Spark Streaming
Real-time Visualization
HDFS
Avro/Parquet FilesKafka Consumers
![Page 9: Streaming ETL for All](https://reader033.fdocuments.in/reader033/viewer/2022051707/58ed0f051a28ab52768b4705/html5/thumbnails/9.jpg)
© Rocana, Inc. All Rights Reserved. | 9
Flink
Storm
Stream data architecture
Kafka
Avro Serialized Recrods
Data Producers Spark Streaming
Real-time Visualization
HDFS
Avro/Parquet FilesKafka Consumers
![Page 10: Streaming ETL for All](https://reader033.fdocuments.in/reader033/viewer/2022051707/58ed0f051a28ab52768b4705/html5/thumbnails/10.jpg)
© Rocana, Inc. All Rights Reserved. | 10
Stream processingA primer
![Page 11: Streaming ETL for All](https://reader033.fdocuments.in/reader033/viewer/2022051707/58ed0f051a28ab52768b4705/html5/thumbnails/11.jpg)
© Rocana, Inc. All Rights Reserved. | 11
Stream processing• Filter
• Extract
• Project
• Aggregate
• Join
• Model
![Page 12: Streaming ETL for All](https://reader033.fdocuments.in/reader033/viewer/2022051707/58ed0f051a28ab52768b4705/html5/thumbnails/12.jpg)
© Rocana, Inc. All Rights Reserved. | 12
Stream processing• Filter
• Extract
• Project
• Aggregate
• Join
• Model
![Page 13: Streaming ETL for All](https://reader033.fdocuments.in/reader033/viewer/2022051707/58ed0f051a28ab52768b4705/html5/thumbnails/13.jpg)
© Rocana, Inc. All Rights Reserved. | 13
Stream processing• Filter
• Extract
• Project
• Aggregate
• Join
• Model
• Data transformation
![Page 14: Streaming ETL for All](https://reader033.fdocuments.in/reader033/viewer/2022051707/58ed0f051a28ab52768b4705/html5/thumbnails/14.jpg)
© Rocana, Inc. All Rights Reserved. | 14
Apache Storm• "Distributed real-time computation system"
• Applications packaged into topologies (think MapReduce job)
• Topologies operate over streams of tuples
• Spout: source of a stream
• Bolt: arbitrary operation such as filtering, aggregating, joining, or executing arbitrary functions
![Page 15: Streaming ETL for All](https://reader033.fdocuments.in/reader033/viewer/2022051707/58ed0f051a28ab52768b4705/html5/thumbnails/15.jpg)
© Rocana, Inc. All Rights Reserved. | 15
Apache Spark• Supports batch and stream processing
• Continuous stream of records discretized into a DStream
• DStream: a sequence of RDDs (batches of records)
• Micro-batch
![Page 16: Streaming ETL for All](https://reader033.fdocuments.in/reader033/viewer/2022051707/58ed0f051a28ab52768b4705/html5/thumbnails/16.jpg)
© Rocana, Inc. All Rights Reserved. | 16
Apache Flink• Supports batch and stream processing
• DataStream: unbounded collection of records
• Operations can apply to individual records or windows of records
• Supports record-at-a-time processing (like Storm)
![Page 17: Streaming ETL for All](https://reader033.fdocuments.in/reader033/viewer/2022051707/58ed0f051a28ab52768b4705/html5/thumbnails/17.jpg)
© Rocana, Inc. All Rights Reserved. | 17
Apache Kafka• Pub-sub messaging system implemented as a distributed commit log
• Popular as a source and sink for data streams
• Scalability, durability, and easy-to-understand delivery guarantees
• Can do stream processing directly in Kafka consumers
• Kafka Streams
![Page 18: Streaming ETL for All](https://reader033.fdocuments.in/reader033/viewer/2022051707/58ed0f051a28ab52768b4705/html5/thumbnails/18.jpg)
© Rocana, Inc. All Rights Reserved. | 18
Data transformation
![Page 19: Streaming ETL for All](https://reader033.fdocuments.in/reader033/viewer/2022051707/58ed0f051a28ab52768b4705/html5/thumbnails/19.jpg)
© Rocana, Inc. All Rights Reserved. | 19
Filter
filter
![Page 20: Streaming ETL for All](https://reader033.fdocuments.in/reader033/viewer/2022051707/58ed0f051a28ab52768b4705/html5/thumbnails/20.jpg)
© Rocana, Inc. All Rights Reserved. | 20
Extract
127.0.0.1 Mozilla/5.0 laura [31/Mar/2016] "GET /index.html HTTP/1.0" 200 2326
ts: 1436576671000body: <binary blob>event_type_id: 100...
extract
ts: 1436576671000body: <binary blob>event_type_id: 100attributes: { ip: "127.0.0.1" user_agent: "Mozilla/5.0" user_id: "laura" date: "[31/March/2016]" request: "GET /index.html HTTP/1.0" status_code: "200" size: "2326"}
![Page 21: Streaming ETL for All](https://reader033.fdocuments.in/reader033/viewer/2022051707/58ed0f051a28ab52768b4705/html5/thumbnails/21.jpg)
© Rocana, Inc. All Rights Reserved. | 21
Project
ts: 1436576671000body: <binary blob>event_type_id: 100attributes: { ip: "127.0.0.1" user_agent: "Mozilla/5.0" user_id: "laura" date: "[31/March/2016]" request: "GET /index.html HTTP/1.0" status_code: "200" size: "2326"}
ts: 1459444413000ip: "127.0.0.1"user_agent: "Mozilla/5.0"user_id: "laura"request: "GET /index.html HTTP/1.0"status_code: 200size: 2326
project
![Page 22: Streaming ETL for All](https://reader033.fdocuments.in/reader033/viewer/2022051707/58ed0f051a28ab52768b4705/html5/thumbnails/22.jpg)
© Rocana, Inc. All Rights Reserved. | 22
Problem
![Page 23: Streaming ETL for All](https://reader033.fdocuments.in/reader033/viewer/2022051707/58ed0f051a28ab52768b4705/html5/thumbnails/23.jpg)
© Rocana, Inc. All Rights Reserved. | 23
Who• Developers
• Data engineers
• Sysadmins
• Analysts
![Page 24: Streaming ETL for All](https://reader033.fdocuments.in/reader033/viewer/2022051707/58ed0f051a28ab52768b4705/html5/thumbnails/24.jpg)
© Rocana, Inc. All Rights Reserved. | 24
Tools
![Page 25: Streaming ETL for All](https://reader033.fdocuments.in/reader033/viewer/2022051707/58ed0f051a28ab52768b4705/html5/thumbnails/25.jpg)
© Rocana, Inc. All Rights Reserved. | 25
The dark art of data science• Feature engineering
• “Getting a mess of raw data that can be used as input to a machine learning algorithm” - @josh_wills
• Video from Midwest.io 2014
![Page 26: Streaming ETL for All](https://reader033.fdocuments.in/reader033/viewer/2022051707/58ed0f051a28ab52768b4705/html5/thumbnails/26.jpg)
© Rocana, Inc. All Rights Reserved. | 26
Data transformation for all
![Page 27: Streaming ETL for All](https://reader033.fdocuments.in/reader033/viewer/2022051707/58ed0f051a28ab52768b4705/html5/thumbnails/27.jpg)
© Rocana, Inc. All Rights Reserved. | 27
Rocana Transform• Library
• Java
• Rocana configuration• JSON + comments + specific numeric types - excess quoting
![Page 28: Streaming ETL for All](https://reader033.fdocuments.in/reader033/viewer/2022051707/58ed0f051a28ab52768b4705/html5/thumbnails/28.jpg)
© Rocana, Inc. All Rights Reserved. | 28
Data model• Event schema
• id: A globally unique identifier for this event• ts: Epoch timestamp in milliseconds• event_type_id: ID indicating the type of the event• location: Location from which the event was generated• host: Hostname, IP, or other device identifier from which the event was
generated• service: Service or process from which the event was generated• body: Raw event content in bytes• attributes: Event type-specific key/value pairs
![Page 29: Streaming ETL for All](https://reader033.fdocuments.in/reader033/viewer/2022051707/58ed0f051a28ab52768b4705/html5/thumbnails/29.jpg)
© Rocana, Inc. All Rights Reserved. | 29
Example event{ "id": "JRHAIDMLCKLEAPMIQDHFLO3MXYXV7NVBEJNDKZGS2XVSEINGGBHA====", "event_type_id": 100, "ts": 1436576671000, "location": "aws/us-west-2a", "host": "example01.rocana.com", "service": "dhclient", "body": "<36>Jul 10 18:04:31 gs09.example.com dhclient[865] DHCPACK from …", "attributes": { "syslog_timestamp": "1436576671000", "syslog_process": "dhclient", "syslog_pid": "865", "syslog_facility": "3", "syslog_severity": "6", "syslog_hostname": "example01", "syslog_message": "DHCPACK from 10.10.1.1 (xid=0x5c64bdb0)" }}
![Page 30: Streaming ETL for All](https://reader033.fdocuments.in/reader033/viewer/2022051707/58ed0f051a28ab52768b4705/html5/thumbnails/30.jpg)
© Rocana, Inc. All Rights Reserved. | 30
Filter, extract, and flatten
![Page 31: Streaming ETL for All](https://reader033.fdocuments.in/reader033/viewer/2022051707/58ed0f051a28ab52768b4705/html5/thumbnails/31.jpg)
© Rocana, Inc. All Rights Reserved. | 31
Filter, extract, and flatten• Filter out events without type id 100
• Filter out events without hostname prefix "ex"
• Extract a numeric prefix from the syslog message
• Flatten syslog attributes to top-level fields in a different avro schema
![Page 32: Streaming ETL for All](https://reader033.fdocuments.in/reader033/viewer/2022051707/58ed0f051a28ab52768b4705/html5/thumbnails/32.jpg)
© Rocana, Inc. All Rights Reserved. | 32
Filter, extract, and flatten{ load-event: {}, // Filter by event_type_id filter: { expression: "${event_type_id == 100}" }, // Extract hostname prefix regex: { ... }, filter: { expression: "${host_prefix.match.group.1 == 'ex'}", // Extract a numeric prefix from the syslog message regex: { ... }, // Build flattened record build-avro-record: { ... }, // Accumulate output record accumulate-output: { value: "${output_record}" }}
![Page 33: Streaming ETL for All](https://reader033.fdocuments.in/reader033/viewer/2022051707/58ed0f051a28ab52768b4705/html5/thumbnails/33.jpg)
© Rocana, Inc. All Rights Reserved. | 33
Extract hostname prefix{ load-event: {}, filter: { expression: "${event_type_id == 100}" }, regex: { pattern: "^(.{2}).*$", value: "${attr.syslog_hostname}", destination: "host_prefix" }, filter: { expression: "${host_prefix.match.group.1 == 'ex'}", ...}
![Page 34: Streaming ETL for All](https://reader033.fdocuments.in/reader033/viewer/2022051707/58ed0f051a28ab52768b4705/html5/thumbnails/34.jpg)
© Rocana, Inc. All Rights Reserved. | 34
Extract numeric prefix ... filter: { expression: "${host_prefix.match.group.1 == 'ex'}", regex: { pattern: "^([0-9]*)", value: "${attributes['syslog_message']}", destination: "msg", match-actions: { set-values: { extracted_field: "${msg.match.group.1}" } }, no-match-actions: { set-values: { extracted_field: "" } } }, ...
![Page 35: Streaming ETL for All](https://reader033.fdocuments.in/reader033/viewer/2022051707/58ed0f051a28ab52768b4705/html5/thumbnails/35.jpg)
© Rocana, Inc. All Rights Reserved. | 35
Build flattened record... build-avro-record: { schema-uri: "resource:avro-schemas/flattened-syslog.avsc", destination: "output_record", field-mapping: { ts: "${ts}", event_type_id: "${event_type_id}", source: "${source}", syslog_facility: "${convert:toInt(attributes['syslog_facility'])}", syslog_severity: "${convert:toInt(attributes['syslog_severity'])}", ... syslog_message: "${attributes['syslog_message']}", syslog_pid: "${convert:toInt(attributes['syslog_pid)}", extracted_field: "${extracted_field}" }, },...
![Page 36: Streaming ETL for All](https://reader033.fdocuments.in/reader033/viewer/2022051707/58ed0f051a28ab52768b4705/html5/thumbnails/36.jpg)
© Rocana, Inc. All Rights Reserved. | 36
Extract metrics from log data
![Page 37: Streaming ETL for All](https://reader033.fdocuments.in/reader033/viewer/2022051707/58ed0f051a28ab52768b4705/html5/thumbnails/37.jpg)
© Rocana, Inc. All Rights Reserved. | 37
Extract metrics• Input: HTTP status logs
• Extract request latency
• Extract counts by HTTP status code
• Metric types• Guage: A value that varies over time (think latency, CPU %, etc.)• Counter: A value that accumulates over time (think event volume, status codes,
etc.)
![Page 38: Streaming ETL for All](https://reader033.fdocuments.in/reader033/viewer/2022051707/58ed0f051a28ab52768b4705/html5/thumbnails/38.jpg)
© Rocana, Inc. All Rights Reserved. | 38
Example metric event{ "id": "JRHAIDMLCKLEAPMIQDHFLO3MXBBQ7NVBEJNDKZGS2XVSEINGGBHA====", "event_type_id": 107, "ts": 1436576671000, "location": "aws/us-west-2a", "host": "web01.rocana.com", "service": "httpd", "attributes": { "m.http.request.latency": "4.2000000000E1|g", "m.http.status.401.count": "1.0000000000E0|c", }}
![Page 39: Streaming ETL for All](https://reader033.fdocuments.in/reader033/viewer/2022051707/58ed0f051a28ab52768b4705/html5/thumbnails/39.jpg)
© Rocana, Inc. All Rights Reserved. | 39
Extract metrics{ load-event: {}, build-metric: { gauge-mapping: { http.request.latency: "${convert:toDouble(attributes['latency'])}" }, destination: "latency_metric" }, accumulate-output: { value: "${latency_metric}" }, build-metric: { dynamic-counter-mapping: [ "${string:format('http.status.%s.count', attributes['sc_status'])}", 1D ], destination: "status_metric" }, accumulate-output: { value: "${status_metric}" }}
![Page 40: Streaming ETL for All](https://reader033.fdocuments.in/reader033/viewer/2022051707/58ed0f051a28ab52768b4705/html5/thumbnails/40.jpg)
© Rocana, Inc. All Rights Reserved. | 40
Architecture
![Page 41: Streaming ETL for All](https://reader033.fdocuments.in/reader033/viewer/2022051707/58ed0f051a28ab52768b4705/html5/thumbnails/41.jpg)
© Rocana, Inc. All Rights Reserved. | 41
Java action objects
Architecture
Configuration file Java action objects Context
Variables
Driver
1. Parse config
2. Initialize context
5. Copy output3. Execute actions
4. Read/write variables
![Page 42: Streaming ETL for All](https://reader033.fdocuments.in/reader033/viewer/2022051707/58ed0f051a28ab52768b4705/html5/thumbnails/42.jpg)
© Rocana, Inc. All Rights Reserved. | 42
Custom actions• Actions loaded at runtime using Java services framework
• Add your jar to the classpath
• Custom actions appear as top-level keywords just like regular actions
• Implement the execute() method of the Action interface
• Implement the build() method of the ActionBuilder interface
![Page 43: Streaming ETL for All](https://reader033.fdocuments.in/reader033/viewer/2022051707/58ed0f051a28ab52768b4705/html5/thumbnails/43.jpg)
© Rocana, Inc. All Rights Reserved. | 43
Custom actions• Parse custom log formats
• Cisco ACS• Citrix• Juniper• Customer-specific formats
• Lookup IP addresses in the MaxMind GeoIP2 database
• Reference dataset lookups• Device id to device name
![Page 44: Streaming ETL for All](https://reader033.fdocuments.in/reader033/viewer/2022051707/58ed0f051a28ab52768b4705/html5/thumbnails/44.jpg)
© Rocana, Inc. All Rights Reserved. | 44
Putting it all together• Stream processing is causing us to re-think how we analyze data
• Limiting accessibility of data transformation side increases costs and decreases velocity
• Reduce your reliance on developers to code custom pipelines
• Re-use transformation configuration in any stream processing framework or batch job
![Page 45: Streaming ETL for All](https://reader033.fdocuments.in/reader033/viewer/2022051707/58ed0f051a28ab52768b4705/html5/thumbnails/45.jpg)
© Rocana, Inc. All Rights Reserved. | 45
Coming soon• Rocana transform will be released under the ASL 2.0
• The configuration library is available today:• https://github.com/scalingdata/rocana-configuration
![Page 46: Streaming ETL for All](https://reader033.fdocuments.in/reader033/viewer/2022051707/58ed0f051a28ab52768b4705/html5/thumbnails/46.jpg)
© Rocana, Inc. All Rights Reserved. | 46
Questions?