(SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014

38
November 13, 2014 | Las Vegas, NV Adi Krishnan, Sr. Product Manager Amazon Kinesis

description

Amazon Kinesis is the AWS service for real-time streaming big data ingestion and processing. This talk gives a detailed exploration of Kinesis stream processing. We'll discuss in detail techniques for building, and scaling Kinesis processing applications, including data filtration and transformation. Finally we'll address tips and techniques to emitting data into S3, DynamoDB, and Redshift.

Transcript of (SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014

Page 1: (SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014

November 13, 2014 | Las Vegas, NV

Adi Krishnan, Sr. Product Manager Amazon Kinesis

Page 2: (SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014
Page 3: (SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014
Page 4: (SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014

Scenarios Accelerated Ingest-Transform-Load Continual Metrics/ KPI Extraction Responsive Data Analysis

Data Types IT infrastructure, Applications logs, Social media, Fin. Market data, Web Clickstreams, Sensors, Geo/Location data

Digital Ad Tech./Marketing

Advertising Data aggregation Advertising metrics like coverage, yield, conversion

Analytics on User engagement with Ads, Optimized bid/ buy engines

Software/ Technology

IT server , App logs ingestion IT operational metrics dashboards Devices / Sensor Operational Intelligence

Financial Services Market/ Financial Transaction order data collection

Financial market data metrics Fraud monitoring, and Value-at-Risk assessment, Auditing of market order data

Consumer Online/E-Commerce

Online customer engagement data aggregation

Consumer engagement metrics like page views, CTR

Customer clickstream analytics, Recommendation engines

Scenarios Across Industry Segments

1 2 3

Page 5: (SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014
Page 6: (SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014

Amazon KinesisManaged Service for streaming data ingestion, and processing

Amazon Web Services

AZ AZ AZ

Durable, highly consistent storage replicates dataacross three data centers (availability zones)

Aggregate andarchive to S3

Millions ofsources producing100s of terabytes

per hour

FrontEnd

AuthenticationAuthorization

Ordered streamof events supportsmultiple readers

Real-timedashboardsand alarms

Machine learningalgorithms or

sliding windowanalytics

Aggregate analysisin Hadoop or adata warehouse

Inexpensive: $0.028 per million puts

Page 7: (SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014

Real-time Ingest

• Highly Scalable

• Durable

• Elastic

• Replay-able Reads

Continuous Processing FX

• Elastic

• Load-balancing incoming streams

• Fault-tolerance, Checkpoint / Replay

• Enable multiple processing apps in parallel

Enable data movement into Stores/ Processing Engines

Managed Service

Low end-to-end latency

Page 8: (SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014
Page 9: (SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014

Kinesis Stream

Managed Ability To Capture And Store Data

Page 10: (SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014

Putting Data into Kinesis

Simple Put interface to store data in Kinesis

Page 11: (SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014

Best Practices: Putting Data in KinesisDetermine Your Partition Key Strategy

• Kinesis as a managed buffer or a streaming map-

reduce

• Ensure a high cardinality for Partition Keys with

respect to shards, to prevent a “hot shard” problem

– Generate Random Partition Keys

• Streaming Map-Reduce: Leverage Partition Keys for

business specific logic as applicable

– Partition Key per billing customer, per DeviceId, per

stock symbol

Page 12: (SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014

Best Practices: Putting Data in KinesisProvisioning Adequate Shards

• For ingress needs

• Egress needs for all consuming applications: If more

than 2 simultaneous consumers

• Include head-room for catching up with data in stream

in the event of application failures

Page 13: (SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014

Best Practices: Putting Data in Kinesis

Pre-Batch before Puts for better efficiency

Page 14: (SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014

# KINESIS appender

log4j.logger.KinesisLogger=INFO, KINESIS

log4j.additivity.KinesisLogger=false

log4j.appender.KINESIS=com.amazonaws.services.kinesis.log4j.

KinesisAppender

# DO NOT use a trailing %n unless you want a newline to be

transmitted to KINESIS after every message

log4j.appender.KINESIS.layout=org.apache.log4j.PatternLayout

log4j.appender.KINESIS.layout.ConversionPattern=%m

# mandatory properties for KINESIS appender

log4j.appender.KINESIS.streamName=testStream

#optional, defaults to UTF-8

log4j.appender.KINESIS.encoding=UTF-8

#optional, defaults to 3

log4j.appender.KINESIS.maxRetries=3

#optional, defaults to 2000

log4j.appender.KINESIS.bufferSize=1000

#optional, defaults to 20

log4j.appender.KINESIS.threadCount=20

#optional, defaults to 30 seconds

log4j.appender.KINESIS.shutdownTimeout=30

https://github.com/awslabs/kinesis-log4j-

appender

Best Practices: Putting Data in Kinesis

Pre-Batch before Puts for better efficiency

Page 15: (SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014

• Retry if rise in input rate is temporary

• Reshard to increase number of

shards

• Monitor CloudWatch metrics:

PutRecord.Bytes and

GetRecords.Bytes metrics keep track

of shard usage

Metric Units

PutRecord.Bytes Bytes

PutRecord.Latency Milliseconds

PutRecord.Success Count

• Keep track of your metrics

• Log hashkey values generated by

your partition keys

• Log Shard-Ids

• Determine which Shard receive the

most (hashkey) traffic.

String shardId =

putRecordResult.getShardId();

putRecordRequest.setPartitionKey(String.format( "myPartitionKey"));

Page 16: (SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014

Options:

• stream-name - The name of the

Stream to be scaled

• scaling-action - The action to be

taken to scale. Must be one of

"scaleUp”, "scaleDown" or

“resize"

• count - Number of shards by

which to absolutely scale up or

down, or resize to or:

• pct - Percentage of the existing

number of shards by which to

scale up or down

https://github.com/awslabs/amazon-

kinesis-scaling-utils

Page 17: (SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014

Sending & Reading Data from Kinesis Streams

HTTP Post

AWS SDK

LOG4J

Flume

Fluentd

Get* APIs

Kinesis Client

Library

+

Connector Library

Apache

Storm

Amazon Elastic

MapReduce

Sending Consuming

AWS Mobile

SDK

Page 18: (SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014
Page 19: (SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014

Building Kinesis Applications: Kinesis Client LibraryOpen Source library for fault-tolerant, continuous processing apps

• Java client library, also available for Python Developers

• Source available on Github

• Build app with Kinesis Client Library

• Deploy on your set of EC2 instances

• Every KCL application includes these components:

• Record processor factory: Creates the record processor

• Record processor: The processing unit that processes data from a shard

of a Kinesis stream

• Worker: The processing unit that maps to each application instance

Page 20: (SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014

• The KCL uses the IRecordProcessor interface to communicate with your application

• A Kinesis application must implement the KCL's IRecordProcessor interface

• Contains the business logic for processing the data retrieved from the Kinesis stream

Page 21: (SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014

• One record processor maps to one shard and processes data records from

that shard

• One worker maps to one or more record processors

• Balances shard-worker associations when worker / instance counts change

• Balances shard-worker associations when shards split or merge

Page 22: (SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014

Moving data into Amazon S3, Redshift

Page 23: (SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014

Amazon Kinesis Connector LibraryCustomizable, Open Source Apps to Connect Kinesis with S3, Redshift, DynamoDB

ITransformer

• Defines the transformation of records from the Amazon Kinesis stream in order to suit the user-defined data model

IFilter

• Excludes irrelevant records from the processing.

IBuffer

• Buffers the set of records to be processed by specifying size limit (# of records)& total byte count

IEmitter

• Makes client calls to other AWS services and persists the records stored in the buffer.

S3

DynamoDB

Redshift

Kinesis

Page 24: (SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014
Page 25: (SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014
Page 26: (SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014
Page 27: (SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014
Page 28: (SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014

S3 Dynamo

DB

Redshift

Kinesis

Amazon Kinesis Connectors

• S3 Connector– Batch writes files for archive into S3

– Uses sequence-based file naming scheme

• Redshift Connector– Once written to S3, loads to Redshift

– Provides manifest support

– Supports user defined transformers

• DynamoDB Connector– BatchPut appends to a table

– Supports user defined transformers

Page 29: (SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014

Best Practices: Processing Data From KinesisBuild applications as part of an Auto Scaling group

• Simply helps with application availability

• Scales in response to incoming spikes in-data volume,

assuming Shards have been provisioned

• Select scaling metrics based on nature of Kinesis

application

– Instance metrics: CPU, Memory, and others

– Kinesis Metrics: PutRecord.Bytes, GetRecord.Bytes

Page 30: (SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014

Metric Units

PutRecord.Bytes Bytes

PutRecord.Latency Milliseconds

PutRecord.Success Count

GetRecords.Bytes Bytes

GetRecords.IteratorAge Milliseconds

GetRecords.Latency Milliseconds

Getrecords.Success Count

Page 31: (SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014

Best Practices: Processing Data From KinesisBuild an flush-to-S3 consumer app

• App can specify three conditions that can trigger a buffer flush:

– Number of records– Total byte count– Time since last flush

• The buffer is flushed and the data is emitted to the destination when any of these thresholds is crossed.

# Flush when buffer exceeds 8 Kinesis records, 1 KB size limit or when time since last emit exceeds 10 minutesbufferSizeByteLimit = 1024bufferRecordCountLimit = 8bufferMillisecondsLimit = 600000

Page 32: (SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014

Best Practices: Processing Data From Kinesis

• In KCL app, ensure data being processed is persisted to durable store like

DynamoDB, or S3, prior to check-pointing.

• Duplicates: Make the authoritative data repository (usually at the end of the

data flow) resilient to duplicates. That way the rest of the system has a simple

policy – keep retrying until you succeed.

• Idempotent Processing: Use number of records since previous checkpoint, to

get repeatable results when the record processors fail over.

Page 33: (SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014

• Creates a manifest file based on a custom set of input files

• Use a manifest stream with only one shard

• Adjust checkpoint frequency, connector buffer and filter to align with your

redshift load models

Best Practices: Processing Data From Kinesis

Page 34: (SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014

Amazon Kinesis Customer Scenarios

Page 35: (SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014

Collect all data of interest continuously

Page 36: (SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014

Faster time to market due to ease of deployment

Page 37: (SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014

Enable operators, partners get to valuable data quickly

Page 38: (SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014

http://bit.ly/awsevals