(SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014

November 13, 2014 | Las Vegas, NV

Adi Krishnan, Sr. Product Manager Amazon Kinesis

Scenarios Accelerated Ingest-Transform-Load Continual Metrics/ KPI Extraction Responsive Data Analysis

Data Types IT infrastructure, Applications logs, Social media, Fin. Market data, Web Clickstreams, Sensors, Geo/Location data

Digital Ad Tech./Marketing

Advertising Data aggregation Advertising metrics like coverage, yield, conversion

Analytics on User engagement with Ads, Optimized bid/ buy engines

Software/ Technology

IT server , App logs ingestion IT operational metrics dashboards Devices / Sensor Operational Intelligence

Financial Services Market/ Financial Transaction order data collection

Financial market data metrics Fraud monitoring, and Value-at-Risk assessment, Auditing of market order data

Consumer Online/E-Commerce

Online customer engagement data aggregation

Consumer engagement metrics like page views, CTR

Customer clickstream analytics, Recommendation engines

Scenarios Across Industry Segments

Amazon KinesisManaged Service for streaming data ingestion, and processing

Amazon Web Services

AZ AZ AZ

Durable, highly consistent storage replicates dataacross three data centers (availability zones)

Aggregate andarchive to S3

Millions ofsources producing100s of terabytes

per hour

FrontEnd

AuthenticationAuthorization

Ordered streamof events supportsmultiple readers

Real-timedashboardsand alarms

Machine learningalgorithms or

sliding windowanalytics

Aggregate analysisin Hadoop or adata warehouse

Inexpensive: $0.028 per million puts

Real-time Ingest

• Highly Scalable

• Durable

• Elastic

• Replay-able Reads

Continuous Processing FX

• Elastic

• Load-balancing incoming streams

• Fault-tolerance, Checkpoint / Replay

• Enable multiple processing apps in parallel

Enable data movement into Stores/ Processing Engines

Managed Service

Low end-to-end latency

Kinesis Stream

Managed Ability To Capture And Store Data

Putting Data into Kinesis

Simple Put interface to store data in Kinesis

Best Practices: Putting Data in KinesisDetermine Your Partition Key Strategy

• Kinesis as a managed buffer or a streaming map-

reduce

• Ensure a high cardinality for Partition Keys with

respect to shards, to prevent a “hot shard” problem

– Generate Random Partition Keys

• Streaming Map-Reduce: Leverage Partition Keys for

business specific logic as applicable

– Partition Key per billing customer, per DeviceId, per

stock symbol

Best Practices: Putting Data in KinesisProvisioning Adequate Shards

• For ingress needs

• Egress needs for all consuming applications: If more

than 2 simultaneous consumers

• Include head-room for catching up with data in stream

in the event of application failures

Best Practices: Putting Data in Kinesis

Pre-Batch before Puts for better efficiency

# KINESIS appender

log4j.logger.KinesisLogger=INFO, KINESIS

log4j.additivity.KinesisLogger=false

log4j.appender.KINESIS=com.amazonaws.services.kinesis.log4j.

KinesisAppender

# DO NOT use a trailing %n unless you want a newline to be

transmitted to KINESIS after every message

log4j.appender.KINESIS.layout=org.apache.log4j.PatternLayout

log4j.appender.KINESIS.layout.ConversionPattern=%m

# mandatory properties for KINESIS appender

log4j.appender.KINESIS.streamName=testStream

#optional, defaults to UTF-8

log4j.appender.KINESIS.encoding=UTF-8

#optional, defaults to 3

log4j.appender.KINESIS.maxRetries=3

log4j.appender.KINESIS.bufferSize=1000

log4j.appender.KINESIS.threadCount=20

#optional, defaults to 30 seconds

log4j.appender.KINESIS.shutdownTimeout=30

https://github.com/awslabs/kinesis-log4j-

appender

Best Practices: Putting Data in Kinesis

Pre-Batch before Puts for better efficiency

• Retry if rise in input rate is temporary

• Reshard to increase number of

shards

• Monitor CloudWatch metrics:

PutRecord.Bytes and

GetRecords.Bytes metrics keep track

of shard usage

Metric Units

PutRecord.Bytes Bytes

PutRecord.Latency Milliseconds

PutRecord.Success Count

• Keep track of your metrics

• Log hashkey values generated by

your partition keys

• Log Shard-Ids

• Determine which Shard receive the

most (hashkey) traffic.

String shardId =

putRecordResult.getShardId();

putRecordRequest.setPartitionKey(String.format( "myPartitionKey"));

Options:

• stream-name - The name of the

Stream to be scaled

• scaling-action - The action to be

taken to scale. Must be one of

"scaleUp”, "scaleDown" or

“resize"

• count - Number of shards by

which to absolutely scale up or

down, or resize to or:

• pct - Percentage of the existing

number of shards by which to

scale up or down

https://github.com/awslabs/amazon-

kinesis-scaling-utils

Sending & Reading Data from Kinesis Streams

HTTP Post

AWS SDK

Fluentd

Get* APIs

Kinesis Client

Library

Connector Library

Apache

Amazon Elastic

MapReduce

Sending Consuming

AWS Mobile

Building Kinesis Applications: Kinesis Client LibraryOpen Source library for fault-tolerant, continuous processing apps

• Java client library, also available for Python Developers

• Source available on Github

• Build app with Kinesis Client Library

• Deploy on your set of EC2 instances

• Every KCL application includes these components:

• Record processor factory: Creates the record processor

• Record processor: The processing unit that processes data from a shard

of a Kinesis stream

• Worker: The processing unit that maps to each application instance

• The KCL uses the IRecordProcessor interface to communicate with your application

• A Kinesis application must implement the KCL's IRecordProcessor interface

• Contains the business logic for processing the data retrieved from the Kinesis stream

• One record processor maps to one shard and processes data records from

that shard

• One worker maps to one or more record processors

• Balances shard-worker associations when worker / instance counts change

• Balances shard-worker associations when shards split or merge

Moving data into Amazon S3, Redshift

Amazon Kinesis Connector LibraryCustomizable, Open Source Apps to Connect Kinesis with S3, Redshift, DynamoDB

ITransformer

• Defines the transformation of records from the Amazon Kinesis stream in order to suit the user-defined data model

IFilter

• Excludes irrelevant records from the processing.

IBuffer

• Buffers the set of records to be processed by specifying size limit (# of records)& total byte count

IEmitter

• Makes client calls to other AWS services and persists the records stored in the buffer.

DynamoDB

Redshift

Kinesis

S3 Dynamo

Redshift

Kinesis

Amazon Kinesis Connectors

• S3 Connector– Batch writes files for archive into S3

– Uses sequence-based file naming scheme

• Redshift Connector– Once written to S3, loads to Redshift

– Provides manifest support

– Supports user defined transformers

• DynamoDB Connector– BatchPut appends to a table

– Supports user defined transformers

Best Practices: Processing Data From KinesisBuild applications as part of an Auto Scaling group

• Simply helps with application availability

• Scales in response to incoming spikes in-data volume,

assuming Shards have been provisioned

• Select scaling metrics based on nature of Kinesis

application

– Instance metrics: CPU, Memory, and others

– Kinesis Metrics: PutRecord.Bytes, GetRecord.Bytes

Metric Units

PutRecord.Bytes Bytes

PutRecord.Latency Milliseconds

PutRecord.Success Count

GetRecords.Bytes Bytes

GetRecords.IteratorAge Milliseconds

GetRecords.Latency Milliseconds

Getrecords.Success Count

Best Practices: Processing Data From KinesisBuild an flush-to-S3 consumer app

• App can specify three conditions that can trigger a buffer flush:

– Number of records– Total byte count– Time since last flush

• The buffer is flushed and the data is emitted to the destination when any of these thresholds is crossed.

# Flush when buffer exceeds 8 Kinesis records, 1 KB size limit or when time since last emit exceeds 10 minutesbufferSizeByteLimit = 1024bufferRecordCountLimit = 8bufferMillisecondsLimit = 600000

Best Practices: Processing Data From Kinesis

• In KCL app, ensure data being processed is persisted to durable store like

DynamoDB, or S3, prior to check-pointing.

• Duplicates: Make the authoritative data repository (usually at the end of the

data flow) resilient to duplicates. That way the rest of the system has a simple

policy – keep retrying until you succeed.

• Idempotent Processing: Use number of records since previous checkpoint, to

get repeatable results when the record processors fail over.

• Creates a manifest file based on a custom set of input files

• Use a manifest stream with only one shard

• Adjust checkpoint frequency, connector buffer and filter to align with your

redshift load models

Best Practices: Processing Data From Kinesis

Amazon Kinesis Customer Scenarios

Collect all data of interest continuously

Faster time to market due to ease of deployment

Enable operators, partners get to valuable data quickly

http://bit.ly/awsevals

(SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014

Technology

Transcript of (SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014

AWS re:Invent 2016: Deep Dive: Amazon EMR Best Practices & Design Patterns (BDM401)

(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesis | AWS re:Invent 2014

Deep Dive Amazon KinesisDeep Dive –Amazon Kinesis Ian Meyers, Principal Solution Architect - Amazon Web Services Analytics Amazon Kinesis Managed Service for Real Time Big Data Processing

(ADV303) MediaMath’s Data Revolution with Amazon Kinesis and Amazon EMR | AWS re:Invent 2014

(SDD409) Amazon RDS for PostgreSQL Deep Dive | AWS re:Invent 2014

(SDD418) Amazon CloudWatch Deep Dive | AWS re:Invent 2014

AWS re:Invent 2016: ElastiCache Deep Dive: Best Practices and Usage Patterns (DAT306)

(SDD423) Elastic Load Balancing Deep Dive and Best Practices | AWS re:Invent 2014

Deep Dive – Amazon Kinesis · 2015-07-10 · Analytics Amazon Kinesis Managed Service for Real Time Big Data Processing Create Streams to Produce & Consume Data Elastically Add

AWS re:Invent 2016: Deep-Dive: Native, Hybrid and Web patterns with Serverless and AWS Mobile Services (MBL404)

(SDD419) Amazon EC2 Networking Deep Dive and Best Practices | AWS re:Invent 2014

AWS re:Invent 2016: Deep Dive on Amazon Glacier (STG302)

AWS re:Invent 2016: Deep Dive on Amazon Elastic Block Store (STG301)

[AWS Black Belt Online Seminar] AWS IoT Analytics Deep Dive · S3 (DataLake) Amazon Kinesis Data Firehose MES/SCADA Protocol conversion Email SMS Factory Machines Vision Amazon Kinesis

AWS re:Invent 2016: Amazon Aurora Deep Dive (GPST402)

Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

(SDD403) Amazon RDS for MySQL Deep Dive | AWS re:Invent 2014

(SDD422) Amazon VPC Deep Dive | AWS re:Invent 2014

AWS re:Invent 2016: Deep Dive on AWS Cloud Data Migration Services (ENT210)

AWS re:Invent 2016: [JK REPEAT] Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization Best Practices (JKT301)