(SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014

Post on 30-Jun-2015

874 views 2 download

description

Amazon Kinesis is the AWS service for real-time streaming big data ingestion and processing. This talk gives a detailed exploration of Kinesis stream processing. We'll discuss in detail techniques for building, and scaling Kinesis processing applications, including data filtration and transformation. Finally we'll address tips and techniques to emitting data into S3, DynamoDB, and Redshift.

Transcript of (SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014

November 13, 2014 | Las Vegas, NV

Adi Krishnan, Sr. Product Manager Amazon Kinesis

Scenarios Accelerated Ingest-Transform-Load Continual Metrics/ KPI Extraction Responsive Data Analysis

Data Types IT infrastructure, Applications logs, Social media, Fin. Market data, Web Clickstreams, Sensors, Geo/Location data

Digital Ad Tech./Marketing

Advertising Data aggregation Advertising metrics like coverage, yield, conversion

Analytics on User engagement with Ads, Optimized bid/ buy engines

Software/ Technology

IT server , App logs ingestion IT operational metrics dashboards Devices / Sensor Operational Intelligence

Financial Services Market/ Financial Transaction order data collection

Financial market data metrics Fraud monitoring, and Value-at-Risk assessment, Auditing of market order data

Consumer Online/E-Commerce

Online customer engagement data aggregation

Consumer engagement metrics like page views, CTR

Customer clickstream analytics, Recommendation engines

Scenarios Across Industry Segments

1 2 3

Amazon KinesisManaged Service for streaming data ingestion, and processing

Amazon Web Services

AZ AZ AZ

Durable, highly consistent storage replicates dataacross three data centers (availability zones)

Aggregate andarchive to S3

Millions ofsources producing100s of terabytes

per hour

FrontEnd

AuthenticationAuthorization

Ordered streamof events supportsmultiple readers

Real-timedashboardsand alarms

Machine learningalgorithms or

sliding windowanalytics

Aggregate analysisin Hadoop or adata warehouse

Inexpensive: $0.028 per million puts

Real-time Ingest

• Highly Scalable

• Durable

• Elastic

• Replay-able Reads

Continuous Processing FX

• Elastic

• Load-balancing incoming streams

• Fault-tolerance, Checkpoint / Replay

• Enable multiple processing apps in parallel

Enable data movement into Stores/ Processing Engines

Managed Service

Low end-to-end latency

Kinesis Stream

Managed Ability To Capture And Store Data

Putting Data into Kinesis

Simple Put interface to store data in Kinesis

Best Practices: Putting Data in KinesisDetermine Your Partition Key Strategy

• Kinesis as a managed buffer or a streaming map-

reduce

• Ensure a high cardinality for Partition Keys with

respect to shards, to prevent a “hot shard” problem

– Generate Random Partition Keys

• Streaming Map-Reduce: Leverage Partition Keys for

business specific logic as applicable

– Partition Key per billing customer, per DeviceId, per

stock symbol

Best Practices: Putting Data in KinesisProvisioning Adequate Shards

• For ingress needs

• Egress needs for all consuming applications: If more

than 2 simultaneous consumers

• Include head-room for catching up with data in stream

in the event of application failures

Best Practices: Putting Data in Kinesis

Pre-Batch before Puts for better efficiency

# KINESIS appender

log4j.logger.KinesisLogger=INFO, KINESIS

log4j.additivity.KinesisLogger=false

log4j.appender.KINESIS=com.amazonaws.services.kinesis.log4j.

KinesisAppender

# DO NOT use a trailing %n unless you want a newline to be

transmitted to KINESIS after every message

log4j.appender.KINESIS.layout=org.apache.log4j.PatternLayout

log4j.appender.KINESIS.layout.ConversionPattern=%m

# mandatory properties for KINESIS appender

log4j.appender.KINESIS.streamName=testStream

#optional, defaults to UTF-8

log4j.appender.KINESIS.encoding=UTF-8

#optional, defaults to 3

log4j.appender.KINESIS.maxRetries=3

#optional, defaults to 2000

log4j.appender.KINESIS.bufferSize=1000

#optional, defaults to 20

log4j.appender.KINESIS.threadCount=20

#optional, defaults to 30 seconds

log4j.appender.KINESIS.shutdownTimeout=30

https://github.com/awslabs/kinesis-log4j-

appender

Best Practices: Putting Data in Kinesis

Pre-Batch before Puts for better efficiency

• Retry if rise in input rate is temporary

• Reshard to increase number of

shards

• Monitor CloudWatch metrics:

PutRecord.Bytes and

GetRecords.Bytes metrics keep track

of shard usage

Metric Units

PutRecord.Bytes Bytes

PutRecord.Latency Milliseconds

PutRecord.Success Count

• Keep track of your metrics

• Log hashkey values generated by

your partition keys

• Log Shard-Ids

• Determine which Shard receive the

most (hashkey) traffic.

String shardId =

putRecordResult.getShardId();

putRecordRequest.setPartitionKey(String.format( "myPartitionKey"));

Options:

• stream-name - The name of the

Stream to be scaled

• scaling-action - The action to be

taken to scale. Must be one of

"scaleUp”, "scaleDown" or

“resize"

• count - Number of shards by

which to absolutely scale up or

down, or resize to or:

• pct - Percentage of the existing

number of shards by which to

scale up or down

https://github.com/awslabs/amazon-

kinesis-scaling-utils

Sending & Reading Data from Kinesis Streams

HTTP Post

AWS SDK

LOG4J

Flume

Fluentd

Get* APIs

Kinesis Client

Library

+

Connector Library

Apache

Storm

Amazon Elastic

MapReduce

Sending Consuming

AWS Mobile

SDK

Building Kinesis Applications: Kinesis Client LibraryOpen Source library for fault-tolerant, continuous processing apps

• Java client library, also available for Python Developers

• Source available on Github

• Build app with Kinesis Client Library

• Deploy on your set of EC2 instances

• Every KCL application includes these components:

• Record processor factory: Creates the record processor

• Record processor: The processing unit that processes data from a shard

of a Kinesis stream

• Worker: The processing unit that maps to each application instance

• The KCL uses the IRecordProcessor interface to communicate with your application

• A Kinesis application must implement the KCL's IRecordProcessor interface

• Contains the business logic for processing the data retrieved from the Kinesis stream

• One record processor maps to one shard and processes data records from

that shard

• One worker maps to one or more record processors

• Balances shard-worker associations when worker / instance counts change

• Balances shard-worker associations when shards split or merge

Moving data into Amazon S3, Redshift

Amazon Kinesis Connector LibraryCustomizable, Open Source Apps to Connect Kinesis with S3, Redshift, DynamoDB

ITransformer

• Defines the transformation of records from the Amazon Kinesis stream in order to suit the user-defined data model

IFilter

• Excludes irrelevant records from the processing.

IBuffer

• Buffers the set of records to be processed by specifying size limit (# of records)& total byte count

IEmitter

• Makes client calls to other AWS services and persists the records stored in the buffer.

S3

DynamoDB

Redshift

Kinesis

S3 Dynamo

DB

Redshift

Kinesis

Amazon Kinesis Connectors

• S3 Connector– Batch writes files for archive into S3

– Uses sequence-based file naming scheme

• Redshift Connector– Once written to S3, loads to Redshift

– Provides manifest support

– Supports user defined transformers

• DynamoDB Connector– BatchPut appends to a table

– Supports user defined transformers

Best Practices: Processing Data From KinesisBuild applications as part of an Auto Scaling group

• Simply helps with application availability

• Scales in response to incoming spikes in-data volume,

assuming Shards have been provisioned

• Select scaling metrics based on nature of Kinesis

application

– Instance metrics: CPU, Memory, and others

– Kinesis Metrics: PutRecord.Bytes, GetRecord.Bytes

Metric Units

PutRecord.Bytes Bytes

PutRecord.Latency Milliseconds

PutRecord.Success Count

GetRecords.Bytes Bytes

GetRecords.IteratorAge Milliseconds

GetRecords.Latency Milliseconds

Getrecords.Success Count

Best Practices: Processing Data From KinesisBuild an flush-to-S3 consumer app

• App can specify three conditions that can trigger a buffer flush:

– Number of records– Total byte count– Time since last flush

• The buffer is flushed and the data is emitted to the destination when any of these thresholds is crossed.

# Flush when buffer exceeds 8 Kinesis records, 1 KB size limit or when time since last emit exceeds 10 minutesbufferSizeByteLimit = 1024bufferRecordCountLimit = 8bufferMillisecondsLimit = 600000

Best Practices: Processing Data From Kinesis

• In KCL app, ensure data being processed is persisted to durable store like

DynamoDB, or S3, prior to check-pointing.

• Duplicates: Make the authoritative data repository (usually at the end of the

data flow) resilient to duplicates. That way the rest of the system has a simple

policy – keep retrying until you succeed.

• Idempotent Processing: Use number of records since previous checkpoint, to

get repeatable results when the record processors fail over.

• Creates a manifest file based on a custom set of input files

• Use a manifest stream with only one shard

• Adjust checkpoint frequency, connector buffer and filter to align with your

redshift load models

Best Practices: Processing Data From Kinesis

Amazon Kinesis Customer Scenarios

Collect all data of interest continuously

Faster time to market due to ease of deployment

Enable operators, partners get to valuable data quickly

http://bit.ly/awsevals