Introducing Amazon Kinesis: Real-time Processing of Streaming Big Data (BDT103) | AWS re:Invent 2013
(SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014
-
Upload
amazon-web-services -
Category
Technology
-
view
874 -
download
2
description
Transcript of (SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014
November 13, 2014 | Las Vegas, NV
Adi Krishnan, Sr. Product Manager Amazon Kinesis
Scenarios Accelerated Ingest-Transform-Load Continual Metrics/ KPI Extraction Responsive Data Analysis
Data Types IT infrastructure, Applications logs, Social media, Fin. Market data, Web Clickstreams, Sensors, Geo/Location data
Digital Ad Tech./Marketing
Advertising Data aggregation Advertising metrics like coverage, yield, conversion
Analytics on User engagement with Ads, Optimized bid/ buy engines
Software/ Technology
IT server , App logs ingestion IT operational metrics dashboards Devices / Sensor Operational Intelligence
Financial Services Market/ Financial Transaction order data collection
Financial market data metrics Fraud monitoring, and Value-at-Risk assessment, Auditing of market order data
Consumer Online/E-Commerce
Online customer engagement data aggregation
Consumer engagement metrics like page views, CTR
Customer clickstream analytics, Recommendation engines
Scenarios Across Industry Segments
1 2 3
Amazon KinesisManaged Service for streaming data ingestion, and processing
Amazon Web Services
AZ AZ AZ
Durable, highly consistent storage replicates dataacross three data centers (availability zones)
Aggregate andarchive to S3
Millions ofsources producing100s of terabytes
per hour
FrontEnd
AuthenticationAuthorization
Ordered streamof events supportsmultiple readers
Real-timedashboardsand alarms
Machine learningalgorithms or
sliding windowanalytics
Aggregate analysisin Hadoop or adata warehouse
Inexpensive: $0.028 per million puts
Real-time Ingest
• Highly Scalable
• Durable
• Elastic
• Replay-able Reads
Continuous Processing FX
• Elastic
• Load-balancing incoming streams
• Fault-tolerance, Checkpoint / Replay
• Enable multiple processing apps in parallel
Enable data movement into Stores/ Processing Engines
Managed Service
Low end-to-end latency
Kinesis Stream
Managed Ability To Capture And Store Data
Putting Data into Kinesis
Simple Put interface to store data in Kinesis
Best Practices: Putting Data in KinesisDetermine Your Partition Key Strategy
• Kinesis as a managed buffer or a streaming map-
reduce
• Ensure a high cardinality for Partition Keys with
respect to shards, to prevent a “hot shard” problem
– Generate Random Partition Keys
• Streaming Map-Reduce: Leverage Partition Keys for
business specific logic as applicable
– Partition Key per billing customer, per DeviceId, per
stock symbol
Best Practices: Putting Data in KinesisProvisioning Adequate Shards
• For ingress needs
• Egress needs for all consuming applications: If more
than 2 simultaneous consumers
• Include head-room for catching up with data in stream
in the event of application failures
Best Practices: Putting Data in Kinesis
Pre-Batch before Puts for better efficiency
# KINESIS appender
log4j.logger.KinesisLogger=INFO, KINESIS
log4j.additivity.KinesisLogger=false
log4j.appender.KINESIS=com.amazonaws.services.kinesis.log4j.
KinesisAppender
# DO NOT use a trailing %n unless you want a newline to be
transmitted to KINESIS after every message
log4j.appender.KINESIS.layout=org.apache.log4j.PatternLayout
log4j.appender.KINESIS.layout.ConversionPattern=%m
# mandatory properties for KINESIS appender
log4j.appender.KINESIS.streamName=testStream
#optional, defaults to UTF-8
log4j.appender.KINESIS.encoding=UTF-8
#optional, defaults to 3
log4j.appender.KINESIS.maxRetries=3
#optional, defaults to 2000
log4j.appender.KINESIS.bufferSize=1000
#optional, defaults to 20
log4j.appender.KINESIS.threadCount=20
#optional, defaults to 30 seconds
log4j.appender.KINESIS.shutdownTimeout=30
https://github.com/awslabs/kinesis-log4j-
appender
Best Practices: Putting Data in Kinesis
Pre-Batch before Puts for better efficiency
• Retry if rise in input rate is temporary
• Reshard to increase number of
shards
• Monitor CloudWatch metrics:
PutRecord.Bytes and
GetRecords.Bytes metrics keep track
of shard usage
Metric Units
PutRecord.Bytes Bytes
PutRecord.Latency Milliseconds
PutRecord.Success Count
• Keep track of your metrics
• Log hashkey values generated by
your partition keys
• Log Shard-Ids
• Determine which Shard receive the
most (hashkey) traffic.
String shardId =
putRecordResult.getShardId();
putRecordRequest.setPartitionKey(String.format( "myPartitionKey"));
Options:
• stream-name - The name of the
Stream to be scaled
• scaling-action - The action to be
taken to scale. Must be one of
"scaleUp”, "scaleDown" or
“resize"
• count - Number of shards by
which to absolutely scale up or
down, or resize to or:
• pct - Percentage of the existing
number of shards by which to
scale up or down
https://github.com/awslabs/amazon-
kinesis-scaling-utils
Sending & Reading Data from Kinesis Streams
HTTP Post
AWS SDK
LOG4J
Flume
Fluentd
Get* APIs
Kinesis Client
Library
+
Connector Library
Apache
Storm
Amazon Elastic
MapReduce
Sending Consuming
AWS Mobile
SDK
Building Kinesis Applications: Kinesis Client LibraryOpen Source library for fault-tolerant, continuous processing apps
• Java client library, also available for Python Developers
• Source available on Github
• Build app with Kinesis Client Library
• Deploy on your set of EC2 instances
• Every KCL application includes these components:
• Record processor factory: Creates the record processor
• Record processor: The processing unit that processes data from a shard
of a Kinesis stream
• Worker: The processing unit that maps to each application instance
• The KCL uses the IRecordProcessor interface to communicate with your application
• A Kinesis application must implement the KCL's IRecordProcessor interface
• Contains the business logic for processing the data retrieved from the Kinesis stream
• One record processor maps to one shard and processes data records from
that shard
• One worker maps to one or more record processors
• Balances shard-worker associations when worker / instance counts change
• Balances shard-worker associations when shards split or merge
Moving data into Amazon S3, Redshift
Amazon Kinesis Connector LibraryCustomizable, Open Source Apps to Connect Kinesis with S3, Redshift, DynamoDB
ITransformer
• Defines the transformation of records from the Amazon Kinesis stream in order to suit the user-defined data model
IFilter
• Excludes irrelevant records from the processing.
IBuffer
• Buffers the set of records to be processed by specifying size limit (# of records)& total byte count
IEmitter
• Makes client calls to other AWS services and persists the records stored in the buffer.
S3
DynamoDB
Redshift
Kinesis
S3 Dynamo
DB
Redshift
Kinesis
Amazon Kinesis Connectors
• S3 Connector– Batch writes files for archive into S3
– Uses sequence-based file naming scheme
• Redshift Connector– Once written to S3, loads to Redshift
– Provides manifest support
– Supports user defined transformers
• DynamoDB Connector– BatchPut appends to a table
– Supports user defined transformers
Best Practices: Processing Data From KinesisBuild applications as part of an Auto Scaling group
• Simply helps with application availability
• Scales in response to incoming spikes in-data volume,
assuming Shards have been provisioned
• Select scaling metrics based on nature of Kinesis
application
– Instance metrics: CPU, Memory, and others
– Kinesis Metrics: PutRecord.Bytes, GetRecord.Bytes
Metric Units
PutRecord.Bytes Bytes
PutRecord.Latency Milliseconds
PutRecord.Success Count
GetRecords.Bytes Bytes
GetRecords.IteratorAge Milliseconds
GetRecords.Latency Milliseconds
Getrecords.Success Count
Best Practices: Processing Data From KinesisBuild an flush-to-S3 consumer app
• App can specify three conditions that can trigger a buffer flush:
– Number of records– Total byte count– Time since last flush
• The buffer is flushed and the data is emitted to the destination when any of these thresholds is crossed.
# Flush when buffer exceeds 8 Kinesis records, 1 KB size limit or when time since last emit exceeds 10 minutesbufferSizeByteLimit = 1024bufferRecordCountLimit = 8bufferMillisecondsLimit = 600000
Best Practices: Processing Data From Kinesis
• In KCL app, ensure data being processed is persisted to durable store like
DynamoDB, or S3, prior to check-pointing.
• Duplicates: Make the authoritative data repository (usually at the end of the
data flow) resilient to duplicates. That way the rest of the system has a simple
policy – keep retrying until you succeed.
• Idempotent Processing: Use number of records since previous checkpoint, to
get repeatable results when the record processors fail over.
• Creates a manifest file based on a custom set of input files
• Use a manifest stream with only one shard
• Adjust checkpoint frequency, connector buffer and filter to align with your
redshift load models
Best Practices: Processing Data From Kinesis
Amazon Kinesis Customer Scenarios
Collect all data of interest continuously
Faster time to market due to ease of deployment
Enable operators, partners get to valuable data quickly
http://bit.ly/awsevals