Post on 18-Jul-2015
Traitement d’évènements en
temps réelMichael Garcia – Architecte Solutions –
mlgarcia@amazon.fr
@aws_actus
Decision
impact(also proportional
to risk)
Decision rate
2000’s – “How often can we run a permission-based email mktg. campaign?” Rules-based alerts
2010’s – Millions of decisions and actions taken, all in less than a blink of an eye
1
2
3
1990’s – “Should we advertise on the Superbowl? Should we run direct mail this qtr.?” Batch mode1
2
3
Evolution of real time decisions
Business
Impact
Plethora of Tools
Glacier
S3 DynamoDB
RDS
EMR
Redshift
Data PipelineKinesis
Cassandra CloudSearch
Kinesis-
enabled
app
Use the right tool for the right job
App/Web Tier
Client Tier
Database & Storage Tier
Amazon RDSAmazon DynamoDB
Amazon ElastiCache
Amazon S3
Amazon Glacier
Amazon EMR
Amazon Redshift
Histoire d’une migration vers
Amazon RedshiftNicolas Baron – CTO, FollowAnalytics
@nico_b
www.linkedin.com/in/nicolasbaron
nicolas@followanalytics.com
FollowAnalyticsEn quelques mots
Mobile Marketing AutomationSaaS Platform
Startup créée à ParisHQ à San Francisco
Positionnement Fortune 1000 / SBF 120 SAP & Hummer Winblad
User Analyt ics
1
CRMIntegrat ion
AlertsUserProf i le
Engagement
In App Message
Push Noti f icat ion
23 4
5
Adoption d’Amazon Redshift
16 Noeuds – dw2.Large
– 2.5 To (SSD)
• Test sur plusieurs milliards de lignes de logs
• 90% des requêtes en moins de 10 secondes
Pro tip: design du schéma !
Avant (MongoDB) Après (Amazon Redshift)
Millions Milliards
Pré-calcul systématique < 10 secondes
Ingest Store Process Visualize
Stages of Big Data Processing
Batch analysis – one set of tools
Real time analysis – another set of tools
Minutes/Hours
Seconds
Types of Data Ingest• Transactional
– Database reads/writes (structured
data)
• File
– Logs (unstructured data)
Database
Cloud Storage
Data has to be extracted from multiple source to be processed periodically
Types of Data Ingest
• Stream
– Click-stream logs
– Mobile analytics
– IoT
– Telemetry
– Any real-time data from any producer
StreamStorage
Data is streamed and can be processed continuously
What is a good ingest tool?
• Sequential streams are easier to process
• Need to scale
• Need to persist
• Architectural flexibility
• Real time! Processing
Kafka
Or
Kinesis
ProcessingIngest To
ol
Amazon Kinesis• Streams contain Shards. Each Shard
ingests data up to 1MB/sec, and up to
1000 TPS
• Each Shard emits up to 2 MB/sec
• All data is stored for 24 hours
• Scale Kinesis streams by adding or
removing Shards
• Replay data inside of 24Hr. Window
• Fully managed & low cost
13.4 Mo/s
500 Millions tweet a day @2.4 Ko per tweetHypothesis:
577 $ / monthSource: dioncosales. Pricing example is for Amazon Kinesis Only
“Amazon Kinesis also offloads a lot of developer burden in building a real-time, streaming data ingestion platform, and enables Supercell to focus on delivering games that delight players worldwide."
Sami Yliharju, Services Lead
Which Stream Store Should I Use?• Amazon Kinesis and Kafka have many similarities
– Multiple consumers
– Ordering of records
– Streaming MapReduce
– Low latency. Highly durable, available, and scalable
• Differences
– Record lifetime: 24 hours in Amazon Kinesis, configurable in Kafka
– Record size: 50 KB in Amazon Kinesis, configurable in Kafka
– Amazon Kinesis is a fully managed service – easier to provision, manage,
and scale
What Database and Storage Should I Use?
• Data structure
• Query complexity
• Use case
• Workload
• Data characteristics: hot, warm, cold
Process
• Answering questions about data
• Questions
– Analytics: Think SQL/data warehouse
– Classification: Think sentiment analysis
• Who is asking them
– Data scientist
– Business owners
• When do you need them
– In seconds
– Weekly/Monthly
Processing Tools
• Batch/Interactive
– Amazon Redshift
– Amazon EMR
• Hive/Tez, Pig, Impala, Spark, Presto, ….
• Stream/Real-time processing
– Apache Spark streaming
– Apache Storm (+ Trident)
– Amazon Kinesis client and
connector library
– AWS Lambda
Spark Streaming,
Storm, Kinesis App
Amazon Redshift
Spark, Impala, Presto
Hive
Amazon Redshift
Hive
Spark, Presto
Amazon Kinesis/
Kafka
Amazon DynamoDB
Amazon S3Data
Hot ColdData TemperatureQ
ue
ry L
ate
nc
y
Low
HighAnswers
HDFS
Hive
Native Client
Data Temperature vs Query Latency