ARC202:real world real time analytics

39
© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. November 13, 2014 | Las Vegas, NV ARC202 Real-World Real-Time Analytics Gustavo Arjones | @arjones CTO, Socialmetrix Sebastian Montini | @sebamontini Solutions Architect, Socialmetrix

Transcript of ARC202:real world real time analytics

Page 1: ARC202:real world real time analytics

© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

November 13, 2014 | Las Vegas, NV

ARC202

Real-World Real-Time Analytics

Gustavo Arjones | @arjones

CTO, Socialmetrix

Sebastian Montini | @sebamontini

Solutions Architect, Socialmetrix

Page 2: ARC202:real world real time analytics

• SaaS Company—since 2008

• Social media analytics track and measure activity

of brands and personality, providing information to

market research and brand comparison

• Multilanguage technology (English, Portuguese,

and Spanish)

• Leader in Latin America, with operations in 5

countries, customers in Latin America and US

• 1 out of 34 Twitter Certified Program worldwide

Page 3: ARC202:real world real time analytics

Our customers

Page 4: ARC202:real world real time analytics
Page 5: ARC202:real world real time analytics
Page 6: ARC202:real world real time analytics

Ranking Brand 1 Brand 2 Brand 3

Q2 Q3 Q2 Q3 Q2 Q3

1° Flavor Breakfast Flavor Flavor Advertising Flavor

2° Healthy Flavor Packaging Brand I love Flavor Breakfast

3° Components Components Healthy Packaging Healthy Healthy

4° Advertising Healthy Components Addiction Components Advertising

5° Enquires Desire Prices Consumption Prices Components

TOTAL 1.401 8.189 463 5.519 1.081 2.445

Share of topics

Which conversations are my brand and my competitors’ brands driving?

Page 7: ARC202:real world real time analytics

smx.io/reinvent #reinvent

Page 8: ARC202:real world real time analytics

Challenges

Page 9: ARC202:real world real time analytics

Challenges: Variety

• Different data sources

• Different API

• SLA

• Method (pull or push)

• Rate-limit, backoff strategy

Page 10: ARC202:real world real time analytics

Challenges: Velocity• Updates every second

• Top users, top hashtags each

minute

• After event analysis are made

with batch over complete

dataset

• Spikes of 20,000+ tweets per

minute

Last TV

Debate

Results

Announced

Page 11: ARC202:real world real time analytics

Challenges: Meaning

•Disambiguation

•Data Enrichment– Demographics

– Sentiment

– Influencers

•Human analysis

PAN

Orange Telecom

Oi Telecom Hi!

Page 12: ARC202:real world real time analytics

Challenges: Alert and report

•Clear and

understandable UI

•Slice-dice for business

(not BI experts)

•Real-time alerts for

anomalies

Page 13: ARC202:real world real time analytics

Architecture evolution

Page 14: ARC202:real world real time analytics

Drivers for architecture evolution

• More customers, bigger customers

• Add new features

• Keep costs under control

Page 15: ARC202:real world real time analytics

Architecture evolution

0

20

40

60

80

100

120

#1 #2 #3 #4

Acti

ve C

usto

mers

Page 16: ARC202:real world real time analytics

Architecture—1st iteration

What we needed:

• Complete data isolation

• Trying different solutions/offerings

Page 17: ARC202:real world real time analytics

Architecture—1st iteration

What we did:

• All-in-one approach

• Multi-instance architecture

• Simple vertical scalability

• MySQL performance tuning

Page 18: ARC202:real world real time analytics

Architecture—1st iteration

What we've learned:

• Multi-instance is harder to administrate, but

minimizes instability impact on customers

• Vertical scalability: poor resource management

• MySQL schema changes translate into downtime

Page 19: ARC202:real world real time analytics

Architecture—2nd iteration

What we needed:

• Separation of responsibilities (crawling, processing)

• Horizontal scalability

• Fast provisioning

• Cost reduction

Page 20: ARC202:real world real time analytics

Architecture—2nd iteration

What we changed:

• Migrated to AWS

• RabbitMQ (Single Node)

• Replace MySQL for

Amazon RDS

• AWS CloudFormation

• Auto Scaling groups

Page 21: ARC202:real world real time analytics

Architecture—2nd iteration

What we've learned:

• PIOPS

• Tuning the Auto Scaling policies can be hard

• AWS CloudFormation: great for migration, not

enough for daily ops

Page 22: ARC202:real world real time analytics

Architecture—3rd iteration

What we needed:

• Deliver new features (NRT, more complex analytics)

• Scale fast

• Be resilient against failure

• Adding and improving data sources

• Keep costs under control (always)

Page 23: ARC202:real world real time analytics

Architecture—3rd iteration

What we changed:

• Apache Storm

• RabbitMQ HA

• Amazon Elastic MapReduce

(Hadoop/Hive)

• AWS CloudFormation + Chef

• Amazon Glacier + Amazon S3

lifecycles policies

Page 24: ARC202:real world real time analytics

Architecture—3rd iteration

What we've learned:

• Spot Instances + Reserved Instances

• Hive = SQL SQL scripts are hard to test

• Bulk upserts on Amazon RDS can be expensive (PIOPS)

• Amazon DynamoDB is great, but expensive (for

our use-case)

Page 25: ARC202:real world real time analytics

Dashboard

Page 26: ARC202:real world real time analytics

Architecture—4th iteration

What we needed:

• Monitor millions of social media profiles

• Make data accessible (exploration, PoC)

• Improve UI response times

• Testing our data pipelines

• Reprocessing (faster)

Page 27: ARC202:real world real time analytics

Architecture—4th iteration

What we changed:

• Cassandra (DSE)

• MongoDB MMS

• Apache Spark

Page 28: ARC202:real world real time analytics

What we've learned:

• Leverage AWS ecosystem

• Datastax AMI + Opscenter integration

• MongoDB MMS: automation magic!

• Apache Spark unit testing + Amazon EC2

launch scripts

• Amazon EMR doesn’t have the latest stable

versions

Architecture—4th iteration

Page 29: ARC202:real world real time analytics
Page 30: ARC202:real world real time analytics

Architecture evolution

-

20

40

60

80

100

120

140

160

0

20

40

60

80

100

120

#1 #2 #3 #4

Acti

ve C

usto

mers

Costs Customers

Page 31: ARC202:real world real time analytics

Lessons learned

Page 32: ARC202:real world real time analytics

Lessons learned

• Automate since Day 1 (CloudFormation + Chef)

• Monitor systems activity, understand your data

patterns, e.g. LogStash (ELK)

• Always have a Source of Truth (Amazon S3 +

Glacier)

• Make your Source of Truth searchable

Page 33: ARC202:real world real time analytics

Lessons Learned (II)

•Approximation is a good thing: HLL, CMS, Bloom

•Write your pipelines considering reprocessing

needs

• Avoid at all costs framework explosion

•AWS ecosystem allows rapid prototype

Page 34: ARC202:real world real time analytics

Socialmetrix NextGen

2015

Page 35: ARC202:real world real time analytics

Architecture evolution

0

20

40

60

80

100

120

#1 #2 #3 #4

Acti

ve C

usto

mers

Page 36: ARC202:real world real time analytics

Architecture nextgen

• Reduce moving parts

• Apache Spark as central processing framework

– Realtime (Micro-batch)

– Batch-processing

• Kafka (Message Broker)

• Cassandra (Time-series storage)

• ElasticSearch (Content Indexer)

Page 37: ARC202:real world real time analytics

To infinity …

and beyond!Architecture evolution

0

20

40

60

80

100

120

#1 #2 #3 #4 NextGen

Acti

ve C

usto

mers

Page 38: ARC202:real world real time analytics

Gustavo Arjones, CTO

@arjones | [email protected]

Sebastian Montini, Solutions Architect

@sebamontini | [email protected]

Let’s talk at Venetian—Titian Hallway

Feedback and QandA

Page 39: ARC202:real world real time analytics

Please give us your feedback on this

presentation

© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

Join the conversation on Twitter with #reinvent

ARC202: Real-World

Real-Time AnalyticsThank you!