About VisualDNA Architecture @ Rubyslava 2014

13
@ Rubyslava 2014 Michal Hariš : [email protected] - Technical Architect, joined VisualDNA in 2012

description

The journey we took at VisualDNA in transforming our architecture from LAMP through BATCH to LAMBDA.

Transcript of About VisualDNA Architecture @ Rubyslava 2014

Page 1: About VisualDNA Architecture @ Rubyslava 2014

@ Rubyslava 2014Michal Hariš : [email protected]

- Technical Architect, joined VisualDNA in 2012

Page 2: About VisualDNA Architecture @ Rubyslava 2014

Where were we 3 years ago● 10 people working around one mysql table holding 50M+ user profiles

Page 3: About VisualDNA Architecture @ Rubyslava 2014

Where were we 3 years ago● 10 people working around one mysql table holding 50M+ user profiles ● LAMP Architecture

SCALABILITY ISSUES

Page 4: About VisualDNA Architecture @ Rubyslava 2014

Where were we 3 years ago● 10 people working around one mysql table holding 50M+ user profiles ● LAMP Architecture

SCALABILITY ISSUES

DECISION TO GO BIG (DATA) !

Page 5: About VisualDNA Architecture @ Rubyslava 2014

Where were we 18 months ago● 30 strong team, of that a single tech team of roughly 15 people

● Basically a batch architecture● just not MySQL but CASSANDRA + HADOOP at the back● http+php trackers with piped custom log batch process ● s3 upload every 5 min● daily hdfs distcp● POC = daily hadoop inference > 6 node cassandra -> batch integrations● POC was a daily batch job which on bad days took 30 hours

● One of the first commercial Cassandra cluster in the world● very unstable

Page 6: About VisualDNA Architecture @ Rubyslava 2014

Where are we today● Stack

● Java ● Scala● Hadoop● Cassandra● Kafka● Redis● R● AngularJS for the front-end

Page 7: About VisualDNA Architecture @ Rubyslava 2014

Where are we today● Auto-scaling geo-located Tracker Clusters - well, almost auto-scaling● Robust Streaming Infrastructure - aggregation of all data streams in

central infrastructure● bringing in 8.5k events/ second at peak ● Real-time end-user products, scoring services, integrations with third

parties where possible, pre-computation infrastructure that scales more predictively

● These are primary events which get multiplied by various speed-layer● ETL Pipeline - offloading data streams and pre-computing materialised

views onto HDFS > 30TB of primary data● some data we keep only last 60 or 90 days, others we keep for ever

● Decision Analytics Pipeline (or RD Pipe) > 100TB+ of secondary data i● Using feature-extraction machine learning methods

Page 8: About VisualDNA Architecture @ Rubyslava 2014

Where are we today● Still one Cassandra ring, just bigger and more stable, 16 nodes, 250M+

active user profiles

● Lambda Architecture for real-time products like WHY Analytics● RD Pipe is the "batch" layer (daily) that generates active profiles as a

cassandra ("view layer")● Primary Events are enriched for user profiles produced daily by the

Enrichment service ("speed layer")● Combination of probabilistic counters and Redis cubes calculates the

current audience profiles for subscribed websites ("speed layer")● API on top of the Redis cubes serves the current audience profiles for the

front end suite of real-time analytics products ("serving layer")● Audience Analytics product suite is the good looking bit - http://www.

visualdna.com/why/

Page 9: About VisualDNA Architecture @ Rubyslava 2014

Where are we today● 120-strong team, of that tech is roughly 60:

● Sysadmin Team● Architecture Tech Team● Decision Analytics Tech Team● Consumer Tech Team● WHY Analytics Team

Page 10: About VisualDNA Architecture @ Rubyslava 2014

What have we learned● Architecture:

● Updating json blobs in Cassandra columns is a trap● Logging is better http://engineering.linkedin.com/distributed-systems/log-what-every-

software-engineer-should-know-about-real-time-datas-unifying

● Metrics are crucial in large distributed systems● yammer metrics + graphite + icinga works well for infrastructure ● but complex event/anomalies detection and pattern analysis gives the

edge● Real-Time processing of Data Streams is not only cool, but scales

well ... until you find a bottleneck in a single component which will limit the entire system

● Batch still matters ● but could be much faster than Hadoop which falls on too much

redundant I/O and requires a coordinated ETL pipeline

Page 11: About VisualDNA Architecture @ Rubyslava 2014

What have we learned● Engineering:

● the unix philosophy of building short, simple, clear, modular, and extendable code applies also to a design of distributed systems not just an OS

● bad tests are better than no tests but they are still bad and most tests only test positive outcome● the story of Math.abs() -> actually can return negative number ->

but none of the unit-tests anticipated this -> which is why metrics and systems with feedback control are crucial

●● Process:

● It is possible to co-operate remotely even on complex and not-well defined systems - atm some of the architecture team is working remotely on permanent basis

● QA is intrinsic to Architecture and local to products

Page 12: About VisualDNA Architecture @ Rubyslava 2014

Interesting issues we’re facing1. SLAs vs. Start-up dynamics - Separate process (and to some

degree architecture) for different levels of guarantee of service

2. Globally-distributed highly-available API for random access to our profiles - enabling decisions based on VDNA profiles on-demand

3. Our Lambda has a bottleneck at the enrichment point - although if we solve (2.) we will be half-way through

4. Complex data pooling attribution model5. Cassandra still gives us some pain - it's the drivers! - interesting

about consistency: http://aphyr.com/posts/294-call-me-maybe-cassandra/

6. Preserving start-up dynamics and culture in a company of 200+ with offices in several cities

Page 13: About VisualDNA Architecture @ Rubyslava 2014

We’re hiring for Bratislava office!

[email protected]

● We’re looking for engineers and analysts and more to be based in Bratislava