Patterns of the Lambda Architecture -- 2015 April -- Hadoop Summit, Europe

Patterns of the Lambda Architecture

Truth and Lies at the Edge of Scale

Flip Kromer — CSC

I’m Flip Kromer, Distinguished Engineer at CSC. If you are a large enterprise company looking to add Big Data capabilities — especially one involving legacy systems — we’re a big, stable company that specializes in turning technology into an enterprise-grade solution.

Pattern Set

This talk will equip you with two things.One is patterns for how we design high-scale architectures to solve specific solution casesnow that extra infrastructure is nearly free

Tradeoff RulesPICK

ANY

TWO

Along with a set of tradeoff rules along the lines of the pick-any-two trinity but more sophisticated

Lambda Architecture

So what is the Lambda Architecture? Here’s two examples.

Search w/ Update

BuildIndexes

A Ton of Text

HistoricalIndex

Live IndexerMoreText

RecentIndex

API

In this system, we have a whole ton of historical text, with more arriving all the time, and want to allow immediate real-time search across the whole corpus.

Search w/ Update

BuildIndexes

A Ton of Text

HistoricalIndex


RecentIndex

API

BuildMain Index

We will use a large periodic batch job to create indexes on the historical data. This takes a while — far longer than our recency demands allow — so we might as well have our elephants use clever algorithms and optimally organize the data for rapid retrieval.

Search w/ Update

BuildIndexes

A Ton of Text

HistoricalIndex


RecentIndex

API

Update Recent IndexUntil the next stampede arrives with an updated index, as each new record arrives we will not only file it with the historical data but also use simple fast indexing to make it immediately searchable. Merging new records directly would require stuffing them into the right place in the historical index, which eventually means moving records around, which demands far too much time and complexity to be workable.

Search w/ Update

BuildIndexes

A Ton of Text

HistoricalIndex


RecentIndex

API

Serve Result

The system to serve the data just pulls from both indexes in immediate time

BuildIndexes

A Ton of Text

HistoricalIndex


RecentIndex

API

Lambda Architecture

Batch

Speed

Serving

We have a batch layer for the global corpus;A speed layer for recent results;and a serving layer for access

BuildIndexes

A Ton of Text

HistoricalIndex


RecentIndex

API

Lambda Architecture

Global

Relevant

Immediate

We have a batch layer for the global corpus;A speed layer for recent results;and a serving layer for access

TrainRecomm’der

Visitor ⇒History

History ⇒Alsobuy

Visitor:Product

Visitor ⇒Alsobuy

UpdateRecommendation

Fetch/UpdateHistory

Visitor:ProductHistory

Webserver

Recommender

Another familiar architecture is a high-scale recommender system — “Given that the user has looked at mod-style dresses and mason jars show them these knitting needles”. This diagram shows a recommender, but most machine-learning systems look like this.

TrainRecomm’der

Visitor ⇒History

History ⇒Alsobuy

Visitor:Product

Visitor ⇒Alsobuy


Fetch/UpdateHistory


Webserver

RecommenderBuild

Model

You have one system process all the examples you’ve ever seen to produce a predictive model. The trained model it produces can then react immediately to all future examples as they occur.

TrainRecomm’der

Visitor ⇒History

History ⇒Alsobuy

Visitor:Product

Visitor ⇒Alsobuy


Fetch/UpdateHistory


Webserver

Recommender

Applies ModelThe trained model it produces can then react immediately to all future examples as they occur. In this system we’re going to have one system to apply the model and store the recommendation

Your operations team is better off with two systems that can fail without breaking the site than to have the apply-model step coupled to serving pages.

TrainRecomm’der

Visitor ⇒History

History ⇒Alsobuy

Visitor:Product

Visitor ⇒Alsobuy


Fetch/UpdateHistory


Webserver

Recommender

Serves Result

So that the web layer can just serve the result without being contaminated by the recommender system’s code.

TrainRecomm’der

Visitor ⇒History

History ⇒Alsobuy

Visitor:Product

Visitor ⇒Alsobuy


Fetch/UpdateHistory


Webserver

Recommender

Batch

Speed

Serving

Again, the same three pieces

Lambda Arch Layers

• Batch layer Deep Global Truth throughput

• Speed layer Relevant Local Truth throughput

• Serving layer Rapid Retrieval latency

speed layer cares about throughputServing layer cares about latency,

Lambda Arch: Technology

• Batch layer Hadoop, Spark, Batch DB Reports

• Speed layer Storm+Trident, Spark Str., Samza, AMQP, …

• Serving layer Web APIs, Static Assets, RPC, …

Lambda Architecture

Batch

Speed

Serving

λλWhere does the name lambda come from?In my head it’s cause the flow diagram…

Lambda Architecture

Batch

Speed

Servingλlooks like the shape of the character for lambda

Lambda Architecture

λ(v)• Pure Function on immutable data

But really it means this new mindset of building pure function (lambda) on immutable data,

Ideal Data System

Ideal Data System• Capacity -- Can process arbitrarily large amounts of data

• Affordability -- Cheap to run

• Simplicity -- Easy to build, maintain, debug

• Resilience -- Jobs/Processes fail&restart gracefully

• Responsiveness -- Low latency for delivering results

• Justification -- Incorporates all relevant data into result

• Comprehensive -- Answer questions about any subject

• Recency -- Promptly incorporates changes in world

• Accuracy -- Few approximations or avoidable errors

The laziest, and therefore best, knobs are the Capacity/Affordability ones. The pre-big-data era can be thought of as one where only those two exist. Big Data broke the handle off the Capacity knob, either because Affordability ramps too fast or because the speed of light starts threatening resilience, responsiveness or recency

* _Comprehensive_: complete; including all or nearly all elements or aspects of something* _concise_: giving a lot of information clearly and in a few words; brief but










You would think that what mattered was correctness — justified true belief










When you look at what we actually do, the non-negotiables are that it be manageable and economic given that you must process arbitrarily large amounts of dataTruth is a nice-to-have.

Tradeoff RulesPICK

ANY

TWO

Set of tradeoff rules along the lines of the pick-any-two trinity but more sophisticated

At ScaleAND

THISTH

IS

AND TRY TO BE GOOD

Basically, given big data you have to accomodate any amount of data and produce static reports or queries that execute within the duration of human patience — so you must be fast and cheap, sacrificing good.

Patterns

TrainRecomm’der

Visitor ⇒History

History ⇒Alsobuy

Visitor:Product

Visitor ⇒Alsobuy


Fetch/UpdateHistory


Webserver

Recommender

The world you’re modeling changes — new sets of products are released, new and variated customers sign up, changes to the site drive new behavior — but it changes slowly. So it’s no big deal if the training stage is only run once a week over several hours.

The first example follows a pretty familiar general form I’ll call “Train / React”. You have one system process all the examples you’ve ever seen to produce a predictive model. The trained model it produces can then react immediately to all future

Pattern: Train / React• Model of the world lets you make immediate decisions

• World changes slowly, so we can re-build model at leisure

• Relax: Recency

• Batch layer: Train a machine learning model

• Speed layer: Apply that model

• Examples: most Machine Learning thingies

(Recommender)

Big fat job that only needs to run occasionally; results of the job inform what happens immediately

Search w/ UpdateBuildIndexes

A Ton of Text

HistoricalIndex


RecentIndex

API

Pattern: Baseline / Delta• Understanding the world takes a long time

• World changes much faster than that, and you care

• Relax: Simplicity, Accuracy

• Batch layer: Process the entire world

• Speed layer: Handle any changes since last big run

• Examples: Real-time Search index; Count Distinct; other Approximate Stream Algorithms

In Train / React, the world changes, but slowly; training in batch mode is just fineIn Baseline / Delta, the world changes so quickly can’t run compute job fast enough

So you are sacrificing simplicity — there’s two systems where there was only one — and accuracy — the recent records won’t update global normalized frequencies

PagerankConvergePagerank

Friend Relations

User ⇒Pagerank

Retrieve Bob’sFacebook NtwkBob Bob’s Friends’

PageranksEstimate

Bob’s Pagerank

But don’t bother updating Bob’s Friends (or friends friends or …)

API

(Lazy Propagation)

Pagerank

48

24

42 12

126

24

24 42

6 6

6

6

6 6

6

This next example has an importantly different flavor.The core way that Google identifies important web pages is the “Pagerank” algorithm, which basically says “a page is interesting if other interesting pages link to it”. That’s recursive of course but the math works out. You can do similar things on a social network like twitter to find spammers and superstars, or among college football teams or world of warcraft players to prepare a competitive ranking, or among buyers and sellers in a market to detect fraud.

To define a reputation ranking on say Twitter you simulate a game of multiple rounds.

48

24

42 12

126

24

24 42

6 6

6

6

6 6

6

94

-5

-

New Record Appears

?

Doing this is kinda literally what Hadoop was born to do, and it’s a simple Hadoop-101 level program.

Acting out all those rounds using every interaction we’ve ever seen takes a fair amount of time, though, and so a problem comes when we meet a new person.

This new person accrues some reputational jellybeans, and we don’t want to live in total ignorance of what their score is; and they dispatch some as well, which should change the scores of those they follow.

48

24

42 12

126

24

24 42

6 6

6

6

6 6

6

94

-5

-

Update Using Local

12÷3 = 4

24÷5 ≈ 59

Well, we can roughly guess the score of the new node by having their followers pay out a jellybean share proportional to what they would have gotten in the last pagerank round.

“A Guess beats a Blank Stare”

* World rate of change not really relevant* The solution is actually to tell a lie

48

24

42 12

126

24

24 42

6 6

6

6

6 6

6

94

-5

-

…Ignoring Correctness

meh

But we’re not going to update the neighbors. You’d be concurrently updating an arbitrary number of outbound nodes, and then of course those nodes’ changes should rightfully propagate as well — this is why we play the multiple pagerank rounds in the first place.

What we do instead is lie. Look, planes don’t fall out of the sky if you get someone’s coolness quotient wrong in the first decimal point.

Batch Updates Graph

42

30

36 11

106

21

21 36

46

6

6

5 5

4

93

9

6

(A Guess beats a Blank)

This has an importantly different flavor* World rate of change not really relevant* The solution is actually to tell a lie

Pattern: World/Local• Understanding the world needs full graph

• You can tell a little white lie reading immediate graph only

• Relaxing: Accuracy, Justification

• Batch layer: uses global graph information

• Speed layer: just reads immediate neighborhood

• Examples: “Whom to follow”, Clustering, anything at 2nd-degree (friend-of-a-friend)

Problem isn’t so much about the volume of data, it’s about how _far away_ that data is. You can’t justify doing that second-order query for three reasons:* time* compute resources* computational risk

Pattern: Guess Beats Blank• You can’t calculate a good answer quickly

• But Comprehensiveness is a must


• Batch layer: finds the correct answer

• Speed layer: makes a reasonable guess

• Examples: Any time the sparse-data case is also the most valuable

In this case, we can’t sacrifice comprehensiveness — for every record that exists, we must return a relevant answer. So we sacrifice truthfulness — or more precisely, we sacrifice accuracy and justification.

Marine Corp’s 80% Rule

“Any decision made with more than 80% of the

necessary information is hesitation”

— “The Marine Corps Way” Santamaria & Martino

When lots of data already, the imperfect result in the speed layer doesn’t have a huge effectWhen there isn’t much data, overwhelmingly better to fill in with an imperfect resultUS Marine Corps: “Any decision made with more than 80% of the necessary information is hesitation”

A Guess Beats a Blank• You can’t calculate a good answer quickly

• But Comprehensiveness is a must


• Batch layer: finds the correct answer

• Speed layer: makes a reasonable guess

• Examples: Any time the sparse-data case is also the most valuable

In this case, we can’t sacrifice comprehensiveness — for every record that exists, we must return a relevant answer. So we sacrifice truthfulness — or more precisely, we sacrifice accuracy and justification.

Security

Find PotentialEvilness

Connection Counts

Agents of Interest

Store Interaction

Net Connect

ions

DetectedEvilnesses

ApproximateStreaming Agg

Agent of Interest?

Dashboard

In security, you have the data-breach type problems — why is someone strip-mining computers in turn to a server in [name your own semi-friendly country]? — and bradley-manning type problems — why is a GS-5 at a console in Kuwait downloading every single diplomatic dispatch?

Pattern: Slow Boil/Flash Fire• Two tempos of data: months vs milliseconds

• Short-term data too much to store

• Long-term data too much to handle immediately

• Often accompanies Baseline / Deltas, Global / Local

• Examples:• Trending Topics• Insider Security

Global/Local: Why has a contractor sysadmin in Hawaii accessed powerpoint presos from every single group within our organization?

Banking, OversimplifiedReconcileAccounts

AccountBalances

Event Store Transaction Update Records

(CAP Tradeoffs)

Banking, OversimplifiedReconcileAccounts

AccountBalances

Event Store Transaction Update Records

nice-to-haveessential

This wins over fast layer

(CAP Tradeoffs)

Pattern: C-A-P Tradeoffs• C-A-P tradeoffs:

• Can’t depend on when data will roll in (Justification)

• Can’t live in ignorance (Comprehensiveness)

• Batch layer: The final answer

• Speed layer: Actionable views

• Examples: Security (Authorization vs Auditing), lots of counting problems

(Banking)

Pattern: Out-of-Order• C-A-P tradeoffs:

• Can’t depend on when data will roll in (Justification)

• Can’t live in ignorance (Comprehensiveness)

• Batch layer: The final answer

• Speed layer: Actionable views

• Examples: Security (Authorization vs Auditing), lots of counting problems

(Banking)

Common Theme

The System Asymptotesto Truth over time

We keep seeing this common theme — you are building a system that approaches correctness over time. This leads to a best practice that I’ll call the improver pattern:

Scrape Product Web

• Scrapers: yield partial records

• Unifier: connects all identifiers for a common object

• Resolver: combines partial records into unified record

Entity Resolution

Pattern: Improver• Improver: function(best guess,{new facts}) ~> new best guess

• Batch layer: f(blank, {all facts}) ~> best possible guess

• Speed layer: f(current best, {new fact}) ~> new best guess

• Batch and speed layer share same code & contract, asymptote to truth.

The way you build your resolver is such that it

Two Big Ideas

• Fine-grained control over architectural tradeoffs

• Truth lives at the edge, not the middle

Lets you trade off how quickly, how expensively, how true, how justifiedNew Paradigm for how, when and where we handle truth

Two Big Ideas


• Approximate a pure function on all data

• What we do now that architecture is free



Two Big Ideas


• Approximate a pure function on all data

• What we do now that architecture is free


• Data is syndicated forward from arrival to serving

• “Query at write time”


• Lambda architecture isn’t about speed layer / batch layer.

• It's about

• moving truth to the edge, not the center;

• enabling fine-grained tradeoffs against fundamental limits;

• decoupling consumer from infrastructure

• decoupling consumer from asynchrony

• …with profound implications for how you build your teams

λ Arch: Truth, not Plumbing

This way of doing it simplifies architecture:Local interactions onlyElimination of asynchrony Which in turn profoundly simplifies development and operationsAnd allows you to structure team like you do the

Lambda Architecture for a Dinky Little Blog

So far, talked about a bunch of reasons why you might be led **to** a lambda architecture And when there's a new technology people always first ask why they should do it differently, which is a wise Thing to ask and a foolish thing to insist onBut let's look at it from the other end, from what life is like if this were the natural state of being.And to do so, let's take the most unjustifiable case for a high scale architecture: a blog engine

Blog: Traditional Approach

• Familiar with the ORM Rails-style blog:

• Models: User, Article, Comment

• Views:

• /user/:id (user info, links to their articles and comments);

• /articles (list of articles);

• /articles/:id (article content, comments, author info)

User

id 3

name joeman

homepage http://…

photo http://…

bio “…”

Article

id 7

title The Crisis

body These are…

author_id 3

created_at 2014-08-08

Comment

id 12

body “lol”

article_id 7

author_id 3

Author NameAuthor Bio Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

AuthorPhoto

Joe has written 2 Articles:“A Post about My Cat”

Donec nec justo eget felis facilisis fermentum. Aliquam porttitor mauris sit amet orci. Aenean dignissim pellentesque (… read more)

“Welcome to my blog”

Donec nec justo eget felis facilisis fermentum. Aliquam porttitor mauris sit amet orci. Aenean dignissim pellentesque (… read more)

Article TitleArticle Body Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident.

Author Name

AuthorPhoto

Author Bio Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor

Comments"First Post" (8/8/2014 by Commenter 1)

"lol" (8/8/2014 by Commenter 2)

"No comment" (8/8/2014 by Commenter 3)

article show user show

articles

users

comments

Webserver

Traditional: Assemble on Read

DB models are sole source of truthDenormalizedUsed directly by reader and writerView is constructed from spare parts at read time

Syndicate on Write

Δarticle Biographers

ViewFragments showReportersΔ

user Biographers

Δcom’t Biographers

articles

users

comments


• (…hack hack hack…)

/articles/v2/show.json

/articles/v1/show.json

• (…hack hack hack…)

What data model would you like to receive? {“title”:”…”,

“body”:”…”,…}

lol um can I also have

Data Engineer Web Engineer

{“title”:”…”, “body”:”…”, “snippet”:…}

Syndicated Data

• The Data is always _there_

• …but sometimes it’s more perfect than other times.

Syndicated Data• Reports are cheap, single-concern, and faithful to the view.

• You start thinking like the customer, not the database

• All pages render in O(1):• Your imagination doesn’t have to fit inside a TCP timeout

• Data is immutable, flows are idempotent:• Interface change is safe

• Data is always _there_,• Asynchrony doesn’t affect consumers

• Everything is decoupled:• Way harder to break everything

One of the worst pains in asses is the query that takes 1500 milliseconds. Needs to be immediate, usually mission-critical, expensive in all ways

• Lambda architecture isn’t about speed layer / batch layer.

• It's about

• moving truth to the edge, not the center;

• enabling fine-grained tradeoffs against fundamental limits;

• decoupling consumer from infrastructure

• decoupling consumer from asynchrony

• …with profound implications for how you build your teams

λ Arch: Truth, not Plumbing

This way of doing it simplifies architecture:Local interactions onlyElimination of asynchrony Which in turn profoundly simplifies development and operationsAnd allows you to structure team like you do the

Changes update models

updatearticle

updateuser

updatecomments

Δarticle

Δuser

Δcom’nt

models

user

com’nt

article

history

Models stay the same: User, Article, Comment. Updated directlyReporters can subscribe to modelsOn update, reporter receives updated object, and can do anything else it wants. Typically, it's to create a new reportReports live in the target domain: faithful to the data consumer. In this case, they look very close to the information hierarchy of the rendered pageAll pages render in O(1). Your imagination is not constrained by the length of a TCP timeout

Models Trigger Reporters

updatearticle

updateuser

updatecomments

Δarticle

Δuser

Δcom’nt

models

user

com’nt

article

history

compactarticle

user’s #articles

expanded user

user’s #comments

sidebar user

compact comment

expandedarticle

exp’darticle

compactarticle

user’s #articles

exp’duser

sidebar user

user’s #comments

compactcomment

microuser

micro user


Serve Report Fragmentsexp’darticle

compactarticle

user’s #articles

exp’duser

sidebar user

user’s #comments

compactcomment

micro user

showarticle


Author Name

AuthorPhoto







Author Name

AuthorPhoto





article show rendered{"title":"Article Title","body":"Article Body Lorem [...]","author":{ ... },"comments: [ {"comment_id":1, "body":"First Post",...}, {"comment_id":2, "body":"lol",...}, ...]}

Serve Report FragmentsArticle Title

Article Body Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident.

Author Name

AuthorPhoto





exp’darticle

compactarticle

user’s #articles

exp’duser

sidebar user

user’s #comments

compactcomment

micro user

showuser


Reports are Cheap

updatearticle

updateuser

updatecomments

Δarticle

Δuser

Δcom’nt

models

user

com’nt

article

history

compactarticle

user’s #articles

expanded user

user’s #comments

sidebar user

compact comment

expandedarticle

exp’darticle

compactarticle

user’s #articles

exp’duser

sidebar user

user’s #comments

compactcomment

microuser

micro user

list articles

showarticle

list user’s articles

showuser


Two Big Ideas

• Fine-grained control over those architectural tradeoffs



Lambda Architecture Entity Resolution

Intake

parseAmazonAmazon

parseeBayeBay

parseMa&Pa

Ma&PaElectronics

Bulk

Stream

RPC Callbackkey

words

mfr &model

ASIN

VendorListingListings

Batch Layer: Resolve/Unify

ProductResolver

Unified ProductsListings

UnifyProducts

Improve

ProductResolver

keywords

mfr &model

ASIN

VendorListing

FetchProducts

Unified ProductsListings

UnifyProducts

Update

ProductResolver

keywords

mfr &model

ASIN

VendorListing

FetchProducts

Unified Products

Resolve &Update

Listings

UnifyProducts

Cannot have Consistency

ProductResolver

keywords

mfr &model

ASIN

VendorListing

FetchProducts

Unified Products

Resolve &Update

Listings

UnifyProducts

Objections

Objections• Three objections

1.Why hasn't it been done before

2.Architecture Astronaut

3.I'm not at high scale

• Response

1.Chef/Puppet/Docker/etc

2.Chef/Puppet/Docker/etc

3.Shut Up

Objections

• Two APIs? Really?

• Yes. Guilty. That’s dumb and must be fixed.• Spark or Samza, if you’re willing to only drink one flavor of

Kool-Aid• EZbake.io, a CSC / 42six project to attack this• …but we shouldn’t be living at the low level anyhow

Objections

• Orchestration: “logical plan” (dataflow graph)

• Optimization/Allocation: “physical plan” (what goes where)

• Resource Projector: instantiates infrastructure• HTTP listeners, Trident streams, Oozie scheduling, ETL

flows, cron jobs, etc• Transport Machineries:

• move data around, fulfilling locality/ordering/etc guarantees• Data Processing: UDFs and operators

Patterns of the Lambda Architecture -- 2015 April -- Hadoop Summit, Europe

Data & Analytics

Transcript of Patterns of the Lambda Architecture -- 2015 April -- Hadoop Summit, Europe