Hadoop Lecture for Harvard's CS 264 -- October 19, 2009

57
(More) Apache Hadoop Philip Zeyliger (Math, Dunster ‘04) [email protected] @philz42 @cloudera October 19, 2009 CS 264

Transcript of Hadoop Lecture for Harvard's CS 264 -- October 19, 2009

Page 1: Hadoop Lecture for Harvard's CS 264 -- October 19, 2009

(More)Apache Hadoop

Philip Zeyliger (Math, Dunster ‘04)[email protected]@philz42 @cloudera

October 19, 2009CS 264

Page 2: Hadoop Lecture for Harvard's CS 264 -- October 19, 2009

Who am I?

Software Engineer

Zak’s classmate

Worked at

(Interns)

Page 3: Hadoop Lecture for Harvard's CS 264 -- October 19, 2009

OutlineReview of last Wednesday

Your Homework

Data Warehousing

Some Hadoop Internals

Research & Hadoop

Short Break

Page 4: Hadoop Lecture for Harvard's CS 264 -- October 19, 2009

Last Wednesday

Page 5: Hadoop Lecture for Harvard's CS 264 -- October 19, 2009

The BasicsClusters, not individual machines

Scale Linearly

Separate App Code from Fault-Tolerant Distributed Systems Code

Systems Programmers Statisticians

Page 6: Hadoop Lecture for Harvard's CS 264 -- October 19, 2009

Some Big Numbers

Yahoo! Hadoop Clusters: > 82PB, >25k machines (Eric14, HadoopWorld NYC ’09)

Google: 40 GB/s GFS read/write load (Jeff Dean, LADIS ’09) [~3,500 TB/day]

Facebook: 4TB new data per day; DW: 4800 cores, 5.5 PB (Dhruba Borthakur, HadoopWorld)

Page 7: Hadoop Lecture for Harvard's CS 264 -- October 19, 2009

Physical Flow

M-R Model

Logical Flow

Logical

Physical

Page 8: Hadoop Lecture for Harvard's CS 264 -- October 19, 2009

Important APIsInput Format

Mapper

Reducer

Partitioner

Combiner

Out. Format

M/R

Flo

w

Oth

er

Writable

JobClient

*Context

Filesystem

K₁,V₁→K₂,V₂

data→K₁,V₁

K₂,iter(V₂)→K₂,V₂

K₂,V₂→int

K₂, iter(V₂)→K₃,V₃

K₃, V₃→data

→ is 1:many

Page 9: Hadoop Lecture for Harvard's CS 264 -- October 19, 2009

public int run(String[] args) throws Exception { if (args.length < 3) { System.out.println("Grep <inDir> <outDir> <regex> [<group>]"); ToolRunner.printGenericCommandUsage(System.out); return -1; } Path tempDir = new Path("grep-temp-"+Integer.toString(new Random().nextInt(Integer.MAX_VALUE))); JobConf grepJob = new JobConf(getConf(), Grep.class); try { grepJob.setJobName("grep-search");

FileInputFormat.setInputPaths(grepJob, args[0]);

grepJob.setMapperClass(RegexMapper.class); grepJob.set("mapred.mapper.regex", args[2]); if (args.length == 4) grepJob.set("mapred.mapper.regex.group", args[3]);

grepJob.setCombinerClass(LongSumReducer.class);

grepJob.setReducerClass(LongSumReducer.class);

FileOutputFormat.setOutputPath(grepJob, tempDir); grepJob.setOutputFormat(SequenceFileOutputFormat.class); grepJob.setOutputKeyClass(Text.class); grepJob.setOutputValueClass(LongWritable.class);

JobClient.runJob(grepJob);

JobConf sortJob = new JobConf(Grep.class); sortJob.setJobName("grep-sort");

FileInputFormat.setInputPaths(sortJob, tempDir); sortJob.setInputFormat(SequenceFileInputFormat.class);

sortJob.setMapperClass(InverseMapper.class);

// write a single file sortJob.setNumReduceTasks(1);

FileOutputFormat.setOutputPath(sortJob, new Path(args[1])); // sort by decreasing freq sortJob.setOutputKeyComparatorClass(LongWritable.DecreasingComparator.class);

JobClient.runJob(sortJob); } finally { FileSystem.get(grepJob).delete(tempDir, true); } return 0;}

the “grep” example

Page 10: Hadoop Lecture for Harvard's CS 264 -- October 19, 2009

$ cat input.txt adams dunster kirkland dunsterkirland dudley dunsteradams dunster winthrop

$ bin/hadoop jar hadoop-0.18.3-examples.jar grep input.txt output1 'dunster|adams'

$ cat output1/part-00000 4 dunster2 adams

Page 11: Hadoop Lecture for Harvard's CS 264 -- October 19, 2009

JobConf grepJob = new JobConf(getConf(), Grep.class); try { grepJob.setJobName("grep-search");

FileInputFormat.setInputPaths(grepJob, args[0]);

grepJob.setMapperClass(RegexMapper.class); grepJob.set("mapred.mapper.regex", args[2]); if (args.length == 4) grepJob.set("mapred.mapper.regex.group", args[3]);

grepJob.setCombinerClass(LongSumReducer.class); grepJob.setReducerClass(LongSumReducer.class);

FileOutputFormat.setOutputPath(grepJob, tempDir); grepJob.setOutputFormat(SequenceFileOutputFormat.class); grepJob.setOutputKeyClass(Text.class); grepJob.setOutputValueClass(LongWritable.class);

JobClient.runJob(grepJob); } ...

Job1of 2

Page 12: Hadoop Lecture for Harvard's CS 264 -- October 19, 2009

JobConf sortJob = new JobConf(Grep.class); sortJob.setJobName("grep-sort");

FileInputFormat.setInputPaths(sortJob, tempDir); sortJob.setInputFormat(SequenceFileInputFormat.class);

sortJob.setMapperClass(InverseMapper.class);

// write a single file sortJob.setNumReduceTasks(1);

FileOutputFormat.setOutputPath(sortJob, new Path(args[1])); // sort by decreasing freq sortJob.setOutputKeyComparatorClass( LongWritable.DecreasingComparator.class);

JobClient.runJob(sortJob); } finally { FileSystem.get(grepJob).delete(tempDir, true); } return 0;}

Job2 of 2

(implicit identity reducer)

Page 13: Hadoop Lecture for Harvard's CS 264 -- October 19, 2009

The types there...?, Text

Text, Long

Long, Text

Text, list(Long)

Text, Long

Page 14: Hadoop Lecture for Harvard's CS 264 -- October 19, 2009

A Simple JoinId Last First1 Washington George2 Lincoln Abraham

Location Id TimeDunster 1 11:00amDunster 2 11:02amKirkland 2 11:08am

You want to track individuals throughout the day.How would you do this in M/R, if you had to?

Peop

leK

ey E

ntry

Log

Page 15: Hadoop Lecture for Harvard's CS 264 -- October 19, 2009

(white-board)

Page 16: Hadoop Lecture for Harvard's CS 264 -- October 19, 2009

Your Homework

(this is the only lolcat in this lecture)

Page 17: Hadoop Lecture for Harvard's CS 264 -- October 19, 2009

Mental Challenges

Learn an algorithm

Adapt it to M/R Model

Practical Challenges

Learn Finicky Software

Debug an unfamiliar environment

Implement PageRank over Wikipedia Pages

Page 18: Hadoop Lecture for Harvard's CS 264 -- October 19, 2009

Tackle Parts Separately

Algorithm

Implementing in M/R(What are the type signatures?)

Starting a cluster on EC2

Small dataset

Large dataset

Advice

Page 19: Hadoop Lecture for Harvard's CS 264 -- October 19, 2009

More Advice

Wealth of “Getting Started” materials online

Feel free to work together

Don’t be a perfectionist about it; data is dirty!

if (____ ≫ Java), use “streaming”

Page 20: Hadoop Lecture for Harvard's CS 264 -- October 19, 2009

Good Luck!

Page 21: Hadoop Lecture for Harvard's CS 264 -- October 19, 2009

Data Warehousing 101

Page 22: Hadoop Lecture for Harvard's CS 264 -- October 19, 2009

What is DW?

a.k.a. BI “Business Intelligence”

Provides data to support decisions

Not the operational/transactional database

e.g., answers “what has our inventory been over time?”, not “what is our inventory now?”

Page 23: Hadoop Lecture for Harvard's CS 264 -- October 19, 2009

Why DW?

Learn from data

Reporting

Ad-hoc analysis

e.g.: which trail mix should TJ’s discontinue? (and other important business questions)

17

These Candies are a Blast!What do you get when you combine blueberry, probiotic

yogurt and vitamin C? A-ha! You said blueberry yogurt,

didn’t you? That would be the natural response, we know.

And you wouldn’t necessarily be wrong. We’ll admit

that this was a bit of a trick question, because, really, how

could you possibly know that the answer was Trader Joe’s

Blueberry Blast Yogurt Candy?

Blueberry Blast Yogurt Candies are little disks of sweet-

tart pleasure with only three calories each. Natural blueberry

flavor, probiotic yogurt and vitamin C, with a little natural

acai (ah-sigh-ee) added in for a burst of flavor… these are no

ordinary candies. Maybe you’re a candy lover looking for

something new. Maybe you want to cut calories. Or maybe

you just really love blueberries and yogurt. Give these little

gems a try – they’re a blast! And they’re just $1.49 for each

very convenient .89 ounce tin.

New!

Chocolate Crisps

Milk or Dark ChocolateWe’ve recently introduced a new wave in chocolate. No,

that’s not us having a high opinion of ourselves; it’s a great

way to describe our new Chocolate Crisps, wave-shaped

chocolate treats dotted with crispy rice to add a little extra

pizzazz. These sweet treats are made with Belgian chocolate

and come in both Milk Chocolate and Dark Chocolate

varieties. They look a bit like a chocolate version of a

popular potato chip, but these are all about the chocolate.

And while we’re tempted to call them “pure indulgence,” we

should point out that you can eat ten Trader Joe’s Chocolate

Crisps for a mere 200 calories – not so indulgent after all.

We’re selling each 4.4 ounce box of 36 Milk or Dark

Chocolate Crisps for $2.49, every day.

Mini Peanut Butter CupsMini Peanut Butter Cups – those four words pretty much

sell this product, so we’ll be brief. Made by the same

supplier who makes our original Peanut Butter Cups, the

Minis are made using all natural peanut butter and luscious

milk chocolate. These are shaped just like miniature versions

of the original, and because of their small size, they’re not

encumbered by individual paper cups. How small are they?

A bit smaller than bite-size, and chock-full of delicious,

natural peanut butter cup flavor.

Trader Joe’s Mini Milk Chocolate Peanut Butter Cups are

excellent for snacking. They’re also a unique replacement

for chocolate chips in cookies and a superb topping for ice

cream. We’re selling them in a 12 ounce tub for $2.99.

Chocolate Chip Granola Bars

$1.99 for a Box of SixDo you have a first memory of eating a granola bar? Does

it involve nearly breaking a tooth? With a little help from

Trader Joe’s, you can now erase that memory forever. All

it takes is a bit of Trader Joe’s Chocolate Chip Chewy

Coated Granola Bars. The key word here is “chewy.”

No rock-hard-teeth-crackers, these are loaded with organic

oats, organic rice crisps and rich, delicious chocolate chips.

The bottoms are covered with a generous coating of creamy

chocolate. They’re chewy, sweet and creamy – quite a treat,

these little chocolate wonders.

Trader Joe’s Chocolate Chip Chewy Coated Granola

Bars are definitely a treat. But we do like to make things

healthier when we can, so they’re made without artificial

flavors, colors or preservatives, hydrogenated oils and trans

fats. And because we always like to make delicious things

deliciously affordable, each 7.4 ounce box of six bars is only $1.99.

100% Juice! 100% Recyclable Bottle!

Cherry Cider • No Added Sugars!Would it surprise you to know that American farmers harvest

somewhere in the vicinity of 210 million pounds of cherries

every year? It surprised us. It also surprised us that nearly

75% of those cherries are enjoyed as a fresh crop during the

harvest season of June, July and August. In other words, we’re

pitting and popping cherries into our mouths at a rate of more

than 157 million pounds over a three month period. Wow!

So what becomes of the other 53 million pounds? Well,

some of the fruit is frozen, some used for jams and preserves

and some is used to make Trader Joe’s Cherry Cider. Our

Cherry Cider is a 100% juice blend – cherry, apple, plum

and pineapple juices from concentrate – that makes ample

use of Bing cherries from the Pacific Northwest. It has big,

bold cherry sweetness and no added sugar. We’re selling

Cherry Cider in a 64 fluid ounce bottle for $3.69, every day.

I told you, hands off the

Chocolate Chip Granola Bars!Geez, lighten up. You

get six in every box.

You could share.

Page 24: Hadoop Lecture for Harvard's CS 264 -- October 19, 2009

Traditionally...

Big databases

Schemas

Dimensional Modelling (Ralph Kimball)

Page 25: Hadoop Lecture for Harvard's CS 264 -- October 19, 2009

Magnetic

Agile

Deep

“MAD Skills”

MAD Skills: New Analysis Practices for Big Data

Jeffrey CohenGreenplum

Brian DolanFox Interactive Media

Mark DunlapEvergreen Technologies

Joseph M. HellersteinU.C. Berkeley

Caleb WeltonGreenplum

ABSTRACTAs massive data acquisition and storage becomes increas-ingly a!ordable, a wide variety of enterprises are employingstatisticians to engage in sophisticated data analysis. In thispaper we highlight the emerging practice of Magnetic, Ag-ile, Deep (MAD) data analysis as a radical departure fromtraditional Enterprise Data Warehouses and Business Intel-ligence. We present our design philosophy, techniques andexperience providing MAD analytics for one of the world’slargest advertising networks at Fox Interactive Media, us-ing the Greenplum parallel database system. We describedatabase design methodologies that support the agile work-ing style of analysts in these settings. We present data-parallel algorithms for sophisticated statistical techniques,with a focus on density methods. Finally, we reflect ondatabase system features that enable agile design and flexi-ble algorithm development using both SQL and MapReduceinterfaces over a variety of storage mechanisms.

1. INTRODUCTIONIf you are looking for a career where your services will be

in high demand, you should find something where you providea scarce, complementary service to something that is gettingubiquitous and cheap. So what’s getting ubiquitous and cheap?Data. And what is complementary to data? Analysis.– Prof. Hal Varian, UC Berkeley, Chief Economist at Google [5]

mad (adj.): an adjective used to enhance a noun.1- dude, you got skills.2- dude, you got mad skills.– UrbanDictionary.com [22]

Standard business practices for large-scale data analysis cen-ter on the notion of an “Enterprise Data Warehouse” (EDW)that is queried by “Business Intelligence” (BI) software. BItools produce reports and interactive interfaces that summa-rize data via basic aggregation functions (e.g., counts andaverages) over various hierarchical breakdowns of the data

Under revision. This version as of 3/20/2009. Contact author: Hellerstein.

into groups. This was the topic of significant academic re-search and industrial development throughout the 1990’s.

Traditionally, a carefully designed EDW is considered tohave a central role in good IT practice. The design andevolution of a comprehensive EDW schema serves as therallying point for disciplined data integration within a largeenterprise, rationalizing the outputs and representations ofall business processes. The resulting database serves as therepository of record for critical business functions. In addi-tion, the database server storing the EDW has traditionallybeen a major computational asset, serving as the central,scalable engine for key enterprise analytics. The concep-tual and computational centrality of the EDW makes it amission-critical, expensive resource, used for serving data-intensive reports targeted at executive decision-makers. It istraditionally controlled by a dedicated IT sta! that not onlymaintains the system, but jealously controls access to ensurethat executives can rely on a high quality of service. [12]

While this orthodox EDW approach continues today inmany settings, a number of factors are pushing towards avery di!erent philosophy for large-scale data management inthe enterprise. First, storage is now so cheap that small sub-groups within an enterprise can develop an isolated databaseof astonishing scale within their discretionary budget. Theworld’s largest data warehouse from just over a decade agocan be stored on less than 20 commodity disks priced atunder $100 today. A department can pay for 1-2 ordersof magnitude more storage than that without coordinatingwith management. Meanwhile, the number of massive-scaledata sources in an enterprise has grown remarkably: mas-sive databases arise today even from single sources like click-streams, software logs, email and discussion forum archives,etc. Finally, the value of data analysis has entered com-mon culture, with numerous companies showing how sophis-ticated data analysis leads to cost savings and even directrevenue. The end result of these opportunities is a grassrootsmove to collect and leverage data in multiple organizationalunits. While this has many benefits in fostering e"ciencyand data-driven culture [14], it adds to the force of data de-centralization that data warehousing is supposed to combat.

In this changed climate of widespread, large-scale datacollection, there is a premium on what we dub MAD anal-ysis skills. The acronym arises from three aspects of thisenvironment that di!er from EDW orthodoxy:

• Magnetic: Traditional EDW approaches “repel” newdata sources, discouraging their incorporation untilthey are carefully cleansed and integrated. Given theubiquity of data in modern organizations, a data ware-

Page 26: Hadoop Lecture for Harvard's CS 264 -- October 19, 2009

MADness is Enabling

Instrumentation

Collection

Storage (Raw Data)

ETL (Extraction, Transform, Load)

RDBMS (Aggregates)

BI / Reporting

Traditional DW

}

Ad-hoc Queries?

Data Mining?

Page 27: Hadoop Lecture for Harvard's CS 264 -- October 19, 2009

Data Mining

Instrumentation

Collection

Storage (Raw Data)

ETL (Extraction, Transform, Load)

RDBMS (Aggregates)

BI / Reporting

Traditional DW

}

Ad-hoc Queries

Page 28: Hadoop Lecture for Harvard's CS 264 -- October 19, 2009

Facebook’s DW (phase N)

Facebook Data Infrastructure2007

Oracle Database Server

Data Collection Server

MySQL TierScribe Tier

Wednesday, April 1, 2009

Page 29: Hadoop Lecture for Harvard's CS 264 -- October 19, 2009

Facebook’s DW (phase M)M > NFacebook Data Infrastructure

2008MySQL TierScribe Tier

Hadoop Tier

Oracle RAC Servers

Wednesday, April 1, 2009

Page 30: Hadoop Lecture for Harvard's CS 264 -- October 19, 2009

Short Break

Page 31: Hadoop Lecture for Harvard's CS 264 -- October 19, 2009

Hadoop Internals

Page 32: Hadoop Lecture for Harvard's CS 264 -- October 19, 2009

HDFS

Namenode

Datanodes

One Rack A Different Rack

3x64MB file, 3 rep

4x64MB file, 3 rep

Small file, 7 rep

Page 33: Hadoop Lecture for Harvard's CS 264 -- October 19, 2009

HDFS Write Path

file in the filesystem’s namespace, with no blocks associated with it. (Step 2.) Thenamenode performs various checks to make sure the file doesn’t already exist, and thatthe client has the right permissions to create the file. If these checks pass, the namenodemakes a record of the new file, otherwise file creation fails and the client is thrown anIOException. The DistributedFileSystem returns a FSDataOutputStream for the client tostart writing data to. Just as in the read case, FSDataOutputStream wraps a DFSOutputStream, which handles communication with the datanodes and namenode.

As the client writes data (step 3.), DFSOutputStream splits it into packets, which it writesto an internal queue, called the data queue. The data queue is consumed by the DataStreamer, whose responsibility it is to ask the namenode to allocate new blocks bypicking a list of suitable datanodes to store the replicas. The list of datanodes forms apipeline—we’ll assume the replication level is three, so there are three nodes in thepipeline. The DataStreamer streams the packets to the first datanode in the pipeline,which stores the packet and forwards it to the second datanode in the pipeline. Similarlythe second datanode stores the packet and forwards it to the third (and last) datanodein the pipeline. (Step 4.)

DFSOutputStream also maintains an internal queue of packets that are waiting to beacknowledged by datanodes, called the ack queue. A packet is only removed from theack queue when it has been acknowledged by all the datanodes in the pipeline. (Step 5.)

If a datanode fails while data is being written to it, then the following actions are taken,which are transparent to the client writing the data. First the pipeline is closed, and anypackets in the ack queue are added to the front of the data queue so that datanodesthat are downstream from the failed node will not miss any packets. The current blockon the good datanodes is given a new identity, which is communicated to the name-node, so that the partial block on the failed datanode will be deleted if the failed data-node recovers later on. The failed datanode is removed from the pipeline and the re-mainder of the block’s data is written to the two good datanodes in the pipeline. Thenamenode notices that the block is under-replicated, and it arranges for a further replicato be created on another node. Subsequent blocks are then treated as normal.

It’s possible, but unlikely, that multiple datanodes fail while a block is being written.As long as dfs.replication.min replicas (default one) are written the write will succeed,and the block will be asynchronously replicated across the cluster until its target rep-lication factor is reached (dfs.replication which defaults to three).

Figure 3-3. A client writing data to HDFS

Data Flow | 61

Page 34: Hadoop Lecture for Harvard's CS 264 -- October 19, 2009

HDFS Failures?Datanode crash?

Clients read another copy

Background rebalance

Namenode crash?

uh-oh

Page 35: Hadoop Lecture for Harvard's CS 264 -- October 19, 2009

M/R

Tasktrackers on the same machines as datanodes

One Rack A Different Rack

Job on starsDifferent jobIdle

Page 36: Hadoop Lecture for Harvard's CS 264 -- October 19, 2009

M/R

CHAPTER 6

How MapReduce Works

In this chapter we’ll look at how MapReduce in Hadoop works in detail. This knowl-edge provides a good foundation for writing more advanced MapReduce programs,which we will cover in the following two chapters.

Anatomy of a MapReduce Job RunYou can run a MapReduce job with a single line of code: JobClient.runJob(conf). It’svery short, but it conceals a great deal of processing behind the scenes. This sectionuncovers the steps Hadoop takes to run a job.

The whole process is illustrated in Figure 6-1. At the highest level there are four inde-pendent entities:

• The client, which submits the MapReduce job.

• The jobtracker, which coordinates the job run. The jobtracker is a Java applicationwhose main class is JobTracker.

• The tasktrackers, which run the tasks that the job has been split into. Tasktrackersare Java applications whose main class is TaskTracker.

• The distributed filesystem (normally HDFS, covered in Chapter 3), which is usedfor sharing job files between the other entities.

Figure 6-1. How Hadoop runs a MapReduce job

145

Page 37: Hadoop Lecture for Harvard's CS 264 -- October 19, 2009

Task fails

Try again?

Try again somewhere else?

Report failure

Retries possible because of idempotence

M/R Failures

Page 38: Hadoop Lecture for Harvard's CS 264 -- October 19, 2009

Programming these systems...

Everything can fail

Inherently multi-threaded

Toolset still young

Mental models are different...

Page 39: Hadoop Lecture for Harvard's CS 264 -- October 19, 2009

Research &Hadoop

Page 40: Hadoop Lecture for Harvard's CS 264 -- October 19, 2009

Scheduling & SharingMixed use

BatchInteractiveReal-time

Isolation

Text

Metrics: Latency, Throughput, Utilization (per resource)

Page 41: Hadoop Lecture for Harvard's CS 264 -- October 19, 2009

Scheduling

Fair and LATE Scheduling (Berkeley)

Nexus (Berkeley)

Quincy (MSR)

Page 42: Hadoop Lecture for Harvard's CS 264 -- October 19, 2009

Implementation

BOOM Project (Berkeley)

Overlog (Berkeley)

APPENDIX

A. NARADA IN OverLogHere we provide an executable OverLog implementation

of Narada’s mesh maintenance algorithms. Current limita-tions of the P2 parser and planner require slightly wordiersyntax for some of our constructs. Specifically, handling ofnegation is still incomplete, requiring that we rewrite somerules to eliminate negation. Furthermore, our planner cur-rently handles rules with collocated terms only. The Over-Log specification below is directly parsed and executed byour current codebase.

/** Base tables */

materialize(member, infinity, infinity, keys(2)).materialize(sequence, infinity, 1, keys(2)).materialize(neighbor, infinity, infinity, keys(2)).

/* Environment table containing configurationvalues */

materialize(env, infinity, infinity, keys(2,3)).

/* Setup of configuration values */

E0 neighbor@X(X,Y) :- periodic@X(X,E,0,1), env@X(X,H, Y), H == "neighbor".

/** Start with sequence number 0 */

S0 sequence@X(X, Sequence) :- periodic@X(X, E, 0,1), Sequence := 0.

/** Periodically start a refresh */

R1 refreshEvent@X(X) :- periodic@X(X, E, 3).

/** Increment my own sequence number */

R2 refreshSequence@X(X, NewSequence) :-refreshEvent@X(X), sequence@X(X, Sequence),NewSequence := Sequence + 1.

/** Save my incremented sequence */

R3 sequence@X(X, NewSequence) :-refreshSequence@X(X, NewSequence).

/** Send a refresh to all neighbors with my currentmembership */

R4 refresh@Y(Y, X, NewSequence, Address, ASequence,ALive) :- refreshSequence@X(X, NewSequence),member@X(X, Address, ASequence, Time, ALive),neighbor@X(X, Y).

/** How many member entries that match the memberin a refresh message (but not myself) do I have? */

R5 membersFound@X(X, Address, ASeq, ALive,count<*>) :- refresh@X(X, Y, YSeq, Address, ASeq,ALive), member@X(X, Address, MySeq, MyTime,MyLive), X != Address.

/** If I have none, just store what I got */

R6 member@X(X, Address, ASequence, T, ALive) :-membersFound@X(X, Address, ASequence, ALive, C),C == 0, T := f_now().

/** If I have some, just update with theinformation I received if it has a highersequence number. */

R7 member@X(X, Address, ASequence, T, ALive) :-membersFound@X(X, Address, ASequence, ALive, C),C > 0, T := f_now(), member@X(X, Address,MySequence, MyT, MyLive), MySequence < ASequence.

/** Update my neighbor’s member entry */

R8 member@X(X, Y, YSeq, T, YLive) :- refresh@X(X,Y, YSeq, A, AS, AL), T := f_now(), YLive := 1.

/** Add anyone from whom I receive a refreshmessage to my neighbors */

N1 neighbor@X(X, Y) :- refresh@X(X, Y,YS, A, AS, L).

/** Probing of neighbor liveness */

L1 neighborProbe@X(X) :- periodic@X(X, E, 1).L2 deadNeighbor@X(X, Y) :- neighborProbe@X(X), T :=

f_now(), neighbor@X(X, Y), member@X(X, Y, YS, YT,L), T - YT > 20.

L3 delete neighbor@X(X, Y) :- deadNeighbor@X(X, Y).L4 member@X(X, Neighbor, DeadSequence, T, Live) :-

deadNeighbor@X(X, Neighbor), member@X(X,Neighbor, S, T1, L), Live := 0, DeadSequence := S+ 1, T:= f_now().

B. CHORD IN OverLogHere we provide the full OverLog specification for Chord.

This specification deals with lookups, ring maintenance witha fixed number of successors, finger-table maintenance andopportunistic finger table population, joins, stabilization,and node failure detection.

/* The base tuples */

materialize(node, infinity, 1, keys(1)).materialize(finger, 180, 160, keys(2)).materialize(bestSucc, infinity, 1, keys(1)).materialize(succDist, 10, 100, keys(2)).materialize(succ, 10, 100, keys(2)).materialize(pred, infinity, 100, keys(1)).materialize(succCount, infinity, 1, keys(1)).materialize(join, 10, 5, keys(1)).materialize(landmark, infinity, 1, keys(1)).materialize(fFix, infinity, 160, keys(2)).materialize(nextFingerFix, infinity, 1, keys(1)).materialize(pingNode, 10, infinity, keys(2)).materialize(pendingPing, 10, infinity, keys(2)).

/** Lookups */

L1 lookupResults@R(R,K,S,SI,E) :- node@NI(NI,N),lookup@NI(NI,K,R,E), bestSucc@NI(NI,S,SI), K in

15

Page 43: Hadoop Lecture for Harvard's CS 264 -- October 19, 2009

Debugging and Visualization

0 100 200 300 400

010

2030

40

Time/s

Per

-task

Task durations (RandomWriter: 100GB written: 4 hosts): All nodesJT_Map

0 200 400 600 800

050

100

150

Time/s

Per

-task

Task durations (Sort: 20GB input: 4 hosts): All nodesJT_MapJT_Reduce

Figure 5: Summarized Swimlanes plot for RandomWriter (top) and Sort (bottom)

0 200 400 600 800

010

2030

4050

60

Time/s

Per

-task

Task durations (Matrix-Vec Multiply, Inefficient # Reducers): Per-nodeJT_MapJT_ReduceJT_MapJT_ReduceJT_MapJT_ReduceJT_MapJT_Reduce

0 100 200 300 400 500 600 700

020

4060

Time/s

Per

-task

Task durations (Matrix-Vec Multiply, Efficient # Reducers): Per-nodeJT_MapJT_ReduceJT_MapJT_ReduceJT_MapJT_ReduceJT_MapJT_Reduce

Figure 6: Matrix-vector Multiplication before optimization (above), and after optimization (below)

4 Examples of Mochi’s Value

We demonstrate the use of Mochi’s visualizations (using mainly Swimlanes due to space constraints). Allof the data is derived from log traces from the Yahoo! M45 [11] production cluster. The examples in § 4.1,§ 4.2 involve 5-node clusters (4-slave, 1-master), and the example in § 4.3 is from a 25-node cluster. Mochi’sanalysis and visualizations have run on real-world data from 300-node Hadoop production clusters, but weomit these results for lack of space; furthermore, at that scale, Mochi’s interactive visualization (zoomingin/out and targeted inspection) is of more benefit, rather than a static one.

4.1 Understanding Hadoop Job Structure

Figure 5 shows the Swimlanes plots from the Sort and RandomWriter benchmark workloads (part of theHadoop distribution), respectively. RandomWriter writes random key/value pairs to HDFS and has onlyMaps, while Sort reads key/value pairs in Maps, and aggregates, sorts, and outputs them in Reduces. Fromthese visualizations, we see that RandomWriter has only Maps, while the Reduces in Sort take significantlylonger than the Maps, showing most of the work occurs in the Reduces. The REP plot in Figure 4 shows thata significant fraction (! 2

3 ) of the time along the critical paths (Cluster 5) is spent waiting for Map outputsto be shuffled to the Reduces, suggesting this is a bottleneck.

4.2 Finding Opportunities for Optimization

Figure 6 shows the Swimlanes from the Matrix-Vector Multiplication job of the HADI [12] graph-miningapplication for Hadoop. This workload contains two MR programs, as seen from the two batches of Mapsand Reduces. Before optimization, the second node and first node do not run any Reduce in the first andsecond jobs respectively. The number of Reduces was then increased to twice the number of slave nodes,

5

Mochi (CMU)

Parallax (UW)

Page 44: Hadoop Lecture for Harvard's CS 264 -- October 19, 2009

Usability

Page 45: Hadoop Lecture for Harvard's CS 264 -- October 19, 2009

Performance

Need for benchmarks (besides GraySort)

Low-hanging fruit!

Page 46: Hadoop Lecture for Harvard's CS 264 -- October 19, 2009

Higher-Level LanguagesHive (a lot like SQL) (Facebook/Apache)

Pig Latin (Yahoo!/Apache)

DryadLINQ (Microsoft)

Sawzall (Google)

SCOPE (Microsoft)

JAQL (IBM)

Page 47: Hadoop Lecture for Harvard's CS 264 -- October 19, 2009

OptimizationsFor a single query....For a single workflow...Across workflows...

Bring out last century’s DB research! (joins) And file system research too! (RAID)

HadoopDB (Yale)

Data Formats (yes, in ’09)

Page 48: Hadoop Lecture for Harvard's CS 264 -- October 19, 2009

New Datastore Models

File System

Bigtable, Dynamo, Cassanda, ...

Database

Page 49: Hadoop Lecture for Harvard's CS 264 -- October 19, 2009

New Computation Models

MPI

M/R

Online M/R

Dryad

Pregel for Graphs

Iterative ML Algorithms

Page 50: Hadoop Lecture for Harvard's CS 264 -- October 19, 2009

HardwareData Center Design (Hamilton, Barroso, Hölzle)

Energy-Efficiency

Network Topology and Hardware

What does flash mean in this context?

What about multi-core?

Larger-Scale Computing

Page 51: Hadoop Lecture for Harvard's CS 264 -- October 19, 2009

Synchronization, Coordination, and

Consistency

Chubby, ZooKeeper, Paxos, ...

Eventual Consistency

Page 52: Hadoop Lecture for Harvard's CS 264 -- October 19, 2009

Applied Research(research using M/R)“Unreasonable Effectiveness of Data”

WebTables (Cafarella)

Translation

ML...

Page 53: Hadoop Lecture for Harvard's CS 264 -- October 19, 2009

Conferences...(some in exotic locales)SIGMOD

VLDB

ICDE

CIDR

HPTS

SOSP

LADIS

OSDI

SIGCOMM

HotCloud

NSDI

SC/ISC

SoCC

Others (ask a prof!)

Page 54: Hadoop Lecture for Harvard's CS 264 -- October 19, 2009

Parting Thoughts

Page 55: Hadoop Lecture for Harvard's CS 264 -- October 19, 2009

The Wheel

Don’t Re-invent

Focus on your data/problem

What about...

Reliability, Durability, Stability, Tooling

19

Sesame Seaweed Rice Balls

Say That Five Times FastRemember cheese balls? Those little snack foods that were

something like a cheese crunchy but round and bite sized?

Trader Joe’s Sesame Seaweed Rice Balls are nothing like

that, except for the round and bite sized part. Oh, and the

crunchy part. Okay, basically, they’re just like cheese balls,

only without the cheese. Confused? Allow us to clarify.

Our version of a common Japanese snack, Sesame Seaweed

Rice Balls are crunchy little snacks made with the popular

Japanese seasoning known as furikake (fu-rih-kah-kay).

Furikake is a savory seasoning made from sesame seeds,

chopped dried seaweed, a little salt and a little sugar. We’ve

blended these flavors with crunchy rice balls (think BIG rice

crispies) to create a salty, savory snack that dares to think

outside the snack box. Sound a little strange? Perhaps. But

once you try them, we think you’ll be back for more. We’re

selling Trader Joe’s Sesame Seaweed Rice Balls in a five

ounce bag for only $1.49.

Baby Swiss from a Master • Only $3.99 a Pound!Trader Joe’s Baby Swiss Cheese comes to us from a

Wisconsin farmer-owned cheese co-op that has been

producing craftsman cheeses since 1885. It is an artisan-

made cheese produced under the watchful eye of a Master

Cheesemaker who has been creating quality cheeses for

more than 30 years.

Baby Swiss is similar to Swiss cheese but is aged for a shorter

period of time, resulting in a milder cheese with significantly

smaller “eyes” than its grown-up namesake. From a flavor

standpoint, it’s buttery, a little nutty and a touch sweet. It

chunks well for salads, melts beautifully on burgers and

slices easily for snacks. We’re selling random weight blocks

of Master-crafted Trader Joe’s Baby Swiss Cheese for

$3.99 a pound, every day – a terrific value, and the same

great price we offered on this cheese back in 2005!

Sweet & Nutty… Just Like We Are!

“The Original”

Honey Roasted PeanutsRemember the sweet and crunchy taste of the original honey

roasted peanuts? Remember the first time you tried a knock-

off version and felt sadness, coupled with disappointment,

enveloped in ennui, longing for a snack that was as good

as the original? Trader Joe’s has the power to make you

ennui-free.

When the original purveyor of honey roasted peanuts became

yet another victim of corporate reorganization, one of our

industrious nut suppliers bought exclusive rights to their

original honey roasted peanut recipe, and we’ve been selling

truckloads of them ever since. Honey Roasted Peanuts are

a natural for snacking any time – to satisfy the afternoon

munchies, out on a long hike, or just sitting in front of the

TV watching a game.

Proof that our nut buyer is as industrious as our nut supplier,

we’re selling this one-of-a-kind product at a one-of-a-kind

price – each 16 ounce bag of Trader Joe’s The Original

Honey Roasted Peanuts is $2.69, every day.

Baker Josef’s FlourBake Like Royalty • 5 lb Bag $2.99!

It is our never-ending quest to buy our products as close to the

source as possible. We’ve found that when we go to the source,

we get consistently better quality and tremendously lower prices.

We do all this with you, our customers, in mind. After all, when

you get both better quality and a better value, you win.

To this end, we’ve recently introduced Baker Josef’s Flour.

Available in both Unbleached All-Purpose and White

Whole Wheat, their journey is simple: from farmer to mill

to our stores. Most “flour companies” are actually just brand

names that market products – they buy their flour from mills,

have someone package it with their name and mark up the

price so they can sell it to retailers, who then sell it to you.

We buy the flour from the people who mill the flour, and sell

it directly to you. We pay less money, have easier access to

the flour, and are able to sell a five pound bag to you for only

$2.99. Our flour is made from 100% U.S. grown hard wheat

– All Purpose is a blend of hard winter and spring wheat

and White Whole Wheat is 100% hard white winter wheat

– and both have four grams of protein in every quarter-cup

serving. You’ll find both Baker Josef’s Flours directly at

the source – your neighborhood Trader Joe’s.

Uh-oh. Looks like Joe’s been reinventing the wheel again.

“Look, there are lots of different types of wheels!” – Todd Lipcon

Re-invent!

Lots of new possibilities!

New Models!New implementations!Better optimizations!

Page 56: Hadoop Lecture for Harvard's CS 264 -- October 19, 2009

Conclusion

It’s a great time to be in Distributed Systems.

Participate!Build!

Collaborate!

Page 57: Hadoop Lecture for Harvard's CS 264 -- October 19, 2009

Questions?

[email protected]

(we’re hiring) (interns)