Why use Hadoop?, Challenges / Learning Hadoop & Average Salary of Hadoop Professional
Hadoop Lecture for Harvard's CS 264 -- October 19, 2009
-
Upload
cloudera-inc -
Category
Technology
-
view
3.587 -
download
2
Transcript of Hadoop Lecture for Harvard's CS 264 -- October 19, 2009
(More)Apache Hadoop
Philip Zeyliger (Math, Dunster ‘04)[email protected]@philz42 @cloudera
October 19, 2009CS 264
Who am I?
Software Engineer
Zak’s classmate
Worked at
(Interns)
OutlineReview of last Wednesday
Your Homework
Data Warehousing
Some Hadoop Internals
Research & Hadoop
Short Break
Last Wednesday
The BasicsClusters, not individual machines
Scale Linearly
Separate App Code from Fault-Tolerant Distributed Systems Code
Systems Programmers Statisticians
Some Big Numbers
Yahoo! Hadoop Clusters: > 82PB, >25k machines (Eric14, HadoopWorld NYC ’09)
Google: 40 GB/s GFS read/write load (Jeff Dean, LADIS ’09) [~3,500 TB/day]
Facebook: 4TB new data per day; DW: 4800 cores, 5.5 PB (Dhruba Borthakur, HadoopWorld)
Physical Flow
M-R Model
Logical Flow
Logical
Physical
Important APIsInput Format
Mapper
Reducer
Partitioner
Combiner
Out. Format
M/R
Flo
w
Oth
er
Writable
JobClient
*Context
Filesystem
K₁,V₁→K₂,V₂
data→K₁,V₁
K₂,iter(V₂)→K₂,V₂
K₂,V₂→int
K₂, iter(V₂)→K₃,V₃
K₃, V₃→data
→ is 1:many
public int run(String[] args) throws Exception { if (args.length < 3) { System.out.println("Grep <inDir> <outDir> <regex> [<group>]"); ToolRunner.printGenericCommandUsage(System.out); return -1; } Path tempDir = new Path("grep-temp-"+Integer.toString(new Random().nextInt(Integer.MAX_VALUE))); JobConf grepJob = new JobConf(getConf(), Grep.class); try { grepJob.setJobName("grep-search");
FileInputFormat.setInputPaths(grepJob, args[0]);
grepJob.setMapperClass(RegexMapper.class); grepJob.set("mapred.mapper.regex", args[2]); if (args.length == 4) grepJob.set("mapred.mapper.regex.group", args[3]);
grepJob.setCombinerClass(LongSumReducer.class);
grepJob.setReducerClass(LongSumReducer.class);
FileOutputFormat.setOutputPath(grepJob, tempDir); grepJob.setOutputFormat(SequenceFileOutputFormat.class); grepJob.setOutputKeyClass(Text.class); grepJob.setOutputValueClass(LongWritable.class);
JobClient.runJob(grepJob);
JobConf sortJob = new JobConf(Grep.class); sortJob.setJobName("grep-sort");
FileInputFormat.setInputPaths(sortJob, tempDir); sortJob.setInputFormat(SequenceFileInputFormat.class);
sortJob.setMapperClass(InverseMapper.class);
// write a single file sortJob.setNumReduceTasks(1);
FileOutputFormat.setOutputPath(sortJob, new Path(args[1])); // sort by decreasing freq sortJob.setOutputKeyComparatorClass(LongWritable.DecreasingComparator.class);
JobClient.runJob(sortJob); } finally { FileSystem.get(grepJob).delete(tempDir, true); } return 0;}
the “grep” example
$ cat input.txt adams dunster kirkland dunsterkirland dudley dunsteradams dunster winthrop
$ bin/hadoop jar hadoop-0.18.3-examples.jar grep input.txt output1 'dunster|adams'
$ cat output1/part-00000 4 dunster2 adams
JobConf grepJob = new JobConf(getConf(), Grep.class); try { grepJob.setJobName("grep-search");
FileInputFormat.setInputPaths(grepJob, args[0]);
grepJob.setMapperClass(RegexMapper.class); grepJob.set("mapred.mapper.regex", args[2]); if (args.length == 4) grepJob.set("mapred.mapper.regex.group", args[3]);
grepJob.setCombinerClass(LongSumReducer.class); grepJob.setReducerClass(LongSumReducer.class);
FileOutputFormat.setOutputPath(grepJob, tempDir); grepJob.setOutputFormat(SequenceFileOutputFormat.class); grepJob.setOutputKeyClass(Text.class); grepJob.setOutputValueClass(LongWritable.class);
JobClient.runJob(grepJob); } ...
Job1of 2
JobConf sortJob = new JobConf(Grep.class); sortJob.setJobName("grep-sort");
FileInputFormat.setInputPaths(sortJob, tempDir); sortJob.setInputFormat(SequenceFileInputFormat.class);
sortJob.setMapperClass(InverseMapper.class);
// write a single file sortJob.setNumReduceTasks(1);
FileOutputFormat.setOutputPath(sortJob, new Path(args[1])); // sort by decreasing freq sortJob.setOutputKeyComparatorClass( LongWritable.DecreasingComparator.class);
JobClient.runJob(sortJob); } finally { FileSystem.get(grepJob).delete(tempDir, true); } return 0;}
Job2 of 2
(implicit identity reducer)
The types there...?, Text
Text, Long
Long, Text
Text, list(Long)
Text, Long
A Simple JoinId Last First1 Washington George2 Lincoln Abraham
Location Id TimeDunster 1 11:00amDunster 2 11:02amKirkland 2 11:08am
You want to track individuals throughout the day.How would you do this in M/R, if you had to?
Peop
leK
ey E
ntry
Log
(white-board)
Your Homework
(this is the only lolcat in this lecture)
Mental Challenges
Learn an algorithm
Adapt it to M/R Model
Practical Challenges
Learn Finicky Software
Debug an unfamiliar environment
Implement PageRank over Wikipedia Pages
Tackle Parts Separately
Algorithm
Implementing in M/R(What are the type signatures?)
Starting a cluster on EC2
Small dataset
Large dataset
Advice
More Advice
Wealth of “Getting Started” materials online
Feel free to work together
Don’t be a perfectionist about it; data is dirty!
if (____ ≫ Java), use “streaming”
Good Luck!
Data Warehousing 101
What is DW?
a.k.a. BI “Business Intelligence”
Provides data to support decisions
Not the operational/transactional database
e.g., answers “what has our inventory been over time?”, not “what is our inventory now?”
Why DW?
Learn from data
Reporting
Ad-hoc analysis
e.g.: which trail mix should TJ’s discontinue? (and other important business questions)
17
These Candies are a Blast!What do you get when you combine blueberry, probiotic
yogurt and vitamin C? A-ha! You said blueberry yogurt,
didn’t you? That would be the natural response, we know.
And you wouldn’t necessarily be wrong. We’ll admit
that this was a bit of a trick question, because, really, how
could you possibly know that the answer was Trader Joe’s
Blueberry Blast Yogurt Candy?
Blueberry Blast Yogurt Candies are little disks of sweet-
tart pleasure with only three calories each. Natural blueberry
flavor, probiotic yogurt and vitamin C, with a little natural
acai (ah-sigh-ee) added in for a burst of flavor… these are no
ordinary candies. Maybe you’re a candy lover looking for
something new. Maybe you want to cut calories. Or maybe
you just really love blueberries and yogurt. Give these little
gems a try – they’re a blast! And they’re just $1.49 for each
very convenient .89 ounce tin.
New!
Chocolate Crisps
Milk or Dark ChocolateWe’ve recently introduced a new wave in chocolate. No,
that’s not us having a high opinion of ourselves; it’s a great
way to describe our new Chocolate Crisps, wave-shaped
chocolate treats dotted with crispy rice to add a little extra
pizzazz. These sweet treats are made with Belgian chocolate
and come in both Milk Chocolate and Dark Chocolate
varieties. They look a bit like a chocolate version of a
popular potato chip, but these are all about the chocolate.
And while we’re tempted to call them “pure indulgence,” we
should point out that you can eat ten Trader Joe’s Chocolate
Crisps for a mere 200 calories – not so indulgent after all.
We’re selling each 4.4 ounce box of 36 Milk or Dark
Chocolate Crisps for $2.49, every day.
Mini Peanut Butter CupsMini Peanut Butter Cups – those four words pretty much
sell this product, so we’ll be brief. Made by the same
supplier who makes our original Peanut Butter Cups, the
Minis are made using all natural peanut butter and luscious
milk chocolate. These are shaped just like miniature versions
of the original, and because of their small size, they’re not
encumbered by individual paper cups. How small are they?
A bit smaller than bite-size, and chock-full of delicious,
natural peanut butter cup flavor.
Trader Joe’s Mini Milk Chocolate Peanut Butter Cups are
excellent for snacking. They’re also a unique replacement
for chocolate chips in cookies and a superb topping for ice
cream. We’re selling them in a 12 ounce tub for $2.99.
Chocolate Chip Granola Bars
$1.99 for a Box of SixDo you have a first memory of eating a granola bar? Does
it involve nearly breaking a tooth? With a little help from
Trader Joe’s, you can now erase that memory forever. All
it takes is a bit of Trader Joe’s Chocolate Chip Chewy
Coated Granola Bars. The key word here is “chewy.”
No rock-hard-teeth-crackers, these are loaded with organic
oats, organic rice crisps and rich, delicious chocolate chips.
The bottoms are covered with a generous coating of creamy
chocolate. They’re chewy, sweet and creamy – quite a treat,
these little chocolate wonders.
Trader Joe’s Chocolate Chip Chewy Coated Granola
Bars are definitely a treat. But we do like to make things
healthier when we can, so they’re made without artificial
flavors, colors or preservatives, hydrogenated oils and trans
fats. And because we always like to make delicious things
deliciously affordable, each 7.4 ounce box of six bars is only $1.99.
100% Juice! 100% Recyclable Bottle!
Cherry Cider • No Added Sugars!Would it surprise you to know that American farmers harvest
somewhere in the vicinity of 210 million pounds of cherries
every year? It surprised us. It also surprised us that nearly
75% of those cherries are enjoyed as a fresh crop during the
harvest season of June, July and August. In other words, we’re
pitting and popping cherries into our mouths at a rate of more
than 157 million pounds over a three month period. Wow!
So what becomes of the other 53 million pounds? Well,
some of the fruit is frozen, some used for jams and preserves
and some is used to make Trader Joe’s Cherry Cider. Our
Cherry Cider is a 100% juice blend – cherry, apple, plum
and pineapple juices from concentrate – that makes ample
use of Bing cherries from the Pacific Northwest. It has big,
bold cherry sweetness and no added sugar. We’re selling
Cherry Cider in a 64 fluid ounce bottle for $3.69, every day.
I told you, hands off the
Chocolate Chip Granola Bars!Geez, lighten up. You
get six in every box.
You could share.
Traditionally...
Big databases
Schemas
Dimensional Modelling (Ralph Kimball)
Magnetic
Agile
Deep
“MAD Skills”
MAD Skills: New Analysis Practices for Big Data
Jeffrey CohenGreenplum
Brian DolanFox Interactive Media
Mark DunlapEvergreen Technologies
Joseph M. HellersteinU.C. Berkeley
Caleb WeltonGreenplum
ABSTRACTAs massive data acquisition and storage becomes increas-ingly a!ordable, a wide variety of enterprises are employingstatisticians to engage in sophisticated data analysis. In thispaper we highlight the emerging practice of Magnetic, Ag-ile, Deep (MAD) data analysis as a radical departure fromtraditional Enterprise Data Warehouses and Business Intel-ligence. We present our design philosophy, techniques andexperience providing MAD analytics for one of the world’slargest advertising networks at Fox Interactive Media, us-ing the Greenplum parallel database system. We describedatabase design methodologies that support the agile work-ing style of analysts in these settings. We present data-parallel algorithms for sophisticated statistical techniques,with a focus on density methods. Finally, we reflect ondatabase system features that enable agile design and flexi-ble algorithm development using both SQL and MapReduceinterfaces over a variety of storage mechanisms.
1. INTRODUCTIONIf you are looking for a career where your services will be
in high demand, you should find something where you providea scarce, complementary service to something that is gettingubiquitous and cheap. So what’s getting ubiquitous and cheap?Data. And what is complementary to data? Analysis.– Prof. Hal Varian, UC Berkeley, Chief Economist at Google [5]
mad (adj.): an adjective used to enhance a noun.1- dude, you got skills.2- dude, you got mad skills.– UrbanDictionary.com [22]
Standard business practices for large-scale data analysis cen-ter on the notion of an “Enterprise Data Warehouse” (EDW)that is queried by “Business Intelligence” (BI) software. BItools produce reports and interactive interfaces that summa-rize data via basic aggregation functions (e.g., counts andaverages) over various hierarchical breakdowns of the data
Under revision. This version as of 3/20/2009. Contact author: Hellerstein.
into groups. This was the topic of significant academic re-search and industrial development throughout the 1990’s.
Traditionally, a carefully designed EDW is considered tohave a central role in good IT practice. The design andevolution of a comprehensive EDW schema serves as therallying point for disciplined data integration within a largeenterprise, rationalizing the outputs and representations ofall business processes. The resulting database serves as therepository of record for critical business functions. In addi-tion, the database server storing the EDW has traditionallybeen a major computational asset, serving as the central,scalable engine for key enterprise analytics. The concep-tual and computational centrality of the EDW makes it amission-critical, expensive resource, used for serving data-intensive reports targeted at executive decision-makers. It istraditionally controlled by a dedicated IT sta! that not onlymaintains the system, but jealously controls access to ensurethat executives can rely on a high quality of service. [12]
While this orthodox EDW approach continues today inmany settings, a number of factors are pushing towards avery di!erent philosophy for large-scale data management inthe enterprise. First, storage is now so cheap that small sub-groups within an enterprise can develop an isolated databaseof astonishing scale within their discretionary budget. Theworld’s largest data warehouse from just over a decade agocan be stored on less than 20 commodity disks priced atunder $100 today. A department can pay for 1-2 ordersof magnitude more storage than that without coordinatingwith management. Meanwhile, the number of massive-scaledata sources in an enterprise has grown remarkably: mas-sive databases arise today even from single sources like click-streams, software logs, email and discussion forum archives,etc. Finally, the value of data analysis has entered com-mon culture, with numerous companies showing how sophis-ticated data analysis leads to cost savings and even directrevenue. The end result of these opportunities is a grassrootsmove to collect and leverage data in multiple organizationalunits. While this has many benefits in fostering e"ciencyand data-driven culture [14], it adds to the force of data de-centralization that data warehousing is supposed to combat.
In this changed climate of widespread, large-scale datacollection, there is a premium on what we dub MAD anal-ysis skills. The acronym arises from three aspects of thisenvironment that di!er from EDW orthodoxy:
• Magnetic: Traditional EDW approaches “repel” newdata sources, discouraging their incorporation untilthey are carefully cleansed and integrated. Given theubiquity of data in modern organizations, a data ware-
MADness is Enabling
Instrumentation
Collection
Storage (Raw Data)
ETL (Extraction, Transform, Load)
RDBMS (Aggregates)
BI / Reporting
Traditional DW
}
Ad-hoc Queries?
Data Mining?
Data Mining
Instrumentation
Collection
Storage (Raw Data)
ETL (Extraction, Transform, Load)
RDBMS (Aggregates)
BI / Reporting
Traditional DW
}
Ad-hoc Queries
Facebook’s DW (phase N)
Facebook Data Infrastructure2007
Oracle Database Server
Data Collection Server
MySQL TierScribe Tier
Wednesday, April 1, 2009
Facebook’s DW (phase M)M > NFacebook Data Infrastructure
2008MySQL TierScribe Tier
Hadoop Tier
Oracle RAC Servers
Wednesday, April 1, 2009
Short Break
Hadoop Internals
HDFS
Namenode
Datanodes
One Rack A Different Rack
3x64MB file, 3 rep
4x64MB file, 3 rep
Small file, 7 rep
HDFS Write Path
file in the filesystem’s namespace, with no blocks associated with it. (Step 2.) Thenamenode performs various checks to make sure the file doesn’t already exist, and thatthe client has the right permissions to create the file. If these checks pass, the namenodemakes a record of the new file, otherwise file creation fails and the client is thrown anIOException. The DistributedFileSystem returns a FSDataOutputStream for the client tostart writing data to. Just as in the read case, FSDataOutputStream wraps a DFSOutputStream, which handles communication with the datanodes and namenode.
As the client writes data (step 3.), DFSOutputStream splits it into packets, which it writesto an internal queue, called the data queue. The data queue is consumed by the DataStreamer, whose responsibility it is to ask the namenode to allocate new blocks bypicking a list of suitable datanodes to store the replicas. The list of datanodes forms apipeline—we’ll assume the replication level is three, so there are three nodes in thepipeline. The DataStreamer streams the packets to the first datanode in the pipeline,which stores the packet and forwards it to the second datanode in the pipeline. Similarlythe second datanode stores the packet and forwards it to the third (and last) datanodein the pipeline. (Step 4.)
DFSOutputStream also maintains an internal queue of packets that are waiting to beacknowledged by datanodes, called the ack queue. A packet is only removed from theack queue when it has been acknowledged by all the datanodes in the pipeline. (Step 5.)
If a datanode fails while data is being written to it, then the following actions are taken,which are transparent to the client writing the data. First the pipeline is closed, and anypackets in the ack queue are added to the front of the data queue so that datanodesthat are downstream from the failed node will not miss any packets. The current blockon the good datanodes is given a new identity, which is communicated to the name-node, so that the partial block on the failed datanode will be deleted if the failed data-node recovers later on. The failed datanode is removed from the pipeline and the re-mainder of the block’s data is written to the two good datanodes in the pipeline. Thenamenode notices that the block is under-replicated, and it arranges for a further replicato be created on another node. Subsequent blocks are then treated as normal.
It’s possible, but unlikely, that multiple datanodes fail while a block is being written.As long as dfs.replication.min replicas (default one) are written the write will succeed,and the block will be asynchronously replicated across the cluster until its target rep-lication factor is reached (dfs.replication which defaults to three).
Figure 3-3. A client writing data to HDFS
Data Flow | 61
HDFS Failures?Datanode crash?
Clients read another copy
Background rebalance
Namenode crash?
uh-oh
M/R
Tasktrackers on the same machines as datanodes
One Rack A Different Rack
Job on starsDifferent jobIdle
M/R
CHAPTER 6
How MapReduce Works
In this chapter we’ll look at how MapReduce in Hadoop works in detail. This knowl-edge provides a good foundation for writing more advanced MapReduce programs,which we will cover in the following two chapters.
Anatomy of a MapReduce Job RunYou can run a MapReduce job with a single line of code: JobClient.runJob(conf). It’svery short, but it conceals a great deal of processing behind the scenes. This sectionuncovers the steps Hadoop takes to run a job.
The whole process is illustrated in Figure 6-1. At the highest level there are four inde-pendent entities:
• The client, which submits the MapReduce job.
• The jobtracker, which coordinates the job run. The jobtracker is a Java applicationwhose main class is JobTracker.
• The tasktrackers, which run the tasks that the job has been split into. Tasktrackersare Java applications whose main class is TaskTracker.
• The distributed filesystem (normally HDFS, covered in Chapter 3), which is usedfor sharing job files between the other entities.
Figure 6-1. How Hadoop runs a MapReduce job
145
Task fails
Try again?
Try again somewhere else?
Report failure
Retries possible because of idempotence
M/R Failures
Programming these systems...
Everything can fail
Inherently multi-threaded
Toolset still young
Mental models are different...
Research &Hadoop
Scheduling & SharingMixed use
BatchInteractiveReal-time
Isolation
Text
Metrics: Latency, Throughput, Utilization (per resource)
Scheduling
Fair and LATE Scheduling (Berkeley)
Nexus (Berkeley)
Quincy (MSR)
Implementation
BOOM Project (Berkeley)
Overlog (Berkeley)
APPENDIX
A. NARADA IN OverLogHere we provide an executable OverLog implementation
of Narada’s mesh maintenance algorithms. Current limita-tions of the P2 parser and planner require slightly wordiersyntax for some of our constructs. Specifically, handling ofnegation is still incomplete, requiring that we rewrite somerules to eliminate negation. Furthermore, our planner cur-rently handles rules with collocated terms only. The Over-Log specification below is directly parsed and executed byour current codebase.
/** Base tables */
materialize(member, infinity, infinity, keys(2)).materialize(sequence, infinity, 1, keys(2)).materialize(neighbor, infinity, infinity, keys(2)).
/* Environment table containing configurationvalues */
materialize(env, infinity, infinity, keys(2,3)).
/* Setup of configuration values */
E0 neighbor@X(X,Y) :- periodic@X(X,E,0,1), env@X(X,H, Y), H == "neighbor".
/** Start with sequence number 0 */
S0 sequence@X(X, Sequence) :- periodic@X(X, E, 0,1), Sequence := 0.
/** Periodically start a refresh */
R1 refreshEvent@X(X) :- periodic@X(X, E, 3).
/** Increment my own sequence number */
R2 refreshSequence@X(X, NewSequence) :-refreshEvent@X(X), sequence@X(X, Sequence),NewSequence := Sequence + 1.
/** Save my incremented sequence */
R3 sequence@X(X, NewSequence) :-refreshSequence@X(X, NewSequence).
/** Send a refresh to all neighbors with my currentmembership */
R4 refresh@Y(Y, X, NewSequence, Address, ASequence,ALive) :- refreshSequence@X(X, NewSequence),member@X(X, Address, ASequence, Time, ALive),neighbor@X(X, Y).
/** How many member entries that match the memberin a refresh message (but not myself) do I have? */
R5 membersFound@X(X, Address, ASeq, ALive,count<*>) :- refresh@X(X, Y, YSeq, Address, ASeq,ALive), member@X(X, Address, MySeq, MyTime,MyLive), X != Address.
/** If I have none, just store what I got */
R6 member@X(X, Address, ASequence, T, ALive) :-membersFound@X(X, Address, ASequence, ALive, C),C == 0, T := f_now().
/** If I have some, just update with theinformation I received if it has a highersequence number. */
R7 member@X(X, Address, ASequence, T, ALive) :-membersFound@X(X, Address, ASequence, ALive, C),C > 0, T := f_now(), member@X(X, Address,MySequence, MyT, MyLive), MySequence < ASequence.
/** Update my neighbor’s member entry */
R8 member@X(X, Y, YSeq, T, YLive) :- refresh@X(X,Y, YSeq, A, AS, AL), T := f_now(), YLive := 1.
/** Add anyone from whom I receive a refreshmessage to my neighbors */
N1 neighbor@X(X, Y) :- refresh@X(X, Y,YS, A, AS, L).
/** Probing of neighbor liveness */
L1 neighborProbe@X(X) :- periodic@X(X, E, 1).L2 deadNeighbor@X(X, Y) :- neighborProbe@X(X), T :=
f_now(), neighbor@X(X, Y), member@X(X, Y, YS, YT,L), T - YT > 20.
L3 delete neighbor@X(X, Y) :- deadNeighbor@X(X, Y).L4 member@X(X, Neighbor, DeadSequence, T, Live) :-
deadNeighbor@X(X, Neighbor), member@X(X,Neighbor, S, T1, L), Live := 0, DeadSequence := S+ 1, T:= f_now().
B. CHORD IN OverLogHere we provide the full OverLog specification for Chord.
This specification deals with lookups, ring maintenance witha fixed number of successors, finger-table maintenance andopportunistic finger table population, joins, stabilization,and node failure detection.
/* The base tuples */
materialize(node, infinity, 1, keys(1)).materialize(finger, 180, 160, keys(2)).materialize(bestSucc, infinity, 1, keys(1)).materialize(succDist, 10, 100, keys(2)).materialize(succ, 10, 100, keys(2)).materialize(pred, infinity, 100, keys(1)).materialize(succCount, infinity, 1, keys(1)).materialize(join, 10, 5, keys(1)).materialize(landmark, infinity, 1, keys(1)).materialize(fFix, infinity, 160, keys(2)).materialize(nextFingerFix, infinity, 1, keys(1)).materialize(pingNode, 10, infinity, keys(2)).materialize(pendingPing, 10, infinity, keys(2)).
/** Lookups */
L1 lookupResults@R(R,K,S,SI,E) :- node@NI(NI,N),lookup@NI(NI,K,R,E), bestSucc@NI(NI,S,SI), K in
15
Debugging and Visualization
0 100 200 300 400
010
2030
40
Time/s
Per
-task
Task durations (RandomWriter: 100GB written: 4 hosts): All nodesJT_Map
0 200 400 600 800
050
100
150
Time/s
Per
-task
Task durations (Sort: 20GB input: 4 hosts): All nodesJT_MapJT_Reduce
Figure 5: Summarized Swimlanes plot for RandomWriter (top) and Sort (bottom)
0 200 400 600 800
010
2030
4050
60
Time/s
Per
-task
Task durations (Matrix-Vec Multiply, Inefficient # Reducers): Per-nodeJT_MapJT_ReduceJT_MapJT_ReduceJT_MapJT_ReduceJT_MapJT_Reduce
0 100 200 300 400 500 600 700
020
4060
Time/s
Per
-task
Task durations (Matrix-Vec Multiply, Efficient # Reducers): Per-nodeJT_MapJT_ReduceJT_MapJT_ReduceJT_MapJT_ReduceJT_MapJT_Reduce
Figure 6: Matrix-vector Multiplication before optimization (above), and after optimization (below)
4 Examples of Mochi’s Value
We demonstrate the use of Mochi’s visualizations (using mainly Swimlanes due to space constraints). Allof the data is derived from log traces from the Yahoo! M45 [11] production cluster. The examples in § 4.1,§ 4.2 involve 5-node clusters (4-slave, 1-master), and the example in § 4.3 is from a 25-node cluster. Mochi’sanalysis and visualizations have run on real-world data from 300-node Hadoop production clusters, but weomit these results for lack of space; furthermore, at that scale, Mochi’s interactive visualization (zoomingin/out and targeted inspection) is of more benefit, rather than a static one.
4.1 Understanding Hadoop Job Structure
Figure 5 shows the Swimlanes plots from the Sort and RandomWriter benchmark workloads (part of theHadoop distribution), respectively. RandomWriter writes random key/value pairs to HDFS and has onlyMaps, while Sort reads key/value pairs in Maps, and aggregates, sorts, and outputs them in Reduces. Fromthese visualizations, we see that RandomWriter has only Maps, while the Reduces in Sort take significantlylonger than the Maps, showing most of the work occurs in the Reduces. The REP plot in Figure 4 shows thata significant fraction (! 2
3 ) of the time along the critical paths (Cluster 5) is spent waiting for Map outputsto be shuffled to the Reduces, suggesting this is a bottleneck.
4.2 Finding Opportunities for Optimization
Figure 6 shows the Swimlanes from the Matrix-Vector Multiplication job of the HADI [12] graph-miningapplication for Hadoop. This workload contains two MR programs, as seen from the two batches of Mapsand Reduces. Before optimization, the second node and first node do not run any Reduce in the first andsecond jobs respectively. The number of Reduces was then increased to twice the number of slave nodes,
5
Mochi (CMU)
Parallax (UW)
Usability
Performance
Need for benchmarks (besides GraySort)
Low-hanging fruit!
Higher-Level LanguagesHive (a lot like SQL) (Facebook/Apache)
Pig Latin (Yahoo!/Apache)
DryadLINQ (Microsoft)
Sawzall (Google)
SCOPE (Microsoft)
JAQL (IBM)
OptimizationsFor a single query....For a single workflow...Across workflows...
Bring out last century’s DB research! (joins) And file system research too! (RAID)
HadoopDB (Yale)
Data Formats (yes, in ’09)
New Datastore Models
File System
Bigtable, Dynamo, Cassanda, ...
Database
New Computation Models
MPI
M/R
Online M/R
Dryad
Pregel for Graphs
Iterative ML Algorithms
HardwareData Center Design (Hamilton, Barroso, Hölzle)
Energy-Efficiency
Network Topology and Hardware
What does flash mean in this context?
What about multi-core?
Larger-Scale Computing
Synchronization, Coordination, and
Consistency
Chubby, ZooKeeper, Paxos, ...
Eventual Consistency
Applied Research(research using M/R)“Unreasonable Effectiveness of Data”
WebTables (Cafarella)
Translation
ML...
Conferences...(some in exotic locales)SIGMOD
VLDB
ICDE
CIDR
HPTS
SOSP
LADIS
OSDI
SIGCOMM
HotCloud
NSDI
SC/ISC
SoCC
Others (ask a prof!)
Parting Thoughts
The Wheel
Don’t Re-invent
Focus on your data/problem
What about...
Reliability, Durability, Stability, Tooling
19
Sesame Seaweed Rice Balls
Say That Five Times FastRemember cheese balls? Those little snack foods that were
something like a cheese crunchy but round and bite sized?
Trader Joe’s Sesame Seaweed Rice Balls are nothing like
that, except for the round and bite sized part. Oh, and the
crunchy part. Okay, basically, they’re just like cheese balls,
only without the cheese. Confused? Allow us to clarify.
Our version of a common Japanese snack, Sesame Seaweed
Rice Balls are crunchy little snacks made with the popular
Japanese seasoning known as furikake (fu-rih-kah-kay).
Furikake is a savory seasoning made from sesame seeds,
chopped dried seaweed, a little salt and a little sugar. We’ve
blended these flavors with crunchy rice balls (think BIG rice
crispies) to create a salty, savory snack that dares to think
outside the snack box. Sound a little strange? Perhaps. But
once you try them, we think you’ll be back for more. We’re
selling Trader Joe’s Sesame Seaweed Rice Balls in a five
ounce bag for only $1.49.
Baby Swiss from a Master • Only $3.99 a Pound!Trader Joe’s Baby Swiss Cheese comes to us from a
Wisconsin farmer-owned cheese co-op that has been
producing craftsman cheeses since 1885. It is an artisan-
made cheese produced under the watchful eye of a Master
Cheesemaker who has been creating quality cheeses for
more than 30 years.
Baby Swiss is similar to Swiss cheese but is aged for a shorter
period of time, resulting in a milder cheese with significantly
smaller “eyes” than its grown-up namesake. From a flavor
standpoint, it’s buttery, a little nutty and a touch sweet. It
chunks well for salads, melts beautifully on burgers and
slices easily for snacks. We’re selling random weight blocks
of Master-crafted Trader Joe’s Baby Swiss Cheese for
$3.99 a pound, every day – a terrific value, and the same
great price we offered on this cheese back in 2005!
Sweet & Nutty… Just Like We Are!
“The Original”
Honey Roasted PeanutsRemember the sweet and crunchy taste of the original honey
roasted peanuts? Remember the first time you tried a knock-
off version and felt sadness, coupled with disappointment,
enveloped in ennui, longing for a snack that was as good
as the original? Trader Joe’s has the power to make you
ennui-free.
When the original purveyor of honey roasted peanuts became
yet another victim of corporate reorganization, one of our
industrious nut suppliers bought exclusive rights to their
original honey roasted peanut recipe, and we’ve been selling
truckloads of them ever since. Honey Roasted Peanuts are
a natural for snacking any time – to satisfy the afternoon
munchies, out on a long hike, or just sitting in front of the
TV watching a game.
Proof that our nut buyer is as industrious as our nut supplier,
we’re selling this one-of-a-kind product at a one-of-a-kind
price – each 16 ounce bag of Trader Joe’s The Original
Honey Roasted Peanuts is $2.69, every day.
Baker Josef’s FlourBake Like Royalty • 5 lb Bag $2.99!
It is our never-ending quest to buy our products as close to the
source as possible. We’ve found that when we go to the source,
we get consistently better quality and tremendously lower prices.
We do all this with you, our customers, in mind. After all, when
you get both better quality and a better value, you win.
To this end, we’ve recently introduced Baker Josef’s Flour.
Available in both Unbleached All-Purpose and White
Whole Wheat, their journey is simple: from farmer to mill
to our stores. Most “flour companies” are actually just brand
names that market products – they buy their flour from mills,
have someone package it with their name and mark up the
price so they can sell it to retailers, who then sell it to you.
We buy the flour from the people who mill the flour, and sell
it directly to you. We pay less money, have easier access to
the flour, and are able to sell a five pound bag to you for only
$2.99. Our flour is made from 100% U.S. grown hard wheat
– All Purpose is a blend of hard winter and spring wheat
and White Whole Wheat is 100% hard white winter wheat
– and both have four grams of protein in every quarter-cup
serving. You’ll find both Baker Josef’s Flours directly at
the source – your neighborhood Trader Joe’s.
Uh-oh. Looks like Joe’s been reinventing the wheel again.
“Look, there are lots of different types of wheels!” – Todd Lipcon
Re-invent!
Lots of new possibilities!
New Models!New implementations!Better optimizations!
Conclusion
It’s a great time to be in Distributed Systems.
Participate!Build!
Collaborate!