Map reduce and hadoop at mylife
-
Upload
responseteam -
Category
Technology
-
view
355 -
download
2
description
Transcript of Map reduce and hadoop at mylife
MapReduce with Hadoop at MyLife
June 6, 2013Speaker: Jeff Meister
Topics of Talk
• What are MapReduce and Hadoop?
• When would you want to use them?
• How do they work?
• What does Hadoop do for you?
• How do you write MapReduce programs to take advantage of that?
• What do we use them for at MyLife?
What are MapReduce and Hadoop?
• MapReduce is a programming model for parallel processing of large datasets
• An idea for how to write programs under certain constraints
• Hadoop is an open-source implementation of MapReduce
• Designed for clusters of commodity machines
Motivation:Why would you use
MapReduce?
Background:Disk vs. Memory
• Memory
• Where the computer keeps data it’s currently working on
• Fast response time, random access supported
• Expensive: typical size in tens of GB
• Hard disk
• More permanent storage of data for future tasks
• Slow response time, sequential access only
• Cheap: typical size in hundreds or thousands of GB
Example Task onSmall Datasets
ID Public recordR1 Steve Jones, 36, 12 Main St, 10001
R2 John Brown, 72, 625 8th Ave, 90210
R3 James Davis, 23, 10 Broadway, 20202
R4 Tom Lewis, 45, 95 Park Pl, 90024
R5 Tim Harris, 33, PO Box 256, 33514
... ...R2000 Adam Parker, 59, 82 F St, 45454
Size: 8 MB Size: 3.5 MB
ID Phone numberP1 Robert White, 45121, (654) 321-4702
P2 David Johnson, 07470, (973) 602-2519
P3 Scott Lee, 23910, (602) 412-2255
P4 Steve Jones, 10001, (212) 347-3380
P5 John Wayne, 13284, (312) 446-8878
... ...P1000 Tom Lewis, 90024, (650) 945-2319
Real World:Large Datasets
• 290 million public records = 380 GB
• 228 million phone records = 252 GB
• We could improve previous algorithm, but...
• The machine doesn’t have enough memory
• Would spend lots of time moving pieces of data between disk and memory
• Disk is so slow, the task is now impractical
• What to do? Use Hadoop MapReduce!
• Divide into smaller tasks, run them in parallel
Hadoop:What does it do?
How do you work with it?
Components of the Hadoop System
• Hadoop Distributed File System (HDFS)
• Splits up files into blocks, stores them on multiple computers
• Knows which blocks are on each machine
• Transfers blocks between machines over the network
• Replicates blocks, designed to tolerate frequent machine failures
• MapReduce engine
• Supports distributed computation
• Programmer writes Map and Reduce functions
• Engine takes care of parallelization, so you can focus on your work
The Map andReduce Functions
• map : (K1, V1) List(K2, V2)
• Take an input record and produce (emit) a list of intermediate (key, value) pairs
• reduce : (K2, List(V2)) List(K3, V3)
• Examine the values for each intermediate key, produce a list of output records
• Critical observation: output type of map ≠ input type of reduce!
• What’s going on in between?
The “Magic”:A Fast Parallel Sort• The core of Hadoop MapReduce is a
distributed parallel sorting algorithm
• Hadoop guarantees that the input to each reducer is sorted by key (K2)
• All the (K2, V2) pairs from the mappers are grouped by key
• The reducer gets a list of values corresponding to each key
Why Is It Fast?
• Imagine how you might sort a deck of cards
• The most intuitive procedure for humans is very inefficient for computers
• Turns out the best algorithm, merge sort, is less straightforward
• Split the data up into smaller pieces, sort the pieces individually, then merge them
• Hadoop is using HDFS to do a giant parallel merge sort over its cluster
Example Taskwith MapReduce
• map : (source_id, record) List(match_key, source_id)
• For each input record, select the fields to match by, make a key out of them
• Use the record’s unique identifier as the value
• reduce : (match_key, List(source_id)) List(public_record_id, phone_id)
• For each match key, look through the list of unique IDs
• If we find both a public record ID and a phone ID in the same list, match!
• The profiles with these IDs share all fields in the key
• Generate the output pair of matched IDs
Example Task onSmall Datasets
ID Public recordR1 Steve Jones, 36, 12 Main St, 10001
R2 John Brown, 72, 625 8th Ave, 90210
R3 James Davis, 23, 10 Broadway, 20202
R4 Tom Lewis, 45, 95 Park Pl, 90024
R5 Tim Harris, 33, PO Box 256, 33514
... ...R2000 Adam Parker, 59, 82 F St, 45454
Size: 8 MB Size: 3.5 MB
ID Phone numberP1 Robert White, 45121, (654) 321-4702
P2 David Johnson, 07470, (973) 602-2519
P3 Scott Lee, 23910, (602) 412-2255
P4 Steve Jones, 10001, (212) 347-3380
P5 John Wayne, 13284, (312) 446-8878
... ...P1000 Tom Lewis, 90024, (650) 945-2319
When is MapReduce Appropriate?
• To benefit from using Hadoop:
• The data must be decomposable into many (key, value) pairs
• Each mapper runs the same operation, independently of other mappers
• Map output keys should sort values into groups of similar size
• Sequential algorithms that are more straightforward may need redesign for the MapReduce model
Common Applications of MapReduce
• Many common distributed tasks are easily expressible with MapReduce. A few examples:
• Term frequency counting
• Pattern searching
• Of course, sorting
• Graph algorithms, such as reversal (Web links)
• Inverted index generation
• Data mining (clustering, statistics)
MapReduce at MyLife
Applications of MapReduce at MyLife• We regularly run computations over large sets of
people data
• Who’s Searching For You
• Content-based aggregation pipeline (1.5 TB)
• Deltas of licensed data updates (300 GB)
• Generating search indexes for old platform
• Various ad hoc jobs involving matching, searching, extraction, counting, de-duplication, and more
Hadoop Cluster Specifications
• Currently 63 machines, each configured to run 4 or 6 map or reduce tasks at once (total capacity 296)
• CPU:
• Each machine: 2x quad-core Opteron @ 2.2 GHz
• Memory:
• Each machine: 32 GB
• Cluster total: 2 TB
• Hard disk:
• Each machine: between 3 and 9 TB
• Total HDFS capacity: 345 TB
Other Companies Using Hadoop
• Yahoo! - Index calculations for Web search
• Facebook - Analytics and machine learning
• World’s largest Hadoop cluster!
• Amazon - Supports Hadoop on EC2/S3 cloud services
• People You May Know
• Viewers of This Profile Also Viewed
• Apple - Used in iAds platform
• Twitter - Data warehousing and analytics
• Lots more... http://wiki.apache.org/hadoop/PoweredBy
Further Reading
• Google research papers
• Google File System, SOSP 2003
• MapReduce, OSDI 2004
• BigTable, OSDI 2006
• Hadoop manual: http://hadoop.apache.org/
• Other Hadoop-related projects from Apache: Cassandra, HBase, Hive, Pig