Introduction to MapReduce and Hadoop IT 332 Distributed Systems.
-
Upload
alyson-ferguson -
Category
Documents
-
view
225 -
download
2
Transcript of Introduction to MapReduce and Hadoop IT 332 Distributed Systems.
Introduction to MapReduce and HadoopIT 332 Distributed Systems
What is MapReduce
Data-parallel programming model for clusters of commodity machines
Pioneered by Google Processes 20 PB of data per day
Popularized by open-source Hadoop project Used by Yahoo Facebook Amazon hellip
What is MapReduce used for
bull At Googlendash Index building for Google Searchndash Article clustering for Google Newsndash Statistical machine translation
bull At Yahoondash Index building for Yahoo Searchndash Spam detection for Yahoo Mail
bull At Facebookndash Data miningndash Ad optimizationndash Spam detection
What is MapReduce used for
In research Analyzing Wikipedia conflicts (PARC) Natural language processing (CMU) Bioinformatics (Maryland) Astronomical image analysis (Washington) Ocean climate simulation (Washington) ltYour application heregt
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
MapReduce Design Goals
1 Scalability to large data volumes Scan 100 TB on 1 node 50 MBs = 23 days Scan on 1000-node cluster = 33 minutes
2 Cost-efficiency Commodity nodes (cheap but unreliable) Commodity network Automatic fault-tolerance (fewer admins) Easy to use (fewer programmers)
Typical Hadoop Cluster
Aggregation switch
Rack switch
40 nodesrack 1000-4000 nodes in cluster
1 GBps bandwidth in rack 8 GBps out of rack
Node specs (Yahoo terasort)8 x 20 GHz cores 8 GB RAM 4 disks (= 4 TB)
Typical Hadoop Cluster
Image from httpwikiapacheorghadoop-dataattachmentsHadoopPresentationsattachmentsaw-apachecon-eu-2009pdf
Challenges
Cheap nodes fail especially if you have many Mean time between failures for 1 node = 3 years MTBF for 1000 nodes = 1 day Solution Build fault-tolerance into system
Commodity network = low bandwidth Solution Push computation to the data
Programming distributed systems is hard Solution Data-parallel programming model users write ldquomaprdquo and
ldquoreducerdquo functions system handles work distribution and fault tolerance
Hadoop Components
Distributed file system (HDFS) Single namespace for entire cluster Replicates data 3x for fault-tolerance
MapReduce implementation Executes user jobs specified as ldquomaprdquo and ldquoreducerdquo functions Manages work distribution amp fault-tolerance
Hadoop Distributed File System Files split into 128MB blocks
Blocks replicated across several datanodes (usually 3)
Single namenode stores metadata (file names block locations etc)
Optimized for large files sequential reads
Files are append-only
Namenode
Datanodes
1234
124
213
143
324
File1
MapReduce Programming Model Data type key-value records
Map function
(Kin Vin) list(Kinter Vinter)
Reduce function
(Kinter list(Vinter)) list(Kout Vout)
Example Word Count
def mapper(line) foreach word in linesplit)(
output(word 1)
def reducer(key values) output(key sum(values))
Word Count Execution
the quickbrown fox
the fox atethe mouse
how nowbrown cow
Map
Map
Map
Reduce
Reduce
brown 2fox 2how 1now 1the 3
ate 1cow 1
mouse 1quick 1
the 1brown 1
fox 1
quick 1
the 1fox 1the 1
how 1now 1
brown 1ate 1
mouse 1
cow 1
Input Map Shuffle amp Sort Reduce Output
MapReduce Execution Details
Single master controls job execution on multiple slaves as well as user scheduling
Mappers preferentially placed on same node or same rack as their input block Push computation to data minimize network use
Mappers save outputs to local disk rather than pushing directly to reducers Allows having more reducers than nodes Allows recovery if a reducer crashes
An Optimization The Combiner
A combiner is a local aggregation function for repeated keys produced by same map
For associative ops like sum count max
Decreases size of intermediate data
Example local counting for Word Count
def combiner(key values) output(key sum(values))
Word Count with CombinerInput Map amp Combine Shuffle amp Sort Reduce Output
the quickbrown fox
the fox atethe mouse
how nowbrown cow
Map
Map
Map
Reduce
Reduce
brown 2fox 2how 1now 1the 3
ate 1cow 1
mouse 1quick 1
the 1brown 1
fox 1
quick 1
the 2fox 1
how 1now 1
brown 1ate 1
mouse 1
cow 1
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
Fault Tolerance in MapReduce
1 If a task crashes Retry on another node
Okay for a map because it had no dependencies Okay for reduce because map outputs are on disk
If the same task repeatedly fails fail the job or ignore that input block (user-controlled)
Note For this and the other fault tolerance features to work your map and reduce tasks must be side-effect-free
Fault Tolerance in MapReduce
2 If a node crashes Relaunch its current tasks on other nodes Relaunch any maps the node previously ran
Necessary because their output files were lost along with the crashed node
Fault Tolerance in MapReduce
3 If a task is going slowly (straggler) Launch second copy of task on another node Take the output of whichever copy finishes first and kill the
other one
Critical for performance in large clusters stragglers occur frequently due to failing hardware bugs misconfiguration etc
Takeaways
By providing a data-parallel programming model MapReduce can control job execution in useful ways Automatic division of job into tasks Automatic placement of computation near data Automatic load balancing Recovery from failures amp stragglers
User focuses on application not on complexities of distributed computing
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
1 Search
Input (lineNumber line) records
Output lines matching a given pattern
Map if(line matches pattern) output(line)
Reduce identify function Alternative no reducer (map-only job)
pigsheepyakzebra
aardvarkantbeecowelephant
2 Sort
Input (key value) records
Output same records sorted by key
Map identity function
Reduce identify function
Trick Pick partitioningfunction h such thatk1ltk2 =gt h(k1)lth(k2)
Map
Map
Map
Reduce
Reduce
ant bee
zebra
aardvarkelephant
cow
pig
sheep yak
[A-M]
[N-Z]
3 Inverted Index
Input (filename text) records
Output list of files containing each word
Map foreach word in textsplit() output(word filename)
Combine uniquify filenames for each word (eliminate duplicates)
Reducedef reduce(word filenames) output(word sort(filenames))
Inverted Index Example
to be or not to be afraid (12thtxt)
be (12thtxt hamlettxt)greatness (12thtxt)not (12thtxt hamlettxt)of (12thtxt)or (hamlettxt)to (hamlettxt)
hamlettxt
be not afraid of greatness
12thtxt
to hamlettxtbe hamlettxtor hamlettxtnot hamlettxt
be 12thtxtnot 12thtxtafraid 12thtxtof 12thtxtgreatness 12thtxt
4 Most Popular Words
Input (filename text) records
Output the 100 words occurring in most files
Two-stage solution Job 1
Create inverted index giving (word list(file)) records Job 2
Map each (word list(file)) to (count word) Sort these records by count as in sort job
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
Cloudera Virtual Manager
Cloudera VM contains a single-node Apache Hadoop cluster along with everything you need to get started with Hadoop
Requirements A 64-bit host OS A virtualization software VMware Player KVM or VirtualBox
Virtualization Software will require a laptop that supports virtualization If you are unsure one way this can be checked by looking at your BIOS and seeing if Virtualization is Enabled
A 4 GB of total RAM The total system memory required varies depending on the size
of your data set and on the other processes that are running
Installation
Step1 Download amp Run Vmware
Step2 Download Cloudera VM
Step3 Extract to the Cloudera folder
Step4 Open the cloudera-quickstart-vm-440-1-vmware
Once you got the software installed fire up the VirtualBox image of Cloudera QuickStart VM and you should see the initial screen similar to below
WordCount Example
This example computes the occurrence frequency of each word in a text file
Steps1 Set up Hadoop environment2 Upload files into HDFS3 Executing Java MapReduce functions in Hadoop
bull Tutorial
httpswwwclouderacomcontentcloudera-contentcloudera-docsHadoopTutorialCDH4Hadoop-Tutorialht_wordcount1html
public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt
private final static IntWritable one = new IntWritable(1) private Text word = new Text)(
public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
String line = valuetoString)(
StringTokenizer tokenizer = new StringTokenizer(line)
while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())
outputcollect(word one)
public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt
public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
int sum = 0
while (valueshasNext()) sum += valuesnext()get)(
outputcollect(key new IntWritable(sum))
public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)
confsetJobName(wordcount)
confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)
confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)
confsetReducerClass(Reduceclass)
confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)
FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))
JobClientrunJob(conf)
How it Works
Purpose Simple serialization for keys values and other data
Interface Writable Read and write binary format Convert to String for text formats
Classes
OutputCollector Collects data output by either the Mapper or the Reducer ie
intermediate outputs or the output of the job Methods
collect(K key V value)
Interface
Text Stores text using standard UTF8 encoding It provides methods to
serialize deserialize and compare texts at byte level Methods
set(byte[] utf8)
Demo
40
Exercises
Get WordCount running
Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words
For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive (reading)
Motivation
MapReduce is great as many algorithms can be expressed by a series of MR jobs
But itrsquos low-level must think about keys values partitioning etc
Can we capture common ldquojob patternsrdquo
Pig
Started at Yahoo Research
Now runs about 30 of Yahoorsquos jobs
Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators
(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse
An Example Problem
Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))
reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)
lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5
store Top5 into lsquotop5sitesrsquo
In Pig Latin
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Job 1
Job 2
Job 3
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Hive
Developed at Facebook
Used for majority of Facebook jobs
ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex
data types some optimizations
Creating a Hive Table
CREATE TABLE page_views(viewTime INT userid BIGINT
page_url STRING referrer_url STRING
ip STRING COMMENT User IP address)
COMMENT This is the page view table
PARTITIONED BY(dt STRING country STRING)
STORED AS SEQUENCEFILE
bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA
Simple Query
SELECT page_views
FROM page_views
WHERE page_viewsdate gt= 2008-03-01
AND page_viewsdate lt= 2008-03-31
AND page_viewsreferrer_url like xyzcom
bullHive only reads partition 2008-03-01 instead of scanning entire table
bullFind all page views coming from xyzcom on March 31st
Aggregation and Joins
SELECT pvpage_url ugender COUNT(DISTINCT uid)
FROM page_views pv JOIN user u ON (pvuserid = uid)
GROUP BY pvpage_url ugender
WHERE pvdate = 2008-03-03
bullCount users who visited each page by gender
bullSample outputpage_url gender count(userid)
homephp MALE 12141412
homephp FEMALE 15431579
photophp MALE 23941451
photophp FEMALE 21231314
Using a Hadoop Streaming Mapper Script
SELECT TRANSFORM(page_viewsuserid
page_viewsdate)
USING map_scriptpy
AS dt uid CLUSTER BY dt
FROM page_views
Conclusions
MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance
Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs
(but requiring fault tolerance)
Hive and Pig further simplify programming
MapReduce is not suitable for all problems but when it works it may save you a lot of time
Resources
Hadoop httphadoopapacheorgcore
Hadoop docs
httphadoopapacheorgcoredocscurrent
Pig httphadoopapacheorgpig
Hive httphadoopapacheorghive
Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training
- Introduction to MapReduce and Hadoop
- What is MapReduce
- What is MapReduce used for
- What is MapReduce used for (2)
- Outline
- MapReduce Design Goals
- Typical Hadoop Cluster
- Typical Hadoop Cluster (2)
- Challenges
- Hadoop Components
- Hadoop Distributed File System
- MapReduce Programming Model
- Example Word Count
- Word Count Execution
- MapReduce Execution Details
- An Optimization The Combiner
- Word Count with Combiner
- Outline (2)
- Fault Tolerance in MapReduce
- Fault Tolerance in MapReduce (2)
- Fault Tolerance in MapReduce (3)
- Takeaways
- Outline (3)
- 1 Search
- 2 Sort
- 3 Inverted Index
- Inverted Index Example
- 4 Most Popular Words
- Outline (4)
- Cloudera Virtual Manager
- Installation
- Once you got the software installed fire up the VirtualBox ima
- WordCount Example
- Slide 34
- Slide 35
- Slide 36
- How it Works
- Classes
- Interface
- Demo
- Exercises
- Outline (5)
- Motivation
- Pig
- An Example Problem
- In MapReduce
- In Pig Latin
- Ease of Translation
- Ease of Translation (2)
- Hive
- Creating a Hive Table
- Simple Query
- Aggregation and Joins
- Using a Hadoop Streaming Mapper Script
- Conclusions
- Resources
-
What is MapReduce
Data-parallel programming model for clusters of commodity machines
Pioneered by Google Processes 20 PB of data per day
Popularized by open-source Hadoop project Used by Yahoo Facebook Amazon hellip
What is MapReduce used for
bull At Googlendash Index building for Google Searchndash Article clustering for Google Newsndash Statistical machine translation
bull At Yahoondash Index building for Yahoo Searchndash Spam detection for Yahoo Mail
bull At Facebookndash Data miningndash Ad optimizationndash Spam detection
What is MapReduce used for
In research Analyzing Wikipedia conflicts (PARC) Natural language processing (CMU) Bioinformatics (Maryland) Astronomical image analysis (Washington) Ocean climate simulation (Washington) ltYour application heregt
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
MapReduce Design Goals
1 Scalability to large data volumes Scan 100 TB on 1 node 50 MBs = 23 days Scan on 1000-node cluster = 33 minutes
2 Cost-efficiency Commodity nodes (cheap but unreliable) Commodity network Automatic fault-tolerance (fewer admins) Easy to use (fewer programmers)
Typical Hadoop Cluster
Aggregation switch
Rack switch
40 nodesrack 1000-4000 nodes in cluster
1 GBps bandwidth in rack 8 GBps out of rack
Node specs (Yahoo terasort)8 x 20 GHz cores 8 GB RAM 4 disks (= 4 TB)
Typical Hadoop Cluster
Image from httpwikiapacheorghadoop-dataattachmentsHadoopPresentationsattachmentsaw-apachecon-eu-2009pdf
Challenges
Cheap nodes fail especially if you have many Mean time between failures for 1 node = 3 years MTBF for 1000 nodes = 1 day Solution Build fault-tolerance into system
Commodity network = low bandwidth Solution Push computation to the data
Programming distributed systems is hard Solution Data-parallel programming model users write ldquomaprdquo and
ldquoreducerdquo functions system handles work distribution and fault tolerance
Hadoop Components
Distributed file system (HDFS) Single namespace for entire cluster Replicates data 3x for fault-tolerance
MapReduce implementation Executes user jobs specified as ldquomaprdquo and ldquoreducerdquo functions Manages work distribution amp fault-tolerance
Hadoop Distributed File System Files split into 128MB blocks
Blocks replicated across several datanodes (usually 3)
Single namenode stores metadata (file names block locations etc)
Optimized for large files sequential reads
Files are append-only
Namenode
Datanodes
1234
124
213
143
324
File1
MapReduce Programming Model Data type key-value records
Map function
(Kin Vin) list(Kinter Vinter)
Reduce function
(Kinter list(Vinter)) list(Kout Vout)
Example Word Count
def mapper(line) foreach word in linesplit)(
output(word 1)
def reducer(key values) output(key sum(values))
Word Count Execution
the quickbrown fox
the fox atethe mouse
how nowbrown cow
Map
Map
Map
Reduce
Reduce
brown 2fox 2how 1now 1the 3
ate 1cow 1
mouse 1quick 1
the 1brown 1
fox 1
quick 1
the 1fox 1the 1
how 1now 1
brown 1ate 1
mouse 1
cow 1
Input Map Shuffle amp Sort Reduce Output
MapReduce Execution Details
Single master controls job execution on multiple slaves as well as user scheduling
Mappers preferentially placed on same node or same rack as their input block Push computation to data minimize network use
Mappers save outputs to local disk rather than pushing directly to reducers Allows having more reducers than nodes Allows recovery if a reducer crashes
An Optimization The Combiner
A combiner is a local aggregation function for repeated keys produced by same map
For associative ops like sum count max
Decreases size of intermediate data
Example local counting for Word Count
def combiner(key values) output(key sum(values))
Word Count with CombinerInput Map amp Combine Shuffle amp Sort Reduce Output
the quickbrown fox
the fox atethe mouse
how nowbrown cow
Map
Map
Map
Reduce
Reduce
brown 2fox 2how 1now 1the 3
ate 1cow 1
mouse 1quick 1
the 1brown 1
fox 1
quick 1
the 2fox 1
how 1now 1
brown 1ate 1
mouse 1
cow 1
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
Fault Tolerance in MapReduce
1 If a task crashes Retry on another node
Okay for a map because it had no dependencies Okay for reduce because map outputs are on disk
If the same task repeatedly fails fail the job or ignore that input block (user-controlled)
Note For this and the other fault tolerance features to work your map and reduce tasks must be side-effect-free
Fault Tolerance in MapReduce
2 If a node crashes Relaunch its current tasks on other nodes Relaunch any maps the node previously ran
Necessary because their output files were lost along with the crashed node
Fault Tolerance in MapReduce
3 If a task is going slowly (straggler) Launch second copy of task on another node Take the output of whichever copy finishes first and kill the
other one
Critical for performance in large clusters stragglers occur frequently due to failing hardware bugs misconfiguration etc
Takeaways
By providing a data-parallel programming model MapReduce can control job execution in useful ways Automatic division of job into tasks Automatic placement of computation near data Automatic load balancing Recovery from failures amp stragglers
User focuses on application not on complexities of distributed computing
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
1 Search
Input (lineNumber line) records
Output lines matching a given pattern
Map if(line matches pattern) output(line)
Reduce identify function Alternative no reducer (map-only job)
pigsheepyakzebra
aardvarkantbeecowelephant
2 Sort
Input (key value) records
Output same records sorted by key
Map identity function
Reduce identify function
Trick Pick partitioningfunction h such thatk1ltk2 =gt h(k1)lth(k2)
Map
Map
Map
Reduce
Reduce
ant bee
zebra
aardvarkelephant
cow
pig
sheep yak
[A-M]
[N-Z]
3 Inverted Index
Input (filename text) records
Output list of files containing each word
Map foreach word in textsplit() output(word filename)
Combine uniquify filenames for each word (eliminate duplicates)
Reducedef reduce(word filenames) output(word sort(filenames))
Inverted Index Example
to be or not to be afraid (12thtxt)
be (12thtxt hamlettxt)greatness (12thtxt)not (12thtxt hamlettxt)of (12thtxt)or (hamlettxt)to (hamlettxt)
hamlettxt
be not afraid of greatness
12thtxt
to hamlettxtbe hamlettxtor hamlettxtnot hamlettxt
be 12thtxtnot 12thtxtafraid 12thtxtof 12thtxtgreatness 12thtxt
4 Most Popular Words
Input (filename text) records
Output the 100 words occurring in most files
Two-stage solution Job 1
Create inverted index giving (word list(file)) records Job 2
Map each (word list(file)) to (count word) Sort these records by count as in sort job
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
Cloudera Virtual Manager
Cloudera VM contains a single-node Apache Hadoop cluster along with everything you need to get started with Hadoop
Requirements A 64-bit host OS A virtualization software VMware Player KVM or VirtualBox
Virtualization Software will require a laptop that supports virtualization If you are unsure one way this can be checked by looking at your BIOS and seeing if Virtualization is Enabled
A 4 GB of total RAM The total system memory required varies depending on the size
of your data set and on the other processes that are running
Installation
Step1 Download amp Run Vmware
Step2 Download Cloudera VM
Step3 Extract to the Cloudera folder
Step4 Open the cloudera-quickstart-vm-440-1-vmware
Once you got the software installed fire up the VirtualBox image of Cloudera QuickStart VM and you should see the initial screen similar to below
WordCount Example
This example computes the occurrence frequency of each word in a text file
Steps1 Set up Hadoop environment2 Upload files into HDFS3 Executing Java MapReduce functions in Hadoop
bull Tutorial
httpswwwclouderacomcontentcloudera-contentcloudera-docsHadoopTutorialCDH4Hadoop-Tutorialht_wordcount1html
public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt
private final static IntWritable one = new IntWritable(1) private Text word = new Text)(
public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
String line = valuetoString)(
StringTokenizer tokenizer = new StringTokenizer(line)
while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())
outputcollect(word one)
public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt
public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
int sum = 0
while (valueshasNext()) sum += valuesnext()get)(
outputcollect(key new IntWritable(sum))
public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)
confsetJobName(wordcount)
confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)
confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)
confsetReducerClass(Reduceclass)
confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)
FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))
JobClientrunJob(conf)
How it Works
Purpose Simple serialization for keys values and other data
Interface Writable Read and write binary format Convert to String for text formats
Classes
OutputCollector Collects data output by either the Mapper or the Reducer ie
intermediate outputs or the output of the job Methods
collect(K key V value)
Interface
Text Stores text using standard UTF8 encoding It provides methods to
serialize deserialize and compare texts at byte level Methods
set(byte[] utf8)
Demo
40
Exercises
Get WordCount running
Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words
For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive (reading)
Motivation
MapReduce is great as many algorithms can be expressed by a series of MR jobs
But itrsquos low-level must think about keys values partitioning etc
Can we capture common ldquojob patternsrdquo
Pig
Started at Yahoo Research
Now runs about 30 of Yahoorsquos jobs
Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators
(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse
An Example Problem
Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))
reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)
lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5
store Top5 into lsquotop5sitesrsquo
In Pig Latin
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Job 1
Job 2
Job 3
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Hive
Developed at Facebook
Used for majority of Facebook jobs
ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex
data types some optimizations
Creating a Hive Table
CREATE TABLE page_views(viewTime INT userid BIGINT
page_url STRING referrer_url STRING
ip STRING COMMENT User IP address)
COMMENT This is the page view table
PARTITIONED BY(dt STRING country STRING)
STORED AS SEQUENCEFILE
bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA
Simple Query
SELECT page_views
FROM page_views
WHERE page_viewsdate gt= 2008-03-01
AND page_viewsdate lt= 2008-03-31
AND page_viewsreferrer_url like xyzcom
bullHive only reads partition 2008-03-01 instead of scanning entire table
bullFind all page views coming from xyzcom on March 31st
Aggregation and Joins
SELECT pvpage_url ugender COUNT(DISTINCT uid)
FROM page_views pv JOIN user u ON (pvuserid = uid)
GROUP BY pvpage_url ugender
WHERE pvdate = 2008-03-03
bullCount users who visited each page by gender
bullSample outputpage_url gender count(userid)
homephp MALE 12141412
homephp FEMALE 15431579
photophp MALE 23941451
photophp FEMALE 21231314
Using a Hadoop Streaming Mapper Script
SELECT TRANSFORM(page_viewsuserid
page_viewsdate)
USING map_scriptpy
AS dt uid CLUSTER BY dt
FROM page_views
Conclusions
MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance
Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs
(but requiring fault tolerance)
Hive and Pig further simplify programming
MapReduce is not suitable for all problems but when it works it may save you a lot of time
Resources
Hadoop httphadoopapacheorgcore
Hadoop docs
httphadoopapacheorgcoredocscurrent
Pig httphadoopapacheorgpig
Hive httphadoopapacheorghive
Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training
- Introduction to MapReduce and Hadoop
- What is MapReduce
- What is MapReduce used for
- What is MapReduce used for (2)
- Outline
- MapReduce Design Goals
- Typical Hadoop Cluster
- Typical Hadoop Cluster (2)
- Challenges
- Hadoop Components
- Hadoop Distributed File System
- MapReduce Programming Model
- Example Word Count
- Word Count Execution
- MapReduce Execution Details
- An Optimization The Combiner
- Word Count with Combiner
- Outline (2)
- Fault Tolerance in MapReduce
- Fault Tolerance in MapReduce (2)
- Fault Tolerance in MapReduce (3)
- Takeaways
- Outline (3)
- 1 Search
- 2 Sort
- 3 Inverted Index
- Inverted Index Example
- 4 Most Popular Words
- Outline (4)
- Cloudera Virtual Manager
- Installation
- Once you got the software installed fire up the VirtualBox ima
- WordCount Example
- Slide 34
- Slide 35
- Slide 36
- How it Works
- Classes
- Interface
- Demo
- Exercises
- Outline (5)
- Motivation
- Pig
- An Example Problem
- In MapReduce
- In Pig Latin
- Ease of Translation
- Ease of Translation (2)
- Hive
- Creating a Hive Table
- Simple Query
- Aggregation and Joins
- Using a Hadoop Streaming Mapper Script
- Conclusions
- Resources
-
What is MapReduce used for
bull At Googlendash Index building for Google Searchndash Article clustering for Google Newsndash Statistical machine translation
bull At Yahoondash Index building for Yahoo Searchndash Spam detection for Yahoo Mail
bull At Facebookndash Data miningndash Ad optimizationndash Spam detection
What is MapReduce used for
In research Analyzing Wikipedia conflicts (PARC) Natural language processing (CMU) Bioinformatics (Maryland) Astronomical image analysis (Washington) Ocean climate simulation (Washington) ltYour application heregt
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
MapReduce Design Goals
1 Scalability to large data volumes Scan 100 TB on 1 node 50 MBs = 23 days Scan on 1000-node cluster = 33 minutes
2 Cost-efficiency Commodity nodes (cheap but unreliable) Commodity network Automatic fault-tolerance (fewer admins) Easy to use (fewer programmers)
Typical Hadoop Cluster
Aggregation switch
Rack switch
40 nodesrack 1000-4000 nodes in cluster
1 GBps bandwidth in rack 8 GBps out of rack
Node specs (Yahoo terasort)8 x 20 GHz cores 8 GB RAM 4 disks (= 4 TB)
Typical Hadoop Cluster
Image from httpwikiapacheorghadoop-dataattachmentsHadoopPresentationsattachmentsaw-apachecon-eu-2009pdf
Challenges
Cheap nodes fail especially if you have many Mean time between failures for 1 node = 3 years MTBF for 1000 nodes = 1 day Solution Build fault-tolerance into system
Commodity network = low bandwidth Solution Push computation to the data
Programming distributed systems is hard Solution Data-parallel programming model users write ldquomaprdquo and
ldquoreducerdquo functions system handles work distribution and fault tolerance
Hadoop Components
Distributed file system (HDFS) Single namespace for entire cluster Replicates data 3x for fault-tolerance
MapReduce implementation Executes user jobs specified as ldquomaprdquo and ldquoreducerdquo functions Manages work distribution amp fault-tolerance
Hadoop Distributed File System Files split into 128MB blocks
Blocks replicated across several datanodes (usually 3)
Single namenode stores metadata (file names block locations etc)
Optimized for large files sequential reads
Files are append-only
Namenode
Datanodes
1234
124
213
143
324
File1
MapReduce Programming Model Data type key-value records
Map function
(Kin Vin) list(Kinter Vinter)
Reduce function
(Kinter list(Vinter)) list(Kout Vout)
Example Word Count
def mapper(line) foreach word in linesplit)(
output(word 1)
def reducer(key values) output(key sum(values))
Word Count Execution
the quickbrown fox
the fox atethe mouse
how nowbrown cow
Map
Map
Map
Reduce
Reduce
brown 2fox 2how 1now 1the 3
ate 1cow 1
mouse 1quick 1
the 1brown 1
fox 1
quick 1
the 1fox 1the 1
how 1now 1
brown 1ate 1
mouse 1
cow 1
Input Map Shuffle amp Sort Reduce Output
MapReduce Execution Details
Single master controls job execution on multiple slaves as well as user scheduling
Mappers preferentially placed on same node or same rack as their input block Push computation to data minimize network use
Mappers save outputs to local disk rather than pushing directly to reducers Allows having more reducers than nodes Allows recovery if a reducer crashes
An Optimization The Combiner
A combiner is a local aggregation function for repeated keys produced by same map
For associative ops like sum count max
Decreases size of intermediate data
Example local counting for Word Count
def combiner(key values) output(key sum(values))
Word Count with CombinerInput Map amp Combine Shuffle amp Sort Reduce Output
the quickbrown fox
the fox atethe mouse
how nowbrown cow
Map
Map
Map
Reduce
Reduce
brown 2fox 2how 1now 1the 3
ate 1cow 1
mouse 1quick 1
the 1brown 1
fox 1
quick 1
the 2fox 1
how 1now 1
brown 1ate 1
mouse 1
cow 1
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
Fault Tolerance in MapReduce
1 If a task crashes Retry on another node
Okay for a map because it had no dependencies Okay for reduce because map outputs are on disk
If the same task repeatedly fails fail the job or ignore that input block (user-controlled)
Note For this and the other fault tolerance features to work your map and reduce tasks must be side-effect-free
Fault Tolerance in MapReduce
2 If a node crashes Relaunch its current tasks on other nodes Relaunch any maps the node previously ran
Necessary because their output files were lost along with the crashed node
Fault Tolerance in MapReduce
3 If a task is going slowly (straggler) Launch second copy of task on another node Take the output of whichever copy finishes first and kill the
other one
Critical for performance in large clusters stragglers occur frequently due to failing hardware bugs misconfiguration etc
Takeaways
By providing a data-parallel programming model MapReduce can control job execution in useful ways Automatic division of job into tasks Automatic placement of computation near data Automatic load balancing Recovery from failures amp stragglers
User focuses on application not on complexities of distributed computing
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
1 Search
Input (lineNumber line) records
Output lines matching a given pattern
Map if(line matches pattern) output(line)
Reduce identify function Alternative no reducer (map-only job)
pigsheepyakzebra
aardvarkantbeecowelephant
2 Sort
Input (key value) records
Output same records sorted by key
Map identity function
Reduce identify function
Trick Pick partitioningfunction h such thatk1ltk2 =gt h(k1)lth(k2)
Map
Map
Map
Reduce
Reduce
ant bee
zebra
aardvarkelephant
cow
pig
sheep yak
[A-M]
[N-Z]
3 Inverted Index
Input (filename text) records
Output list of files containing each word
Map foreach word in textsplit() output(word filename)
Combine uniquify filenames for each word (eliminate duplicates)
Reducedef reduce(word filenames) output(word sort(filenames))
Inverted Index Example
to be or not to be afraid (12thtxt)
be (12thtxt hamlettxt)greatness (12thtxt)not (12thtxt hamlettxt)of (12thtxt)or (hamlettxt)to (hamlettxt)
hamlettxt
be not afraid of greatness
12thtxt
to hamlettxtbe hamlettxtor hamlettxtnot hamlettxt
be 12thtxtnot 12thtxtafraid 12thtxtof 12thtxtgreatness 12thtxt
4 Most Popular Words
Input (filename text) records
Output the 100 words occurring in most files
Two-stage solution Job 1
Create inverted index giving (word list(file)) records Job 2
Map each (word list(file)) to (count word) Sort these records by count as in sort job
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
Cloudera Virtual Manager
Cloudera VM contains a single-node Apache Hadoop cluster along with everything you need to get started with Hadoop
Requirements A 64-bit host OS A virtualization software VMware Player KVM or VirtualBox
Virtualization Software will require a laptop that supports virtualization If you are unsure one way this can be checked by looking at your BIOS and seeing if Virtualization is Enabled
A 4 GB of total RAM The total system memory required varies depending on the size
of your data set and on the other processes that are running
Installation
Step1 Download amp Run Vmware
Step2 Download Cloudera VM
Step3 Extract to the Cloudera folder
Step4 Open the cloudera-quickstart-vm-440-1-vmware
Once you got the software installed fire up the VirtualBox image of Cloudera QuickStart VM and you should see the initial screen similar to below
WordCount Example
This example computes the occurrence frequency of each word in a text file
Steps1 Set up Hadoop environment2 Upload files into HDFS3 Executing Java MapReduce functions in Hadoop
bull Tutorial
httpswwwclouderacomcontentcloudera-contentcloudera-docsHadoopTutorialCDH4Hadoop-Tutorialht_wordcount1html
public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt
private final static IntWritable one = new IntWritable(1) private Text word = new Text)(
public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
String line = valuetoString)(
StringTokenizer tokenizer = new StringTokenizer(line)
while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())
outputcollect(word one)
public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt
public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
int sum = 0
while (valueshasNext()) sum += valuesnext()get)(
outputcollect(key new IntWritable(sum))
public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)
confsetJobName(wordcount)
confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)
confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)
confsetReducerClass(Reduceclass)
confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)
FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))
JobClientrunJob(conf)
How it Works
Purpose Simple serialization for keys values and other data
Interface Writable Read and write binary format Convert to String for text formats
Classes
OutputCollector Collects data output by either the Mapper or the Reducer ie
intermediate outputs or the output of the job Methods
collect(K key V value)
Interface
Text Stores text using standard UTF8 encoding It provides methods to
serialize deserialize and compare texts at byte level Methods
set(byte[] utf8)
Demo
40
Exercises
Get WordCount running
Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words
For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive (reading)
Motivation
MapReduce is great as many algorithms can be expressed by a series of MR jobs
But itrsquos low-level must think about keys values partitioning etc
Can we capture common ldquojob patternsrdquo
Pig
Started at Yahoo Research
Now runs about 30 of Yahoorsquos jobs
Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators
(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse
An Example Problem
Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))
reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)
lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5
store Top5 into lsquotop5sitesrsquo
In Pig Latin
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Job 1
Job 2
Job 3
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Hive
Developed at Facebook
Used for majority of Facebook jobs
ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex
data types some optimizations
Creating a Hive Table
CREATE TABLE page_views(viewTime INT userid BIGINT
page_url STRING referrer_url STRING
ip STRING COMMENT User IP address)
COMMENT This is the page view table
PARTITIONED BY(dt STRING country STRING)
STORED AS SEQUENCEFILE
bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA
Simple Query
SELECT page_views
FROM page_views
WHERE page_viewsdate gt= 2008-03-01
AND page_viewsdate lt= 2008-03-31
AND page_viewsreferrer_url like xyzcom
bullHive only reads partition 2008-03-01 instead of scanning entire table
bullFind all page views coming from xyzcom on March 31st
Aggregation and Joins
SELECT pvpage_url ugender COUNT(DISTINCT uid)
FROM page_views pv JOIN user u ON (pvuserid = uid)
GROUP BY pvpage_url ugender
WHERE pvdate = 2008-03-03
bullCount users who visited each page by gender
bullSample outputpage_url gender count(userid)
homephp MALE 12141412
homephp FEMALE 15431579
photophp MALE 23941451
photophp FEMALE 21231314
Using a Hadoop Streaming Mapper Script
SELECT TRANSFORM(page_viewsuserid
page_viewsdate)
USING map_scriptpy
AS dt uid CLUSTER BY dt
FROM page_views
Conclusions
MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance
Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs
(but requiring fault tolerance)
Hive and Pig further simplify programming
MapReduce is not suitable for all problems but when it works it may save you a lot of time
Resources
Hadoop httphadoopapacheorgcore
Hadoop docs
httphadoopapacheorgcoredocscurrent
Pig httphadoopapacheorgpig
Hive httphadoopapacheorghive
Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training
- Introduction to MapReduce and Hadoop
- What is MapReduce
- What is MapReduce used for
- What is MapReduce used for (2)
- Outline
- MapReduce Design Goals
- Typical Hadoop Cluster
- Typical Hadoop Cluster (2)
- Challenges
- Hadoop Components
- Hadoop Distributed File System
- MapReduce Programming Model
- Example Word Count
- Word Count Execution
- MapReduce Execution Details
- An Optimization The Combiner
- Word Count with Combiner
- Outline (2)
- Fault Tolerance in MapReduce
- Fault Tolerance in MapReduce (2)
- Fault Tolerance in MapReduce (3)
- Takeaways
- Outline (3)
- 1 Search
- 2 Sort
- 3 Inverted Index
- Inverted Index Example
- 4 Most Popular Words
- Outline (4)
- Cloudera Virtual Manager
- Installation
- Once you got the software installed fire up the VirtualBox ima
- WordCount Example
- Slide 34
- Slide 35
- Slide 36
- How it Works
- Classes
- Interface
- Demo
- Exercises
- Outline (5)
- Motivation
- Pig
- An Example Problem
- In MapReduce
- In Pig Latin
- Ease of Translation
- Ease of Translation (2)
- Hive
- Creating a Hive Table
- Simple Query
- Aggregation and Joins
- Using a Hadoop Streaming Mapper Script
- Conclusions
- Resources
-
What is MapReduce used for
In research Analyzing Wikipedia conflicts (PARC) Natural language processing (CMU) Bioinformatics (Maryland) Astronomical image analysis (Washington) Ocean climate simulation (Washington) ltYour application heregt
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
MapReduce Design Goals
1 Scalability to large data volumes Scan 100 TB on 1 node 50 MBs = 23 days Scan on 1000-node cluster = 33 minutes
2 Cost-efficiency Commodity nodes (cheap but unreliable) Commodity network Automatic fault-tolerance (fewer admins) Easy to use (fewer programmers)
Typical Hadoop Cluster
Aggregation switch
Rack switch
40 nodesrack 1000-4000 nodes in cluster
1 GBps bandwidth in rack 8 GBps out of rack
Node specs (Yahoo terasort)8 x 20 GHz cores 8 GB RAM 4 disks (= 4 TB)
Typical Hadoop Cluster
Image from httpwikiapacheorghadoop-dataattachmentsHadoopPresentationsattachmentsaw-apachecon-eu-2009pdf
Challenges
Cheap nodes fail especially if you have many Mean time between failures for 1 node = 3 years MTBF for 1000 nodes = 1 day Solution Build fault-tolerance into system
Commodity network = low bandwidth Solution Push computation to the data
Programming distributed systems is hard Solution Data-parallel programming model users write ldquomaprdquo and
ldquoreducerdquo functions system handles work distribution and fault tolerance
Hadoop Components
Distributed file system (HDFS) Single namespace for entire cluster Replicates data 3x for fault-tolerance
MapReduce implementation Executes user jobs specified as ldquomaprdquo and ldquoreducerdquo functions Manages work distribution amp fault-tolerance
Hadoop Distributed File System Files split into 128MB blocks
Blocks replicated across several datanodes (usually 3)
Single namenode stores metadata (file names block locations etc)
Optimized for large files sequential reads
Files are append-only
Namenode
Datanodes
1234
124
213
143
324
File1
MapReduce Programming Model Data type key-value records
Map function
(Kin Vin) list(Kinter Vinter)
Reduce function
(Kinter list(Vinter)) list(Kout Vout)
Example Word Count
def mapper(line) foreach word in linesplit)(
output(word 1)
def reducer(key values) output(key sum(values))
Word Count Execution
the quickbrown fox
the fox atethe mouse
how nowbrown cow
Map
Map
Map
Reduce
Reduce
brown 2fox 2how 1now 1the 3
ate 1cow 1
mouse 1quick 1
the 1brown 1
fox 1
quick 1
the 1fox 1the 1
how 1now 1
brown 1ate 1
mouse 1
cow 1
Input Map Shuffle amp Sort Reduce Output
MapReduce Execution Details
Single master controls job execution on multiple slaves as well as user scheduling
Mappers preferentially placed on same node or same rack as their input block Push computation to data minimize network use
Mappers save outputs to local disk rather than pushing directly to reducers Allows having more reducers than nodes Allows recovery if a reducer crashes
An Optimization The Combiner
A combiner is a local aggregation function for repeated keys produced by same map
For associative ops like sum count max
Decreases size of intermediate data
Example local counting for Word Count
def combiner(key values) output(key sum(values))
Word Count with CombinerInput Map amp Combine Shuffle amp Sort Reduce Output
the quickbrown fox
the fox atethe mouse
how nowbrown cow
Map
Map
Map
Reduce
Reduce
brown 2fox 2how 1now 1the 3
ate 1cow 1
mouse 1quick 1
the 1brown 1
fox 1
quick 1
the 2fox 1
how 1now 1
brown 1ate 1
mouse 1
cow 1
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
Fault Tolerance in MapReduce
1 If a task crashes Retry on another node
Okay for a map because it had no dependencies Okay for reduce because map outputs are on disk
If the same task repeatedly fails fail the job or ignore that input block (user-controlled)
Note For this and the other fault tolerance features to work your map and reduce tasks must be side-effect-free
Fault Tolerance in MapReduce
2 If a node crashes Relaunch its current tasks on other nodes Relaunch any maps the node previously ran
Necessary because their output files were lost along with the crashed node
Fault Tolerance in MapReduce
3 If a task is going slowly (straggler) Launch second copy of task on another node Take the output of whichever copy finishes first and kill the
other one
Critical for performance in large clusters stragglers occur frequently due to failing hardware bugs misconfiguration etc
Takeaways
By providing a data-parallel programming model MapReduce can control job execution in useful ways Automatic division of job into tasks Automatic placement of computation near data Automatic load balancing Recovery from failures amp stragglers
User focuses on application not on complexities of distributed computing
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
1 Search
Input (lineNumber line) records
Output lines matching a given pattern
Map if(line matches pattern) output(line)
Reduce identify function Alternative no reducer (map-only job)
pigsheepyakzebra
aardvarkantbeecowelephant
2 Sort
Input (key value) records
Output same records sorted by key
Map identity function
Reduce identify function
Trick Pick partitioningfunction h such thatk1ltk2 =gt h(k1)lth(k2)
Map
Map
Map
Reduce
Reduce
ant bee
zebra
aardvarkelephant
cow
pig
sheep yak
[A-M]
[N-Z]
3 Inverted Index
Input (filename text) records
Output list of files containing each word
Map foreach word in textsplit() output(word filename)
Combine uniquify filenames for each word (eliminate duplicates)
Reducedef reduce(word filenames) output(word sort(filenames))
Inverted Index Example
to be or not to be afraid (12thtxt)
be (12thtxt hamlettxt)greatness (12thtxt)not (12thtxt hamlettxt)of (12thtxt)or (hamlettxt)to (hamlettxt)
hamlettxt
be not afraid of greatness
12thtxt
to hamlettxtbe hamlettxtor hamlettxtnot hamlettxt
be 12thtxtnot 12thtxtafraid 12thtxtof 12thtxtgreatness 12thtxt
4 Most Popular Words
Input (filename text) records
Output the 100 words occurring in most files
Two-stage solution Job 1
Create inverted index giving (word list(file)) records Job 2
Map each (word list(file)) to (count word) Sort these records by count as in sort job
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
Cloudera Virtual Manager
Cloudera VM contains a single-node Apache Hadoop cluster along with everything you need to get started with Hadoop
Requirements A 64-bit host OS A virtualization software VMware Player KVM or VirtualBox
Virtualization Software will require a laptop that supports virtualization If you are unsure one way this can be checked by looking at your BIOS and seeing if Virtualization is Enabled
A 4 GB of total RAM The total system memory required varies depending on the size
of your data set and on the other processes that are running
Installation
Step1 Download amp Run Vmware
Step2 Download Cloudera VM
Step3 Extract to the Cloudera folder
Step4 Open the cloudera-quickstart-vm-440-1-vmware
Once you got the software installed fire up the VirtualBox image of Cloudera QuickStart VM and you should see the initial screen similar to below
WordCount Example
This example computes the occurrence frequency of each word in a text file
Steps1 Set up Hadoop environment2 Upload files into HDFS3 Executing Java MapReduce functions in Hadoop
bull Tutorial
httpswwwclouderacomcontentcloudera-contentcloudera-docsHadoopTutorialCDH4Hadoop-Tutorialht_wordcount1html
public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt
private final static IntWritable one = new IntWritable(1) private Text word = new Text)(
public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
String line = valuetoString)(
StringTokenizer tokenizer = new StringTokenizer(line)
while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())
outputcollect(word one)
public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt
public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
int sum = 0
while (valueshasNext()) sum += valuesnext()get)(
outputcollect(key new IntWritable(sum))
public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)
confsetJobName(wordcount)
confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)
confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)
confsetReducerClass(Reduceclass)
confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)
FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))
JobClientrunJob(conf)
How it Works
Purpose Simple serialization for keys values and other data
Interface Writable Read and write binary format Convert to String for text formats
Classes
OutputCollector Collects data output by either the Mapper or the Reducer ie
intermediate outputs or the output of the job Methods
collect(K key V value)
Interface
Text Stores text using standard UTF8 encoding It provides methods to
serialize deserialize and compare texts at byte level Methods
set(byte[] utf8)
Demo
40
Exercises
Get WordCount running
Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words
For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive (reading)
Motivation
MapReduce is great as many algorithms can be expressed by a series of MR jobs
But itrsquos low-level must think about keys values partitioning etc
Can we capture common ldquojob patternsrdquo
Pig
Started at Yahoo Research
Now runs about 30 of Yahoorsquos jobs
Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators
(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse
An Example Problem
Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))
reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)
lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5
store Top5 into lsquotop5sitesrsquo
In Pig Latin
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Job 1
Job 2
Job 3
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Hive
Developed at Facebook
Used for majority of Facebook jobs
ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex
data types some optimizations
Creating a Hive Table
CREATE TABLE page_views(viewTime INT userid BIGINT
page_url STRING referrer_url STRING
ip STRING COMMENT User IP address)
COMMENT This is the page view table
PARTITIONED BY(dt STRING country STRING)
STORED AS SEQUENCEFILE
bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA
Simple Query
SELECT page_views
FROM page_views
WHERE page_viewsdate gt= 2008-03-01
AND page_viewsdate lt= 2008-03-31
AND page_viewsreferrer_url like xyzcom
bullHive only reads partition 2008-03-01 instead of scanning entire table
bullFind all page views coming from xyzcom on March 31st
Aggregation and Joins
SELECT pvpage_url ugender COUNT(DISTINCT uid)
FROM page_views pv JOIN user u ON (pvuserid = uid)
GROUP BY pvpage_url ugender
WHERE pvdate = 2008-03-03
bullCount users who visited each page by gender
bullSample outputpage_url gender count(userid)
homephp MALE 12141412
homephp FEMALE 15431579
photophp MALE 23941451
photophp FEMALE 21231314
Using a Hadoop Streaming Mapper Script
SELECT TRANSFORM(page_viewsuserid
page_viewsdate)
USING map_scriptpy
AS dt uid CLUSTER BY dt
FROM page_views
Conclusions
MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance
Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs
(but requiring fault tolerance)
Hive and Pig further simplify programming
MapReduce is not suitable for all problems but when it works it may save you a lot of time
Resources
Hadoop httphadoopapacheorgcore
Hadoop docs
httphadoopapacheorgcoredocscurrent
Pig httphadoopapacheorgpig
Hive httphadoopapacheorghive
Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training
- Introduction to MapReduce and Hadoop
- What is MapReduce
- What is MapReduce used for
- What is MapReduce used for (2)
- Outline
- MapReduce Design Goals
- Typical Hadoop Cluster
- Typical Hadoop Cluster (2)
- Challenges
- Hadoop Components
- Hadoop Distributed File System
- MapReduce Programming Model
- Example Word Count
- Word Count Execution
- MapReduce Execution Details
- An Optimization The Combiner
- Word Count with Combiner
- Outline (2)
- Fault Tolerance in MapReduce
- Fault Tolerance in MapReduce (2)
- Fault Tolerance in MapReduce (3)
- Takeaways
- Outline (3)
- 1 Search
- 2 Sort
- 3 Inverted Index
- Inverted Index Example
- 4 Most Popular Words
- Outline (4)
- Cloudera Virtual Manager
- Installation
- Once you got the software installed fire up the VirtualBox ima
- WordCount Example
- Slide 34
- Slide 35
- Slide 36
- How it Works
- Classes
- Interface
- Demo
- Exercises
- Outline (5)
- Motivation
- Pig
- An Example Problem
- In MapReduce
- In Pig Latin
- Ease of Translation
- Ease of Translation (2)
- Hive
- Creating a Hive Table
- Simple Query
- Aggregation and Joins
- Using a Hadoop Streaming Mapper Script
- Conclusions
- Resources
-
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
MapReduce Design Goals
1 Scalability to large data volumes Scan 100 TB on 1 node 50 MBs = 23 days Scan on 1000-node cluster = 33 minutes
2 Cost-efficiency Commodity nodes (cheap but unreliable) Commodity network Automatic fault-tolerance (fewer admins) Easy to use (fewer programmers)
Typical Hadoop Cluster
Aggregation switch
Rack switch
40 nodesrack 1000-4000 nodes in cluster
1 GBps bandwidth in rack 8 GBps out of rack
Node specs (Yahoo terasort)8 x 20 GHz cores 8 GB RAM 4 disks (= 4 TB)
Typical Hadoop Cluster
Image from httpwikiapacheorghadoop-dataattachmentsHadoopPresentationsattachmentsaw-apachecon-eu-2009pdf
Challenges
Cheap nodes fail especially if you have many Mean time between failures for 1 node = 3 years MTBF for 1000 nodes = 1 day Solution Build fault-tolerance into system
Commodity network = low bandwidth Solution Push computation to the data
Programming distributed systems is hard Solution Data-parallel programming model users write ldquomaprdquo and
ldquoreducerdquo functions system handles work distribution and fault tolerance
Hadoop Components
Distributed file system (HDFS) Single namespace for entire cluster Replicates data 3x for fault-tolerance
MapReduce implementation Executes user jobs specified as ldquomaprdquo and ldquoreducerdquo functions Manages work distribution amp fault-tolerance
Hadoop Distributed File System Files split into 128MB blocks
Blocks replicated across several datanodes (usually 3)
Single namenode stores metadata (file names block locations etc)
Optimized for large files sequential reads
Files are append-only
Namenode
Datanodes
1234
124
213
143
324
File1
MapReduce Programming Model Data type key-value records
Map function
(Kin Vin) list(Kinter Vinter)
Reduce function
(Kinter list(Vinter)) list(Kout Vout)
Example Word Count
def mapper(line) foreach word in linesplit)(
output(word 1)
def reducer(key values) output(key sum(values))
Word Count Execution
the quickbrown fox
the fox atethe mouse
how nowbrown cow
Map
Map
Map
Reduce
Reduce
brown 2fox 2how 1now 1the 3
ate 1cow 1
mouse 1quick 1
the 1brown 1
fox 1
quick 1
the 1fox 1the 1
how 1now 1
brown 1ate 1
mouse 1
cow 1
Input Map Shuffle amp Sort Reduce Output
MapReduce Execution Details
Single master controls job execution on multiple slaves as well as user scheduling
Mappers preferentially placed on same node or same rack as their input block Push computation to data minimize network use
Mappers save outputs to local disk rather than pushing directly to reducers Allows having more reducers than nodes Allows recovery if a reducer crashes
An Optimization The Combiner
A combiner is a local aggregation function for repeated keys produced by same map
For associative ops like sum count max
Decreases size of intermediate data
Example local counting for Word Count
def combiner(key values) output(key sum(values))
Word Count with CombinerInput Map amp Combine Shuffle amp Sort Reduce Output
the quickbrown fox
the fox atethe mouse
how nowbrown cow
Map
Map
Map
Reduce
Reduce
brown 2fox 2how 1now 1the 3
ate 1cow 1
mouse 1quick 1
the 1brown 1
fox 1
quick 1
the 2fox 1
how 1now 1
brown 1ate 1
mouse 1
cow 1
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
Fault Tolerance in MapReduce
1 If a task crashes Retry on another node
Okay for a map because it had no dependencies Okay for reduce because map outputs are on disk
If the same task repeatedly fails fail the job or ignore that input block (user-controlled)
Note For this and the other fault tolerance features to work your map and reduce tasks must be side-effect-free
Fault Tolerance in MapReduce
2 If a node crashes Relaunch its current tasks on other nodes Relaunch any maps the node previously ran
Necessary because their output files were lost along with the crashed node
Fault Tolerance in MapReduce
3 If a task is going slowly (straggler) Launch second copy of task on another node Take the output of whichever copy finishes first and kill the
other one
Critical for performance in large clusters stragglers occur frequently due to failing hardware bugs misconfiguration etc
Takeaways
By providing a data-parallel programming model MapReduce can control job execution in useful ways Automatic division of job into tasks Automatic placement of computation near data Automatic load balancing Recovery from failures amp stragglers
User focuses on application not on complexities of distributed computing
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
1 Search
Input (lineNumber line) records
Output lines matching a given pattern
Map if(line matches pattern) output(line)
Reduce identify function Alternative no reducer (map-only job)
pigsheepyakzebra
aardvarkantbeecowelephant
2 Sort
Input (key value) records
Output same records sorted by key
Map identity function
Reduce identify function
Trick Pick partitioningfunction h such thatk1ltk2 =gt h(k1)lth(k2)
Map
Map
Map
Reduce
Reduce
ant bee
zebra
aardvarkelephant
cow
pig
sheep yak
[A-M]
[N-Z]
3 Inverted Index
Input (filename text) records
Output list of files containing each word
Map foreach word in textsplit() output(word filename)
Combine uniquify filenames for each word (eliminate duplicates)
Reducedef reduce(word filenames) output(word sort(filenames))
Inverted Index Example
to be or not to be afraid (12thtxt)
be (12thtxt hamlettxt)greatness (12thtxt)not (12thtxt hamlettxt)of (12thtxt)or (hamlettxt)to (hamlettxt)
hamlettxt
be not afraid of greatness
12thtxt
to hamlettxtbe hamlettxtor hamlettxtnot hamlettxt
be 12thtxtnot 12thtxtafraid 12thtxtof 12thtxtgreatness 12thtxt
4 Most Popular Words
Input (filename text) records
Output the 100 words occurring in most files
Two-stage solution Job 1
Create inverted index giving (word list(file)) records Job 2
Map each (word list(file)) to (count word) Sort these records by count as in sort job
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
Cloudera Virtual Manager
Cloudera VM contains a single-node Apache Hadoop cluster along with everything you need to get started with Hadoop
Requirements A 64-bit host OS A virtualization software VMware Player KVM or VirtualBox
Virtualization Software will require a laptop that supports virtualization If you are unsure one way this can be checked by looking at your BIOS and seeing if Virtualization is Enabled
A 4 GB of total RAM The total system memory required varies depending on the size
of your data set and on the other processes that are running
Installation
Step1 Download amp Run Vmware
Step2 Download Cloudera VM
Step3 Extract to the Cloudera folder
Step4 Open the cloudera-quickstart-vm-440-1-vmware
Once you got the software installed fire up the VirtualBox image of Cloudera QuickStart VM and you should see the initial screen similar to below
WordCount Example
This example computes the occurrence frequency of each word in a text file
Steps1 Set up Hadoop environment2 Upload files into HDFS3 Executing Java MapReduce functions in Hadoop
bull Tutorial
httpswwwclouderacomcontentcloudera-contentcloudera-docsHadoopTutorialCDH4Hadoop-Tutorialht_wordcount1html
public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt
private final static IntWritable one = new IntWritable(1) private Text word = new Text)(
public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
String line = valuetoString)(
StringTokenizer tokenizer = new StringTokenizer(line)
while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())
outputcollect(word one)
public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt
public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
int sum = 0
while (valueshasNext()) sum += valuesnext()get)(
outputcollect(key new IntWritable(sum))
public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)
confsetJobName(wordcount)
confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)
confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)
confsetReducerClass(Reduceclass)
confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)
FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))
JobClientrunJob(conf)
How it Works
Purpose Simple serialization for keys values and other data
Interface Writable Read and write binary format Convert to String for text formats
Classes
OutputCollector Collects data output by either the Mapper or the Reducer ie
intermediate outputs or the output of the job Methods
collect(K key V value)
Interface
Text Stores text using standard UTF8 encoding It provides methods to
serialize deserialize and compare texts at byte level Methods
set(byte[] utf8)
Demo
40
Exercises
Get WordCount running
Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words
For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive (reading)
Motivation
MapReduce is great as many algorithms can be expressed by a series of MR jobs
But itrsquos low-level must think about keys values partitioning etc
Can we capture common ldquojob patternsrdquo
Pig
Started at Yahoo Research
Now runs about 30 of Yahoorsquos jobs
Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators
(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse
An Example Problem
Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))
reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)
lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5
store Top5 into lsquotop5sitesrsquo
In Pig Latin
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Job 1
Job 2
Job 3
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Hive
Developed at Facebook
Used for majority of Facebook jobs
ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex
data types some optimizations
Creating a Hive Table
CREATE TABLE page_views(viewTime INT userid BIGINT
page_url STRING referrer_url STRING
ip STRING COMMENT User IP address)
COMMENT This is the page view table
PARTITIONED BY(dt STRING country STRING)
STORED AS SEQUENCEFILE
bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA
Simple Query
SELECT page_views
FROM page_views
WHERE page_viewsdate gt= 2008-03-01
AND page_viewsdate lt= 2008-03-31
AND page_viewsreferrer_url like xyzcom
bullHive only reads partition 2008-03-01 instead of scanning entire table
bullFind all page views coming from xyzcom on March 31st
Aggregation and Joins
SELECT pvpage_url ugender COUNT(DISTINCT uid)
FROM page_views pv JOIN user u ON (pvuserid = uid)
GROUP BY pvpage_url ugender
WHERE pvdate = 2008-03-03
bullCount users who visited each page by gender
bullSample outputpage_url gender count(userid)
homephp MALE 12141412
homephp FEMALE 15431579
photophp MALE 23941451
photophp FEMALE 21231314
Using a Hadoop Streaming Mapper Script
SELECT TRANSFORM(page_viewsuserid
page_viewsdate)
USING map_scriptpy
AS dt uid CLUSTER BY dt
FROM page_views
Conclusions
MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance
Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs
(but requiring fault tolerance)
Hive and Pig further simplify programming
MapReduce is not suitable for all problems but when it works it may save you a lot of time
Resources
Hadoop httphadoopapacheorgcore
Hadoop docs
httphadoopapacheorgcoredocscurrent
Pig httphadoopapacheorgpig
Hive httphadoopapacheorghive
Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training
- Introduction to MapReduce and Hadoop
- What is MapReduce
- What is MapReduce used for
- What is MapReduce used for (2)
- Outline
- MapReduce Design Goals
- Typical Hadoop Cluster
- Typical Hadoop Cluster (2)
- Challenges
- Hadoop Components
- Hadoop Distributed File System
- MapReduce Programming Model
- Example Word Count
- Word Count Execution
- MapReduce Execution Details
- An Optimization The Combiner
- Word Count with Combiner
- Outline (2)
- Fault Tolerance in MapReduce
- Fault Tolerance in MapReduce (2)
- Fault Tolerance in MapReduce (3)
- Takeaways
- Outline (3)
- 1 Search
- 2 Sort
- 3 Inverted Index
- Inverted Index Example
- 4 Most Popular Words
- Outline (4)
- Cloudera Virtual Manager
- Installation
- Once you got the software installed fire up the VirtualBox ima
- WordCount Example
- Slide 34
- Slide 35
- Slide 36
- How it Works
- Classes
- Interface
- Demo
- Exercises
- Outline (5)
- Motivation
- Pig
- An Example Problem
- In MapReduce
- In Pig Latin
- Ease of Translation
- Ease of Translation (2)
- Hive
- Creating a Hive Table
- Simple Query
- Aggregation and Joins
- Using a Hadoop Streaming Mapper Script
- Conclusions
- Resources
-
MapReduce Design Goals
1 Scalability to large data volumes Scan 100 TB on 1 node 50 MBs = 23 days Scan on 1000-node cluster = 33 minutes
2 Cost-efficiency Commodity nodes (cheap but unreliable) Commodity network Automatic fault-tolerance (fewer admins) Easy to use (fewer programmers)
Typical Hadoop Cluster
Aggregation switch
Rack switch
40 nodesrack 1000-4000 nodes in cluster
1 GBps bandwidth in rack 8 GBps out of rack
Node specs (Yahoo terasort)8 x 20 GHz cores 8 GB RAM 4 disks (= 4 TB)
Typical Hadoop Cluster
Image from httpwikiapacheorghadoop-dataattachmentsHadoopPresentationsattachmentsaw-apachecon-eu-2009pdf
Challenges
Cheap nodes fail especially if you have many Mean time between failures for 1 node = 3 years MTBF for 1000 nodes = 1 day Solution Build fault-tolerance into system
Commodity network = low bandwidth Solution Push computation to the data
Programming distributed systems is hard Solution Data-parallel programming model users write ldquomaprdquo and
ldquoreducerdquo functions system handles work distribution and fault tolerance
Hadoop Components
Distributed file system (HDFS) Single namespace for entire cluster Replicates data 3x for fault-tolerance
MapReduce implementation Executes user jobs specified as ldquomaprdquo and ldquoreducerdquo functions Manages work distribution amp fault-tolerance
Hadoop Distributed File System Files split into 128MB blocks
Blocks replicated across several datanodes (usually 3)
Single namenode stores metadata (file names block locations etc)
Optimized for large files sequential reads
Files are append-only
Namenode
Datanodes
1234
124
213
143
324
File1
MapReduce Programming Model Data type key-value records
Map function
(Kin Vin) list(Kinter Vinter)
Reduce function
(Kinter list(Vinter)) list(Kout Vout)
Example Word Count
def mapper(line) foreach word in linesplit)(
output(word 1)
def reducer(key values) output(key sum(values))
Word Count Execution
the quickbrown fox
the fox atethe mouse
how nowbrown cow
Map
Map
Map
Reduce
Reduce
brown 2fox 2how 1now 1the 3
ate 1cow 1
mouse 1quick 1
the 1brown 1
fox 1
quick 1
the 1fox 1the 1
how 1now 1
brown 1ate 1
mouse 1
cow 1
Input Map Shuffle amp Sort Reduce Output
MapReduce Execution Details
Single master controls job execution on multiple slaves as well as user scheduling
Mappers preferentially placed on same node or same rack as their input block Push computation to data minimize network use
Mappers save outputs to local disk rather than pushing directly to reducers Allows having more reducers than nodes Allows recovery if a reducer crashes
An Optimization The Combiner
A combiner is a local aggregation function for repeated keys produced by same map
For associative ops like sum count max
Decreases size of intermediate data
Example local counting for Word Count
def combiner(key values) output(key sum(values))
Word Count with CombinerInput Map amp Combine Shuffle amp Sort Reduce Output
the quickbrown fox
the fox atethe mouse
how nowbrown cow
Map
Map
Map
Reduce
Reduce
brown 2fox 2how 1now 1the 3
ate 1cow 1
mouse 1quick 1
the 1brown 1
fox 1
quick 1
the 2fox 1
how 1now 1
brown 1ate 1
mouse 1
cow 1
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
Fault Tolerance in MapReduce
1 If a task crashes Retry on another node
Okay for a map because it had no dependencies Okay for reduce because map outputs are on disk
If the same task repeatedly fails fail the job or ignore that input block (user-controlled)
Note For this and the other fault tolerance features to work your map and reduce tasks must be side-effect-free
Fault Tolerance in MapReduce
2 If a node crashes Relaunch its current tasks on other nodes Relaunch any maps the node previously ran
Necessary because their output files were lost along with the crashed node
Fault Tolerance in MapReduce
3 If a task is going slowly (straggler) Launch second copy of task on another node Take the output of whichever copy finishes first and kill the
other one
Critical for performance in large clusters stragglers occur frequently due to failing hardware bugs misconfiguration etc
Takeaways
By providing a data-parallel programming model MapReduce can control job execution in useful ways Automatic division of job into tasks Automatic placement of computation near data Automatic load balancing Recovery from failures amp stragglers
User focuses on application not on complexities of distributed computing
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
1 Search
Input (lineNumber line) records
Output lines matching a given pattern
Map if(line matches pattern) output(line)
Reduce identify function Alternative no reducer (map-only job)
pigsheepyakzebra
aardvarkantbeecowelephant
2 Sort
Input (key value) records
Output same records sorted by key
Map identity function
Reduce identify function
Trick Pick partitioningfunction h such thatk1ltk2 =gt h(k1)lth(k2)
Map
Map
Map
Reduce
Reduce
ant bee
zebra
aardvarkelephant
cow
pig
sheep yak
[A-M]
[N-Z]
3 Inverted Index
Input (filename text) records
Output list of files containing each word
Map foreach word in textsplit() output(word filename)
Combine uniquify filenames for each word (eliminate duplicates)
Reducedef reduce(word filenames) output(word sort(filenames))
Inverted Index Example
to be or not to be afraid (12thtxt)
be (12thtxt hamlettxt)greatness (12thtxt)not (12thtxt hamlettxt)of (12thtxt)or (hamlettxt)to (hamlettxt)
hamlettxt
be not afraid of greatness
12thtxt
to hamlettxtbe hamlettxtor hamlettxtnot hamlettxt
be 12thtxtnot 12thtxtafraid 12thtxtof 12thtxtgreatness 12thtxt
4 Most Popular Words
Input (filename text) records
Output the 100 words occurring in most files
Two-stage solution Job 1
Create inverted index giving (word list(file)) records Job 2
Map each (word list(file)) to (count word) Sort these records by count as in sort job
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
Cloudera Virtual Manager
Cloudera VM contains a single-node Apache Hadoop cluster along with everything you need to get started with Hadoop
Requirements A 64-bit host OS A virtualization software VMware Player KVM or VirtualBox
Virtualization Software will require a laptop that supports virtualization If you are unsure one way this can be checked by looking at your BIOS and seeing if Virtualization is Enabled
A 4 GB of total RAM The total system memory required varies depending on the size
of your data set and on the other processes that are running
Installation
Step1 Download amp Run Vmware
Step2 Download Cloudera VM
Step3 Extract to the Cloudera folder
Step4 Open the cloudera-quickstart-vm-440-1-vmware
Once you got the software installed fire up the VirtualBox image of Cloudera QuickStart VM and you should see the initial screen similar to below
WordCount Example
This example computes the occurrence frequency of each word in a text file
Steps1 Set up Hadoop environment2 Upload files into HDFS3 Executing Java MapReduce functions in Hadoop
bull Tutorial
httpswwwclouderacomcontentcloudera-contentcloudera-docsHadoopTutorialCDH4Hadoop-Tutorialht_wordcount1html
public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt
private final static IntWritable one = new IntWritable(1) private Text word = new Text)(
public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
String line = valuetoString)(
StringTokenizer tokenizer = new StringTokenizer(line)
while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())
outputcollect(word one)
public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt
public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
int sum = 0
while (valueshasNext()) sum += valuesnext()get)(
outputcollect(key new IntWritable(sum))
public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)
confsetJobName(wordcount)
confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)
confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)
confsetReducerClass(Reduceclass)
confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)
FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))
JobClientrunJob(conf)
How it Works
Purpose Simple serialization for keys values and other data
Interface Writable Read and write binary format Convert to String for text formats
Classes
OutputCollector Collects data output by either the Mapper or the Reducer ie
intermediate outputs or the output of the job Methods
collect(K key V value)
Interface
Text Stores text using standard UTF8 encoding It provides methods to
serialize deserialize and compare texts at byte level Methods
set(byte[] utf8)
Demo
40
Exercises
Get WordCount running
Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words
For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive (reading)
Motivation
MapReduce is great as many algorithms can be expressed by a series of MR jobs
But itrsquos low-level must think about keys values partitioning etc
Can we capture common ldquojob patternsrdquo
Pig
Started at Yahoo Research
Now runs about 30 of Yahoorsquos jobs
Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators
(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse
An Example Problem
Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))
reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)
lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5
store Top5 into lsquotop5sitesrsquo
In Pig Latin
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Job 1
Job 2
Job 3
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Hive
Developed at Facebook
Used for majority of Facebook jobs
ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex
data types some optimizations
Creating a Hive Table
CREATE TABLE page_views(viewTime INT userid BIGINT
page_url STRING referrer_url STRING
ip STRING COMMENT User IP address)
COMMENT This is the page view table
PARTITIONED BY(dt STRING country STRING)
STORED AS SEQUENCEFILE
bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA
Simple Query
SELECT page_views
FROM page_views
WHERE page_viewsdate gt= 2008-03-01
AND page_viewsdate lt= 2008-03-31
AND page_viewsreferrer_url like xyzcom
bullHive only reads partition 2008-03-01 instead of scanning entire table
bullFind all page views coming from xyzcom on March 31st
Aggregation and Joins
SELECT pvpage_url ugender COUNT(DISTINCT uid)
FROM page_views pv JOIN user u ON (pvuserid = uid)
GROUP BY pvpage_url ugender
WHERE pvdate = 2008-03-03
bullCount users who visited each page by gender
bullSample outputpage_url gender count(userid)
homephp MALE 12141412
homephp FEMALE 15431579
photophp MALE 23941451
photophp FEMALE 21231314
Using a Hadoop Streaming Mapper Script
SELECT TRANSFORM(page_viewsuserid
page_viewsdate)
USING map_scriptpy
AS dt uid CLUSTER BY dt
FROM page_views
Conclusions
MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance
Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs
(but requiring fault tolerance)
Hive and Pig further simplify programming
MapReduce is not suitable for all problems but when it works it may save you a lot of time
Resources
Hadoop httphadoopapacheorgcore
Hadoop docs
httphadoopapacheorgcoredocscurrent
Pig httphadoopapacheorgpig
Hive httphadoopapacheorghive
Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training
- Introduction to MapReduce and Hadoop
- What is MapReduce
- What is MapReduce used for
- What is MapReduce used for (2)
- Outline
- MapReduce Design Goals
- Typical Hadoop Cluster
- Typical Hadoop Cluster (2)
- Challenges
- Hadoop Components
- Hadoop Distributed File System
- MapReduce Programming Model
- Example Word Count
- Word Count Execution
- MapReduce Execution Details
- An Optimization The Combiner
- Word Count with Combiner
- Outline (2)
- Fault Tolerance in MapReduce
- Fault Tolerance in MapReduce (2)
- Fault Tolerance in MapReduce (3)
- Takeaways
- Outline (3)
- 1 Search
- 2 Sort
- 3 Inverted Index
- Inverted Index Example
- 4 Most Popular Words
- Outline (4)
- Cloudera Virtual Manager
- Installation
- Once you got the software installed fire up the VirtualBox ima
- WordCount Example
- Slide 34
- Slide 35
- Slide 36
- How it Works
- Classes
- Interface
- Demo
- Exercises
- Outline (5)
- Motivation
- Pig
- An Example Problem
- In MapReduce
- In Pig Latin
- Ease of Translation
- Ease of Translation (2)
- Hive
- Creating a Hive Table
- Simple Query
- Aggregation and Joins
- Using a Hadoop Streaming Mapper Script
- Conclusions
- Resources
-
Typical Hadoop Cluster
Aggregation switch
Rack switch
40 nodesrack 1000-4000 nodes in cluster
1 GBps bandwidth in rack 8 GBps out of rack
Node specs (Yahoo terasort)8 x 20 GHz cores 8 GB RAM 4 disks (= 4 TB)
Typical Hadoop Cluster
Image from httpwikiapacheorghadoop-dataattachmentsHadoopPresentationsattachmentsaw-apachecon-eu-2009pdf
Challenges
Cheap nodes fail especially if you have many Mean time between failures for 1 node = 3 years MTBF for 1000 nodes = 1 day Solution Build fault-tolerance into system
Commodity network = low bandwidth Solution Push computation to the data
Programming distributed systems is hard Solution Data-parallel programming model users write ldquomaprdquo and
ldquoreducerdquo functions system handles work distribution and fault tolerance
Hadoop Components
Distributed file system (HDFS) Single namespace for entire cluster Replicates data 3x for fault-tolerance
MapReduce implementation Executes user jobs specified as ldquomaprdquo and ldquoreducerdquo functions Manages work distribution amp fault-tolerance
Hadoop Distributed File System Files split into 128MB blocks
Blocks replicated across several datanodes (usually 3)
Single namenode stores metadata (file names block locations etc)
Optimized for large files sequential reads
Files are append-only
Namenode
Datanodes
1234
124
213
143
324
File1
MapReduce Programming Model Data type key-value records
Map function
(Kin Vin) list(Kinter Vinter)
Reduce function
(Kinter list(Vinter)) list(Kout Vout)
Example Word Count
def mapper(line) foreach word in linesplit)(
output(word 1)
def reducer(key values) output(key sum(values))
Word Count Execution
the quickbrown fox
the fox atethe mouse
how nowbrown cow
Map
Map
Map
Reduce
Reduce
brown 2fox 2how 1now 1the 3
ate 1cow 1
mouse 1quick 1
the 1brown 1
fox 1
quick 1
the 1fox 1the 1
how 1now 1
brown 1ate 1
mouse 1
cow 1
Input Map Shuffle amp Sort Reduce Output
MapReduce Execution Details
Single master controls job execution on multiple slaves as well as user scheduling
Mappers preferentially placed on same node or same rack as their input block Push computation to data minimize network use
Mappers save outputs to local disk rather than pushing directly to reducers Allows having more reducers than nodes Allows recovery if a reducer crashes
An Optimization The Combiner
A combiner is a local aggregation function for repeated keys produced by same map
For associative ops like sum count max
Decreases size of intermediate data
Example local counting for Word Count
def combiner(key values) output(key sum(values))
Word Count with CombinerInput Map amp Combine Shuffle amp Sort Reduce Output
the quickbrown fox
the fox atethe mouse
how nowbrown cow
Map
Map
Map
Reduce
Reduce
brown 2fox 2how 1now 1the 3
ate 1cow 1
mouse 1quick 1
the 1brown 1
fox 1
quick 1
the 2fox 1
how 1now 1
brown 1ate 1
mouse 1
cow 1
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
Fault Tolerance in MapReduce
1 If a task crashes Retry on another node
Okay for a map because it had no dependencies Okay for reduce because map outputs are on disk
If the same task repeatedly fails fail the job or ignore that input block (user-controlled)
Note For this and the other fault tolerance features to work your map and reduce tasks must be side-effect-free
Fault Tolerance in MapReduce
2 If a node crashes Relaunch its current tasks on other nodes Relaunch any maps the node previously ran
Necessary because their output files were lost along with the crashed node
Fault Tolerance in MapReduce
3 If a task is going slowly (straggler) Launch second copy of task on another node Take the output of whichever copy finishes first and kill the
other one
Critical for performance in large clusters stragglers occur frequently due to failing hardware bugs misconfiguration etc
Takeaways
By providing a data-parallel programming model MapReduce can control job execution in useful ways Automatic division of job into tasks Automatic placement of computation near data Automatic load balancing Recovery from failures amp stragglers
User focuses on application not on complexities of distributed computing
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
1 Search
Input (lineNumber line) records
Output lines matching a given pattern
Map if(line matches pattern) output(line)
Reduce identify function Alternative no reducer (map-only job)
pigsheepyakzebra
aardvarkantbeecowelephant
2 Sort
Input (key value) records
Output same records sorted by key
Map identity function
Reduce identify function
Trick Pick partitioningfunction h such thatk1ltk2 =gt h(k1)lth(k2)
Map
Map
Map
Reduce
Reduce
ant bee
zebra
aardvarkelephant
cow
pig
sheep yak
[A-M]
[N-Z]
3 Inverted Index
Input (filename text) records
Output list of files containing each word
Map foreach word in textsplit() output(word filename)
Combine uniquify filenames for each word (eliminate duplicates)
Reducedef reduce(word filenames) output(word sort(filenames))
Inverted Index Example
to be or not to be afraid (12thtxt)
be (12thtxt hamlettxt)greatness (12thtxt)not (12thtxt hamlettxt)of (12thtxt)or (hamlettxt)to (hamlettxt)
hamlettxt
be not afraid of greatness
12thtxt
to hamlettxtbe hamlettxtor hamlettxtnot hamlettxt
be 12thtxtnot 12thtxtafraid 12thtxtof 12thtxtgreatness 12thtxt
4 Most Popular Words
Input (filename text) records
Output the 100 words occurring in most files
Two-stage solution Job 1
Create inverted index giving (word list(file)) records Job 2
Map each (word list(file)) to (count word) Sort these records by count as in sort job
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
Cloudera Virtual Manager
Cloudera VM contains a single-node Apache Hadoop cluster along with everything you need to get started with Hadoop
Requirements A 64-bit host OS A virtualization software VMware Player KVM or VirtualBox
Virtualization Software will require a laptop that supports virtualization If you are unsure one way this can be checked by looking at your BIOS and seeing if Virtualization is Enabled
A 4 GB of total RAM The total system memory required varies depending on the size
of your data set and on the other processes that are running
Installation
Step1 Download amp Run Vmware
Step2 Download Cloudera VM
Step3 Extract to the Cloudera folder
Step4 Open the cloudera-quickstart-vm-440-1-vmware
Once you got the software installed fire up the VirtualBox image of Cloudera QuickStart VM and you should see the initial screen similar to below
WordCount Example
This example computes the occurrence frequency of each word in a text file
Steps1 Set up Hadoop environment2 Upload files into HDFS3 Executing Java MapReduce functions in Hadoop
bull Tutorial
httpswwwclouderacomcontentcloudera-contentcloudera-docsHadoopTutorialCDH4Hadoop-Tutorialht_wordcount1html
public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt
private final static IntWritable one = new IntWritable(1) private Text word = new Text)(
public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
String line = valuetoString)(
StringTokenizer tokenizer = new StringTokenizer(line)
while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())
outputcollect(word one)
public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt
public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
int sum = 0
while (valueshasNext()) sum += valuesnext()get)(
outputcollect(key new IntWritable(sum))
public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)
confsetJobName(wordcount)
confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)
confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)
confsetReducerClass(Reduceclass)
confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)
FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))
JobClientrunJob(conf)
How it Works
Purpose Simple serialization for keys values and other data
Interface Writable Read and write binary format Convert to String for text formats
Classes
OutputCollector Collects data output by either the Mapper or the Reducer ie
intermediate outputs or the output of the job Methods
collect(K key V value)
Interface
Text Stores text using standard UTF8 encoding It provides methods to
serialize deserialize and compare texts at byte level Methods
set(byte[] utf8)
Demo
40
Exercises
Get WordCount running
Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words
For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive (reading)
Motivation
MapReduce is great as many algorithms can be expressed by a series of MR jobs
But itrsquos low-level must think about keys values partitioning etc
Can we capture common ldquojob patternsrdquo
Pig
Started at Yahoo Research
Now runs about 30 of Yahoorsquos jobs
Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators
(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse
An Example Problem
Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))
reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)
lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5
store Top5 into lsquotop5sitesrsquo
In Pig Latin
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Job 1
Job 2
Job 3
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Hive
Developed at Facebook
Used for majority of Facebook jobs
ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex
data types some optimizations
Creating a Hive Table
CREATE TABLE page_views(viewTime INT userid BIGINT
page_url STRING referrer_url STRING
ip STRING COMMENT User IP address)
COMMENT This is the page view table
PARTITIONED BY(dt STRING country STRING)
STORED AS SEQUENCEFILE
bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA
Simple Query
SELECT page_views
FROM page_views
WHERE page_viewsdate gt= 2008-03-01
AND page_viewsdate lt= 2008-03-31
AND page_viewsreferrer_url like xyzcom
bullHive only reads partition 2008-03-01 instead of scanning entire table
bullFind all page views coming from xyzcom on March 31st
Aggregation and Joins
SELECT pvpage_url ugender COUNT(DISTINCT uid)
FROM page_views pv JOIN user u ON (pvuserid = uid)
GROUP BY pvpage_url ugender
WHERE pvdate = 2008-03-03
bullCount users who visited each page by gender
bullSample outputpage_url gender count(userid)
homephp MALE 12141412
homephp FEMALE 15431579
photophp MALE 23941451
photophp FEMALE 21231314
Using a Hadoop Streaming Mapper Script
SELECT TRANSFORM(page_viewsuserid
page_viewsdate)
USING map_scriptpy
AS dt uid CLUSTER BY dt
FROM page_views
Conclusions
MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance
Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs
(but requiring fault tolerance)
Hive and Pig further simplify programming
MapReduce is not suitable for all problems but when it works it may save you a lot of time
Resources
Hadoop httphadoopapacheorgcore
Hadoop docs
httphadoopapacheorgcoredocscurrent
Pig httphadoopapacheorgpig
Hive httphadoopapacheorghive
Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training
- Introduction to MapReduce and Hadoop
- What is MapReduce
- What is MapReduce used for
- What is MapReduce used for (2)
- Outline
- MapReduce Design Goals
- Typical Hadoop Cluster
- Typical Hadoop Cluster (2)
- Challenges
- Hadoop Components
- Hadoop Distributed File System
- MapReduce Programming Model
- Example Word Count
- Word Count Execution
- MapReduce Execution Details
- An Optimization The Combiner
- Word Count with Combiner
- Outline (2)
- Fault Tolerance in MapReduce
- Fault Tolerance in MapReduce (2)
- Fault Tolerance in MapReduce (3)
- Takeaways
- Outline (3)
- 1 Search
- 2 Sort
- 3 Inverted Index
- Inverted Index Example
- 4 Most Popular Words
- Outline (4)
- Cloudera Virtual Manager
- Installation
- Once you got the software installed fire up the VirtualBox ima
- WordCount Example
- Slide 34
- Slide 35
- Slide 36
- How it Works
- Classes
- Interface
- Demo
- Exercises
- Outline (5)
- Motivation
- Pig
- An Example Problem
- In MapReduce
- In Pig Latin
- Ease of Translation
- Ease of Translation (2)
- Hive
- Creating a Hive Table
- Simple Query
- Aggregation and Joins
- Using a Hadoop Streaming Mapper Script
- Conclusions
- Resources
-
Typical Hadoop Cluster
Image from httpwikiapacheorghadoop-dataattachmentsHadoopPresentationsattachmentsaw-apachecon-eu-2009pdf
Challenges
Cheap nodes fail especially if you have many Mean time between failures for 1 node = 3 years MTBF for 1000 nodes = 1 day Solution Build fault-tolerance into system
Commodity network = low bandwidth Solution Push computation to the data
Programming distributed systems is hard Solution Data-parallel programming model users write ldquomaprdquo and
ldquoreducerdquo functions system handles work distribution and fault tolerance
Hadoop Components
Distributed file system (HDFS) Single namespace for entire cluster Replicates data 3x for fault-tolerance
MapReduce implementation Executes user jobs specified as ldquomaprdquo and ldquoreducerdquo functions Manages work distribution amp fault-tolerance
Hadoop Distributed File System Files split into 128MB blocks
Blocks replicated across several datanodes (usually 3)
Single namenode stores metadata (file names block locations etc)
Optimized for large files sequential reads
Files are append-only
Namenode
Datanodes
1234
124
213
143
324
File1
MapReduce Programming Model Data type key-value records
Map function
(Kin Vin) list(Kinter Vinter)
Reduce function
(Kinter list(Vinter)) list(Kout Vout)
Example Word Count
def mapper(line) foreach word in linesplit)(
output(word 1)
def reducer(key values) output(key sum(values))
Word Count Execution
the quickbrown fox
the fox atethe mouse
how nowbrown cow
Map
Map
Map
Reduce
Reduce
brown 2fox 2how 1now 1the 3
ate 1cow 1
mouse 1quick 1
the 1brown 1
fox 1
quick 1
the 1fox 1the 1
how 1now 1
brown 1ate 1
mouse 1
cow 1
Input Map Shuffle amp Sort Reduce Output
MapReduce Execution Details
Single master controls job execution on multiple slaves as well as user scheduling
Mappers preferentially placed on same node or same rack as their input block Push computation to data minimize network use
Mappers save outputs to local disk rather than pushing directly to reducers Allows having more reducers than nodes Allows recovery if a reducer crashes
An Optimization The Combiner
A combiner is a local aggregation function for repeated keys produced by same map
For associative ops like sum count max
Decreases size of intermediate data
Example local counting for Word Count
def combiner(key values) output(key sum(values))
Word Count with CombinerInput Map amp Combine Shuffle amp Sort Reduce Output
the quickbrown fox
the fox atethe mouse
how nowbrown cow
Map
Map
Map
Reduce
Reduce
brown 2fox 2how 1now 1the 3
ate 1cow 1
mouse 1quick 1
the 1brown 1
fox 1
quick 1
the 2fox 1
how 1now 1
brown 1ate 1
mouse 1
cow 1
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
Fault Tolerance in MapReduce
1 If a task crashes Retry on another node
Okay for a map because it had no dependencies Okay for reduce because map outputs are on disk
If the same task repeatedly fails fail the job or ignore that input block (user-controlled)
Note For this and the other fault tolerance features to work your map and reduce tasks must be side-effect-free
Fault Tolerance in MapReduce
2 If a node crashes Relaunch its current tasks on other nodes Relaunch any maps the node previously ran
Necessary because their output files were lost along with the crashed node
Fault Tolerance in MapReduce
3 If a task is going slowly (straggler) Launch second copy of task on another node Take the output of whichever copy finishes first and kill the
other one
Critical for performance in large clusters stragglers occur frequently due to failing hardware bugs misconfiguration etc
Takeaways
By providing a data-parallel programming model MapReduce can control job execution in useful ways Automatic division of job into tasks Automatic placement of computation near data Automatic load balancing Recovery from failures amp stragglers
User focuses on application not on complexities of distributed computing
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
1 Search
Input (lineNumber line) records
Output lines matching a given pattern
Map if(line matches pattern) output(line)
Reduce identify function Alternative no reducer (map-only job)
pigsheepyakzebra
aardvarkantbeecowelephant
2 Sort
Input (key value) records
Output same records sorted by key
Map identity function
Reduce identify function
Trick Pick partitioningfunction h such thatk1ltk2 =gt h(k1)lth(k2)
Map
Map
Map
Reduce
Reduce
ant bee
zebra
aardvarkelephant
cow
pig
sheep yak
[A-M]
[N-Z]
3 Inverted Index
Input (filename text) records
Output list of files containing each word
Map foreach word in textsplit() output(word filename)
Combine uniquify filenames for each word (eliminate duplicates)
Reducedef reduce(word filenames) output(word sort(filenames))
Inverted Index Example
to be or not to be afraid (12thtxt)
be (12thtxt hamlettxt)greatness (12thtxt)not (12thtxt hamlettxt)of (12thtxt)or (hamlettxt)to (hamlettxt)
hamlettxt
be not afraid of greatness
12thtxt
to hamlettxtbe hamlettxtor hamlettxtnot hamlettxt
be 12thtxtnot 12thtxtafraid 12thtxtof 12thtxtgreatness 12thtxt
4 Most Popular Words
Input (filename text) records
Output the 100 words occurring in most files
Two-stage solution Job 1
Create inverted index giving (word list(file)) records Job 2
Map each (word list(file)) to (count word) Sort these records by count as in sort job
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
Cloudera Virtual Manager
Cloudera VM contains a single-node Apache Hadoop cluster along with everything you need to get started with Hadoop
Requirements A 64-bit host OS A virtualization software VMware Player KVM or VirtualBox
Virtualization Software will require a laptop that supports virtualization If you are unsure one way this can be checked by looking at your BIOS and seeing if Virtualization is Enabled
A 4 GB of total RAM The total system memory required varies depending on the size
of your data set and on the other processes that are running
Installation
Step1 Download amp Run Vmware
Step2 Download Cloudera VM
Step3 Extract to the Cloudera folder
Step4 Open the cloudera-quickstart-vm-440-1-vmware
Once you got the software installed fire up the VirtualBox image of Cloudera QuickStart VM and you should see the initial screen similar to below
WordCount Example
This example computes the occurrence frequency of each word in a text file
Steps1 Set up Hadoop environment2 Upload files into HDFS3 Executing Java MapReduce functions in Hadoop
bull Tutorial
httpswwwclouderacomcontentcloudera-contentcloudera-docsHadoopTutorialCDH4Hadoop-Tutorialht_wordcount1html
public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt
private final static IntWritable one = new IntWritable(1) private Text word = new Text)(
public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
String line = valuetoString)(
StringTokenizer tokenizer = new StringTokenizer(line)
while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())
outputcollect(word one)
public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt
public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
int sum = 0
while (valueshasNext()) sum += valuesnext()get)(
outputcollect(key new IntWritable(sum))
public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)
confsetJobName(wordcount)
confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)
confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)
confsetReducerClass(Reduceclass)
confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)
FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))
JobClientrunJob(conf)
How it Works
Purpose Simple serialization for keys values and other data
Interface Writable Read and write binary format Convert to String for text formats
Classes
OutputCollector Collects data output by either the Mapper or the Reducer ie
intermediate outputs or the output of the job Methods
collect(K key V value)
Interface
Text Stores text using standard UTF8 encoding It provides methods to
serialize deserialize and compare texts at byte level Methods
set(byte[] utf8)
Demo
40
Exercises
Get WordCount running
Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words
For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive (reading)
Motivation
MapReduce is great as many algorithms can be expressed by a series of MR jobs
But itrsquos low-level must think about keys values partitioning etc
Can we capture common ldquojob patternsrdquo
Pig
Started at Yahoo Research
Now runs about 30 of Yahoorsquos jobs
Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators
(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse
An Example Problem
Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))
reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)
lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5
store Top5 into lsquotop5sitesrsquo
In Pig Latin
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Job 1
Job 2
Job 3
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Hive
Developed at Facebook
Used for majority of Facebook jobs
ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex
data types some optimizations
Creating a Hive Table
CREATE TABLE page_views(viewTime INT userid BIGINT
page_url STRING referrer_url STRING
ip STRING COMMENT User IP address)
COMMENT This is the page view table
PARTITIONED BY(dt STRING country STRING)
STORED AS SEQUENCEFILE
bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA
Simple Query
SELECT page_views
FROM page_views
WHERE page_viewsdate gt= 2008-03-01
AND page_viewsdate lt= 2008-03-31
AND page_viewsreferrer_url like xyzcom
bullHive only reads partition 2008-03-01 instead of scanning entire table
bullFind all page views coming from xyzcom on March 31st
Aggregation and Joins
SELECT pvpage_url ugender COUNT(DISTINCT uid)
FROM page_views pv JOIN user u ON (pvuserid = uid)
GROUP BY pvpage_url ugender
WHERE pvdate = 2008-03-03
bullCount users who visited each page by gender
bullSample outputpage_url gender count(userid)
homephp MALE 12141412
homephp FEMALE 15431579
photophp MALE 23941451
photophp FEMALE 21231314
Using a Hadoop Streaming Mapper Script
SELECT TRANSFORM(page_viewsuserid
page_viewsdate)
USING map_scriptpy
AS dt uid CLUSTER BY dt
FROM page_views
Conclusions
MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance
Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs
(but requiring fault tolerance)
Hive and Pig further simplify programming
MapReduce is not suitable for all problems but when it works it may save you a lot of time
Resources
Hadoop httphadoopapacheorgcore
Hadoop docs
httphadoopapacheorgcoredocscurrent
Pig httphadoopapacheorgpig
Hive httphadoopapacheorghive
Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training
- Introduction to MapReduce and Hadoop
- What is MapReduce
- What is MapReduce used for
- What is MapReduce used for (2)
- Outline
- MapReduce Design Goals
- Typical Hadoop Cluster
- Typical Hadoop Cluster (2)
- Challenges
- Hadoop Components
- Hadoop Distributed File System
- MapReduce Programming Model
- Example Word Count
- Word Count Execution
- MapReduce Execution Details
- An Optimization The Combiner
- Word Count with Combiner
- Outline (2)
- Fault Tolerance in MapReduce
- Fault Tolerance in MapReduce (2)
- Fault Tolerance in MapReduce (3)
- Takeaways
- Outline (3)
- 1 Search
- 2 Sort
- 3 Inverted Index
- Inverted Index Example
- 4 Most Popular Words
- Outline (4)
- Cloudera Virtual Manager
- Installation
- Once you got the software installed fire up the VirtualBox ima
- WordCount Example
- Slide 34
- Slide 35
- Slide 36
- How it Works
- Classes
- Interface
- Demo
- Exercises
- Outline (5)
- Motivation
- Pig
- An Example Problem
- In MapReduce
- In Pig Latin
- Ease of Translation
- Ease of Translation (2)
- Hive
- Creating a Hive Table
- Simple Query
- Aggregation and Joins
- Using a Hadoop Streaming Mapper Script
- Conclusions
- Resources
-
Challenges
Cheap nodes fail especially if you have many Mean time between failures for 1 node = 3 years MTBF for 1000 nodes = 1 day Solution Build fault-tolerance into system
Commodity network = low bandwidth Solution Push computation to the data
Programming distributed systems is hard Solution Data-parallel programming model users write ldquomaprdquo and
ldquoreducerdquo functions system handles work distribution and fault tolerance
Hadoop Components
Distributed file system (HDFS) Single namespace for entire cluster Replicates data 3x for fault-tolerance
MapReduce implementation Executes user jobs specified as ldquomaprdquo and ldquoreducerdquo functions Manages work distribution amp fault-tolerance
Hadoop Distributed File System Files split into 128MB blocks
Blocks replicated across several datanodes (usually 3)
Single namenode stores metadata (file names block locations etc)
Optimized for large files sequential reads
Files are append-only
Namenode
Datanodes
1234
124
213
143
324
File1
MapReduce Programming Model Data type key-value records
Map function
(Kin Vin) list(Kinter Vinter)
Reduce function
(Kinter list(Vinter)) list(Kout Vout)
Example Word Count
def mapper(line) foreach word in linesplit)(
output(word 1)
def reducer(key values) output(key sum(values))
Word Count Execution
the quickbrown fox
the fox atethe mouse
how nowbrown cow
Map
Map
Map
Reduce
Reduce
brown 2fox 2how 1now 1the 3
ate 1cow 1
mouse 1quick 1
the 1brown 1
fox 1
quick 1
the 1fox 1the 1
how 1now 1
brown 1ate 1
mouse 1
cow 1
Input Map Shuffle amp Sort Reduce Output
MapReduce Execution Details
Single master controls job execution on multiple slaves as well as user scheduling
Mappers preferentially placed on same node or same rack as their input block Push computation to data minimize network use
Mappers save outputs to local disk rather than pushing directly to reducers Allows having more reducers than nodes Allows recovery if a reducer crashes
An Optimization The Combiner
A combiner is a local aggregation function for repeated keys produced by same map
For associative ops like sum count max
Decreases size of intermediate data
Example local counting for Word Count
def combiner(key values) output(key sum(values))
Word Count with CombinerInput Map amp Combine Shuffle amp Sort Reduce Output
the quickbrown fox
the fox atethe mouse
how nowbrown cow
Map
Map
Map
Reduce
Reduce
brown 2fox 2how 1now 1the 3
ate 1cow 1
mouse 1quick 1
the 1brown 1
fox 1
quick 1
the 2fox 1
how 1now 1
brown 1ate 1
mouse 1
cow 1
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
Fault Tolerance in MapReduce
1 If a task crashes Retry on another node
Okay for a map because it had no dependencies Okay for reduce because map outputs are on disk
If the same task repeatedly fails fail the job or ignore that input block (user-controlled)
Note For this and the other fault tolerance features to work your map and reduce tasks must be side-effect-free
Fault Tolerance in MapReduce
2 If a node crashes Relaunch its current tasks on other nodes Relaunch any maps the node previously ran
Necessary because their output files were lost along with the crashed node
Fault Tolerance in MapReduce
3 If a task is going slowly (straggler) Launch second copy of task on another node Take the output of whichever copy finishes first and kill the
other one
Critical for performance in large clusters stragglers occur frequently due to failing hardware bugs misconfiguration etc
Takeaways
By providing a data-parallel programming model MapReduce can control job execution in useful ways Automatic division of job into tasks Automatic placement of computation near data Automatic load balancing Recovery from failures amp stragglers
User focuses on application not on complexities of distributed computing
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
1 Search
Input (lineNumber line) records
Output lines matching a given pattern
Map if(line matches pattern) output(line)
Reduce identify function Alternative no reducer (map-only job)
pigsheepyakzebra
aardvarkantbeecowelephant
2 Sort
Input (key value) records
Output same records sorted by key
Map identity function
Reduce identify function
Trick Pick partitioningfunction h such thatk1ltk2 =gt h(k1)lth(k2)
Map
Map
Map
Reduce
Reduce
ant bee
zebra
aardvarkelephant
cow
pig
sheep yak
[A-M]
[N-Z]
3 Inverted Index
Input (filename text) records
Output list of files containing each word
Map foreach word in textsplit() output(word filename)
Combine uniquify filenames for each word (eliminate duplicates)
Reducedef reduce(word filenames) output(word sort(filenames))
Inverted Index Example
to be or not to be afraid (12thtxt)
be (12thtxt hamlettxt)greatness (12thtxt)not (12thtxt hamlettxt)of (12thtxt)or (hamlettxt)to (hamlettxt)
hamlettxt
be not afraid of greatness
12thtxt
to hamlettxtbe hamlettxtor hamlettxtnot hamlettxt
be 12thtxtnot 12thtxtafraid 12thtxtof 12thtxtgreatness 12thtxt
4 Most Popular Words
Input (filename text) records
Output the 100 words occurring in most files
Two-stage solution Job 1
Create inverted index giving (word list(file)) records Job 2
Map each (word list(file)) to (count word) Sort these records by count as in sort job
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
Cloudera Virtual Manager
Cloudera VM contains a single-node Apache Hadoop cluster along with everything you need to get started with Hadoop
Requirements A 64-bit host OS A virtualization software VMware Player KVM or VirtualBox
Virtualization Software will require a laptop that supports virtualization If you are unsure one way this can be checked by looking at your BIOS and seeing if Virtualization is Enabled
A 4 GB of total RAM The total system memory required varies depending on the size
of your data set and on the other processes that are running
Installation
Step1 Download amp Run Vmware
Step2 Download Cloudera VM
Step3 Extract to the Cloudera folder
Step4 Open the cloudera-quickstart-vm-440-1-vmware
Once you got the software installed fire up the VirtualBox image of Cloudera QuickStart VM and you should see the initial screen similar to below
WordCount Example
This example computes the occurrence frequency of each word in a text file
Steps1 Set up Hadoop environment2 Upload files into HDFS3 Executing Java MapReduce functions in Hadoop
bull Tutorial
httpswwwclouderacomcontentcloudera-contentcloudera-docsHadoopTutorialCDH4Hadoop-Tutorialht_wordcount1html
public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt
private final static IntWritable one = new IntWritable(1) private Text word = new Text)(
public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
String line = valuetoString)(
StringTokenizer tokenizer = new StringTokenizer(line)
while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())
outputcollect(word one)
public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt
public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
int sum = 0
while (valueshasNext()) sum += valuesnext()get)(
outputcollect(key new IntWritable(sum))
public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)
confsetJobName(wordcount)
confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)
confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)
confsetReducerClass(Reduceclass)
confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)
FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))
JobClientrunJob(conf)
How it Works
Purpose Simple serialization for keys values and other data
Interface Writable Read and write binary format Convert to String for text formats
Classes
OutputCollector Collects data output by either the Mapper or the Reducer ie
intermediate outputs or the output of the job Methods
collect(K key V value)
Interface
Text Stores text using standard UTF8 encoding It provides methods to
serialize deserialize and compare texts at byte level Methods
set(byte[] utf8)
Demo
40
Exercises
Get WordCount running
Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words
For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive (reading)
Motivation
MapReduce is great as many algorithms can be expressed by a series of MR jobs
But itrsquos low-level must think about keys values partitioning etc
Can we capture common ldquojob patternsrdquo
Pig
Started at Yahoo Research
Now runs about 30 of Yahoorsquos jobs
Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators
(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse
An Example Problem
Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))
reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)
lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5
store Top5 into lsquotop5sitesrsquo
In Pig Latin
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Job 1
Job 2
Job 3
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Hive
Developed at Facebook
Used for majority of Facebook jobs
ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex
data types some optimizations
Creating a Hive Table
CREATE TABLE page_views(viewTime INT userid BIGINT
page_url STRING referrer_url STRING
ip STRING COMMENT User IP address)
COMMENT This is the page view table
PARTITIONED BY(dt STRING country STRING)
STORED AS SEQUENCEFILE
bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA
Simple Query
SELECT page_views
FROM page_views
WHERE page_viewsdate gt= 2008-03-01
AND page_viewsdate lt= 2008-03-31
AND page_viewsreferrer_url like xyzcom
bullHive only reads partition 2008-03-01 instead of scanning entire table
bullFind all page views coming from xyzcom on March 31st
Aggregation and Joins
SELECT pvpage_url ugender COUNT(DISTINCT uid)
FROM page_views pv JOIN user u ON (pvuserid = uid)
GROUP BY pvpage_url ugender
WHERE pvdate = 2008-03-03
bullCount users who visited each page by gender
bullSample outputpage_url gender count(userid)
homephp MALE 12141412
homephp FEMALE 15431579
photophp MALE 23941451
photophp FEMALE 21231314
Using a Hadoop Streaming Mapper Script
SELECT TRANSFORM(page_viewsuserid
page_viewsdate)
USING map_scriptpy
AS dt uid CLUSTER BY dt
FROM page_views
Conclusions
MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance
Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs
(but requiring fault tolerance)
Hive and Pig further simplify programming
MapReduce is not suitable for all problems but when it works it may save you a lot of time
Resources
Hadoop httphadoopapacheorgcore
Hadoop docs
httphadoopapacheorgcoredocscurrent
Pig httphadoopapacheorgpig
Hive httphadoopapacheorghive
Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training
- Introduction to MapReduce and Hadoop
- What is MapReduce
- What is MapReduce used for
- What is MapReduce used for (2)
- Outline
- MapReduce Design Goals
- Typical Hadoop Cluster
- Typical Hadoop Cluster (2)
- Challenges
- Hadoop Components
- Hadoop Distributed File System
- MapReduce Programming Model
- Example Word Count
- Word Count Execution
- MapReduce Execution Details
- An Optimization The Combiner
- Word Count with Combiner
- Outline (2)
- Fault Tolerance in MapReduce
- Fault Tolerance in MapReduce (2)
- Fault Tolerance in MapReduce (3)
- Takeaways
- Outline (3)
- 1 Search
- 2 Sort
- 3 Inverted Index
- Inverted Index Example
- 4 Most Popular Words
- Outline (4)
- Cloudera Virtual Manager
- Installation
- Once you got the software installed fire up the VirtualBox ima
- WordCount Example
- Slide 34
- Slide 35
- Slide 36
- How it Works
- Classes
- Interface
- Demo
- Exercises
- Outline (5)
- Motivation
- Pig
- An Example Problem
- In MapReduce
- In Pig Latin
- Ease of Translation
- Ease of Translation (2)
- Hive
- Creating a Hive Table
- Simple Query
- Aggregation and Joins
- Using a Hadoop Streaming Mapper Script
- Conclusions
- Resources
-
Hadoop Components
Distributed file system (HDFS) Single namespace for entire cluster Replicates data 3x for fault-tolerance
MapReduce implementation Executes user jobs specified as ldquomaprdquo and ldquoreducerdquo functions Manages work distribution amp fault-tolerance
Hadoop Distributed File System Files split into 128MB blocks
Blocks replicated across several datanodes (usually 3)
Single namenode stores metadata (file names block locations etc)
Optimized for large files sequential reads
Files are append-only
Namenode
Datanodes
1234
124
213
143
324
File1
MapReduce Programming Model Data type key-value records
Map function
(Kin Vin) list(Kinter Vinter)
Reduce function
(Kinter list(Vinter)) list(Kout Vout)
Example Word Count
def mapper(line) foreach word in linesplit)(
output(word 1)
def reducer(key values) output(key sum(values))
Word Count Execution
the quickbrown fox
the fox atethe mouse
how nowbrown cow
Map
Map
Map
Reduce
Reduce
brown 2fox 2how 1now 1the 3
ate 1cow 1
mouse 1quick 1
the 1brown 1
fox 1
quick 1
the 1fox 1the 1
how 1now 1
brown 1ate 1
mouse 1
cow 1
Input Map Shuffle amp Sort Reduce Output
MapReduce Execution Details
Single master controls job execution on multiple slaves as well as user scheduling
Mappers preferentially placed on same node or same rack as their input block Push computation to data minimize network use
Mappers save outputs to local disk rather than pushing directly to reducers Allows having more reducers than nodes Allows recovery if a reducer crashes
An Optimization The Combiner
A combiner is a local aggregation function for repeated keys produced by same map
For associative ops like sum count max
Decreases size of intermediate data
Example local counting for Word Count
def combiner(key values) output(key sum(values))
Word Count with CombinerInput Map amp Combine Shuffle amp Sort Reduce Output
the quickbrown fox
the fox atethe mouse
how nowbrown cow
Map
Map
Map
Reduce
Reduce
brown 2fox 2how 1now 1the 3
ate 1cow 1
mouse 1quick 1
the 1brown 1
fox 1
quick 1
the 2fox 1
how 1now 1
brown 1ate 1
mouse 1
cow 1
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
Fault Tolerance in MapReduce
1 If a task crashes Retry on another node
Okay for a map because it had no dependencies Okay for reduce because map outputs are on disk
If the same task repeatedly fails fail the job or ignore that input block (user-controlled)
Note For this and the other fault tolerance features to work your map and reduce tasks must be side-effect-free
Fault Tolerance in MapReduce
2 If a node crashes Relaunch its current tasks on other nodes Relaunch any maps the node previously ran
Necessary because their output files were lost along with the crashed node
Fault Tolerance in MapReduce
3 If a task is going slowly (straggler) Launch second copy of task on another node Take the output of whichever copy finishes first and kill the
other one
Critical for performance in large clusters stragglers occur frequently due to failing hardware bugs misconfiguration etc
Takeaways
By providing a data-parallel programming model MapReduce can control job execution in useful ways Automatic division of job into tasks Automatic placement of computation near data Automatic load balancing Recovery from failures amp stragglers
User focuses on application not on complexities of distributed computing
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
1 Search
Input (lineNumber line) records
Output lines matching a given pattern
Map if(line matches pattern) output(line)
Reduce identify function Alternative no reducer (map-only job)
pigsheepyakzebra
aardvarkantbeecowelephant
2 Sort
Input (key value) records
Output same records sorted by key
Map identity function
Reduce identify function
Trick Pick partitioningfunction h such thatk1ltk2 =gt h(k1)lth(k2)
Map
Map
Map
Reduce
Reduce
ant bee
zebra
aardvarkelephant
cow
pig
sheep yak
[A-M]
[N-Z]
3 Inverted Index
Input (filename text) records
Output list of files containing each word
Map foreach word in textsplit() output(word filename)
Combine uniquify filenames for each word (eliminate duplicates)
Reducedef reduce(word filenames) output(word sort(filenames))
Inverted Index Example
to be or not to be afraid (12thtxt)
be (12thtxt hamlettxt)greatness (12thtxt)not (12thtxt hamlettxt)of (12thtxt)or (hamlettxt)to (hamlettxt)
hamlettxt
be not afraid of greatness
12thtxt
to hamlettxtbe hamlettxtor hamlettxtnot hamlettxt
be 12thtxtnot 12thtxtafraid 12thtxtof 12thtxtgreatness 12thtxt
4 Most Popular Words
Input (filename text) records
Output the 100 words occurring in most files
Two-stage solution Job 1
Create inverted index giving (word list(file)) records Job 2
Map each (word list(file)) to (count word) Sort these records by count as in sort job
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
Cloudera Virtual Manager
Cloudera VM contains a single-node Apache Hadoop cluster along with everything you need to get started with Hadoop
Requirements A 64-bit host OS A virtualization software VMware Player KVM or VirtualBox
Virtualization Software will require a laptop that supports virtualization If you are unsure one way this can be checked by looking at your BIOS and seeing if Virtualization is Enabled
A 4 GB of total RAM The total system memory required varies depending on the size
of your data set and on the other processes that are running
Installation
Step1 Download amp Run Vmware
Step2 Download Cloudera VM
Step3 Extract to the Cloudera folder
Step4 Open the cloudera-quickstart-vm-440-1-vmware
Once you got the software installed fire up the VirtualBox image of Cloudera QuickStart VM and you should see the initial screen similar to below
WordCount Example
This example computes the occurrence frequency of each word in a text file
Steps1 Set up Hadoop environment2 Upload files into HDFS3 Executing Java MapReduce functions in Hadoop
bull Tutorial
httpswwwclouderacomcontentcloudera-contentcloudera-docsHadoopTutorialCDH4Hadoop-Tutorialht_wordcount1html
public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt
private final static IntWritable one = new IntWritable(1) private Text word = new Text)(
public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
String line = valuetoString)(
StringTokenizer tokenizer = new StringTokenizer(line)
while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())
outputcollect(word one)
public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt
public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
int sum = 0
while (valueshasNext()) sum += valuesnext()get)(
outputcollect(key new IntWritable(sum))
public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)
confsetJobName(wordcount)
confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)
confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)
confsetReducerClass(Reduceclass)
confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)
FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))
JobClientrunJob(conf)
How it Works
Purpose Simple serialization for keys values and other data
Interface Writable Read and write binary format Convert to String for text formats
Classes
OutputCollector Collects data output by either the Mapper or the Reducer ie
intermediate outputs or the output of the job Methods
collect(K key V value)
Interface
Text Stores text using standard UTF8 encoding It provides methods to
serialize deserialize and compare texts at byte level Methods
set(byte[] utf8)
Demo
40
Exercises
Get WordCount running
Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words
For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive (reading)
Motivation
MapReduce is great as many algorithms can be expressed by a series of MR jobs
But itrsquos low-level must think about keys values partitioning etc
Can we capture common ldquojob patternsrdquo
Pig
Started at Yahoo Research
Now runs about 30 of Yahoorsquos jobs
Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators
(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse
An Example Problem
Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))
reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)
lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5
store Top5 into lsquotop5sitesrsquo
In Pig Latin
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Job 1
Job 2
Job 3
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Hive
Developed at Facebook
Used for majority of Facebook jobs
ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex
data types some optimizations
Creating a Hive Table
CREATE TABLE page_views(viewTime INT userid BIGINT
page_url STRING referrer_url STRING
ip STRING COMMENT User IP address)
COMMENT This is the page view table
PARTITIONED BY(dt STRING country STRING)
STORED AS SEQUENCEFILE
bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA
Simple Query
SELECT page_views
FROM page_views
WHERE page_viewsdate gt= 2008-03-01
AND page_viewsdate lt= 2008-03-31
AND page_viewsreferrer_url like xyzcom
bullHive only reads partition 2008-03-01 instead of scanning entire table
bullFind all page views coming from xyzcom on March 31st
Aggregation and Joins
SELECT pvpage_url ugender COUNT(DISTINCT uid)
FROM page_views pv JOIN user u ON (pvuserid = uid)
GROUP BY pvpage_url ugender
WHERE pvdate = 2008-03-03
bullCount users who visited each page by gender
bullSample outputpage_url gender count(userid)
homephp MALE 12141412
homephp FEMALE 15431579
photophp MALE 23941451
photophp FEMALE 21231314
Using a Hadoop Streaming Mapper Script
SELECT TRANSFORM(page_viewsuserid
page_viewsdate)
USING map_scriptpy
AS dt uid CLUSTER BY dt
FROM page_views
Conclusions
MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance
Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs
(but requiring fault tolerance)
Hive and Pig further simplify programming
MapReduce is not suitable for all problems but when it works it may save you a lot of time
Resources
Hadoop httphadoopapacheorgcore
Hadoop docs
httphadoopapacheorgcoredocscurrent
Pig httphadoopapacheorgpig
Hive httphadoopapacheorghive
Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training
- Introduction to MapReduce and Hadoop
- What is MapReduce
- What is MapReduce used for
- What is MapReduce used for (2)
- Outline
- MapReduce Design Goals
- Typical Hadoop Cluster
- Typical Hadoop Cluster (2)
- Challenges
- Hadoop Components
- Hadoop Distributed File System
- MapReduce Programming Model
- Example Word Count
- Word Count Execution
- MapReduce Execution Details
- An Optimization The Combiner
- Word Count with Combiner
- Outline (2)
- Fault Tolerance in MapReduce
- Fault Tolerance in MapReduce (2)
- Fault Tolerance in MapReduce (3)
- Takeaways
- Outline (3)
- 1 Search
- 2 Sort
- 3 Inverted Index
- Inverted Index Example
- 4 Most Popular Words
- Outline (4)
- Cloudera Virtual Manager
- Installation
- Once you got the software installed fire up the VirtualBox ima
- WordCount Example
- Slide 34
- Slide 35
- Slide 36
- How it Works
- Classes
- Interface
- Demo
- Exercises
- Outline (5)
- Motivation
- Pig
- An Example Problem
- In MapReduce
- In Pig Latin
- Ease of Translation
- Ease of Translation (2)
- Hive
- Creating a Hive Table
- Simple Query
- Aggregation and Joins
- Using a Hadoop Streaming Mapper Script
- Conclusions
- Resources
-
Hadoop Distributed File System Files split into 128MB blocks
Blocks replicated across several datanodes (usually 3)
Single namenode stores metadata (file names block locations etc)
Optimized for large files sequential reads
Files are append-only
Namenode
Datanodes
1234
124
213
143
324
File1
MapReduce Programming Model Data type key-value records
Map function
(Kin Vin) list(Kinter Vinter)
Reduce function
(Kinter list(Vinter)) list(Kout Vout)
Example Word Count
def mapper(line) foreach word in linesplit)(
output(word 1)
def reducer(key values) output(key sum(values))
Word Count Execution
the quickbrown fox
the fox atethe mouse
how nowbrown cow
Map
Map
Map
Reduce
Reduce
brown 2fox 2how 1now 1the 3
ate 1cow 1
mouse 1quick 1
the 1brown 1
fox 1
quick 1
the 1fox 1the 1
how 1now 1
brown 1ate 1
mouse 1
cow 1
Input Map Shuffle amp Sort Reduce Output
MapReduce Execution Details
Single master controls job execution on multiple slaves as well as user scheduling
Mappers preferentially placed on same node or same rack as their input block Push computation to data minimize network use
Mappers save outputs to local disk rather than pushing directly to reducers Allows having more reducers than nodes Allows recovery if a reducer crashes
An Optimization The Combiner
A combiner is a local aggregation function for repeated keys produced by same map
For associative ops like sum count max
Decreases size of intermediate data
Example local counting for Word Count
def combiner(key values) output(key sum(values))
Word Count with CombinerInput Map amp Combine Shuffle amp Sort Reduce Output
the quickbrown fox
the fox atethe mouse
how nowbrown cow
Map
Map
Map
Reduce
Reduce
brown 2fox 2how 1now 1the 3
ate 1cow 1
mouse 1quick 1
the 1brown 1
fox 1
quick 1
the 2fox 1
how 1now 1
brown 1ate 1
mouse 1
cow 1
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
Fault Tolerance in MapReduce
1 If a task crashes Retry on another node
Okay for a map because it had no dependencies Okay for reduce because map outputs are on disk
If the same task repeatedly fails fail the job or ignore that input block (user-controlled)
Note For this and the other fault tolerance features to work your map and reduce tasks must be side-effect-free
Fault Tolerance in MapReduce
2 If a node crashes Relaunch its current tasks on other nodes Relaunch any maps the node previously ran
Necessary because their output files were lost along with the crashed node
Fault Tolerance in MapReduce
3 If a task is going slowly (straggler) Launch second copy of task on another node Take the output of whichever copy finishes first and kill the
other one
Critical for performance in large clusters stragglers occur frequently due to failing hardware bugs misconfiguration etc
Takeaways
By providing a data-parallel programming model MapReduce can control job execution in useful ways Automatic division of job into tasks Automatic placement of computation near data Automatic load balancing Recovery from failures amp stragglers
User focuses on application not on complexities of distributed computing
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
1 Search
Input (lineNumber line) records
Output lines matching a given pattern
Map if(line matches pattern) output(line)
Reduce identify function Alternative no reducer (map-only job)
pigsheepyakzebra
aardvarkantbeecowelephant
2 Sort
Input (key value) records
Output same records sorted by key
Map identity function
Reduce identify function
Trick Pick partitioningfunction h such thatk1ltk2 =gt h(k1)lth(k2)
Map
Map
Map
Reduce
Reduce
ant bee
zebra
aardvarkelephant
cow
pig
sheep yak
[A-M]
[N-Z]
3 Inverted Index
Input (filename text) records
Output list of files containing each word
Map foreach word in textsplit() output(word filename)
Combine uniquify filenames for each word (eliminate duplicates)
Reducedef reduce(word filenames) output(word sort(filenames))
Inverted Index Example
to be or not to be afraid (12thtxt)
be (12thtxt hamlettxt)greatness (12thtxt)not (12thtxt hamlettxt)of (12thtxt)or (hamlettxt)to (hamlettxt)
hamlettxt
be not afraid of greatness
12thtxt
to hamlettxtbe hamlettxtor hamlettxtnot hamlettxt
be 12thtxtnot 12thtxtafraid 12thtxtof 12thtxtgreatness 12thtxt
4 Most Popular Words
Input (filename text) records
Output the 100 words occurring in most files
Two-stage solution Job 1
Create inverted index giving (word list(file)) records Job 2
Map each (word list(file)) to (count word) Sort these records by count as in sort job
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
Cloudera Virtual Manager
Cloudera VM contains a single-node Apache Hadoop cluster along with everything you need to get started with Hadoop
Requirements A 64-bit host OS A virtualization software VMware Player KVM or VirtualBox
Virtualization Software will require a laptop that supports virtualization If you are unsure one way this can be checked by looking at your BIOS and seeing if Virtualization is Enabled
A 4 GB of total RAM The total system memory required varies depending on the size
of your data set and on the other processes that are running
Installation
Step1 Download amp Run Vmware
Step2 Download Cloudera VM
Step3 Extract to the Cloudera folder
Step4 Open the cloudera-quickstart-vm-440-1-vmware
Once you got the software installed fire up the VirtualBox image of Cloudera QuickStart VM and you should see the initial screen similar to below
WordCount Example
This example computes the occurrence frequency of each word in a text file
Steps1 Set up Hadoop environment2 Upload files into HDFS3 Executing Java MapReduce functions in Hadoop
bull Tutorial
httpswwwclouderacomcontentcloudera-contentcloudera-docsHadoopTutorialCDH4Hadoop-Tutorialht_wordcount1html
public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt
private final static IntWritable one = new IntWritable(1) private Text word = new Text)(
public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
String line = valuetoString)(
StringTokenizer tokenizer = new StringTokenizer(line)
while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())
outputcollect(word one)
public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt
public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
int sum = 0
while (valueshasNext()) sum += valuesnext()get)(
outputcollect(key new IntWritable(sum))
public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)
confsetJobName(wordcount)
confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)
confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)
confsetReducerClass(Reduceclass)
confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)
FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))
JobClientrunJob(conf)
How it Works
Purpose Simple serialization for keys values and other data
Interface Writable Read and write binary format Convert to String for text formats
Classes
OutputCollector Collects data output by either the Mapper or the Reducer ie
intermediate outputs or the output of the job Methods
collect(K key V value)
Interface
Text Stores text using standard UTF8 encoding It provides methods to
serialize deserialize and compare texts at byte level Methods
set(byte[] utf8)
Demo
40
Exercises
Get WordCount running
Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words
For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive (reading)
Motivation
MapReduce is great as many algorithms can be expressed by a series of MR jobs
But itrsquos low-level must think about keys values partitioning etc
Can we capture common ldquojob patternsrdquo
Pig
Started at Yahoo Research
Now runs about 30 of Yahoorsquos jobs
Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators
(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse
An Example Problem
Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))
reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)
lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5
store Top5 into lsquotop5sitesrsquo
In Pig Latin
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Job 1
Job 2
Job 3
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Hive
Developed at Facebook
Used for majority of Facebook jobs
ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex
data types some optimizations
Creating a Hive Table
CREATE TABLE page_views(viewTime INT userid BIGINT
page_url STRING referrer_url STRING
ip STRING COMMENT User IP address)
COMMENT This is the page view table
PARTITIONED BY(dt STRING country STRING)
STORED AS SEQUENCEFILE
bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA
Simple Query
SELECT page_views
FROM page_views
WHERE page_viewsdate gt= 2008-03-01
AND page_viewsdate lt= 2008-03-31
AND page_viewsreferrer_url like xyzcom
bullHive only reads partition 2008-03-01 instead of scanning entire table
bullFind all page views coming from xyzcom on March 31st
Aggregation and Joins
SELECT pvpage_url ugender COUNT(DISTINCT uid)
FROM page_views pv JOIN user u ON (pvuserid = uid)
GROUP BY pvpage_url ugender
WHERE pvdate = 2008-03-03
bullCount users who visited each page by gender
bullSample outputpage_url gender count(userid)
homephp MALE 12141412
homephp FEMALE 15431579
photophp MALE 23941451
photophp FEMALE 21231314
Using a Hadoop Streaming Mapper Script
SELECT TRANSFORM(page_viewsuserid
page_viewsdate)
USING map_scriptpy
AS dt uid CLUSTER BY dt
FROM page_views
Conclusions
MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance
Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs
(but requiring fault tolerance)
Hive and Pig further simplify programming
MapReduce is not suitable for all problems but when it works it may save you a lot of time
Resources
Hadoop httphadoopapacheorgcore
Hadoop docs
httphadoopapacheorgcoredocscurrent
Pig httphadoopapacheorgpig
Hive httphadoopapacheorghive
Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training
- Introduction to MapReduce and Hadoop
- What is MapReduce
- What is MapReduce used for
- What is MapReduce used for (2)
- Outline
- MapReduce Design Goals
- Typical Hadoop Cluster
- Typical Hadoop Cluster (2)
- Challenges
- Hadoop Components
- Hadoop Distributed File System
- MapReduce Programming Model
- Example Word Count
- Word Count Execution
- MapReduce Execution Details
- An Optimization The Combiner
- Word Count with Combiner
- Outline (2)
- Fault Tolerance in MapReduce
- Fault Tolerance in MapReduce (2)
- Fault Tolerance in MapReduce (3)
- Takeaways
- Outline (3)
- 1 Search
- 2 Sort
- 3 Inverted Index
- Inverted Index Example
- 4 Most Popular Words
- Outline (4)
- Cloudera Virtual Manager
- Installation
- Once you got the software installed fire up the VirtualBox ima
- WordCount Example
- Slide 34
- Slide 35
- Slide 36
- How it Works
- Classes
- Interface
- Demo
- Exercises
- Outline (5)
- Motivation
- Pig
- An Example Problem
- In MapReduce
- In Pig Latin
- Ease of Translation
- Ease of Translation (2)
- Hive
- Creating a Hive Table
- Simple Query
- Aggregation and Joins
- Using a Hadoop Streaming Mapper Script
- Conclusions
- Resources
-
MapReduce Programming Model Data type key-value records
Map function
(Kin Vin) list(Kinter Vinter)
Reduce function
(Kinter list(Vinter)) list(Kout Vout)
Example Word Count
def mapper(line) foreach word in linesplit)(
output(word 1)
def reducer(key values) output(key sum(values))
Word Count Execution
the quickbrown fox
the fox atethe mouse
how nowbrown cow
Map
Map
Map
Reduce
Reduce
brown 2fox 2how 1now 1the 3
ate 1cow 1
mouse 1quick 1
the 1brown 1
fox 1
quick 1
the 1fox 1the 1
how 1now 1
brown 1ate 1
mouse 1
cow 1
Input Map Shuffle amp Sort Reduce Output
MapReduce Execution Details
Single master controls job execution on multiple slaves as well as user scheduling
Mappers preferentially placed on same node or same rack as their input block Push computation to data minimize network use
Mappers save outputs to local disk rather than pushing directly to reducers Allows having more reducers than nodes Allows recovery if a reducer crashes
An Optimization The Combiner
A combiner is a local aggregation function for repeated keys produced by same map
For associative ops like sum count max
Decreases size of intermediate data
Example local counting for Word Count
def combiner(key values) output(key sum(values))
Word Count with CombinerInput Map amp Combine Shuffle amp Sort Reduce Output
the quickbrown fox
the fox atethe mouse
how nowbrown cow
Map
Map
Map
Reduce
Reduce
brown 2fox 2how 1now 1the 3
ate 1cow 1
mouse 1quick 1
the 1brown 1
fox 1
quick 1
the 2fox 1
how 1now 1
brown 1ate 1
mouse 1
cow 1
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
Fault Tolerance in MapReduce
1 If a task crashes Retry on another node
Okay for a map because it had no dependencies Okay for reduce because map outputs are on disk
If the same task repeatedly fails fail the job or ignore that input block (user-controlled)
Note For this and the other fault tolerance features to work your map and reduce tasks must be side-effect-free
Fault Tolerance in MapReduce
2 If a node crashes Relaunch its current tasks on other nodes Relaunch any maps the node previously ran
Necessary because their output files were lost along with the crashed node
Fault Tolerance in MapReduce
3 If a task is going slowly (straggler) Launch second copy of task on another node Take the output of whichever copy finishes first and kill the
other one
Critical for performance in large clusters stragglers occur frequently due to failing hardware bugs misconfiguration etc
Takeaways
By providing a data-parallel programming model MapReduce can control job execution in useful ways Automatic division of job into tasks Automatic placement of computation near data Automatic load balancing Recovery from failures amp stragglers
User focuses on application not on complexities of distributed computing
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
1 Search
Input (lineNumber line) records
Output lines matching a given pattern
Map if(line matches pattern) output(line)
Reduce identify function Alternative no reducer (map-only job)
pigsheepyakzebra
aardvarkantbeecowelephant
2 Sort
Input (key value) records
Output same records sorted by key
Map identity function
Reduce identify function
Trick Pick partitioningfunction h such thatk1ltk2 =gt h(k1)lth(k2)
Map
Map
Map
Reduce
Reduce
ant bee
zebra
aardvarkelephant
cow
pig
sheep yak
[A-M]
[N-Z]
3 Inverted Index
Input (filename text) records
Output list of files containing each word
Map foreach word in textsplit() output(word filename)
Combine uniquify filenames for each word (eliminate duplicates)
Reducedef reduce(word filenames) output(word sort(filenames))
Inverted Index Example
to be or not to be afraid (12thtxt)
be (12thtxt hamlettxt)greatness (12thtxt)not (12thtxt hamlettxt)of (12thtxt)or (hamlettxt)to (hamlettxt)
hamlettxt
be not afraid of greatness
12thtxt
to hamlettxtbe hamlettxtor hamlettxtnot hamlettxt
be 12thtxtnot 12thtxtafraid 12thtxtof 12thtxtgreatness 12thtxt
4 Most Popular Words
Input (filename text) records
Output the 100 words occurring in most files
Two-stage solution Job 1
Create inverted index giving (word list(file)) records Job 2
Map each (word list(file)) to (count word) Sort these records by count as in sort job
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
Cloudera Virtual Manager
Cloudera VM contains a single-node Apache Hadoop cluster along with everything you need to get started with Hadoop
Requirements A 64-bit host OS A virtualization software VMware Player KVM or VirtualBox
Virtualization Software will require a laptop that supports virtualization If you are unsure one way this can be checked by looking at your BIOS and seeing if Virtualization is Enabled
A 4 GB of total RAM The total system memory required varies depending on the size
of your data set and on the other processes that are running
Installation
Step1 Download amp Run Vmware
Step2 Download Cloudera VM
Step3 Extract to the Cloudera folder
Step4 Open the cloudera-quickstart-vm-440-1-vmware
Once you got the software installed fire up the VirtualBox image of Cloudera QuickStart VM and you should see the initial screen similar to below
WordCount Example
This example computes the occurrence frequency of each word in a text file
Steps1 Set up Hadoop environment2 Upload files into HDFS3 Executing Java MapReduce functions in Hadoop
bull Tutorial
httpswwwclouderacomcontentcloudera-contentcloudera-docsHadoopTutorialCDH4Hadoop-Tutorialht_wordcount1html
public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt
private final static IntWritable one = new IntWritable(1) private Text word = new Text)(
public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
String line = valuetoString)(
StringTokenizer tokenizer = new StringTokenizer(line)
while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())
outputcollect(word one)
public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt
public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
int sum = 0
while (valueshasNext()) sum += valuesnext()get)(
outputcollect(key new IntWritable(sum))
public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)
confsetJobName(wordcount)
confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)
confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)
confsetReducerClass(Reduceclass)
confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)
FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))
JobClientrunJob(conf)
How it Works
Purpose Simple serialization for keys values and other data
Interface Writable Read and write binary format Convert to String for text formats
Classes
OutputCollector Collects data output by either the Mapper or the Reducer ie
intermediate outputs or the output of the job Methods
collect(K key V value)
Interface
Text Stores text using standard UTF8 encoding It provides methods to
serialize deserialize and compare texts at byte level Methods
set(byte[] utf8)
Demo
40
Exercises
Get WordCount running
Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words
For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive (reading)
Motivation
MapReduce is great as many algorithms can be expressed by a series of MR jobs
But itrsquos low-level must think about keys values partitioning etc
Can we capture common ldquojob patternsrdquo
Pig
Started at Yahoo Research
Now runs about 30 of Yahoorsquos jobs
Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators
(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse
An Example Problem
Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))
reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)
lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5
store Top5 into lsquotop5sitesrsquo
In Pig Latin
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Job 1
Job 2
Job 3
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Hive
Developed at Facebook
Used for majority of Facebook jobs
ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex
data types some optimizations
Creating a Hive Table
CREATE TABLE page_views(viewTime INT userid BIGINT
page_url STRING referrer_url STRING
ip STRING COMMENT User IP address)
COMMENT This is the page view table
PARTITIONED BY(dt STRING country STRING)
STORED AS SEQUENCEFILE
bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA
Simple Query
SELECT page_views
FROM page_views
WHERE page_viewsdate gt= 2008-03-01
AND page_viewsdate lt= 2008-03-31
AND page_viewsreferrer_url like xyzcom
bullHive only reads partition 2008-03-01 instead of scanning entire table
bullFind all page views coming from xyzcom on March 31st
Aggregation and Joins
SELECT pvpage_url ugender COUNT(DISTINCT uid)
FROM page_views pv JOIN user u ON (pvuserid = uid)
GROUP BY pvpage_url ugender
WHERE pvdate = 2008-03-03
bullCount users who visited each page by gender
bullSample outputpage_url gender count(userid)
homephp MALE 12141412
homephp FEMALE 15431579
photophp MALE 23941451
photophp FEMALE 21231314
Using a Hadoop Streaming Mapper Script
SELECT TRANSFORM(page_viewsuserid
page_viewsdate)
USING map_scriptpy
AS dt uid CLUSTER BY dt
FROM page_views
Conclusions
MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance
Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs
(but requiring fault tolerance)
Hive and Pig further simplify programming
MapReduce is not suitable for all problems but when it works it may save you a lot of time
Resources
Hadoop httphadoopapacheorgcore
Hadoop docs
httphadoopapacheorgcoredocscurrent
Pig httphadoopapacheorgpig
Hive httphadoopapacheorghive
Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training
- Introduction to MapReduce and Hadoop
- What is MapReduce
- What is MapReduce used for
- What is MapReduce used for (2)
- Outline
- MapReduce Design Goals
- Typical Hadoop Cluster
- Typical Hadoop Cluster (2)
- Challenges
- Hadoop Components
- Hadoop Distributed File System
- MapReduce Programming Model
- Example Word Count
- Word Count Execution
- MapReduce Execution Details
- An Optimization The Combiner
- Word Count with Combiner
- Outline (2)
- Fault Tolerance in MapReduce
- Fault Tolerance in MapReduce (2)
- Fault Tolerance in MapReduce (3)
- Takeaways
- Outline (3)
- 1 Search
- 2 Sort
- 3 Inverted Index
- Inverted Index Example
- 4 Most Popular Words
- Outline (4)
- Cloudera Virtual Manager
- Installation
- Once you got the software installed fire up the VirtualBox ima
- WordCount Example
- Slide 34
- Slide 35
- Slide 36
- How it Works
- Classes
- Interface
- Demo
- Exercises
- Outline (5)
- Motivation
- Pig
- An Example Problem
- In MapReduce
- In Pig Latin
- Ease of Translation
- Ease of Translation (2)
- Hive
- Creating a Hive Table
- Simple Query
- Aggregation and Joins
- Using a Hadoop Streaming Mapper Script
- Conclusions
- Resources
-
Example Word Count
def mapper(line) foreach word in linesplit)(
output(word 1)
def reducer(key values) output(key sum(values))
Word Count Execution
the quickbrown fox
the fox atethe mouse
how nowbrown cow
Map
Map
Map
Reduce
Reduce
brown 2fox 2how 1now 1the 3
ate 1cow 1
mouse 1quick 1
the 1brown 1
fox 1
quick 1
the 1fox 1the 1
how 1now 1
brown 1ate 1
mouse 1
cow 1
Input Map Shuffle amp Sort Reduce Output
MapReduce Execution Details
Single master controls job execution on multiple slaves as well as user scheduling
Mappers preferentially placed on same node or same rack as their input block Push computation to data minimize network use
Mappers save outputs to local disk rather than pushing directly to reducers Allows having more reducers than nodes Allows recovery if a reducer crashes
An Optimization The Combiner
A combiner is a local aggregation function for repeated keys produced by same map
For associative ops like sum count max
Decreases size of intermediate data
Example local counting for Word Count
def combiner(key values) output(key sum(values))
Word Count with CombinerInput Map amp Combine Shuffle amp Sort Reduce Output
the quickbrown fox
the fox atethe mouse
how nowbrown cow
Map
Map
Map
Reduce
Reduce
brown 2fox 2how 1now 1the 3
ate 1cow 1
mouse 1quick 1
the 1brown 1
fox 1
quick 1
the 2fox 1
how 1now 1
brown 1ate 1
mouse 1
cow 1
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
Fault Tolerance in MapReduce
1 If a task crashes Retry on another node
Okay for a map because it had no dependencies Okay for reduce because map outputs are on disk
If the same task repeatedly fails fail the job or ignore that input block (user-controlled)
Note For this and the other fault tolerance features to work your map and reduce tasks must be side-effect-free
Fault Tolerance in MapReduce
2 If a node crashes Relaunch its current tasks on other nodes Relaunch any maps the node previously ran
Necessary because their output files were lost along with the crashed node
Fault Tolerance in MapReduce
3 If a task is going slowly (straggler) Launch second copy of task on another node Take the output of whichever copy finishes first and kill the
other one
Critical for performance in large clusters stragglers occur frequently due to failing hardware bugs misconfiguration etc
Takeaways
By providing a data-parallel programming model MapReduce can control job execution in useful ways Automatic division of job into tasks Automatic placement of computation near data Automatic load balancing Recovery from failures amp stragglers
User focuses on application not on complexities of distributed computing
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
1 Search
Input (lineNumber line) records
Output lines matching a given pattern
Map if(line matches pattern) output(line)
Reduce identify function Alternative no reducer (map-only job)
pigsheepyakzebra
aardvarkantbeecowelephant
2 Sort
Input (key value) records
Output same records sorted by key
Map identity function
Reduce identify function
Trick Pick partitioningfunction h such thatk1ltk2 =gt h(k1)lth(k2)
Map
Map
Map
Reduce
Reduce
ant bee
zebra
aardvarkelephant
cow
pig
sheep yak
[A-M]
[N-Z]
3 Inverted Index
Input (filename text) records
Output list of files containing each word
Map foreach word in textsplit() output(word filename)
Combine uniquify filenames for each word (eliminate duplicates)
Reducedef reduce(word filenames) output(word sort(filenames))
Inverted Index Example
to be or not to be afraid (12thtxt)
be (12thtxt hamlettxt)greatness (12thtxt)not (12thtxt hamlettxt)of (12thtxt)or (hamlettxt)to (hamlettxt)
hamlettxt
be not afraid of greatness
12thtxt
to hamlettxtbe hamlettxtor hamlettxtnot hamlettxt
be 12thtxtnot 12thtxtafraid 12thtxtof 12thtxtgreatness 12thtxt
4 Most Popular Words
Input (filename text) records
Output the 100 words occurring in most files
Two-stage solution Job 1
Create inverted index giving (word list(file)) records Job 2
Map each (word list(file)) to (count word) Sort these records by count as in sort job
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
Cloudera Virtual Manager
Cloudera VM contains a single-node Apache Hadoop cluster along with everything you need to get started with Hadoop
Requirements A 64-bit host OS A virtualization software VMware Player KVM or VirtualBox
Virtualization Software will require a laptop that supports virtualization If you are unsure one way this can be checked by looking at your BIOS and seeing if Virtualization is Enabled
A 4 GB of total RAM The total system memory required varies depending on the size
of your data set and on the other processes that are running
Installation
Step1 Download amp Run Vmware
Step2 Download Cloudera VM
Step3 Extract to the Cloudera folder
Step4 Open the cloudera-quickstart-vm-440-1-vmware
Once you got the software installed fire up the VirtualBox image of Cloudera QuickStart VM and you should see the initial screen similar to below
WordCount Example
This example computes the occurrence frequency of each word in a text file
Steps1 Set up Hadoop environment2 Upload files into HDFS3 Executing Java MapReduce functions in Hadoop
bull Tutorial
httpswwwclouderacomcontentcloudera-contentcloudera-docsHadoopTutorialCDH4Hadoop-Tutorialht_wordcount1html
public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt
private final static IntWritable one = new IntWritable(1) private Text word = new Text)(
public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
String line = valuetoString)(
StringTokenizer tokenizer = new StringTokenizer(line)
while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())
outputcollect(word one)
public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt
public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
int sum = 0
while (valueshasNext()) sum += valuesnext()get)(
outputcollect(key new IntWritable(sum))
public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)
confsetJobName(wordcount)
confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)
confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)
confsetReducerClass(Reduceclass)
confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)
FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))
JobClientrunJob(conf)
How it Works
Purpose Simple serialization for keys values and other data
Interface Writable Read and write binary format Convert to String for text formats
Classes
OutputCollector Collects data output by either the Mapper or the Reducer ie
intermediate outputs or the output of the job Methods
collect(K key V value)
Interface
Text Stores text using standard UTF8 encoding It provides methods to
serialize deserialize and compare texts at byte level Methods
set(byte[] utf8)
Demo
40
Exercises
Get WordCount running
Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words
For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive (reading)
Motivation
MapReduce is great as many algorithms can be expressed by a series of MR jobs
But itrsquos low-level must think about keys values partitioning etc
Can we capture common ldquojob patternsrdquo
Pig
Started at Yahoo Research
Now runs about 30 of Yahoorsquos jobs
Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators
(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse
An Example Problem
Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))
reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)
lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5
store Top5 into lsquotop5sitesrsquo
In Pig Latin
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Job 1
Job 2
Job 3
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Hive
Developed at Facebook
Used for majority of Facebook jobs
ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex
data types some optimizations
Creating a Hive Table
CREATE TABLE page_views(viewTime INT userid BIGINT
page_url STRING referrer_url STRING
ip STRING COMMENT User IP address)
COMMENT This is the page view table
PARTITIONED BY(dt STRING country STRING)
STORED AS SEQUENCEFILE
bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA
Simple Query
SELECT page_views
FROM page_views
WHERE page_viewsdate gt= 2008-03-01
AND page_viewsdate lt= 2008-03-31
AND page_viewsreferrer_url like xyzcom
bullHive only reads partition 2008-03-01 instead of scanning entire table
bullFind all page views coming from xyzcom on March 31st
Aggregation and Joins
SELECT pvpage_url ugender COUNT(DISTINCT uid)
FROM page_views pv JOIN user u ON (pvuserid = uid)
GROUP BY pvpage_url ugender
WHERE pvdate = 2008-03-03
bullCount users who visited each page by gender
bullSample outputpage_url gender count(userid)
homephp MALE 12141412
homephp FEMALE 15431579
photophp MALE 23941451
photophp FEMALE 21231314
Using a Hadoop Streaming Mapper Script
SELECT TRANSFORM(page_viewsuserid
page_viewsdate)
USING map_scriptpy
AS dt uid CLUSTER BY dt
FROM page_views
Conclusions
MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance
Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs
(but requiring fault tolerance)
Hive and Pig further simplify programming
MapReduce is not suitable for all problems but when it works it may save you a lot of time
Resources
Hadoop httphadoopapacheorgcore
Hadoop docs
httphadoopapacheorgcoredocscurrent
Pig httphadoopapacheorgpig
Hive httphadoopapacheorghive
Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training
- Introduction to MapReduce and Hadoop
- What is MapReduce
- What is MapReduce used for
- What is MapReduce used for (2)
- Outline
- MapReduce Design Goals
- Typical Hadoop Cluster
- Typical Hadoop Cluster (2)
- Challenges
- Hadoop Components
- Hadoop Distributed File System
- MapReduce Programming Model
- Example Word Count
- Word Count Execution
- MapReduce Execution Details
- An Optimization The Combiner
- Word Count with Combiner
- Outline (2)
- Fault Tolerance in MapReduce
- Fault Tolerance in MapReduce (2)
- Fault Tolerance in MapReduce (3)
- Takeaways
- Outline (3)
- 1 Search
- 2 Sort
- 3 Inverted Index
- Inverted Index Example
- 4 Most Popular Words
- Outline (4)
- Cloudera Virtual Manager
- Installation
- Once you got the software installed fire up the VirtualBox ima
- WordCount Example
- Slide 34
- Slide 35
- Slide 36
- How it Works
- Classes
- Interface
- Demo
- Exercises
- Outline (5)
- Motivation
- Pig
- An Example Problem
- In MapReduce
- In Pig Latin
- Ease of Translation
- Ease of Translation (2)
- Hive
- Creating a Hive Table
- Simple Query
- Aggregation and Joins
- Using a Hadoop Streaming Mapper Script
- Conclusions
- Resources
-
Word Count Execution
the quickbrown fox
the fox atethe mouse
how nowbrown cow
Map
Map
Map
Reduce
Reduce
brown 2fox 2how 1now 1the 3
ate 1cow 1
mouse 1quick 1
the 1brown 1
fox 1
quick 1
the 1fox 1the 1
how 1now 1
brown 1ate 1
mouse 1
cow 1
Input Map Shuffle amp Sort Reduce Output
MapReduce Execution Details
Single master controls job execution on multiple slaves as well as user scheduling
Mappers preferentially placed on same node or same rack as their input block Push computation to data minimize network use
Mappers save outputs to local disk rather than pushing directly to reducers Allows having more reducers than nodes Allows recovery if a reducer crashes
An Optimization The Combiner
A combiner is a local aggregation function for repeated keys produced by same map
For associative ops like sum count max
Decreases size of intermediate data
Example local counting for Word Count
def combiner(key values) output(key sum(values))
Word Count with CombinerInput Map amp Combine Shuffle amp Sort Reduce Output
the quickbrown fox
the fox atethe mouse
how nowbrown cow
Map
Map
Map
Reduce
Reduce
brown 2fox 2how 1now 1the 3
ate 1cow 1
mouse 1quick 1
the 1brown 1
fox 1
quick 1
the 2fox 1
how 1now 1
brown 1ate 1
mouse 1
cow 1
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
Fault Tolerance in MapReduce
1 If a task crashes Retry on another node
Okay for a map because it had no dependencies Okay for reduce because map outputs are on disk
If the same task repeatedly fails fail the job or ignore that input block (user-controlled)
Note For this and the other fault tolerance features to work your map and reduce tasks must be side-effect-free
Fault Tolerance in MapReduce
2 If a node crashes Relaunch its current tasks on other nodes Relaunch any maps the node previously ran
Necessary because their output files were lost along with the crashed node
Fault Tolerance in MapReduce
3 If a task is going slowly (straggler) Launch second copy of task on another node Take the output of whichever copy finishes first and kill the
other one
Critical for performance in large clusters stragglers occur frequently due to failing hardware bugs misconfiguration etc
Takeaways
By providing a data-parallel programming model MapReduce can control job execution in useful ways Automatic division of job into tasks Automatic placement of computation near data Automatic load balancing Recovery from failures amp stragglers
User focuses on application not on complexities of distributed computing
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
1 Search
Input (lineNumber line) records
Output lines matching a given pattern
Map if(line matches pattern) output(line)
Reduce identify function Alternative no reducer (map-only job)
pigsheepyakzebra
aardvarkantbeecowelephant
2 Sort
Input (key value) records
Output same records sorted by key
Map identity function
Reduce identify function
Trick Pick partitioningfunction h such thatk1ltk2 =gt h(k1)lth(k2)
Map
Map
Map
Reduce
Reduce
ant bee
zebra
aardvarkelephant
cow
pig
sheep yak
[A-M]
[N-Z]
3 Inverted Index
Input (filename text) records
Output list of files containing each word
Map foreach word in textsplit() output(word filename)
Combine uniquify filenames for each word (eliminate duplicates)
Reducedef reduce(word filenames) output(word sort(filenames))
Inverted Index Example
to be or not to be afraid (12thtxt)
be (12thtxt hamlettxt)greatness (12thtxt)not (12thtxt hamlettxt)of (12thtxt)or (hamlettxt)to (hamlettxt)
hamlettxt
be not afraid of greatness
12thtxt
to hamlettxtbe hamlettxtor hamlettxtnot hamlettxt
be 12thtxtnot 12thtxtafraid 12thtxtof 12thtxtgreatness 12thtxt
4 Most Popular Words
Input (filename text) records
Output the 100 words occurring in most files
Two-stage solution Job 1
Create inverted index giving (word list(file)) records Job 2
Map each (word list(file)) to (count word) Sort these records by count as in sort job
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
Cloudera Virtual Manager
Cloudera VM contains a single-node Apache Hadoop cluster along with everything you need to get started with Hadoop
Requirements A 64-bit host OS A virtualization software VMware Player KVM or VirtualBox
Virtualization Software will require a laptop that supports virtualization If you are unsure one way this can be checked by looking at your BIOS and seeing if Virtualization is Enabled
A 4 GB of total RAM The total system memory required varies depending on the size
of your data set and on the other processes that are running
Installation
Step1 Download amp Run Vmware
Step2 Download Cloudera VM
Step3 Extract to the Cloudera folder
Step4 Open the cloudera-quickstart-vm-440-1-vmware
Once you got the software installed fire up the VirtualBox image of Cloudera QuickStart VM and you should see the initial screen similar to below
WordCount Example
This example computes the occurrence frequency of each word in a text file
Steps1 Set up Hadoop environment2 Upload files into HDFS3 Executing Java MapReduce functions in Hadoop
bull Tutorial
httpswwwclouderacomcontentcloudera-contentcloudera-docsHadoopTutorialCDH4Hadoop-Tutorialht_wordcount1html
public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt
private final static IntWritable one = new IntWritable(1) private Text word = new Text)(
public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
String line = valuetoString)(
StringTokenizer tokenizer = new StringTokenizer(line)
while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())
outputcollect(word one)
public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt
public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
int sum = 0
while (valueshasNext()) sum += valuesnext()get)(
outputcollect(key new IntWritable(sum))
public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)
confsetJobName(wordcount)
confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)
confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)
confsetReducerClass(Reduceclass)
confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)
FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))
JobClientrunJob(conf)
How it Works
Purpose Simple serialization for keys values and other data
Interface Writable Read and write binary format Convert to String for text formats
Classes
OutputCollector Collects data output by either the Mapper or the Reducer ie
intermediate outputs or the output of the job Methods
collect(K key V value)
Interface
Text Stores text using standard UTF8 encoding It provides methods to
serialize deserialize and compare texts at byte level Methods
set(byte[] utf8)
Demo
40
Exercises
Get WordCount running
Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words
For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive (reading)
Motivation
MapReduce is great as many algorithms can be expressed by a series of MR jobs
But itrsquos low-level must think about keys values partitioning etc
Can we capture common ldquojob patternsrdquo
Pig
Started at Yahoo Research
Now runs about 30 of Yahoorsquos jobs
Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators
(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse
An Example Problem
Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))
reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)
lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5
store Top5 into lsquotop5sitesrsquo
In Pig Latin
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Job 1
Job 2
Job 3
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Hive
Developed at Facebook
Used for majority of Facebook jobs
ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex
data types some optimizations
Creating a Hive Table
CREATE TABLE page_views(viewTime INT userid BIGINT
page_url STRING referrer_url STRING
ip STRING COMMENT User IP address)
COMMENT This is the page view table
PARTITIONED BY(dt STRING country STRING)
STORED AS SEQUENCEFILE
bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA
Simple Query
SELECT page_views
FROM page_views
WHERE page_viewsdate gt= 2008-03-01
AND page_viewsdate lt= 2008-03-31
AND page_viewsreferrer_url like xyzcom
bullHive only reads partition 2008-03-01 instead of scanning entire table
bullFind all page views coming from xyzcom on March 31st
Aggregation and Joins
SELECT pvpage_url ugender COUNT(DISTINCT uid)
FROM page_views pv JOIN user u ON (pvuserid = uid)
GROUP BY pvpage_url ugender
WHERE pvdate = 2008-03-03
bullCount users who visited each page by gender
bullSample outputpage_url gender count(userid)
homephp MALE 12141412
homephp FEMALE 15431579
photophp MALE 23941451
photophp FEMALE 21231314
Using a Hadoop Streaming Mapper Script
SELECT TRANSFORM(page_viewsuserid
page_viewsdate)
USING map_scriptpy
AS dt uid CLUSTER BY dt
FROM page_views
Conclusions
MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance
Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs
(but requiring fault tolerance)
Hive and Pig further simplify programming
MapReduce is not suitable for all problems but when it works it may save you a lot of time
Resources
Hadoop httphadoopapacheorgcore
Hadoop docs
httphadoopapacheorgcoredocscurrent
Pig httphadoopapacheorgpig
Hive httphadoopapacheorghive
Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training
- Introduction to MapReduce and Hadoop
- What is MapReduce
- What is MapReduce used for
- What is MapReduce used for (2)
- Outline
- MapReduce Design Goals
- Typical Hadoop Cluster
- Typical Hadoop Cluster (2)
- Challenges
- Hadoop Components
- Hadoop Distributed File System
- MapReduce Programming Model
- Example Word Count
- Word Count Execution
- MapReduce Execution Details
- An Optimization The Combiner
- Word Count with Combiner
- Outline (2)
- Fault Tolerance in MapReduce
- Fault Tolerance in MapReduce (2)
- Fault Tolerance in MapReduce (3)
- Takeaways
- Outline (3)
- 1 Search
- 2 Sort
- 3 Inverted Index
- Inverted Index Example
- 4 Most Popular Words
- Outline (4)
- Cloudera Virtual Manager
- Installation
- Once you got the software installed fire up the VirtualBox ima
- WordCount Example
- Slide 34
- Slide 35
- Slide 36
- How it Works
- Classes
- Interface
- Demo
- Exercises
- Outline (5)
- Motivation
- Pig
- An Example Problem
- In MapReduce
- In Pig Latin
- Ease of Translation
- Ease of Translation (2)
- Hive
- Creating a Hive Table
- Simple Query
- Aggregation and Joins
- Using a Hadoop Streaming Mapper Script
- Conclusions
- Resources
-
MapReduce Execution Details
Single master controls job execution on multiple slaves as well as user scheduling
Mappers preferentially placed on same node or same rack as their input block Push computation to data minimize network use
Mappers save outputs to local disk rather than pushing directly to reducers Allows having more reducers than nodes Allows recovery if a reducer crashes
An Optimization The Combiner
A combiner is a local aggregation function for repeated keys produced by same map
For associative ops like sum count max
Decreases size of intermediate data
Example local counting for Word Count
def combiner(key values) output(key sum(values))
Word Count with CombinerInput Map amp Combine Shuffle amp Sort Reduce Output
the quickbrown fox
the fox atethe mouse
how nowbrown cow
Map
Map
Map
Reduce
Reduce
brown 2fox 2how 1now 1the 3
ate 1cow 1
mouse 1quick 1
the 1brown 1
fox 1
quick 1
the 2fox 1
how 1now 1
brown 1ate 1
mouse 1
cow 1
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
Fault Tolerance in MapReduce
1 If a task crashes Retry on another node
Okay for a map because it had no dependencies Okay for reduce because map outputs are on disk
If the same task repeatedly fails fail the job or ignore that input block (user-controlled)
Note For this and the other fault tolerance features to work your map and reduce tasks must be side-effect-free
Fault Tolerance in MapReduce
2 If a node crashes Relaunch its current tasks on other nodes Relaunch any maps the node previously ran
Necessary because their output files were lost along with the crashed node
Fault Tolerance in MapReduce
3 If a task is going slowly (straggler) Launch second copy of task on another node Take the output of whichever copy finishes first and kill the
other one
Critical for performance in large clusters stragglers occur frequently due to failing hardware bugs misconfiguration etc
Takeaways
By providing a data-parallel programming model MapReduce can control job execution in useful ways Automatic division of job into tasks Automatic placement of computation near data Automatic load balancing Recovery from failures amp stragglers
User focuses on application not on complexities of distributed computing
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
1 Search
Input (lineNumber line) records
Output lines matching a given pattern
Map if(line matches pattern) output(line)
Reduce identify function Alternative no reducer (map-only job)
pigsheepyakzebra
aardvarkantbeecowelephant
2 Sort
Input (key value) records
Output same records sorted by key
Map identity function
Reduce identify function
Trick Pick partitioningfunction h such thatk1ltk2 =gt h(k1)lth(k2)
Map
Map
Map
Reduce
Reduce
ant bee
zebra
aardvarkelephant
cow
pig
sheep yak
[A-M]
[N-Z]
3 Inverted Index
Input (filename text) records
Output list of files containing each word
Map foreach word in textsplit() output(word filename)
Combine uniquify filenames for each word (eliminate duplicates)
Reducedef reduce(word filenames) output(word sort(filenames))
Inverted Index Example
to be or not to be afraid (12thtxt)
be (12thtxt hamlettxt)greatness (12thtxt)not (12thtxt hamlettxt)of (12thtxt)or (hamlettxt)to (hamlettxt)
hamlettxt
be not afraid of greatness
12thtxt
to hamlettxtbe hamlettxtor hamlettxtnot hamlettxt
be 12thtxtnot 12thtxtafraid 12thtxtof 12thtxtgreatness 12thtxt
4 Most Popular Words
Input (filename text) records
Output the 100 words occurring in most files
Two-stage solution Job 1
Create inverted index giving (word list(file)) records Job 2
Map each (word list(file)) to (count word) Sort these records by count as in sort job
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
Cloudera Virtual Manager
Cloudera VM contains a single-node Apache Hadoop cluster along with everything you need to get started with Hadoop
Requirements A 64-bit host OS A virtualization software VMware Player KVM or VirtualBox
Virtualization Software will require a laptop that supports virtualization If you are unsure one way this can be checked by looking at your BIOS and seeing if Virtualization is Enabled
A 4 GB of total RAM The total system memory required varies depending on the size
of your data set and on the other processes that are running
Installation
Step1 Download amp Run Vmware
Step2 Download Cloudera VM
Step3 Extract to the Cloudera folder
Step4 Open the cloudera-quickstart-vm-440-1-vmware
Once you got the software installed fire up the VirtualBox image of Cloudera QuickStart VM and you should see the initial screen similar to below
WordCount Example
This example computes the occurrence frequency of each word in a text file
Steps1 Set up Hadoop environment2 Upload files into HDFS3 Executing Java MapReduce functions in Hadoop
bull Tutorial
httpswwwclouderacomcontentcloudera-contentcloudera-docsHadoopTutorialCDH4Hadoop-Tutorialht_wordcount1html
public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt
private final static IntWritable one = new IntWritable(1) private Text word = new Text)(
public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
String line = valuetoString)(
StringTokenizer tokenizer = new StringTokenizer(line)
while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())
outputcollect(word one)
public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt
public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
int sum = 0
while (valueshasNext()) sum += valuesnext()get)(
outputcollect(key new IntWritable(sum))
public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)
confsetJobName(wordcount)
confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)
confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)
confsetReducerClass(Reduceclass)
confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)
FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))
JobClientrunJob(conf)
How it Works
Purpose Simple serialization for keys values and other data
Interface Writable Read and write binary format Convert to String for text formats
Classes
OutputCollector Collects data output by either the Mapper or the Reducer ie
intermediate outputs or the output of the job Methods
collect(K key V value)
Interface
Text Stores text using standard UTF8 encoding It provides methods to
serialize deserialize and compare texts at byte level Methods
set(byte[] utf8)
Demo
40
Exercises
Get WordCount running
Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words
For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive (reading)
Motivation
MapReduce is great as many algorithms can be expressed by a series of MR jobs
But itrsquos low-level must think about keys values partitioning etc
Can we capture common ldquojob patternsrdquo
Pig
Started at Yahoo Research
Now runs about 30 of Yahoorsquos jobs
Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators
(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse
An Example Problem
Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))
reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)
lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5
store Top5 into lsquotop5sitesrsquo
In Pig Latin
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Job 1
Job 2
Job 3
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Hive
Developed at Facebook
Used for majority of Facebook jobs
ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex
data types some optimizations
Creating a Hive Table
CREATE TABLE page_views(viewTime INT userid BIGINT
page_url STRING referrer_url STRING
ip STRING COMMENT User IP address)
COMMENT This is the page view table
PARTITIONED BY(dt STRING country STRING)
STORED AS SEQUENCEFILE
bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA
Simple Query
SELECT page_views
FROM page_views
WHERE page_viewsdate gt= 2008-03-01
AND page_viewsdate lt= 2008-03-31
AND page_viewsreferrer_url like xyzcom
bullHive only reads partition 2008-03-01 instead of scanning entire table
bullFind all page views coming from xyzcom on March 31st
Aggregation and Joins
SELECT pvpage_url ugender COUNT(DISTINCT uid)
FROM page_views pv JOIN user u ON (pvuserid = uid)
GROUP BY pvpage_url ugender
WHERE pvdate = 2008-03-03
bullCount users who visited each page by gender
bullSample outputpage_url gender count(userid)
homephp MALE 12141412
homephp FEMALE 15431579
photophp MALE 23941451
photophp FEMALE 21231314
Using a Hadoop Streaming Mapper Script
SELECT TRANSFORM(page_viewsuserid
page_viewsdate)
USING map_scriptpy
AS dt uid CLUSTER BY dt
FROM page_views
Conclusions
MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance
Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs
(but requiring fault tolerance)
Hive and Pig further simplify programming
MapReduce is not suitable for all problems but when it works it may save you a lot of time
Resources
Hadoop httphadoopapacheorgcore
Hadoop docs
httphadoopapacheorgcoredocscurrent
Pig httphadoopapacheorgpig
Hive httphadoopapacheorghive
Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training
- Introduction to MapReduce and Hadoop
- What is MapReduce
- What is MapReduce used for
- What is MapReduce used for (2)
- Outline
- MapReduce Design Goals
- Typical Hadoop Cluster
- Typical Hadoop Cluster (2)
- Challenges
- Hadoop Components
- Hadoop Distributed File System
- MapReduce Programming Model
- Example Word Count
- Word Count Execution
- MapReduce Execution Details
- An Optimization The Combiner
- Word Count with Combiner
- Outline (2)
- Fault Tolerance in MapReduce
- Fault Tolerance in MapReduce (2)
- Fault Tolerance in MapReduce (3)
- Takeaways
- Outline (3)
- 1 Search
- 2 Sort
- 3 Inverted Index
- Inverted Index Example
- 4 Most Popular Words
- Outline (4)
- Cloudera Virtual Manager
- Installation
- Once you got the software installed fire up the VirtualBox ima
- WordCount Example
- Slide 34
- Slide 35
- Slide 36
- How it Works
- Classes
- Interface
- Demo
- Exercises
- Outline (5)
- Motivation
- Pig
- An Example Problem
- In MapReduce
- In Pig Latin
- Ease of Translation
- Ease of Translation (2)
- Hive
- Creating a Hive Table
- Simple Query
- Aggregation and Joins
- Using a Hadoop Streaming Mapper Script
- Conclusions
- Resources
-
An Optimization The Combiner
A combiner is a local aggregation function for repeated keys produced by same map
For associative ops like sum count max
Decreases size of intermediate data
Example local counting for Word Count
def combiner(key values) output(key sum(values))
Word Count with CombinerInput Map amp Combine Shuffle amp Sort Reduce Output
the quickbrown fox
the fox atethe mouse
how nowbrown cow
Map
Map
Map
Reduce
Reduce
brown 2fox 2how 1now 1the 3
ate 1cow 1
mouse 1quick 1
the 1brown 1
fox 1
quick 1
the 2fox 1
how 1now 1
brown 1ate 1
mouse 1
cow 1
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
Fault Tolerance in MapReduce
1 If a task crashes Retry on another node
Okay for a map because it had no dependencies Okay for reduce because map outputs are on disk
If the same task repeatedly fails fail the job or ignore that input block (user-controlled)
Note For this and the other fault tolerance features to work your map and reduce tasks must be side-effect-free
Fault Tolerance in MapReduce
2 If a node crashes Relaunch its current tasks on other nodes Relaunch any maps the node previously ran
Necessary because their output files were lost along with the crashed node
Fault Tolerance in MapReduce
3 If a task is going slowly (straggler) Launch second copy of task on another node Take the output of whichever copy finishes first and kill the
other one
Critical for performance in large clusters stragglers occur frequently due to failing hardware bugs misconfiguration etc
Takeaways
By providing a data-parallel programming model MapReduce can control job execution in useful ways Automatic division of job into tasks Automatic placement of computation near data Automatic load balancing Recovery from failures amp stragglers
User focuses on application not on complexities of distributed computing
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
1 Search
Input (lineNumber line) records
Output lines matching a given pattern
Map if(line matches pattern) output(line)
Reduce identify function Alternative no reducer (map-only job)
pigsheepyakzebra
aardvarkantbeecowelephant
2 Sort
Input (key value) records
Output same records sorted by key
Map identity function
Reduce identify function
Trick Pick partitioningfunction h such thatk1ltk2 =gt h(k1)lth(k2)
Map
Map
Map
Reduce
Reduce
ant bee
zebra
aardvarkelephant
cow
pig
sheep yak
[A-M]
[N-Z]
3 Inverted Index
Input (filename text) records
Output list of files containing each word
Map foreach word in textsplit() output(word filename)
Combine uniquify filenames for each word (eliminate duplicates)
Reducedef reduce(word filenames) output(word sort(filenames))
Inverted Index Example
to be or not to be afraid (12thtxt)
be (12thtxt hamlettxt)greatness (12thtxt)not (12thtxt hamlettxt)of (12thtxt)or (hamlettxt)to (hamlettxt)
hamlettxt
be not afraid of greatness
12thtxt
to hamlettxtbe hamlettxtor hamlettxtnot hamlettxt
be 12thtxtnot 12thtxtafraid 12thtxtof 12thtxtgreatness 12thtxt
4 Most Popular Words
Input (filename text) records
Output the 100 words occurring in most files
Two-stage solution Job 1
Create inverted index giving (word list(file)) records Job 2
Map each (word list(file)) to (count word) Sort these records by count as in sort job
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
Cloudera Virtual Manager
Cloudera VM contains a single-node Apache Hadoop cluster along with everything you need to get started with Hadoop
Requirements A 64-bit host OS A virtualization software VMware Player KVM or VirtualBox
Virtualization Software will require a laptop that supports virtualization If you are unsure one way this can be checked by looking at your BIOS and seeing if Virtualization is Enabled
A 4 GB of total RAM The total system memory required varies depending on the size
of your data set and on the other processes that are running
Installation
Step1 Download amp Run Vmware
Step2 Download Cloudera VM
Step3 Extract to the Cloudera folder
Step4 Open the cloudera-quickstart-vm-440-1-vmware
Once you got the software installed fire up the VirtualBox image of Cloudera QuickStart VM and you should see the initial screen similar to below
WordCount Example
This example computes the occurrence frequency of each word in a text file
Steps1 Set up Hadoop environment2 Upload files into HDFS3 Executing Java MapReduce functions in Hadoop
bull Tutorial
httpswwwclouderacomcontentcloudera-contentcloudera-docsHadoopTutorialCDH4Hadoop-Tutorialht_wordcount1html
public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt
private final static IntWritable one = new IntWritable(1) private Text word = new Text)(
public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
String line = valuetoString)(
StringTokenizer tokenizer = new StringTokenizer(line)
while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())
outputcollect(word one)
public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt
public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
int sum = 0
while (valueshasNext()) sum += valuesnext()get)(
outputcollect(key new IntWritable(sum))
public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)
confsetJobName(wordcount)
confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)
confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)
confsetReducerClass(Reduceclass)
confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)
FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))
JobClientrunJob(conf)
How it Works
Purpose Simple serialization for keys values and other data
Interface Writable Read and write binary format Convert to String for text formats
Classes
OutputCollector Collects data output by either the Mapper or the Reducer ie
intermediate outputs or the output of the job Methods
collect(K key V value)
Interface
Text Stores text using standard UTF8 encoding It provides methods to
serialize deserialize and compare texts at byte level Methods
set(byte[] utf8)
Demo
40
Exercises
Get WordCount running
Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words
For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive (reading)
Motivation
MapReduce is great as many algorithms can be expressed by a series of MR jobs
But itrsquos low-level must think about keys values partitioning etc
Can we capture common ldquojob patternsrdquo
Pig
Started at Yahoo Research
Now runs about 30 of Yahoorsquos jobs
Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators
(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse
An Example Problem
Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))
reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)
lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5
store Top5 into lsquotop5sitesrsquo
In Pig Latin
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Job 1
Job 2
Job 3
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Hive
Developed at Facebook
Used for majority of Facebook jobs
ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex
data types some optimizations
Creating a Hive Table
CREATE TABLE page_views(viewTime INT userid BIGINT
page_url STRING referrer_url STRING
ip STRING COMMENT User IP address)
COMMENT This is the page view table
PARTITIONED BY(dt STRING country STRING)
STORED AS SEQUENCEFILE
bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA
Simple Query
SELECT page_views
FROM page_views
WHERE page_viewsdate gt= 2008-03-01
AND page_viewsdate lt= 2008-03-31
AND page_viewsreferrer_url like xyzcom
bullHive only reads partition 2008-03-01 instead of scanning entire table
bullFind all page views coming from xyzcom on March 31st
Aggregation and Joins
SELECT pvpage_url ugender COUNT(DISTINCT uid)
FROM page_views pv JOIN user u ON (pvuserid = uid)
GROUP BY pvpage_url ugender
WHERE pvdate = 2008-03-03
bullCount users who visited each page by gender
bullSample outputpage_url gender count(userid)
homephp MALE 12141412
homephp FEMALE 15431579
photophp MALE 23941451
photophp FEMALE 21231314
Using a Hadoop Streaming Mapper Script
SELECT TRANSFORM(page_viewsuserid
page_viewsdate)
USING map_scriptpy
AS dt uid CLUSTER BY dt
FROM page_views
Conclusions
MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance
Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs
(but requiring fault tolerance)
Hive and Pig further simplify programming
MapReduce is not suitable for all problems but when it works it may save you a lot of time
Resources
Hadoop httphadoopapacheorgcore
Hadoop docs
httphadoopapacheorgcoredocscurrent
Pig httphadoopapacheorgpig
Hive httphadoopapacheorghive
Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training
- Introduction to MapReduce and Hadoop
- What is MapReduce
- What is MapReduce used for
- What is MapReduce used for (2)
- Outline
- MapReduce Design Goals
- Typical Hadoop Cluster
- Typical Hadoop Cluster (2)
- Challenges
- Hadoop Components
- Hadoop Distributed File System
- MapReduce Programming Model
- Example Word Count
- Word Count Execution
- MapReduce Execution Details
- An Optimization The Combiner
- Word Count with Combiner
- Outline (2)
- Fault Tolerance in MapReduce
- Fault Tolerance in MapReduce (2)
- Fault Tolerance in MapReduce (3)
- Takeaways
- Outline (3)
- 1 Search
- 2 Sort
- 3 Inverted Index
- Inverted Index Example
- 4 Most Popular Words
- Outline (4)
- Cloudera Virtual Manager
- Installation
- Once you got the software installed fire up the VirtualBox ima
- WordCount Example
- Slide 34
- Slide 35
- Slide 36
- How it Works
- Classes
- Interface
- Demo
- Exercises
- Outline (5)
- Motivation
- Pig
- An Example Problem
- In MapReduce
- In Pig Latin
- Ease of Translation
- Ease of Translation (2)
- Hive
- Creating a Hive Table
- Simple Query
- Aggregation and Joins
- Using a Hadoop Streaming Mapper Script
- Conclusions
- Resources
-
Word Count with CombinerInput Map amp Combine Shuffle amp Sort Reduce Output
the quickbrown fox
the fox atethe mouse
how nowbrown cow
Map
Map
Map
Reduce
Reduce
brown 2fox 2how 1now 1the 3
ate 1cow 1
mouse 1quick 1
the 1brown 1
fox 1
quick 1
the 2fox 1
how 1now 1
brown 1ate 1
mouse 1
cow 1
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
Fault Tolerance in MapReduce
1 If a task crashes Retry on another node
Okay for a map because it had no dependencies Okay for reduce because map outputs are on disk
If the same task repeatedly fails fail the job or ignore that input block (user-controlled)
Note For this and the other fault tolerance features to work your map and reduce tasks must be side-effect-free
Fault Tolerance in MapReduce
2 If a node crashes Relaunch its current tasks on other nodes Relaunch any maps the node previously ran
Necessary because their output files were lost along with the crashed node
Fault Tolerance in MapReduce
3 If a task is going slowly (straggler) Launch second copy of task on another node Take the output of whichever copy finishes first and kill the
other one
Critical for performance in large clusters stragglers occur frequently due to failing hardware bugs misconfiguration etc
Takeaways
By providing a data-parallel programming model MapReduce can control job execution in useful ways Automatic division of job into tasks Automatic placement of computation near data Automatic load balancing Recovery from failures amp stragglers
User focuses on application not on complexities of distributed computing
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
1 Search
Input (lineNumber line) records
Output lines matching a given pattern
Map if(line matches pattern) output(line)
Reduce identify function Alternative no reducer (map-only job)
pigsheepyakzebra
aardvarkantbeecowelephant
2 Sort
Input (key value) records
Output same records sorted by key
Map identity function
Reduce identify function
Trick Pick partitioningfunction h such thatk1ltk2 =gt h(k1)lth(k2)
Map
Map
Map
Reduce
Reduce
ant bee
zebra
aardvarkelephant
cow
pig
sheep yak
[A-M]
[N-Z]
3 Inverted Index
Input (filename text) records
Output list of files containing each word
Map foreach word in textsplit() output(word filename)
Combine uniquify filenames for each word (eliminate duplicates)
Reducedef reduce(word filenames) output(word sort(filenames))
Inverted Index Example
to be or not to be afraid (12thtxt)
be (12thtxt hamlettxt)greatness (12thtxt)not (12thtxt hamlettxt)of (12thtxt)or (hamlettxt)to (hamlettxt)
hamlettxt
be not afraid of greatness
12thtxt
to hamlettxtbe hamlettxtor hamlettxtnot hamlettxt
be 12thtxtnot 12thtxtafraid 12thtxtof 12thtxtgreatness 12thtxt
4 Most Popular Words
Input (filename text) records
Output the 100 words occurring in most files
Two-stage solution Job 1
Create inverted index giving (word list(file)) records Job 2
Map each (word list(file)) to (count word) Sort these records by count as in sort job
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
Cloudera Virtual Manager
Cloudera VM contains a single-node Apache Hadoop cluster along with everything you need to get started with Hadoop
Requirements A 64-bit host OS A virtualization software VMware Player KVM or VirtualBox
Virtualization Software will require a laptop that supports virtualization If you are unsure one way this can be checked by looking at your BIOS and seeing if Virtualization is Enabled
A 4 GB of total RAM The total system memory required varies depending on the size
of your data set and on the other processes that are running
Installation
Step1 Download amp Run Vmware
Step2 Download Cloudera VM
Step3 Extract to the Cloudera folder
Step4 Open the cloudera-quickstart-vm-440-1-vmware
Once you got the software installed fire up the VirtualBox image of Cloudera QuickStart VM and you should see the initial screen similar to below
WordCount Example
This example computes the occurrence frequency of each word in a text file
Steps1 Set up Hadoop environment2 Upload files into HDFS3 Executing Java MapReduce functions in Hadoop
bull Tutorial
httpswwwclouderacomcontentcloudera-contentcloudera-docsHadoopTutorialCDH4Hadoop-Tutorialht_wordcount1html
public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt
private final static IntWritable one = new IntWritable(1) private Text word = new Text)(
public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
String line = valuetoString)(
StringTokenizer tokenizer = new StringTokenizer(line)
while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())
outputcollect(word one)
public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt
public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
int sum = 0
while (valueshasNext()) sum += valuesnext()get)(
outputcollect(key new IntWritable(sum))
public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)
confsetJobName(wordcount)
confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)
confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)
confsetReducerClass(Reduceclass)
confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)
FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))
JobClientrunJob(conf)
How it Works
Purpose Simple serialization for keys values and other data
Interface Writable Read and write binary format Convert to String for text formats
Classes
OutputCollector Collects data output by either the Mapper or the Reducer ie
intermediate outputs or the output of the job Methods
collect(K key V value)
Interface
Text Stores text using standard UTF8 encoding It provides methods to
serialize deserialize and compare texts at byte level Methods
set(byte[] utf8)
Demo
40
Exercises
Get WordCount running
Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words
For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive (reading)
Motivation
MapReduce is great as many algorithms can be expressed by a series of MR jobs
But itrsquos low-level must think about keys values partitioning etc
Can we capture common ldquojob patternsrdquo
Pig
Started at Yahoo Research
Now runs about 30 of Yahoorsquos jobs
Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators
(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse
An Example Problem
Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))
reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)
lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5
store Top5 into lsquotop5sitesrsquo
In Pig Latin
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Job 1
Job 2
Job 3
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Hive
Developed at Facebook
Used for majority of Facebook jobs
ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex
data types some optimizations
Creating a Hive Table
CREATE TABLE page_views(viewTime INT userid BIGINT
page_url STRING referrer_url STRING
ip STRING COMMENT User IP address)
COMMENT This is the page view table
PARTITIONED BY(dt STRING country STRING)
STORED AS SEQUENCEFILE
bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA
Simple Query
SELECT page_views
FROM page_views
WHERE page_viewsdate gt= 2008-03-01
AND page_viewsdate lt= 2008-03-31
AND page_viewsreferrer_url like xyzcom
bullHive only reads partition 2008-03-01 instead of scanning entire table
bullFind all page views coming from xyzcom on March 31st
Aggregation and Joins
SELECT pvpage_url ugender COUNT(DISTINCT uid)
FROM page_views pv JOIN user u ON (pvuserid = uid)
GROUP BY pvpage_url ugender
WHERE pvdate = 2008-03-03
bullCount users who visited each page by gender
bullSample outputpage_url gender count(userid)
homephp MALE 12141412
homephp FEMALE 15431579
photophp MALE 23941451
photophp FEMALE 21231314
Using a Hadoop Streaming Mapper Script
SELECT TRANSFORM(page_viewsuserid
page_viewsdate)
USING map_scriptpy
AS dt uid CLUSTER BY dt
FROM page_views
Conclusions
MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance
Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs
(but requiring fault tolerance)
Hive and Pig further simplify programming
MapReduce is not suitable for all problems but when it works it may save you a lot of time
Resources
Hadoop httphadoopapacheorgcore
Hadoop docs
httphadoopapacheorgcoredocscurrent
Pig httphadoopapacheorgpig
Hive httphadoopapacheorghive
Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training
- Introduction to MapReduce and Hadoop
- What is MapReduce
- What is MapReduce used for
- What is MapReduce used for (2)
- Outline
- MapReduce Design Goals
- Typical Hadoop Cluster
- Typical Hadoop Cluster (2)
- Challenges
- Hadoop Components
- Hadoop Distributed File System
- MapReduce Programming Model
- Example Word Count
- Word Count Execution
- MapReduce Execution Details
- An Optimization The Combiner
- Word Count with Combiner
- Outline (2)
- Fault Tolerance in MapReduce
- Fault Tolerance in MapReduce (2)
- Fault Tolerance in MapReduce (3)
- Takeaways
- Outline (3)
- 1 Search
- 2 Sort
- 3 Inverted Index
- Inverted Index Example
- 4 Most Popular Words
- Outline (4)
- Cloudera Virtual Manager
- Installation
- Once you got the software installed fire up the VirtualBox ima
- WordCount Example
- Slide 34
- Slide 35
- Slide 36
- How it Works
- Classes
- Interface
- Demo
- Exercises
- Outline (5)
- Motivation
- Pig
- An Example Problem
- In MapReduce
- In Pig Latin
- Ease of Translation
- Ease of Translation (2)
- Hive
- Creating a Hive Table
- Simple Query
- Aggregation and Joins
- Using a Hadoop Streaming Mapper Script
- Conclusions
- Resources
-
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
Fault Tolerance in MapReduce
1 If a task crashes Retry on another node
Okay for a map because it had no dependencies Okay for reduce because map outputs are on disk
If the same task repeatedly fails fail the job or ignore that input block (user-controlled)
Note For this and the other fault tolerance features to work your map and reduce tasks must be side-effect-free
Fault Tolerance in MapReduce
2 If a node crashes Relaunch its current tasks on other nodes Relaunch any maps the node previously ran
Necessary because their output files were lost along with the crashed node
Fault Tolerance in MapReduce
3 If a task is going slowly (straggler) Launch second copy of task on another node Take the output of whichever copy finishes first and kill the
other one
Critical for performance in large clusters stragglers occur frequently due to failing hardware bugs misconfiguration etc
Takeaways
By providing a data-parallel programming model MapReduce can control job execution in useful ways Automatic division of job into tasks Automatic placement of computation near data Automatic load balancing Recovery from failures amp stragglers
User focuses on application not on complexities of distributed computing
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
1 Search
Input (lineNumber line) records
Output lines matching a given pattern
Map if(line matches pattern) output(line)
Reduce identify function Alternative no reducer (map-only job)
pigsheepyakzebra
aardvarkantbeecowelephant
2 Sort
Input (key value) records
Output same records sorted by key
Map identity function
Reduce identify function
Trick Pick partitioningfunction h such thatk1ltk2 =gt h(k1)lth(k2)
Map
Map
Map
Reduce
Reduce
ant bee
zebra
aardvarkelephant
cow
pig
sheep yak
[A-M]
[N-Z]
3 Inverted Index
Input (filename text) records
Output list of files containing each word
Map foreach word in textsplit() output(word filename)
Combine uniquify filenames for each word (eliminate duplicates)
Reducedef reduce(word filenames) output(word sort(filenames))
Inverted Index Example
to be or not to be afraid (12thtxt)
be (12thtxt hamlettxt)greatness (12thtxt)not (12thtxt hamlettxt)of (12thtxt)or (hamlettxt)to (hamlettxt)
hamlettxt
be not afraid of greatness
12thtxt
to hamlettxtbe hamlettxtor hamlettxtnot hamlettxt
be 12thtxtnot 12thtxtafraid 12thtxtof 12thtxtgreatness 12thtxt
4 Most Popular Words
Input (filename text) records
Output the 100 words occurring in most files
Two-stage solution Job 1
Create inverted index giving (word list(file)) records Job 2
Map each (word list(file)) to (count word) Sort these records by count as in sort job
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
Cloudera Virtual Manager
Cloudera VM contains a single-node Apache Hadoop cluster along with everything you need to get started with Hadoop
Requirements A 64-bit host OS A virtualization software VMware Player KVM or VirtualBox
Virtualization Software will require a laptop that supports virtualization If you are unsure one way this can be checked by looking at your BIOS and seeing if Virtualization is Enabled
A 4 GB of total RAM The total system memory required varies depending on the size
of your data set and on the other processes that are running
Installation
Step1 Download amp Run Vmware
Step2 Download Cloudera VM
Step3 Extract to the Cloudera folder
Step4 Open the cloudera-quickstart-vm-440-1-vmware
Once you got the software installed fire up the VirtualBox image of Cloudera QuickStart VM and you should see the initial screen similar to below
WordCount Example
This example computes the occurrence frequency of each word in a text file
Steps1 Set up Hadoop environment2 Upload files into HDFS3 Executing Java MapReduce functions in Hadoop
bull Tutorial
httpswwwclouderacomcontentcloudera-contentcloudera-docsHadoopTutorialCDH4Hadoop-Tutorialht_wordcount1html
public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt
private final static IntWritable one = new IntWritable(1) private Text word = new Text)(
public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
String line = valuetoString)(
StringTokenizer tokenizer = new StringTokenizer(line)
while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())
outputcollect(word one)
public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt
public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
int sum = 0
while (valueshasNext()) sum += valuesnext()get)(
outputcollect(key new IntWritable(sum))
public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)
confsetJobName(wordcount)
confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)
confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)
confsetReducerClass(Reduceclass)
confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)
FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))
JobClientrunJob(conf)
How it Works
Purpose Simple serialization for keys values and other data
Interface Writable Read and write binary format Convert to String for text formats
Classes
OutputCollector Collects data output by either the Mapper or the Reducer ie
intermediate outputs or the output of the job Methods
collect(K key V value)
Interface
Text Stores text using standard UTF8 encoding It provides methods to
serialize deserialize and compare texts at byte level Methods
set(byte[] utf8)
Demo
40
Exercises
Get WordCount running
Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words
For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive (reading)
Motivation
MapReduce is great as many algorithms can be expressed by a series of MR jobs
But itrsquos low-level must think about keys values partitioning etc
Can we capture common ldquojob patternsrdquo
Pig
Started at Yahoo Research
Now runs about 30 of Yahoorsquos jobs
Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators
(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse
An Example Problem
Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))
reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)
lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5
store Top5 into lsquotop5sitesrsquo
In Pig Latin
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Job 1
Job 2
Job 3
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Hive
Developed at Facebook
Used for majority of Facebook jobs
ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex
data types some optimizations
Creating a Hive Table
CREATE TABLE page_views(viewTime INT userid BIGINT
page_url STRING referrer_url STRING
ip STRING COMMENT User IP address)
COMMENT This is the page view table
PARTITIONED BY(dt STRING country STRING)
STORED AS SEQUENCEFILE
bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA
Simple Query
SELECT page_views
FROM page_views
WHERE page_viewsdate gt= 2008-03-01
AND page_viewsdate lt= 2008-03-31
AND page_viewsreferrer_url like xyzcom
bullHive only reads partition 2008-03-01 instead of scanning entire table
bullFind all page views coming from xyzcom on March 31st
Aggregation and Joins
SELECT pvpage_url ugender COUNT(DISTINCT uid)
FROM page_views pv JOIN user u ON (pvuserid = uid)
GROUP BY pvpage_url ugender
WHERE pvdate = 2008-03-03
bullCount users who visited each page by gender
bullSample outputpage_url gender count(userid)
homephp MALE 12141412
homephp FEMALE 15431579
photophp MALE 23941451
photophp FEMALE 21231314
Using a Hadoop Streaming Mapper Script
SELECT TRANSFORM(page_viewsuserid
page_viewsdate)
USING map_scriptpy
AS dt uid CLUSTER BY dt
FROM page_views
Conclusions
MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance
Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs
(but requiring fault tolerance)
Hive and Pig further simplify programming
MapReduce is not suitable for all problems but when it works it may save you a lot of time
Resources
Hadoop httphadoopapacheorgcore
Hadoop docs
httphadoopapacheorgcoredocscurrent
Pig httphadoopapacheorgpig
Hive httphadoopapacheorghive
Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training
- Introduction to MapReduce and Hadoop
- What is MapReduce
- What is MapReduce used for
- What is MapReduce used for (2)
- Outline
- MapReduce Design Goals
- Typical Hadoop Cluster
- Typical Hadoop Cluster (2)
- Challenges
- Hadoop Components
- Hadoop Distributed File System
- MapReduce Programming Model
- Example Word Count
- Word Count Execution
- MapReduce Execution Details
- An Optimization The Combiner
- Word Count with Combiner
- Outline (2)
- Fault Tolerance in MapReduce
- Fault Tolerance in MapReduce (2)
- Fault Tolerance in MapReduce (3)
- Takeaways
- Outline (3)
- 1 Search
- 2 Sort
- 3 Inverted Index
- Inverted Index Example
- 4 Most Popular Words
- Outline (4)
- Cloudera Virtual Manager
- Installation
- Once you got the software installed fire up the VirtualBox ima
- WordCount Example
- Slide 34
- Slide 35
- Slide 36
- How it Works
- Classes
- Interface
- Demo
- Exercises
- Outline (5)
- Motivation
- Pig
- An Example Problem
- In MapReduce
- In Pig Latin
- Ease of Translation
- Ease of Translation (2)
- Hive
- Creating a Hive Table
- Simple Query
- Aggregation and Joins
- Using a Hadoop Streaming Mapper Script
- Conclusions
- Resources
-
Fault Tolerance in MapReduce
1 If a task crashes Retry on another node
Okay for a map because it had no dependencies Okay for reduce because map outputs are on disk
If the same task repeatedly fails fail the job or ignore that input block (user-controlled)
Note For this and the other fault tolerance features to work your map and reduce tasks must be side-effect-free
Fault Tolerance in MapReduce
2 If a node crashes Relaunch its current tasks on other nodes Relaunch any maps the node previously ran
Necessary because their output files were lost along with the crashed node
Fault Tolerance in MapReduce
3 If a task is going slowly (straggler) Launch second copy of task on another node Take the output of whichever copy finishes first and kill the
other one
Critical for performance in large clusters stragglers occur frequently due to failing hardware bugs misconfiguration etc
Takeaways
By providing a data-parallel programming model MapReduce can control job execution in useful ways Automatic division of job into tasks Automatic placement of computation near data Automatic load balancing Recovery from failures amp stragglers
User focuses on application not on complexities of distributed computing
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
1 Search
Input (lineNumber line) records
Output lines matching a given pattern
Map if(line matches pattern) output(line)
Reduce identify function Alternative no reducer (map-only job)
pigsheepyakzebra
aardvarkantbeecowelephant
2 Sort
Input (key value) records
Output same records sorted by key
Map identity function
Reduce identify function
Trick Pick partitioningfunction h such thatk1ltk2 =gt h(k1)lth(k2)
Map
Map
Map
Reduce
Reduce
ant bee
zebra
aardvarkelephant
cow
pig
sheep yak
[A-M]
[N-Z]
3 Inverted Index
Input (filename text) records
Output list of files containing each word
Map foreach word in textsplit() output(word filename)
Combine uniquify filenames for each word (eliminate duplicates)
Reducedef reduce(word filenames) output(word sort(filenames))
Inverted Index Example
to be or not to be afraid (12thtxt)
be (12thtxt hamlettxt)greatness (12thtxt)not (12thtxt hamlettxt)of (12thtxt)or (hamlettxt)to (hamlettxt)
hamlettxt
be not afraid of greatness
12thtxt
to hamlettxtbe hamlettxtor hamlettxtnot hamlettxt
be 12thtxtnot 12thtxtafraid 12thtxtof 12thtxtgreatness 12thtxt
4 Most Popular Words
Input (filename text) records
Output the 100 words occurring in most files
Two-stage solution Job 1
Create inverted index giving (word list(file)) records Job 2
Map each (word list(file)) to (count word) Sort these records by count as in sort job
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
Cloudera Virtual Manager
Cloudera VM contains a single-node Apache Hadoop cluster along with everything you need to get started with Hadoop
Requirements A 64-bit host OS A virtualization software VMware Player KVM or VirtualBox
Virtualization Software will require a laptop that supports virtualization If you are unsure one way this can be checked by looking at your BIOS and seeing if Virtualization is Enabled
A 4 GB of total RAM The total system memory required varies depending on the size
of your data set and on the other processes that are running
Installation
Step1 Download amp Run Vmware
Step2 Download Cloudera VM
Step3 Extract to the Cloudera folder
Step4 Open the cloudera-quickstart-vm-440-1-vmware
Once you got the software installed fire up the VirtualBox image of Cloudera QuickStart VM and you should see the initial screen similar to below
WordCount Example
This example computes the occurrence frequency of each word in a text file
Steps1 Set up Hadoop environment2 Upload files into HDFS3 Executing Java MapReduce functions in Hadoop
bull Tutorial
httpswwwclouderacomcontentcloudera-contentcloudera-docsHadoopTutorialCDH4Hadoop-Tutorialht_wordcount1html
public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt
private final static IntWritable one = new IntWritable(1) private Text word = new Text)(
public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
String line = valuetoString)(
StringTokenizer tokenizer = new StringTokenizer(line)
while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())
outputcollect(word one)
public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt
public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
int sum = 0
while (valueshasNext()) sum += valuesnext()get)(
outputcollect(key new IntWritable(sum))
public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)
confsetJobName(wordcount)
confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)
confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)
confsetReducerClass(Reduceclass)
confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)
FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))
JobClientrunJob(conf)
How it Works
Purpose Simple serialization for keys values and other data
Interface Writable Read and write binary format Convert to String for text formats
Classes
OutputCollector Collects data output by either the Mapper or the Reducer ie
intermediate outputs or the output of the job Methods
collect(K key V value)
Interface
Text Stores text using standard UTF8 encoding It provides methods to
serialize deserialize and compare texts at byte level Methods
set(byte[] utf8)
Demo
40
Exercises
Get WordCount running
Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words
For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive (reading)
Motivation
MapReduce is great as many algorithms can be expressed by a series of MR jobs
But itrsquos low-level must think about keys values partitioning etc
Can we capture common ldquojob patternsrdquo
Pig
Started at Yahoo Research
Now runs about 30 of Yahoorsquos jobs
Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators
(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse
An Example Problem
Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))
reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)
lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5
store Top5 into lsquotop5sitesrsquo
In Pig Latin
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Job 1
Job 2
Job 3
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Hive
Developed at Facebook
Used for majority of Facebook jobs
ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex
data types some optimizations
Creating a Hive Table
CREATE TABLE page_views(viewTime INT userid BIGINT
page_url STRING referrer_url STRING
ip STRING COMMENT User IP address)
COMMENT This is the page view table
PARTITIONED BY(dt STRING country STRING)
STORED AS SEQUENCEFILE
bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA
Simple Query
SELECT page_views
FROM page_views
WHERE page_viewsdate gt= 2008-03-01
AND page_viewsdate lt= 2008-03-31
AND page_viewsreferrer_url like xyzcom
bullHive only reads partition 2008-03-01 instead of scanning entire table
bullFind all page views coming from xyzcom on March 31st
Aggregation and Joins
SELECT pvpage_url ugender COUNT(DISTINCT uid)
FROM page_views pv JOIN user u ON (pvuserid = uid)
GROUP BY pvpage_url ugender
WHERE pvdate = 2008-03-03
bullCount users who visited each page by gender
bullSample outputpage_url gender count(userid)
homephp MALE 12141412
homephp FEMALE 15431579
photophp MALE 23941451
photophp FEMALE 21231314
Using a Hadoop Streaming Mapper Script
SELECT TRANSFORM(page_viewsuserid
page_viewsdate)
USING map_scriptpy
AS dt uid CLUSTER BY dt
FROM page_views
Conclusions
MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance
Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs
(but requiring fault tolerance)
Hive and Pig further simplify programming
MapReduce is not suitable for all problems but when it works it may save you a lot of time
Resources
Hadoop httphadoopapacheorgcore
Hadoop docs
httphadoopapacheorgcoredocscurrent
Pig httphadoopapacheorgpig
Hive httphadoopapacheorghive
Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training
- Introduction to MapReduce and Hadoop
- What is MapReduce
- What is MapReduce used for
- What is MapReduce used for (2)
- Outline
- MapReduce Design Goals
- Typical Hadoop Cluster
- Typical Hadoop Cluster (2)
- Challenges
- Hadoop Components
- Hadoop Distributed File System
- MapReduce Programming Model
- Example Word Count
- Word Count Execution
- MapReduce Execution Details
- An Optimization The Combiner
- Word Count with Combiner
- Outline (2)
- Fault Tolerance in MapReduce
- Fault Tolerance in MapReduce (2)
- Fault Tolerance in MapReduce (3)
- Takeaways
- Outline (3)
- 1 Search
- 2 Sort
- 3 Inverted Index
- Inverted Index Example
- 4 Most Popular Words
- Outline (4)
- Cloudera Virtual Manager
- Installation
- Once you got the software installed fire up the VirtualBox ima
- WordCount Example
- Slide 34
- Slide 35
- Slide 36
- How it Works
- Classes
- Interface
- Demo
- Exercises
- Outline (5)
- Motivation
- Pig
- An Example Problem
- In MapReduce
- In Pig Latin
- Ease of Translation
- Ease of Translation (2)
- Hive
- Creating a Hive Table
- Simple Query
- Aggregation and Joins
- Using a Hadoop Streaming Mapper Script
- Conclusions
- Resources
-
Fault Tolerance in MapReduce
2 If a node crashes Relaunch its current tasks on other nodes Relaunch any maps the node previously ran
Necessary because their output files were lost along with the crashed node
Fault Tolerance in MapReduce
3 If a task is going slowly (straggler) Launch second copy of task on another node Take the output of whichever copy finishes first and kill the
other one
Critical for performance in large clusters stragglers occur frequently due to failing hardware bugs misconfiguration etc
Takeaways
By providing a data-parallel programming model MapReduce can control job execution in useful ways Automatic division of job into tasks Automatic placement of computation near data Automatic load balancing Recovery from failures amp stragglers
User focuses on application not on complexities of distributed computing
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
1 Search
Input (lineNumber line) records
Output lines matching a given pattern
Map if(line matches pattern) output(line)
Reduce identify function Alternative no reducer (map-only job)
pigsheepyakzebra
aardvarkantbeecowelephant
2 Sort
Input (key value) records
Output same records sorted by key
Map identity function
Reduce identify function
Trick Pick partitioningfunction h such thatk1ltk2 =gt h(k1)lth(k2)
Map
Map
Map
Reduce
Reduce
ant bee
zebra
aardvarkelephant
cow
pig
sheep yak
[A-M]
[N-Z]
3 Inverted Index
Input (filename text) records
Output list of files containing each word
Map foreach word in textsplit() output(word filename)
Combine uniquify filenames for each word (eliminate duplicates)
Reducedef reduce(word filenames) output(word sort(filenames))
Inverted Index Example
to be or not to be afraid (12thtxt)
be (12thtxt hamlettxt)greatness (12thtxt)not (12thtxt hamlettxt)of (12thtxt)or (hamlettxt)to (hamlettxt)
hamlettxt
be not afraid of greatness
12thtxt
to hamlettxtbe hamlettxtor hamlettxtnot hamlettxt
be 12thtxtnot 12thtxtafraid 12thtxtof 12thtxtgreatness 12thtxt
4 Most Popular Words
Input (filename text) records
Output the 100 words occurring in most files
Two-stage solution Job 1
Create inverted index giving (word list(file)) records Job 2
Map each (word list(file)) to (count word) Sort these records by count as in sort job
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
Cloudera Virtual Manager
Cloudera VM contains a single-node Apache Hadoop cluster along with everything you need to get started with Hadoop
Requirements A 64-bit host OS A virtualization software VMware Player KVM or VirtualBox
Virtualization Software will require a laptop that supports virtualization If you are unsure one way this can be checked by looking at your BIOS and seeing if Virtualization is Enabled
A 4 GB of total RAM The total system memory required varies depending on the size
of your data set and on the other processes that are running
Installation
Step1 Download amp Run Vmware
Step2 Download Cloudera VM
Step3 Extract to the Cloudera folder
Step4 Open the cloudera-quickstart-vm-440-1-vmware
Once you got the software installed fire up the VirtualBox image of Cloudera QuickStart VM and you should see the initial screen similar to below
WordCount Example
This example computes the occurrence frequency of each word in a text file
Steps1 Set up Hadoop environment2 Upload files into HDFS3 Executing Java MapReduce functions in Hadoop
bull Tutorial
httpswwwclouderacomcontentcloudera-contentcloudera-docsHadoopTutorialCDH4Hadoop-Tutorialht_wordcount1html
public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt
private final static IntWritable one = new IntWritable(1) private Text word = new Text)(
public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
String line = valuetoString)(
StringTokenizer tokenizer = new StringTokenizer(line)
while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())
outputcollect(word one)
public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt
public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
int sum = 0
while (valueshasNext()) sum += valuesnext()get)(
outputcollect(key new IntWritable(sum))
public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)
confsetJobName(wordcount)
confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)
confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)
confsetReducerClass(Reduceclass)
confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)
FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))
JobClientrunJob(conf)
How it Works
Purpose Simple serialization for keys values and other data
Interface Writable Read and write binary format Convert to String for text formats
Classes
OutputCollector Collects data output by either the Mapper or the Reducer ie
intermediate outputs or the output of the job Methods
collect(K key V value)
Interface
Text Stores text using standard UTF8 encoding It provides methods to
serialize deserialize and compare texts at byte level Methods
set(byte[] utf8)
Demo
40
Exercises
Get WordCount running
Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words
For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive (reading)
Motivation
MapReduce is great as many algorithms can be expressed by a series of MR jobs
But itrsquos low-level must think about keys values partitioning etc
Can we capture common ldquojob patternsrdquo
Pig
Started at Yahoo Research
Now runs about 30 of Yahoorsquos jobs
Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators
(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse
An Example Problem
Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))
reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)
lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5
store Top5 into lsquotop5sitesrsquo
In Pig Latin
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Job 1
Job 2
Job 3
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Hive
Developed at Facebook
Used for majority of Facebook jobs
ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex
data types some optimizations
Creating a Hive Table
CREATE TABLE page_views(viewTime INT userid BIGINT
page_url STRING referrer_url STRING
ip STRING COMMENT User IP address)
COMMENT This is the page view table
PARTITIONED BY(dt STRING country STRING)
STORED AS SEQUENCEFILE
bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA
Simple Query
SELECT page_views
FROM page_views
WHERE page_viewsdate gt= 2008-03-01
AND page_viewsdate lt= 2008-03-31
AND page_viewsreferrer_url like xyzcom
bullHive only reads partition 2008-03-01 instead of scanning entire table
bullFind all page views coming from xyzcom on March 31st
Aggregation and Joins
SELECT pvpage_url ugender COUNT(DISTINCT uid)
FROM page_views pv JOIN user u ON (pvuserid = uid)
GROUP BY pvpage_url ugender
WHERE pvdate = 2008-03-03
bullCount users who visited each page by gender
bullSample outputpage_url gender count(userid)
homephp MALE 12141412
homephp FEMALE 15431579
photophp MALE 23941451
photophp FEMALE 21231314
Using a Hadoop Streaming Mapper Script
SELECT TRANSFORM(page_viewsuserid
page_viewsdate)
USING map_scriptpy
AS dt uid CLUSTER BY dt
FROM page_views
Conclusions
MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance
Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs
(but requiring fault tolerance)
Hive and Pig further simplify programming
MapReduce is not suitable for all problems but when it works it may save you a lot of time
Resources
Hadoop httphadoopapacheorgcore
Hadoop docs
httphadoopapacheorgcoredocscurrent
Pig httphadoopapacheorgpig
Hive httphadoopapacheorghive
Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training
- Introduction to MapReduce and Hadoop
- What is MapReduce
- What is MapReduce used for
- What is MapReduce used for (2)
- Outline
- MapReduce Design Goals
- Typical Hadoop Cluster
- Typical Hadoop Cluster (2)
- Challenges
- Hadoop Components
- Hadoop Distributed File System
- MapReduce Programming Model
- Example Word Count
- Word Count Execution
- MapReduce Execution Details
- An Optimization The Combiner
- Word Count with Combiner
- Outline (2)
- Fault Tolerance in MapReduce
- Fault Tolerance in MapReduce (2)
- Fault Tolerance in MapReduce (3)
- Takeaways
- Outline (3)
- 1 Search
- 2 Sort
- 3 Inverted Index
- Inverted Index Example
- 4 Most Popular Words
- Outline (4)
- Cloudera Virtual Manager
- Installation
- Once you got the software installed fire up the VirtualBox ima
- WordCount Example
- Slide 34
- Slide 35
- Slide 36
- How it Works
- Classes
- Interface
- Demo
- Exercises
- Outline (5)
- Motivation
- Pig
- An Example Problem
- In MapReduce
- In Pig Latin
- Ease of Translation
- Ease of Translation (2)
- Hive
- Creating a Hive Table
- Simple Query
- Aggregation and Joins
- Using a Hadoop Streaming Mapper Script
- Conclusions
- Resources
-
Fault Tolerance in MapReduce
3 If a task is going slowly (straggler) Launch second copy of task on another node Take the output of whichever copy finishes first and kill the
other one
Critical for performance in large clusters stragglers occur frequently due to failing hardware bugs misconfiguration etc
Takeaways
By providing a data-parallel programming model MapReduce can control job execution in useful ways Automatic division of job into tasks Automatic placement of computation near data Automatic load balancing Recovery from failures amp stragglers
User focuses on application not on complexities of distributed computing
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
1 Search
Input (lineNumber line) records
Output lines matching a given pattern
Map if(line matches pattern) output(line)
Reduce identify function Alternative no reducer (map-only job)
pigsheepyakzebra
aardvarkantbeecowelephant
2 Sort
Input (key value) records
Output same records sorted by key
Map identity function
Reduce identify function
Trick Pick partitioningfunction h such thatk1ltk2 =gt h(k1)lth(k2)
Map
Map
Map
Reduce
Reduce
ant bee
zebra
aardvarkelephant
cow
pig
sheep yak
[A-M]
[N-Z]
3 Inverted Index
Input (filename text) records
Output list of files containing each word
Map foreach word in textsplit() output(word filename)
Combine uniquify filenames for each word (eliminate duplicates)
Reducedef reduce(word filenames) output(word sort(filenames))
Inverted Index Example
to be or not to be afraid (12thtxt)
be (12thtxt hamlettxt)greatness (12thtxt)not (12thtxt hamlettxt)of (12thtxt)or (hamlettxt)to (hamlettxt)
hamlettxt
be not afraid of greatness
12thtxt
to hamlettxtbe hamlettxtor hamlettxtnot hamlettxt
be 12thtxtnot 12thtxtafraid 12thtxtof 12thtxtgreatness 12thtxt
4 Most Popular Words
Input (filename text) records
Output the 100 words occurring in most files
Two-stage solution Job 1
Create inverted index giving (word list(file)) records Job 2
Map each (word list(file)) to (count word) Sort these records by count as in sort job
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
Cloudera Virtual Manager
Cloudera VM contains a single-node Apache Hadoop cluster along with everything you need to get started with Hadoop
Requirements A 64-bit host OS A virtualization software VMware Player KVM or VirtualBox
Virtualization Software will require a laptop that supports virtualization If you are unsure one way this can be checked by looking at your BIOS and seeing if Virtualization is Enabled
A 4 GB of total RAM The total system memory required varies depending on the size
of your data set and on the other processes that are running
Installation
Step1 Download amp Run Vmware
Step2 Download Cloudera VM
Step3 Extract to the Cloudera folder
Step4 Open the cloudera-quickstart-vm-440-1-vmware
Once you got the software installed fire up the VirtualBox image of Cloudera QuickStart VM and you should see the initial screen similar to below
WordCount Example
This example computes the occurrence frequency of each word in a text file
Steps1 Set up Hadoop environment2 Upload files into HDFS3 Executing Java MapReduce functions in Hadoop
bull Tutorial
httpswwwclouderacomcontentcloudera-contentcloudera-docsHadoopTutorialCDH4Hadoop-Tutorialht_wordcount1html
public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt
private final static IntWritable one = new IntWritable(1) private Text word = new Text)(
public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
String line = valuetoString)(
StringTokenizer tokenizer = new StringTokenizer(line)
while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())
outputcollect(word one)
public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt
public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
int sum = 0
while (valueshasNext()) sum += valuesnext()get)(
outputcollect(key new IntWritable(sum))
public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)
confsetJobName(wordcount)
confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)
confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)
confsetReducerClass(Reduceclass)
confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)
FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))
JobClientrunJob(conf)
How it Works
Purpose Simple serialization for keys values and other data
Interface Writable Read and write binary format Convert to String for text formats
Classes
OutputCollector Collects data output by either the Mapper or the Reducer ie
intermediate outputs or the output of the job Methods
collect(K key V value)
Interface
Text Stores text using standard UTF8 encoding It provides methods to
serialize deserialize and compare texts at byte level Methods
set(byte[] utf8)
Demo
40
Exercises
Get WordCount running
Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words
For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive (reading)
Motivation
MapReduce is great as many algorithms can be expressed by a series of MR jobs
But itrsquos low-level must think about keys values partitioning etc
Can we capture common ldquojob patternsrdquo
Pig
Started at Yahoo Research
Now runs about 30 of Yahoorsquos jobs
Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators
(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse
An Example Problem
Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))
reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)
lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5
store Top5 into lsquotop5sitesrsquo
In Pig Latin
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Job 1
Job 2
Job 3
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Hive
Developed at Facebook
Used for majority of Facebook jobs
ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex
data types some optimizations
Creating a Hive Table
CREATE TABLE page_views(viewTime INT userid BIGINT
page_url STRING referrer_url STRING
ip STRING COMMENT User IP address)
COMMENT This is the page view table
PARTITIONED BY(dt STRING country STRING)
STORED AS SEQUENCEFILE
bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA
Simple Query
SELECT page_views
FROM page_views
WHERE page_viewsdate gt= 2008-03-01
AND page_viewsdate lt= 2008-03-31
AND page_viewsreferrer_url like xyzcom
bullHive only reads partition 2008-03-01 instead of scanning entire table
bullFind all page views coming from xyzcom on March 31st
Aggregation and Joins
SELECT pvpage_url ugender COUNT(DISTINCT uid)
FROM page_views pv JOIN user u ON (pvuserid = uid)
GROUP BY pvpage_url ugender
WHERE pvdate = 2008-03-03
bullCount users who visited each page by gender
bullSample outputpage_url gender count(userid)
homephp MALE 12141412
homephp FEMALE 15431579
photophp MALE 23941451
photophp FEMALE 21231314
Using a Hadoop Streaming Mapper Script
SELECT TRANSFORM(page_viewsuserid
page_viewsdate)
USING map_scriptpy
AS dt uid CLUSTER BY dt
FROM page_views
Conclusions
MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance
Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs
(but requiring fault tolerance)
Hive and Pig further simplify programming
MapReduce is not suitable for all problems but when it works it may save you a lot of time
Resources
Hadoop httphadoopapacheorgcore
Hadoop docs
httphadoopapacheorgcoredocscurrent
Pig httphadoopapacheorgpig
Hive httphadoopapacheorghive
Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training
- Introduction to MapReduce and Hadoop
- What is MapReduce
- What is MapReduce used for
- What is MapReduce used for (2)
- Outline
- MapReduce Design Goals
- Typical Hadoop Cluster
- Typical Hadoop Cluster (2)
- Challenges
- Hadoop Components
- Hadoop Distributed File System
- MapReduce Programming Model
- Example Word Count
- Word Count Execution
- MapReduce Execution Details
- An Optimization The Combiner
- Word Count with Combiner
- Outline (2)
- Fault Tolerance in MapReduce
- Fault Tolerance in MapReduce (2)
- Fault Tolerance in MapReduce (3)
- Takeaways
- Outline (3)
- 1 Search
- 2 Sort
- 3 Inverted Index
- Inverted Index Example
- 4 Most Popular Words
- Outline (4)
- Cloudera Virtual Manager
- Installation
- Once you got the software installed fire up the VirtualBox ima
- WordCount Example
- Slide 34
- Slide 35
- Slide 36
- How it Works
- Classes
- Interface
- Demo
- Exercises
- Outline (5)
- Motivation
- Pig
- An Example Problem
- In MapReduce
- In Pig Latin
- Ease of Translation
- Ease of Translation (2)
- Hive
- Creating a Hive Table
- Simple Query
- Aggregation and Joins
- Using a Hadoop Streaming Mapper Script
- Conclusions
- Resources
-
Takeaways
By providing a data-parallel programming model MapReduce can control job execution in useful ways Automatic division of job into tasks Automatic placement of computation near data Automatic load balancing Recovery from failures amp stragglers
User focuses on application not on complexities of distributed computing
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
1 Search
Input (lineNumber line) records
Output lines matching a given pattern
Map if(line matches pattern) output(line)
Reduce identify function Alternative no reducer (map-only job)
pigsheepyakzebra
aardvarkantbeecowelephant
2 Sort
Input (key value) records
Output same records sorted by key
Map identity function
Reduce identify function
Trick Pick partitioningfunction h such thatk1ltk2 =gt h(k1)lth(k2)
Map
Map
Map
Reduce
Reduce
ant bee
zebra
aardvarkelephant
cow
pig
sheep yak
[A-M]
[N-Z]
3 Inverted Index
Input (filename text) records
Output list of files containing each word
Map foreach word in textsplit() output(word filename)
Combine uniquify filenames for each word (eliminate duplicates)
Reducedef reduce(word filenames) output(word sort(filenames))
Inverted Index Example
to be or not to be afraid (12thtxt)
be (12thtxt hamlettxt)greatness (12thtxt)not (12thtxt hamlettxt)of (12thtxt)or (hamlettxt)to (hamlettxt)
hamlettxt
be not afraid of greatness
12thtxt
to hamlettxtbe hamlettxtor hamlettxtnot hamlettxt
be 12thtxtnot 12thtxtafraid 12thtxtof 12thtxtgreatness 12thtxt
4 Most Popular Words
Input (filename text) records
Output the 100 words occurring in most files
Two-stage solution Job 1
Create inverted index giving (word list(file)) records Job 2
Map each (word list(file)) to (count word) Sort these records by count as in sort job
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
Cloudera Virtual Manager
Cloudera VM contains a single-node Apache Hadoop cluster along with everything you need to get started with Hadoop
Requirements A 64-bit host OS A virtualization software VMware Player KVM or VirtualBox
Virtualization Software will require a laptop that supports virtualization If you are unsure one way this can be checked by looking at your BIOS and seeing if Virtualization is Enabled
A 4 GB of total RAM The total system memory required varies depending on the size
of your data set and on the other processes that are running
Installation
Step1 Download amp Run Vmware
Step2 Download Cloudera VM
Step3 Extract to the Cloudera folder
Step4 Open the cloudera-quickstart-vm-440-1-vmware
Once you got the software installed fire up the VirtualBox image of Cloudera QuickStart VM and you should see the initial screen similar to below
WordCount Example
This example computes the occurrence frequency of each word in a text file
Steps1 Set up Hadoop environment2 Upload files into HDFS3 Executing Java MapReduce functions in Hadoop
bull Tutorial
httpswwwclouderacomcontentcloudera-contentcloudera-docsHadoopTutorialCDH4Hadoop-Tutorialht_wordcount1html
public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt
private final static IntWritable one = new IntWritable(1) private Text word = new Text)(
public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
String line = valuetoString)(
StringTokenizer tokenizer = new StringTokenizer(line)
while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())
outputcollect(word one)
public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt
public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
int sum = 0
while (valueshasNext()) sum += valuesnext()get)(
outputcollect(key new IntWritable(sum))
public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)
confsetJobName(wordcount)
confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)
confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)
confsetReducerClass(Reduceclass)
confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)
FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))
JobClientrunJob(conf)
How it Works
Purpose Simple serialization for keys values and other data
Interface Writable Read and write binary format Convert to String for text formats
Classes
OutputCollector Collects data output by either the Mapper or the Reducer ie
intermediate outputs or the output of the job Methods
collect(K key V value)
Interface
Text Stores text using standard UTF8 encoding It provides methods to
serialize deserialize and compare texts at byte level Methods
set(byte[] utf8)
Demo
40
Exercises
Get WordCount running
Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words
For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive (reading)
Motivation
MapReduce is great as many algorithms can be expressed by a series of MR jobs
But itrsquos low-level must think about keys values partitioning etc
Can we capture common ldquojob patternsrdquo
Pig
Started at Yahoo Research
Now runs about 30 of Yahoorsquos jobs
Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators
(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse
An Example Problem
Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))
reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)
lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5
store Top5 into lsquotop5sitesrsquo
In Pig Latin
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Job 1
Job 2
Job 3
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Hive
Developed at Facebook
Used for majority of Facebook jobs
ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex
data types some optimizations
Creating a Hive Table
CREATE TABLE page_views(viewTime INT userid BIGINT
page_url STRING referrer_url STRING
ip STRING COMMENT User IP address)
COMMENT This is the page view table
PARTITIONED BY(dt STRING country STRING)
STORED AS SEQUENCEFILE
bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA
Simple Query
SELECT page_views
FROM page_views
WHERE page_viewsdate gt= 2008-03-01
AND page_viewsdate lt= 2008-03-31
AND page_viewsreferrer_url like xyzcom
bullHive only reads partition 2008-03-01 instead of scanning entire table
bullFind all page views coming from xyzcom on March 31st
Aggregation and Joins
SELECT pvpage_url ugender COUNT(DISTINCT uid)
FROM page_views pv JOIN user u ON (pvuserid = uid)
GROUP BY pvpage_url ugender
WHERE pvdate = 2008-03-03
bullCount users who visited each page by gender
bullSample outputpage_url gender count(userid)
homephp MALE 12141412
homephp FEMALE 15431579
photophp MALE 23941451
photophp FEMALE 21231314
Using a Hadoop Streaming Mapper Script
SELECT TRANSFORM(page_viewsuserid
page_viewsdate)
USING map_scriptpy
AS dt uid CLUSTER BY dt
FROM page_views
Conclusions
MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance
Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs
(but requiring fault tolerance)
Hive and Pig further simplify programming
MapReduce is not suitable for all problems but when it works it may save you a lot of time
Resources
Hadoop httphadoopapacheorgcore
Hadoop docs
httphadoopapacheorgcoredocscurrent
Pig httphadoopapacheorgpig
Hive httphadoopapacheorghive
Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training
- Introduction to MapReduce and Hadoop
- What is MapReduce
- What is MapReduce used for
- What is MapReduce used for (2)
- Outline
- MapReduce Design Goals
- Typical Hadoop Cluster
- Typical Hadoop Cluster (2)
- Challenges
- Hadoop Components
- Hadoop Distributed File System
- MapReduce Programming Model
- Example Word Count
- Word Count Execution
- MapReduce Execution Details
- An Optimization The Combiner
- Word Count with Combiner
- Outline (2)
- Fault Tolerance in MapReduce
- Fault Tolerance in MapReduce (2)
- Fault Tolerance in MapReduce (3)
- Takeaways
- Outline (3)
- 1 Search
- 2 Sort
- 3 Inverted Index
- Inverted Index Example
- 4 Most Popular Words
- Outline (4)
- Cloudera Virtual Manager
- Installation
- Once you got the software installed fire up the VirtualBox ima
- WordCount Example
- Slide 34
- Slide 35
- Slide 36
- How it Works
- Classes
- Interface
- Demo
- Exercises
- Outline (5)
- Motivation
- Pig
- An Example Problem
- In MapReduce
- In Pig Latin
- Ease of Translation
- Ease of Translation (2)
- Hive
- Creating a Hive Table
- Simple Query
- Aggregation and Joins
- Using a Hadoop Streaming Mapper Script
- Conclusions
- Resources
-
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
1 Search
Input (lineNumber line) records
Output lines matching a given pattern
Map if(line matches pattern) output(line)
Reduce identify function Alternative no reducer (map-only job)
pigsheepyakzebra
aardvarkantbeecowelephant
2 Sort
Input (key value) records
Output same records sorted by key
Map identity function
Reduce identify function
Trick Pick partitioningfunction h such thatk1ltk2 =gt h(k1)lth(k2)
Map
Map
Map
Reduce
Reduce
ant bee
zebra
aardvarkelephant
cow
pig
sheep yak
[A-M]
[N-Z]
3 Inverted Index
Input (filename text) records
Output list of files containing each word
Map foreach word in textsplit() output(word filename)
Combine uniquify filenames for each word (eliminate duplicates)
Reducedef reduce(word filenames) output(word sort(filenames))
Inverted Index Example
to be or not to be afraid (12thtxt)
be (12thtxt hamlettxt)greatness (12thtxt)not (12thtxt hamlettxt)of (12thtxt)or (hamlettxt)to (hamlettxt)
hamlettxt
be not afraid of greatness
12thtxt
to hamlettxtbe hamlettxtor hamlettxtnot hamlettxt
be 12thtxtnot 12thtxtafraid 12thtxtof 12thtxtgreatness 12thtxt
4 Most Popular Words
Input (filename text) records
Output the 100 words occurring in most files
Two-stage solution Job 1
Create inverted index giving (word list(file)) records Job 2
Map each (word list(file)) to (count word) Sort these records by count as in sort job
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
Cloudera Virtual Manager
Cloudera VM contains a single-node Apache Hadoop cluster along with everything you need to get started with Hadoop
Requirements A 64-bit host OS A virtualization software VMware Player KVM or VirtualBox
Virtualization Software will require a laptop that supports virtualization If you are unsure one way this can be checked by looking at your BIOS and seeing if Virtualization is Enabled
A 4 GB of total RAM The total system memory required varies depending on the size
of your data set and on the other processes that are running
Installation
Step1 Download amp Run Vmware
Step2 Download Cloudera VM
Step3 Extract to the Cloudera folder
Step4 Open the cloudera-quickstart-vm-440-1-vmware
Once you got the software installed fire up the VirtualBox image of Cloudera QuickStart VM and you should see the initial screen similar to below
WordCount Example
This example computes the occurrence frequency of each word in a text file
Steps1 Set up Hadoop environment2 Upload files into HDFS3 Executing Java MapReduce functions in Hadoop
bull Tutorial
httpswwwclouderacomcontentcloudera-contentcloudera-docsHadoopTutorialCDH4Hadoop-Tutorialht_wordcount1html
public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt
private final static IntWritable one = new IntWritable(1) private Text word = new Text)(
public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
String line = valuetoString)(
StringTokenizer tokenizer = new StringTokenizer(line)
while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())
outputcollect(word one)
public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt
public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
int sum = 0
while (valueshasNext()) sum += valuesnext()get)(
outputcollect(key new IntWritable(sum))
public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)
confsetJobName(wordcount)
confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)
confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)
confsetReducerClass(Reduceclass)
confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)
FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))
JobClientrunJob(conf)
How it Works
Purpose Simple serialization for keys values and other data
Interface Writable Read and write binary format Convert to String for text formats
Classes
OutputCollector Collects data output by either the Mapper or the Reducer ie
intermediate outputs or the output of the job Methods
collect(K key V value)
Interface
Text Stores text using standard UTF8 encoding It provides methods to
serialize deserialize and compare texts at byte level Methods
set(byte[] utf8)
Demo
40
Exercises
Get WordCount running
Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words
For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive (reading)
Motivation
MapReduce is great as many algorithms can be expressed by a series of MR jobs
But itrsquos low-level must think about keys values partitioning etc
Can we capture common ldquojob patternsrdquo
Pig
Started at Yahoo Research
Now runs about 30 of Yahoorsquos jobs
Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators
(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse
An Example Problem
Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))
reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)
lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5
store Top5 into lsquotop5sitesrsquo
In Pig Latin
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Job 1
Job 2
Job 3
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Hive
Developed at Facebook
Used for majority of Facebook jobs
ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex
data types some optimizations
Creating a Hive Table
CREATE TABLE page_views(viewTime INT userid BIGINT
page_url STRING referrer_url STRING
ip STRING COMMENT User IP address)
COMMENT This is the page view table
PARTITIONED BY(dt STRING country STRING)
STORED AS SEQUENCEFILE
bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA
Simple Query
SELECT page_views
FROM page_views
WHERE page_viewsdate gt= 2008-03-01
AND page_viewsdate lt= 2008-03-31
AND page_viewsreferrer_url like xyzcom
bullHive only reads partition 2008-03-01 instead of scanning entire table
bullFind all page views coming from xyzcom on March 31st
Aggregation and Joins
SELECT pvpage_url ugender COUNT(DISTINCT uid)
FROM page_views pv JOIN user u ON (pvuserid = uid)
GROUP BY pvpage_url ugender
WHERE pvdate = 2008-03-03
bullCount users who visited each page by gender
bullSample outputpage_url gender count(userid)
homephp MALE 12141412
homephp FEMALE 15431579
photophp MALE 23941451
photophp FEMALE 21231314
Using a Hadoop Streaming Mapper Script
SELECT TRANSFORM(page_viewsuserid
page_viewsdate)
USING map_scriptpy
AS dt uid CLUSTER BY dt
FROM page_views
Conclusions
MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance
Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs
(but requiring fault tolerance)
Hive and Pig further simplify programming
MapReduce is not suitable for all problems but when it works it may save you a lot of time
Resources
Hadoop httphadoopapacheorgcore
Hadoop docs
httphadoopapacheorgcoredocscurrent
Pig httphadoopapacheorgpig
Hive httphadoopapacheorghive
Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training
- Introduction to MapReduce and Hadoop
- What is MapReduce
- What is MapReduce used for
- What is MapReduce used for (2)
- Outline
- MapReduce Design Goals
- Typical Hadoop Cluster
- Typical Hadoop Cluster (2)
- Challenges
- Hadoop Components
- Hadoop Distributed File System
- MapReduce Programming Model
- Example Word Count
- Word Count Execution
- MapReduce Execution Details
- An Optimization The Combiner
- Word Count with Combiner
- Outline (2)
- Fault Tolerance in MapReduce
- Fault Tolerance in MapReduce (2)
- Fault Tolerance in MapReduce (3)
- Takeaways
- Outline (3)
- 1 Search
- 2 Sort
- 3 Inverted Index
- Inverted Index Example
- 4 Most Popular Words
- Outline (4)
- Cloudera Virtual Manager
- Installation
- Once you got the software installed fire up the VirtualBox ima
- WordCount Example
- Slide 34
- Slide 35
- Slide 36
- How it Works
- Classes
- Interface
- Demo
- Exercises
- Outline (5)
- Motivation
- Pig
- An Example Problem
- In MapReduce
- In Pig Latin
- Ease of Translation
- Ease of Translation (2)
- Hive
- Creating a Hive Table
- Simple Query
- Aggregation and Joins
- Using a Hadoop Streaming Mapper Script
- Conclusions
- Resources
-
1 Search
Input (lineNumber line) records
Output lines matching a given pattern
Map if(line matches pattern) output(line)
Reduce identify function Alternative no reducer (map-only job)
pigsheepyakzebra
aardvarkantbeecowelephant
2 Sort
Input (key value) records
Output same records sorted by key
Map identity function
Reduce identify function
Trick Pick partitioningfunction h such thatk1ltk2 =gt h(k1)lth(k2)
Map
Map
Map
Reduce
Reduce
ant bee
zebra
aardvarkelephant
cow
pig
sheep yak
[A-M]
[N-Z]
3 Inverted Index
Input (filename text) records
Output list of files containing each word
Map foreach word in textsplit() output(word filename)
Combine uniquify filenames for each word (eliminate duplicates)
Reducedef reduce(word filenames) output(word sort(filenames))
Inverted Index Example
to be or not to be afraid (12thtxt)
be (12thtxt hamlettxt)greatness (12thtxt)not (12thtxt hamlettxt)of (12thtxt)or (hamlettxt)to (hamlettxt)
hamlettxt
be not afraid of greatness
12thtxt
to hamlettxtbe hamlettxtor hamlettxtnot hamlettxt
be 12thtxtnot 12thtxtafraid 12thtxtof 12thtxtgreatness 12thtxt
4 Most Popular Words
Input (filename text) records
Output the 100 words occurring in most files
Two-stage solution Job 1
Create inverted index giving (word list(file)) records Job 2
Map each (word list(file)) to (count word) Sort these records by count as in sort job
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
Cloudera Virtual Manager
Cloudera VM contains a single-node Apache Hadoop cluster along with everything you need to get started with Hadoop
Requirements A 64-bit host OS A virtualization software VMware Player KVM or VirtualBox
Virtualization Software will require a laptop that supports virtualization If you are unsure one way this can be checked by looking at your BIOS and seeing if Virtualization is Enabled
A 4 GB of total RAM The total system memory required varies depending on the size
of your data set and on the other processes that are running
Installation
Step1 Download amp Run Vmware
Step2 Download Cloudera VM
Step3 Extract to the Cloudera folder
Step4 Open the cloudera-quickstart-vm-440-1-vmware
Once you got the software installed fire up the VirtualBox image of Cloudera QuickStart VM and you should see the initial screen similar to below
WordCount Example
This example computes the occurrence frequency of each word in a text file
Steps1 Set up Hadoop environment2 Upload files into HDFS3 Executing Java MapReduce functions in Hadoop
bull Tutorial
httpswwwclouderacomcontentcloudera-contentcloudera-docsHadoopTutorialCDH4Hadoop-Tutorialht_wordcount1html
public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt
private final static IntWritable one = new IntWritable(1) private Text word = new Text)(
public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
String line = valuetoString)(
StringTokenizer tokenizer = new StringTokenizer(line)
while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())
outputcollect(word one)
public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt
public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
int sum = 0
while (valueshasNext()) sum += valuesnext()get)(
outputcollect(key new IntWritable(sum))
public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)
confsetJobName(wordcount)
confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)
confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)
confsetReducerClass(Reduceclass)
confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)
FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))
JobClientrunJob(conf)
How it Works
Purpose Simple serialization for keys values and other data
Interface Writable Read and write binary format Convert to String for text formats
Classes
OutputCollector Collects data output by either the Mapper or the Reducer ie
intermediate outputs or the output of the job Methods
collect(K key V value)
Interface
Text Stores text using standard UTF8 encoding It provides methods to
serialize deserialize and compare texts at byte level Methods
set(byte[] utf8)
Demo
40
Exercises
Get WordCount running
Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words
For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive (reading)
Motivation
MapReduce is great as many algorithms can be expressed by a series of MR jobs
But itrsquos low-level must think about keys values partitioning etc
Can we capture common ldquojob patternsrdquo
Pig
Started at Yahoo Research
Now runs about 30 of Yahoorsquos jobs
Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators
(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse
An Example Problem
Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))
reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)
lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5
store Top5 into lsquotop5sitesrsquo
In Pig Latin
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Job 1
Job 2
Job 3
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Hive
Developed at Facebook
Used for majority of Facebook jobs
ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex
data types some optimizations
Creating a Hive Table
CREATE TABLE page_views(viewTime INT userid BIGINT
page_url STRING referrer_url STRING
ip STRING COMMENT User IP address)
COMMENT This is the page view table
PARTITIONED BY(dt STRING country STRING)
STORED AS SEQUENCEFILE
bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA
Simple Query
SELECT page_views
FROM page_views
WHERE page_viewsdate gt= 2008-03-01
AND page_viewsdate lt= 2008-03-31
AND page_viewsreferrer_url like xyzcom
bullHive only reads partition 2008-03-01 instead of scanning entire table
bullFind all page views coming from xyzcom on March 31st
Aggregation and Joins
SELECT pvpage_url ugender COUNT(DISTINCT uid)
FROM page_views pv JOIN user u ON (pvuserid = uid)
GROUP BY pvpage_url ugender
WHERE pvdate = 2008-03-03
bullCount users who visited each page by gender
bullSample outputpage_url gender count(userid)
homephp MALE 12141412
homephp FEMALE 15431579
photophp MALE 23941451
photophp FEMALE 21231314
Using a Hadoop Streaming Mapper Script
SELECT TRANSFORM(page_viewsuserid
page_viewsdate)
USING map_scriptpy
AS dt uid CLUSTER BY dt
FROM page_views
Conclusions
MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance
Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs
(but requiring fault tolerance)
Hive and Pig further simplify programming
MapReduce is not suitable for all problems but when it works it may save you a lot of time
Resources
Hadoop httphadoopapacheorgcore
Hadoop docs
httphadoopapacheorgcoredocscurrent
Pig httphadoopapacheorgpig
Hive httphadoopapacheorghive
Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training
- Introduction to MapReduce and Hadoop
- What is MapReduce
- What is MapReduce used for
- What is MapReduce used for (2)
- Outline
- MapReduce Design Goals
- Typical Hadoop Cluster
- Typical Hadoop Cluster (2)
- Challenges
- Hadoop Components
- Hadoop Distributed File System
- MapReduce Programming Model
- Example Word Count
- Word Count Execution
- MapReduce Execution Details
- An Optimization The Combiner
- Word Count with Combiner
- Outline (2)
- Fault Tolerance in MapReduce
- Fault Tolerance in MapReduce (2)
- Fault Tolerance in MapReduce (3)
- Takeaways
- Outline (3)
- 1 Search
- 2 Sort
- 3 Inverted Index
- Inverted Index Example
- 4 Most Popular Words
- Outline (4)
- Cloudera Virtual Manager
- Installation
- Once you got the software installed fire up the VirtualBox ima
- WordCount Example
- Slide 34
- Slide 35
- Slide 36
- How it Works
- Classes
- Interface
- Demo
- Exercises
- Outline (5)
- Motivation
- Pig
- An Example Problem
- In MapReduce
- In Pig Latin
- Ease of Translation
- Ease of Translation (2)
- Hive
- Creating a Hive Table
- Simple Query
- Aggregation and Joins
- Using a Hadoop Streaming Mapper Script
- Conclusions
- Resources
-
pigsheepyakzebra
aardvarkantbeecowelephant
2 Sort
Input (key value) records
Output same records sorted by key
Map identity function
Reduce identify function
Trick Pick partitioningfunction h such thatk1ltk2 =gt h(k1)lth(k2)
Map
Map
Map
Reduce
Reduce
ant bee
zebra
aardvarkelephant
cow
pig
sheep yak
[A-M]
[N-Z]
3 Inverted Index
Input (filename text) records
Output list of files containing each word
Map foreach word in textsplit() output(word filename)
Combine uniquify filenames for each word (eliminate duplicates)
Reducedef reduce(word filenames) output(word sort(filenames))
Inverted Index Example
to be or not to be afraid (12thtxt)
be (12thtxt hamlettxt)greatness (12thtxt)not (12thtxt hamlettxt)of (12thtxt)or (hamlettxt)to (hamlettxt)
hamlettxt
be not afraid of greatness
12thtxt
to hamlettxtbe hamlettxtor hamlettxtnot hamlettxt
be 12thtxtnot 12thtxtafraid 12thtxtof 12thtxtgreatness 12thtxt
4 Most Popular Words
Input (filename text) records
Output the 100 words occurring in most files
Two-stage solution Job 1
Create inverted index giving (word list(file)) records Job 2
Map each (word list(file)) to (count word) Sort these records by count as in sort job
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
Cloudera Virtual Manager
Cloudera VM contains a single-node Apache Hadoop cluster along with everything you need to get started with Hadoop
Requirements A 64-bit host OS A virtualization software VMware Player KVM or VirtualBox
Virtualization Software will require a laptop that supports virtualization If you are unsure one way this can be checked by looking at your BIOS and seeing if Virtualization is Enabled
A 4 GB of total RAM The total system memory required varies depending on the size
of your data set and on the other processes that are running
Installation
Step1 Download amp Run Vmware
Step2 Download Cloudera VM
Step3 Extract to the Cloudera folder
Step4 Open the cloudera-quickstart-vm-440-1-vmware
Once you got the software installed fire up the VirtualBox image of Cloudera QuickStart VM and you should see the initial screen similar to below
WordCount Example
This example computes the occurrence frequency of each word in a text file
Steps1 Set up Hadoop environment2 Upload files into HDFS3 Executing Java MapReduce functions in Hadoop
bull Tutorial
httpswwwclouderacomcontentcloudera-contentcloudera-docsHadoopTutorialCDH4Hadoop-Tutorialht_wordcount1html
public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt
private final static IntWritable one = new IntWritable(1) private Text word = new Text)(
public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
String line = valuetoString)(
StringTokenizer tokenizer = new StringTokenizer(line)
while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())
outputcollect(word one)
public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt
public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
int sum = 0
while (valueshasNext()) sum += valuesnext()get)(
outputcollect(key new IntWritable(sum))
public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)
confsetJobName(wordcount)
confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)
confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)
confsetReducerClass(Reduceclass)
confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)
FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))
JobClientrunJob(conf)
How it Works
Purpose Simple serialization for keys values and other data
Interface Writable Read and write binary format Convert to String for text formats
Classes
OutputCollector Collects data output by either the Mapper or the Reducer ie
intermediate outputs or the output of the job Methods
collect(K key V value)
Interface
Text Stores text using standard UTF8 encoding It provides methods to
serialize deserialize and compare texts at byte level Methods
set(byte[] utf8)
Demo
40
Exercises
Get WordCount running
Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words
For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive (reading)
Motivation
MapReduce is great as many algorithms can be expressed by a series of MR jobs
But itrsquos low-level must think about keys values partitioning etc
Can we capture common ldquojob patternsrdquo
Pig
Started at Yahoo Research
Now runs about 30 of Yahoorsquos jobs
Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators
(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse
An Example Problem
Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))
reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)
lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5
store Top5 into lsquotop5sitesrsquo
In Pig Latin
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Job 1
Job 2
Job 3
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Hive
Developed at Facebook
Used for majority of Facebook jobs
ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex
data types some optimizations
Creating a Hive Table
CREATE TABLE page_views(viewTime INT userid BIGINT
page_url STRING referrer_url STRING
ip STRING COMMENT User IP address)
COMMENT This is the page view table
PARTITIONED BY(dt STRING country STRING)
STORED AS SEQUENCEFILE
bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA
Simple Query
SELECT page_views
FROM page_views
WHERE page_viewsdate gt= 2008-03-01
AND page_viewsdate lt= 2008-03-31
AND page_viewsreferrer_url like xyzcom
bullHive only reads partition 2008-03-01 instead of scanning entire table
bullFind all page views coming from xyzcom on March 31st
Aggregation and Joins
SELECT pvpage_url ugender COUNT(DISTINCT uid)
FROM page_views pv JOIN user u ON (pvuserid = uid)
GROUP BY pvpage_url ugender
WHERE pvdate = 2008-03-03
bullCount users who visited each page by gender
bullSample outputpage_url gender count(userid)
homephp MALE 12141412
homephp FEMALE 15431579
photophp MALE 23941451
photophp FEMALE 21231314
Using a Hadoop Streaming Mapper Script
SELECT TRANSFORM(page_viewsuserid
page_viewsdate)
USING map_scriptpy
AS dt uid CLUSTER BY dt
FROM page_views
Conclusions
MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance
Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs
(but requiring fault tolerance)
Hive and Pig further simplify programming
MapReduce is not suitable for all problems but when it works it may save you a lot of time
Resources
Hadoop httphadoopapacheorgcore
Hadoop docs
httphadoopapacheorgcoredocscurrent
Pig httphadoopapacheorgpig
Hive httphadoopapacheorghive
Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training
- Introduction to MapReduce and Hadoop
- What is MapReduce
- What is MapReduce used for
- What is MapReduce used for (2)
- Outline
- MapReduce Design Goals
- Typical Hadoop Cluster
- Typical Hadoop Cluster (2)
- Challenges
- Hadoop Components
- Hadoop Distributed File System
- MapReduce Programming Model
- Example Word Count
- Word Count Execution
- MapReduce Execution Details
- An Optimization The Combiner
- Word Count with Combiner
- Outline (2)
- Fault Tolerance in MapReduce
- Fault Tolerance in MapReduce (2)
- Fault Tolerance in MapReduce (3)
- Takeaways
- Outline (3)
- 1 Search
- 2 Sort
- 3 Inverted Index
- Inverted Index Example
- 4 Most Popular Words
- Outline (4)
- Cloudera Virtual Manager
- Installation
- Once you got the software installed fire up the VirtualBox ima
- WordCount Example
- Slide 34
- Slide 35
- Slide 36
- How it Works
- Classes
- Interface
- Demo
- Exercises
- Outline (5)
- Motivation
- Pig
- An Example Problem
- In MapReduce
- In Pig Latin
- Ease of Translation
- Ease of Translation (2)
- Hive
- Creating a Hive Table
- Simple Query
- Aggregation and Joins
- Using a Hadoop Streaming Mapper Script
- Conclusions
- Resources
-
3 Inverted Index
Input (filename text) records
Output list of files containing each word
Map foreach word in textsplit() output(word filename)
Combine uniquify filenames for each word (eliminate duplicates)
Reducedef reduce(word filenames) output(word sort(filenames))
Inverted Index Example
to be or not to be afraid (12thtxt)
be (12thtxt hamlettxt)greatness (12thtxt)not (12thtxt hamlettxt)of (12thtxt)or (hamlettxt)to (hamlettxt)
hamlettxt
be not afraid of greatness
12thtxt
to hamlettxtbe hamlettxtor hamlettxtnot hamlettxt
be 12thtxtnot 12thtxtafraid 12thtxtof 12thtxtgreatness 12thtxt
4 Most Popular Words
Input (filename text) records
Output the 100 words occurring in most files
Two-stage solution Job 1
Create inverted index giving (word list(file)) records Job 2
Map each (word list(file)) to (count word) Sort these records by count as in sort job
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
Cloudera Virtual Manager
Cloudera VM contains a single-node Apache Hadoop cluster along with everything you need to get started with Hadoop
Requirements A 64-bit host OS A virtualization software VMware Player KVM or VirtualBox
Virtualization Software will require a laptop that supports virtualization If you are unsure one way this can be checked by looking at your BIOS and seeing if Virtualization is Enabled
A 4 GB of total RAM The total system memory required varies depending on the size
of your data set and on the other processes that are running
Installation
Step1 Download amp Run Vmware
Step2 Download Cloudera VM
Step3 Extract to the Cloudera folder
Step4 Open the cloudera-quickstart-vm-440-1-vmware
Once you got the software installed fire up the VirtualBox image of Cloudera QuickStart VM and you should see the initial screen similar to below
WordCount Example
This example computes the occurrence frequency of each word in a text file
Steps1 Set up Hadoop environment2 Upload files into HDFS3 Executing Java MapReduce functions in Hadoop
bull Tutorial
httpswwwclouderacomcontentcloudera-contentcloudera-docsHadoopTutorialCDH4Hadoop-Tutorialht_wordcount1html
public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt
private final static IntWritable one = new IntWritable(1) private Text word = new Text)(
public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
String line = valuetoString)(
StringTokenizer tokenizer = new StringTokenizer(line)
while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())
outputcollect(word one)
public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt
public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
int sum = 0
while (valueshasNext()) sum += valuesnext()get)(
outputcollect(key new IntWritable(sum))
public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)
confsetJobName(wordcount)
confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)
confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)
confsetReducerClass(Reduceclass)
confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)
FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))
JobClientrunJob(conf)
How it Works
Purpose Simple serialization for keys values and other data
Interface Writable Read and write binary format Convert to String for text formats
Classes
OutputCollector Collects data output by either the Mapper or the Reducer ie
intermediate outputs or the output of the job Methods
collect(K key V value)
Interface
Text Stores text using standard UTF8 encoding It provides methods to
serialize deserialize and compare texts at byte level Methods
set(byte[] utf8)
Demo
40
Exercises
Get WordCount running
Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words
For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive (reading)
Motivation
MapReduce is great as many algorithms can be expressed by a series of MR jobs
But itrsquos low-level must think about keys values partitioning etc
Can we capture common ldquojob patternsrdquo
Pig
Started at Yahoo Research
Now runs about 30 of Yahoorsquos jobs
Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators
(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse
An Example Problem
Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))
reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)
lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5
store Top5 into lsquotop5sitesrsquo
In Pig Latin
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Job 1
Job 2
Job 3
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Hive
Developed at Facebook
Used for majority of Facebook jobs
ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex
data types some optimizations
Creating a Hive Table
CREATE TABLE page_views(viewTime INT userid BIGINT
page_url STRING referrer_url STRING
ip STRING COMMENT User IP address)
COMMENT This is the page view table
PARTITIONED BY(dt STRING country STRING)
STORED AS SEQUENCEFILE
bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA
Simple Query
SELECT page_views
FROM page_views
WHERE page_viewsdate gt= 2008-03-01
AND page_viewsdate lt= 2008-03-31
AND page_viewsreferrer_url like xyzcom
bullHive only reads partition 2008-03-01 instead of scanning entire table
bullFind all page views coming from xyzcom on March 31st
Aggregation and Joins
SELECT pvpage_url ugender COUNT(DISTINCT uid)
FROM page_views pv JOIN user u ON (pvuserid = uid)
GROUP BY pvpage_url ugender
WHERE pvdate = 2008-03-03
bullCount users who visited each page by gender
bullSample outputpage_url gender count(userid)
homephp MALE 12141412
homephp FEMALE 15431579
photophp MALE 23941451
photophp FEMALE 21231314
Using a Hadoop Streaming Mapper Script
SELECT TRANSFORM(page_viewsuserid
page_viewsdate)
USING map_scriptpy
AS dt uid CLUSTER BY dt
FROM page_views
Conclusions
MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance
Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs
(but requiring fault tolerance)
Hive and Pig further simplify programming
MapReduce is not suitable for all problems but when it works it may save you a lot of time
Resources
Hadoop httphadoopapacheorgcore
Hadoop docs
httphadoopapacheorgcoredocscurrent
Pig httphadoopapacheorgpig
Hive httphadoopapacheorghive
Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training
- Introduction to MapReduce and Hadoop
- What is MapReduce
- What is MapReduce used for
- What is MapReduce used for (2)
- Outline
- MapReduce Design Goals
- Typical Hadoop Cluster
- Typical Hadoop Cluster (2)
- Challenges
- Hadoop Components
- Hadoop Distributed File System
- MapReduce Programming Model
- Example Word Count
- Word Count Execution
- MapReduce Execution Details
- An Optimization The Combiner
- Word Count with Combiner
- Outline (2)
- Fault Tolerance in MapReduce
- Fault Tolerance in MapReduce (2)
- Fault Tolerance in MapReduce (3)
- Takeaways
- Outline (3)
- 1 Search
- 2 Sort
- 3 Inverted Index
- Inverted Index Example
- 4 Most Popular Words
- Outline (4)
- Cloudera Virtual Manager
- Installation
- Once you got the software installed fire up the VirtualBox ima
- WordCount Example
- Slide 34
- Slide 35
- Slide 36
- How it Works
- Classes
- Interface
- Demo
- Exercises
- Outline (5)
- Motivation
- Pig
- An Example Problem
- In MapReduce
- In Pig Latin
- Ease of Translation
- Ease of Translation (2)
- Hive
- Creating a Hive Table
- Simple Query
- Aggregation and Joins
- Using a Hadoop Streaming Mapper Script
- Conclusions
- Resources
-
Inverted Index Example
to be or not to be afraid (12thtxt)
be (12thtxt hamlettxt)greatness (12thtxt)not (12thtxt hamlettxt)of (12thtxt)or (hamlettxt)to (hamlettxt)
hamlettxt
be not afraid of greatness
12thtxt
to hamlettxtbe hamlettxtor hamlettxtnot hamlettxt
be 12thtxtnot 12thtxtafraid 12thtxtof 12thtxtgreatness 12thtxt
4 Most Popular Words
Input (filename text) records
Output the 100 words occurring in most files
Two-stage solution Job 1
Create inverted index giving (word list(file)) records Job 2
Map each (word list(file)) to (count word) Sort these records by count as in sort job
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
Cloudera Virtual Manager
Cloudera VM contains a single-node Apache Hadoop cluster along with everything you need to get started with Hadoop
Requirements A 64-bit host OS A virtualization software VMware Player KVM or VirtualBox
Virtualization Software will require a laptop that supports virtualization If you are unsure one way this can be checked by looking at your BIOS and seeing if Virtualization is Enabled
A 4 GB of total RAM The total system memory required varies depending on the size
of your data set and on the other processes that are running
Installation
Step1 Download amp Run Vmware
Step2 Download Cloudera VM
Step3 Extract to the Cloudera folder
Step4 Open the cloudera-quickstart-vm-440-1-vmware
Once you got the software installed fire up the VirtualBox image of Cloudera QuickStart VM and you should see the initial screen similar to below
WordCount Example
This example computes the occurrence frequency of each word in a text file
Steps1 Set up Hadoop environment2 Upload files into HDFS3 Executing Java MapReduce functions in Hadoop
bull Tutorial
httpswwwclouderacomcontentcloudera-contentcloudera-docsHadoopTutorialCDH4Hadoop-Tutorialht_wordcount1html
public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt
private final static IntWritable one = new IntWritable(1) private Text word = new Text)(
public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
String line = valuetoString)(
StringTokenizer tokenizer = new StringTokenizer(line)
while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())
outputcollect(word one)
public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt
public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
int sum = 0
while (valueshasNext()) sum += valuesnext()get)(
outputcollect(key new IntWritable(sum))
public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)
confsetJobName(wordcount)
confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)
confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)
confsetReducerClass(Reduceclass)
confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)
FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))
JobClientrunJob(conf)
How it Works
Purpose Simple serialization for keys values and other data
Interface Writable Read and write binary format Convert to String for text formats
Classes
OutputCollector Collects data output by either the Mapper or the Reducer ie
intermediate outputs or the output of the job Methods
collect(K key V value)
Interface
Text Stores text using standard UTF8 encoding It provides methods to
serialize deserialize and compare texts at byte level Methods
set(byte[] utf8)
Demo
40
Exercises
Get WordCount running
Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words
For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive (reading)
Motivation
MapReduce is great as many algorithms can be expressed by a series of MR jobs
But itrsquos low-level must think about keys values partitioning etc
Can we capture common ldquojob patternsrdquo
Pig
Started at Yahoo Research
Now runs about 30 of Yahoorsquos jobs
Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators
(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse
An Example Problem
Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))
reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)
lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5
store Top5 into lsquotop5sitesrsquo
In Pig Latin
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Job 1
Job 2
Job 3
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Hive
Developed at Facebook
Used for majority of Facebook jobs
ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex
data types some optimizations
Creating a Hive Table
CREATE TABLE page_views(viewTime INT userid BIGINT
page_url STRING referrer_url STRING
ip STRING COMMENT User IP address)
COMMENT This is the page view table
PARTITIONED BY(dt STRING country STRING)
STORED AS SEQUENCEFILE
bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA
Simple Query
SELECT page_views
FROM page_views
WHERE page_viewsdate gt= 2008-03-01
AND page_viewsdate lt= 2008-03-31
AND page_viewsreferrer_url like xyzcom
bullHive only reads partition 2008-03-01 instead of scanning entire table
bullFind all page views coming from xyzcom on March 31st
Aggregation and Joins
SELECT pvpage_url ugender COUNT(DISTINCT uid)
FROM page_views pv JOIN user u ON (pvuserid = uid)
GROUP BY pvpage_url ugender
WHERE pvdate = 2008-03-03
bullCount users who visited each page by gender
bullSample outputpage_url gender count(userid)
homephp MALE 12141412
homephp FEMALE 15431579
photophp MALE 23941451
photophp FEMALE 21231314
Using a Hadoop Streaming Mapper Script
SELECT TRANSFORM(page_viewsuserid
page_viewsdate)
USING map_scriptpy
AS dt uid CLUSTER BY dt
FROM page_views
Conclusions
MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance
Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs
(but requiring fault tolerance)
Hive and Pig further simplify programming
MapReduce is not suitable for all problems but when it works it may save you a lot of time
Resources
Hadoop httphadoopapacheorgcore
Hadoop docs
httphadoopapacheorgcoredocscurrent
Pig httphadoopapacheorgpig
Hive httphadoopapacheorghive
Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training
- Introduction to MapReduce and Hadoop
- What is MapReduce
- What is MapReduce used for
- What is MapReduce used for (2)
- Outline
- MapReduce Design Goals
- Typical Hadoop Cluster
- Typical Hadoop Cluster (2)
- Challenges
- Hadoop Components
- Hadoop Distributed File System
- MapReduce Programming Model
- Example Word Count
- Word Count Execution
- MapReduce Execution Details
- An Optimization The Combiner
- Word Count with Combiner
- Outline (2)
- Fault Tolerance in MapReduce
- Fault Tolerance in MapReduce (2)
- Fault Tolerance in MapReduce (3)
- Takeaways
- Outline (3)
- 1 Search
- 2 Sort
- 3 Inverted Index
- Inverted Index Example
- 4 Most Popular Words
- Outline (4)
- Cloudera Virtual Manager
- Installation
- Once you got the software installed fire up the VirtualBox ima
- WordCount Example
- Slide 34
- Slide 35
- Slide 36
- How it Works
- Classes
- Interface
- Demo
- Exercises
- Outline (5)
- Motivation
- Pig
- An Example Problem
- In MapReduce
- In Pig Latin
- Ease of Translation
- Ease of Translation (2)
- Hive
- Creating a Hive Table
- Simple Query
- Aggregation and Joins
- Using a Hadoop Streaming Mapper Script
- Conclusions
- Resources
-
4 Most Popular Words
Input (filename text) records
Output the 100 words occurring in most files
Two-stage solution Job 1
Create inverted index giving (word list(file)) records Job 2
Map each (word list(file)) to (count word) Sort these records by count as in sort job
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
Cloudera Virtual Manager
Cloudera VM contains a single-node Apache Hadoop cluster along with everything you need to get started with Hadoop
Requirements A 64-bit host OS A virtualization software VMware Player KVM or VirtualBox
Virtualization Software will require a laptop that supports virtualization If you are unsure one way this can be checked by looking at your BIOS and seeing if Virtualization is Enabled
A 4 GB of total RAM The total system memory required varies depending on the size
of your data set and on the other processes that are running
Installation
Step1 Download amp Run Vmware
Step2 Download Cloudera VM
Step3 Extract to the Cloudera folder
Step4 Open the cloudera-quickstart-vm-440-1-vmware
Once you got the software installed fire up the VirtualBox image of Cloudera QuickStart VM and you should see the initial screen similar to below
WordCount Example
This example computes the occurrence frequency of each word in a text file
Steps1 Set up Hadoop environment2 Upload files into HDFS3 Executing Java MapReduce functions in Hadoop
bull Tutorial
httpswwwclouderacomcontentcloudera-contentcloudera-docsHadoopTutorialCDH4Hadoop-Tutorialht_wordcount1html
public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt
private final static IntWritable one = new IntWritable(1) private Text word = new Text)(
public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
String line = valuetoString)(
StringTokenizer tokenizer = new StringTokenizer(line)
while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())
outputcollect(word one)
public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt
public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
int sum = 0
while (valueshasNext()) sum += valuesnext()get)(
outputcollect(key new IntWritable(sum))
public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)
confsetJobName(wordcount)
confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)
confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)
confsetReducerClass(Reduceclass)
confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)
FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))
JobClientrunJob(conf)
How it Works
Purpose Simple serialization for keys values and other data
Interface Writable Read and write binary format Convert to String for text formats
Classes
OutputCollector Collects data output by either the Mapper or the Reducer ie
intermediate outputs or the output of the job Methods
collect(K key V value)
Interface
Text Stores text using standard UTF8 encoding It provides methods to
serialize deserialize and compare texts at byte level Methods
set(byte[] utf8)
Demo
40
Exercises
Get WordCount running
Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words
For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive (reading)
Motivation
MapReduce is great as many algorithms can be expressed by a series of MR jobs
But itrsquos low-level must think about keys values partitioning etc
Can we capture common ldquojob patternsrdquo
Pig
Started at Yahoo Research
Now runs about 30 of Yahoorsquos jobs
Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators
(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse
An Example Problem
Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))
reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)
lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5
store Top5 into lsquotop5sitesrsquo
In Pig Latin
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Job 1
Job 2
Job 3
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Hive
Developed at Facebook
Used for majority of Facebook jobs
ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex
data types some optimizations
Creating a Hive Table
CREATE TABLE page_views(viewTime INT userid BIGINT
page_url STRING referrer_url STRING
ip STRING COMMENT User IP address)
COMMENT This is the page view table
PARTITIONED BY(dt STRING country STRING)
STORED AS SEQUENCEFILE
bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA
Simple Query
SELECT page_views
FROM page_views
WHERE page_viewsdate gt= 2008-03-01
AND page_viewsdate lt= 2008-03-31
AND page_viewsreferrer_url like xyzcom
bullHive only reads partition 2008-03-01 instead of scanning entire table
bullFind all page views coming from xyzcom on March 31st
Aggregation and Joins
SELECT pvpage_url ugender COUNT(DISTINCT uid)
FROM page_views pv JOIN user u ON (pvuserid = uid)
GROUP BY pvpage_url ugender
WHERE pvdate = 2008-03-03
bullCount users who visited each page by gender
bullSample outputpage_url gender count(userid)
homephp MALE 12141412
homephp FEMALE 15431579
photophp MALE 23941451
photophp FEMALE 21231314
Using a Hadoop Streaming Mapper Script
SELECT TRANSFORM(page_viewsuserid
page_viewsdate)
USING map_scriptpy
AS dt uid CLUSTER BY dt
FROM page_views
Conclusions
MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance
Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs
(but requiring fault tolerance)
Hive and Pig further simplify programming
MapReduce is not suitable for all problems but when it works it may save you a lot of time
Resources
Hadoop httphadoopapacheorgcore
Hadoop docs
httphadoopapacheorgcoredocscurrent
Pig httphadoopapacheorgpig
Hive httphadoopapacheorghive
Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training
- Introduction to MapReduce and Hadoop
- What is MapReduce
- What is MapReduce used for
- What is MapReduce used for (2)
- Outline
- MapReduce Design Goals
- Typical Hadoop Cluster
- Typical Hadoop Cluster (2)
- Challenges
- Hadoop Components
- Hadoop Distributed File System
- MapReduce Programming Model
- Example Word Count
- Word Count Execution
- MapReduce Execution Details
- An Optimization The Combiner
- Word Count with Combiner
- Outline (2)
- Fault Tolerance in MapReduce
- Fault Tolerance in MapReduce (2)
- Fault Tolerance in MapReduce (3)
- Takeaways
- Outline (3)
- 1 Search
- 2 Sort
- 3 Inverted Index
- Inverted Index Example
- 4 Most Popular Words
- Outline (4)
- Cloudera Virtual Manager
- Installation
- Once you got the software installed fire up the VirtualBox ima
- WordCount Example
- Slide 34
- Slide 35
- Slide 36
- How it Works
- Classes
- Interface
- Demo
- Exercises
- Outline (5)
- Motivation
- Pig
- An Example Problem
- In MapReduce
- In Pig Latin
- Ease of Translation
- Ease of Translation (2)
- Hive
- Creating a Hive Table
- Simple Query
- Aggregation and Joins
- Using a Hadoop Streaming Mapper Script
- Conclusions
- Resources
-
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive
Cloudera Virtual Manager
Cloudera VM contains a single-node Apache Hadoop cluster along with everything you need to get started with Hadoop
Requirements A 64-bit host OS A virtualization software VMware Player KVM or VirtualBox
Virtualization Software will require a laptop that supports virtualization If you are unsure one way this can be checked by looking at your BIOS and seeing if Virtualization is Enabled
A 4 GB of total RAM The total system memory required varies depending on the size
of your data set and on the other processes that are running
Installation
Step1 Download amp Run Vmware
Step2 Download Cloudera VM
Step3 Extract to the Cloudera folder
Step4 Open the cloudera-quickstart-vm-440-1-vmware
Once you got the software installed fire up the VirtualBox image of Cloudera QuickStart VM and you should see the initial screen similar to below
WordCount Example
This example computes the occurrence frequency of each word in a text file
Steps1 Set up Hadoop environment2 Upload files into HDFS3 Executing Java MapReduce functions in Hadoop
bull Tutorial
httpswwwclouderacomcontentcloudera-contentcloudera-docsHadoopTutorialCDH4Hadoop-Tutorialht_wordcount1html
public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt
private final static IntWritable one = new IntWritable(1) private Text word = new Text)(
public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
String line = valuetoString)(
StringTokenizer tokenizer = new StringTokenizer(line)
while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())
outputcollect(word one)
public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt
public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
int sum = 0
while (valueshasNext()) sum += valuesnext()get)(
outputcollect(key new IntWritable(sum))
public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)
confsetJobName(wordcount)
confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)
confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)
confsetReducerClass(Reduceclass)
confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)
FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))
JobClientrunJob(conf)
How it Works
Purpose Simple serialization for keys values and other data
Interface Writable Read and write binary format Convert to String for text formats
Classes
OutputCollector Collects data output by either the Mapper or the Reducer ie
intermediate outputs or the output of the job Methods
collect(K key V value)
Interface
Text Stores text using standard UTF8 encoding It provides methods to
serialize deserialize and compare texts at byte level Methods
set(byte[] utf8)
Demo
40
Exercises
Get WordCount running
Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words
For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive (reading)
Motivation
MapReduce is great as many algorithms can be expressed by a series of MR jobs
But itrsquos low-level must think about keys values partitioning etc
Can we capture common ldquojob patternsrdquo
Pig
Started at Yahoo Research
Now runs about 30 of Yahoorsquos jobs
Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators
(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse
An Example Problem
Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))
reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)
lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5
store Top5 into lsquotop5sitesrsquo
In Pig Latin
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Job 1
Job 2
Job 3
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Hive
Developed at Facebook
Used for majority of Facebook jobs
ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex
data types some optimizations
Creating a Hive Table
CREATE TABLE page_views(viewTime INT userid BIGINT
page_url STRING referrer_url STRING
ip STRING COMMENT User IP address)
COMMENT This is the page view table
PARTITIONED BY(dt STRING country STRING)
STORED AS SEQUENCEFILE
bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA
Simple Query
SELECT page_views
FROM page_views
WHERE page_viewsdate gt= 2008-03-01
AND page_viewsdate lt= 2008-03-31
AND page_viewsreferrer_url like xyzcom
bullHive only reads partition 2008-03-01 instead of scanning entire table
bullFind all page views coming from xyzcom on March 31st
Aggregation and Joins
SELECT pvpage_url ugender COUNT(DISTINCT uid)
FROM page_views pv JOIN user u ON (pvuserid = uid)
GROUP BY pvpage_url ugender
WHERE pvdate = 2008-03-03
bullCount users who visited each page by gender
bullSample outputpage_url gender count(userid)
homephp MALE 12141412
homephp FEMALE 15431579
photophp MALE 23941451
photophp FEMALE 21231314
Using a Hadoop Streaming Mapper Script
SELECT TRANSFORM(page_viewsuserid
page_viewsdate)
USING map_scriptpy
AS dt uid CLUSTER BY dt
FROM page_views
Conclusions
MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance
Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs
(but requiring fault tolerance)
Hive and Pig further simplify programming
MapReduce is not suitable for all problems but when it works it may save you a lot of time
Resources
Hadoop httphadoopapacheorgcore
Hadoop docs
httphadoopapacheorgcoredocscurrent
Pig httphadoopapacheorgpig
Hive httphadoopapacheorghive
Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training
- Introduction to MapReduce and Hadoop
- What is MapReduce
- What is MapReduce used for
- What is MapReduce used for (2)
- Outline
- MapReduce Design Goals
- Typical Hadoop Cluster
- Typical Hadoop Cluster (2)
- Challenges
- Hadoop Components
- Hadoop Distributed File System
- MapReduce Programming Model
- Example Word Count
- Word Count Execution
- MapReduce Execution Details
- An Optimization The Combiner
- Word Count with Combiner
- Outline (2)
- Fault Tolerance in MapReduce
- Fault Tolerance in MapReduce (2)
- Fault Tolerance in MapReduce (3)
- Takeaways
- Outline (3)
- 1 Search
- 2 Sort
- 3 Inverted Index
- Inverted Index Example
- 4 Most Popular Words
- Outline (4)
- Cloudera Virtual Manager
- Installation
- Once you got the software installed fire up the VirtualBox ima
- WordCount Example
- Slide 34
- Slide 35
- Slide 36
- How it Works
- Classes
- Interface
- Demo
- Exercises
- Outline (5)
- Motivation
- Pig
- An Example Problem
- In MapReduce
- In Pig Latin
- Ease of Translation
- Ease of Translation (2)
- Hive
- Creating a Hive Table
- Simple Query
- Aggregation and Joins
- Using a Hadoop Streaming Mapper Script
- Conclusions
- Resources
-
Cloudera Virtual Manager
Cloudera VM contains a single-node Apache Hadoop cluster along with everything you need to get started with Hadoop
Requirements A 64-bit host OS A virtualization software VMware Player KVM or VirtualBox
Virtualization Software will require a laptop that supports virtualization If you are unsure one way this can be checked by looking at your BIOS and seeing if Virtualization is Enabled
A 4 GB of total RAM The total system memory required varies depending on the size
of your data set and on the other processes that are running
Installation
Step1 Download amp Run Vmware
Step2 Download Cloudera VM
Step3 Extract to the Cloudera folder
Step4 Open the cloudera-quickstart-vm-440-1-vmware
Once you got the software installed fire up the VirtualBox image of Cloudera QuickStart VM and you should see the initial screen similar to below
WordCount Example
This example computes the occurrence frequency of each word in a text file
Steps1 Set up Hadoop environment2 Upload files into HDFS3 Executing Java MapReduce functions in Hadoop
bull Tutorial
httpswwwclouderacomcontentcloudera-contentcloudera-docsHadoopTutorialCDH4Hadoop-Tutorialht_wordcount1html
public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt
private final static IntWritable one = new IntWritable(1) private Text word = new Text)(
public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
String line = valuetoString)(
StringTokenizer tokenizer = new StringTokenizer(line)
while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())
outputcollect(word one)
public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt
public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
int sum = 0
while (valueshasNext()) sum += valuesnext()get)(
outputcollect(key new IntWritable(sum))
public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)
confsetJobName(wordcount)
confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)
confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)
confsetReducerClass(Reduceclass)
confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)
FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))
JobClientrunJob(conf)
How it Works
Purpose Simple serialization for keys values and other data
Interface Writable Read and write binary format Convert to String for text formats
Classes
OutputCollector Collects data output by either the Mapper or the Reducer ie
intermediate outputs or the output of the job Methods
collect(K key V value)
Interface
Text Stores text using standard UTF8 encoding It provides methods to
serialize deserialize and compare texts at byte level Methods
set(byte[] utf8)
Demo
40
Exercises
Get WordCount running
Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words
For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive (reading)
Motivation
MapReduce is great as many algorithms can be expressed by a series of MR jobs
But itrsquos low-level must think about keys values partitioning etc
Can we capture common ldquojob patternsrdquo
Pig
Started at Yahoo Research
Now runs about 30 of Yahoorsquos jobs
Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators
(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse
An Example Problem
Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))
reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)
lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5
store Top5 into lsquotop5sitesrsquo
In Pig Latin
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Job 1
Job 2
Job 3
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Hive
Developed at Facebook
Used for majority of Facebook jobs
ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex
data types some optimizations
Creating a Hive Table
CREATE TABLE page_views(viewTime INT userid BIGINT
page_url STRING referrer_url STRING
ip STRING COMMENT User IP address)
COMMENT This is the page view table
PARTITIONED BY(dt STRING country STRING)
STORED AS SEQUENCEFILE
bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA
Simple Query
SELECT page_views
FROM page_views
WHERE page_viewsdate gt= 2008-03-01
AND page_viewsdate lt= 2008-03-31
AND page_viewsreferrer_url like xyzcom
bullHive only reads partition 2008-03-01 instead of scanning entire table
bullFind all page views coming from xyzcom on March 31st
Aggregation and Joins
SELECT pvpage_url ugender COUNT(DISTINCT uid)
FROM page_views pv JOIN user u ON (pvuserid = uid)
GROUP BY pvpage_url ugender
WHERE pvdate = 2008-03-03
bullCount users who visited each page by gender
bullSample outputpage_url gender count(userid)
homephp MALE 12141412
homephp FEMALE 15431579
photophp MALE 23941451
photophp FEMALE 21231314
Using a Hadoop Streaming Mapper Script
SELECT TRANSFORM(page_viewsuserid
page_viewsdate)
USING map_scriptpy
AS dt uid CLUSTER BY dt
FROM page_views
Conclusions
MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance
Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs
(but requiring fault tolerance)
Hive and Pig further simplify programming
MapReduce is not suitable for all problems but when it works it may save you a lot of time
Resources
Hadoop httphadoopapacheorgcore
Hadoop docs
httphadoopapacheorgcoredocscurrent
Pig httphadoopapacheorgpig
Hive httphadoopapacheorghive
Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training
- Introduction to MapReduce and Hadoop
- What is MapReduce
- What is MapReduce used for
- What is MapReduce used for (2)
- Outline
- MapReduce Design Goals
- Typical Hadoop Cluster
- Typical Hadoop Cluster (2)
- Challenges
- Hadoop Components
- Hadoop Distributed File System
- MapReduce Programming Model
- Example Word Count
- Word Count Execution
- MapReduce Execution Details
- An Optimization The Combiner
- Word Count with Combiner
- Outline (2)
- Fault Tolerance in MapReduce
- Fault Tolerance in MapReduce (2)
- Fault Tolerance in MapReduce (3)
- Takeaways
- Outline (3)
- 1 Search
- 2 Sort
- 3 Inverted Index
- Inverted Index Example
- 4 Most Popular Words
- Outline (4)
- Cloudera Virtual Manager
- Installation
- Once you got the software installed fire up the VirtualBox ima
- WordCount Example
- Slide 34
- Slide 35
- Slide 36
- How it Works
- Classes
- Interface
- Demo
- Exercises
- Outline (5)
- Motivation
- Pig
- An Example Problem
- In MapReduce
- In Pig Latin
- Ease of Translation
- Ease of Translation (2)
- Hive
- Creating a Hive Table
- Simple Query
- Aggregation and Joins
- Using a Hadoop Streaming Mapper Script
- Conclusions
- Resources
-
Installation
Step1 Download amp Run Vmware
Step2 Download Cloudera VM
Step3 Extract to the Cloudera folder
Step4 Open the cloudera-quickstart-vm-440-1-vmware
Once you got the software installed fire up the VirtualBox image of Cloudera QuickStart VM and you should see the initial screen similar to below
WordCount Example
This example computes the occurrence frequency of each word in a text file
Steps1 Set up Hadoop environment2 Upload files into HDFS3 Executing Java MapReduce functions in Hadoop
bull Tutorial
httpswwwclouderacomcontentcloudera-contentcloudera-docsHadoopTutorialCDH4Hadoop-Tutorialht_wordcount1html
public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt
private final static IntWritable one = new IntWritable(1) private Text word = new Text)(
public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
String line = valuetoString)(
StringTokenizer tokenizer = new StringTokenizer(line)
while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())
outputcollect(word one)
public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt
public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
int sum = 0
while (valueshasNext()) sum += valuesnext()get)(
outputcollect(key new IntWritable(sum))
public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)
confsetJobName(wordcount)
confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)
confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)
confsetReducerClass(Reduceclass)
confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)
FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))
JobClientrunJob(conf)
How it Works
Purpose Simple serialization for keys values and other data
Interface Writable Read and write binary format Convert to String for text formats
Classes
OutputCollector Collects data output by either the Mapper or the Reducer ie
intermediate outputs or the output of the job Methods
collect(K key V value)
Interface
Text Stores text using standard UTF8 encoding It provides methods to
serialize deserialize and compare texts at byte level Methods
set(byte[] utf8)
Demo
40
Exercises
Get WordCount running
Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words
For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive (reading)
Motivation
MapReduce is great as many algorithms can be expressed by a series of MR jobs
But itrsquos low-level must think about keys values partitioning etc
Can we capture common ldquojob patternsrdquo
Pig
Started at Yahoo Research
Now runs about 30 of Yahoorsquos jobs
Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators
(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse
An Example Problem
Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))
reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)
lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5
store Top5 into lsquotop5sitesrsquo
In Pig Latin
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Job 1
Job 2
Job 3
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Hive
Developed at Facebook
Used for majority of Facebook jobs
ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex
data types some optimizations
Creating a Hive Table
CREATE TABLE page_views(viewTime INT userid BIGINT
page_url STRING referrer_url STRING
ip STRING COMMENT User IP address)
COMMENT This is the page view table
PARTITIONED BY(dt STRING country STRING)
STORED AS SEQUENCEFILE
bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA
Simple Query
SELECT page_views
FROM page_views
WHERE page_viewsdate gt= 2008-03-01
AND page_viewsdate lt= 2008-03-31
AND page_viewsreferrer_url like xyzcom
bullHive only reads partition 2008-03-01 instead of scanning entire table
bullFind all page views coming from xyzcom on March 31st
Aggregation and Joins
SELECT pvpage_url ugender COUNT(DISTINCT uid)
FROM page_views pv JOIN user u ON (pvuserid = uid)
GROUP BY pvpage_url ugender
WHERE pvdate = 2008-03-03
bullCount users who visited each page by gender
bullSample outputpage_url gender count(userid)
homephp MALE 12141412
homephp FEMALE 15431579
photophp MALE 23941451
photophp FEMALE 21231314
Using a Hadoop Streaming Mapper Script
SELECT TRANSFORM(page_viewsuserid
page_viewsdate)
USING map_scriptpy
AS dt uid CLUSTER BY dt
FROM page_views
Conclusions
MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance
Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs
(but requiring fault tolerance)
Hive and Pig further simplify programming
MapReduce is not suitable for all problems but when it works it may save you a lot of time
Resources
Hadoop httphadoopapacheorgcore
Hadoop docs
httphadoopapacheorgcoredocscurrent
Pig httphadoopapacheorgpig
Hive httphadoopapacheorghive
Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training
- Introduction to MapReduce and Hadoop
- What is MapReduce
- What is MapReduce used for
- What is MapReduce used for (2)
- Outline
- MapReduce Design Goals
- Typical Hadoop Cluster
- Typical Hadoop Cluster (2)
- Challenges
- Hadoop Components
- Hadoop Distributed File System
- MapReduce Programming Model
- Example Word Count
- Word Count Execution
- MapReduce Execution Details
- An Optimization The Combiner
- Word Count with Combiner
- Outline (2)
- Fault Tolerance in MapReduce
- Fault Tolerance in MapReduce (2)
- Fault Tolerance in MapReduce (3)
- Takeaways
- Outline (3)
- 1 Search
- 2 Sort
- 3 Inverted Index
- Inverted Index Example
- 4 Most Popular Words
- Outline (4)
- Cloudera Virtual Manager
- Installation
- Once you got the software installed fire up the VirtualBox ima
- WordCount Example
- Slide 34
- Slide 35
- Slide 36
- How it Works
- Classes
- Interface
- Demo
- Exercises
- Outline (5)
- Motivation
- Pig
- An Example Problem
- In MapReduce
- In Pig Latin
- Ease of Translation
- Ease of Translation (2)
- Hive
- Creating a Hive Table
- Simple Query
- Aggregation and Joins
- Using a Hadoop Streaming Mapper Script
- Conclusions
- Resources
-
Once you got the software installed fire up the VirtualBox image of Cloudera QuickStart VM and you should see the initial screen similar to below
WordCount Example
This example computes the occurrence frequency of each word in a text file
Steps1 Set up Hadoop environment2 Upload files into HDFS3 Executing Java MapReduce functions in Hadoop
bull Tutorial
httpswwwclouderacomcontentcloudera-contentcloudera-docsHadoopTutorialCDH4Hadoop-Tutorialht_wordcount1html
public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt
private final static IntWritable one = new IntWritable(1) private Text word = new Text)(
public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
String line = valuetoString)(
StringTokenizer tokenizer = new StringTokenizer(line)
while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())
outputcollect(word one)
public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt
public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
int sum = 0
while (valueshasNext()) sum += valuesnext()get)(
outputcollect(key new IntWritable(sum))
public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)
confsetJobName(wordcount)
confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)
confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)
confsetReducerClass(Reduceclass)
confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)
FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))
JobClientrunJob(conf)
How it Works
Purpose Simple serialization for keys values and other data
Interface Writable Read and write binary format Convert to String for text formats
Classes
OutputCollector Collects data output by either the Mapper or the Reducer ie
intermediate outputs or the output of the job Methods
collect(K key V value)
Interface
Text Stores text using standard UTF8 encoding It provides methods to
serialize deserialize and compare texts at byte level Methods
set(byte[] utf8)
Demo
40
Exercises
Get WordCount running
Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words
For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive (reading)
Motivation
MapReduce is great as many algorithms can be expressed by a series of MR jobs
But itrsquos low-level must think about keys values partitioning etc
Can we capture common ldquojob patternsrdquo
Pig
Started at Yahoo Research
Now runs about 30 of Yahoorsquos jobs
Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators
(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse
An Example Problem
Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))
reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)
lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5
store Top5 into lsquotop5sitesrsquo
In Pig Latin
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Job 1
Job 2
Job 3
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Hive
Developed at Facebook
Used for majority of Facebook jobs
ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex
data types some optimizations
Creating a Hive Table
CREATE TABLE page_views(viewTime INT userid BIGINT
page_url STRING referrer_url STRING
ip STRING COMMENT User IP address)
COMMENT This is the page view table
PARTITIONED BY(dt STRING country STRING)
STORED AS SEQUENCEFILE
bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA
Simple Query
SELECT page_views
FROM page_views
WHERE page_viewsdate gt= 2008-03-01
AND page_viewsdate lt= 2008-03-31
AND page_viewsreferrer_url like xyzcom
bullHive only reads partition 2008-03-01 instead of scanning entire table
bullFind all page views coming from xyzcom on March 31st
Aggregation and Joins
SELECT pvpage_url ugender COUNT(DISTINCT uid)
FROM page_views pv JOIN user u ON (pvuserid = uid)
GROUP BY pvpage_url ugender
WHERE pvdate = 2008-03-03
bullCount users who visited each page by gender
bullSample outputpage_url gender count(userid)
homephp MALE 12141412
homephp FEMALE 15431579
photophp MALE 23941451
photophp FEMALE 21231314
Using a Hadoop Streaming Mapper Script
SELECT TRANSFORM(page_viewsuserid
page_viewsdate)
USING map_scriptpy
AS dt uid CLUSTER BY dt
FROM page_views
Conclusions
MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance
Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs
(but requiring fault tolerance)
Hive and Pig further simplify programming
MapReduce is not suitable for all problems but when it works it may save you a lot of time
Resources
Hadoop httphadoopapacheorgcore
Hadoop docs
httphadoopapacheorgcoredocscurrent
Pig httphadoopapacheorgpig
Hive httphadoopapacheorghive
Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training
- Introduction to MapReduce and Hadoop
- What is MapReduce
- What is MapReduce used for
- What is MapReduce used for (2)
- Outline
- MapReduce Design Goals
- Typical Hadoop Cluster
- Typical Hadoop Cluster (2)
- Challenges
- Hadoop Components
- Hadoop Distributed File System
- MapReduce Programming Model
- Example Word Count
- Word Count Execution
- MapReduce Execution Details
- An Optimization The Combiner
- Word Count with Combiner
- Outline (2)
- Fault Tolerance in MapReduce
- Fault Tolerance in MapReduce (2)
- Fault Tolerance in MapReduce (3)
- Takeaways
- Outline (3)
- 1 Search
- 2 Sort
- 3 Inverted Index
- Inverted Index Example
- 4 Most Popular Words
- Outline (4)
- Cloudera Virtual Manager
- Installation
- Once you got the software installed fire up the VirtualBox ima
- WordCount Example
- Slide 34
- Slide 35
- Slide 36
- How it Works
- Classes
- Interface
- Demo
- Exercises
- Outline (5)
- Motivation
- Pig
- An Example Problem
- In MapReduce
- In Pig Latin
- Ease of Translation
- Ease of Translation (2)
- Hive
- Creating a Hive Table
- Simple Query
- Aggregation and Joins
- Using a Hadoop Streaming Mapper Script
- Conclusions
- Resources
-
WordCount Example
This example computes the occurrence frequency of each word in a text file
Steps1 Set up Hadoop environment2 Upload files into HDFS3 Executing Java MapReduce functions in Hadoop
bull Tutorial
httpswwwclouderacomcontentcloudera-contentcloudera-docsHadoopTutorialCDH4Hadoop-Tutorialht_wordcount1html
public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt
private final static IntWritable one = new IntWritable(1) private Text word = new Text)(
public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
String line = valuetoString)(
StringTokenizer tokenizer = new StringTokenizer(line)
while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())
outputcollect(word one)
public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt
public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
int sum = 0
while (valueshasNext()) sum += valuesnext()get)(
outputcollect(key new IntWritable(sum))
public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)
confsetJobName(wordcount)
confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)
confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)
confsetReducerClass(Reduceclass)
confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)
FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))
JobClientrunJob(conf)
How it Works
Purpose Simple serialization for keys values and other data
Interface Writable Read and write binary format Convert to String for text formats
Classes
OutputCollector Collects data output by either the Mapper or the Reducer ie
intermediate outputs or the output of the job Methods
collect(K key V value)
Interface
Text Stores text using standard UTF8 encoding It provides methods to
serialize deserialize and compare texts at byte level Methods
set(byte[] utf8)
Demo
40
Exercises
Get WordCount running
Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words
For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive (reading)
Motivation
MapReduce is great as many algorithms can be expressed by a series of MR jobs
But itrsquos low-level must think about keys values partitioning etc
Can we capture common ldquojob patternsrdquo
Pig
Started at Yahoo Research
Now runs about 30 of Yahoorsquos jobs
Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators
(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse
An Example Problem
Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))
reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)
lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5
store Top5 into lsquotop5sitesrsquo
In Pig Latin
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Job 1
Job 2
Job 3
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Hive
Developed at Facebook
Used for majority of Facebook jobs
ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex
data types some optimizations
Creating a Hive Table
CREATE TABLE page_views(viewTime INT userid BIGINT
page_url STRING referrer_url STRING
ip STRING COMMENT User IP address)
COMMENT This is the page view table
PARTITIONED BY(dt STRING country STRING)
STORED AS SEQUENCEFILE
bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA
Simple Query
SELECT page_views
FROM page_views
WHERE page_viewsdate gt= 2008-03-01
AND page_viewsdate lt= 2008-03-31
AND page_viewsreferrer_url like xyzcom
bullHive only reads partition 2008-03-01 instead of scanning entire table
bullFind all page views coming from xyzcom on March 31st
Aggregation and Joins
SELECT pvpage_url ugender COUNT(DISTINCT uid)
FROM page_views pv JOIN user u ON (pvuserid = uid)
GROUP BY pvpage_url ugender
WHERE pvdate = 2008-03-03
bullCount users who visited each page by gender
bullSample outputpage_url gender count(userid)
homephp MALE 12141412
homephp FEMALE 15431579
photophp MALE 23941451
photophp FEMALE 21231314
Using a Hadoop Streaming Mapper Script
SELECT TRANSFORM(page_viewsuserid
page_viewsdate)
USING map_scriptpy
AS dt uid CLUSTER BY dt
FROM page_views
Conclusions
MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance
Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs
(but requiring fault tolerance)
Hive and Pig further simplify programming
MapReduce is not suitable for all problems but when it works it may save you a lot of time
Resources
Hadoop httphadoopapacheorgcore
Hadoop docs
httphadoopapacheorgcoredocscurrent
Pig httphadoopapacheorgpig
Hive httphadoopapacheorghive
Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training
- Introduction to MapReduce and Hadoop
- What is MapReduce
- What is MapReduce used for
- What is MapReduce used for (2)
- Outline
- MapReduce Design Goals
- Typical Hadoop Cluster
- Typical Hadoop Cluster (2)
- Challenges
- Hadoop Components
- Hadoop Distributed File System
- MapReduce Programming Model
- Example Word Count
- Word Count Execution
- MapReduce Execution Details
- An Optimization The Combiner
- Word Count with Combiner
- Outline (2)
- Fault Tolerance in MapReduce
- Fault Tolerance in MapReduce (2)
- Fault Tolerance in MapReduce (3)
- Takeaways
- Outline (3)
- 1 Search
- 2 Sort
- 3 Inverted Index
- Inverted Index Example
- 4 Most Popular Words
- Outline (4)
- Cloudera Virtual Manager
- Installation
- Once you got the software installed fire up the VirtualBox ima
- WordCount Example
- Slide 34
- Slide 35
- Slide 36
- How it Works
- Classes
- Interface
- Demo
- Exercises
- Outline (5)
- Motivation
- Pig
- An Example Problem
- In MapReduce
- In Pig Latin
- Ease of Translation
- Ease of Translation (2)
- Hive
- Creating a Hive Table
- Simple Query
- Aggregation and Joins
- Using a Hadoop Streaming Mapper Script
- Conclusions
- Resources
-
public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt
private final static IntWritable one = new IntWritable(1) private Text word = new Text)(
public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
String line = valuetoString)(
StringTokenizer tokenizer = new StringTokenizer(line)
while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())
outputcollect(word one)
public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt
public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
int sum = 0
while (valueshasNext()) sum += valuesnext()get)(
outputcollect(key new IntWritable(sum))
public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)
confsetJobName(wordcount)
confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)
confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)
confsetReducerClass(Reduceclass)
confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)
FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))
JobClientrunJob(conf)
How it Works
Purpose Simple serialization for keys values and other data
Interface Writable Read and write binary format Convert to String for text formats
Classes
OutputCollector Collects data output by either the Mapper or the Reducer ie
intermediate outputs or the output of the job Methods
collect(K key V value)
Interface
Text Stores text using standard UTF8 encoding It provides methods to
serialize deserialize and compare texts at byte level Methods
set(byte[] utf8)
Demo
40
Exercises
Get WordCount running
Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words
For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive (reading)
Motivation
MapReduce is great as many algorithms can be expressed by a series of MR jobs
But itrsquos low-level must think about keys values partitioning etc
Can we capture common ldquojob patternsrdquo
Pig
Started at Yahoo Research
Now runs about 30 of Yahoorsquos jobs
Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators
(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse
An Example Problem
Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))
reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)
lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5
store Top5 into lsquotop5sitesrsquo
In Pig Latin
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Job 1
Job 2
Job 3
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Hive
Developed at Facebook
Used for majority of Facebook jobs
ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex
data types some optimizations
Creating a Hive Table
CREATE TABLE page_views(viewTime INT userid BIGINT
page_url STRING referrer_url STRING
ip STRING COMMENT User IP address)
COMMENT This is the page view table
PARTITIONED BY(dt STRING country STRING)
STORED AS SEQUENCEFILE
bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA
Simple Query
SELECT page_views
FROM page_views
WHERE page_viewsdate gt= 2008-03-01
AND page_viewsdate lt= 2008-03-31
AND page_viewsreferrer_url like xyzcom
bullHive only reads partition 2008-03-01 instead of scanning entire table
bullFind all page views coming from xyzcom on March 31st
Aggregation and Joins
SELECT pvpage_url ugender COUNT(DISTINCT uid)
FROM page_views pv JOIN user u ON (pvuserid = uid)
GROUP BY pvpage_url ugender
WHERE pvdate = 2008-03-03
bullCount users who visited each page by gender
bullSample outputpage_url gender count(userid)
homephp MALE 12141412
homephp FEMALE 15431579
photophp MALE 23941451
photophp FEMALE 21231314
Using a Hadoop Streaming Mapper Script
SELECT TRANSFORM(page_viewsuserid
page_viewsdate)
USING map_scriptpy
AS dt uid CLUSTER BY dt
FROM page_views
Conclusions
MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance
Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs
(but requiring fault tolerance)
Hive and Pig further simplify programming
MapReduce is not suitable for all problems but when it works it may save you a lot of time
Resources
Hadoop httphadoopapacheorgcore
Hadoop docs
httphadoopapacheorgcoredocscurrent
Pig httphadoopapacheorgpig
Hive httphadoopapacheorghive
Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training
- Introduction to MapReduce and Hadoop
- What is MapReduce
- What is MapReduce used for
- What is MapReduce used for (2)
- Outline
- MapReduce Design Goals
- Typical Hadoop Cluster
- Typical Hadoop Cluster (2)
- Challenges
- Hadoop Components
- Hadoop Distributed File System
- MapReduce Programming Model
- Example Word Count
- Word Count Execution
- MapReduce Execution Details
- An Optimization The Combiner
- Word Count with Combiner
- Outline (2)
- Fault Tolerance in MapReduce
- Fault Tolerance in MapReduce (2)
- Fault Tolerance in MapReduce (3)
- Takeaways
- Outline (3)
- 1 Search
- 2 Sort
- 3 Inverted Index
- Inverted Index Example
- 4 Most Popular Words
- Outline (4)
- Cloudera Virtual Manager
- Installation
- Once you got the software installed fire up the VirtualBox ima
- WordCount Example
- Slide 34
- Slide 35
- Slide 36
- How it Works
- Classes
- Interface
- Demo
- Exercises
- Outline (5)
- Motivation
- Pig
- An Example Problem
- In MapReduce
- In Pig Latin
- Ease of Translation
- Ease of Translation (2)
- Hive
- Creating a Hive Table
- Simple Query
- Aggregation and Joins
- Using a Hadoop Streaming Mapper Script
- Conclusions
- Resources
-
public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt
public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException
int sum = 0
while (valueshasNext()) sum += valuesnext()get)(
outputcollect(key new IntWritable(sum))
public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)
confsetJobName(wordcount)
confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)
confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)
confsetReducerClass(Reduceclass)
confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)
FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))
JobClientrunJob(conf)
How it Works
Purpose Simple serialization for keys values and other data
Interface Writable Read and write binary format Convert to String for text formats
Classes
OutputCollector Collects data output by either the Mapper or the Reducer ie
intermediate outputs or the output of the job Methods
collect(K key V value)
Interface
Text Stores text using standard UTF8 encoding It provides methods to
serialize deserialize and compare texts at byte level Methods
set(byte[] utf8)
Demo
40
Exercises
Get WordCount running
Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words
For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive (reading)
Motivation
MapReduce is great as many algorithms can be expressed by a series of MR jobs
But itrsquos low-level must think about keys values partitioning etc
Can we capture common ldquojob patternsrdquo
Pig
Started at Yahoo Research
Now runs about 30 of Yahoorsquos jobs
Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators
(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse
An Example Problem
Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))
reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)
lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5
store Top5 into lsquotop5sitesrsquo
In Pig Latin
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Job 1
Job 2
Job 3
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Hive
Developed at Facebook
Used for majority of Facebook jobs
ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex
data types some optimizations
Creating a Hive Table
CREATE TABLE page_views(viewTime INT userid BIGINT
page_url STRING referrer_url STRING
ip STRING COMMENT User IP address)
COMMENT This is the page view table
PARTITIONED BY(dt STRING country STRING)
STORED AS SEQUENCEFILE
bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA
Simple Query
SELECT page_views
FROM page_views
WHERE page_viewsdate gt= 2008-03-01
AND page_viewsdate lt= 2008-03-31
AND page_viewsreferrer_url like xyzcom
bullHive only reads partition 2008-03-01 instead of scanning entire table
bullFind all page views coming from xyzcom on March 31st
Aggregation and Joins
SELECT pvpage_url ugender COUNT(DISTINCT uid)
FROM page_views pv JOIN user u ON (pvuserid = uid)
GROUP BY pvpage_url ugender
WHERE pvdate = 2008-03-03
bullCount users who visited each page by gender
bullSample outputpage_url gender count(userid)
homephp MALE 12141412
homephp FEMALE 15431579
photophp MALE 23941451
photophp FEMALE 21231314
Using a Hadoop Streaming Mapper Script
SELECT TRANSFORM(page_viewsuserid
page_viewsdate)
USING map_scriptpy
AS dt uid CLUSTER BY dt
FROM page_views
Conclusions
MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance
Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs
(but requiring fault tolerance)
Hive and Pig further simplify programming
MapReduce is not suitable for all problems but when it works it may save you a lot of time
Resources
Hadoop httphadoopapacheorgcore
Hadoop docs
httphadoopapacheorgcoredocscurrent
Pig httphadoopapacheorgpig
Hive httphadoopapacheorghive
Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training
- Introduction to MapReduce and Hadoop
- What is MapReduce
- What is MapReduce used for
- What is MapReduce used for (2)
- Outline
- MapReduce Design Goals
- Typical Hadoop Cluster
- Typical Hadoop Cluster (2)
- Challenges
- Hadoop Components
- Hadoop Distributed File System
- MapReduce Programming Model
- Example Word Count
- Word Count Execution
- MapReduce Execution Details
- An Optimization The Combiner
- Word Count with Combiner
- Outline (2)
- Fault Tolerance in MapReduce
- Fault Tolerance in MapReduce (2)
- Fault Tolerance in MapReduce (3)
- Takeaways
- Outline (3)
- 1 Search
- 2 Sort
- 3 Inverted Index
- Inverted Index Example
- 4 Most Popular Words
- Outline (4)
- Cloudera Virtual Manager
- Installation
- Once you got the software installed fire up the VirtualBox ima
- WordCount Example
- Slide 34
- Slide 35
- Slide 36
- How it Works
- Classes
- Interface
- Demo
- Exercises
- Outline (5)
- Motivation
- Pig
- An Example Problem
- In MapReduce
- In Pig Latin
- Ease of Translation
- Ease of Translation (2)
- Hive
- Creating a Hive Table
- Simple Query
- Aggregation and Joins
- Using a Hadoop Streaming Mapper Script
- Conclusions
- Resources
-
public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)
confsetJobName(wordcount)
confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)
confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)
confsetReducerClass(Reduceclass)
confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)
FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))
JobClientrunJob(conf)
How it Works
Purpose Simple serialization for keys values and other data
Interface Writable Read and write binary format Convert to String for text formats
Classes
OutputCollector Collects data output by either the Mapper or the Reducer ie
intermediate outputs or the output of the job Methods
collect(K key V value)
Interface
Text Stores text using standard UTF8 encoding It provides methods to
serialize deserialize and compare texts at byte level Methods
set(byte[] utf8)
Demo
40
Exercises
Get WordCount running
Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words
For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive (reading)
Motivation
MapReduce is great as many algorithms can be expressed by a series of MR jobs
But itrsquos low-level must think about keys values partitioning etc
Can we capture common ldquojob patternsrdquo
Pig
Started at Yahoo Research
Now runs about 30 of Yahoorsquos jobs
Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators
(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse
An Example Problem
Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))
reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)
lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5
store Top5 into lsquotop5sitesrsquo
In Pig Latin
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Job 1
Job 2
Job 3
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Hive
Developed at Facebook
Used for majority of Facebook jobs
ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex
data types some optimizations
Creating a Hive Table
CREATE TABLE page_views(viewTime INT userid BIGINT
page_url STRING referrer_url STRING
ip STRING COMMENT User IP address)
COMMENT This is the page view table
PARTITIONED BY(dt STRING country STRING)
STORED AS SEQUENCEFILE
bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA
Simple Query
SELECT page_views
FROM page_views
WHERE page_viewsdate gt= 2008-03-01
AND page_viewsdate lt= 2008-03-31
AND page_viewsreferrer_url like xyzcom
bullHive only reads partition 2008-03-01 instead of scanning entire table
bullFind all page views coming from xyzcom on March 31st
Aggregation and Joins
SELECT pvpage_url ugender COUNT(DISTINCT uid)
FROM page_views pv JOIN user u ON (pvuserid = uid)
GROUP BY pvpage_url ugender
WHERE pvdate = 2008-03-03
bullCount users who visited each page by gender
bullSample outputpage_url gender count(userid)
homephp MALE 12141412
homephp FEMALE 15431579
photophp MALE 23941451
photophp FEMALE 21231314
Using a Hadoop Streaming Mapper Script
SELECT TRANSFORM(page_viewsuserid
page_viewsdate)
USING map_scriptpy
AS dt uid CLUSTER BY dt
FROM page_views
Conclusions
MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance
Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs
(but requiring fault tolerance)
Hive and Pig further simplify programming
MapReduce is not suitable for all problems but when it works it may save you a lot of time
Resources
Hadoop httphadoopapacheorgcore
Hadoop docs
httphadoopapacheorgcoredocscurrent
Pig httphadoopapacheorgpig
Hive httphadoopapacheorghive
Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training
- Introduction to MapReduce and Hadoop
- What is MapReduce
- What is MapReduce used for
- What is MapReduce used for (2)
- Outline
- MapReduce Design Goals
- Typical Hadoop Cluster
- Typical Hadoop Cluster (2)
- Challenges
- Hadoop Components
- Hadoop Distributed File System
- MapReduce Programming Model
- Example Word Count
- Word Count Execution
- MapReduce Execution Details
- An Optimization The Combiner
- Word Count with Combiner
- Outline (2)
- Fault Tolerance in MapReduce
- Fault Tolerance in MapReduce (2)
- Fault Tolerance in MapReduce (3)
- Takeaways
- Outline (3)
- 1 Search
- 2 Sort
- 3 Inverted Index
- Inverted Index Example
- 4 Most Popular Words
- Outline (4)
- Cloudera Virtual Manager
- Installation
- Once you got the software installed fire up the VirtualBox ima
- WordCount Example
- Slide 34
- Slide 35
- Slide 36
- How it Works
- Classes
- Interface
- Demo
- Exercises
- Outline (5)
- Motivation
- Pig
- An Example Problem
- In MapReduce
- In Pig Latin
- Ease of Translation
- Ease of Translation (2)
- Hive
- Creating a Hive Table
- Simple Query
- Aggregation and Joins
- Using a Hadoop Streaming Mapper Script
- Conclusions
- Resources
-
How it Works
Purpose Simple serialization for keys values and other data
Interface Writable Read and write binary format Convert to String for text formats
Classes
OutputCollector Collects data output by either the Mapper or the Reducer ie
intermediate outputs or the output of the job Methods
collect(K key V value)
Interface
Text Stores text using standard UTF8 encoding It provides methods to
serialize deserialize and compare texts at byte level Methods
set(byte[] utf8)
Demo
40
Exercises
Get WordCount running
Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words
For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive (reading)
Motivation
MapReduce is great as many algorithms can be expressed by a series of MR jobs
But itrsquos low-level must think about keys values partitioning etc
Can we capture common ldquojob patternsrdquo
Pig
Started at Yahoo Research
Now runs about 30 of Yahoorsquos jobs
Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators
(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse
An Example Problem
Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))
reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)
lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5
store Top5 into lsquotop5sitesrsquo
In Pig Latin
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Job 1
Job 2
Job 3
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Hive
Developed at Facebook
Used for majority of Facebook jobs
ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex
data types some optimizations
Creating a Hive Table
CREATE TABLE page_views(viewTime INT userid BIGINT
page_url STRING referrer_url STRING
ip STRING COMMENT User IP address)
COMMENT This is the page view table
PARTITIONED BY(dt STRING country STRING)
STORED AS SEQUENCEFILE
bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA
Simple Query
SELECT page_views
FROM page_views
WHERE page_viewsdate gt= 2008-03-01
AND page_viewsdate lt= 2008-03-31
AND page_viewsreferrer_url like xyzcom
bullHive only reads partition 2008-03-01 instead of scanning entire table
bullFind all page views coming from xyzcom on March 31st
Aggregation and Joins
SELECT pvpage_url ugender COUNT(DISTINCT uid)
FROM page_views pv JOIN user u ON (pvuserid = uid)
GROUP BY pvpage_url ugender
WHERE pvdate = 2008-03-03
bullCount users who visited each page by gender
bullSample outputpage_url gender count(userid)
homephp MALE 12141412
homephp FEMALE 15431579
photophp MALE 23941451
photophp FEMALE 21231314
Using a Hadoop Streaming Mapper Script
SELECT TRANSFORM(page_viewsuserid
page_viewsdate)
USING map_scriptpy
AS dt uid CLUSTER BY dt
FROM page_views
Conclusions
MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance
Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs
(but requiring fault tolerance)
Hive and Pig further simplify programming
MapReduce is not suitable for all problems but when it works it may save you a lot of time
Resources
Hadoop httphadoopapacheorgcore
Hadoop docs
httphadoopapacheorgcoredocscurrent
Pig httphadoopapacheorgpig
Hive httphadoopapacheorghive
Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training
- Introduction to MapReduce and Hadoop
- What is MapReduce
- What is MapReduce used for
- What is MapReduce used for (2)
- Outline
- MapReduce Design Goals
- Typical Hadoop Cluster
- Typical Hadoop Cluster (2)
- Challenges
- Hadoop Components
- Hadoop Distributed File System
- MapReduce Programming Model
- Example Word Count
- Word Count Execution
- MapReduce Execution Details
- An Optimization The Combiner
- Word Count with Combiner
- Outline (2)
- Fault Tolerance in MapReduce
- Fault Tolerance in MapReduce (2)
- Fault Tolerance in MapReduce (3)
- Takeaways
- Outline (3)
- 1 Search
- 2 Sort
- 3 Inverted Index
- Inverted Index Example
- 4 Most Popular Words
- Outline (4)
- Cloudera Virtual Manager
- Installation
- Once you got the software installed fire up the VirtualBox ima
- WordCount Example
- Slide 34
- Slide 35
- Slide 36
- How it Works
- Classes
- Interface
- Demo
- Exercises
- Outline (5)
- Motivation
- Pig
- An Example Problem
- In MapReduce
- In Pig Latin
- Ease of Translation
- Ease of Translation (2)
- Hive
- Creating a Hive Table
- Simple Query
- Aggregation and Joins
- Using a Hadoop Streaming Mapper Script
- Conclusions
- Resources
-
Classes
OutputCollector Collects data output by either the Mapper or the Reducer ie
intermediate outputs or the output of the job Methods
collect(K key V value)
Interface
Text Stores text using standard UTF8 encoding It provides methods to
serialize deserialize and compare texts at byte level Methods
set(byte[] utf8)
Demo
40
Exercises
Get WordCount running
Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words
For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive (reading)
Motivation
MapReduce is great as many algorithms can be expressed by a series of MR jobs
But itrsquos low-level must think about keys values partitioning etc
Can we capture common ldquojob patternsrdquo
Pig
Started at Yahoo Research
Now runs about 30 of Yahoorsquos jobs
Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators
(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse
An Example Problem
Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))
reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)
lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5
store Top5 into lsquotop5sitesrsquo
In Pig Latin
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Job 1
Job 2
Job 3
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Hive
Developed at Facebook
Used for majority of Facebook jobs
ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex
data types some optimizations
Creating a Hive Table
CREATE TABLE page_views(viewTime INT userid BIGINT
page_url STRING referrer_url STRING
ip STRING COMMENT User IP address)
COMMENT This is the page view table
PARTITIONED BY(dt STRING country STRING)
STORED AS SEQUENCEFILE
bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA
Simple Query
SELECT page_views
FROM page_views
WHERE page_viewsdate gt= 2008-03-01
AND page_viewsdate lt= 2008-03-31
AND page_viewsreferrer_url like xyzcom
bullHive only reads partition 2008-03-01 instead of scanning entire table
bullFind all page views coming from xyzcom on March 31st
Aggregation and Joins
SELECT pvpage_url ugender COUNT(DISTINCT uid)
FROM page_views pv JOIN user u ON (pvuserid = uid)
GROUP BY pvpage_url ugender
WHERE pvdate = 2008-03-03
bullCount users who visited each page by gender
bullSample outputpage_url gender count(userid)
homephp MALE 12141412
homephp FEMALE 15431579
photophp MALE 23941451
photophp FEMALE 21231314
Using a Hadoop Streaming Mapper Script
SELECT TRANSFORM(page_viewsuserid
page_viewsdate)
USING map_scriptpy
AS dt uid CLUSTER BY dt
FROM page_views
Conclusions
MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance
Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs
(but requiring fault tolerance)
Hive and Pig further simplify programming
MapReduce is not suitable for all problems but when it works it may save you a lot of time
Resources
Hadoop httphadoopapacheorgcore
Hadoop docs
httphadoopapacheorgcoredocscurrent
Pig httphadoopapacheorgpig
Hive httphadoopapacheorghive
Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training
- Introduction to MapReduce and Hadoop
- What is MapReduce
- What is MapReduce used for
- What is MapReduce used for (2)
- Outline
- MapReduce Design Goals
- Typical Hadoop Cluster
- Typical Hadoop Cluster (2)
- Challenges
- Hadoop Components
- Hadoop Distributed File System
- MapReduce Programming Model
- Example Word Count
- Word Count Execution
- MapReduce Execution Details
- An Optimization The Combiner
- Word Count with Combiner
- Outline (2)
- Fault Tolerance in MapReduce
- Fault Tolerance in MapReduce (2)
- Fault Tolerance in MapReduce (3)
- Takeaways
- Outline (3)
- 1 Search
- 2 Sort
- 3 Inverted Index
- Inverted Index Example
- 4 Most Popular Words
- Outline (4)
- Cloudera Virtual Manager
- Installation
- Once you got the software installed fire up the VirtualBox ima
- WordCount Example
- Slide 34
- Slide 35
- Slide 36
- How it Works
- Classes
- Interface
- Demo
- Exercises
- Outline (5)
- Motivation
- Pig
- An Example Problem
- In MapReduce
- In Pig Latin
- Ease of Translation
- Ease of Translation (2)
- Hive
- Creating a Hive Table
- Simple Query
- Aggregation and Joins
- Using a Hadoop Streaming Mapper Script
- Conclusions
- Resources
-
Interface
Text Stores text using standard UTF8 encoding It provides methods to
serialize deserialize and compare texts at byte level Methods
set(byte[] utf8)
Demo
40
Exercises
Get WordCount running
Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words
For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive (reading)
Motivation
MapReduce is great as many algorithms can be expressed by a series of MR jobs
But itrsquos low-level must think about keys values partitioning etc
Can we capture common ldquojob patternsrdquo
Pig
Started at Yahoo Research
Now runs about 30 of Yahoorsquos jobs
Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators
(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse
An Example Problem
Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))
reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)
lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5
store Top5 into lsquotop5sitesrsquo
In Pig Latin
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Job 1
Job 2
Job 3
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Hive
Developed at Facebook
Used for majority of Facebook jobs
ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex
data types some optimizations
Creating a Hive Table
CREATE TABLE page_views(viewTime INT userid BIGINT
page_url STRING referrer_url STRING
ip STRING COMMENT User IP address)
COMMENT This is the page view table
PARTITIONED BY(dt STRING country STRING)
STORED AS SEQUENCEFILE
bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA
Simple Query
SELECT page_views
FROM page_views
WHERE page_viewsdate gt= 2008-03-01
AND page_viewsdate lt= 2008-03-31
AND page_viewsreferrer_url like xyzcom
bullHive only reads partition 2008-03-01 instead of scanning entire table
bullFind all page views coming from xyzcom on March 31st
Aggregation and Joins
SELECT pvpage_url ugender COUNT(DISTINCT uid)
FROM page_views pv JOIN user u ON (pvuserid = uid)
GROUP BY pvpage_url ugender
WHERE pvdate = 2008-03-03
bullCount users who visited each page by gender
bullSample outputpage_url gender count(userid)
homephp MALE 12141412
homephp FEMALE 15431579
photophp MALE 23941451
photophp FEMALE 21231314
Using a Hadoop Streaming Mapper Script
SELECT TRANSFORM(page_viewsuserid
page_viewsdate)
USING map_scriptpy
AS dt uid CLUSTER BY dt
FROM page_views
Conclusions
MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance
Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs
(but requiring fault tolerance)
Hive and Pig further simplify programming
MapReduce is not suitable for all problems but when it works it may save you a lot of time
Resources
Hadoop httphadoopapacheorgcore
Hadoop docs
httphadoopapacheorgcoredocscurrent
Pig httphadoopapacheorgpig
Hive httphadoopapacheorghive
Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training
- Introduction to MapReduce and Hadoop
- What is MapReduce
- What is MapReduce used for
- What is MapReduce used for (2)
- Outline
- MapReduce Design Goals
- Typical Hadoop Cluster
- Typical Hadoop Cluster (2)
- Challenges
- Hadoop Components
- Hadoop Distributed File System
- MapReduce Programming Model
- Example Word Count
- Word Count Execution
- MapReduce Execution Details
- An Optimization The Combiner
- Word Count with Combiner
- Outline (2)
- Fault Tolerance in MapReduce
- Fault Tolerance in MapReduce (2)
- Fault Tolerance in MapReduce (3)
- Takeaways
- Outline (3)
- 1 Search
- 2 Sort
- 3 Inverted Index
- Inverted Index Example
- 4 Most Popular Words
- Outline (4)
- Cloudera Virtual Manager
- Installation
- Once you got the software installed fire up the VirtualBox ima
- WordCount Example
- Slide 34
- Slide 35
- Slide 36
- How it Works
- Classes
- Interface
- Demo
- Exercises
- Outline (5)
- Motivation
- Pig
- An Example Problem
- In MapReduce
- In Pig Latin
- Ease of Translation
- Ease of Translation (2)
- Hive
- Creating a Hive Table
- Simple Query
- Aggregation and Joins
- Using a Hadoop Streaming Mapper Script
- Conclusions
- Resources
-
Demo
40
Exercises
Get WordCount running
Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words
For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive (reading)
Motivation
MapReduce is great as many algorithms can be expressed by a series of MR jobs
But itrsquos low-level must think about keys values partitioning etc
Can we capture common ldquojob patternsrdquo
Pig
Started at Yahoo Research
Now runs about 30 of Yahoorsquos jobs
Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators
(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse
An Example Problem
Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))
reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)
lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5
store Top5 into lsquotop5sitesrsquo
In Pig Latin
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Job 1
Job 2
Job 3
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Hive
Developed at Facebook
Used for majority of Facebook jobs
ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex
data types some optimizations
Creating a Hive Table
CREATE TABLE page_views(viewTime INT userid BIGINT
page_url STRING referrer_url STRING
ip STRING COMMENT User IP address)
COMMENT This is the page view table
PARTITIONED BY(dt STRING country STRING)
STORED AS SEQUENCEFILE
bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA
Simple Query
SELECT page_views
FROM page_views
WHERE page_viewsdate gt= 2008-03-01
AND page_viewsdate lt= 2008-03-31
AND page_viewsreferrer_url like xyzcom
bullHive only reads partition 2008-03-01 instead of scanning entire table
bullFind all page views coming from xyzcom on March 31st
Aggregation and Joins
SELECT pvpage_url ugender COUNT(DISTINCT uid)
FROM page_views pv JOIN user u ON (pvuserid = uid)
GROUP BY pvpage_url ugender
WHERE pvdate = 2008-03-03
bullCount users who visited each page by gender
bullSample outputpage_url gender count(userid)
homephp MALE 12141412
homephp FEMALE 15431579
photophp MALE 23941451
photophp FEMALE 21231314
Using a Hadoop Streaming Mapper Script
SELECT TRANSFORM(page_viewsuserid
page_viewsdate)
USING map_scriptpy
AS dt uid CLUSTER BY dt
FROM page_views
Conclusions
MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance
Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs
(but requiring fault tolerance)
Hive and Pig further simplify programming
MapReduce is not suitable for all problems but when it works it may save you a lot of time
Resources
Hadoop httphadoopapacheorgcore
Hadoop docs
httphadoopapacheorgcoredocscurrent
Pig httphadoopapacheorgpig
Hive httphadoopapacheorghive
Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training
- Introduction to MapReduce and Hadoop
- What is MapReduce
- What is MapReduce used for
- What is MapReduce used for (2)
- Outline
- MapReduce Design Goals
- Typical Hadoop Cluster
- Typical Hadoop Cluster (2)
- Challenges
- Hadoop Components
- Hadoop Distributed File System
- MapReduce Programming Model
- Example Word Count
- Word Count Execution
- MapReduce Execution Details
- An Optimization The Combiner
- Word Count with Combiner
- Outline (2)
- Fault Tolerance in MapReduce
- Fault Tolerance in MapReduce (2)
- Fault Tolerance in MapReduce (3)
- Takeaways
- Outline (3)
- 1 Search
- 2 Sort
- 3 Inverted Index
- Inverted Index Example
- 4 Most Popular Words
- Outline (4)
- Cloudera Virtual Manager
- Installation
- Once you got the software installed fire up the VirtualBox ima
- WordCount Example
- Slide 34
- Slide 35
- Slide 36
- How it Works
- Classes
- Interface
- Demo
- Exercises
- Outline (5)
- Motivation
- Pig
- An Example Problem
- In MapReduce
- In Pig Latin
- Ease of Translation
- Ease of Translation (2)
- Hive
- Creating a Hive Table
- Simple Query
- Aggregation and Joins
- Using a Hadoop Streaming Mapper Script
- Conclusions
- Resources
-
Exercises
Get WordCount running
Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words
For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive (reading)
Motivation
MapReduce is great as many algorithms can be expressed by a series of MR jobs
But itrsquos low-level must think about keys values partitioning etc
Can we capture common ldquojob patternsrdquo
Pig
Started at Yahoo Research
Now runs about 30 of Yahoorsquos jobs
Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators
(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse
An Example Problem
Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))
reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)
lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5
store Top5 into lsquotop5sitesrsquo
In Pig Latin
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Job 1
Job 2
Job 3
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Hive
Developed at Facebook
Used for majority of Facebook jobs
ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex
data types some optimizations
Creating a Hive Table
CREATE TABLE page_views(viewTime INT userid BIGINT
page_url STRING referrer_url STRING
ip STRING COMMENT User IP address)
COMMENT This is the page view table
PARTITIONED BY(dt STRING country STRING)
STORED AS SEQUENCEFILE
bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA
Simple Query
SELECT page_views
FROM page_views
WHERE page_viewsdate gt= 2008-03-01
AND page_viewsdate lt= 2008-03-31
AND page_viewsreferrer_url like xyzcom
bullHive only reads partition 2008-03-01 instead of scanning entire table
bullFind all page views coming from xyzcom on March 31st
Aggregation and Joins
SELECT pvpage_url ugender COUNT(DISTINCT uid)
FROM page_views pv JOIN user u ON (pvuserid = uid)
GROUP BY pvpage_url ugender
WHERE pvdate = 2008-03-03
bullCount users who visited each page by gender
bullSample outputpage_url gender count(userid)
homephp MALE 12141412
homephp FEMALE 15431579
photophp MALE 23941451
photophp FEMALE 21231314
Using a Hadoop Streaming Mapper Script
SELECT TRANSFORM(page_viewsuserid
page_viewsdate)
USING map_scriptpy
AS dt uid CLUSTER BY dt
FROM page_views
Conclusions
MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance
Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs
(but requiring fault tolerance)
Hive and Pig further simplify programming
MapReduce is not suitable for all problems but when it works it may save you a lot of time
Resources
Hadoop httphadoopapacheorgcore
Hadoop docs
httphadoopapacheorgcoredocscurrent
Pig httphadoopapacheorgpig
Hive httphadoopapacheorghive
Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training
- Introduction to MapReduce and Hadoop
- What is MapReduce
- What is MapReduce used for
- What is MapReduce used for (2)
- Outline
- MapReduce Design Goals
- Typical Hadoop Cluster
- Typical Hadoop Cluster (2)
- Challenges
- Hadoop Components
- Hadoop Distributed File System
- MapReduce Programming Model
- Example Word Count
- Word Count Execution
- MapReduce Execution Details
- An Optimization The Combiner
- Word Count with Combiner
- Outline (2)
- Fault Tolerance in MapReduce
- Fault Tolerance in MapReduce (2)
- Fault Tolerance in MapReduce (3)
- Takeaways
- Outline (3)
- 1 Search
- 2 Sort
- 3 Inverted Index
- Inverted Index Example
- 4 Most Popular Words
- Outline (4)
- Cloudera Virtual Manager
- Installation
- Once you got the software installed fire up the VirtualBox ima
- WordCount Example
- Slide 34
- Slide 35
- Slide 36
- How it Works
- Classes
- Interface
- Demo
- Exercises
- Outline (5)
- Motivation
- Pig
- An Example Problem
- In MapReduce
- In Pig Latin
- Ease of Translation
- Ease of Translation (2)
- Hive
- Creating a Hive Table
- Simple Query
- Aggregation and Joins
- Using a Hadoop Streaming Mapper Script
- Conclusions
- Resources
-
Outline
MapReduce architecture
Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop Pig and Hive (reading)
Motivation
MapReduce is great as many algorithms can be expressed by a series of MR jobs
But itrsquos low-level must think about keys values partitioning etc
Can we capture common ldquojob patternsrdquo
Pig
Started at Yahoo Research
Now runs about 30 of Yahoorsquos jobs
Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators
(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse
An Example Problem
Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))
reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)
lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5
store Top5 into lsquotop5sitesrsquo
In Pig Latin
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Job 1
Job 2
Job 3
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Hive
Developed at Facebook
Used for majority of Facebook jobs
ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex
data types some optimizations
Creating a Hive Table
CREATE TABLE page_views(viewTime INT userid BIGINT
page_url STRING referrer_url STRING
ip STRING COMMENT User IP address)
COMMENT This is the page view table
PARTITIONED BY(dt STRING country STRING)
STORED AS SEQUENCEFILE
bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA
Simple Query
SELECT page_views
FROM page_views
WHERE page_viewsdate gt= 2008-03-01
AND page_viewsdate lt= 2008-03-31
AND page_viewsreferrer_url like xyzcom
bullHive only reads partition 2008-03-01 instead of scanning entire table
bullFind all page views coming from xyzcom on March 31st
Aggregation and Joins
SELECT pvpage_url ugender COUNT(DISTINCT uid)
FROM page_views pv JOIN user u ON (pvuserid = uid)
GROUP BY pvpage_url ugender
WHERE pvdate = 2008-03-03
bullCount users who visited each page by gender
bullSample outputpage_url gender count(userid)
homephp MALE 12141412
homephp FEMALE 15431579
photophp MALE 23941451
photophp FEMALE 21231314
Using a Hadoop Streaming Mapper Script
SELECT TRANSFORM(page_viewsuserid
page_viewsdate)
USING map_scriptpy
AS dt uid CLUSTER BY dt
FROM page_views
Conclusions
MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance
Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs
(but requiring fault tolerance)
Hive and Pig further simplify programming
MapReduce is not suitable for all problems but when it works it may save you a lot of time
Resources
Hadoop httphadoopapacheorgcore
Hadoop docs
httphadoopapacheorgcoredocscurrent
Pig httphadoopapacheorgpig
Hive httphadoopapacheorghive
Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training
- Introduction to MapReduce and Hadoop
- What is MapReduce
- What is MapReduce used for
- What is MapReduce used for (2)
- Outline
- MapReduce Design Goals
- Typical Hadoop Cluster
- Typical Hadoop Cluster (2)
- Challenges
- Hadoop Components
- Hadoop Distributed File System
- MapReduce Programming Model
- Example Word Count
- Word Count Execution
- MapReduce Execution Details
- An Optimization The Combiner
- Word Count with Combiner
- Outline (2)
- Fault Tolerance in MapReduce
- Fault Tolerance in MapReduce (2)
- Fault Tolerance in MapReduce (3)
- Takeaways
- Outline (3)
- 1 Search
- 2 Sort
- 3 Inverted Index
- Inverted Index Example
- 4 Most Popular Words
- Outline (4)
- Cloudera Virtual Manager
- Installation
- Once you got the software installed fire up the VirtualBox ima
- WordCount Example
- Slide 34
- Slide 35
- Slide 36
- How it Works
- Classes
- Interface
- Demo
- Exercises
- Outline (5)
- Motivation
- Pig
- An Example Problem
- In MapReduce
- In Pig Latin
- Ease of Translation
- Ease of Translation (2)
- Hive
- Creating a Hive Table
- Simple Query
- Aggregation and Joins
- Using a Hadoop Streaming Mapper Script
- Conclusions
- Resources
-
Motivation
MapReduce is great as many algorithms can be expressed by a series of MR jobs
But itrsquos low-level must think about keys values partitioning etc
Can we capture common ldquojob patternsrdquo
Pig
Started at Yahoo Research
Now runs about 30 of Yahoorsquos jobs
Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators
(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse
An Example Problem
Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))
reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)
lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5
store Top5 into lsquotop5sitesrsquo
In Pig Latin
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Job 1
Job 2
Job 3
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Hive
Developed at Facebook
Used for majority of Facebook jobs
ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex
data types some optimizations
Creating a Hive Table
CREATE TABLE page_views(viewTime INT userid BIGINT
page_url STRING referrer_url STRING
ip STRING COMMENT User IP address)
COMMENT This is the page view table
PARTITIONED BY(dt STRING country STRING)
STORED AS SEQUENCEFILE
bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA
Simple Query
SELECT page_views
FROM page_views
WHERE page_viewsdate gt= 2008-03-01
AND page_viewsdate lt= 2008-03-31
AND page_viewsreferrer_url like xyzcom
bullHive only reads partition 2008-03-01 instead of scanning entire table
bullFind all page views coming from xyzcom on March 31st
Aggregation and Joins
SELECT pvpage_url ugender COUNT(DISTINCT uid)
FROM page_views pv JOIN user u ON (pvuserid = uid)
GROUP BY pvpage_url ugender
WHERE pvdate = 2008-03-03
bullCount users who visited each page by gender
bullSample outputpage_url gender count(userid)
homephp MALE 12141412
homephp FEMALE 15431579
photophp MALE 23941451
photophp FEMALE 21231314
Using a Hadoop Streaming Mapper Script
SELECT TRANSFORM(page_viewsuserid
page_viewsdate)
USING map_scriptpy
AS dt uid CLUSTER BY dt
FROM page_views
Conclusions
MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance
Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs
(but requiring fault tolerance)
Hive and Pig further simplify programming
MapReduce is not suitable for all problems but when it works it may save you a lot of time
Resources
Hadoop httphadoopapacheorgcore
Hadoop docs
httphadoopapacheorgcoredocscurrent
Pig httphadoopapacheorgpig
Hive httphadoopapacheorghive
Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training
- Introduction to MapReduce and Hadoop
- What is MapReduce
- What is MapReduce used for
- What is MapReduce used for (2)
- Outline
- MapReduce Design Goals
- Typical Hadoop Cluster
- Typical Hadoop Cluster (2)
- Challenges
- Hadoop Components
- Hadoop Distributed File System
- MapReduce Programming Model
- Example Word Count
- Word Count Execution
- MapReduce Execution Details
- An Optimization The Combiner
- Word Count with Combiner
- Outline (2)
- Fault Tolerance in MapReduce
- Fault Tolerance in MapReduce (2)
- Fault Tolerance in MapReduce (3)
- Takeaways
- Outline (3)
- 1 Search
- 2 Sort
- 3 Inverted Index
- Inverted Index Example
- 4 Most Popular Words
- Outline (4)
- Cloudera Virtual Manager
- Installation
- Once you got the software installed fire up the VirtualBox ima
- WordCount Example
- Slide 34
- Slide 35
- Slide 36
- How it Works
- Classes
- Interface
- Demo
- Exercises
- Outline (5)
- Motivation
- Pig
- An Example Problem
- In MapReduce
- In Pig Latin
- Ease of Translation
- Ease of Translation (2)
- Hive
- Creating a Hive Table
- Simple Query
- Aggregation and Joins
- Using a Hadoop Streaming Mapper Script
- Conclusions
- Resources
-
Pig
Started at Yahoo Research
Now runs about 30 of Yahoorsquos jobs
Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators
(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse
An Example Problem
Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))
reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)
lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5
store Top5 into lsquotop5sitesrsquo
In Pig Latin
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Job 1
Job 2
Job 3
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Hive
Developed at Facebook
Used for majority of Facebook jobs
ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex
data types some optimizations
Creating a Hive Table
CREATE TABLE page_views(viewTime INT userid BIGINT
page_url STRING referrer_url STRING
ip STRING COMMENT User IP address)
COMMENT This is the page view table
PARTITIONED BY(dt STRING country STRING)
STORED AS SEQUENCEFILE
bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA
Simple Query
SELECT page_views
FROM page_views
WHERE page_viewsdate gt= 2008-03-01
AND page_viewsdate lt= 2008-03-31
AND page_viewsreferrer_url like xyzcom
bullHive only reads partition 2008-03-01 instead of scanning entire table
bullFind all page views coming from xyzcom on March 31st
Aggregation and Joins
SELECT pvpage_url ugender COUNT(DISTINCT uid)
FROM page_views pv JOIN user u ON (pvuserid = uid)
GROUP BY pvpage_url ugender
WHERE pvdate = 2008-03-03
bullCount users who visited each page by gender
bullSample outputpage_url gender count(userid)
homephp MALE 12141412
homephp FEMALE 15431579
photophp MALE 23941451
photophp FEMALE 21231314
Using a Hadoop Streaming Mapper Script
SELECT TRANSFORM(page_viewsuserid
page_viewsdate)
USING map_scriptpy
AS dt uid CLUSTER BY dt
FROM page_views
Conclusions
MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance
Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs
(but requiring fault tolerance)
Hive and Pig further simplify programming
MapReduce is not suitable for all problems but when it works it may save you a lot of time
Resources
Hadoop httphadoopapacheorgcore
Hadoop docs
httphadoopapacheorgcoredocscurrent
Pig httphadoopapacheorgpig
Hive httphadoopapacheorghive
Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training
- Introduction to MapReduce and Hadoop
- What is MapReduce
- What is MapReduce used for
- What is MapReduce used for (2)
- Outline
- MapReduce Design Goals
- Typical Hadoop Cluster
- Typical Hadoop Cluster (2)
- Challenges
- Hadoop Components
- Hadoop Distributed File System
- MapReduce Programming Model
- Example Word Count
- Word Count Execution
- MapReduce Execution Details
- An Optimization The Combiner
- Word Count with Combiner
- Outline (2)
- Fault Tolerance in MapReduce
- Fault Tolerance in MapReduce (2)
- Fault Tolerance in MapReduce (3)
- Takeaways
- Outline (3)
- 1 Search
- 2 Sort
- 3 Inverted Index
- Inverted Index Example
- 4 Most Popular Words
- Outline (4)
- Cloudera Virtual Manager
- Installation
- Once you got the software installed fire up the VirtualBox ima
- WordCount Example
- Slide 34
- Slide 35
- Slide 36
- How it Works
- Classes
- Interface
- Demo
- Exercises
- Outline (5)
- Motivation
- Pig
- An Example Problem
- In MapReduce
- In Pig Latin
- Ease of Translation
- Ease of Translation (2)
- Hive
- Creating a Hive Table
- Simple Query
- Aggregation and Joins
- Using a Hadoop Streaming Mapper Script
- Conclusions
- Resources
-
An Example Problem
Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))
reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)
lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5
store Top5 into lsquotop5sitesrsquo
In Pig Latin
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Job 1
Job 2
Job 3
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Hive
Developed at Facebook
Used for majority of Facebook jobs
ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex
data types some optimizations
Creating a Hive Table
CREATE TABLE page_views(viewTime INT userid BIGINT
page_url STRING referrer_url STRING
ip STRING COMMENT User IP address)
COMMENT This is the page view table
PARTITIONED BY(dt STRING country STRING)
STORED AS SEQUENCEFILE
bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA
Simple Query
SELECT page_views
FROM page_views
WHERE page_viewsdate gt= 2008-03-01
AND page_viewsdate lt= 2008-03-31
AND page_viewsreferrer_url like xyzcom
bullHive only reads partition 2008-03-01 instead of scanning entire table
bullFind all page views coming from xyzcom on March 31st
Aggregation and Joins
SELECT pvpage_url ugender COUNT(DISTINCT uid)
FROM page_views pv JOIN user u ON (pvuserid = uid)
GROUP BY pvpage_url ugender
WHERE pvdate = 2008-03-03
bullCount users who visited each page by gender
bullSample outputpage_url gender count(userid)
homephp MALE 12141412
homephp FEMALE 15431579
photophp MALE 23941451
photophp FEMALE 21231314
Using a Hadoop Streaming Mapper Script
SELECT TRANSFORM(page_viewsuserid
page_viewsdate)
USING map_scriptpy
AS dt uid CLUSTER BY dt
FROM page_views
Conclusions
MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance
Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs
(but requiring fault tolerance)
Hive and Pig further simplify programming
MapReduce is not suitable for all problems but when it works it may save you a lot of time
Resources
Hadoop httphadoopapacheorgcore
Hadoop docs
httphadoopapacheorgcoredocscurrent
Pig httphadoopapacheorgpig
Hive httphadoopapacheorghive
Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training
- Introduction to MapReduce and Hadoop
- What is MapReduce
- What is MapReduce used for
- What is MapReduce used for (2)
- Outline
- MapReduce Design Goals
- Typical Hadoop Cluster
- Typical Hadoop Cluster (2)
- Challenges
- Hadoop Components
- Hadoop Distributed File System
- MapReduce Programming Model
- Example Word Count
- Word Count Execution
- MapReduce Execution Details
- An Optimization The Combiner
- Word Count with Combiner
- Outline (2)
- Fault Tolerance in MapReduce
- Fault Tolerance in MapReduce (2)
- Fault Tolerance in MapReduce (3)
- Takeaways
- Outline (3)
- 1 Search
- 2 Sort
- 3 Inverted Index
- Inverted Index Example
- 4 Most Popular Words
- Outline (4)
- Cloudera Virtual Manager
- Installation
- Once you got the software installed fire up the VirtualBox ima
- WordCount Example
- Slide 34
- Slide 35
- Slide 36
- How it Works
- Classes
- Interface
- Demo
- Exercises
- Outline (5)
- Motivation
- Pig
- An Example Problem
- In MapReduce
- In Pig Latin
- Ease of Translation
- Ease of Translation (2)
- Hive
- Creating a Hive Table
- Simple Query
- Aggregation and Joins
- Using a Hadoop Streaming Mapper Script
- Conclusions
- Resources
-
In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))
reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)
lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5
store Top5 into lsquotop5sitesrsquo
In Pig Latin
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Job 1
Job 2
Job 3
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Hive
Developed at Facebook
Used for majority of Facebook jobs
ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex
data types some optimizations
Creating a Hive Table
CREATE TABLE page_views(viewTime INT userid BIGINT
page_url STRING referrer_url STRING
ip STRING COMMENT User IP address)
COMMENT This is the page view table
PARTITIONED BY(dt STRING country STRING)
STORED AS SEQUENCEFILE
bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA
Simple Query
SELECT page_views
FROM page_views
WHERE page_viewsdate gt= 2008-03-01
AND page_viewsdate lt= 2008-03-31
AND page_viewsreferrer_url like xyzcom
bullHive only reads partition 2008-03-01 instead of scanning entire table
bullFind all page views coming from xyzcom on March 31st
Aggregation and Joins
SELECT pvpage_url ugender COUNT(DISTINCT uid)
FROM page_views pv JOIN user u ON (pvuserid = uid)
GROUP BY pvpage_url ugender
WHERE pvdate = 2008-03-03
bullCount users who visited each page by gender
bullSample outputpage_url gender count(userid)
homephp MALE 12141412
homephp FEMALE 15431579
photophp MALE 23941451
photophp FEMALE 21231314
Using a Hadoop Streaming Mapper Script
SELECT TRANSFORM(page_viewsuserid
page_viewsdate)
USING map_scriptpy
AS dt uid CLUSTER BY dt
FROM page_views
Conclusions
MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance
Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs
(but requiring fault tolerance)
Hive and Pig further simplify programming
MapReduce is not suitable for all problems but when it works it may save you a lot of time
Resources
Hadoop httphadoopapacheorgcore
Hadoop docs
httphadoopapacheorgcoredocscurrent
Pig httphadoopapacheorgpig
Hive httphadoopapacheorghive
Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training
- Introduction to MapReduce and Hadoop
- What is MapReduce
- What is MapReduce used for
- What is MapReduce used for (2)
- Outline
- MapReduce Design Goals
- Typical Hadoop Cluster
- Typical Hadoop Cluster (2)
- Challenges
- Hadoop Components
- Hadoop Distributed File System
- MapReduce Programming Model
- Example Word Count
- Word Count Execution
- MapReduce Execution Details
- An Optimization The Combiner
- Word Count with Combiner
- Outline (2)
- Fault Tolerance in MapReduce
- Fault Tolerance in MapReduce (2)
- Fault Tolerance in MapReduce (3)
- Takeaways
- Outline (3)
- 1 Search
- 2 Sort
- 3 Inverted Index
- Inverted Index Example
- 4 Most Popular Words
- Outline (4)
- Cloudera Virtual Manager
- Installation
- Once you got the software installed fire up the VirtualBox ima
- WordCount Example
- Slide 34
- Slide 35
- Slide 36
- How it Works
- Classes
- Interface
- Demo
- Exercises
- Outline (5)
- Motivation
- Pig
- An Example Problem
- In MapReduce
- In Pig Latin
- Ease of Translation
- Ease of Translation (2)
- Hive
- Creating a Hive Table
- Simple Query
- Aggregation and Joins
- Using a Hadoop Streaming Mapper Script
- Conclusions
- Resources
-
Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5
store Top5 into lsquotop5sitesrsquo
In Pig Latin
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Job 1
Job 2
Job 3
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Hive
Developed at Facebook
Used for majority of Facebook jobs
ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex
data types some optimizations
Creating a Hive Table
CREATE TABLE page_views(viewTime INT userid BIGINT
page_url STRING referrer_url STRING
ip STRING COMMENT User IP address)
COMMENT This is the page view table
PARTITIONED BY(dt STRING country STRING)
STORED AS SEQUENCEFILE
bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA
Simple Query
SELECT page_views
FROM page_views
WHERE page_viewsdate gt= 2008-03-01
AND page_viewsdate lt= 2008-03-31
AND page_viewsreferrer_url like xyzcom
bullHive only reads partition 2008-03-01 instead of scanning entire table
bullFind all page views coming from xyzcom on March 31st
Aggregation and Joins
SELECT pvpage_url ugender COUNT(DISTINCT uid)
FROM page_views pv JOIN user u ON (pvuserid = uid)
GROUP BY pvpage_url ugender
WHERE pvdate = 2008-03-03
bullCount users who visited each page by gender
bullSample outputpage_url gender count(userid)
homephp MALE 12141412
homephp FEMALE 15431579
photophp MALE 23941451
photophp FEMALE 21231314
Using a Hadoop Streaming Mapper Script
SELECT TRANSFORM(page_viewsuserid
page_viewsdate)
USING map_scriptpy
AS dt uid CLUSTER BY dt
FROM page_views
Conclusions
MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance
Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs
(but requiring fault tolerance)
Hive and Pig further simplify programming
MapReduce is not suitable for all problems but when it works it may save you a lot of time
Resources
Hadoop httphadoopapacheorgcore
Hadoop docs
httphadoopapacheorgcoredocscurrent
Pig httphadoopapacheorgpig
Hive httphadoopapacheorghive
Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training
- Introduction to MapReduce and Hadoop
- What is MapReduce
- What is MapReduce used for
- What is MapReduce used for (2)
- Outline
- MapReduce Design Goals
- Typical Hadoop Cluster
- Typical Hadoop Cluster (2)
- Challenges
- Hadoop Components
- Hadoop Distributed File System
- MapReduce Programming Model
- Example Word Count
- Word Count Execution
- MapReduce Execution Details
- An Optimization The Combiner
- Word Count with Combiner
- Outline (2)
- Fault Tolerance in MapReduce
- Fault Tolerance in MapReduce (2)
- Fault Tolerance in MapReduce (3)
- Takeaways
- Outline (3)
- 1 Search
- 2 Sort
- 3 Inverted Index
- Inverted Index Example
- 4 Most Popular Words
- Outline (4)
- Cloudera Virtual Manager
- Installation
- Once you got the software installed fire up the VirtualBox ima
- WordCount Example
- Slide 34
- Slide 35
- Slide 36
- How it Works
- Classes
- Interface
- Demo
- Exercises
- Outline (5)
- Motivation
- Pig
- An Example Problem
- In MapReduce
- In Pig Latin
- Ease of Translation
- Ease of Translation (2)
- Hive
- Creating a Hive Table
- Simple Query
- Aggregation and Joins
- Using a Hadoop Streaming Mapper Script
- Conclusions
- Resources
-
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Job 1
Job 2
Job 3
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Hive
Developed at Facebook
Used for majority of Facebook jobs
ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex
data types some optimizations
Creating a Hive Table
CREATE TABLE page_views(viewTime INT userid BIGINT
page_url STRING referrer_url STRING
ip STRING COMMENT User IP address)
COMMENT This is the page view table
PARTITIONED BY(dt STRING country STRING)
STORED AS SEQUENCEFILE
bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA
Simple Query
SELECT page_views
FROM page_views
WHERE page_viewsdate gt= 2008-03-01
AND page_viewsdate lt= 2008-03-31
AND page_viewsreferrer_url like xyzcom
bullHive only reads partition 2008-03-01 instead of scanning entire table
bullFind all page views coming from xyzcom on March 31st
Aggregation and Joins
SELECT pvpage_url ugender COUNT(DISTINCT uid)
FROM page_views pv JOIN user u ON (pvuserid = uid)
GROUP BY pvpage_url ugender
WHERE pvdate = 2008-03-03
bullCount users who visited each page by gender
bullSample outputpage_url gender count(userid)
homephp MALE 12141412
homephp FEMALE 15431579
photophp MALE 23941451
photophp FEMALE 21231314
Using a Hadoop Streaming Mapper Script
SELECT TRANSFORM(page_viewsuserid
page_viewsdate)
USING map_scriptpy
AS dt uid CLUSTER BY dt
FROM page_views
Conclusions
MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance
Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs
(but requiring fault tolerance)
Hive and Pig further simplify programming
MapReduce is not suitable for all problems but when it works it may save you a lot of time
Resources
Hadoop httphadoopapacheorgcore
Hadoop docs
httphadoopapacheorgcoredocscurrent
Pig httphadoopapacheorgpig
Hive httphadoopapacheorghive
Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training
- Introduction to MapReduce and Hadoop
- What is MapReduce
- What is MapReduce used for
- What is MapReduce used for (2)
- Outline
- MapReduce Design Goals
- Typical Hadoop Cluster
- Typical Hadoop Cluster (2)
- Challenges
- Hadoop Components
- Hadoop Distributed File System
- MapReduce Programming Model
- Example Word Count
- Word Count Execution
- MapReduce Execution Details
- An Optimization The Combiner
- Word Count with Combiner
- Outline (2)
- Fault Tolerance in MapReduce
- Fault Tolerance in MapReduce (2)
- Fault Tolerance in MapReduce (3)
- Takeaways
- Outline (3)
- 1 Search
- 2 Sort
- 3 Inverted Index
- Inverted Index Example
- 4 Most Popular Words
- Outline (4)
- Cloudera Virtual Manager
- Installation
- Once you got the software installed fire up the VirtualBox ima
- WordCount Example
- Slide 34
- Slide 35
- Slide 36
- How it Works
- Classes
- Interface
- Demo
- Exercises
- Outline (5)
- Motivation
- Pig
- An Example Problem
- In MapReduce
- In Pig Latin
- Ease of Translation
- Ease of Translation (2)
- Hive
- Creating a Hive Table
- Simple Query
- Aggregation and Joins
- Using a Hadoop Streaming Mapper Script
- Conclusions
- Resources
-
Ease of TranslationNotice how naturally the components of the job translate into Pig Latin
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip
Job 1
Job 2
Job 3
Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt
Hive
Developed at Facebook
Used for majority of Facebook jobs
ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex
data types some optimizations
Creating a Hive Table
CREATE TABLE page_views(viewTime INT userid BIGINT
page_url STRING referrer_url STRING
ip STRING COMMENT User IP address)
COMMENT This is the page view table
PARTITIONED BY(dt STRING country STRING)
STORED AS SEQUENCEFILE
bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA
Simple Query
SELECT page_views
FROM page_views
WHERE page_viewsdate gt= 2008-03-01
AND page_viewsdate lt= 2008-03-31
AND page_viewsreferrer_url like xyzcom
bullHive only reads partition 2008-03-01 instead of scanning entire table
bullFind all page views coming from xyzcom on March 31st
Aggregation and Joins
SELECT pvpage_url ugender COUNT(DISTINCT uid)
FROM page_views pv JOIN user u ON (pvuserid = uid)
GROUP BY pvpage_url ugender
WHERE pvdate = 2008-03-03
bullCount users who visited each page by gender
bullSample outputpage_url gender count(userid)
homephp MALE 12141412
homephp FEMALE 15431579
photophp MALE 23941451
photophp FEMALE 21231314
Using a Hadoop Streaming Mapper Script
SELECT TRANSFORM(page_viewsuserid
page_viewsdate)
USING map_scriptpy
AS dt uid CLUSTER BY dt
FROM page_views
Conclusions
MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance
Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs
(but requiring fault tolerance)
Hive and Pig further simplify programming
MapReduce is not suitable for all problems but when it works it may save you a lot of time
Resources
Hadoop httphadoopapacheorgcore
Hadoop docs
httphadoopapacheorgcoredocscurrent
Pig httphadoopapacheorgpig
Hive httphadoopapacheorghive
Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training
- Introduction to MapReduce and Hadoop
- What is MapReduce
- What is MapReduce used for
- What is MapReduce used for (2)
- Outline
- MapReduce Design Goals
- Typical Hadoop Cluster
- Typical Hadoop Cluster (2)
- Challenges
- Hadoop Components
- Hadoop Distributed File System
- MapReduce Programming Model
- Example Word Count
- Word Count Execution
- MapReduce Execution Details
- An Optimization The Combiner
- Word Count with Combiner
- Outline (2)
- Fault Tolerance in MapReduce
- Fault Tolerance in MapReduce (2)
- Fault Tolerance in MapReduce (3)
- Takeaways
- Outline (3)
- 1 Search
- 2 Sort
- 3 Inverted Index
- Inverted Index Example
- 4 Most Popular Words
- Outline (4)
- Cloudera Virtual Manager
- Installation
- Once you got the software installed fire up the VirtualBox ima
- WordCount Example
- Slide 34
- Slide 35
- Slide 36
- How it Works
- Classes
- Interface
- Demo
- Exercises
- Outline (5)
- Motivation
- Pig
- An Example Problem
- In MapReduce
- In Pig Latin
- Ease of Translation
- Ease of Translation (2)
- Hive
- Creating a Hive Table
- Simple Query
- Aggregation and Joins
- Using a Hadoop Streaming Mapper Script
- Conclusions
- Resources
-
Hive
Developed at Facebook
Used for majority of Facebook jobs
ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex
data types some optimizations
Creating a Hive Table
CREATE TABLE page_views(viewTime INT userid BIGINT
page_url STRING referrer_url STRING
ip STRING COMMENT User IP address)
COMMENT This is the page view table
PARTITIONED BY(dt STRING country STRING)
STORED AS SEQUENCEFILE
bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA
Simple Query
SELECT page_views
FROM page_views
WHERE page_viewsdate gt= 2008-03-01
AND page_viewsdate lt= 2008-03-31
AND page_viewsreferrer_url like xyzcom
bullHive only reads partition 2008-03-01 instead of scanning entire table
bullFind all page views coming from xyzcom on March 31st
Aggregation and Joins
SELECT pvpage_url ugender COUNT(DISTINCT uid)
FROM page_views pv JOIN user u ON (pvuserid = uid)
GROUP BY pvpage_url ugender
WHERE pvdate = 2008-03-03
bullCount users who visited each page by gender
bullSample outputpage_url gender count(userid)
homephp MALE 12141412
homephp FEMALE 15431579
photophp MALE 23941451
photophp FEMALE 21231314
Using a Hadoop Streaming Mapper Script
SELECT TRANSFORM(page_viewsuserid
page_viewsdate)
USING map_scriptpy
AS dt uid CLUSTER BY dt
FROM page_views
Conclusions
MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance
Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs
(but requiring fault tolerance)
Hive and Pig further simplify programming
MapReduce is not suitable for all problems but when it works it may save you a lot of time
Resources
Hadoop httphadoopapacheorgcore
Hadoop docs
httphadoopapacheorgcoredocscurrent
Pig httphadoopapacheorgpig
Hive httphadoopapacheorghive
Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training
- Introduction to MapReduce and Hadoop
- What is MapReduce
- What is MapReduce used for
- What is MapReduce used for (2)
- Outline
- MapReduce Design Goals
- Typical Hadoop Cluster
- Typical Hadoop Cluster (2)
- Challenges
- Hadoop Components
- Hadoop Distributed File System
- MapReduce Programming Model
- Example Word Count
- Word Count Execution
- MapReduce Execution Details
- An Optimization The Combiner
- Word Count with Combiner
- Outline (2)
- Fault Tolerance in MapReduce
- Fault Tolerance in MapReduce (2)
- Fault Tolerance in MapReduce (3)
- Takeaways
- Outline (3)
- 1 Search
- 2 Sort
- 3 Inverted Index
- Inverted Index Example
- 4 Most Popular Words
- Outline (4)
- Cloudera Virtual Manager
- Installation
- Once you got the software installed fire up the VirtualBox ima
- WordCount Example
- Slide 34
- Slide 35
- Slide 36
- How it Works
- Classes
- Interface
- Demo
- Exercises
- Outline (5)
- Motivation
- Pig
- An Example Problem
- In MapReduce
- In Pig Latin
- Ease of Translation
- Ease of Translation (2)
- Hive
- Creating a Hive Table
- Simple Query
- Aggregation and Joins
- Using a Hadoop Streaming Mapper Script
- Conclusions
- Resources
-
Creating a Hive Table
CREATE TABLE page_views(viewTime INT userid BIGINT
page_url STRING referrer_url STRING
ip STRING COMMENT User IP address)
COMMENT This is the page view table
PARTITIONED BY(dt STRING country STRING)
STORED AS SEQUENCEFILE
bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA
Simple Query
SELECT page_views
FROM page_views
WHERE page_viewsdate gt= 2008-03-01
AND page_viewsdate lt= 2008-03-31
AND page_viewsreferrer_url like xyzcom
bullHive only reads partition 2008-03-01 instead of scanning entire table
bullFind all page views coming from xyzcom on March 31st
Aggregation and Joins
SELECT pvpage_url ugender COUNT(DISTINCT uid)
FROM page_views pv JOIN user u ON (pvuserid = uid)
GROUP BY pvpage_url ugender
WHERE pvdate = 2008-03-03
bullCount users who visited each page by gender
bullSample outputpage_url gender count(userid)
homephp MALE 12141412
homephp FEMALE 15431579
photophp MALE 23941451
photophp FEMALE 21231314
Using a Hadoop Streaming Mapper Script
SELECT TRANSFORM(page_viewsuserid
page_viewsdate)
USING map_scriptpy
AS dt uid CLUSTER BY dt
FROM page_views
Conclusions
MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance
Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs
(but requiring fault tolerance)
Hive and Pig further simplify programming
MapReduce is not suitable for all problems but when it works it may save you a lot of time
Resources
Hadoop httphadoopapacheorgcore
Hadoop docs
httphadoopapacheorgcoredocscurrent
Pig httphadoopapacheorgpig
Hive httphadoopapacheorghive
Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training
- Introduction to MapReduce and Hadoop
- What is MapReduce
- What is MapReduce used for
- What is MapReduce used for (2)
- Outline
- MapReduce Design Goals
- Typical Hadoop Cluster
- Typical Hadoop Cluster (2)
- Challenges
- Hadoop Components
- Hadoop Distributed File System
- MapReduce Programming Model
- Example Word Count
- Word Count Execution
- MapReduce Execution Details
- An Optimization The Combiner
- Word Count with Combiner
- Outline (2)
- Fault Tolerance in MapReduce
- Fault Tolerance in MapReduce (2)
- Fault Tolerance in MapReduce (3)
- Takeaways
- Outline (3)
- 1 Search
- 2 Sort
- 3 Inverted Index
- Inverted Index Example
- 4 Most Popular Words
- Outline (4)
- Cloudera Virtual Manager
- Installation
- Once you got the software installed fire up the VirtualBox ima
- WordCount Example
- Slide 34
- Slide 35
- Slide 36
- How it Works
- Classes
- Interface
- Demo
- Exercises
- Outline (5)
- Motivation
- Pig
- An Example Problem
- In MapReduce
- In Pig Latin
- Ease of Translation
- Ease of Translation (2)
- Hive
- Creating a Hive Table
- Simple Query
- Aggregation and Joins
- Using a Hadoop Streaming Mapper Script
- Conclusions
- Resources
-
Simple Query
SELECT page_views
FROM page_views
WHERE page_viewsdate gt= 2008-03-01
AND page_viewsdate lt= 2008-03-31
AND page_viewsreferrer_url like xyzcom
bullHive only reads partition 2008-03-01 instead of scanning entire table
bullFind all page views coming from xyzcom on March 31st
Aggregation and Joins
SELECT pvpage_url ugender COUNT(DISTINCT uid)
FROM page_views pv JOIN user u ON (pvuserid = uid)
GROUP BY pvpage_url ugender
WHERE pvdate = 2008-03-03
bullCount users who visited each page by gender
bullSample outputpage_url gender count(userid)
homephp MALE 12141412
homephp FEMALE 15431579
photophp MALE 23941451
photophp FEMALE 21231314
Using a Hadoop Streaming Mapper Script
SELECT TRANSFORM(page_viewsuserid
page_viewsdate)
USING map_scriptpy
AS dt uid CLUSTER BY dt
FROM page_views
Conclusions
MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance
Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs
(but requiring fault tolerance)
Hive and Pig further simplify programming
MapReduce is not suitable for all problems but when it works it may save you a lot of time
Resources
Hadoop httphadoopapacheorgcore
Hadoop docs
httphadoopapacheorgcoredocscurrent
Pig httphadoopapacheorgpig
Hive httphadoopapacheorghive
Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training
- Introduction to MapReduce and Hadoop
- What is MapReduce
- What is MapReduce used for
- What is MapReduce used for (2)
- Outline
- MapReduce Design Goals
- Typical Hadoop Cluster
- Typical Hadoop Cluster (2)
- Challenges
- Hadoop Components
- Hadoop Distributed File System
- MapReduce Programming Model
- Example Word Count
- Word Count Execution
- MapReduce Execution Details
- An Optimization The Combiner
- Word Count with Combiner
- Outline (2)
- Fault Tolerance in MapReduce
- Fault Tolerance in MapReduce (2)
- Fault Tolerance in MapReduce (3)
- Takeaways
- Outline (3)
- 1 Search
- 2 Sort
- 3 Inverted Index
- Inverted Index Example
- 4 Most Popular Words
- Outline (4)
- Cloudera Virtual Manager
- Installation
- Once you got the software installed fire up the VirtualBox ima
- WordCount Example
- Slide 34
- Slide 35
- Slide 36
- How it Works
- Classes
- Interface
- Demo
- Exercises
- Outline (5)
- Motivation
- Pig
- An Example Problem
- In MapReduce
- In Pig Latin
- Ease of Translation
- Ease of Translation (2)
- Hive
- Creating a Hive Table
- Simple Query
- Aggregation and Joins
- Using a Hadoop Streaming Mapper Script
- Conclusions
- Resources
-
Aggregation and Joins
SELECT pvpage_url ugender COUNT(DISTINCT uid)
FROM page_views pv JOIN user u ON (pvuserid = uid)
GROUP BY pvpage_url ugender
WHERE pvdate = 2008-03-03
bullCount users who visited each page by gender
bullSample outputpage_url gender count(userid)
homephp MALE 12141412
homephp FEMALE 15431579
photophp MALE 23941451
photophp FEMALE 21231314
Using a Hadoop Streaming Mapper Script
SELECT TRANSFORM(page_viewsuserid
page_viewsdate)
USING map_scriptpy
AS dt uid CLUSTER BY dt
FROM page_views
Conclusions
MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance
Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs
(but requiring fault tolerance)
Hive and Pig further simplify programming
MapReduce is not suitable for all problems but when it works it may save you a lot of time
Resources
Hadoop httphadoopapacheorgcore
Hadoop docs
httphadoopapacheorgcoredocscurrent
Pig httphadoopapacheorgpig
Hive httphadoopapacheorghive
Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training
- Introduction to MapReduce and Hadoop
- What is MapReduce
- What is MapReduce used for
- What is MapReduce used for (2)
- Outline
- MapReduce Design Goals
- Typical Hadoop Cluster
- Typical Hadoop Cluster (2)
- Challenges
- Hadoop Components
- Hadoop Distributed File System
- MapReduce Programming Model
- Example Word Count
- Word Count Execution
- MapReduce Execution Details
- An Optimization The Combiner
- Word Count with Combiner
- Outline (2)
- Fault Tolerance in MapReduce
- Fault Tolerance in MapReduce (2)
- Fault Tolerance in MapReduce (3)
- Takeaways
- Outline (3)
- 1 Search
- 2 Sort
- 3 Inverted Index
- Inverted Index Example
- 4 Most Popular Words
- Outline (4)
- Cloudera Virtual Manager
- Installation
- Once you got the software installed fire up the VirtualBox ima
- WordCount Example
- Slide 34
- Slide 35
- Slide 36
- How it Works
- Classes
- Interface
- Demo
- Exercises
- Outline (5)
- Motivation
- Pig
- An Example Problem
- In MapReduce
- In Pig Latin
- Ease of Translation
- Ease of Translation (2)
- Hive
- Creating a Hive Table
- Simple Query
- Aggregation and Joins
- Using a Hadoop Streaming Mapper Script
- Conclusions
- Resources
-
Using a Hadoop Streaming Mapper Script
SELECT TRANSFORM(page_viewsuserid
page_viewsdate)
USING map_scriptpy
AS dt uid CLUSTER BY dt
FROM page_views
Conclusions
MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance
Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs
(but requiring fault tolerance)
Hive and Pig further simplify programming
MapReduce is not suitable for all problems but when it works it may save you a lot of time
Resources
Hadoop httphadoopapacheorgcore
Hadoop docs
httphadoopapacheorgcoredocscurrent
Pig httphadoopapacheorgpig
Hive httphadoopapacheorghive
Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training
- Introduction to MapReduce and Hadoop
- What is MapReduce
- What is MapReduce used for
- What is MapReduce used for (2)
- Outline
- MapReduce Design Goals
- Typical Hadoop Cluster
- Typical Hadoop Cluster (2)
- Challenges
- Hadoop Components
- Hadoop Distributed File System
- MapReduce Programming Model
- Example Word Count
- Word Count Execution
- MapReduce Execution Details
- An Optimization The Combiner
- Word Count with Combiner
- Outline (2)
- Fault Tolerance in MapReduce
- Fault Tolerance in MapReduce (2)
- Fault Tolerance in MapReduce (3)
- Takeaways
- Outline (3)
- 1 Search
- 2 Sort
- 3 Inverted Index
- Inverted Index Example
- 4 Most Popular Words
- Outline (4)
- Cloudera Virtual Manager
- Installation
- Once you got the software installed fire up the VirtualBox ima
- WordCount Example
- Slide 34
- Slide 35
- Slide 36
- How it Works
- Classes
- Interface
- Demo
- Exercises
- Outline (5)
- Motivation
- Pig
- An Example Problem
- In MapReduce
- In Pig Latin
- Ease of Translation
- Ease of Translation (2)
- Hive
- Creating a Hive Table
- Simple Query
- Aggregation and Joins
- Using a Hadoop Streaming Mapper Script
- Conclusions
- Resources
-
Conclusions
MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance
Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs
(but requiring fault tolerance)
Hive and Pig further simplify programming
MapReduce is not suitable for all problems but when it works it may save you a lot of time
Resources
Hadoop httphadoopapacheorgcore
Hadoop docs
httphadoopapacheorgcoredocscurrent
Pig httphadoopapacheorgpig
Hive httphadoopapacheorghive
Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training
- Introduction to MapReduce and Hadoop
- What is MapReduce
- What is MapReduce used for
- What is MapReduce used for (2)
- Outline
- MapReduce Design Goals
- Typical Hadoop Cluster
- Typical Hadoop Cluster (2)
- Challenges
- Hadoop Components
- Hadoop Distributed File System
- MapReduce Programming Model
- Example Word Count
- Word Count Execution
- MapReduce Execution Details
- An Optimization The Combiner
- Word Count with Combiner
- Outline (2)
- Fault Tolerance in MapReduce
- Fault Tolerance in MapReduce (2)
- Fault Tolerance in MapReduce (3)
- Takeaways
- Outline (3)
- 1 Search
- 2 Sort
- 3 Inverted Index
- Inverted Index Example
- 4 Most Popular Words
- Outline (4)
- Cloudera Virtual Manager
- Installation
- Once you got the software installed fire up the VirtualBox ima
- WordCount Example
- Slide 34
- Slide 35
- Slide 36
- How it Works
- Classes
- Interface
- Demo
- Exercises
- Outline (5)
- Motivation
- Pig
- An Example Problem
- In MapReduce
- In Pig Latin
- Ease of Translation
- Ease of Translation (2)
- Hive
- Creating a Hive Table
- Simple Query
- Aggregation and Joins
- Using a Hadoop Streaming Mapper Script
- Conclusions
- Resources
-
Resources
Hadoop httphadoopapacheorgcore
Hadoop docs
httphadoopapacheorgcoredocscurrent
Pig httphadoopapacheorgpig
Hive httphadoopapacheorghive
Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training
- Introduction to MapReduce and Hadoop
- What is MapReduce
- What is MapReduce used for
- What is MapReduce used for (2)
- Outline
- MapReduce Design Goals
- Typical Hadoop Cluster
- Typical Hadoop Cluster (2)
- Challenges
- Hadoop Components
- Hadoop Distributed File System
- MapReduce Programming Model
- Example Word Count
- Word Count Execution
- MapReduce Execution Details
- An Optimization The Combiner
- Word Count with Combiner
- Outline (2)
- Fault Tolerance in MapReduce
- Fault Tolerance in MapReduce (2)
- Fault Tolerance in MapReduce (3)
- Takeaways
- Outline (3)
- 1 Search
- 2 Sort
- 3 Inverted Index
- Inverted Index Example
- 4 Most Popular Words
- Outline (4)
- Cloudera Virtual Manager
- Installation
- Once you got the software installed fire up the VirtualBox ima
- WordCount Example
- Slide 34
- Slide 35
- Slide 36
- How it Works
- Classes
- Interface
- Demo
- Exercises
- Outline (5)
- Motivation
- Pig
- An Example Problem
- In MapReduce
- In Pig Latin
- Ease of Translation
- Ease of Translation (2)
- Hive
- Creating a Hive Table
- Simple Query
- Aggregation and Joins
- Using a Hadoop Streaming Mapper Script
- Conclusions
- Resources
-