Mastering Hadoop Map Reduce - Custom Types and Other Optimizations

27
Mastering Map Reduce Scott Crespo

Transcript of Mastering Hadoop Map Reduce - Custom Types and Other Optimizations

Page 1: Mastering Hadoop Map Reduce - Custom Types and Other Optimizations

Mastering Map Reduce

Scott Crespo

Page 2: Mastering Hadoop Map Reduce - Custom Types and Other Optimizations

Path to Success

Map Reduce Refresher

Optimization Strategies

Custom Type Example

Applications

Page 3: Mastering Hadoop Map Reduce - Custom Types and Other Optimizations

What’s Hadoop?

A framework that facilitates data flow through a cluster of servers

Page 4: Mastering Hadoop Map Reduce - Custom Types and Other Optimizations

What’s Map Reduce?

A paradigm for analyzing distributed data sets

Raw Data ( K, [V1..Vn] )(K,V)

Page 5: Mastering Hadoop Map Reduce - Custom Types and Other Optimizations

What About Hive And Pig?

Use them whenever possible!

Page 6: Mastering Hadoop Map Reduce - Custom Types and Other Optimizations

Data States in Map Reduce (Letter Count)

Hello World

Hello

World

H,1E,1L,1L,1O,1

W,1O,1R,1L,1D,1

H,[1]E,[1]L,[1,1,1]O,[1,1]

W,[1]R,[1]D,[1]

H,1E,1L,3O,2W,1R,1D,1

Split

Map Partition/Shuffle

Reduce

Page 7: Mastering Hadoop Map Reduce - Custom Types and Other Optimizations

Basic Map Reduce Program Structure

MyMapReduceProgram {MyMapperClass extends Mapper {

map() { // map code}

}MyReducerClass extends Reducer {

reduce() { //reduce code

}}main() {

//driver code }}

Page 8: Mastering Hadoop Map Reduce - Custom Types and Other Optimizations

Advanced Optimizations

Drivers

Custom Types

Setup Methods

Partitioning

Combiners

Chaining

Fault Tolerance

Page 9: Mastering Hadoop Map Reduce - Custom Types and Other Optimizations

Generating N-Grams

N-Gram: Set of all n sequential elements in a set.

Trigram: “The quick brown fox jumps over the lazy dog”

(the quick brown), (quick brown fox), (brown fox jumps),

(fox jumps over), (jumps over the), (the lazy dog)

Page 10: Mastering Hadoop Map Reduce - Custom Types and Other Optimizations

Solution DesignNGramCounter {

NGramMapper {

map() { // Tokenize and Sanitize Inputs // Create NGram // Output (NGram ngram, Int count) } } NGramCombiner { combine() { // Sum local NGrams counts that are of the same key // Output (NGram ngram, Int Count) } } NGramReducer { reduce() { // Sum Ngrams counts of the same key // Output (NGram ngram, Int Count) } }

}

Custom Type!

Page 11: Mastering Hadoop Map Reduce - Custom Types and Other Optimizations

Work Flow

Prototype (Python)

Custom Type (Trigram)

Unit Tests

Mapper

Reducer

Page 12: Mastering Hadoop Map Reduce - Custom Types and Other Optimizations

PrototypeQuick and Dirty Python

Page 13: Mastering Hadoop Map Reduce - Custom Types and Other Optimizations

Prototype

def test_mapper(): lines = [“the quick brown fox jumped over the lazy dog", "the quick brown”] for line in lines: words = line.split() length = len(words) sys.stdout.write("\nLength of %d \n-------------------\n" % length) i = 0 while (i+2 < length): first = words[i] second = words [i+1] third = words[i+2]

trigram = "%s %s %s \n" % (first, second, third) sys.stdout.write(trigram) i += 1

Page 14: Mastering Hadoop Map Reduce - Custom Types and Other Optimizations

Output

Length of 9 -------------------the quick brown quick brown fox brown fox jumped fox jumped over jumped over the over the lazy the lazy dog

Length of 3 -------------------the quick brown

Page 15: Mastering Hadoop Map Reduce - Custom Types and Other Optimizations

Custom Data Types

Page 16: Mastering Hadoop Map Reduce - Custom Types and Other Optimizations

Custom Key Types

Must implement Hadoops WritableComparable interface

Writable: The key can be serialized and transmitted across a network

Comparable: The key can be compared to other keys & combined/sorted for the reduce phase

write() readFields() compareTo() hashCode()

toString() equals()

Page 17: Mastering Hadoop Map Reduce - Custom Types and Other Optimizations

Trigram.java

public class Trigram implements WritableComparable<Trigram> { …

public int compareTo(Trigram other) { int compared = first.compareTo(other.first); if (compared != 0) { return compared; } compared = second.compareTo(other.second); if (compared != 0) { return compared; } return third.compareTo(other.third); } public int hashCode() { return first.hashCode()*163 + second.hashCode() + third.hashCode(); }

}

Page 18: Mastering Hadoop Map Reduce - Custom Types and Other Optimizations

Map Reduce Program

Page 19: Mastering Hadoop Map Reduce - Custom Types and Other Optimizations

TrigramMapper

public static class TrigramMapper extends Mapper<Object, Text, Trigram, IntWritable> {… public void map(Object key, Text value, Context context) {

String line = value.toString().toLowerCase(); // create string and lower case

line = line.replaceAll("[^a-z\\s]",""); // remove bad non-word chars

String[] words = line.split("\\s"); // split line into list of words

int len = words.length; // need the length for our loop condition

for(int i = 0; i+2 < len; i++) {

if(len <= 1) { continue; } // remove short lines

first.set(words[i]); second.set(words[i+1]); third.set(words[i+2]);

trigram.set(first, second, third);

context.write(trigram, one);

Page 20: Mastering Hadoop Map Reduce - Custom Types and Other Optimizations

TrigramReducer

public static class TrigramReducer extends Reducer<Trigram, IntWritable, Trigram, IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Trigram key, Iterable<IntWritable> values, Context context ) { int sum = 0; for(IntWritable value : values) { sum += value.get(); } result.set(sum); context.write(key, result);…

Page 21: Mastering Hadoop Map Reduce - Custom Types and Other Optimizations

Driver

public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "Trigram Count"); job.setJarByClass(TrigramCount.class); job.setMapperClass(TrigramMapper.class); job.setMapOutputKeyClass(Trigram.class); job.setMapOutputValueClass(IntWritable.class); job.setReducerClass(TrigramReducer.class); job.setCombinerClass(TrigramReducer.class); job.setOutputKeyClass(Trigram.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); }

Page 22: Mastering Hadoop Map Reduce - Custom Types and Other Optimizations

Applications

Page 23: Mastering Hadoop Map Reduce - Custom Types and Other Optimizations

Speech Recognition

(Trigram1, 90)(Trigram2, 76)(Trigram3, 8)(Trigram4, 1)

Page 24: Mastering Hadoop Map Reduce - Custom Types and Other Optimizations

Other Applications

Blog Posts

Stocks

GIS CoordinatesAny object with multiple attributes!

Page 25: Mastering Hadoop Map Reduce - Custom Types and Other Optimizations

Stock

Attributes

Text timeStamp;

Text ticker;

Float price;

Page 26: Mastering Hadoop Map Reduce - Custom Types and Other Optimizations

Conclusion

Custom Data Types Can:

Improve Runtime Performance

Result in Reusable Code

Provide a Consistent Interface

Page 27: Mastering Hadoop Map Reduce - Custom Types and Other Optimizations

Thank You!Scott [email protected]