Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
-
Upload
scottcrespo -
Category
Data & Analytics
-
view
86 -
download
0
Transcript of Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Map Reduce
Scott Crespo
Path to Success
Map Reduce Refresher
Optimization Strategies
Custom Type Example
Applications
What’s Hadoop?
A framework that facilitates data flow through a cluster of servers
What’s Map Reduce?
A paradigm for analyzing distributed data sets
Raw Data ( K, [V1..Vn] )(K,V)
What About Hive And Pig?
Use them whenever possible!
Data States in Map Reduce (Letter Count)
Hello World
Hello
World
H,1E,1L,1L,1O,1
W,1O,1R,1L,1D,1
H,[1]E,[1]L,[1,1,1]O,[1,1]
W,[1]R,[1]D,[1]
H,1E,1L,3O,2W,1R,1D,1
Split
Map Partition/Shuffle
Reduce
Basic Map Reduce Program Structure
MyMapReduceProgram {MyMapperClass extends Mapper {
map() { // map code}
}MyReducerClass extends Reducer {
reduce() { //reduce code
}}main() {
//driver code }}
Advanced Optimizations
Drivers
Custom Types
Setup Methods
Partitioning
Combiners
Chaining
Fault Tolerance
Generating N-Grams
N-Gram: Set of all n sequential elements in a set.
Trigram: “The quick brown fox jumps over the lazy dog”
(the quick brown), (quick brown fox), (brown fox jumps),
(fox jumps over), (jumps over the), (the lazy dog)
Solution DesignNGramCounter {
NGramMapper {
map() { // Tokenize and Sanitize Inputs // Create NGram // Output (NGram ngram, Int count) } } NGramCombiner { combine() { // Sum local NGrams counts that are of the same key // Output (NGram ngram, Int Count) } } NGramReducer { reduce() { // Sum Ngrams counts of the same key // Output (NGram ngram, Int Count) } }
}
Custom Type!
Work Flow
Prototype (Python)
Custom Type (Trigram)
Unit Tests
Mapper
Reducer
PrototypeQuick and Dirty Python
Prototype
def test_mapper(): lines = [“the quick brown fox jumped over the lazy dog", "the quick brown”] for line in lines: words = line.split() length = len(words) sys.stdout.write("\nLength of %d \n-------------------\n" % length) i = 0 while (i+2 < length): first = words[i] second = words [i+1] third = words[i+2]
trigram = "%s %s %s \n" % (first, second, third) sys.stdout.write(trigram) i += 1
Output
Length of 9 -------------------the quick brown quick brown fox brown fox jumped fox jumped over jumped over the over the lazy the lazy dog
Length of 3 -------------------the quick brown
Custom Data Types
Custom Key Types
Must implement Hadoops WritableComparable interface
Writable: The key can be serialized and transmitted across a network
Comparable: The key can be compared to other keys & combined/sorted for the reduce phase
write() readFields() compareTo() hashCode()
toString() equals()
Trigram.java
public class Trigram implements WritableComparable<Trigram> { …
public int compareTo(Trigram other) { int compared = first.compareTo(other.first); if (compared != 0) { return compared; } compared = second.compareTo(other.second); if (compared != 0) { return compared; } return third.compareTo(other.third); } public int hashCode() { return first.hashCode()*163 + second.hashCode() + third.hashCode(); }
}
Map Reduce Program
TrigramMapper
public static class TrigramMapper extends Mapper<Object, Text, Trigram, IntWritable> {… public void map(Object key, Text value, Context context) {
String line = value.toString().toLowerCase(); // create string and lower case
line = line.replaceAll("[^a-z\\s]",""); // remove bad non-word chars
String[] words = line.split("\\s"); // split line into list of words
int len = words.length; // need the length for our loop condition
for(int i = 0; i+2 < len; i++) {
if(len <= 1) { continue; } // remove short lines
first.set(words[i]); second.set(words[i+1]); third.set(words[i+2]);
trigram.set(first, second, third);
context.write(trigram, one);
TrigramReducer
public static class TrigramReducer extends Reducer<Trigram, IntWritable, Trigram, IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Trigram key, Iterable<IntWritable> values, Context context ) { int sum = 0; for(IntWritable value : values) { sum += value.get(); } result.set(sum); context.write(key, result);…
Driver
public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "Trigram Count"); job.setJarByClass(TrigramCount.class); job.setMapperClass(TrigramMapper.class); job.setMapOutputKeyClass(Trigram.class); job.setMapOutputValueClass(IntWritable.class); job.setReducerClass(TrigramReducer.class); job.setCombinerClass(TrigramReducer.class); job.setOutputKeyClass(Trigram.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); }
Applications
Speech Recognition
(Trigram1, 90)(Trigram2, 76)(Trigram3, 8)(Trigram4, 1)
Other Applications
Blog Posts
Stocks
GIS CoordinatesAny object with multiple attributes!
Stock
Attributes
Text timeStamp;
Text ticker;
Float price;
Conclusion
Custom Data Types Can:
Improve Runtime Performance
Result in Reusable Code
Provide a Consistent Interface
Thank You!Scott [email protected]