Processing Big Data (Chapter 3, SC 11 Tutorial)
-
Upload
robert-grossman -
Category
Technology
-
view
2.867 -
download
0
Transcript of Processing Big Data (Chapter 3, SC 11 Tutorial)
![Page 1: Processing Big Data (Chapter 3, SC 11 Tutorial)](https://reader030.fdocuments.in/reader030/viewer/2022032514/55d513c6bb61eb87638b45a4/html5/thumbnails/1.jpg)
An Introduc+on to Data Intensive Compu+ng
Chapter 3: Processing Big Data
Robert Grossman University of Chicago Open Data Group
Collin BenneC
Open Data Group
November 14, 2011 1
![Page 2: Processing Big Data (Chapter 3, SC 11 Tutorial)](https://reader030.fdocuments.in/reader030/viewer/2022032514/55d513c6bb61eb87638b45a4/html5/thumbnails/2.jpg)
1. Introduc+on (0830-‐0900) a. Data clouds (e.g. Hadoop) b. U+lity clouds (e.g. Amazon)
2. Managing Big Data (0900-‐0945) a. Databases b. Distributed File Systems (e.g. Hadoop) c. NoSql databases (e.g. HBase)
3. Processing Big Data (0945-‐1000 and 1030-‐1100) a. Mul+ple Virtual Machines & Message Queues b. MapReduce c. Streams over distributed file systems
4. Lab using Amazon’s Elas+c Map Reduce (1100-‐1200)
![Page 3: Processing Big Data (Chapter 3, SC 11 Tutorial)](https://reader030.fdocuments.in/reader030/viewer/2022032514/55d513c6bb61eb87638b45a4/html5/thumbnails/3.jpg)
Sec+on 3.1 Processing Big Data Using U+lity and Data Clouds
A Google produc+on rack of servers from about 1999.
![Page 4: Processing Big Data (Chapter 3, SC 11 Tutorial)](https://reader030.fdocuments.in/reader030/viewer/2022032514/55d513c6bb61eb87638b45a4/html5/thumbnails/4.jpg)
• How do you do analy+cs over commodity disks and processors?
• How do you improve the efficiency of programmers?
![Page 5: Processing Big Data (Chapter 3, SC 11 Tutorial)](https://reader030.fdocuments.in/reader030/viewer/2022032514/55d513c6bb61eb87638b45a4/html5/thumbnails/5.jpg)
Serial & SMP Algorithms
• * local disk and memory
local disk*
Task
local disk*
Task Task Task
Serial algorithm Symmetric Mul+processing (SMP) algorithm
![Page 6: Processing Big Data (Chapter 3, SC 11 Tutorial)](https://reader030.fdocuments.in/reader030/viewer/2022032514/55d513c6bb61eb87638b45a4/html5/thumbnails/6.jpg)
Pleasantly (= Embarrassingly) Parallel
• Need to par++on data, start tasks, collect results. • Oden tasks organized into DAG.
local disk
Task Task Task
local disk
Task Task Task
local disk
Task Task Task
MPI
![Page 7: Processing Big Data (Chapter 3, SC 11 Tutorial)](https://reader030.fdocuments.in/reader030/viewer/2022032514/55d513c6bb61eb87638b45a4/html5/thumbnails/7.jpg)
How Do You Program A Data Center?
7
![Page 8: Processing Big Data (Chapter 3, SC 11 Tutorial)](https://reader030.fdocuments.in/reader030/viewer/2022032514/55d513c6bb61eb87638b45a4/html5/thumbnails/8.jpg)
The Google Data Stack
• The Google File System (2003) • MapReduce: Simplified Data Processing… (2004) • BigTable: A Distributed Storage System… (2006)
8
![Page 9: Processing Big Data (Chapter 3, SC 11 Tutorial)](https://reader030.fdocuments.in/reader030/viewer/2022032514/55d513c6bb61eb87638b45a4/html5/thumbnails/9.jpg)
Google’s Large Data Cloud
9
Google’s Early Data Stack circa 2000
Google File System (GFS)
Google’s MapReduce
Google’s BigTable
Storage Services
Compute Services
Applica+ons
Data Services
![Page 10: Processing Big Data (Chapter 3, SC 11 Tutorial)](https://reader030.fdocuments.in/reader030/viewer/2022032514/55d513c6bb61eb87638b45a4/html5/thumbnails/10.jpg)
Hadoop’s Large Data Cloud (Open Source)
Storage Services
Compute Services
10
Hadoop’s Stack
Applica+ons
Hadoop Distributed File System (HDFS)
Hadoop’s MapReduce
Data Services NoSQL, e.g. HBase
![Page 11: Processing Big Data (Chapter 3, SC 11 Tutorial)](https://reader030.fdocuments.in/reader030/viewer/2022032514/55d513c6bb61eb87638b45a4/html5/thumbnails/11.jpg)
A very nice recent book by Barroso and Holzle
![Page 12: Processing Big Data (Chapter 3, SC 11 Tutorial)](https://reader030.fdocuments.in/reader030/viewer/2022032514/55d513c6bb61eb87638b45a4/html5/thumbnails/12.jpg)
The Amazon Data Stack
Amazon uses a highly decentralized, loosely coupled, service oriented architecture consis+ng of hundreds of services. In this environment there is a par+cular need for storage technologies that are always available. For example, customers should be able to view and add items to their shopping cart even if disks are failing, network routes are flapping, or data centers are being destroyed by tornados.
SOSP’07
![Page 13: Processing Big Data (Chapter 3, SC 11 Tutorial)](https://reader030.fdocuments.in/reader030/viewer/2022032514/55d513c6bb61eb87638b45a4/html5/thumbnails/13.jpg)
Amazon Style Data Cloud
S3 Storage Services
Simple Queue Service
13
Load Balancer
EC2 Instance EC2 Instance EC2 Instance EC2 Instance EC2 Instance EC2 Instances
EC2 Instance EC2 Instance EC2 Instance EC2 Instance EC2 Instance EC2 Instances
SDB
![Page 14: Processing Big Data (Chapter 3, SC 11 Tutorial)](https://reader030.fdocuments.in/reader030/viewer/2022032514/55d513c6bb61eb87638b45a4/html5/thumbnails/14.jpg)
Open Source Versions
• Eucalyptus – Ability to launch VMs – S3 like storage
• Open Stack – Ability to launch VMs – S3 like storage -‐ Swid
• Cassandra – Key-‐value store like S3 – Columns like BigTable
• Many other open source Amazon style services available.
![Page 15: Processing Big Data (Chapter 3, SC 11 Tutorial)](https://reader030.fdocuments.in/reader030/viewer/2022032514/55d513c6bb61eb87638b45a4/html5/thumbnails/15.jpg)
Some Programming Models for Data Centers
• Opera+ons over data center of disks – MapReduce (“string-‐based” scans of data) – User-‐Defined Func+ons (UDFs) over data center – Launch VMs that all have access to highly scalable and available disk-‐based data.
– SQL and NoSQL over data center • Opera+ons over data center of memory
– Grep over distributed memory – UDFs over distributed memory – Launch VMs that all have access to highly scalable and available membory-‐based data.
– SQL and NoSQL over distributed memory
![Page 16: Processing Big Data (Chapter 3, SC 11 Tutorial)](https://reader030.fdocuments.in/reader030/viewer/2022032514/55d513c6bb61eb87638b45a4/html5/thumbnails/16.jpg)
Sec+on 3.2 Processing Data By Scaling Out Virtual Machines
![Page 17: Processing Big Data (Chapter 3, SC 11 Tutorial)](https://reader030.fdocuments.in/reader030/viewer/2022032514/55d513c6bb61eb87638b45a4/html5/thumbnails/17.jpg)
Processing Big Data PaCern 1: Launch Independent Virtual Machines and Task with a Messaging Service
![Page 18: Processing Big Data (Chapter 3, SC 11 Tutorial)](https://reader030.fdocuments.in/reader030/viewer/2022032514/55d513c6bb61eb87638b45a4/html5/thumbnails/18.jpg)
Task With Messaging Service & Use S3 (Variant 1)
S3
Task
VM
Messaging Services (AWS SMS, AMQP Service, etc.)
Task
VM
Task
VM
Task
VM
…
Control VM: Launches and tasks workers
Worker VMs
![Page 19: Processing Big Data (Chapter 3, SC 11 Tutorial)](https://reader030.fdocuments.in/reader030/viewer/2022032514/55d513c6bb61eb87638b45a4/html5/thumbnails/19.jpg)
Task With Messaging Service & Use NoSQL DB (Variant 2)
AWS SimpleDB
Task
VM
Messaging Services (AWS SMS, AMQP Service, etc.)
Task
VM
Task
VM
Task
VM
…
Control VM: Launches and tasks workers
Worker VMs
![Page 20: Processing Big Data (Chapter 3, SC 11 Tutorial)](https://reader030.fdocuments.in/reader030/viewer/2022032514/55d513c6bb61eb87638b45a4/html5/thumbnails/20.jpg)
Task With Messaging Service & Use Clustered FS (Variant 3)
GlusterFS
Task
VM
Messaging Services (AWS SMS, AMQP Service, etc.)
Task
VM
Task
VM
Task
VM
…
Control VM: Launches and tasks workers
Worker VMs
![Page 21: Processing Big Data (Chapter 3, SC 11 Tutorial)](https://reader030.fdocuments.in/reader030/viewer/2022032514/55d513c6bb61eb87638b45a4/html5/thumbnails/21.jpg)
Sec+on 3.3 MapReduce
Google 2004 Technical Report
![Page 22: Processing Big Data (Chapter 3, SC 11 Tutorial)](https://reader030.fdocuments.in/reader030/viewer/2022032514/55d513c6bb61eb87638b45a4/html5/thumbnails/22.jpg)
Core Concepts
• Data are (key, value) pairs and that’s it • Par++on data over commodity nodes filling racks in a data center.
• Sodware handles failures, restarts, etc. This is the hard part.
• Basic examples: – Word Count – Inverted index
![Page 23: Processing Big Data (Chapter 3, SC 11 Tutorial)](https://reader030.fdocuments.in/reader030/viewer/2022032514/55d513c6bb61eb87638b45a4/html5/thumbnails/23.jpg)
Processing Big Data PaCern 2: MapReduce
![Page 24: Processing Big Data (Chapter 3, SC 11 Tutorial)](https://reader030.fdocuments.in/reader030/viewer/2022032514/55d513c6bb61eb87638b45a4/html5/thumbnails/24.jpg)
HDFS
Map Task
Task Tracker
local disk
Map Task Map Task
HDFS
Map Task
Task Tracker
local disk
Map Task Map Task
HDFS
Map Task
Task Tracker
local disk
Map Task Map Task
local disk
HDFS
Reduce Task
local disk
HDFS
Reduce Task
Shuffle & Sort
![Page 25: Processing Big Data (Chapter 3, SC 11 Tutorial)](https://reader030.fdocuments.in/reader030/viewer/2022032514/55d513c6bb61eb87638b45a4/html5/thumbnails/25.jpg)
Example: Word Count & Inverted Index
• How do you count the words in a million books? – (best, 7)
• Inverted index: – (best; page 1, page 82, …)
– (worst; page 1, page 12, …)
Cover of serial Vol. V, 1859, London
![Page 26: Processing Big Data (Chapter 3, SC 11 Tutorial)](https://reader030.fdocuments.in/reader030/viewer/2022032514/55d513c6bb61eb87638b45a4/html5/thumbnails/26.jpg)
• Assume you have a cluster of 50 computers, each with an aCached local disk and half full of web pages.
• What is a simple parallel programming framework that would support the computa+on of word counts and inverted indices?
![Page 27: Processing Big Data (Chapter 3, SC 11 Tutorial)](https://reader030.fdocuments.in/reader030/viewer/2022032514/55d513c6bb61eb87638b45a4/html5/thumbnails/27.jpg)
Basic PaCern: Strings
1. Extract words from web pages in parallel.
2. Hash and sort words.
3. Count (or construct inverted index) in parallel.
![Page 28: Processing Big Data (Chapter 3, SC 11 Tutorial)](https://reader030.fdocuments.in/reader030/viewer/2022032514/55d513c6bb61eb87638b45a4/html5/thumbnails/28.jpg)
1. Extract words from web pages in parallel.
2. Hash and sort words.
3. Count (or construct inverted index) in parallel.
1. Extract binned field value from data records in parallel.
2. Hash and sort binned field values.
3. Count (or construct inverted index) in parallel.
What about data records?
![Page 29: Processing Big Data (Chapter 3, SC 11 Tutorial)](https://reader030.fdocuments.in/reader030/viewer/2022032514/55d513c6bb61eb87638b45a4/html5/thumbnails/29.jpg)
Map-‐Reduce Example • Input is files with one document per record • User specifies map func+on
– key = document URL – Value = document contents
“doc cdickens two ci+es”, “it was the best of +mes”
“it”, 1 “was”, 1 “the”, 1 “best”, 1
Input of map
Output of map
![Page 30: Processing Big Data (Chapter 3, SC 11 Tutorial)](https://reader030.fdocuments.in/reader030/viewer/2022032514/55d513c6bb61eb87638b45a4/html5/thumbnails/30.jpg)
Example (cont’d) • MapReduce library gathers together all pairs with the same key value (shuffle/sort phase)
• The user-‐defined reduce func+on combines all the values associated with the same key
key = “it” values = 1, 1
key = “was” values = 1, 1
key = “best” values = 1
key = “worst” values = 1
Input of reduce
“it”, 2 “was”, 2 “best”, 1 “worst”, 1
Output of reduce
![Page 31: Processing Big Data (Chapter 3, SC 11 Tutorial)](https://reader030.fdocuments.in/reader030/viewer/2022032514/55d513c6bb61eb87638b45a4/html5/thumbnails/31.jpg)
Why Is Word Count Important?
• It is one of the most important examples for the type of text processing oden done with MapReduce.
• There is an important mapping
document < -‐-‐-‐-‐-‐ > data record words < -‐-‐-‐-‐-‐ > (field, value)
Inversion
![Page 32: Processing Big Data (Chapter 3, SC 11 Tutorial)](https://reader030.fdocuments.in/reader030/viewer/2022032514/55d513c6bb61eb87638b45a4/html5/thumbnails/32.jpg)
Pleasantly Parallel MapReduce
Data structure Arbitrary (key, value) pairs
Func+ons Arbitrary Map & Reduce
Middleware MPI (message passing)
Hadoop
Ease of use Difficult Medium
Scope Wide Narrow
Challenge Geung something working
Moving to MapReduce
![Page 33: Processing Big Data (Chapter 3, SC 11 Tutorial)](https://reader030.fdocuments.in/reader030/viewer/2022032514/55d513c6bb61eb87638b45a4/html5/thumbnails/33.jpg)
Common MapReduce Design PaCerns
• Word count • Inversion – inverted index • Compu+ng simple sta+s+cs • Compu+ng windowed sta+s+cs • Sparse matrix (document-‐term, data record-‐FieldBinValue, …)
• Site-‐en+ty sta+s+cs • PageRank • Par++oned and ensemble models • EM
![Page 34: Processing Big Data (Chapter 3, SC 11 Tutorial)](https://reader030.fdocuments.in/reader030/viewer/2022032514/55d513c6bb61eb87638b45a4/html5/thumbnails/34.jpg)
Sec+on 3.4 User Defined Func+ons over DFS
sector.sf.net
![Page 35: Processing Big Data (Chapter 3, SC 11 Tutorial)](https://reader030.fdocuments.in/reader030/viewer/2022032514/55d513c6bb61eb87638b45a4/html5/thumbnails/35.jpg)
Processing Big Data PaCern 3: User Defined Func+ons over Distributed File Systems
![Page 36: Processing Big Data (Chapter 3, SC 11 Tutorial)](https://reader030.fdocuments.in/reader030/viewer/2022032514/55d513c6bb61eb87638b45a4/html5/thumbnails/36.jpg)
Sector/Sphere
• Sector/Sphere is a plaworm for data intensive compu+ng.
![Page 37: Processing Big Data (Chapter 3, SC 11 Tutorial)](https://reader030.fdocuments.in/reader030/viewer/2022032514/55d513c6bb61eb87638b45a4/html5/thumbnails/37.jpg)
Idea 1: Apply User Defined Func+ons (UDF) to Files in a Distributed File System
map/shuffle reduce
UDF UDF
This generalizes Hadoop’s implementa+on of MapReduce over the Hadoop Distributed File system.
![Page 38: Processing Big Data (Chapter 3, SC 11 Tutorial)](https://reader030.fdocuments.in/reader030/viewer/2022032514/55d513c6bb61eb87638b45a4/html5/thumbnails/38.jpg)
Idea 2: Add Security From the Start
• Security server maintains informa+on about users and slaves.
• User access control: password and client IP address.
• File level access control. • Messages are encrypted over SSL. Cer+ficate is used for authen+ca+on.
• Sector is a good basis for HIPAA compliant applica+ons.
Security Server Master Client
Slaves
data AAA
SSL SSL
![Page 39: Processing Big Data (Chapter 3, SC 11 Tutorial)](https://reader030.fdocuments.in/reader030/viewer/2022032514/55d513c6bb61eb87638b45a4/html5/thumbnails/39.jpg)
Idea 3: Extend the Stack to Include Network Transport Services
Storage Services
39
Storage Services
Rou+ng & Transport Services Google, Hadoop
Sector
Compute Services
Data Services
Compute Services
Data Services
![Page 40: Processing Big Data (Chapter 3, SC 11 Tutorial)](https://reader030.fdocuments.in/reader030/viewer/2022032514/55d513c6bb61eb87638b45a4/html5/thumbnails/40.jpg)
Sec+on 3.5 Compu+ng With Streams: Warming Up With Means and Variances
![Page 41: Processing Big Data (Chapter 3, SC 11 Tutorial)](https://reader030.fdocuments.in/reader030/viewer/2022032514/55d513c6bb61eb87638b45a4/html5/thumbnails/41.jpg)
Warm Up: Par++oned Means
• Means and variances cannot be computed naively when the data is in distributed par++ons.
Step 1. Compute local (Σ xi, Σ xi2, ni) in parallel for each par++on. Step 2. Compute global mean and variance from these tuples.
![Page 42: Processing Big Data (Chapter 3, SC 11 Tutorial)](https://reader030.fdocuments.in/reader030/viewer/2022032514/55d513c6bb61eb87638b45a4/html5/thumbnails/42.jpg)
Trivial Observa+on 1
If si = Σ xi is a the i’th local means, then global mean = Σ si / Σ ni. • If local means for each par++on are passed (without corresponding counts), then there is not enough informa+on to compute global means.
• Same tricks works for variance, but need to pass triples (Σ xi, Σ xi2, ni).
![Page 43: Processing Big Data (Chapter 3, SC 11 Tutorial)](https://reader030.fdocuments.in/reader030/viewer/2022032514/55d513c6bb61eb87638b45a4/html5/thumbnails/43.jpg)
Trivial Observa+on 2
• To reduce data passed over the network, combine appropriate sta+s+cs as early as possible.
• Consider average. Recall with MapReduce there are 4 steps (Map, Shuffle, Sort and Reduce) and Reduce pulls data from local disk that performs Map.
• A Combine Step in MapReduce combines local data before it is pulled for Reduce Step.
• There are built in combiners for counts, means, etc.
![Page 44: Processing Big Data (Chapter 3, SC 11 Tutorial)](https://reader030.fdocuments.in/reader030/viewer/2022032514/55d513c6bb61eb87638b45a4/html5/thumbnails/44.jpg)
Sec+on 3.6 Hadoop Streams
![Page 45: Processing Big Data (Chapter 3, SC 11 Tutorial)](https://reader030.fdocuments.in/reader030/viewer/2022032514/55d513c6bb61eb87638b45a4/html5/thumbnails/45.jpg)
Processing Big Data PaCern 4: Streams over Distributed File Systems
![Page 46: Processing Big Data (Chapter 3, SC 11 Tutorial)](https://reader030.fdocuments.in/reader030/viewer/2022032514/55d513c6bb61eb87638b45a4/html5/thumbnails/46.jpg)
Hadoop Streams
• In addi+on to the Java API, Hadoop offers – Streaming interface for any language that supports reading and wri+ng to Standard In and Out
– Pipes for C++ • Why would I want to use something besides Java? Because Hadoop Streams provide direct access to – (Without JNI/ NIO) to C++ libraries like Boost, GNU Scien+fic Library (GSL)
– R modules
![Page 47: Processing Big Data (Chapter 3, SC 11 Tutorial)](https://reader030.fdocuments.in/reader030/viewer/2022032514/55d513c6bb61eb87638b45a4/html5/thumbnails/47.jpg)
Pros and Cons • Java
+ Best documented + Largest community – More LOC per MR job
• Python + Efficient memory handling + Programmers can be very efficient – Limited logging / debugging
• R + Vast collec+on of sta+s+cal algorithms – Poor error handling and memory handling – Less familiar to developers
![Page 48: Processing Big Data (Chapter 3, SC 11 Tutorial)](https://reader030.fdocuments.in/reader030/viewer/2022032514/55d513c6bb61eb87638b45a4/html5/thumbnails/48.jpg)
Word Count Python Mapper def read_input(file): for line in file: yield line.split() def main(separator='\t'): data = read_input(sys.stdin) for words in data: for word in words: print '%s%s%d' % (word, separator, 1)
![Page 49: Processing Big Data (Chapter 3, SC 11 Tutorial)](https://reader030.fdocuments.in/reader030/viewer/2022032514/55d513c6bb61eb87638b45a4/html5/thumbnails/49.jpg)
Word Count Python Reducer def read_mapper_output(file, separator='\t'): for line in file: yield line.rstrip().split(separator, 1) def main(sep='\t'): data = read_mapper_output(sys.stdin, sep=sepa) for word, group in groupby(data, itemgetter(0)): total_count = sum(int(count) for word, count in group) print "%s%s%d" % (word, sep, total_count)
![Page 50: Processing Big Data (Chapter 3, SC 11 Tutorial)](https://reader030.fdocuments.in/reader030/viewer/2022032514/55d513c6bb61eb87638b45a4/html5/thumbnails/50.jpg)
MalStone Benchmark
MalStone A MalStone B Hadoop MapReduce 455m 13s 840m 50s
Hadoop Streams (Python)
87m 29s 142m 32s
C++ implemented UDFs 33m 40s 43m 44s
Sector/Sphere 1.20, Hadoop 0.18.3 with no replica+on on Phase 1 of Open Cloud Testbed in a single rack. Data consisted of 20 nodes with 500 million 100-‐byte records / node.
![Page 51: Processing Big Data (Chapter 3, SC 11 Tutorial)](https://reader030.fdocuments.in/reader030/viewer/2022032514/55d513c6bb61eb87638b45a4/html5/thumbnails/51.jpg)
Word Count R Mapper trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line) con <- file("stdin", open = "r") while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) { line <- trimWhiteSpace(line) words <- splitIntoWords(line) cat(paste(words, "\t1\n", sep=""), sep="") } close(con)
![Page 52: Processing Big Data (Chapter 3, SC 11 Tutorial)](https://reader030.fdocuments.in/reader030/viewer/2022032514/55d513c6bb61eb87638b45a4/html5/thumbnails/52.jpg)
Word Count R Reducer trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line) splitLine <- function(line) { val <- unlist(strsplit(line, "\t")) list(word = val[1], count = as.integer(val[2])) } env <- new.env(hash = TRUE) con <- file("stdin", open = "r") while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) { line <- trimWhiteSpace(line) split <- splitLine(line) word <- split$word count <- split$count
![Page 53: Processing Big Data (Chapter 3, SC 11 Tutorial)](https://reader030.fdocuments.in/reader030/viewer/2022032514/55d513c6bb61eb87638b45a4/html5/thumbnails/53.jpg)
Word Count R Reducer (cont’d) if (exists(word, envir = env, inherits = FALSE)) { oldcount <- get(word, envir = env) assign(word, oldcount + count, envir = env) } else assign(word, count, envir = env) } close(con) for (w in ls(env, all = TRUE)) cat(w, "\t", get(w, envir = env), "\n", sep = "”)
![Page 54: Processing Big Data (Chapter 3, SC 11 Tutorial)](https://reader030.fdocuments.in/reader030/viewer/2022032514/55d513c6bb61eb87638b45a4/html5/thumbnails/54.jpg)
Word Count Java Mapper public static class Map
extends Mapper<LongWritable, Text,Text, IntWritable> private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } }
![Page 55: Processing Big Data (Chapter 3, SC 11 Tutorial)](https://reader030.fdocuments.in/reader030/viewer/2022032514/55d513c6bb61eb87638b45a4/html5/thumbnails/55.jpg)
Word Count Java Reducer public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); }
![Page 56: Processing Big Data (Chapter 3, SC 11 Tutorial)](https://reader030.fdocuments.in/reader030/viewer/2022032514/55d513c6bb61eb87638b45a4/html5/thumbnails/56.jpg)
Code Comparison – Word Count Mapper
Python def read_input(file): for line in file: yield line.split() def main(separator='\t'): data = read_input(sys.stdin) for words in data: for word in words: print '%s%s%d' % (word, separator, 1) R trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line) con <- file("stdin", open = "r") while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) { line <- trimWhiteSpace(line) words <- splitIntoWords(line) cat(paste(words, "\t1\n", sep=""), sep="") } close(con)
Java public static class Map extends Mapper<LongWritable, Text,Text, IntWritable> private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } }
![Page 57: Processing Big Data (Chapter 3, SC 11 Tutorial)](https://reader030.fdocuments.in/reader030/viewer/2022032514/55d513c6bb61eb87638b45a4/html5/thumbnails/57.jpg)
Code Comparison – Word Count Reducer
Python def read_mapper_output(file, separator='\t'): for line in file: yield line.rstrip().split(separator, 1) def main(sep='\t'): data = read_mapper_output(sys.stdin, sep=sepa) for word, group in groupby(data, itemgetter(0)): total_count = sum(int(count) for word, count in group) print "%s%s%d" % (word, sep, total_count) R trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line) splitLine <- function(line) { val <- unlist(strsplit(line, "\t")) list(word = val[1], count = as.integer(val[2])) } env <- new.env(hash = TRUE) con <- file("stdin", open = "r") while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) { line <- trimWhiteSpace(line) split <- splitLine(line) word <- split$word count <- split$count
if (exists(word, envir = env, inherits = FALSE)) { oldcount <- get(word, envir = env) assign(word, oldcount + count, envir = env) } else assign(word, count, envir = env) } close(con) for (w in ls(env, all = TRUE)) cat(w, "\t", get(w, envir = env), "\n", sep = "”) Java public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } }
![Page 58: Processing Big Data (Chapter 3, SC 11 Tutorial)](https://reader030.fdocuments.in/reader030/viewer/2022032514/55d513c6bb61eb87638b45a4/html5/thumbnails/58.jpg)
Ques+ons?
For the most current version of these notes, see rgrossman.com