Post on 20-Aug-2015
An Introduc+on to Data Intensive Compu+ng
Chapter 3: Processing Big Data
Robert Grossman University of Chicago Open Data Group
Collin BenneC
Open Data Group
November 14, 2011 1
1. Introduc+on (0830-‐0900) a. Data clouds (e.g. Hadoop) b. U+lity clouds (e.g. Amazon)
2. Managing Big Data (0900-‐0945) a. Databases b. Distributed File Systems (e.g. Hadoop) c. NoSql databases (e.g. HBase)
3. Processing Big Data (0945-‐1000 and 1030-‐1100) a. Mul+ple Virtual Machines & Message Queues b. MapReduce c. Streams over distributed file systems
4. Lab using Amazon’s Elas+c Map Reduce (1100-‐1200)
Sec+on 3.1 Processing Big Data Using U+lity and Data Clouds
A Google produc+on rack of servers from about 1999.
• How do you do analy+cs over commodity disks and processors?
• How do you improve the efficiency of programmers?
Serial & SMP Algorithms
• * local disk and memory
local disk*
Task
local disk*
Task Task Task
Serial algorithm Symmetric Mul+processing (SMP) algorithm
Pleasantly (= Embarrassingly) Parallel
• Need to par++on data, start tasks, collect results. • Oden tasks organized into DAG.
local disk
Task Task Task
local disk
Task Task Task
local disk
Task Task Task
MPI
How Do You Program A Data Center?
7
The Google Data Stack
• The Google File System (2003) • MapReduce: Simplified Data Processing… (2004) • BigTable: A Distributed Storage System… (2006)
8
Google’s Large Data Cloud
9
Google’s Early Data Stack circa 2000
Google File System (GFS)
Google’s MapReduce
Google’s BigTable
Storage Services
Compute Services
Applica+ons
Data Services
Hadoop’s Large Data Cloud (Open Source)
Storage Services
Compute Services
10
Hadoop’s Stack
Applica+ons
Hadoop Distributed File System (HDFS)
Hadoop’s MapReduce
Data Services NoSQL, e.g. HBase
A very nice recent book by Barroso and Holzle
The Amazon Data Stack
Amazon uses a highly decentralized, loosely coupled, service oriented architecture consis+ng of hundreds of services. In this environment there is a par+cular need for storage technologies that are always available. For example, customers should be able to view and add items to their shopping cart even if disks are failing, network routes are flapping, or data centers are being destroyed by tornados.
SOSP’07
Amazon Style Data Cloud
S3 Storage Services
Simple Queue Service
13
Load Balancer
EC2 Instance EC2 Instance EC2 Instance EC2 Instance EC2 Instance EC2 Instances
EC2 Instance EC2 Instance EC2 Instance EC2 Instance EC2 Instance EC2 Instances
SDB
Open Source Versions
• Eucalyptus – Ability to launch VMs – S3 like storage
• Open Stack – Ability to launch VMs – S3 like storage -‐ Swid
• Cassandra – Key-‐value store like S3 – Columns like BigTable
• Many other open source Amazon style services available.
Some Programming Models for Data Centers
• Opera+ons over data center of disks – MapReduce (“string-‐based” scans of data) – User-‐Defined Func+ons (UDFs) over data center – Launch VMs that all have access to highly scalable and available disk-‐based data.
– SQL and NoSQL over data center • Opera+ons over data center of memory
– Grep over distributed memory – UDFs over distributed memory – Launch VMs that all have access to highly scalable and available membory-‐based data.
– SQL and NoSQL over distributed memory
Sec+on 3.2 Processing Data By Scaling Out Virtual Machines
Processing Big Data PaCern 1: Launch Independent Virtual Machines and Task with a Messaging Service
Task With Messaging Service & Use S3 (Variant 1)
S3
Task
VM
Messaging Services (AWS SMS, AMQP Service, etc.)
Task
VM
Task
VM
Task
VM
…
Control VM: Launches and tasks workers
Worker VMs
Task With Messaging Service & Use NoSQL DB (Variant 2)
AWS SimpleDB
Task
VM
Messaging Services (AWS SMS, AMQP Service, etc.)
Task
VM
Task
VM
Task
VM
…
Control VM: Launches and tasks workers
Worker VMs
Task With Messaging Service & Use Clustered FS (Variant 3)
GlusterFS
Task
VM
Messaging Services (AWS SMS, AMQP Service, etc.)
Task
VM
Task
VM
Task
VM
…
Control VM: Launches and tasks workers
Worker VMs
Sec+on 3.3 MapReduce
Google 2004 Technical Report
Core Concepts
• Data are (key, value) pairs and that’s it • Par++on data over commodity nodes filling racks in a data center.
• Sodware handles failures, restarts, etc. This is the hard part.
• Basic examples: – Word Count – Inverted index
Processing Big Data PaCern 2: MapReduce
HDFS
Map Task
Task Tracker
local disk
Map Task Map Task
HDFS
Map Task
Task Tracker
local disk
Map Task Map Task
HDFS
Map Task
Task Tracker
local disk
Map Task Map Task
local disk
HDFS
Reduce Task
local disk
HDFS
Reduce Task
Shuffle & Sort
Example: Word Count & Inverted Index
• How do you count the words in a million books? – (best, 7)
• Inverted index: – (best; page 1, page 82, …)
– (worst; page 1, page 12, …)
Cover of serial Vol. V, 1859, London
• Assume you have a cluster of 50 computers, each with an aCached local disk and half full of web pages.
• What is a simple parallel programming framework that would support the computa+on of word counts and inverted indices?
Basic PaCern: Strings
1. Extract words from web pages in parallel.
2. Hash and sort words.
3. Count (or construct inverted index) in parallel.
1. Extract words from web pages in parallel.
2. Hash and sort words.
3. Count (or construct inverted index) in parallel.
1. Extract binned field value from data records in parallel.
2. Hash and sort binned field values.
3. Count (or construct inverted index) in parallel.
What about data records?
Map-‐Reduce Example • Input is files with one document per record • User specifies map func+on
– key = document URL – Value = document contents
“doc cdickens two ci+es”, “it was the best of +mes”
“it”, 1 “was”, 1 “the”, 1 “best”, 1
Input of map
Output of map
Example (cont’d) • MapReduce library gathers together all pairs with the same key value (shuffle/sort phase)
• The user-‐defined reduce func+on combines all the values associated with the same key
key = “it” values = 1, 1
key = “was” values = 1, 1
key = “best” values = 1
key = “worst” values = 1
Input of reduce
“it”, 2 “was”, 2 “best”, 1 “worst”, 1
Output of reduce
Why Is Word Count Important?
• It is one of the most important examples for the type of text processing oden done with MapReduce.
• There is an important mapping
document < -‐-‐-‐-‐-‐ > data record words < -‐-‐-‐-‐-‐ > (field, value)
Inversion
Pleasantly Parallel MapReduce
Data structure Arbitrary (key, value) pairs
Func+ons Arbitrary Map & Reduce
Middleware MPI (message passing)
Hadoop
Ease of use Difficult Medium
Scope Wide Narrow
Challenge Geung something working
Moving to MapReduce
Common MapReduce Design PaCerns
• Word count • Inversion – inverted index • Compu+ng simple sta+s+cs • Compu+ng windowed sta+s+cs • Sparse matrix (document-‐term, data record-‐FieldBinValue, …)
• Site-‐en+ty sta+s+cs • PageRank • Par++oned and ensemble models • EM
Sec+on 3.4 User Defined Func+ons over DFS
sector.sf.net
Processing Big Data PaCern 3: User Defined Func+ons over Distributed File Systems
Sector/Sphere
• Sector/Sphere is a plaworm for data intensive compu+ng.
Idea 1: Apply User Defined Func+ons (UDF) to Files in a Distributed File System
map/shuffle reduce
UDF UDF
This generalizes Hadoop’s implementa+on of MapReduce over the Hadoop Distributed File system.
Idea 2: Add Security From the Start
• Security server maintains informa+on about users and slaves.
• User access control: password and client IP address.
• File level access control. • Messages are encrypted over SSL. Cer+ficate is used for authen+ca+on.
• Sector is a good basis for HIPAA compliant applica+ons.
Security Server Master Client
Slaves
data AAA
SSL SSL
Idea 3: Extend the Stack to Include Network Transport Services
Storage Services
39
Storage Services
Rou+ng & Transport Services Google, Hadoop
Sector
Compute Services
Data Services
Compute Services
Data Services
Sec+on 3.5 Compu+ng With Streams: Warming Up With Means and Variances
Warm Up: Par++oned Means
• Means and variances cannot be computed naively when the data is in distributed par++ons.
Step 1. Compute local (Σ xi, Σ xi2, ni) in parallel for each par++on. Step 2. Compute global mean and variance from these tuples.
Trivial Observa+on 1
If si = Σ xi is a the i’th local means, then global mean = Σ si / Σ ni. • If local means for each par++on are passed (without corresponding counts), then there is not enough informa+on to compute global means.
• Same tricks works for variance, but need to pass triples (Σ xi, Σ xi2, ni).
Trivial Observa+on 2
• To reduce data passed over the network, combine appropriate sta+s+cs as early as possible.
• Consider average. Recall with MapReduce there are 4 steps (Map, Shuffle, Sort and Reduce) and Reduce pulls data from local disk that performs Map.
• A Combine Step in MapReduce combines local data before it is pulled for Reduce Step.
• There are built in combiners for counts, means, etc.
Sec+on 3.6 Hadoop Streams
Processing Big Data PaCern 4: Streams over Distributed File Systems
Hadoop Streams
• In addi+on to the Java API, Hadoop offers – Streaming interface for any language that supports reading and wri+ng to Standard In and Out
– Pipes for C++ • Why would I want to use something besides Java? Because Hadoop Streams provide direct access to – (Without JNI/ NIO) to C++ libraries like Boost, GNU Scien+fic Library (GSL)
– R modules
Pros and Cons • Java
+ Best documented + Largest community – More LOC per MR job
• Python + Efficient memory handling + Programmers can be very efficient – Limited logging / debugging
• R + Vast collec+on of sta+s+cal algorithms – Poor error handling and memory handling – Less familiar to developers
Word Count Python Mapper def read_input(file): for line in file: yield line.split() def main(separator='\t'): data = read_input(sys.stdin) for words in data: for word in words: print '%s%s%d' % (word, separator, 1)
Word Count Python Reducer def read_mapper_output(file, separator='\t'): for line in file: yield line.rstrip().split(separator, 1) def main(sep='\t'): data = read_mapper_output(sys.stdin, sep=sepa) for word, group in groupby(data, itemgetter(0)): total_count = sum(int(count) for word, count in group) print "%s%s%d" % (word, sep, total_count)
MalStone Benchmark
MalStone A MalStone B Hadoop MapReduce 455m 13s 840m 50s
Hadoop Streams (Python)
87m 29s 142m 32s
C++ implemented UDFs 33m 40s 43m 44s
Sector/Sphere 1.20, Hadoop 0.18.3 with no replica+on on Phase 1 of Open Cloud Testbed in a single rack. Data consisted of 20 nodes with 500 million 100-‐byte records / node.
Word Count R Mapper trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line) con <- file("stdin", open = "r") while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) { line <- trimWhiteSpace(line) words <- splitIntoWords(line) cat(paste(words, "\t1\n", sep=""), sep="") } close(con)
Word Count R Reducer trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line) splitLine <- function(line) { val <- unlist(strsplit(line, "\t")) list(word = val[1], count = as.integer(val[2])) } env <- new.env(hash = TRUE) con <- file("stdin", open = "r") while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) { line <- trimWhiteSpace(line) split <- splitLine(line) word <- split$word count <- split$count
Word Count R Reducer (cont’d) if (exists(word, envir = env, inherits = FALSE)) { oldcount <- get(word, envir = env) assign(word, oldcount + count, envir = env) } else assign(word, count, envir = env) } close(con) for (w in ls(env, all = TRUE)) cat(w, "\t", get(w, envir = env), "\n", sep = "”)
Word Count Java Mapper public static class Map
extends Mapper<LongWritable, Text,Text, IntWritable> private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } }
Word Count Java Reducer public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); }
Code Comparison – Word Count Mapper
Python def read_input(file): for line in file: yield line.split() def main(separator='\t'): data = read_input(sys.stdin) for words in data: for word in words: print '%s%s%d' % (word, separator, 1) R trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line) con <- file("stdin", open = "r") while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) { line <- trimWhiteSpace(line) words <- splitIntoWords(line) cat(paste(words, "\t1\n", sep=""), sep="") } close(con)
Java public static class Map extends Mapper<LongWritable, Text,Text, IntWritable> private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } }
Code Comparison – Word Count Reducer
Python def read_mapper_output(file, separator='\t'): for line in file: yield line.rstrip().split(separator, 1) def main(sep='\t'): data = read_mapper_output(sys.stdin, sep=sepa) for word, group in groupby(data, itemgetter(0)): total_count = sum(int(count) for word, count in group) print "%s%s%d" % (word, sep, total_count) R trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line) splitLine <- function(line) { val <- unlist(strsplit(line, "\t")) list(word = val[1], count = as.integer(val[2])) } env <- new.env(hash = TRUE) con <- file("stdin", open = "r") while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) { line <- trimWhiteSpace(line) split <- splitLine(line) word <- split$word count <- split$count
if (exists(word, envir = env, inherits = FALSE)) { oldcount <- get(word, envir = env) assign(word, oldcount + count, envir = env) } else assign(word, count, envir = env) } close(con) for (w in ls(env, all = TRUE)) cat(w, "\t", get(w, envir = env), "\n", sep = "”) Java public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } }
Ques+ons?
For the most current version of these notes, see rgrossman.com