Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)

Data-Intensive Computing for Text Analysis CS395T / INF385T / LIN386M

University of Texas at Austin, Fall 2011

Lecture 2 September 1, 2011

Matt Lease

School of Information

University of Texas at Austin

ml at ischool dot utexas dot edu

Jason Baldridge

Department of Linguistics

University of Texas at Austin

Jasonbaldridge at gmail dot com

Acknowledgments

Course design and slides derived from Jimmy Lin’s cloud computing courses at the University of Maryland, College Park

Some figures courtesy of

• Chuck Lam’s Hadoop In Action (2011)

• Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010)

g g g g g

f f f f f Map

Roots in Functional Programming

Divide and Conquer

“Work”

w1 w2 w3

r1 r2 r3

“Result”

“worker” “worker” “worker”

Partition

Combine

MapReduce

“Big Ideas”

Scale “out”, not “up”

Limits of SMP and large shared-memory machines

Move processing to the data

Cluster have limited bandwidth

Process data sequentially, avoid random access

Seeks are expensive, disk throughput is reasonable

Seamless scalability

From the mythical man-month to the tradable machine-hour

Typical Large-Data Problem

Iterate over a large number of records

Compute something of interest from each

Shuffle and sort intermediate results

Aggregate intermediate results

Generate final output

Key idea: provide a functional abstraction for

these two operations

(Dean and Ghemawat, OSDI 2004)

MapReduce Data Flow

Courtesy of Chuck Lam’s Hadoop In Action (2011), pp. 45, 52

MapReduce “Runtime”

Handles scheduling

Assigns workers to map and reduce tasks

Handles “data distribution”

Moves processes to data

Handles synchronization

Gathers, sorts, and shuffles intermediate data

Handles errors and faults

Detects worker failures and restarts

Built on a distributed file system

MapReduce

Programmers specify two functions

map ( K1, V1 ) → list ( K2, V2 )

reduce ( K2, list(V2) ) → list ( K3, V3)

Note correspondence of types map output → reduce input

Data Flow

Input → “input splits”: each a sequence of logical (K1,V1) “records”

• Each split processed by same map node

• map invoked iteratively: once per record in the split

• For each record processed, map may emit 0-N (K2,V2) pairs

Reduce

• reduce invoked iteratively for each ( K2, list(V2) ) intermediate value

• For each processed, reduce may emit 0-N (K3,V3) pairs

Each reducer’s output written to a persistent file in HDFS

InputSplit

Source: redrawn from a slide by Cloduera, cc-licensed

InputSplit InputSplit

Input File Input File

InputSplit InputSplit

RecordReader RecordReader RecordReader RecordReader RecordReader

Mapper

Intermediates

Mapper

Intermediates

Mapper

Intermediates

Mapper

Intermediates

Mapper

Intermediates

Data Flow

Input → “input splits”: each a sequence of logical (K1,V1) “records”

For each split, for each record, do map(K1,V1) (multiple calls)

Each map call may emit any number of (K2,V2) pairs (0-N)

Run-time

Groups all values with the same key into ( K2, list(V2) )

Determines which reducer will process this

Copies data across network as needed for reducer

Ensures intra-node sort of keys processed by each reducer

• No guarantee by default of inter-node total sort across reducers

Courtesy of Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010), p. 30

“Hello World”: Word Count

Map(String docid, String text):

for each word w in text:

Emit(w, 1);

Reduce(String term, Iterator<Int> values):

int sum = 0;

for each v in values:

sum += v;

Emit(term, sum);

map ( K( K1=String, V1=String ) → list ( K2=String, V2=Integer ) reduce ( K2=String, list(V2=Integer) ) → list ( K3=String, V3=Integer)

map map map map

Shuffle and Sort: aggregate values by keys

reduce reduce reduce

k1 k2 k3 k4 k5 k6 v1 v2 v3 v4 v5 v6

b a 1 2 c c 3 6 a c 5 2 b c 7 8

a 1 5 b 2 7 c 2 3 6 8

r1 s1 r2 s2 r3 s3

Courtesy of Chuck Lam’s Hadoop In Action (2011), pp. 45, 52

Partition

Given: map ( K1, V1 ) → list ( K2, V2 )

partition (K2, N) → Rj maps K2 to some reducer Rj in [1..N]

Each distinct key (with associated values) sent to a single reducer

• Same reduce node may process multiple keys in separate reduce() calls

Balances workload across reducers: equal number of keys to each

• Default: simple hash of the key, e.g., hash(k’) mod N (# reducers)

Customizable

• Some keys require more computation than others

• e.g. value skew, or key-specific computation performed

• For skew, sampling can dynamically estimate distribution & set partition

• Secondary/Tertiary sorting (e.g. bigrams or arbitrary n-grams)?

Secondary Sorting (Lin 57, White 241)

How to output sorted bigrams (1st word, then list of 2nds)?

What if we use word1 as the key, word 2 as the value?

What if we use <first>--<second> as the key?

Pattern

Create a composite key of (first, second)

Define a Key Comparator based on both words

• This will produce the sort order we want (aa ab ac ba bb bc ca cb…)

Define a partition function based only on first word

• All bigrams with the same first word go to same reducer

• How do you know when the first word changes across invocations?

Preserve state in the reducer across invocations

• Will be called separately for each bigram, but we want to remember

the current first word across bigrams seen

Hadoop also provides Group Comparator

Combine

Given: map ( K1, V1 ) → list ( K2, V2 )

combine ( K2, list(V2) ) → list ( K2, V2 )

Optional optimization

Local aggregation to reduce network traffic

No guarantee it will be used, how many times it will be called

Semantics of program cannot depend on its use

Signature: same input as reduce, same output as map

Combine may be run repeatedly on its own output

Lin: Associative & Commutative combiner = reducer

• See next slide

Functional Properties

Associative: f( a, f(b,c) ) = f( f(a,b), c )

Grouping of operations doesn’t matter

YES: Addition, multiplication, concatenation

NO: division, subtraction, NAND

NAND(1, NAND(1,0)) = 0 != 1 = NAND( NAND(1,0), 0 )

Commutative: f(a,b) = f(b,a)

Ordering of arguments doesn’t matter

YES: addition, multiplication, NAND

NO: division, subtraction, concatenation

Concatenate(“a,”b”) != concatenate(“b”,a”)

Distributive

White (p. 32) and Lam (p. 84) mention with regard to combiners

But really, go with associative + commutative in Lin (pp. 20, 27)

combine combine combine combine

b a 1 2 c 9 a c 5 2 b c 7 8

partition partition partition partition

map map map map

k1 k2 k3 k4 k5 k6 v1 v2 v3 v4 v5 v6

b a 1 2 c c 3 6 a c 5 2 b c 7 8

Shuffle and Sort: aggregate values by keys

reduce reduce reduce

a 1 5 b 2 7 c 2 9 8

r1 s1 r2 s2 r3 s3

c 2 3 6 8

split 0

split 1

split 2

split 3

split 4

worker

Master

Program

output

file 0

output

file 1

(1) submit

(2) schedule map (2) schedule reduce

(3) read (4) local write

(5) remote read (6) write

Intermediate files

(on local disk)

Reduce

Output

Adapted from (Dean and Ghemawat, OSDI 2004)

Shuffle and 2 Sorts

As map emits values, local sorting runs in tandem (1st sort)

Combine is optionally called 0..N times for local aggregation on sorted (K2, list(V2)) tuples (more sorting of output)

Partition determines which (logical) reducer Rj each key will go to

Node’s TaskTracker tells JobTracker it has keys for Rj

JobTracker determines node to run Rj based on data locality

When local map/combine/sort finishes, sends data to Rj’s node

Rj’s node iteratively merges inputs from map nodes as it arrives (2nd sort)

For each (K, list(V)) tuple in merged output, call reduce(…)

Courtesy of Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010), p. 178

Distributed File System

Don’t move data… move computation to the data!

Store data on the local disks of nodes in the cluster

Start up the workers on the node that has the data local

Not enough RAM to hold all the data in memory

Disk access is slow, but disk throughput is reasonable

A distributed file system is the answer

GFS (Google File System) for Google’s MapReduce

HDFS (Hadoop Distributed File System) for Hadoop

GFS: Assumptions

Commodity hardware over “exotic” hardware

Scale “out”, not “up”

High component failure rates

Inexpensive commodity components fail all the time

“Modest” number of huge files

Multi-gigabyte files are common, if not encouraged

Files are write-once, mostly appended to

Perhaps concurrently

Large streaming reads over random access

High sustained throughput over low latency

GFS slides adapted from material by (Ghemawat et al., SOSP 2003)

GFS: Design Decisions

Files stored as chunks

Fixed size (64MB)

Reliability through replication

Each chunk replicated across 3+ chunkservers

Single master to coordinate access, keep metadata

Simple centralized management

No data caching

Little benefit due to large datasets, streaming reads

Simplify the API

Push some of the issues onto the client (e.g., data layout)

HDFS = GFS clone (same basic ideas)

Basic Cluster Components

1 “Manager” node (can be split onto 2 nodes)

Namenode (NN)

Jobtracker (JT)

1-N “Worker” nodes

Tasktracker (TT)

Datanode (DN)

Optional Secondary Namenode

Periodic backups of Namenode in case of failure

Hadoop Architecture

Courtesy of Chuck Lam’s Hadoop In Action (2011), pp. 24-25

Namenode Responsibilities

Managing the file system namespace:

Holds file/directory structure, metadata, file-to-block mapping,

access permissions, etc.

Coordinating file operations:

Directs clients to datanodes for reads and writes

No data is moved through the namenode

Maintaining overall health:

Periodic communication with the datanodes

Block re-replication and rebalancing

Garbage collection

Putting everything together…

datanode daemon

Linux file system

tasktracker

slave node

datanode daemon

Linux file system

tasktracker

slave node

datanode daemon

Linux file system

tasktracker

slave node

namenode

namenode daemon

job submission node

jobtracker

Anatomy of a Job

MapReduce program in Hadoop = Hadoop job

Jobs are divided into map and reduce tasks (+ more!)

An instance of running a task is called a task attempt

Multiple jobs can be composed into a workflow

Job submission process

Client (i.e., driver program) creates a job, configures it, and

submits it to job tracker

JobClient computes input splits (on client end)

Job data (jar, configuration XML) are sent to JobTracker

JobTracker puts job data in shared location, enqueues tasks

TaskTrackers poll for tasks

Off to the races…

Why have 1 API when you can have 2?

White pp. 25-27, Lam pp. 77-80

Hadoop 0.19 and earlier had “old API”

Hadoop 0.21 and forward has “new API”

Hadoop 0.20 has both!

Old API most stable, but deprecated

Current books use old API predominantly, but discuss changes

• Example code using new API available online from publisher

Some old API classes/methods not yet ported to new API

Cloud9 uses both, and you can too

Old API

Mapper (interface)

void map(K1 key, V1 value, OutputCollector<K2, V2> output,

Reporter reporter)

void configure(JobConf job)

void close() throws IOException

Reducer/Combiner

void reduce(K2 key, Iterator<V2> values,

OutputCollector<K3,V3> output, Reporter reporter)

void configure(JobConf job)

void close() throws IOException

Partitioner

void getPartition(K2 key, V2 value, int numPartitions)

New API

org.apache.hadoop.mapred now deprecated; instead use

org.apache.hadoop.mapreduce &

org.apache.hadoop.mapreduce.lib

Mapper, Reducer now abstract classes, not interfaces

Use Context instead of OutputCollector and Reporter

Context.write(), not OutputCollector.collect()

Reduce takes value list as Iterable, not Iterator

Can use java’s foreach syntax for iterating

Can throw InterruptedException as well as IOException

JobConf & JobClient replaced by Configuration & Job

Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)

Technology

Transcript of Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)

Data-Intensive Distributed Computing

Petascale Data Intensive Computing for eScience

Data Intensive Computing at Sandia

Extreme Data-Intensive Scientific Computing

Data-Intensive Computing: From Clouds to GPUs

CompSci516 Data Intensive Computing Systems Lecture 6a ...db.cs.duke.edu/.../compsci516/fall17/Lectures/Lecture-6a-Normalizat… · CompSci516 Data Intensive Computing Systems Lecture

CPS216: Data-Intensive Computing Systems

Data-Intensive Text Processing with MapReduce

On Data Intensive Computing and Exascale

Data-Intensive Computing Symposium: Report Out

CSCI-2950u :: Data-Intensive Scalable Computing

Data Intensive Computing Frameworks

daneshwarinoola.files.wordpress.com€¦ · Web view17CS742 CLOUD COMPUTING. Module - 4. Data-Intensive Computing: MapReduce. Programming ( Chapter - 8) Data-intensive computing

Cyber Analytics Applications for Data-Intensive Computing

Cooperative Computing for Data Intensive Science

Data-intensive computing systems · 2016-12-06 · Data-intensive computing systems Cloud Computing University of Verona Computer Science Department Damiano Carra 2 Acknowledgements

Data-Intensive Text Processing with MapReducelintool.github.io/MapReduceAlgorithms/ed1n/MapReduce-algorithms.pdf · Data-Intensive Text Processing with MapReduce ... be drowning in

D -INTENSIVE COMPUTING PARADIGMS FOR BIG DATA

CompSci516 Data Intensive Computing Systems

Data-Intensive Text Processing With MapReduce