Hands On: Multimedia Methods for Large Scale Video Analysis...

$: Hands On: Multimedia Methods for Large Scale Video Analysis …fractor/fall2012/cs294-18-2012.pdf · 2012-11-15 · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture)$
Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture)

Dr. Gerald Friedland, [email protected]

1

Thursday, November 15, 12

mailto:[email protected]

mailto:[email protected]

Today

2

•Comments on the Mid-Term•MapReduce

•Intro•Implementation•Algorithm Design

Slides excerpts by - Malcolm Slanley, Microsoft Research

- Jimmy Lin, Chris Dyer, UMDThursday, November 15, 12

Midterm Results

3

•Max Score: 40 points = 100%•Mean Result: 33.2 points = 83%•Variance: 18.4 points, StdDev: 4.28


Comments on the Mid-Term

4

Memory thrashing: If a process does not have enough pages, thrashing is a high paging activity, and the page-fault rate is high.

=> low CPU utilization, high I/O.


Unix Command: top

5


MapReduce

6

•Idea originated in functional programming

•First large use for parallelization at Google for accessing BigTable.

•Killer app: Text indexing!Thursday, November 15, 12

7

Basic Parallelization Pattern


“Work”

7



“Work”

w1 w2 w3

7



“Work”

w1 w2 w3

“worker” “worker” “worker”

7



“Work”

w1 w2 w3

r1 r2 r3


7



“Work”

w1 w2 w3

r1 r2 r3

“Result”


7



“Work”

w1 w2 w3

r1 r2 r3

“Result”


Partition

7



“Work”

w1 w2 w3

r1 r2 r3

“Result”


Partition

Combine

7



Reality


Reality

Fundamental issuesscheduling, data distribution, synchronization, inter-process communication, robustness, fault tolerance, …


Reality


Architectural issuesFlynn’s taxonomy (SIMD, MIMD, etc.),network typology, bisection bandwidthUMA vs. NUMA, cache coherence


Reality

Message Passing

P1 P2 P3 P4 P5

Shared Memory

P1 P2 P3 P4 P5

Mem

ory

Different programming modelsFundamental issuesscheduling, data distribution, synchronization, inter-process communication, robustness, fault tolerance, …



Reality

Message Passing

P1 P2 P3 P4 P5

Shared Memory

P1 P2 P3 P4 P5

Mem

ory

Different programming modelsFundamental issuesscheduling, data distribution, synchronization, inter-process communication, robustness, fault tolerance, …

Common problemslivelock, deadlock, data starvation, priority inversion…dining philosophers, sleeping barbers, cigarette smokers, …



Reality

Message Passing

P1 P2 P3 P4 P5

Shared Memory

P1 P2 P3 P4 P5

Mem

ory

Different programming models

Different programming constructsmutexes, conditional variables, barriers, …masters/slaves, producers/consumers, work queues, …





Reality

Message Passing

P1 P2 P3 P4 P5

Shared Memory

P1 P2 P3 P4 P5

Mem

ory

Different programming models

Different programming constructsmutexes, conditional variables, barriers, …masters/slaves, producers/consumers, work queues, …




The reality: programmer shoulders the burden of managing concurrency… Solutions: See PARLAB


Common Workflow

• Iterate over a large number of records

• Extract something of interest from each

• Shuffle and sort intermediate results

• Aggregate intermediate results• Generate final output

(Dean and Ghemawat, OSDI 2004)9


Common Workflow





Map



Common Workflow





Map

Reduce



Common Workflow





Key idea: provide a functional abstraction for these two operations

Map

Reduce



MapReduce as a Diagram

10


f f f f f


10


f f f f fMap


10


g

f f f f fMap


10


g g

f f f f fMap


10


g g g

f f f f fMap


10


g g g g

f f f f fMap


10


g g g g g

f f f f fMap


10


g g g g g

f f f f fMap

Fold


10


g g g g g

f f f f fMap

Fold

Map

Reduce


10


MapReduce in Haskell

11

Very parallelizable!

Google: Part 2

� We already know that Google have a massive amount ofcomputational power:

� This is largely based on a cluster of workstation type computers.� The conglomerate result is a MIMD type parallel computer.

� We also know that to use this, we need some efficient parallelalgorithms:

� The PageRank web-page ranking system is one example of this.� To efficiently execute PageRank we need good graph algorithms, good

data structures and so on ...

� The question is, how do we bridge the gap between the two and

implement software on the Google cluster ?

� We look at MapReduce, a system that performs this task in an

interesting way.

Dan Page

COMS21102 : Software Engineering Slide 1

Introduction (1)

� One option might be to use a system like Condor:� Condor is basically a system for harnessing idle processing time.� Jobs are submitted to a queue and distributed to nodes in cluster in a

managed way.� When node has no user load, job is run but stopped again when user

load returns.

� Negatives:� Doesn’t really support parallelism, only batch processing.� No support for inter-process communication.

� Positives:� Due to lack of parallelism, the programming model is easy to understand.� Can automatically check-point processes and recover from faults.� Fairly extensible in that new processors can be easily added.

Dan Page


Introduction (2)

� One option might be to use a system like MPI:� MPI is basically a system for writing message-passing programs.� Jobs are forcibly executed on nodes in cluster.

� Negatives:� Need to explicitly define parallelism in programs, programming model is

only quite easy.� Doesn’t automatically deal with issues of fault tolerance.� There is no real management or load-balancing.

� Positives:� Can be used as an API to C, a well known and portable language.� Some issues of communication are automatically optimised for platform.

Dan Page


Introduction (3)

� Neither of these options is ideal (obviously there are some others):� We might have to manually deal with parallelism.� We might have to manually deal with scalability.� We might have to manually deal with fault tolerance.

� Google decided that in order to get the best out of their platform, they

would change the way people wrote parallel programs ...

Dan Page


A Recap on Haskell (1)

� The concept of higher-order functions is central to programming inHaskell:

� Functions can take other functions as arguments.� Functions can return other functions as results.

� Very roughly, this concept is like using function pointers in C orinterfaces in Java:

� In Haskell we can create functions on-the-fly, in C and Java they are

static; we have to define them in the source code.� In Haskell the concept is natural, in C and Java it (arguably) isn’t.

Dan Page


A Recap on Haskell (2) – map

� As an example, consider the map function:

map :: (A->B) -> [A] -> [B]

map f [] = []

map f (x:xs) = f x : map f xs

� The inputs to map are:� A unary function f : A⇥ B.� A list ⌅v0, v1, . . . vn�1⇧ with vi ⇤ A.

� The output of map is the list ⇥v ⇥0, v ⇥1, . . . v ⇥n�1⇤ with v ⇥i � B such thatv ⇥i = f (vi).

� The key point is the fact we can pass a function into map for it to use.

� So, for example, we can execute map (+1) [1,2,3] and get the

result [2,3,4].

Dan Page


A Recap on Haskell (3) – reduce

� Another example is the reduce function:

reduce :: (A->B->B) -> B -> [A] -> B

reduce f y [] = y

reduce f y (x:xs) = f x (reduce f y xs)

� The inputs to reduce are:� A binary function f : A� B ⇥ B.� A value y ⇤ B.� A list ⌅v0, v1, . . . vn�1⇧ with vi ⇤ A.

� The output of reduce is the value f (v0, f (v1, . . . f (vn�1, y))) � B.

� Again, the key point is the fact we can pass a function into reduce for

it to use.

� So, for example, we can execute reduce (+) 0 [1,2,3] and get

the result 6.

Dan Page



� In Haskell programming reduce is often called foldr or fold-right.

� We can also define foldl, or fold-left which applies the function f inthe opposite order:

foldl :: (A->B->A) -> A -> [B] -> A

foldl f y [] = y

foldl f y (x:xs) = f (foldl f y xs) x

� So, for example, we can still write foldr (+) 0 [1,2,3] and get

the result 6 since addition is associative.

� But division is not associative so:� Executing foldr (/) 64 [4,2,4] gives 0.125.� Executing foldl (/) 64 [4,2,4] gives 2.000.

Dan Page



� It turns out that many useful and quite powerful functions can beexpressed using just reduce.

� All these example use the concept of creating a function on-the-fly.� You can even define map using reduce if you want !

length :: [A] -> Int

length xs = reduce (\y n -> 1 + n) 0 xs

reverse :: [A] -> [A]

reverse xs = reduce (\y ys -> ys ++ [y]) [] xs

map :: (A->B) -> [A] -> [B]

map f xs = reduce (\y ys -> f y : ys) [] xs

filter :: (A->Bool) [A] -> [A]

filter p xs = reduce (\y ys -> if p y then y:ys else ys) [] xs

� It might be a good idea to write Haskell library code like this ...� Imagine if you are writing a Haskell compiler such as GHC.� To ensure your generated programs are efficient, you just need to

optimise the reduce implementation in the library.

Dan Page


Google: Part 2

� We already know that Google have a massive amount ofcomputational power:

� This is largely based on a cluster of workstation type computers.� The conglomerate result is a MIMD type parallel computer.

� We also know that to use this, we need some efficient parallelalgorithms:

� The PageRank web-page ranking system is one example of this.� To efficiently execute PageRank we need good graph algorithms, good

data structures and so on ...

� The question is, how do we bridge the gap between the two and

implement software on the Google cluster ?

� We look at MapReduce, a system that performs this task in an

interesting way.

Dan Page


Introduction (1)

� One option might be to use a system like Condor:� Condor is basically a system for harnessing idle processing time.� Jobs are submitted to a queue and distributed to nodes in cluster in a

managed way.� When node has no user load, job is run but stopped again when user

load returns.

� Negatives:� Doesn’t really support parallelism, only batch processing.� No support for inter-process communication.

� Positives:� Due to lack of parallelism, the programming model is easy to understand.� Can automatically check-point processes and recover from faults.� Fairly extensible in that new processors can be easily added.

Dan Page


Introduction (2)

� One option might be to use a system like MPI:� MPI is basically a system for writing message-passing programs.� Jobs are forcibly executed on nodes in cluster.

� Negatives:� Need to explicitly define parallelism in programs, programming model is

only quite easy.� Doesn’t automatically deal with issues of fault tolerance.� There is no real management or load-balancing.

� Positives:� Can be used as an API to C, a well known and portable language.� Some issues of communication are automatically optimised for platform.

Dan Page


Introduction (3)

� Neither of these options is ideal (obviously there are some others):� We might have to manually deal with parallelism.� We might have to manually deal with scalability.� We might have to manually deal with fault tolerance.

� Google decided that in order to get the best out of their platform, they

would change the way people wrote parallel programs ...

Dan Page


A Recap on Haskell (1)

� The concept of higher-order functions is central to programming inHaskell:

� Functions can take other functions as arguments.� Functions can return other functions as results.

� Very roughly, this concept is like using function pointers in C orinterfaces in Java:

� In Haskell we can create functions on-the-fly, in C and Java they are

static; we have to define them in the source code.� In Haskell the concept is natural, in C and Java it (arguably) isn’t.

Dan Page


A Recap on Haskell (2) – map

� As an example, consider the map function:

map :: (A->B) -> [A] -> [B]

map f [] = []

map f (x:xs) = f x : map f xs

� The inputs to map are:� A unary function f : A⇥ B.� A list ⌅v0, v1, . . . vn�1⇧ with vi ⇤ A.

� The output of map is the list ⇥v ⇥0, v ⇥1, . . . v ⇥n�1⇤ with v ⇥i � B such thatv ⇥i = f (vi).

� The key point is the fact we can pass a function into map for it to use.

� So, for example, we can execute map (+1) [1,2,3] and get the

result [2,3,4].

Dan Page



� Another example is the reduce function:

reduce :: (A->B->B) -> B -> [A] -> B

reduce f y [] = y

reduce f y (x:xs) = f x (reduce f y xs)

� The inputs to reduce are:� A binary function f : A� B ⇥ B.� A value y ⇤ B.� A list ⌅v0, v1, . . . vn�1⇧ with vi ⇤ A.

� The output of reduce is the value f (v0, f (v1, . . . f (vn�1, y))) � B.

� Again, the key point is the fact we can pass a function into reduce for

it to use.

� So, for example, we can execute reduce (+) 0 [1,2,3] and get

the result 6.

Dan Page



� In Haskell programming reduce is often called foldr or fold-right.

� We can also define foldl, or fold-left which applies the function f inthe opposite order:

foldl :: (A->B->A) -> A -> [B] -> A

foldl f y [] = y

foldl f y (x:xs) = f (foldl f y xs) x

� So, for example, we can still write foldr (+) 0 [1,2,3] and get

the result 6 since addition is associative.

� But division is not associative so:� Executing foldr (/) 64 [4,2,4] gives 0.125.� Executing foldl (/) 64 [4,2,4] gives 2.000.

Dan Page



� It turns out that many useful and quite powerful functions can beexpressed using just reduce.

� All these example use the concept of creating a function on-the-fly.� You can even define map using reduce if you want !

length :: [A] -> Int

length xs = reduce (\y n -> 1 + n) 0 xs

reverse :: [A] -> [A]

reverse xs = reduce (\y ys -> ys ++ [y]) [] xs

map :: (A->B) -> [A] -> [B]

map f xs = reduce (\y ys -> f y : ys) [] xs

filter :: (A->Bool) [A] -> [A]

filter p xs = reduce (\y ys -> if p y then y:ys else ys) [] xs

� It might be a good idea to write Haskell library code like this ...� Imagine if you are writing a Haskell compiler such as GHC.� To ensure your generated programs are efficient, you just need to

optimise the reduce implementation in the library.

Dan Page



MapReduce in Practice• Programmers specify two

functions:map (k, v) → <k’, v’>*reduce (k’, v’) → <k’, v’>*– All values with the same key are

reduced together• Usually, programmers also

specify:partition (k’, number of partitions) → partition for k’

– Often a simple hash of the key, e.g. hash(k’) mod n

12


MapReduce

http://people.apache.org/~rdonkin/hadoop-talk/diagrams/map-reduce.pngThursday, November 15, 12

A MapReduce Engine Typically• Handles scheduling

– Assigns workers to map and reduce tasks

• Handles “data distribution”– Moves the process to the data

• Handles synchronization– Gathers, sorts, and shuffles

intermediate data• Handles faults

– Detects worker failures and restarts 14


What to do with I/O?• Don’t move data to workers…

Move workers to the data!– Store data on the local disks for

nodes in the cluster– Start up the workers on the node

that has the data local• Why?

– Not enough RAM to hold all the data in memory

– Disk access is slow, need to be serialized!

– Disk becomes bottleneck! 15


Distributed File System

• A distributed file system is the answer– GFS (Google File System)– HDFS for Hadoop (= GFS clone)– Amazon S3

16


GFS: Design Decisions• Files stored as chunks

– Fixed size (64MB)• Reliability through replication

– Each chunk replicated across 3+ chunkservers

• Single master to coordinate access, keep metadata– Simple centralized management

• No data caching– Little benefit due to large data sets,

streaming reads• Simplify the API

– Push some of the issues onto the client 17


MapReduce Implementation

18


Hadoop


Hadoop Word Count

• #!/usr/bin/env python • import sys • # input comes from STDIN (standard input) • for line in sys.stdin:

# remove leading and trailing whitespace line = line.strip() # split the line into words words = line.split() # increase counters for word in words: # write the results to STDOUT (standard output); # what we output here will be the input for the # Reduce step, i.e. the input for reducer.py # # tab-delimited; the trivial word count is 1 print '%s\t%s' % (word, 1)

http://www.michael-noll.com/wiki/Writing_An_Hadoop_MapReduce_Program_In_Python


Hadoop Word Count • #!/usr/bin/env python • import sys • word2count = {} • # input comes from STDIN • for line in sys.stdin:

# remove leading and trailing whitespace line = line.strip() # parse the input we got from mapper.pyword, count = line.split('\t', 1) # convert count (currently a string) to inttry: count = int(count) word2count[word] = word2count.get(word, 0) + count except ValueError: # count was not a number, so silently # ignore/discard this line pass


Word Count Command

• bin/hadoopjar contrib/streaming/

hadoop-0.19.1streaming.jar-mapper /home/hadoop/

mapper.py-reducer /home/hadoop/

reducer.py-input gutenberg/* -output gutenberg-output


HadoopMapper• def ReadAndDispatch():• while True:• theLine = sys.stdin.readline()• if theLine == '' or theLine == None:• break• if theLine[0] == '#':• continue• sys.stderr.write("Working on " + theLine + "\n")• args = theLine.split()• if len(args) < 1:• continue• if args[0] == 'expected':• GetOneResult(int(args[1]), int(args[2]), \• int(args[3]), [int(args[4])], \• [float(args[5])])• elif args[0] == 'hyperexpected':• GetOneResult(int(args[1]), int(args[2]), \• int(args[3]), [int(args[4])], \• [float(args[5])], hyper=float(args[6]))• else:• print "Unknown command:", args[0]


Run Hadoop Example• $HADOOP_HOME/bin/hadoopfs -rmrmyOutputDir• EXP=run.tst8

• $HADOOP_HOME/bin/hadoopfs -rm $EXP • $HADOOP_HOME/bin/hadoopfs -put $EXP .

• $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \• -Dmapred.job.queue.name=keystone \• -Dmapred.map.tasks=50 \• -Dmapred.task.timeout=9999000 \• -Dmapred.reduce.tasks=0 \• -archives hdfs://axoniteblue-nn1.blue.ygrid.yahoo.com:8020/user/malcolm/

numpy4python2.5.tgz \• -cmdenv PYTHONPATH=./numpy4python2.5.tgz/ \• -cmdenv LD_LIBRARY_PATH=./numpy4python2.5.tgz/numpy \• -input $EXP \• -mapper "${HOD_PYTHON_HOME} TestRecall.py -hadoop" \• -output myOutputDir \• -reducer /bin/cat \• -file lsh.py \• -file TestRecall.py

• $HADOOP_HOME/bin/hadoopfs -cat myOutputDir/part* > $EXP.out


Hadoop Status


Pricing (Amazon)


MapReduce Algorithm Design

Adapted from work reported in (Lin, EMNLP 2008)27


Managing Dependencies

28



• Remember: Mappers run in isolation– You have no idea in what order the mappers run– You have no idea on what node the mappers run– You have no idea when each mapper finishes

28



• Remember: Mappers run in isolation– You have no idea in what order the mappers run– You have no idea on what node the mappers run– You have no idea when each mapper finishes

• Tools for synchronization:– Ability to hold state in reducer across multiple

key-value pairs– Sorting function for keys– Partitioner– Cleverly-constructed data structures

28


Motivating Example

29


Motivating Example

• Term co-occurrence matrix for a text collection– M = N x N matrix (N = vocabulary size)– Mij: number of times i and j co-occur in some

context (for concreteness, let’s say context = sentence)

29


Motivating Example

• Term co-occurrence matrix for a text collection– M = N x N matrix (N = vocabulary size)– Mij: number of times i and j co-occur in some

context (for concreteness, let’s say context = sentence)

• Why?– Distributional profiles as a way of measuring

semantic distance– Semantic distance useful for many language

processing tasks29


MapReduce: Large

• Term co-occurrence matrix for a text collection= specific instance of a large counting problem– A large event space (number of terms)– A large number of observations (the collection itself)– Goal: keep track of interesting statistics about the

events• Basic approach

– Mappers generate partial counts– Reducers aggregate partial counts

30


MapReduce: Large

• Term co-occurrence matrix for a text collection= specific instance of a large counting problem– A large event space (number of terms)– A large number of observations (the collection itself)– Goal: keep track of interesting statistics about the

events• Basic approach

– Mappers generate partial counts– Reducers aggregate partial counts

How do we aggregate partial counts efficiently?30


First Try: “Pairs”

• Each mapper takes a sentence:– Generate all co-occurring term

pairs– For all pairs, emit (a, b) → count

• Reducers sums up counts associated with these pairs

• Use combiners!

31


“Pairs” Analysis

• Advantages– Easy to implement, easy to understand

• Disadvantages– Lots of pairs to sort and shuffle around

(upper bound?)

32


Another Try: “Stripes”

• Idea: group together pairs into an associative array

• Each mapper takes a sentence:– Generate all co-occurring term pairs– For each term, emit a → { b: countb, c: countc, d: countd … }

• Reducers perform element-wise sum of associative arrays

(a, b) → 1 (a, c) → 2 (a, d) → 5 (a, e) → 3 (a, f) → 2

a → { b: 1, c: 2, d: 5, e: 3, f: 2 }

a → { b: 1, d: 5, e: 3 }a → { b: 1, c: 2, d: 2, f: 2 }a → { b: 2, c: 2, d: 7, e: 3, f: 2 }

+33


“Stripes” Analysis

• Advantages– Far less sorting and shuffling of key-

value pairs– Can make better use of combiners

• Disadvantages– More difficult to implement– Underlying object is more heavyweight– Fundamental limitation in terms of size

of event space34


Cluster size: 38 coresData Source: Associated Press Worldstream (APW) of the English Gigaword Corpus (v3), which contains 2.27 million documents (1.8 GB compressed, 5.7 GB uncompressed) 35


Conditional ProbabilitiesApproach• How do we estimate

conditional probabilities from counts?

• Why do we want to do this?• How do we do this with

MapReduce?36


P(B|A): “Stripes”

• Easy!– One pass to compute (a, *)– Another pass to directly compute

P(B|A)

a → {b1:3, b2 :12, b3 :7, b4 :1, … }

37


P(B|A): “Pairs”

• For this to work:– Must emit extra (a, *) for every bn in mapper– Must make sure all a’s get sent to same reducer (use

partitioner)– Must make sure (a, *) comes first (define sort order)– Must hold state in reducer across different key-value pairs

(a, b1) → 3 (a, b2) → 12 (a, b3) → 7(a, b4) → 1 …

(a, *) → 32 (a, b1) → 3 / 32 (a, b2) → 12 / 32(a, b3) → 7 / 32(a, b4) → 1 / 32…

Reducer holds this value in memory

38


Synchronization in Hadoop

• Approach 1: turn synchronization into an ordering problem– Sort keys into correct order of computation– Partition key space so that each reducer

gets the appropriate set of partial results– Hold state in reducer across multiple key-

value pairs to perform computation– Illustrated by the “pairs” approach

39


Synchronization in Hadoop

• Approach 2: construct data structures that “bring the pieces together”– Each reducer receives all the data it needs

to complete the computation– Illustrated by the “stripes” approach

40


Issues and Tradeoffs• Number of key-value pairs

– Object creation overhead– Time for sorting and shuffling pairs

across the network• Size of each key-value pair

– De/serialization overhead• Combiners make a big difference!

– RAM vs. disk and network– Arrange data to maximize

opportunities to aggregate partial results 41


In general: Issues

42


In general: Issues

• The optimally-parallelized version doesn’t exist!

42


In general: Issues


• It’s all about the right level of abstraction

42


In general: Issues


• It’s all about the right level of abstraction

• Hadoop has Overhead!

42


Next Week (Project Meeting)

•Stephanie Pancoast (Stanford)


Next Week (Lecture)

44

Mehmet Emre Sargin (Google Research/YouTube)


Hands On: Multimedia Methods for Large Scale Video Analysis...

Documents

Transcript of Hands On: Multimedia Methods for Large Scale Video Analysis...