Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.

35
Map-Reduce and Map-Reduce and Parallel Computing Parallel Computing for Large-Scale for Large-Scale Media Processing Media Processing Youjie Zhou Youjie Zhou
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    215
  • download

    2

Transcript of Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.

Page 1: Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.

Map-Reduce and Parallel Map-Reduce and Parallel Computing for Large-Computing for Large-

Scale Media ProcessingScale Media Processing

Youjie ZhouYoujie Zhou

Page 2: Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.

OutlineOutline

►MotivationsMotivations►Map-Reduce FrameworkMap-Reduce Framework►Large-scale Multimedia Processing Large-scale Multimedia Processing

ParallelizationParallelization►Machine Learning Algorithm Machine Learning Algorithm

TransformationTransformation►Map-Reduce Drawbacks and VariantsMap-Reduce Drawbacks and Variants►ConclusionsConclusions

Page 3: Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.

MotivationsMotivations

►Why we need Parallelization?Why we need Parallelization? ““Time is Money”Time is Money”

►SimultaneouslySimultaneously►Divide-and-conquerDivide-and-conquer

Data is too huge to handleData is too huge to handle►1 trillion (10^12) unique URLs in 20081 trillion (10^12) unique URLs in 2008►CPU speed limitationCPU speed limitation

Page 4: Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.

MotivationsMotivations

►Why we need Parallelization?Why we need Parallelization? Increasing DataIncreasing Data

►Social NetworksSocial Networks►Scalability!Scalability!

““Brute Force”Brute Force”►No approximationsNo approximations►Cheap clusters v.s. expensive computersCheap clusters v.s. expensive computers

Page 5: Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.

MotivationsMotivations

►Why we choose Map-Reduce?Why we choose Map-Reduce? PopularPopular

►A parallelization framework Google proposed A parallelization framework Google proposed and Google uses it everydayand Google uses it everyday

►Yahoo and Amazon also involve inYahoo and Amazon also involve in Popular Popular Good? Good?

►““Hides” parallelization details from usersHides” parallelization details from users►Provides high-level operations that suit for Provides high-level operations that suit for

majority algorithmsmajority algorithms Good start on deeper parallelization Good start on deeper parallelization

researchesresearches

Page 6: Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.

Map-Reduce FrameworkMap-Reduce Framework

►Simple idea inspired by function Simple idea inspired by function language (like LISP)language (like LISP) mapmap

►a type of iteration in which a function is a type of iteration in which a function is successively applied to each element of one successively applied to each element of one sequencesequence

reducereduce►a function combines all the elements of a a function combines all the elements of a

sequence using a binary operationsequence using a binary operation

Page 7: Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.

Map-Reduce FrameworkMap-Reduce Framework

►Data representationData representation <key,value><key,value> mapmap generates generates <key,value><key,value> pairs pairs reducereduce combines combines <key,value><key,value> pairs pairs

according to same according to same keykey

►““Hello, world!” ExampleHello, world!” Example

Page 8: Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.

Map-Reduce FrameworkMap-Reduce Framework

data

split0

split1

split2

map

map

map

reduce

reduce

reduce

reduce output

Page 9: Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.

Map-Reduce FrameworkMap-Reduce Framework

►Count the appearances of each Count the appearances of each different word in a set of documentsdifferent word in a set of documents

void map (Document)for each word in Document

generate <word,1>

void reduce (word,CountList)int count = 0;for each number in CountList

count += numbergenerate <word,count>

Page 10: Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.

Map-Reduce FrameworkMap-Reduce Framework

►Different ImplementationsDifferent Implementations Distributed computingDistributed computing

►each computer acts as a computing nodeeach computer acts as a computing node►focusing on reliability over distributed focusing on reliability over distributed

computer networkscomputer networks►Google’s clustersGoogle’s clusters

closed sourceclosed source GFS: distributed file systemGFS: distributed file system

►HadoopHadoop open sourceopen source HDFS: hadoop distributed file systemHDFS: hadoop distributed file system

Page 11: Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.

Map-Reduce FrameworkMap-Reduce Framework

►Different ImplementationsDifferent Implementations Multi-Core computingMulti-Core computing

►each core acts as a computing nodeeach core acts as a computing node► focusing on high speed computing using large shared focusing on high speed computing using large shared

memoriesmemories►PPhoenix++hoenix++

a two dimensional a two dimensional <key,value><key,value> table stored in the table stored in the memory where memory where mapmap and and reducereduce read and write pairs read and write pairs

open source created by Stanfordopen source created by Stanford►GPUGPU

10x higher memory bandwidth than a CPU10x higher memory bandwidth than a CPU 5x to 32x speedups on SVM training5x to 32x speedups on SVM training

Page 12: Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.

Large-scale Multimedia Large-scale Multimedia Processing Processing ParallelizationParallelization

►ClusteringClustering k-meansk-means Spectral ClusteringSpectral Clustering

►Classifiers trainingClassifiers training SVMSVM

►Feature extraction and indexingFeature extraction and indexing Bag-of-FeaturesBag-of-Features Text Inverted IndexingText Inverted Indexing

Page 13: Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.

ClusteringClustering

► k-meansk-means Basic and fundamentalBasic and fundamental Original AlgorithmOriginal Algorithm

1.1. Pick k initial center pointsPick k initial center points

2.2. Iterate until convergeIterate until converge1.1. Assign each point with the nearest centerAssign each point with the nearest center

2.2. Calculate new centersCalculate new centers

Easy to parallel!Easy to parallel!

Page 14: Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.

ClusteringClustering

► k-meansk-means a shared file contains center pointsa shared file contains center points mapmap

1.1. for each point, find the nearest centerfor each point, find the nearest center2.2. generate generate <key,value><key,value> pair pair

keykey: center id: center id valuevalue: current point’s coordinate: current point’s coordinate

reducereduce1.1. collect all points belonging to the same cluster (they collect all points belonging to the same cluster (they

have the same have the same keykey value) value)2.2. calculate the average calculate the average new center new center

iterateiterate

Page 15: Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.

ClusteringClustering

► Spectral ClusteringSpectral Clustering

S is huge: 10^6 points (double) need 8TBS is huge: 10^6 points (double) need 8TB Sparse It!Sparse It!

►Retain only S_ij where j is among the t nearest neighbors Retain only S_ij where j is among the t nearest neighbors of iof i

►Locality Sensitive Hashing? Locality Sensitive Hashing? It’s an approximationIt’s an approximation

►We can calculate directly We can calculate directly Parallel Parallel

2/12/1 SDDIL

Page 16: Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.

ClusteringClustering

►Spectral ClusteringSpectral Clustering Calculate distance matrixCalculate distance matrix

►mapmap creates <key,value> so that every n/p points have creates <key,value> so that every n/p points have

the same keythe same key p is the number of node in the computer clusterp is the number of node in the computer cluster

►reducereduce collect points with same key so that the data is split collect points with same key so that the data is split

into p parts and each part is stored in each nodeinto p parts and each part is stored in each node

►for each point in the whole data set, on each for each point in the whole data set, on each node, find t nearest neighborsnode, find t nearest neighbors

Page 17: Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.

ClusteringClustering

►Spectral ClusteringSpectral Clustering SymmetrySymmetry

►x_j in t-nearest-neighbor set of x_i ≠ x_i in t-x_j in t-nearest-neighbor set of x_i ≠ x_i in t-nearest-neighbor set of x_jnearest-neighbor set of x_j

►mapmap for each nonzero element, generates two for each nonzero element, generates two <key,value><key,value> first: first: keykey is row ID; is row ID; valuevalue is column ID and distance is column ID and distance second: second: keykey is column ID; is column ID; valuevalue is row ID and distance is row ID and distance

►reducereduce uses uses keykey as row ID and fills columns specified by column as row ID and fills columns specified by column

ID in ID in valuevalue

Page 18: Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.

ClassificationClassification

► SVMSVM

),(2/1)(maxarg

)(minarg

||||

2

2/)(

jijijii

T

iT

i

xxKyy

www

w

bxwy

Page 19: Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.

ClassificationClassification

► SVM SVM SMO SMO instead of solving all alpha togetherinstead of solving all alpha together coordinate ascentcoordinate ascent

►pick one alpha, fix otherspick one alpha, fix others►optimize alpha_ioptimize alpha_i

Page 20: Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.

ClassificationClassification

► SVM SVM SMO SMO But we cannot optimize only one alpha for SVMBut we cannot optimize only one alpha for SVM We need to optimize two alpha each iterationWe need to optimize two alpha each iteration

n

iii

iii

jijijii

yy

yC

xxKyy

211

0,0

),(2/1)(maxarg

Page 21: Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.

ClassificationClassification

►SVMSVM repeat until converge:repeat until converge:

►mapmap given two alpha, updating the optimization given two alpha, updating the optimization

informationinformation

►reducereduce find the two maximally violating alphafind the two maximally violating alpha

Page 22: Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.

Feature Extraction and Feature Extraction and IndexingIndexing

►Bag-of-FeaturesBag-of-Features features features feature clusters feature clusters histogram histogram feature extractionfeature extraction

►mapmap takes images in and outputs features directlytakes images in and outputs features directly

feature clusteringfeature clustering►clustering algorithms, like k-meansclustering algorithms, like k-means

Page 23: Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.

Feature Extraction and Feature Extraction and IndexingIndexing

►Bag-of-FeaturesBag-of-Features feature quantization histogramfeature quantization histogram

►mapmap for each feature on one image, find the nearest for each feature on one image, find the nearest

feature clusterfeature cluster

generates generates <imageID,clusterID><imageID,clusterID>►reducereduce

<imageID,cluster0,cluster1…><imageID,cluster0,cluster1…> for each feature cluster, updating the histogramfor each feature cluster, updating the histogram

generates generates <imageID,histogram><imageID,histogram>

Page 24: Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.

Feature Extraction and Feature Extraction and IndexingIndexing

► Text Inverted IndexingText Inverted Indexing Inverted index of a termInverted index of a term

►a document list containing the terma document list containing the term►each item in the document list stores statistical informationeach item in the document list stores statistical information

frequency, position, field informationfrequency, position, field information

mapmap► for each term in one document, generates for each term in one document, generates <term,docID><term,docID>

reducereduce►<term,doc0,doc1,doc2…><term,doc0,doc1,doc2…>► for each document, update statistical information for that for each document, update statistical information for that

termterm►generates generates <term,list><term,list>

Page 25: Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.

Machine Learning Algorithm Machine Learning Algorithm TransformationTransformation

►How can we know whether an algorithm can How can we know whether an algorithm can be transformed into a Map-Reduce fashion?be transformed into a Map-Reduce fashion? if so, how to do that?if so, how to do that?

►Statistical Query and Summation FormStatistical Query and Summation Form All we want is to estimate or inferenceAll we want is to estimate or inference

►cluster id, labels…cluster id, labels…

From sufficient statisticsFrom sufficient statistics►distances between pointsdistances between points►points positionspoints positions

statistic computation can be dividedstatistic computation can be divided

Page 26: Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.

Machine Learning Algorithm Machine Learning Algorithm TransformationTransformation

► Linear RegressionLinear Regression

)()(minarg XbYTXbYError

YXXX TT 1)(

)( Tiixx )( ii yx

Summation Form

reduce

map

reduce

map

reduce

map

Page 27: Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.

Machine Learning Algorithm Machine Learning Algorithm TransformationTransformation

►Naïve BayesianNaïve Bayesian

)|()(maxarg

)]()|...([maxarg

)]...(|)()|...([maxarg

)...|(maxarg

1

11

1

jijVvNB

jjnVvMAP

njjnVvMAP

njVvMAP

vaPvPv

vPvaaPv

aaPvPvaaPv

aavPv

j

j

j

j

map

reduce

Page 28: Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.

Machine Learning Algorithm Machine Learning Algorithm TransformationTransformation

►SolutionSolution Find statistics calculation partFind statistics calculation part Distribute calculations on data using Distribute calculations on data using mapmap Gather and refine all statistics in Gather and refine all statistics in reducereduce

Page 29: Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.

Map-Reduce Systems Map-Reduce Systems DrawbacksDrawbacks

►Batch based systemBatch based system ““pull” modelpull” model

►reduce must wait for un-finished mapreduce must wait for un-finished map►reduce “pull” data from mapreduce “pull” data from map

no iteration support directlyno iteration support directly

►Focusing too much on distributed Focusing too much on distributed system and failure tolerancesystem and failure tolerance local computing cluster may not need local computing cluster may not need

themthem

Page 30: Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.

Map-Reduce Systems Map-Reduce Systems DrawbacksDrawbacks

►Focusing too much on distributed Focusing too much on distributed system and failure tolerancesystem and failure tolerance

Page 31: Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.

Map-Reduce VariantsMap-Reduce Variants

►Map-Reduce onlineMap-Reduce online ““push” modelpush” model

►mapmap “pushes” data to “pushes” data to reducereduce reducereduce can also “push” results to can also “push” results to mapmap from from

the next jobthe next job build a pipelinebuild a pipeline

► Iterative Map-ReduceIterative Map-Reduce higher level schedulershigher level schedulers schedule the whole iteration processschedule the whole iteration process

Page 32: Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.

Map-Reduce VariantsMap-Reduce Variants

►Series Map-Reduce?Series Map-Reduce?

Multi-CoreMap-Reduce

Multi-CoreMap-Reduce

Multi-CoreMap-Reduce

Multi-CoreMap-Reduce

Map-Reduce? MPI? Condor?

Page 33: Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.

ConclusionsConclusions

►Good parallelization frameworkGood parallelization framework Schedule jobs automaticallySchedule jobs automatically Failure toleranceFailure tolerance Distributed computing supportedDistributed computing supported High level abstractionHigh level abstraction

►easy to port algorithms on iteasy to port algorithms on it

►Too “industry”Too “industry” why we need a large distributed system?why we need a large distributed system? why we need too much data safety?why we need too much data safety?

Page 34: Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.

ReferencesReferences[1] Map-Reduce for Machine Learning on Multicore[1] Map-Reduce for Machine Learning on Multicore[2] A Map Reduce Framework for Programming Graphics Processors[2] A Map Reduce Framework for Programming Graphics Processors[3] Mapreduce Distributed Computing for Machine Learning[3] Mapreduce Distributed Computing for Machine Learning[4] Evaluating mapreduce for multi-core and multiprocessor systems[4] Evaluating mapreduce for multi-core and multiprocessor systems[5] Phoenix Rebirth: Scalable MapReduce on a Large-Scale Shared-Memory [5] Phoenix Rebirth: Scalable MapReduce on a Large-Scale Shared-Memory

SystemSystem[6] Phoenix++: Modular MapReduce for Shared-Memory Systems[6] Phoenix++: Modular MapReduce for Shared-Memory Systems[7] Web-scale computer vision using MapReduce for multimedia data mining[7] Web-scale computer vision using MapReduce for multimedia data mining[8] MapReduce indexing strategies: Studying scalability and efficiency[8] MapReduce indexing strategies: Studying scalability and efficiency[9] Batch Text Similarity Search with MapReduce[9] Batch Text Similarity Search with MapReduce[10] Twister: A Runtime for Iterative MapReduce[10] Twister: A Runtime for Iterative MapReduce[11] MapReduce Online[11] MapReduce Online[12] Fast Training of Support Vector Machines Using Sequential Minimal [12] Fast Training of Support Vector Machines Using Sequential Minimal

OptimizationOptimization[13] Social Content Matching in MapReduce[13] Social Content Matching in MapReduce[14] Large-scale multimedia semantic concept modeling using robust [14] Large-scale multimedia semantic concept modeling using robust

subspace bagging and MapReducesubspace bagging and MapReduce[15] Parallel Spectral Clustering in Distributed Systems[15] Parallel Spectral Clustering in Distributed Systems

Page 35: Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.

ThanksThanks

Q & AQ & A