Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer...

22
Ahsanul Haque * , Swarup Chandra * , Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical Sciences, University of Texas at Dallas MapReduce Guided Approximate Inference Over Graphical Models This material is based upon work supported by University Of Texas at Dallas

Transcript of Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer...

Page 1: Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.

Ahsanul Haque*, Swarup Chandra*, Latifur Khan* and Michael Baron+

* Department of Computer Science, University of Texas at Dallas+ Department of Mathematical Sciences, University of Texas at Dallas

MapReduce Guided Approximate Inference Over

Graphical Models

This material is based upon work supported by

University Of Texas at Dallas

Page 2: Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.

2 University Of Texas at Dallas

AgendaBrief overview on Inference techniquesProblemProposed ApproachesExperimentsDiscussion

Page 3: Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.

3 University Of Texas at Dallas

AgendaBrief overview on Inference techniquesProblemProposed ApproachesExperimentsDiscussion

Page 4: Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.

4 University Of Texas at Dallas

Graphical ModelsA probabilistic graphical model G is a

collection of functions over a set of random variables.

Generally represented as a network of nodes:Each node denoting a random variable (e.g.,

data feature).Each edge denotes relationship between two

random variables.

Two types of representations:Bayesian network is represented by directed

graph.Markov network is represented by undirected

graph.

Page 5: Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.

5 University Of Texas at Dallas

Example Graphical Model

Inference is needed to evaluate Probability of Evidence, Prior and Posterior Marginal, Most Probable Explanation (MPE) and Maximum a Posteriori (MAP) queries.

Probability of Evidence needs to be evaluated in classification problems.

A C (A,C)

0 0 5

0 1 100

1 0 15

1 1 20

A

B

C

D

E

F

(A,C) (C,E)

(D,F)(B,D)

(C,D)(A,B) (E,F)

Sample Factor:

Page 6: Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.

6 University Of Texas at Dallas

Exact InferenceExact Inference algorithms, e.g., Variable Elimination

provide accurate results for Probability of Evidence.

Challenges:Exponential time and space complexity.Computationally intractable on large graphs.

Approximate Inference algorithms are used widely in practice to evaluate queries within resource limit.Sampling based, e.g., Gibbs Sampling, Importance

Sampling.Propagation based, e.g., Iterative Join Graph Propagation.

Page 7: Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.

7 University Of Texas at Dallas

Adaptive Importance Sampling (AIS)

Adaptive Importance Sampling (AIS) is an approximate Inference algorithm where-Samples are generated from a known distribution Q,

called the proposal distribution.Q is updated periodically based on the sample weights.Probability of evidence is evaluated using the samples

generated using the proposal distribution and by calculating the following expected value with respect to Q.

P(E = e) EQ [], where R=X\EConsidering weight of each sample reduces the variance

in expected value due to occurrence of rare events.

Page 8: Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.

8 University Of Texas at Dallas

RB-AISWe focus on a special type of AIS in this paper,

called Rao-Blackwellized Adaptive Importance Sampling (RB-AIS).

In RB-AIS, a set of variables, Xw ⊂ X \ Xe (called w-cutset variables) are sampled.

Xw is chosen in such a way that Exact Inference over X \ Xw, Xe is tractable.Large |Xw| results in quicker evaluation of query

but more erroneous result.Small |Xw| results in more accurate result but

takes more time.Trade off!

V. Gogate and R. Dechter, “Approximate inference algorithms for hybrid bayesian networks with discrete constraints.” in UAI. AUAI Press, 2005, pp. 209–216.

Page 9: Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.

9 University Of Texas at Dallas

RB-AIS : StepsStart

Initial Q on Xw

Generate Samples

Calculate Sample Weights

Update Q and Z

Converge?

End

Yes

No

Ψ 𝑥=𝑉𝐸 (𝐺𝑒𝑤)𝑄(𝑿=𝑥)Q(X = x)

Page 10: Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.

10 University Of Texas at Dallas

AgendaBrief overview on Inference techniquesProblemProposed ApproachesExperimentsDiscussion

Page 11: Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.

11 University Of Texas at Dallas

ProblemReal world applications require good quality

result within the time constraint.Typically, real world networks are large and

complex (i.e., large tree width).For instance, if we want to model facebook users

using graphical models, it will have billions of nodes in it!

Even RB-AIS may run out of time to provide a quality estimate within the time limit.For instance, RB-AIS takes more than 6 hours to

find out a single probability of evidence on a network having only 67 nodes and 271 factors.

Page 12: Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.

12 University Of Texas at Dallas

AgendaBrief overview on Inference techniquesProblemProposed ApproachesExperimentsDiscussion

Page 13: Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.

13 University Of Texas at Dallas

ChallengesTo design a parallel and distributed

approach for RB-AIS, following challenges need to be addressed:RB-AIS updates Q periodically. Since values of Q and Z at iteration i depends

on those values at iteration i -1, a proper synchronization mechanism is needed.

Distributing the task of sample generation on Xw over the worker nodes.

Page 14: Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.

14 University Of Texas at Dallas

Proposed ApproachesWe design and implement two MapReduce based

approaches for distributed and parallel computation of inference queries using RB-AIS.

Distributed Sampling in Mappers (DSM)Parallel sampling.Sequential weight calculation.Each MapReduce Job Unit(MJU) contains only one MapReduce

Job.

Distributed Weight Calculation in Reducers (DWCR)Parallel sampling.Parallel weight calculation.Each MapReduce Job Unit(MJU) contains two MapReduce Jobs.

Page 15: Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.

15 University Of Texas at Dallas

Distributed Sampling in Mappers (DSM)

Reducer

1 (X1, x11, Qi[X1])

n (X1, x1n, Qi[X1])

Shuffle and Sort: aggregate values by keys

X1 Qi[x1]

Map 1

Input to ith MJU: Xw, Qi

X2 Qi[x2] X3 Qi[x3] Xm Qi[xm]-1 Z

1 (X2, x21, Qi[X2])

n (X2, x2n, Qi[X2])

1 (X3, x31, Qi[X3])

n (X3, x3n, Qi[X3])

1 (Xm, xm1, Qi[Xm])

n (Xm, xmn, Qi[Xm])

s (X1, x1s, Q[X1]) (X2, x2s, Q[X2]) (X3, x3s, Q[X3]) (Xm, xms, Q[Xm])

Calculate 1, 2 …n Update Z, and Qi to Qi+1

-1 Z

X1 Qi+1[x1] X2 Qi+1[x2] X3 Qi+1[x3] Xm Qi+1[xm] -1 Z

Combine x1s, x2s…xms to form xs, where s = {1,2…n}

Map 2 Map 3 Map m

Page 16: Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.

16 University Of Texas at Dallas

Distributed Weight Calculation in Reducers (DWCR)

Input to ith MJU: Xw, Qi

Map 1Input: X1 ⊂ Xw

Output: Partial Samplesx1 ∈ X1

Map 2Input: X1 ⊂ Xw

Output: Partial Samplesx2 ∈ X2

Map mInput: Xm ⊂ Xw

Output: Partial Samplesxm ∈ Xm

ReducerUpdate Z, and Qi to

Qi+1

Reducer 1Combine partial Samples

s: xi → x; i ∈ {1….m}Calculate weight Ψx

Reducer 2Combine partial Samples

s: xi → x; i ∈ {1….m}Calculate weight Ψx

Reducer rCombine partial Samples

s: xi → x; i ∈ {1….m}Calculate weight Ψx

Map 1Output Ψx

Map 2Output Ψx

Map jOutput Ψx

Page 17: Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.

17 University Of Texas at Dallas

AgendaBrief overview on Inference techniquesProblemProposed ApproachesExperimentsDiscussion

Page 18: Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.

18 University Of Texas at Dallas

SetupPerformance Metrics:

Speedup = Tsq/Td

Tsq = Execution time of sequential approach.Td = Execution time of distributed approach.

Scaleup = Ts/Tp

Ts = Execution time using single Machine.Tp = Execution time using multiple Machines.

Hadoop version 1.2.1. 8 data nodes, 1 name node.Each machine has 2.2GHz processor and 4 GB of RAM.

NetworkNumber of Nodes

Number of Factors

54.wcsp[

1] 67 271

29.wcsp[

1] 82 462

404.wcsp[1] 100 710

[1] “The probabilistic inference challenge (pic2011),” http://www.cs.huji.ac.il/project/PASCAL/showNet.php, 2011, last updated on 10.23.2014.

Page 19: Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.

19 University Of Texas at Dallas

Speedup

Page 20: Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.

20 University Of Texas at Dallas

Scaleup

Page 21: Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.

21 University Of Texas at Dallas

DiscussionBoth of the approaches achieve substantial speedup

and scaleup comparing with the sequential execution.

DWCR has better speedup and scalability than DSM.

Weight calculation is computationally more expensive than sample generation.

DWCR does both parallel weight calculation and parallel sampling, so it outperforms DSM.

Both of the approaches show similar accuracy to the sequential execution asymptotically.

Page 22: Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.

22 University Of Texas at Dallas

Questions?