A* Lasso for Learning a Sparse Bayesian Network Structure for Continuous Variances Jing Xiang &...

1
A* Lasso for Learning a Sparse Bayesian Network Structure for Continuous Variances Jing Xiang & Seyoung Kim Bayesian Network Structure Learning X 1 . . . X 5 Sample 1 Sample 2 Sample n We observe… A Bayesian network for continuous variables is defined over DAG G, which has V nodes, where V = {X 1 , …, X |V| }. The probability model factorizes as below. Recovery of V- structures Recovery of Skeleton Prediction Error for Benchmark Networks Prediction Error for S&P Stock Price Data Dynamic Programming (DP) with Lasso Learning Bayes net + DAG constraint = learning optimal ordering. Given ordering, Pa(X j ) = variables that precede it in ordering. DP must visit 2 | V| states! Example of A* Search with an Admissible and Consistent Heuristic DP is not practical for >20 nodes. Need to prune search space, use A* search! S 3= {X 3 } S 2= {X 2 } S 0= {} S 1= {X 1 } S 7= {X 1 ,X 2 ,X 3 } S 6= {X 2 ,X 3 } S 5= {X 1 ,X 3 } S 4= {X 1 ,X 2 } 1 2 3 5 4 6 5 8 9 8 7 1 1 h(S 1 ) = 4 h(S 2 ) = 5 h(S 3 ) = 10 h(S 4 ) = 9 h(S 5 ) = 5 h(S 6 ) = 6 Queue {S 0 ,S 1 }: f = 1+4= 5 {S 0 ,S 2 }: f = 2+5= 7 {S 0 ,S 3 }: f = 3+10= 13 Queue {S 0 ,S 2 }: f = 2+5= 7 {S 0 ,S 1 ,S 5 }: f = (1+4)+5= 10 {S 0 ,S 3 }: f = 3+10= 13 {S 0 ,S 1 ,S 4 }: f = (1+5)+9= 15 Queue {S 0 ,S 1 ,S 5 }: f = (1+4)+5= 10 {S 0 ,S 3 }: f = 3+10= 13 {S 0 ,S 2 ,S 6 }: f = (2+5)+6= 13 {S 0 ,S 1 ,S 4 }: f = (1+5)+9= 15 {S 0 ,S 2 ,S 4 }: f = (2+6)+9= 17 Queue {S 0 ,S 1 ,S 5 ,S 7 }: f = (1+4)+7= 12 {S 0 ,S 3 }: f = 3+10= 13 {S 0 ,S 2 ,S 6 }: f = (2+5)+6= 13 {S 0 ,S 1 ,S 4 }: f = (1+5)+9= 15 S 7 S 6 S 5 S 4 S 1 S 2 S 3 S 0 Expand S 0 Expand S 1 Expand S 2 Expand S 5 S 7 S 6 S 5 S 4 S 1 S 2 S 3 S 0 S 7 S 6 S 5 S 4 S 1 S 2 S 3 S 0 S 7 S 6 S 5 S 4 S 1 S 2 S 3 S 0 Goal Reached! A* Lasso for Pruning the Search Space Construct ordering by decomposing the problem with DP. Comparing Computation Time of Different Methods Consistenc y! Improving Scalability We do NOT naively limit the queue. This would reduce quality of solutions dramatically! Best intermediate results occupy shallow part of the search space, so we distribute results to be discarded across different depths. To discard k results, given depth |V|, we Daily stock price data of 125 S&P companies over 1500 time points (1/3/07- 12/17/12). Estimated Bayes net using the first 1000 time points, then computed prediction errors on 500 time points. 1. Huang et al. A sparse structure learning algorithm for Gaussian Bayesian network identification from high-dimensional data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(6), 2013. 2. Schmidt et al. Learning graphical model structure using L1- regularization paths. In Proceedings of AAAI, volume 22, 2007. 3. Singh and Moore. Finding optimal Bayesian networks by dynamic programming. Technical Report 05-106, School of Computer Science, Carnegie Mellon University, 2005. 4. Yuan et al. Learning optimal Bayesian networks using A* search. In Proceedings of AAAI, 2011. Referen ces Conclusi ons X 1 X 2 X 3 X 4 X 5 X 1 X 2 X 3 X 4 X 5 X 1 X 2 X 3 X 4 X 5 Stage 1: Parent Selection Stage 2: Search for DAG Single stage combined Parent Selection + DAG Search e.g. L1MB, DP + A* for discrete variables [2,3,4] e.g. SBN [1] Method 1- Stage Optim al Allows Sparse Parent Set Computational Time DP [3] No Yes No Exp. A* [4] No Yes No ≤ Exp. L1MB [2] No No Yes Fast SBN [1] Yes No Yes Fast DP Lasso Yes Yes Yes Exp A* Lasso Yes Yes Yes ≤ Exp. A* Lasso + Qlimit Yes No Yes Fast Linear Regression Model: Bayesian Network Model Optimization Problem for Learning We address the problem of learning a sparse Bayes net structure for continuous variables in high-D space. 1. Present single stage methods A* lasso and Dynamic Programming (DP) lasso. 2. A* lasso and DP lasso both guarantee optimality of the structure for continuous variables. 3. A* lasso has huge speed-up over DP lasso! It improves on the exponential time required by DP lasso, and previous optimal methods for discrete variables. Contributions Finding optimal ordering = finding shortest path from start state to goal state DP must consider ALL possible paths in search space. Find optimal score for nodes excluding X j Find optimal score for first node X j Cost incurred so far. g(S k ) only = Greedy Fast but suboptimal • LassoScore from start state to S k . Heuris tic Admissi ble Consist ent + + h(S k ) is always an underestimate of the true cost to the goal. h(S k ) always satisfies A* guaranteed to find the optimal solution. First path to a state is guaranteed to be the shortest path, thus we can prune other paths. Proposed A* lasso for Bayes net structure learning with continuous variables, this guarantees optimality + reduces computational time compared to the previous optimal algorithm DP. Also presented heuristic scheme that further improves speed but does not significantly sacrifice the quality of solution. Efficient + Optimal! = Estimate of future cost Heuristic estimate of cost to reach goal from S k Estimate of future LassoScore from S k to goal state (ignores DAG constraint).

Transcript of A* Lasso for Learning a Sparse Bayesian Network Structure for Continuous Variances Jing Xiang &...

Page 1: A* Lasso for Learning a Sparse Bayesian Network Structure for Continuous Variances Jing Xiang & Seyoung Kim Bayesian Network Structure Learning X 1...

A* Lasso for Learning a Sparse Bayesian Network Structure for Continuous Variances

Jing Xiang & Seyoung Kim

Bayesian Network Structure Learning

X1 . . . X5

Sample 1Sample 2

Sample n

We observe…

• A Bayesian network for continuous variables is defined over DAG G, which has V nodes, where V = {X1, …, X|V|}. The probability model factorizes as below.

Recovery of V-structures

Recovery of Skeleton

Prediction Error for Benchmark Networks

Prediction Error for S&P Stock Price Data

Dynamic Programming (DP) with Lasso• Learning Bayes net + DAG constraint = learning optimal ordering.• Given ordering, Pa(Xj) = variables that precede it in ordering.

DP must visit 2|V| states!

Example of A* Search with an Admissible and Consistent Heuristic

• DP is not practical for >20 nodes. • Need to prune search space, use A* search!

S3={X3}S2={X2}

S0={}

S1={X1}

S7={X1,X2,X3}

S6={X2,X3}S5={X1,X3}S4={X1,X2}

12

3

54 6 5

89

8711

h(S1) = 4h(S2) = 5h(S3) = 10h(S4) = 9h(S5) = 5h(S6) = 6

Queue{S0,S1}: f = 1+4= 5{S0,S2}: f = 2+5= 7 {S0,S3}: f = 3+10= 13

Queue{S0,S2}: f = 2+5= 7 {S0,S1,S5}: f = (1+4)+5= 10{S0,S3}: f = 3+10= 13 {S0,S1,S4}: f = (1+5)+9= 15

Queue{S0,S1,S5}: f = (1+4)+5= 10{S0,S3}: f = 3+10= 13 {S0,S2,S6}: f = (2+5)+6= 13{S0,S1,S4}: f = (1+5)+9= 15{S0,S2,S4}: f = (2+6)+9= 17

Queue{S0,S1,S5,S7}: f = (1+4)+7= 12{S0,S3}: f = 3+10= 13 {S0,S2,S6}: f = (2+5)+6= 13{S0,S1,S4}: f = (1+5)+9= 15

S7

S6S5S4

S1S2 S3

S0

Expand S0 Expand S1 Expand S2 Expand S5

S7

S6S5S4

S1S2 S3

S0

S7

S6S5S4

S1S2 S3

S0

S7

S6S5S4

S1S2 S3

S0

Goal Reached!

A* Lasso for Pruning the Search Space

• Construct ordering by decomposing the problem with DP.

Comparing Computation Time of Different Methods

Consistency!

Improving Scalability

• We do NOT naively limit the queue. This would reduce quality of solutions dramatically!

• Best intermediate results occupy shallow part of the search space, so we distribute results to be discarded across different depths.

• To discard k results, given depth |V|, we discard k/|V| intermediate results at each depth.

• Daily stock price data of 125 S&P companies over 1500 time points (1/3/07-12/17/12).

• Estimated Bayes net using the first 1000 time points, then computed prediction errors on 500 time points.

1. Huang et al. A sparse structure learning algorithm for Gaussian Bayesian network identification from high-dimensional data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(6), 2013.

2. Schmidt et al. Learning graphical model structure using L1-regularization paths. In Proceedings of AAAI, volume 22, 2007.

3. Singh and Moore. Finding optimal Bayesian networks by dynamic programming. Technical Report 05-106, School of Computer Science, Carnegie Mellon University, 2005.

4. Yuan et al. Learning optimal Bayesian networks using A* search. In Proceedings of AAAI, 2011.

References

Conclusions

X1 X2

X3 X4

X5

X1 X2

X3 X4

X5

X1 X2

X3 X4

X5

Stage 1:Parent Selection

Stage 2:Search for DAG

Single stage combined

Parent Selection + DAG Search

e.g. L1MB, DP + A* for discrete variables [2,3,4]

e.g. SBN [1]

Method 1-Stage Optimal Allows Sparse Parent Set

Computational Time

DP [3] No Yes No Exp.

A* [4] No Yes No ≤ Exp.

L1MB [2] No No Yes Fast

SBN [1] Yes No Yes Fast

DP Lasso Yes Yes Yes Exp

A* Lasso Yes Yes Yes ≤ Exp.

A* Lasso + Qlimit Yes No Yes Fast

Linear RegressionModel:

Bayesian Network Model

Optimization Problem for Learning

We address the problem of learning a sparse Bayes net structure for continuous variables in high-D space. 1. Present single stage methods A* lasso and Dynamic

Programming (DP) lasso. 2. A* lasso and DP lasso both guarantee optimality of the

structure for continuous variables. 3. A* lasso has huge speed-up over DP lasso! It improves

on the exponential time required by DP lasso, and previous optimal methods for discrete variables.

Contributions

Finding optimal ordering = finding shortest path from start state to goal state

DP must consider ALL possible paths in search space.

Find optimal score for nodes excluding Xj

Find optimal score for first node Xj

• Cost incurred so far. • g(Sk) only = Greedy• Fast but suboptimal• LassoScore from start

state to Sk.

Heuristic

Admissible

Consistent

+

+

h(Sk) is always an underestimate of the true cost to the goal.

h(Sk) always satisfies

A* guaranteed to find the optimal solution.

First path to a state is guaranteed to be the shortest path, thus we can prune other paths.

• Proposed A* lasso for Bayes net structure learning with continuous variables, this guarantees optimality + reduces computational time compared to the previous optimal algorithm DP.

• Also presented heuristic scheme that further improves speed but does not significantly sacrifice the quality of solution.

Efficient + Optimal!=

• Estimate of future cost• Heuristic estimate of cost to reach

goal from Sk

• Estimate of future LassoScore from Sk to goal state (ignores DAG constraint).