Post on 13-Jan-2016
description
Fast Probabilistic Modeling for Combinatorial Optimization
Scott Davies
School of Computer Science
Carnegie Mellon University(and Justsystem Pittsburgh Research Center)
Joint work with Shumeet Baluja
Combinatorial Optimization
Maximize “evaluation function” f(x) input: fixed-length bitstring x = (x1, x2, …,xn)
output: real value
x might represent: job shop schedules TSP tours discretized numeric values etc.
Our focus: “Black Box” optimization No domain-dependent heuristics
Commonly Used Approaches
Hill-climbing, simulated annealing Generate candidate solutions “neighboring” single current
working solution (e.g. differing by one bit) Typically make no attempt to model how particular bits
affect solution quality
Genetic algorithms Attempt to implicitly capture dependency of solution
quality on bit values by maintaining a population of candidate solutions
Use crossover and mutation operators on population members to generate new candidate solutions
Using Explicit Probabilistic Models
Maintain an explicit probability distribution P from which we generate new candidate solutions Initialize P to uniform distribution Until termination criteria met:
Stochastically generate K candidate solutions from PEvaluate themUpdate P to make it more likely to generate solutions “similar” to
the “good” solutions
Return best bitstring ever evaluated
Several different choices for what sorts of P to use and how to update it after candidate solution evaluation
Population-Based Incremental Learning
Population-Based Incremental Learning (PBIL) [Baluja, 1995] Maintains a vector of probabilities: one independent
probability P(xi=1) for each bit xi. (Each initialized to 0.5)
Until termination criteria met:Generate a population of K bitstrings from P. Evaluate them.For each of the best M bitstrings of these K, adjust P to make that
bitstring more likely:
Also sometimes uses “mutation” and/or pushes P away from bad solutions
Often works very well (compared to hillclimbing and GAs) despite its independence assumption
))1(1()1()1(' iii xPxPxP if xi is set to 1
)1()1( ii xPxP if xi is set to 0or
Modeling Inter-Bit Dependencies
MIMIC [De Bonet, et al., 1997]: sort bits into Markov Chain in which each bit’s distribution is conditionalized on value of previous bit in chain e.g.: P(x1, x2, …, xn) = P(x3)*P(x6|x3)* P(x2|x6)*… Generate chain from best N% of bitstrings evaluated so
far; generate new bitstrings from chain; repeat
More general framework: Bayesian networks
Bayesian Networks
P(x1,…,xn) = P(x3)*P(x12)*P(x2|x3,x12)*P(x6|x3)*...
x3 x12
x6 x2
x9x17
Specified by: Network structure (must be a DAG) Probability distribution for each variable (bit) given
each possible combination of its parents’ values
Optimization with Bayesian Networks
Two operations we need to perform during optimization: Given a network, generate bitstrings from the
probability distribution represented by that networkTrivial with Bayesian networks
Given a set of “good” bitstrings, find the network most likely to have generated itGiven a fixed network structure, trivial to fill in the probability
tables optimallyBut what structure should we use?
Learning Network Structures from Data
Problem statement: given a dataset D and a set of allowable Bayesian networks, find the network B with the maximum posterior probability:
)|()(maxarg)|(maxarg BDPBPDBPBB
)|(log)(logmaxarg BDPBPB
Learning Network Structures from Data
Maximizing log (D|B) reduces to maximizing
where i is the set of xi’s parents in B
H’ (…) is the average entropy of an empirical distribution exhibited in D
n
iiixH
1
)|('
Single-Parent Tree-Shaped Networks
Let’s allow each bit to be conditioned on at most one other bit. x17
x5x22 x2
x6x19x13 x14 x8
Adding an arc from xj to xi increases network score by
H(xi) - H(xi|xj) (xj’s “information gain” with xi)
= H(xj) - H(xj|xi) (not necessarily obvious, but true)
= I(xi, xj) Mutual information between xi and xj
Optimal Single-Parent Tree-Shaped Networks
To find optimal single-parent tree-shaped network, just find maximum spanning tree using I(xi, xj) as the weight for the edge between xi and xj. [Chow and Liu, 1968].
Can be done in O(n2) time (assuming D has already been reduced to sufficient statistics)
Optimal Dependency Trees for Combinatorial Optimization
[Baluja & Davies, 1997]Start with a dataset D initialized from the uniform
distributionUntil termination criteria met:
Build optimal dependency tree T with which to model D. Generate K bitstrings from probability distribution represented
by T. Evaluate them. Add best M bitstrings to D after decaying the weight of all
datapoints already in D by a factor between 0 and 1.
Return best bitstring ever evaluated.Running time: O(K*n + M*n2) per iteration
Graph-Coloring Example
Noisy Graph-Coloring example For each edge connected to vertices of different colors, add
+1 to evaluation function with probability 0.5Title:pogp.figCreator:fig2dev Version 3.1 Patchlevel 2Preview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.
Optimal Dependency Tree Results
Tested following optimization algoirthms on variety of small problems, up to 256 bits long: Optimization with optimal dependency trees Optimization with chain-shaped networks (ala MIMIC) PBIL Hillclimbing Genetic algorithms
General trend: Hillclimbing & GAs did relatively poorly Among probabilistic methods, PBIL < Chains < Trees:
more accurate models are better
Modeling Higher-Order Dependencies
The maximum spanning tree algorithm gives us the optimal Bayesian Network in which each node has at most one parent.
What about finding the best network in which each node has at most K parents for K>1? NP-complete problem! [Chickering, et al., 1995] However, can use search heuristics to look for “good”
network structures (e.g., [Heckerman, et al., 1995]), e.g. hillclimbing.
Bayesian Network-Based Combinatorial Optimization
Initialize D with C bitstrings from uniform distribution, and Bayesian network B to empty network containing no edges
Until termination criteria met: Perform steepest-ascent hillclimbing from B to find locally optimal network
B’. Repeat until no changes increase score:Evaluate how each possible edge addition, removal or deletion would
affect penalized log-likelihood scorePerform change that maximizes increase in score.
Set B to B’. Generate and evaluate K bitstrings from B. Decay weight of datapoints in D by . Add best M of the K recently generated datapoints to D.
Return best bit string ever evaluated.
Bayesian Networks Results
Does better than Tree-based optimization algorithm on some toy problems
Roughly the same as Tree-based algorithm on others, Significantly more computation than Tree-based
algorithm, however, despite efficiency hacksWhy not much better results?
Too much emphasis on exploitation rather than exploration? Steepest-ascent hillclimbing over network structures not good
enough, particularly when starting from old networks?
Using Probabilistic Models for Intelligent Restarts
Tree-based algorithm’s O(n2) execution time per iteration very expensive for large problems
Even more so for algorithm based on more complicated Bayesian networks
One possible approach: use probabilistic models to select good starting points for faster optimization algorithms, e.g. hillclimbing or PBIL
COMIT
Combining Optimizers with Mutual Information Trees [Baluja and Davies, 1998] Initialize dataset D with bitstrings drawn from uniform
distribution Until termination criteria met:
Build optimal dependency tree T with which to model D.Use T to stochastically generate K bitstrings. Evaluate them.Execute a fast-search procedure, initialized with the best solutions
generated from T.Replace up to M of the worst bitstrings in D with the best
bitstrings found during the fast-search run just executed.
Return best bitstring ever evaluated
COMIT with Hill-climbing
Use single best bitstring generated by tree as starting point of a stochastic hillclimbing algorithm that allows at most PATIENCE moves to points of equal value before giving up
|D| kept at 1000; M set to 100.We compare COMIT vs.:
Hillclimbing with restarts from bitstrings chosen randomly from uniform distribution
Hillclimbing with restarts from best bitstring out of K chosen randomly from uniform distribution
COMIT w/Hillclimbing: Example of Behavior
TSP domain: 100 cities, 700 bitsTitle:
Creator:idrawPreview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.
Title:
Creator:idrawPreview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.
Evaluation Number * 103
Tour L
ength * 10 3
Hillclimber
COMIT
COMIT w/Hillclimbing: Results
Each number is average over at least 25 runs Each algorithm given 200,000 evaluation function calls Highlighted: better than each non-COMIT hillclimber with confidence > 95% AHCxxx: pick best of xxx randomly generated starting points before hillclimbing COMITxxx: pick best of xxx starting points generated by tree before hillclimbing
Problem Min/Max # bits HC AHC100 AHC1000 COMIT100 COMIT1000
Knapsack 1 Max 512 3238 3377 3335 6684 6259Knapsack 2 Max 900 3403 3418 3488 7733 7182Knapsack 3 Max 1200 5226 5270 5280 13052 12829Jobshop 1 Min 500 998 988 982 978 970Jobshop 2 Min 700 965 961 957 954 953Jobshop 3 Min 700 1207 1200 1199 1196 1195Bin-packing 1 Min 504 170 158 162 156 145Bin-packing 2 Min 800 154 150 138 111 124TSP 1 (100 city) Min 700 1629 1599 1573 1335 1336TSP 2 (200 city) Min 1600 15119 15286 15100 15189 15117TSP 3 (150 city) Min 1200 11451 11247 11290 9812 9077Sum. Canc. Min 675 64 61 59 54 52
COMIT with PBIL
Generate K samples from T; evaluate them Initialize PBIL’s P vector according to the
unconditional distributions contained in the best C% of these examples
PBIL run terminated after 5000 evaluations without improvement
|D| kept at 1000; M set to 100 (as before) PBIL parameters: =.15; update P with single best example
out of 50 each iteration; some “mutation” of P used as well
COMIT w/PBIL: Results
Problem Min/Max # bits PBIL PBIL+R COMIT+PBIL
Knapsack 1 Max 512 7823 7765 7872Knapsack 2 Max 900 9601 9402 10134Knapsack 3 Max 1200 14668 13768 18014Jobshop 1 Min 500 982 963 961Jobshop 2 Min 700 959 943 940Jobshop 3 Min 700 1188 1176 1170Bin-packing 1 Min 504 140 280 280Bin-packing 2 Min 800 830 960 1000TSP 1 (100 city) Min 700 1331 1518 1021TSP 2 (200 city) Min 1600 17148 21858 14870TSP 3 (150 city) Min 1200 11035 13698 9078Sum. Canc. Min 675 25 33 33
Each number is average over at least 50 runs Each algorithm given 600,000 evaluation function calls Highlighted: better than opposing algorithm(s) with confidence > 95% PBIL+R: PBIL with restart after 5000 evaluations without improvement
(P reinitialized to 0.5)
Conclusions & Future Work
COMIT makes probabilistic modeling applicable to much larger problems
COMIT led to significant improvements over baseline search algorithm in most problems tested
Future work Applying COMIT to more interesting problems Using COMIT to combine results of multiple search algorithms Incorporating domain knowledge
Future Work
Making algorithm based on complex Bayesian Networks more practical Combine w/simpler search algorithms, ala COMIT?
Applying COMIT to more interesting problems and other baseline search algorithms WALKSAT?
Using more sophisticated probabilistic models Using COMIT to combine results of multiple search
algorithms