Optimization Issues for Huge Datasets and Long Computation

27
Optimization Issues for Huge Datasets and Long Computation Michael Ferris University of Wisconsin, Computer Sciences [email protected] Qun Chen, Jin-Ho Lim, Jeff Linderoth, Miron Livny, Todd Munson, Mary Vernon, Meta Voelker

description

Optimization Issues for Huge Datasets and Long Computation. Michael Ferris University of Wisconsin, Computer Sciences [email protected] Qun Chen, Jin-Ho Lim, Jeff Linderoth, Miron Livny, Todd Munson, Mary Vernon, Meta Voelker. Update on Gamma Knife. In use at U. Maryland Hospitals - PowerPoint PPT Presentation

Transcript of Optimization Issues for Huge Datasets and Long Computation

Page 1: Optimization Issues for Huge Datasets and Long Computation

Optimization Issues for Huge Datasets and Long

Computation

Michael FerrisUniversity of Wisconsin, Computer Sciences

[email protected]

Qun Chen, Jin-Ho Lim, Jeff Linderoth, Miron Livny, Todd Munson, Mary Vernon, Meta

Voelker

Page 2: Optimization Issues for Huge Datasets and Long Computation

Update on Gamma Knife

• In use at U. Maryland Hospitals• Covered by Business Week (Apr 2001)

• Better models, faster solution• Requires less user input

• Skeletonization is key improvement

Page 3: Optimization Issues for Huge Datasets and Long Computation

Skeleton Starting Pointsa. Target area

10 20 30 40

10

20

30

40

b. A single line skeleton of an image

10 20 30 40

10

20

30

40

c. 8 initial shots are identified

1-4mm, 2-8mm, 5-14mm10 20 30 40

10

20

30

400.5

1

1.5

2

1-4mm, 2-8mm, 5-14mm

d. An optimal solution: 8 shots

10 20 30 40 50

10

20

30

40

50

Page 4: Optimization Issues for Huge Datasets and Long Computation

Run Time Comparison

Average Run Time

Size of Tumor

Small Medium Large

Random(Std. Dev)

2 min 33 sec

(40 sec)

17 min 20 sec

(3 min 48 sec)

373 min 2 sec

(90 min 8 sec)

SLSD(Std. Dev)

1 min 2 sec(17 sec)

15 min 57 sec

(3 min 12 sec)

23 min 54 sec

(4 min 54 sec)

Page 5: Optimization Issues for Huge Datasets and Long Computation

Data Mining & Optimization

C om p u ta tionP rocessor/M em ory

A lg orith m sO p tim iza tion

M od e lsS ta tis t ica l/A I

D a ta M in in g A p p lica tionD atab ases

Prediction, Categorization,

Separation

Equations, LP, QP,

MIP, NLP

GAMS, Matlab, so/dll

Serial, Parallel

, Condor

Page 6: Optimization Issues for Huge Datasets and Long Computation

Optimization

• Global• Exact• Constrained• Stochastic• Large scale• Fast convergence• CPU + Memory +

Smarts

• Local• Approximate• Unconstrained• Deterministic• Small scale• Termination

Page 7: Optimization Issues for Huge Datasets and Long Computation

MIP formulation

minimize cTxsubject to Ax b

l x uand some xj integer

Problems are specified by application convenient format - GAMS, AMPL, or MPS

Page 8: Optimization Issues for Huge Datasets and Long Computation

Data delivery: pay-per-view

• Optimization model for regional caches:

minimize: Cremote + P Cregional

over all possible cached objects/segments

subject to:– Cregional Nchannels

regional storage Nsegments

regional server stores 0, k or K segments of each object

• MIP (large number of objects/segments)

Page 9: Optimization Issues for Huge Datasets and Long Computation

The “Seymour Problem”

• Set covering problem used in proof of four color theorem

• CPLEX 6.0 and Condor (2 option files)• Running since June 23, 1999• Currently >590 days CPU time per

job• (13 million nodes; 2.4 million nodes)

Page 10: Optimization Issues for Huge Datasets and Long Computation

FAT COP

• FAT - large # of processors – opportunistic environment (Condor)

• COP - Master Worker control– fault tolerant: task exit, host suspend– portable parallel programming

• Mixed Integer Program Solver– Branch and Bound: LP relaxations – MPS file, AMPL or GAMS input

Page 11: Optimization Issues for Huge Datasets and Long Computation

GAMS AMPLMPS

FATCOPFATCOP

MW

Condor-PVM

CPLEXOSL

SOPLEXMINOS

...

Application Problem

PVM

Internet Protocol

LPSO

LVER

INTER

FAC

E

Page 12: Optimization Issues for Huge Datasets and Long Computation

MIP Technology

• Each task is a subtree, time limit– Diving heuristic– Cutting planes (global)– Pseudocosts– Preprocessing

• Master checkpoint• Worker has state, how to share

info?

Page 13: Optimization Issues for Huge Datasets and Long Computation

FATCOP Daily Log

Note machine reboot at approx 3:00 am (night)

Page 14: Optimization Issues for Huge Datasets and Long Computation

Back to Seymour

• Schmieta, Pataki, Linderoth and MCF– explored to depth 8 in tree– applied cuts at each of these 256 nodes– solved in parallel, using whatever resources

available (CPLEX, FATCOP,...)

• Problem solved with over 1 year CPU– over 10 million nodes, 11,000 hours

Page 15: Optimization Issues for Huge Datasets and Long Computation

Seymour Node 319

• FATCOP– 47.0 hrs with 2,887,808 nodes– average number of machine used is

108

• CPLEX– 12 days, 10 hrs with 356,600 nodes– single machine, clique cuts useful

Page 16: Optimization Issues for Huge Datasets and Long Computation

Large datasets

• Enormous computational resources can sometimes facilitate solution

• X-validation, slice modeling

• What about the data? • In particular, what if the problem

does not fit in core?

Page 17: Optimization Issues for Huge Datasets and Long Computation
Page 18: Optimization Issues for Huge Datasets and Long Computation
Page 19: Optimization Issues for Huge Datasets and Long Computation
Page 20: Optimization Issues for Huge Datasets and Long Computation
Page 21: Optimization Issues for Huge Datasets and Long Computation
Page 22: Optimization Issues for Huge Datasets and Long Computation

NCP functions

þ(a;b) = 0( ) 0ô a ? bõ 0Definition:

þmin(a;b) := minf a;bgExample:

þFB(a;b) := a2 + b2p

à aà bExample:

Ð(x) = 0 ( ) 0ô x ? F(x) õ 0

Ði(x) := þ(xi;F i(x))Componentwise definition:

Page 23: Optimization Issues for Huge Datasets and Long Computation
Page 24: Optimization Issues for Huge Datasets and Long Computation
Page 25: Optimization Issues for Huge Datasets and Long Computation

Semismooth results

Page 26: Optimization Issues for Huge Datasets and Long Computation

How can you use these?

• Specialized codes– Asynchronous I/O

• Specialized platforms– Condor (executable per architecture)

• Specific input formats– GAMS, Matlab

• Handholding operation

Page 27: Optimization Issues for Huge Datasets and Long Computation

Model centric toolbox

GAMSoptimization

model

SolversLP,QP,MIP,NLP,MINLP

Other modelformatsgms2xx

Matlabprogrammingenvironment

Modeldata

exchange

CondorResourceManager

Datawarehouse

Specializedinput