Siggi Olafsson Associate Professor Department of Industrial Engineering Iowa State University 20 th...

Siggi OlafssonAssociate Professor

Department of Industrial EngineeringIowa State University

20th European Conference on Operational ResearchRhodes, Greece, July 5 - 8

Operations Research&

Data Mining

20th European Conference on Operational Research, July 4-7, 2004

2

Purpose of Talk Give a definition and an

overview of data mining as it relates to operations research

Present some examples to give the flavor for the type of work that is possible

My views and future of OR and data mining

Aim for it to be accessible without prior knowledge of data mining

ShouldI be

here?


3

Overview Background Intersection of OR and Data Mining

Optimization algorithms used for data mining Data visualization Attribute selection Classification Unsupervised learning

Data mining used in OR applications Production scheduling

Optimization methods applied to output of standard data mining algorithms

Selecting and improving decision trees Open research areas


4

Background Rapidly growing interest in data mining among

operations research academics and practitioners

For example evidenced by increased data mining presence in professional organizations New INFORMS Section on Data Mining

Large number of data mining sessions at INFORMS and IIE research conferences

Special issues in Computers & Operations Research, IIE Transactions, Discrete Applied Mathematics, etc.

Numerous presentations/sessions at this conference


5

What is Data Mining?


6

What is Data Mining, Really? Extracting meaningful, previously unknown

patterns or knowledge from large databases

The knowledge discovery process

DefineObjective

PrepareData

MineKnowledge

InterpretResults

Data cleaning

Data selection

Attribute selection

Visualization

Classification

Association rule

discovery

Clustering

Business/scientific

objective

Data mining

objective

Predictive models

Structural insights


7

Interdisciplinary FieldStatistics

Databases

Optimization

MachineLearning

Data Mining


8

Input Engineering Preparing the data may take as much as 70% of

the entire effort Numerous steps, including

Combining data sources Transforming attributes Data cleaning Data selection Attribute selection Data visualization

Many of those have connections with operations research and optimization in particular


9







10

Data Visualization Visualizing the data is important in any

data mining project Generally difficult because the data is

always high-dimensional, i.e., hundreds or thousands of attributes (variables)

How can we best visualize such data in 2 or 3 dimensions?

Traditional techniques include multidimensional scaling, which uses nonlinear optimization


11

Optimization Formulation Recent combinatorial optimization formulation by Abbiw-

Jackson, Golden, Raghavan, and Wasil (2004) Map a set M of m points from Rr to Rq, q = 2,3 Approximate the q-dimensional space by a lattice N

1,0

,1s.t.

),(),,(min

1

ik

Nkik

Mij

Mj Nk Nljlikneworiginal

x

Mix

xxlkdjidF

etc map,Sammon square,least assuch Function

in measure Distance),(

in measure Distance),(

F

lkd

jidq

new

roriginal

R

R


12

Solution Methods Quadratic Assignment Problem (QAP) Not possible to solve exactly for large scale problems Local search procedure proposed

Key to the formulation is selection of objective function, e.g., Sammon map

MiijMj Nk Nl original

jlikneworiginal

MiijMj

original jid

xxlkdjid

jid ),(

),(),(

),(

1min

2


13







14

Attribute Selection Usually large number of attributes Some attributes are redundant or

irrelevant and should be removed Benefits:

Faster subsequent induction Simpler models (important in data mining) Better (predictive) performance of models Discover which attributes are important

(descriptive or structural knowledge)


15

Optimization Formulation Define decision variable

Combinatorial optimization problem

Number of solutions is 2n-1

How should the objective function be defined?

otherwise.,0

selected, is attribute if,1 jx j

jx

xxxf

j

n

1,0s.t.

,...,,max 21xx


16

Solution Methods Non-linear objective function

(Defining a good objective is a major issue)

Mathematical programming approach (Bradley, Mangasarian and Street, 1998)

Metaheuristics have been applied extensively Genetic algorithms, simulated annealing Nested partitions method (Olafsson and Yang, 2004)

Intelligent partitioning: take advantage of what is known in data mining about evaluating attributes

Random instance sampling: in each step the algorithm uses a sample of instances, which improves scalability


17

Learning from Data Each data point (instance) represents an

example from which we can learn The instances are either

Labeled (supervised learning) One attribute is of special interest (called the class or

target) and each instance is labeled by its class value Unlabeled (unsupervised learning)

Instances are assumed to be independent (However, spatial and temporal data mining

are active areas of research)


18

Learning Tasks in Data Mining Classification (supervised learning)

Learn how to classify data in one of a given number of categories or classes

Clustering (unsupervised learning) Learn natural groupings (clusters) of data

Association Rule Discovery Learn correlations (associations) among the

data instances Also called market basket analysis


19







20

Classification Classification is the most common learning

task in data mining Many methods have been proposed

Decision trees, neural networks, support vector machines, Bayesian networks, etc.

The algorithm is trained on part of the data and the accuracy tested on independent data (or use cross-validation)

Optimization is relevant to many classification methods


21

Optimization Formulation Suppose we have n attributes and each instance has been

labeled as belonging to one of two classes Represent by two matrices A and B Need to learn what separates the points in the two sets (if

they can be separated) In a 1965 Operations Research article, Olvi Mangasarian

studied the case where the two sets can be separated with a hyperplane:

0

, ,

wx

eBweAw


22

Class A

Class B

x1

x2

Separating hyperplane

Closest points inconvex hulls

c

d

Separating Hyperplane


23

Finding the Closest Points

0

1

1

s.t.2

1min

B Class

A Class

B Class

A Class

2

,

i

i:i

i:i

i:ii

i:ii

dc

xd

xc

dc

Formulate as QP:


24

Support Vector Machines

Class A

Class B

SeparatingHyperplane

Support Vectors

x1

x2


25

Limitations The points (instances) may not be separable by a

hyperplane Add error terms to minimize

A linear separation is quite limited

Class A

Class B

x1

x2

Solution is to map the data to a higher dimensional space


26

Wolfe Dual Problem First formulate the Wolfe dual

Now the data only appears in the dot product in the objective function

.0

0subject to

2

1max

,

2

iii

i

jijijiji

ii

y

C

yy

xxwα


27

Kernel Functions Use kernel functions to map the data and replace

the dot product with

For example,

)()()( yxyx, K Hn R:

)tanh()(

)(

)1()(22

2/

yxyx,

yx,

yxyx,yx

K

eK

K p


28

Other Classification Work Extensive publications on SVM and mathematical

programming for classifications Several other approaches also relevant, e.g.

Logical Analysis of Data (LAD) learns logical expressions to classify the target attribute (series of papers by Hammer, Boros, et al.)

Related approach is Logic Data Miner Lsquare (e.g., talk by Felici, Truemper, and Paola last Monday)

Bayesian networks are often used, and finding the best structure of such networks is a combinatorial optimization problem

Further discussed in the next talk


29







30

Data Clustering Now we do not have labeled data to train

(unsupervised learning) Want to identify natural clusters or

groupings of data instances Many possible set of clusters

What makes a set of clusters good?


31

Optimization Formulation Given a set A of m points, find the centers Cj of k

clusters that minimize the 1-norm

This formulation is due to Bradley, Mangasarian, and Street (1997)

Much more work is needed in this area

kjmiDCAD

De

ijjTiij

m

iij

T

jDC

,...,1;,...,1 ,s.t.

minmin1

,


32

Association Rule Discovery Find strong associations among instances (e.g.,

high support and confidence) Originally used in market basket analysis, e.g.,

what products are candidates for cross-sell, up-sell, etc.

Define an item as an attribute-value pair Algorithm approach (Agrawal et al., 1992, Apriori

and related methods): Generate frequent item sets with high support Generate rules from these sets with high confidence


33

Objectives for Association Rules Want high support and high confidence

Maximizing support would lead to only discovering a few trivial rules (those that occur very frequently)

Maximizing confidence leads to obvious rules (those that are 100% accurate)

Support and confidence are usually treated as constraints (user specified minimum)

Still need measures for good rules (i.e., rules that add insights and are hence interesting)

Significant opportunities for optimizing the rules that are obtained (not much work, yet)


34







35

Data Mining for OR Applications

Data mining can be used to complement

traditional OR methods in many areas

Example applications areas:

E-commerce

Supply chain management (e.g., to enable

customer-value management in the chain)

Production scheduling


36

Data Mining for Scheduling Production scheduling is often ad-hoc in practice

Experience and intuition of human schedulers

Li and Olafsson (2004) propose a method to learn

directly from production data

Benefits Make scheduling practices explicit

Incorporate in automatic scheduling system

Insights into operations

Improve schedules


37

Background Scheduling task

Given a finite set of jobs, sequence the jobs in order of priority

Many simple dispatching rules available Machine learning in scheduling

Considerable work over two decades Expert systems Inductive learning

Select dispatching rules from simulated data Has not been applied directly to scheduling data

(which would be data mining)


38

Simple Example: Dispatching List

Job ID

Release Time

Start Time

Processing Time

Completion Time

J5 0 0 17 17J1 10 17 15 32J3 18 32 20 52J4 0 52 7 59J2 30 59 5 64

How were these five jobs scheduled?

Longest processing time first (LPT)


39

Data Mining Formulation Determine the target concept

Dispatching rules are a pair-wise comparison

Learning task: Given two jobs, which job should

be dispatched first?

Data preparation Construct a flat file

Each line (instance/data object) is an example

of the target concept


40

J1 15 10 J2 5 30 Yes

J1 15 10 J3 20 18 Yes

J1 15 10 J4 7 0 Yes

J1 15 10 J5 17 0 No

J2 5 30 J1 15 10 No

J2 5 30 J3 20 18 No

J2 5 30 J4 7 0 No

Job1

Processing Time1

Release1

Job 2

Processing Time2

Release2

Job1ScheduledFirst

Prepared Data File


41

Input Engineering Attribute creation (i.e., composite

attributes) and attribute selection is an important part of data mining

Add attributes: ProcessingTimeDifference ReleaseDifference Job1Longer Job1ReleasedFirst

Select the best subset of attributes Apply the C4.5 decision tree algorithm


42

Decision Tree

Yes No

Job 1 Longer?

Yes Yes NoNo

Job 1 ReleasedFirst?

Job 1 ReleasedFirst?

No Yes

-8 > -8

NoProcessing Time

DifferenceYesLPT forreleased jobs

NoDo not wait for Job 1 if not much longer than Job 2

Yes

5 > 5

Processing TimeDifference

Wait for Job 1 to bereleased if it is muchlonger than Job 2


43

Structural Knowledge The dispatching rule is LPT Mine data that use this rule and the processing

time and release time data The induced model takes into account:

Possible range of processing times Largest delay caused by a not released job

New structural patterns, not explicitly known by the dispatcher, discovered

Next step is to improve schedules Instance selection: learn from best practices Optimize the decision tree


44







45

Optimizing Decision Trees Decision tree induction is often unstable Genetic algorithms have been used to

select the best tree from a set of trees Kennedy et al. (1997) encode decision trees

and define crossover and mutation operators The accuracy of the tree is the fitness function A series of papers by Fu, Golden, et al. (2003;

2004a; 2004b) builds further on this approach Other optimization methods could also

apply and other outputs can be optimized


46







47

Conclusions Although data mining related optimization work

dates back to the 1960s, most problems are still open or need more research

Need to be aware of the key concerns of data mining: extracting meaningful, previously unknown patterns or knowledge from large databases Algorithms should handle massive data sets, that is, be

scalable with respect to both time and memory use Results often focus on simple to interpret meaningful

patterns that provide structural insights Previously unknown means few modeling assumptions

that restrict what can be discovered


48

Open Problems Many data mining problems can be formulated as

optimization problems Seen numerous examples, e.g., classification and

attribute selection (most work for these problems) Many areas have not been addressed or need more work

(in particular, clustering and association rule mining) Optimizing model outputs is very promising Use of data mining in OR applications has been

very little investigated Supply chain management Logistics and transportation Planning and scheduling


49

Questions? For more information after today:

Email me at [email protected] Visit my homepage at http://www.public.iastate.edu/~olafsson Consult Dilbert


50

Select References The following surveys on optimization and data mining are available:

1. Padmanabhan, B. and A. Tuzhilin (2003). “On the Use of Optimization for Data Mining: Theoretical Interactions and eCRM Opportunities,” Management Science 49: 1327-1343.

2. Bradley, P.S., U.M. Fayyad, and O.L. Mangasarian (1999). “Mathematical Programming for Data Mining: Formulations and Challenges,” INFORMS Journal of Computing 11: 217-238.

Work mentioned in presentation:3. Abbiw-Jackson, B. Golden, S. Raghavan, and E. Wasil (2004). “A Divide-and-Conquer Local Search Heuristic for

Data Visualization,” Working Paper, University of Maryland.4. Boros, E. P.L. Hammer, T. Ibaraki, A. Kogan (1997). “Logical Analysis of Numerical Data,” Mathematical

Programming 79: 163-190.5. Bradley, P.S., O.L. Mangasarian, and W.N. Street (1997). “Clustering via Concave Minimization,” in M.C. Mozer, M.I.

Jordan, T. Petsche (eds.) Advances in Neural Information Processing Systems. MIT Press, Cambridge, MA.6. Bradley, P.S., O.L. Mangasarian, and W.N. Street (1998). “Feature Selection via Mathematical Programming,”

INFORMS Journal of Computing 10: 209-217.7. Fu, Z., B. Golden, S. Lele, S. Raghavan, and E. Wasil (2003). “A Genetic Algorithm-Based Approach for Building

Accurate Decision Trees,” INFORMS Journal of Computing 15: 3-22.8. Kennedy, H., C. Chinniah, P. Bradbeer, and L. Morss (1997). “The Construction and Evaluation of Decision Trees: A

Comparison of Evolutionary and Concept Learning Methods,” in D. Corne and J.L. Shapiro (eds.) Evolutionary Computing, Lecture Notes in Computer Science, Springer-Verlag, 147-161.

9. Li, X. and S. Olafsson (2004). “Discovering Dispatching Rules using Data Mining,” Journal of Scheduling, to appear.10. Mangasarian, O.L. (1965). “Linear and Nonlinear Separation of Patterns by Linear Programming,” Operations

Research 13: 455-461.11. Olafsson, S. and J. Yang (2004). “Intelligent Partitioning for Feature Selection,” INFORMS Journal on Computing, to

appear.

Siggi Olafsson Associate Professor Department of Industrial Engineering Iowa State University 20 th...

Documents

Transcript of Siggi Olafsson Associate Professor Department of Industrial Engineering Iowa State University 20 th...