Operations research and data mining - NUS Computingrudys/arnie/or-datamining.pdfOperations research...

ARTICLE IN PRESS

European Journal of Operational Research xxx (2006) xxx–xxx

www.elsevier.com/locate/ejor

Operations research and data mining

Sigurdur Olafsson *, Xiaonan Li, Shuning Wu

Department of Industrial and Manufacturing Systems Engineering, Iowa State University, 2019 Black Engineering, Ames, IA 50011, USA

Abstract

With the rapid growth of databases in many modern enterprises data mining has become an increasingly importantapproach for data analysis. The operations research community has contributed significantly to this field, especiallythrough the formulation and solution of numerous data mining problems as optimization problems, and several operationsresearch applications can also be addressed using data mining methods. This paper provides a survey of the intersection ofoperations research and data mining. The primary goals of the paper are to illustrate the range of interactions between thetwo fields, present some detailed examples of important research work, and provide comprehensive references to otherimportant work in the area. The paper thus looks at both the different optimization methods that can be used for datamining, as well as the data mining process itself and how operations research methods can be used in almost every stepof this process. Promising directions for future research are also identified throughout the paper. Finally, the paper looksat some applications related to the area of management of electronic services, namely customer relationship managementand personalization.� 2006 Elsevier B.V. All rights reserved.

Keywords: Data mining; Optimization; Classification; Clustering; Mathematical programming; Heuristics

1. Introduction

In recent years, the field of data mining has seenan explosion of interest from both academia andindustry. Driving this interest is the fact that datacollection and storage has become easier and lessexpensive, so databases in modern enterprises arenow often massive. This is particularly true inweb-based systems and it is therefore not surprisingthat data mining has been found particularly usefulin areas related to electronic services. These massive

0377-2217/$ - see front matter � 2006 Elsevier B.V. All rights reserved

doi:10.1016/j.ejor.2006.09.023

* Corresponding author. Tel.: +1 515 294 8908; fax: +1 515 2943524.

E-mail address: [email protected] (S. Olafsson).

Please cite this article in press as: Olafsson, S. et al., Operatiodoi:10.1016/j.ejor.2006.09.023

databases often contain a wealth of important datathat traditional methods of analysis fail to trans-form into relevant knowledge. Specifically, mean-ingful knowledge is often hidden and unexpected,and hypothesis driven methods, such as on-line ana-lytical processing (OLAP) and most statistical meth-ods, will generally fail to uncover such knowledge.Inductive methods, which learn directly from thedata without an a priori hypothesis, must thereforebe used to uncover hidden patterns and knowledge.

We use the term data mining to refer to allaspects of an automated or semi-automated processfor extracting previously unknown and potentiallyuseful knowledge and patterns from large dat-abases. This process consists of numerous steps suchas integration of data from numerous databases,

.

ns research and data mining, Eur. J. Oper. Res. (2006),

mailto:[email protected]

2 S. Olafsson et al. / European Journal of Operational Research xxx (2006) xxx–xxx

ARTICLE IN PRESS

preprocessing of the data, and induction of a modelwith a learning algorithm. The model is then used toidentify and implement actions to take within theenterprise. Data mining traditionally draws heavilyon both statistics and machine learning but numer-ous problems in data mining can also be formulatedas optimization problems (Freed and Glover, 1986;Mangasarian, 1997; Bradley et al., 1999; Padma-nabhan and Tuzhilin, 2003).

All data mining starts with a set of data called thetraining set that consists of instances describing theobserved values of certain variables or attributes.These instances are then used to learn a given targetconcept or pattern and, depending upon the natureof this concept, different inductive learning algo-rithms are applied. The most common conceptslearned in data mining are classification, data clus-tering, and association rule discovery, and of thosewill be discussed in detail in Section 3. In classifica-tion the training data is labeled, that is, eachinstance is identified as belonging to one of two ormore classes, and an inductive learning algorithmis used to create a model that discriminates betweenthose class values. The model can then be used toclassify any new instances according to this classattribute. The primary objective is usually for theclassification to be as accurate as possible, but accu-rate models are not necessarily useful or interestingand other measures such as simplicity and noveltyare also important. In both data clustering andassociation rule discovery there is no class attributeand the data is thus unlabelled. For those twoapproaches patterns are learned along one of thetwo dimensions of the database, that is, the attributedimension and the instance dimension. Specifically,data clustering involves identifying which datainstances belong together in natural groups or clus-ters, whereas association rule discovery learns rela-tionships among the attributes.

The operations research community has madesignificant contributions to the field of data miningand in particular to the design and analysis of datamining algorithms. Early contributions include theuse of mathematical programming for both classifi-cation (Mangasarian, 1965), and clustering (Vinod,1969; Rao, 1971), and the growing popularity ofdata mining has motivated a relatively recentincrease of interest in this area (Bradley et al.,1999; Padmanabhan and Tuzhilin, 2003). Mathe-matical programming formulations now exist for arange of data mining problems, including attributeselection, classification, and data clustering. Meta-


heuristics have also been introduced to solve datamining problems. For example, attribute selectionhas been done using simulated annealing (Debuseand Rayward-Smith, 1997), genetic algorithms(Yang and Honavar, 1998) and the nested partitionsmethod (Olafsson and Yang, 2004). However, theintersection of OR and data mining is not limitedto algorithm design and data mining can play animportant role in many OR applications. Vastamount of data is generated in both traditionalapplication areas such as production scheduling(Li and Olafsson, 2005), as well as newer areas suchas customer relationship management (Padmanab-han and Tuzhilin, 2003) and personalization (Mur-thi and Sarkar, 2003), and both data mining andtraditional OR tools can be used to better addresssuch problems.

In this paper, we present a survey of operationsresearch and data mining, focusing on both of theabovementioned intersections. The discussion ofthe use of operations research techniques in datamining focuses on how numerous data mining prob-lems can be formulated and solved as optimizationproblems. We do this using a range of optimizationmethodology, including both metaheuristics andmathematical programming. The application partof this survey focuses on a particular type of appli-cations, namely two areas related to electronic ser-vices: customer relationship management andpersonalization. The intention of the paper is notto be a comprehensive survey, since the breadth ofthe topics would dictate a far lengthier paper. Fur-thermore, many excellent surveys already exist onspecific data mining topics such as attribute selec-tion, clustering, and support vector machine. Theprimary goals of this paper, on the other hand,are to illustrate the range of intersections of thetwo fields of OR and data mining, give somedetailed examples of research that we believe illus-trates the synergy well, provide references to otherimportant work in the area, and finally suggest somedirections for future research in the field.

2. Optimization methods for data mining

A key intersection of data mining and operationsresearch is in the use of optimization algorithms,either directly applied as data mining algorithms, orused to tune parameters of other algorithms. Theliterature in this area goes back to the seminalwork of Mangasarian (1965) where the problem ofseparating two classes of points was formulated as


S. Olafsson et al. / European Journal of Operational Research xxx (2006) xxx–xxx 3

ARTICLE IN PRESS

a linear program. This has continued to be an activeresearch area ever since this time and the interesthas grown rapidly over the past few years with theincreased popularity of data mining (see e.g., Glo-ver, 1990; Mangasarian, 1994; Bennett and Breden-steiner, 1999; Boros et al., 2000; Felici andTruemper, 2002; Street, 2005). In this section, wewill briefly review different types of optimizationmethods that have been commonly used for datamining, including the use of mathematical program-ming for formulating support vector machines, andmetaheuristics such as genetic algorithms.

2.1. Mathematical programming and support vector

machines

One of the well-known intersections of optimiza-tion and data mining is the formulation of supportvector machines (SVM) as optimization problems.Support vector machines trace their origins to theseminal work of Vapnik and Lerner (1963) but haveonly recently received the attention of much of thedata mining and machine learning communities.

In what appears to be the earliest mathematicalprogramming work related to this area, Mangasar-ian (1965) shows how to use linear programmingto obtain both linear and non-linear discriminationmodels between separable data points or instancesand several authors have since built on this work.The idea of obtaining a linear discrimination is illus-trated in Fig. 1. The problem here is to determine abest model for separating the two classes. If the data

Fig. 1. A separating hyperplane to discriminate two classes.


can be separated by a hyperplane H as in Fig. 1, theproblem can be solved fairly easily. To formulate itmathematically, we assume that the class attribute yi

takes two values �1 or +1. We assume that all attri-butes other than the class attribute are real valuedand denote the training data, consisting of n

instances, as {(aj, yj)}, where j = 1, 2, . . . , n, yj 2{�1, +1} and aj 2 Rm. If a separating hyperplaneexists then there are in general many such planesand we define the optimal separating hyperplaneas the one that maximizes the sum of the distancesfrom the plane to the closest positive example andthe closest negative example. To learn this optimalplane we first form the convex hulls of each of thetwo data sets (the positive and the negative exam-ples), find the closest points c and d in the convexhulls, and then let the optimal hyperplane be theplane that bisects the straight line between c andd. This can be formulated as a quadratic assignmentproblem (QAP):

min1

2tkc� dk2

s:t c ¼X

i:yi¼þ1

aiai

d ¼X

i:yi¼�1

aiai

Xi:yi¼þ1

aiai ¼ 1

Xi:yi¼�1

aiai ¼ 1

ai P 0:

ð1Þ

Note that the hyperplane H can also be defined interms of its unit normal w and its distance b fromthe origin (see Fig. 1). In other words, H = {x 2Rm : x Æ w + b = 0}, where x Æ w is the dot productbetween those two vectors. For an intuitive idea ofsupport vectors and support vector machines wecan imagine that two hyperplanes, parallel to theoriginal plane H and thus having the same normal,are pushed in either direction until the convex hullof the sets of all instances with each classificationis encountered. This will occur at certain instances,or vectors, that are hence called the support vectors(see Fig. 2). This intuitive procedure is capturedmathematically by requiring the following con-straints to be satisfied:

ai � wþ b P þ1; 8i : yi ¼ þ1;

ai � wþ b 6 �1; 8i : yi ¼ �1:ð2Þ


Supportvectors

Hyperplane H

w

Fig. 2. Illustration of a support vector machine (SVM).


ARTICLE IN PRESS

With this formulation the distance between the twoplanes, called the margin, is readily seen to be 2/kwkand the optimal plane can thus be found by solvingthe following mathematical optimization problemthat maximizes the margin:

maxw;bkwk2

subject to ai � wþ b P þ1; 8i : yi ¼ þ1

ai � wþ b 6 �1; 8i : yi ¼ �1:

ð3Þ

When the data is non-separable this problem willhave no feasible solution and the constraints forthe existence of the two hyperplanes must be re-laxed. One way of accomplishing this is by introduc-ing error variables ej for each instance aj, j =1, 2, . . . ,n. Essentially these variables measure theviolation of each instance and using these variablesthe following modified constraints are obtained forproblem (3):

ai � wþ b P þ1� ei; 8i : yi ¼ þ1

ai � wþ b 6 �1þ ei; 8i : yi ¼ �1

ei P 0; 8i:ð4Þ

As the variables ej represent the training error,the objective could be taken to minimizekwk=2þ C �

Pjej, where C is constant that mea-

sures how much penalty is given. However, ratherthan formulating this problem directly it turns outto be convenient to formulate the following dualprogram:


maxa

Xi

ai �1

2

Xi;j

aiajyiyjai � aj

subject to 0 6 ai 6 CXi

aiyi ¼ 0:

ð5Þ

The solution to this problem are the dual variables a

and to obtain the primal solution, that is, the modelclassifying instances defined by the normal of thehyperplane, we calculate

w ¼X

i:ai support vector

aiyiai: ð6Þ

The benefit of using the dual is that the constraintsare much simpler and easier to handle and that thetraining data only enters in (5) through the dotproduct ai Æ aj. This latter point is important forextending the approach to non-linear model.

Requiring a hyperplane or a linear discriminationof points is clearly too restrictive for most problems.Fortunately, the SVM approach can be extended tonon-linear models in a very straightforward mannerusing what is called kernel functions K(x, y) = /(x) Æ /(y), where / : Rm! H is a mapping fromthe m-dimensional Euclidean space to some Hilbertspace H. This approach was introduced to the SVMliterature by Cortes and Vapnik (1995) and it worksbecause the data aj only enters the dual via the dotproduct ai Æ aj, which can thus be replaced withK(ai, aj). The choice of kernel determines the model.For example, to fit a p degree polynomial the kernelcan be chosen as K(x, y) = (x Æ y + 1)p. Many otherchoices have been considered in the literature butwe will not explore this further here. Detailed expo-sitions of SVM can be found in the book by Vapnik(1995) and in the survey papers of Burges (1998)and Bennett and Campbell (2000).

2.2. Metaheuristics for combinatorial optimization

Many optimization problems that arise in datamining are discrete rather than continuous andnumerous combinatorial optimization formulationshave been suggested for such problems. Thisincludes for example attribute selection, that is,the problem of determining the best set of attributesto be used by the learning algorithm (see Section3.1), determining the optimal structure of a Bayes-ian network in classification (see Section 3.1.2),and finding the optimal clustering of data instances(see Section 3.2). In particular, many metaheuristic



ARTICLE IN PRESS

approaches have been proposed to address suchdata mining problems.

Metaheuristics are the preferred method overother optimization methods primarily when thereis a need to find good heuristic solutions to complexoptimization problems with many local optima andlittle inherent structure to guide the search (Gloverand Kochenberger, 2003). Many such problemsarise in the data mining context. The metaheuristicapproach to solving such problem is to start byobtaining an initial solution or an initial set of solu-tions and then initiating an improving search guidedby certain principles. The structure of the search hasmany common elements across various methods. Ineach step of the search algorithm, there is always asolution xk (or a set of solutions) that represents thecurrent state of the algorithm. Many metaheuristicsare solution-to-solution search methods, that is, xk

is a single solution or point xk 2 X in some solutionspace X, corresponding to the feasible region. Oth-ers are set-based, that is, in each step xk representsa set of solutions xk � X. However, the basic struc-ture of the search remains the same regardless ofwhether the metaheuristics is solution-to-solutionor set-based.

The reason for the meta-prefix is that metaheuris-tics do not specify all the details of the search, whichcan thus be adapted by a local heuristic to a specificdata mining application. Instead, they specify gen-eral strategies to guide specific aspects of the search.For example, tabu search uses a list of solutions ormoves called the tabu list, which ensures the searchdoes not revisit recent solutions or becomes trappedin local optima. The tabu list can thus be thought ofas a restriction of the neighborhood. On the otherhand, methods such as genetic algorithm specifythe neighborhood as all solutions that can beobtained by combining the current solutionsthrough certain operators. Other methods, such assimulated annealing, do not specify the neighbor-hood in any way but rather specify an approachto accepting or rejecting solutions that allows themethod to escape local optima. Finally, the nestedpartitions method is an example of a set-basedmethod that selects candidate solutions from theneighborhood with probability distribution thatadapts as the search progresses to make better solu-tions be selected with higher probability.

All metaheuristics can be thought of to share theelements of selecting candidate solution(s) from aneighborhood of the current solution(s) and theneither accepting or rejecting the candidate(s). With


this perspective, each metaheuristics is thus definedby specifying one or more of these elements butallowing others to be adapted to the particularapplication. This may be viewed as both strengthand a liability. It implies that we can take advantageof special structure for each application but it alsomeans that the user must specify those aspects,which can be complicated. For the remainder of thissection we briefly discuss a few of the most commonmetaheuristics and discuss how they fit within thisframework.

One of the earliest metaheuristics is simulatedannealing (Kirkpatrick et al., 1983), which is moti-vated by the physical annealing process, but withinthe framework here simply specifies a method fordetermining if a solution should be accepted. As asolution-to-solution search method, in each step itselects a candidate xc from the neighborhoodN(xk) of the current solution xk 2 X. The definitionof the neighborhood is determined by the user. Ifthe candidate is better than the current solution itis accepted. If it is worse it is not automaticallyrejected but rather accepted with probabilityP[Accept xc� ¼ ef ðxkÞ�f ðxcÞ=T k , where f : X! R is areal valued objective function to be minimized andTk is a parameter called the temperature. Clearly,the probability of acceptance is high if the perfor-mance difference is small and Tk is large. The keyto simulated annealing is to specify a cooling sche-dule fT kg1k¼1 by which the temperature is reducedso that initially inferior solutions are selected witha high enough probability so local optimal areescaped but eventually it becomes small enough sothat the algorithm converges. Simulated annealinghas for example been used to solve the attributeselection problem in data mining (Debuse and Ray-ward-Smith, 1997, 1999).

Other popular solution-to-solution metaheuris-tics include tabu search, the greedy randomizedadaptive search procedure (GRASP) and the vari-able neighborhood search (VNS). The definingcharacteristic of tabu search is in how solutionsare selected from the neighborhood. In each stepof the algorithm, there is a list Lk of solutions thatwere recently visited and are therefore tabu. Thealgorithm looks through all of the solutions of theneighborhood that are not tabu and selects the bestone. The defining property of GRASP is its multi-start approach that initializes several local searchprocedures from different starting points. Theadvantage of this is that the search becomes moreglobal, but on the other hand each search cannot



ARTICLE IN PRESS

use what the other searches have learned, whichintroduces some inefficiency. The VNS is interestingin that it uses an adaptive neighborhood structurethat changes based on the performance of the solu-tions that are evaluated. More information on tabusearch can be found in Glover and Laguna (1997),GRASP is discussed in Resende and Ribeiro(2003), and for an introduction to the VNSapproach we refer the reader to Hansen and Mlade-novic (1997).

Several metaheuristics are set-based or popula-tion based rather than solution-to-solution. Thisincludes genetic algorithms and other evolutionaryapproaches, as well as scatter search and the nestedpartitions method. The most popular metaheuristicused in data mining is in fact genetic algorithmand its variants. As an approach to global optimiza-tion genetic algorithms (GA) have been found to beapplicable to optimization problems that are intrac-table for exact solutions by conventional methods(Holland, 1975; Goldberg, 1989). It is a set-basedsearch algorithm where at each iteration it simulta-neously generates a number of solutions. In eachstep, a subset of the current set of solutions isselected based on their performance and these solu-tions are combined into new solutions. The opera-tors used to create the new solutions are survival,where a solution is carried to the next iteration with-out change, crossover, where the properties of twosolutions are combined into one, and mutation,where a solution is modified slightly. The same pro-cess is then repeated with the new set of solutions.The crossover and mutation operators depend onthe representation of the solution but not on theevaluation of its performance. The selection of solu-tions, however, does depend on the performance.The general principle is that high performing solu-tions (which in genetic algorithms are referred toas fit individuals) should have a better chance ofboth surviving and being allowed to create newsolutions through crossover. For genetic algorithmsand other evolutionary methods the definingelement is the innovative manner in which the cross-over and mutation operators define a neighborhoodof the current solution. This allows the search toquickly and intelligently traverse large parts of thesolution space. In data mining genetic and evolu-tionary algorithms have been used to solve a hostof problems, including attribute selection (Yangand Honavar, 1998; Kim et al., 2000) and classifi-cation (Fu et al., 2003a,b; Larranaga et al., 1996;Sharpe and Glover, 1999).


Scatter search is another metaheuristic related tothe concept of evolutionary search. In each step ascatter search algorithm considers a set of solutionscalled the reference set. Similar to the genetic algo-rithm approach these solutions are then combinedinto a new set. However, as opposed to the geneticoperators, in scatter search the solutions are com-bined using linear combinations, which thus definethe neighborhood. For references on scatter searchwe refer the reader to Glover et al. (2003).

Introduced by Shi and Olafsson (2000), thenested partition method (NP) is another metaheu-ristic for combinatorial optimization. The key ideabehind this method lies in systematically partition-ing the feasible region into subregions, evaluatingthe potential of each region, and then focusing thecomputational effort to the most promising region.This process is carried out iteratively with each par-tition nested within the last. The computationaleffectiveness of the NP method relies heavily onthe partitioning, which if carried out in a mannersuch that good solutions are close together canreach a near optimal solution very quickly. In datamining, the NP algorithm has been used for attri-bute selection (Olafsson and Yang, 2004; Yangand Olafsson, 2006), and clustering (Kim and Olafs-son, 2004).

3. The data mining process

As described in the introduction, data mininginvolves using an inductive algorithm to learn previ-ously unknown patterns from a large database. Butbefore the learning algorithm can be applied a greatdeal of data preprocessing must usually be per-formed. Some authors distinguish this from theinductive learning by referring to the whole processas knowledge discovery and reserve the term datamining for only the inductive learning part of theprocess. As stated earlier, however, we refer to thewhole process as data mining. Typical steps in theprocess include the following:

• As for other data analyses projects data miningstarts by defining the business or scientific objec-tives and formulating this as a data miningproblem.

• Given the problem to be addressed, the appropri-ate data sources need to be identified and thedata integrated and preprocessed to make itappropriate for data mining (see Section 3.1)



ARTICLE IN PRESS

• Once the data has been prepared the next step isto generate previously unknown patterns fromthe data using inductive learning. The most com-mon types of patterns are classification models(see Section 3.1.2), natural cluster of instances(see Section 3.2), and association rules describingrelationships between attributions (see Section3.3).

• The final steps are to validate and then imple-ment the patterns obtained from the inductivelearning.

In the following sections, we describe several ofthe most important parts of this process and focusspecifically on how optimization methods can beused for the various parts of the process.

3.1. Data preprocessing and exploratory data mining

In any data mining application the database to bemined may contain noisy or inconsistent data, somedata may be missing and in almost all cases thedatabase is large. Data preprocessing addresses eachof those issues and includes such preliminary tasksas data cleaning, data integration, data transforma-tion, and data reduction. Also applied in the earlystages of the process exploratory data mininginvolves discovering patterns in the data using sum-mary statistics and visualization. Optimization andother OR tools are relevant to both data preprocess-ing tasks and exploratory data mining and in thissection we illustrate this through one particular taskfrom each category, namely attribute selection anddata visualization.

3.1.1. Attribute selectionAttribute selection is an important problem in

data mining. This involves a process for determiningwhich attributes are relevant in that they predict orexplain the data, and conversely which attributesare redundant or provide little information (Liuand Motoda, 1998). Doing attribute selection beforea learning algorithm is applied has numerous bene-fits. By eliminating many of the attributes itbecomes easier to train other learning methods, thatis, computational time of the induction is reduced.Also, the resulting model may be simpler, whichoften makes it easier to interpret and thus more use-ful in practice. It is also often the case that simplemodels are found to generalize better when appliedfor prediction. Thus, a model employing fewer attri-butes is likely to score higher on many interesting-


ness measures and may even score higher inaccuracy. Finally, discovering which attributesshould be kept, that is identifying attributes thatare relevant to the decision making, often providesvaluable structural information and is thereforeimportant in its own right.

The literature on attribute selection is extensiveand many attribute selection methods are basedon applying an optimization approach. As a recentexample, Olafsson and Yang (2004) formulate theattribute selection problem as a simple combinato-rial optimization problem with the following deci-sion variables:

xi ¼1 if the ith feature is included

0 otherwise;

�ð7Þ

for i = 1, 2, . . . , m. The optimization problem isthen

min f ðxÞ

s:t: Kmin 6Pmi¼1

xi 6 Kmax;

xi 2 f0; 1g

ð8Þ

where x = (x1, x2, . . . , xm) and Kmin and Kmin aresome minimum and maximum number of attributesto be selected. A key issue is the selection of theobjective function and there is no single methodfor evaluating the quality of attributes that worksbest for all data mining problems. Some methodsevaluate the quality of each attribute individually,that is,

f ðxÞ ¼Xm

i¼1

fiðxiÞ; ð9Þ

whereas others evaluate the quality of the entiresubset together, that is, f(x) = f(X), whereX = {i : xi = 1} is the subset of selected attributes.In Olafsson and Yang (2004) the authors use thenested partitions method of Shi and Olafsson(2000) to solve this problem using multiple objectivefunctions of both types described above and showthat such an optimization approach is very effective.In Yang and Olafsson (2006) the authors improvethese results by developing an adaptive version ofthe algorithm, which in each step uses a small ran-dom subset of all the instances. This is importantbecause data mining usually deals with very largenumber of instances and scalability with respect tonumber of instances is therefore a critical issue.Other optimization-based methods that have beenapplied to this problem include genetic algorithms



ARTICLE IN PRESS

(Yang and Honavar, 1998), evolutionary search(Kim et al., 2000), simulated annealing (Debuseand Rayward-Smith, 1997), branch-and-bound(Narendra and Fukunaga, 1977), logical analysisof data (Boros et al., 2000), and mathematical pro-gramming (Bradley et al., 1998).

3.1.2. Data visualization

As for most other data analysis, the visualizationof data plays an important role in data mining. Thisis a difficult problem since the data is usually highdimensional, that is, the number m of attributes islarge, whereas the data can only be visualized intwo or three dimensions. While it is possible to visu-alize two or three attributes at a time a better alter-native is often to map the data to two or threedimensions in a way that preserves the structure ofthe relationships (that is, distances) betweeninstances. This problem has traditionally been for-mulated as a non-linear mathematical programmingproblem (Borg and Groenen, 1997).

As an alternative to the traditional formulation,Abbiw-Jackson et al. (2006) recently provide the fol-lowing quadratic assignment problem (QAP) for-mulation. Given n instances in Rm and a matrixDold 2 Rn · Rn measuring the distance betweenthose instances, find the optimal assignment ofthose instances to a lattice N in Rq, q = 2, 3. Thedecision variables are given by

xik ¼1; if the ith instance is assigned

to lattice point k 2 N ;

0; otherwise:

8><>: ð10Þ

With this assignment, there is a new distancematrix Dnew 2 Rq · Rq in the q = 2 or 3 dimensionalspace, and the mathematical program can be writtenas follows:

minPni¼1

Pnj¼1

Pk2N

Pl2N

F ðDoldi;j ;D

newi;j Þxikxjl

subject toPk2N

xik ¼ 1; 8i

xik 2 f0; 1g;ð11Þ

where F is a function of the deviation between thedifferences between the instances in the originalspace and the new q-dimensional space. Any solu-tion method for the quadratic assignment methodcan be used, but Abbiw-Jackson et al. (2006) pro-pose a local search heuristic that take the specificobjective function into account and compare the


results to the traditional non-linear mathematicalprogramming formulations. They conclude thatthe QAP formulation provides similar results andimportantly tends to perform better for largeproblems.

3.2. Classification

Once the data has been preprocessed a learningalgorithm is applied, and one of the most commonlearning tasks in data mining is classification. Herethere is a specific attribute called the class attributethat can take a given number of values and the goalis to induce a model that can be used to discriminatenew data into classes according to those values. Theinduction is based on a labeled training set whereeach instance is labeled according to the value ofthe class attribute. The objective of the classificationis to first analyze the training data and develop anaccurate description or a model for each class usingthe attributes available in the data. Such classdescriptions are then used to classify future indepen-dent test data or to develop a better description foreach class. Many methods have been studied forclassification, including decision tree induction, sup-port vector machines, neural networks, and Bayes-ian networks (Fayyad et al., 1996; Weiss andKulikowski, 1991).

Optimization is relevant to many classificationmethods and support vector machines have alreadybeen mentioned in Section 2.2 above. In this section,we focus on three additional popular classificationapproaches, namely decision tree induction, Bayes-ian networks, and neural networks.

3.2.1. Decision treesOne of the most popular techniques for classifica-

tion is the top-down induction of decision trees. Oneof the main reason behind their popularity appearsto be their transparency, and hence relative advan-tage in terms of interpretability. Another advantageis the ready availability of powerful implementa-tions such as CART (Breiman et al., 1984) andC4.5 (Quinlan, 1993). Most decision tree inductionalgorithms construct a tree in a top-down mannerby selecting attributes one at a time and splittingthe data according to the values of those attributes.The most important attribute is selected as the topsplit node, and so forth. For example, in C4.5 attri-butes are chosen to maximize the information gainratio in the split (Quinlan, 1993). This is an entropymeasure designed to increase the average class pur-



ARTICLE IN PRESS

ity of the resulting subsets. Algorithms such as C4.5and CART are computationally efficient and haveproven very successful in practice. However, the factthat they are limited to constructing axis-parallelseparating planes limits their effectiveness in appli-cations where some combination of attributes ishighly predictive of the class (Lee and Olafsson,2006).

Mathematical optimization techniques have beenapplied directly in the optimal construction ofdecision boundaries in decision tree induction. Inparticular, Bennett (1992) introduced an extensionof linear programming techniques to decision treeconstruction, although this formulation is limitedto two-class problems. In recent work, Street(2005) presents a new algorithm for multi-categorydecision tree induction based on non-linear pro-gramming. The algorithm, termed Oblique Cate-gory SEParation (OC-SEP), shows improvedgeneralization performance on several real-worlddata sets.

One of the limitations of most decision tree algo-rithms is that they are known to be unstable. This isespecially true dealing with a large data set where itcan be impractical to access all data at once andconstruct a single decision tree (Fu et al., 2003a).To increase interpretability, it is necessary to reducethe tree sizes and this can make the process even lessstable. Finding the optimal decision tree can be trea-ted as a combinatorial optimization problem butthis is known to be an NP-complete problem andheuristics such as those discussed in Section 2.2above must be applied. Kennedy et al. (1997) firstdeveloped a genetic algorithm for optimizing deci-sion trees. In their approach, a binary tree is repre-sented by a number of unit subtrees, each having aroot node and two branches. In more recent work,Fu et al. (2003a,b, 2006) also use genetic algorithmsfor this task. Their method uses C4.5 to generate K

trees as the initial population, and then exchangesthe subtrees between trees (crossover) or withinthe same tree (mutation). At the end of a genera-tion, logic checks and pruning are carried out toimprove the decision tree. They show that the result-ing tree performs better than C4.5 and the computa-tion time only increases linearly as the size of thetraining and scoring combination increases. Fur-thermore, creating each tree only requires a smallpercent of data to generate high-quality decisiontrees. All of the above approaches use some func-tion of the tree accuracy for the genetic algorithmfitness function. In particular, Fu et al. (2003a) use


the average classification accuracy directly, whereasFu et al. (2003b) use a distribution for the accuracythat enables them to account for the user’s risk tol-erance. This is further extended in Fu et al. (2006)where the classification is modeled using a loss func-tion that then becomes the fitness function of thegenetic algorithm. Finally, in other related workDhar et al. (2000) use an adaptive resamplingmethod where instead of using a complete decisiontree as the chromosomal unit, a chromosome is sim-ply a rule, that is, any complete path from the rootnode of the tree to a leaf node.

When using genetic algorithm to optimize thethree, there is ordinarily no method for adequatelycontrolling the growth of the tree, because thegenetic algorithm does not evaluate the size of thetree. Therefore, during the search process the treemay become overly deep and complex or may settleto a too simple tree. To address this, Niimi andTazaki (2000) combine genetic programming withassociation rule algorithm for decision tree con-struction. In this approach rules generated by theApriori association rule discovery algorithm (Agra-wal et al., 1993) are taken as the initial individualdecision trees for a subsequent genetic program-ming algorithm.

Another approach to improve the optimizationof the decision tree is to improve the fitness functionused by the genetic algorithm. Traditional fitnessfunctions use the mean accuracy as the performancemeasure. Fu et al. (2003b) investigate the use of var-ious percentiles of the distribution of classificationaccuracy in place of the mean and developed agenetic algorithm that simultaneously considerstwo fitness criteria. Tanigawa and Zhao (2000)include the tree size in the fitness function in orderto control the tree’s growth. Also, the utilizationof a fitness function based on the J-Measure, whichdetermines the information content of a tree, cangive a preference criterion to find the decision treethat classifies a set of instances in the best way(Folino et al., 2001).

3.2.2. Bayesian networksThe popular naıve Bayes method is another sim-

ple but yet effective classifier. This method learns theconditional probability of each attribute given theclass label from the training data. Classification isthen done by applying Bayes rule to compute theprobability of a class value given the particularinstance and predicting the class value with thehighest probability. In general this would require



ARTICLE IN PRESS

estimating the marginal probabilities of every attri-bute combination, which is not feasible, especiallywhen the number of attributes is large and theremay be few or no observations (instances) for someof the attribute combinations. Thus, a strong inde-pendence assumption is made, that is, all the attri-butes are assumed conditionally independent giventhe value of the class attribute. Given this assump-tion, only the marginal probabilities of each attri-bute given the class need to be calculated.However, this assumption is clearly unrealistic andBayesian networks relax it by explicitly modelingdependencies between attributes.

A Bayesian network is a directed acyclic graph G

that models probabilistic relationships among a setof random variables U = {X1, . . . , Xm}, where eachvariable in U has specific states or values (Jensen,1996). As before, m denotes the number of attri-butes. Each node in the graph represents a randomvariable, while the edges capture the direct depen-dencies between the variables. The network encodesthe conditional independence relationships thateach node is independent of its non-descendantsgiven its parents (Castillo et al., 1997; Pernkopf,2005). There are two key optimization-related issueswhen using Bayesian networks. First, when some ofthe nodes in the network are not observable, that is,there is no data for the values of the attributes cor-responding to those nodes, finding the most likelyvalues of the conditional probabilities can be formu-lated as a non-linear mathematical program. Inpractice this is usually solved using a simple steepestdescent approach. The second optimization prob-lem occurs when the structure of the network isunknown, which can be formulated as a combinato-rial optimization problem.

The problem of learning the structure of a Bayes-ian network can be informally stated as follows.Given a training set A = {u1, u2, . . . , un} of n

instances of U find a network that best matches A.The common approach to this problem is to intro-duce an objective function that evaluates each net-work with respect to the training data and then tosearch for the best network according to this func-tion (Friedman et al., 1997). The key optimizationchallenges are choosing the objective function anddetermining how to search for the best network.The two main objective functions commonly usedto learn Bayesian networks are the Bayesian scoringfunction (Cooper and Herskovits, 1992; Heckermanet al., 1995), and a function based on the minimaldescription length (MDL) principle (Lam and Bac-


chus, 1994; Suzuki, 1993). Any metaheuristic canbe applied to solve the problem. For example,Larranaga et al. (1996) have done work on usinggenetic algorithms for learning Bayesian networks.

3.2.3. Neural networks

Another popular approach for classification isneural networks. Neural networks have been exten-sively studied in the literature and an excellentreview of the use of feed-forward neural networksfor classification is given by Zhang (2000). Theinductive learning of neural networks from data isreferred to as training this network, and the mostpopular method of training is back-propagation(Rumelhart and McClelland, 1986). It is well knownthat back-propagation can be viewed as an optimi-zation process and since this has been studied indetailed elsewhere we only briefly review the pri-mary connection with optimization here.

A neural network consists of at least three layersof nodes. The input layer consists of one node foreach of the independent attributes. The output layerconsists of node(s) for the class attribute(s), andconnecting these layers is one or more intermediatelayers of nodes that transform the input into an out-put. When connected, these layers of nodes make upthe network we refer to as a neural net. The trainingof the neural network involves determining theparameters for this network. Specifically, each arcconnecting the nodes in this network has certainassociated weight and the values of those weightsdetermine how the input is transformed into anoutput. Most neural network training methods,including back-propagation, are inherently an opti-mization processes. As before, the training dataconsists of values for some input attributes (inputlayer) along with the class attribute (output layer),which is usually referred to as the target value ofthe network. The optimization process seeks todetermine the arc weights in order to reduce somemeasure of error (normally, minimizing squarederror) between the actual and target outputs (Rip-ley, 1996).

Since the weights in the network are continuousvariables and the relationship between the inputand the output is highly non-linear, this is a non-lin-ear continuous optimization problem or a non-linearprogramming problem (NLP). Any appropriateNLP algorithm could therefore be applied to traina neural network, but in practice a simple steepest-descent approach is most often applied (Ripley,1996). This does not assure that the global optimal



ARTICLE IN PRESS

solution has been found, but rather terminates at thefirst local optimum that is encountered. However,due to the size of the problems involved the speedof the optimization algorithm is usually imperative.

This section illustrates how optimization playssignificant role in classification. As shown in Section2.1, the classification problem itself can be formu-lated as a mathematical programming problem,and as demonstrated in this section it also playsan important role in conjunction with other classifi-cation methods. This can both be to optimize theoutput of other classification algorithms, such asthe optimization of decision trees, and to optimizeparameters utilized by other classification algo-rithms, such as finding the optimal structure of aBayesian network. Although considerable workhas been done already in this area there are stillmany unsolved problems for the operationsresearch community to address.

3.3. Clustering

When the data is unlabelled and each instancedoes not have a given class label the learning taskis called unsupervised. If we still want to identifywhich instances belong together, that is, form natu-ral clusters of instances, a clustering algorithm canbe applied (Jain et al., 1999; Kaufman and Rous-seeuw, 1990). Such algorithms can be divided intotwo categories: hierarchical clustering and part-itional clustering. In hierarchical clustering all ofthe instances are organized into a hierarchy thatdescribes the degree of similarity between thoseinstances. Such representation may provide a greatdeal of information and many algorithms have beenproposed. Partitional clustering, on the other hand,simply creates one partition of the data where eachinstance falls into one cluster. Thus, less informa-tion is obtained but the ability to deal with largenumber of instances is improved.

As appears to have been first pointed out byVinod (1969), the partitional clustering problemcan be formulated as an optimization problem.The key issues are how to define the decision vari-ables and how to define the objective functions, nei-ther of which has a universally applicable answer. Inclustering, the most common objectives are to min-imize the difference of the instances in each cluster(compactness), maximize the difference betweeninstances in different clusters (separation), or somecombination of the two measures. However, othermeasures may also be of interest and adequately


assessing cluster quality is a major unresolved issue(Estevill-Castro, 2002; Grabmeier and Rudolph,2002; Osei-Bryson, 2005). A detailed discussion ofthis issue is outside the scope of the paper and mostof the work that applies optimization to clusteringfocuses on some variant of the compactnessmeasure.

In addition to the issue of selecting an appropri-ate measure of cluster quality as the objective func-tion there is no generally agreed upon manner inwhich a clustering should be defined. One popularway to define a data clustering is to let each clusterbe defined by its center point cj 2 Rm and thenassign every instance to the closest center. Thus,the clustering is defined by a m · k matrixC = (c1, c2, . . . , ck). This is for example done bythe classic and still popular k-means algorithm(MacQueen, 1967). K-means is a simple iterativealgorithm that proceeds as follows. Starting withrandomly selected instances as centers each instanceis first assigned to the closest center. Given thoseassignments, the cluster centers are recalculatedand each instance again assigned to the closest cen-ter. This is repeated until no instance changes clus-ters after the centers are recalculated, that is, thealgorithm converges to a local optimum.

Much of the work on optimization formulationsuses the idea of defining a clustering by fixed num-ber of centers. This is true of the early work ofVinod (1969) where the author provides two integerprogramming formulations of the clustering prob-lem. For example, in the first formulation the deci-sion variable is defined as an indicator of thecluster to which each instance is assigned:

xij ¼1 if the ith instance is assigned

to the jth cluster;

0 otherwise;

8><>: ð12Þ

and the objective is to minimize the total cost of theassignment, where wij is some cost of assigning theith instance to the jth cluster:

minPni¼1

Pkj¼1

wijxij

s:t:Pkj¼1

xij ¼ 1; i ¼ 1; 2; . . . ; n

Pni¼1

xij P 1; j ¼ 1; 2; . . . ; k:

ð13Þ

Note that the constraints assure that each instance isassigned to exactly one cluster and that each cluster



ARTICLE IN PRESS

has at least one instance (all the clusters are used).Also note that if the assignments (12) are knownthen the cluster centers can be calculated by averag-ing the assigned instances. Using the same definitionof the decision variables, Rao (1971) provides addi-tional insights into the clustering problem and sug-gests improved integer programming formulations,with the objective taken as both minimizing thewithin cluster sum of squares and minimizing themaximum distance within clusters. Both of thoseobjective functions can be viewed as measures ofcluster compactness.

More recently, Bradley et al. (1996) and Bradleyand Mangasarian (2000) also formulated the prob-lem of identifying cluster centers as a mathematicalprogramming problem. As before, a set A of n train-ing instances Ai 2 Rn is given, and we assume a fixednumber of k clusters. Given this scenario, Bradleyet al. (1996) use dummy vectors dij 2 Rn to formu-late the following linear program to find the k clus-ter centers cj that minimize the 1-norm from eachinstance to the nearest center:

minc;d

Pni¼1

minfeTdijg

s:t: �dij6ATi � cj6 dij; i¼ 1;2; . . . ;n; j¼ 1;2; . . . ;k:

ð14Þ

By using the 1-norm instead of more usual 2-normthis is a linear program, which can be solved effi-ciently even for very large problem. As for the otherformulations discussed above a limitation to thisprogram is that it focuses exclusively on optimizingthe cluster compactness, which in this case means tofind the cluster centers such that each instance in thecluster is as close as possible to the center. The solu-tion may therefore be a cluster set where the clustersare not well separated.

In Bradley and Mangasarian (2000), the authorstake a different definition of a clustering and insteadof finding the best centers identify the best clusterplanes:

P j ¼ fx 2 RmjxT � wj ¼ cjg; j ¼ 1; 2; . . . ; k: ð15Þ

They propose an iterative algorithm similar to the k-means algorithm that iteratively assigns instances tothe closes cluster and then given the new assignmentfinds the plane that minimizes the sum of squares ofdistances of each instance to the cluster. In otherwords, given the set of instances A(j) 2 Rn assignedto cluster j, find w and c that solve:


minw;c

kAw� eck22

s:t: wT � w ¼ 1:ð16Þ

Indeed it is not necessary to solve (16) using tradi-tional methods. The authors show that a solutionðw�j ; c�j Þ can be found by letting w�j be the eigenvectorcorresponding to the smallest eigenvector of(A(j))T(I � e Æ eT/nj)A

(j), where nj = |A(j)| is the num-ber of instances assigned to the cluster, and then cal-culating c�j ¼ eTAðjÞw�j=nj:

We note from the formulations above that theclustering problem can be formulated as both aninteger program, as in (13), and a continuous pro-gram, as in (14). This implies that a wide array ofoptimization techniques is applicable to the prob-lem. For example, Kroese et al. (2004) recently usedthe cross-entropy method to solve both discrete andcontinuous versions of the problem. They show thatalthough the cross-entropy method is more timeconsuming than traditional heuristics such as k-means the quality of the results is significantlybetter.

As noted by both the early work in this area(Vinod, 1969; Rao, 1971) and by more recentauthors (e.g., Shmoys, 1999), when a clustering isdefined by the cluster centers the clustering problemis closely related to well-known optimization prob-lems related to set covering. In particular, the prob-lem of locating the best clusters mirrors problems infacility location (Shmoys et al., 1997), and specifi-cally the k-center and k-median problem. In manycases, results obtained for these problems could bedirectly applied to clustering in data mining. Forexample, Hochbaum and Shmoys (1985) proposedthe following approximation algorithm for the k-center problem. Starting with any point, find thepoint furthest from it, then the point furthest fromthe first two, and so forth, until k points have beenselected. The authors use duality theory to showthat the performance of this heuristic is no worsethan twice the performance of the optimal solution.Interpreting the points as centers, Dasgupta (2002)notes that this approach can be used directly forpartitional clustering. The author also develops anextension to hierarchical clustering and derives asimilar performance bound for the clustering per-formance of every level of the hierarchy.

It should be noted that this section has focusedon one particular type of clustering, namely part-itional clustering. For many of the optimization for-mulations it is further assumed that each cluster is



ARTICLE IN PRESS

defined by its center, although the (13) does notmake this assumption and other formulations from,for example, Rao (1971) and Kroese et al. (2004) aremore flexible. Several other approaches exist forclustering and much research on clustering for datamining has focused on the scalability of clusteringalgorithms (see e.g., Bradley et al., 1998; Ng andHan, 1994; Zhang et al., 1996). As another exampleof a partitional clustering method, the well-knownEM algorithm assumes instances are drawn fromone of k distributions and estimates the parametersof these distributions as well as the cluster assign-ments (Lauritzen, 1995). There have also beennumerous hierarchical clustering algorithms pro-posed, which create a hierarchy, such as a dendro-gram, that shows the relationship between all theinstances (Kaufman and Rousseeuw, 1990). Suchclustering may be either agglomerative, whereinstances are initially assigned to singleton clustersand are then merged based on some measure (suchas the well-known single-link and complete-linkalgorithms), or divisive, where the instances are allassigned to one cluster that is then split iteratively.For hierarchical clustering there no single clusteringthat is optimal, although one may ask the questionif it is possible to find a hierarchy that is optimal atevery level and extend known partitional clusteringresults (Dasgupta, 2002).

Another clustering problem where optimizationmethods have been successfully applied is sequentialclustering (Hwang, 1981). This problem is equiva-lent to the partitional clustering problem describedabove, except that the instances are ordered and thisorder must be maintained in the clustering. Hwang(1981) shows how to find optimal clusters for thisproblem, and in recent work Joseph and Bryson(1997) show how to find so-called w-efficient solu-tions to sequential clustering using linear program-ming. Such solutions have satisfactory rate ofimprovement in cluster compactness as the numberof clusters increases.

Although it is well known that the clusteringproblem can be formulated as an optimizationproblem the increased popularity of data miningdoes not appear to have resulted in comparableresurgence of interest in use of optimization to clus-ter data as for classification. For example, a greatdeal of work has been done on closely related prob-lems such as the set covering, k-center and k-medianproblems, but relatively little has still been done toinvestigate potential impact in the data mining taskof clustering. Since the clustering problem can be


formulated as both a continuous mathematical pro-gramming problem and a combinatorial optimiza-tion problem, a host of OR tools is applicable tothis problem, including both mathematical pro-gramming (see Section 2.1) and metaheuristics (seeSection 2.2), and most of this has not beenaddressed. Other fundamental issues, such as howto define a set of clusters and measure cluster qualityare also still unresolved in the larger clustering com-munity, and it is our belief that the OR communitycould continue to make very significant contribu-tions by examining these issues.

3.4. Association rule mining

Clustering finds previously unknown patternsalong the instance dimension. Another unsupervisedlearning approach is association rule discovery thataims to discover interesting correlation or otherrelationships among the attributes (Agrawal et al.,1993). Association rule mining was originally usedfor market basket analysis, where items are articlesin the customer’s shopping cart and the supermar-ket manager is looking for associations among thesepurchases. Basket data stores items purchased onper-transaction basis. The questions addressed bymarket basket analysis include how to boost thesales of a given product, what other products do dis-continuing a product impact, and which productsshould be shelved together. Being derived from mar-ket basket analysis, association rule discovery usesthe terminology of an item, which is simply an attri-bute – value pair, and item set, which simply refersto a set of such items.

With this terminology the process of associationrule mining can be described as follows. LetI = {1, 2, . . . , q} be the set of all items and letT = {1, 2, . . . , n} be the set of transactions orinstances in a database. An association rule R isan expression A) B, where the antecedent (A)and consequent (B) are both item sets, that isA, B � I. Each rule has an associated support andconfidence. The support sup(R) of the rule is thenumber or percentage of instances in I containingboth A and B, and this is also referred to as the cov-erage of the rule. The confidence of the rule R isgiven by

confðRÞ ¼ supðA [ BÞsupðAÞ ; ð17Þ

which is the conditional probability that an instancecontains item set B given that it contains item set A.



ARTICLE IN PRESS

Confidence is also called accuracy of the rule. Sup-port and confidence are typically modeled as con-straints for association rule mining, where usersspecify the minimum support supmin and minimumconfidence confmin according to their preferences.An item set is called a frequent item set if its supportis greater than this minimum support threshold.

Apriori is the best-known and original algorithmfor association rule discovery (Agrawal and Srikant,1994). The idea behind the apriori algorithm is thatif an item set is not frequent in the database thenany superset of this item set is not frequent in thesame database. There are two phases to the induc-tive learning: (a) first find all frequent item sets,and (b) then generate high confidence rules fromthose sets. The apriori algorithm generates all fre-quent item sets by making multiple passes over thedata. In the first pass it determines whether 1-itemsets are frequent or not according to their support.In each subsequent pass it starts with those item setsfound to be frequent in the previous pass. It usesthese frequent items sets as seed sets to generatesuper item sets, called candidate item sets, by onlyadding one more item. If the super item set meetsthe minimum support then it is actually frequent.After frequent item sets’ generation, for each finalfrequent item set it checks all single-consequentrules. Only those single consequent rules that meetminimum confidence level will go further to buildup two-consequent rules as candidate rules. If thosetwo-consequent rules meet the minimum confidencelevel will continue to build up three-consequentrules, and so on.

Even after limiting the possible rules to those thatmeet minimum support and confidence there is usu-ally a very large number of rules generated, most ofwhich are not valuable. Determining how to selectthe most important set of association rules is there-fore an important issue in association rule discoveryand here optimization can be applied. This issue hasreceived moderate consideration in the recent years.Most of this research focuses on using the two met-rics of support and confidence of association rules.The idea of optimized association rules was firstintroduced by Fukuda et al. (1996). In this work,an association rule R is of the form (A 2 [v1, v2]) ^C1) C2, where, A is a numeric attribute, v1 and v2

are a range of attribute A, and C1 and C2 are twonormal attributes. Then the optimized associationrules problem can be divided into two sub-problems.

On the other hand, the support of the antecedentcan be maximized subject to the confidence of the


rule R meeting the minimum confidence. Thus, theoptimized support rule can be formulated asfollows:

maxR

supfðA1 2 ½v1; v2�Þ ^ C1g

s:t: confðRÞP confmin:ð18Þ

Alternatively, the confidence of the rule R can bemaximized subject to the support of the antecedentbeing greater than the minimum support. Thus, theoptimized confidence rule can be formulated asfollows:

maxR

conffðA1 2 ½v1; v2�Þ ^ C1g

s:t: supðRÞP supmin:ð19Þ

Rastogi and Shim (1998) presented optimized asso-ciation rules problem to allow an arbitrary numberof uninstantiated categorical and numeric attri-butes. Moreover, the authors proposed to usebranch and bound and graph search pruning tech-niques to reduce the search space. In later workRastogi and Shim (1999) extended this work to opti-mized support association rule problem for numericattributes by allowing rules to contain disjunctionsof uninstantiated numeric attributes. Dynamic pro-gramming is used to generate the optimized supportrules and a bucketing technique and divide and con-quer strategy are employed to improve the algo-rithm’s efficiency.

By combining concerns with both support andconfidence, Brin et al. (2000) proposed optimizingthe gain of the rule, where gain is defined as

gainðRÞ ¼ supfðA1

2 ½v1; v2�Þ ^ C1g � Confmin � supðC1Þ¼ supðRÞ � ðConfðRÞ � ConfminÞ: ð20Þ

The optimization problem maximizes the gain sub-ject to minimum support and confidence.

max gainðRÞs:t: supðRÞP supmin

ConfðRÞP confmin:

ð21Þ

Although there exists considerable research on theissue of generating good association rules, supportand confidence are not always sufficient measuresfor identifying insightful and interesting rules. Max-imizing support may lead to finding many trivialrules that not valuable for decision making. Onthe other hand, maximizing confidence may leadto finding too specific rules that are also not usefulfor decision making. Thus, relevant good measures



ARTICLE IN PRESS

and factors for identifying good association rulesneed more exploration. In addition, how to opti-mize the rules that have been obtained is an interest-ing and challenging issue that has not received muchattention. Both of these areas are issue where theOR community can contribute in a significant wayto the field of association rule discovery.

4. Applications in the management of electronic

services

Many of the most important applications of datamining come in areas related to management ofelectronic services. In such applications, automaticdata collection and storage is often cheap andstraightforward, which generates large databasesto which data mining can be applied. In this sectionwe consider two such applications areas, namelycustomer relationship management and personaliza-tion. We choose these applications because of theirimportance and the usefulness of data mining fortheir solution, as well as for the relatively unex-plored potential there is for incorporating optimiza-tion technology to enable and explain the datamining results.

4.1. Customer relationship management

Relationship marketing and customer relation-ship management (CRM) in general have becomecentral business issues. With more intense competi-tion in many mature markets companies have real-ized that development of relationship with more

Fig. 3. Illustration of a customer life-cycle. This fig


profitable customer is a critical factor to staying inthe market. Thus, CRM techniques have beendeveloped that afford new opportunities for busi-nesses to act well in a relationship market. The focusof CRM is on the customer and the potential forincreasing revenue, and in doing so it enhances theability of a firm to compete and to retain keycustomers.

The relationship between a business and custom-ers can be described as follows. A customer pur-chases products and services, while business is tomarket, sell, provide and service customers. Gener-ally, there are three ways for business to increase thevalue of customers:

• increase their usage (or purchases) on the prod-ucts or service that customers already have;

• sell customers more or higher-profitableproducts;

• keep customers for a longer time.

A valuable customer is usually not static and therelationship evolves and changes over time. Thus,understanding this relationship is a crucial part ofCRM. This can be achieved by analyzing the cus-tomer life-cycle, or customer lifetime, which refersto various stages of the relationship between cus-tomer and business. A typical customer life-cycleis shown in Fig. 3.

First, acquisition campaigns are marketing cam-paigns that are directed to the target market andseek to interest prospects in a company’s productor service. If prospects respond to company’s

ure is adapted from Berry and Linoff (2000).



ARTICLE IN PRESS

inquiry then they will become respondents.Responders become established customers whenthe relationship between them and the companieshas been established. For example, they have madethe initial purchase or their application for a certaincredit card has been approved. At this point, com-panies will gain revenue from customer usage. Fur-thermore, customers’ value will be increased notonly by cross-selling that encourages customers tobuy more products or services but also by up-sellingthat encourage customers to upgrading existingproducts and services. On the other hand, at somepoint established customers stop being customers(churn). There are two different types of churns.The first is voluntary churn, which means thatestablished customers choose to stop being custom-ers. The other type is forced churn, which refers tothose established customers who no longer are goodcustomers and the company cancels the relation-ship. The main purpose of CRM is to maximize cus-tomers’ values throughout the life-cycle.

With large volumes of data generated in CRMdata mining plays a leading role in the overallCRM (Rud and Brohman, 2003; Shaw et al.,2001). In acquisition campaigns data mining canbe used to profile people who have responded toprevious similar campaigns and these data miningprofiles is helpful to find the best customer segmentsthat the company should target (Adomavicius andTuzhilin, 2003). Another application is to look forprospects who have similar behavior patterns totoday’s established customers. In responding cam-paigns data mining can be applied to determinewhich prospects will become responders and whichresponders will become established customers.Established customers are also a significant areafor data mining. Identifying customer behavior pat-terns from customer usage data and predictingwhich customers are likely to respond to cross-selland up-sell campaigns, which is very important tothe business (Chiang and Lin, 2000). Regarding for-mer customers, data mining can be used to analyzethe reasons for churns and to predict churn (Chianget al., 2003).

Optimization also plays an important role inCRM and in particular in determining how todevelop proactive customer interaction strategy tomaximize customer lifetime value. A customer isprofitable if the revenue from this customer exceedscompany’s cost to attract, sell and service this cus-tomer. This excess is called the customer lifetimevalue (LTV). In other words, LTV is the total value


to be gained while the customer is still active and itis one of the most important metric in CRM. Thereis much research that has been done in the area ofmodeling LTV using OR techniques (Schmittleinet al., 1987; Blattberg and Deighton, 1991; Drezeand Bonfrer, 2002; Ching et al., 2004).

Even when the LTV model can be formulated, itis difficult to find the optimal solution in the pres-ence of great volume of data. Some researchers haveaddressed this by using data mining to find out theoptimal parameters for the LTV model. For exam-ple, Rosset et al. (2002) formulate the followingLTV model

LTV ¼Z 1

0

SðtÞvðtÞDðtÞdt; ð22Þ

where v(t) describes the customer’s value over time,S(t) describes the probability that the customer isstill active at time t, and D(t) is a discounting factor.Data mining is then employed to estimate customerfuture revenue value and estimate the customer’schurn probability over time from current data. Thisproblem is difficult in practice, however, due to thelarge volumes of data involved. Padmanabhan andTuzhilin (2003) presented two directions to reducethe complexity of the LTV optimization problem.One direction is to find good heuristics to improveLTV values and the other strategy is to optimizesome simpler performance measures that are relatedto LTV value. As for the latter direction, the authorpointed out that data mining and optimization canbe integrated to build customer profiles, which iscritical in many CRM applications. Data mining isfirst used for discover customer usage patterns andrules and optimization is then employed to select asmall number of best patterns from the previouslydiscovered rules. Finally, according to the customerprofile, the company can achieve targeting andspend money on those customers who are likely torespond within their budget.

Campaign optimization is another problemwhere a combination of data mining and operationresearch can be applied. In the campaign optimiza-tion process a company needs to determine whichkind of offers should go to which segment of cus-tomers or prospects through which communicationchannel. Vercellis (2002) presents two stages of cam-paign optimization models with both data miningtechnology and an optimization strategy. In the firststage optimization sub-problems are solved for eachcampaign and customers are segmented by theirscores. In the second stage, a mixed integer optimi-



ARTICLE IN PRESS

zation model is formulated to solve the overall cam-paign optimization problem based on customer seg-mentation and the limited available resources.

As described above, numerous CRM problemscan be formulated into optimization problems butthere are usually very large volumes of data thatmay make the problem difficult to solve. Combiningthe optimization problem with data mining is valu-able in this context. For example, data mining canbe used to identify insightful patterns from cus-tomer data and then these patterns can be used toidentify more relevant constraints for the optimiza-tion models. In addition, data mining can be appliedto reduce the search space and improve the comput-ing time. Thus, investigating how to combine opti-mization and data mining to address CRMproblems is a promising research area for the oper-ation research community.

4.2. Personalization

Personalization is the ability to provide contentand services tailored to individuals on the basis ofknowledge about their preferences and behavior.Data mining research related to personalizationhas focused mostly on recommender systems andrelated issues such as collaborative filtering, and rec-ommender systems have been investigated inten-sively in the data mining community (Breese et al.,1998; Geyer-Schulz and Hahsler, 2002; Lieberman,1997; Lin et al., 2000). Such systems can be catego-rized into three groups: content-based systems,social data mining, and collaborative filtering. Con-tent-based systems use exclusively the preferences ofthe user receiving the recommendation (Hill et al.,1995). These preferences are learned through impli-cit or explicit user feedback and typically repre-sented as a profile for the user. Recommendersbased on social data mining consider data sourcescreated by groups of people as part of their dailyactivities and mine this data for potentially usefulinformation. However, the recommendation ofsocial data mining systems are usually not personal-ized but rather broadcast to the entire user commu-nity. On the other hand, such personalization isachieved by collaborative filtering (Resnick et al.,1994; Shardanan and Maes, 1995; Good et al.,1999), which matches users with similar interestsand uses the preferences of these users to makerecommendations.

As argued in Adomavicius and Tuzhilin (2003),the recommendation problem can be formulated


as an optimization problem that selects the bestitems to recommend to a user. Specifically, given aset U of users and a set V of items, a ratings functionf : U · V! R can be defined to specify how eachuser u 2 U likes each item v 2 V. The recommenda-tion problem can then be formulated as the follow-ing optimization problem:

max f ðu; vÞsubject to u 2 U

v 2 V :

ð23Þ

The challenge for this problem is that the ratingfunction can usually only be partially specified, thatis, not all entries in the matrix {f(u, v)}u2U,v2V haveknown ratings. Therefore, it is necessary to specifyhow unknown ratings should be estimated fromthe set of the previously specified ratings (Padma-nabhan and Tuzhilin, 2003). Numerous methodshave been developed for estimating these ratingsand Pazzani (1999) and Adomavicius and Tuzhilin(2003) describe some of these methods. Once theoptimization problem is defined data mining cancontribute to its solution by learning additional con-straints with data mining methods. For more on ORin personalization we refer the reader to Murthi andSarkar (2003), but simply note that this area has awealth of opportunities for the OR community tocontribute.

5. Conclusions

As illustrated in this paper, the OR communityhas over the past several years made highly signifi-cant contributions to the growing field of data min-ing. The existing contributions of optimizationmethods in data mining touch on almost every partof the data mining process, from data visualizationand preprocessing, to inductive learning, and select-ing the best model after learning. Furthermore, datamining can be helpful in many OR application areasand can be used in a complementary way to optimi-zation method to identify constraints and reduce thesearch space.

Although large volume of work already existscovering the intersection of OR and data miningwe feel that the current work is only the beginning.Interest in data mining continues to grow in bothacademia and industry and most data mining issueswhere there is the potential to use optimizationmethods still require significantly more research.This is clearly being addressed at the present time,



ARTICLE IN PRESS

as interest in data mining within the OR communityis growing, and we hope that this survey helps tofurther motivate more researchers to contribute tothis exciting field.

References

Abbiw-Jackson, R., Golden, B., Raghavan, S., Wasil, E., 2006. Adivide-and-conquer local search heuristic for data visualiza-tion. Computers and Operations Research 33, 3070–3087.

Adomavicius, G., Tuzhilin, A., 2003. Recommendation technol-ogies: Survey of current methods and possible extensions.Working paper, Stern School of Business, New YorkUniversity, New York.

Agrawal, R., Imielinski, T., Swami, A., 1993. Mining associationrules between sets of items in large databases. In: Proceedingsof the ACM SIGMOD Conference on Management of Data,pp. 207–216.

Agrawal, R., Srikant, R., 1994. Fast algorithms for miningassociation rules. In: Proceedings of the 1994 InternationalConference on Very Large Data Bases (VLDB’94), pp. 487–499.

Bennett, K.P., 1992. Decision tree construction via linearprogramming. In: Proceedings of the 4th Midwest ArtificialIntelligence and Cognitive Science Society Conference, Utica,IL, pp. 97–101.

Bennett, K.P., Bredensteiner, E., 1999. Multicategory classifica-tion by support vector machines. Computational Optimiza-tions and Applications 12, 53–79.

Bennett, K.P., Campbell, C., 2000. Support vector machines:Hype or hallelujah?. SIGKDD Explorations 2 (2) 1–13.

Berry, M., Linoff, G., 2000. Mastering Data Mining. Wiley, NewYork.

Blattberg, R., Deighton, J., 1991. Interactive marketing: Exploit-ing the age of addressability. Sloan Management Review, 5–14.

Borg, I., Groenen, P., 1997. Modern Multidimensional Scaling:Theory and Applications. Springer, New York.

Boros, E., Hammer, P.L., Ibaraki, T., Kogan, A., Mayoraz, E.,Muchnik, I., 2000. Implementation of logical analysis of data.IEEE Transactions on Knowledge and Data Engineering 12(2), 292–306.

Bradley, P.S., Fayyad, U.M., Mangasarian, O.L., 1999. Math-ematical programming for data mining: Formulations andchallenges. INFORMS Journal on Computing 11, 217–238.

Bradley, P.S., Fayyad, U.M., Reina, C., 1998a. Scaling clusteringalgorithms to large databases. In: Proceedings of ACMConference on Knowledge Discovery in Databases, pp. 9–15.

Bradley, P.S., Mangasarian, O.L., 2000. k-Plane clustering.Journal of Global Optimization 16 (1), 23–32.

Bradley, P.S., Mangasarian, O.L., Street, N., 1996. Clustering viaconcave minimizationAdvances in Neural Information Pro-cessing Systems, vol. 9. MIT Press, Cambridge, MA, pp. 368–374.

Bradley, P.S., Mangasarian, O.L., Street, W.N., 1998b. Featureselection via mathematical programming. INFORMS Journalon Computing 10 (2), 209–217.

Breese, J., Heckerman, D., Kadie, C., 1998. Empirical analysis ofpredictive algorithms for collaborative filtering. In: Proceed-ings of the 14th Conference on Uncertainty in ArtificialIntelligence.


Breiman, L., Friedman, J., Olshen, R., Stone, C., 1984. Classi-fication and Regression Trees. Wadsworth InternationalGroup, Monterey, CA.

Brin, S., Rastogi, R., Shim, K., 2000. Mining optimized gain rulesfor numeric attributes. IEEE Transaction on Knowledge andData Engineering 15 (2), 324–338.

Burges, C.J.C., 1998. A tutorial on support vector machines forpattern recognition. Knowledge Discovery and Data Mining2 (2), 121–167.

Castillo, E., Gutierrez, J.M., Hadi, A.S., 1997. Expert Systemsand Probabilistic Network Models. Springer, Berlin.

Chiang, I., Lin, T., 2000. Using rough sets to build-up web-basedone to one customer services. IEEE Transactions.

Chiang, D., Lin, C., Lee, S., 2003. Customer relationshipmanagement for network banking churn analysis. In: Pro-ceedings of the International Conference on Information andKnowledge Engineering, Las Vegas, NV, 135–141.

Ching, W., Wong, K., Altman, E., 2004. Customer lifetime value:Stochastic optimization approach. Journal of the OperationalResearch Society 55 (8), 860–868.

Cooper, G.F., Herskovits, E., 1992. A Bayesian method for theinduction of probabilistic networks from data. MachineLearning 9, 309–347.

Cortes, C., Vapnik, V., 1995. Support vector networks. MachineLearning 20, 273–297.

Dasgupta, S., 2002. Performance guarantees for hierarchicalclustering. In: Proceedings of the 15th Annual Conference onComputational Learning Theory, pp. 351–363.

Debuse, J.C., Rayward-Smith, V.J., 1997. Feature subset selec-tion within a simulated annealing data mining algorithm.Journal of Intelligent Information Systems 9, 57–81.

Debuse, J.C., Rayward-Smith, V.J., 1999. Discretisation ofcontinuous commercial database features for a simulatedannealing data mining algorithm. Applied Intelligence 11,285–295.

Dhar, V., Chou, D., Provost, F., 2000. Discovering interestingpatterns for investment decision making with GLOWER – agenetic learner overlaid with entropy reduction. Data Miningand Knowledge Discovery 4, 251–280.

Dreze, X., Bonfrer, A., 2002. To pester or leave alone: Lifetimevalue maximization through optimal communication timing.Working paper, Marketing Department, University of Cali-fornia, Los Angeles, CA.

Estevill-Castro, V., 2002. Why so many clustering algorithms – aposition paper. SIGKDD Explorations 4 (1), 65–75.

Fayyad, U., Piatetsky-Shapiro, G., Smith, P., Uthurusamy, R.,1996. Advances in Knowledge Discovery and Data Mining.MIT Press, Cambridge, MA.

Felici, G., Truemper, K., 2002. A MINSAT approach for learningin logic domains. INFORMS Journal on Computing 14, 20–36.

Folino, G., Pizzuti, C., Spezzano, G., 2001. Parallel geneticprogramming for decision tree induction. Tools with ArtificialIntelligence. In: Proceedings of the 13th International Con-ference, pp. 129–135.

Freed, N., Glover, F., 1986. Evaluating alternative linearprogramming models to solve the two-group discriminantproblem. Decision Sciences 17, 151–162.

Friedman, N., Geiger, D., Goldszmidt, M., 1997. Bayesiannetwork classifiers. Machine Learning 29, 131–163.

Fu, Z., Golden, B., Lele, S., Raghavan, S., Wasil, E., 2003a. Agenetic algorithm-based approach for building accuratedecision trees. INFORMS Journal of Computing 15, 3–22.



ARTICLE IN PRESS

Fu, Z., Golden, B., Lele, S., Raghavan, S., Wasil, E., 2003b.Genetically engineered decision trees: Population diversityproduces smarter trees. Operations Research 51 (6), 894–907.

Fu, Z., Golden, B., Lele, S., Raghavan, S., Wasil, E., 2006.Diversification for smarter trees. Computers and OperationsResearch 33, 3185–3202.

Fukuda, T., Morimoto, Y., Morishita, S., Tokuyama, T., 1996.Data mining using two-dimensional optimized associationrules: Scheme, algorithms, and visualization. In: Proceedingsof the ACM SIGMOD Conference on Management of Data,pp. 13–23.

Geyer-Schulz, A. Hahsler, M., 2002. Evaluation of recommenderalgorithms for and internet information broker based onsimple association rules and the repeat-buying theory. In:Proceedings of WEBKDD’02, pp. 100–114.

Glover, 1990. Improved linear programming models for discrim-inant analysis. Decision Sciences 21 (4), 771–785.

Glover, F., Laguna, M., 1997. Tabu Search. Kluwer Academic,Boston.

Glover, F., Laguna, M., Marti, R., 2003. Scatter search. In:Tsutsui, Ghosh (Eds.), Theory and Applications of Evolu-tionary Computation: Recent Trends. Springer, Berlin, pp.519–528.

Glover, F., Kochenberger, G.A., 2003. Handbook of Metaheu-ristics. Kluwer Academic Publishers, Boston, MA.

Goldberg, D.E., 1989. Genetic Algorithm in Search, Optimi-zation and Machine Learning. Addison-Wesley, Reading,MA.

Good, N., Schafer, J.B., Konstan, J.A., Borchers, A., Sarwar, B.,1999. Combining collaborative filtering with personal agentsfor better recommendations. In: Proceedings of the NationalConference on Artificial Intelligence.

Grabmeier, J., Rudolph, A., 2002. Techniques of cluster algo-rithms in data mining. Data Mining and Knowledge Discov-ery 6, 303–360.

Hansen, P., Mladenovic, N., 1997. An introduction to variableneighborhood search. In: Voss, S., et al. (Eds.), Proceedings ofMIC 97 Conference.

Heckerman, D., Geiger, D., Chickering, D.M., 1995. LearningBayesian networks: The combination of knowledge andstatistical data. Machine Learning 20 (3), 197–243.

Hill, W.C., Stead, L., Rosenstein, M., Furnas, G., 1995.Recommending and evaluating choices in a virtual commu-nity of use. In: Proceedings of the CHI’95 Conference onHuman Factors in Computing Systems, pp. 194–201.

Hochbaum, D., Shmoys, D., 1985. A best possible heuristic forthe k-center problem. Mathematics of Operations Research10, 180–184.

Holland, J.H., 1975. Adaptation in Natural and ArtificialSystems. University of Michigan Press.

Hwang, F., 1981. Optimal partitions. Journal of OptimizationTheory and Applications 34, 1–10.

Jain, A.K., Murty, M.N., Flynn, P.J., 1999. Data clustering: Areview. ACM Computing Surveys 31, 264–323.

Jensen, F.V., 1996. An Introduction to Bayesian Networks. UCLPress Limited, London.

Joseph, A., Bryson, N., 1997. W-efficient partitions and thesolution of the sequential clustering problem. Annals ofOperations Research: Nontraditional Approaches to Statisti-cal Classification 74, 305–319.

Kaufman, L., Rousseeuw, P.J., 1990. Finding Groups in Data:An Introduction to Cluster Analysis. Wiley, New York.


Kennedy, H., Chinniah, C., Bradbeer, P., Morss, L., 1997. Theconstruction and evaluation of decision trees: A comparisonof evolutionary and concept learning methods. In: Come, D.,Shapiro, J. (Eds.), Evolutionary Computing, Lecture Notes inComputer Science. Springer, Berlin, pp. 147–161.

Kim, J., Olafsson, S., 2004. Optimization-based data clusteringusing the nested partitions method. Working paper, Depart-ment of Industrial and Manufacturing Systems Engineering,Iowa State University.

Kim, Y., Street, W.N., Menczer, F., 2000. Feature selection inunsupervised learning via evolutionary search. In: Proceed-ings of the Sixth ACM SIGKDD International Conference onKnowledge Discovery and Data Mining (KDD-00), pp. 365–369.

Kirkpatrick, S., Gelatt Jr., C.D., Vecchi, M.P., 1983. Optimiza-tion by simulated annealing. Science 220, 671–680.

Kroese, D.P, Rubinstein, R.Y., Taimre, T., 2004. Application ofthe Cross-Entropy Method to Clustering and Vector Quan-tization. Working Paper.

Lam, W., Bacchus, F., 1994. Learning Bayesian belief networks.An approach based on the MDL principle. ComputationalIntelligence 10, 269–293.

Larranaga, P., Poza, M., Yurramendi, Y., Murga, R., Kuijpers,C., 1996. Structure learning of bayesian network by geneticalgorithms: A performance analysis of control parameters.IEEE Transactions on Pattern Analysis and Machine Intel-ligence 18 (9), 912–926.

Lauritzen, S.L., 1995. The EM algorithm for graphical associa-tion models with missing data. Computational Statistics andData Analysis 19, 191–201.

Lee, J.-Y., Olafsson, S., 2006. Multiattribute decision trees anddecision rules. In: Triantaphyllou, Felici (Eds.), Data Miningand Knowledge Discovery Approaches Based on RuleInduction Techniques, pp. 327–358.

Li, X., Olafsson, S., 2005. Discovering dispatching rules usingdata mining. Journal of Scheduling 8 (6), 515–527.

Lieberman, H., 1997. Autonomous interface agents. In: Proceed-ings of the CHI’97 Conference on Human Factors inComputing Systems, pp. 67–74.

Lin, W., Alvarez, S.A., Ruiz, C., 2000. Collaborative recommen-dation via adaptive association rule mining. In: Proceedingsof ACM WEBKDD 2000.

Liu, H., Motoda, H., 1998. Feature Selection for KnowledgeDiscovery and Data Mining. Kluwer, Boston.

MacQueen, J., 1967. Some methods for classification and analysisof multivariate observations. In: Proceedings of the 5thBerkeley Symposium on Mathematical Statistics and Proba-bility, pp. 281–297.

Mangasarian, O.L., 1965. Linear and nonlinear separation ofpatterns by linear programming. Operations Research 13,444–452.

Mangasarian, O.L., 1994. Misclassification minimization. Jour-nal of Global Optimization 5 (4), 309–323.

Mangasarian, O.L., 1997. Mathematical programming in datamining. Data Mining and Knowledge Discovery 1 (2), 183–201.

Murthi, B.S., Sarkar, Sumit, 2003. The role of the managementsciences in research on personalization. Management Science49 (1), 1344–1362.

Narendra, P.M., Fukunaga, K., 1977. A branch and boundalgorithm for feature subset selection. IEEE Transactions onComputers 26 (9), 917–922.



ARTICLE IN PRESS

Ng, R., Han, J., 1994. Efficient and effective clustering method forspatial data mining. In: Proceedings for the 1994 Interna-tional Conference on Very Large Data Bases, pp. 144–155.

Niimi, A., Tazaki, E., 2000. Genetic programming combined withassociation rule algorithm for decision tree construction. In:Fourth International Conference on Knowledge-based Intel-ligent Engineering Systems and Allied Technologies, Brigh-ton, UK, pp. 746–749.

Olafsson, S., Yang, J., 2004. Intelligent partitioning for featureselection. INFORMS Journal on Computing 17 (3), 339–355.

Osei-Bryson, K.-M., 2005. Assessing cluster quality using multi-ple measures – a decision tree based approach. In: Golden, B.,Raghavan, S., Wasil, E. (Eds.), The Next Wave in Comput-ing, Optimization, and Decision Technologies. Kluwer.

Padmanabhan, B., Tuzhilin, A., 2003. On the use of optimizationfor data mining: Theoretical interactions and eCRM oppor-tunities. Management Science 49 (10), 1327–1343.

Pazzani, M., 1999. A framework for collaborative, content-basedand demographic filtering. Artificial Intelligence Review, 393–408.

Pernkopf, F., 2005. Bayesian network classifiers versus selectivek-NN classifier. Pattern Recognition 38 (1), 1–10.

Quinlan, J.R., 1993. C4.5: Programs for Machine Learning.Morgan-Kaufmann, San Mateo, CA.

Rao, M.R., 1971. Cluster analysis and mathematical program-ming. Journal of the American Statistical Association 66,622–626.

Rastogi, R., Shim, K., 1998. Mining optimized association rulesfor categorical and numeric attributes. In: Proceedings ofInternational Conference of Data Engineering.

Rastogi, R., Shim, K., 1999. Mining optimized associationsupport rules for numeric attributes. In: Proceedings ofInternational Conference of Data Engineering.

Resnick, P., Iacovou, N., Suckak, M., Bergstrom, P, Riedl, J.,1994. Grouplens: An open architecture for collaborativefiltering of netnews. In: Proceedings of ACM CSCW’94Conference on Computer-Supported Cooperative Work, pp.175–186.

Resende, M.G.C., Ribeiro, C.C., 2003. Greedy randomizedadaptive search procedures. In: Kochenberger, Glover(Eds.), Handbook of Metaheuristics. Kluwer, Boston, MA.

Ripley, B.D., 1996. Pattern Recognition and Neural Networks.Cambridge University Press, Cambridge, UK.

Rosset, S., Neumann, E., Vatnik, Y., 2002. Customer lifetimevalue modeling and its use for customer retention planning.In: Proceedings of ACM International Conference on Knowl-edge Discovery and Data mining.

Rumelhart, D.E., McClelland, J.L., 1986. Parallel DistributedProcessing: Explorations in the Microstructure of Cognition1. MIT Press, Cambridge, MA.

Schmittlein, D., Morrison, D., Colombo, R., 1987. Countingyour customers: Who are they and what will they do next?.Management Science 33 (1) 1–24.

Shardanan, U., Maes, P., 1995. Social information filtering:Algorithms for automating ‘word of mouth’. In: Proceedings


of ACM CHI’95 Conference on Human Factors in Comput-ing Systems, pp. 210–17.

Sharpe, P.K., Glover, R.P., 1999. Efficient GA based techniquefor classification. Applied Intelligence 11, 277–284.

Shaw, M., Subramaniam, C., Tan, G., Welge, M., 2001.Knowledge management and data mining for marketing.Decision Support Systems 31, 127–137.

Shi, L., Olafsson, S., 2000. Nested partitions method for globaloptimization. Operations Research 48, 390–407.

Shmoys, D.B., 1999. Approximation algorithms for clusteringproblems. In: Proceedings of the 12th Annual Conference onComputational Learning Theory, pp. 100–101.

Shmoys, D.B, Tardos, E., Aardal, K., 1997. Approximationalgorithms for facility location problems. In: Proceedings ofthe Twenty-Ninth Annual ACM Symposium on the Theoryof Computing, pp. 265–274.

Street, W.N., 2005. Multicategory decision trees using nonlinearprogramming. Informs Journal on Computin 17, 25–31.

Suzuki, J., 1993. A construction of Bayesian networks fromdatabases based on an MDL scheme. In: Heckerman, D.,Mamdani, A. (Eds.), Proceedings of the Ninth Conference onUncertainty in Artificial Intelligence. Morgan Kaufmann, SanFrancisco, CA, pp. 266–273.

Tanigawa, T., Zhao, Q.F., 2000. A study on efficient generationof decision trees using genetic programming. In: Proceedingsof the 2000 Genetic and Evolutionary Computation Confer-ence (GECCO’2000), Las Vegas, pp. 1047–1052.

Vapnik, V., 1995. The Nature of Statistical Learning Theory.Springer, Berlin.

Vapnik, V., Lerner, A., 1963. Pattern recognition using general-ized portrait method. Automation and Remote Control 24,774–780.

Vercellis, C., 2002. Combining data mining and optimization forcampaign management. Management Information Sys-temsData Mining III, vol. 6. Wit Press, pp. 61–71.

Vinod, H.D., 1969. Integer programming and the theory ofgrouping. Journal of the American Statistical Association 64,506–519.

Weiss, S.M., Kulikowski, C.A., 1991. Computer Systems thatLearn: Classification and Prediction Methods from Statistics,Neural Nets, Machine Learning, and Expert Systems. Mor-gan Kaufman.

Yang, J., Honavar, V., 1998. Feature subset selection using agenetic algorithm. In: Motada, H., Liu, H. (Eds.), FeatureSelection, Construction, and Subset Selection: A Data MiningPerspective. Kluwer, New York.

Yang, J., Olafsson, S., 2006. Optimization-based feature selectionwith adaptive instance sampling. Computers and OperationsResearch 33, 3088–3106.

Zhang, G.P., 2000. Neural networks for classification: A survey.IEEE Transactions on Systems Man and Cybernetics Part C –Applications and Reviews 30 (4), 451–461.

Zhang, T., Ramakrishnan, R., Livny, M., 1996. BIRCH: Anefficient data clustering method for very large databases. In:SIGMOD Conference, pp. 103–114.


Operations research and data mining - NUS Computingrudys/arnie/or-datamining.pdfOperations research...

Documents

Transcript of Operations research and data mining - NUS Computingrudys/arnie/or-datamining.pdfOperations research...