Machine Learning and Data

32
Gerstner Laboratory for Intelligent Decision Making and Control Czech Technical University in Prague Series of Research Reports Report No: GL 157/02 Machine Learning and Data Mining Jiˇ ı Palouˇ s [email protected] http://cyber.felk.cvut.cz/gerstner/reports/GL157.pdf Gerstner Laboratory, Department of Cybernetics Faculty of Electrical Engineering, Czech Technical University Technick´ a 2, 166 27 Prague 6, Czech Republic tel. (+420-2) 2435 7421, fax: (+420-2) 2492 3677 http://gerstner.felk.cvut.cz Prague, 2002 ISSN 1213-3000

Transcript of Machine Learning and Data

Page 1: Machine Learning and Data

Gerstner Laboratoryfor Intelligent Decision Making and Control

Czech Technical University in Prague

Series of Research Reports

Report No:

GL 157/02

Machine Learningand Data Mining

Jirı [email protected]

http://cyber.felk.cvut.cz/gerstner/reports/GL157.pdf

Gerstner Laboratory, Department of CyberneticsFaculty of Electrical Engineering, Czech Technical University

Technicka 2, 166 27 Prague 6, Czech Republictel. (+420-2) 2435 7421, fax: (+420-2) 2492 3677

http://gerstner.felk.cvut.cz

Prague, 2002

ISSN 1213-3000

Page 2: Machine Learning and Data

Contents

1 Introduction 3

2 Machine Learning 42.1 Main Machine Learning Methods . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.1 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1.2 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1.3 Bayesian Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1.4 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.5 Inductive Logic Programing . . . . . . . . . . . . . . . . . . . . . . . . 52.1.6 Case-Based Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.7 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.8 Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Machine Learning and Data Mining . . . . . . . . . . . . . . . . . . . . . . . 62.3 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.4 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.5 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.6 Results Evaluation & Model Exchange . . . . . . . . . . . . . . . . . . . . . . 9

2.6.1 Area under ROC Curve . . . . . . . . . . . . . . . . . . . . . . . . . . 92.6.2 Predictive Model Markup Language . . . . . . . . . . . . . . . . . . . 10

3 iBARET – Instance-BAsed REasoning Tool 113.1 iBARET structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2 CQL Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.3 Consultation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.4 Testing Set Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.4.1 Classification Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.4.2 Regression Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.5 IBR Model Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.5.1 Sequential Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.5.2 Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.6 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4 Experiments 25

5 Future Research 28

6 Conclusion 28

References 29

1

Page 3: Machine Learning and Data

List of Figures

1 ROC curve for example ad Table 1 (ROC area 0.8) . . . . . . . . . . . . . . . 102 The iBARET blok structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Example of prediction values of sinus function . . . . . . . . . . . . . . . . . . 204 PMML utilization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 Procedure 6 - iBARET’s training error . . . . . . . . . . . . . . . . . . . . . . 266 Procedure 6 - iBARET’s performance on testing data . . . . . . . . . . . . . 267 Procedure 6 - iBARET’s performance with reduced number of patient groups 27

List of Tables

1 Example – calculating ROC curve . . . . . . . . . . . . . . . . . . . . . . . . 92 Four-fold table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Example of a symbol distance table . . . . . . . . . . . . . . . . . . . . . . . . 144 Example of the second type of a symbol distance table . . . . . . . . . . . . . 145 Example – RMSE of different LWR method . . . . . . . . . . . . . . . . . . . 19

2

Page 4: Machine Learning and Data

1 Introduction

Artificial Intelligence (AI) is the area of computer science focusing on creating machines thatcan engage on behaviors that humans consider intelligent. The ability to create intelligentmachines has attracted humans since ancient times, and today with the huge expansion of thecomputers and 50 years of research of AI programming techniques, the dream of intelligentmachines is becoming a reality.

Machine Learning (ML) is the area of Artificial Intelligence that focuses on developingprinciples and techniques for automating acquisition of knowledge. Some machine learn-ing methods can dramatically reduce the cost of developing knowledge-based software byextracting knowledge directly from existing databases. Other machine learning methodsenable software systems to improve their performance over time with minimal human inter-vention. These approaches are expected to enable the development of effective software forautonomous systems that can operate in poorly understood environments.

The aim of this work is to make short overview of most frequently used Machine Learningmethods, and to introduce our research focus. This report can be divided in two main parts.

The first one concentrates mainly on used Machine Learning methods. Each importantML method is briefly described and appraised its significance. Then we show relation to thenext branch of artificial intelligence – Data Mining. After that we focus on problems in eachphase in Machine Learning process. Some opportunities of preprocessing are shown and apowerful preprocessing tool SumatraTT is introduced. Then we discuss problems in generallearning phase, testing phase and show possibilities in results evaluation. At the end of thefirst part, there is mentioned popular language for predictive model exchange – PMML.

The second part describes our solution for classification and prediction – iBARET. Alltechniques that are covered by iBARET are explained, from consultation process to modeltuning by genetic algorithm. Then we try to sketch future evaluation of the tool. At the endwe show most recent experiment on SPA data made by iBARET.

The final part of this work includes reflection on possible directions of further research.There we try to find some interesting topic we would like to work on in frame of PhD thesis.

3

Page 5: Machine Learning and Data

2 Machine Learning

The first part of this section contains the overview of the machine learning methods. Wefocused on the basic, frequently used methods and several, at present, most popular methods.After this introduction to Machine Learning (ML) we will try to describe relationship betweenML and Data Mining (DM). Then we focused on the whole ML/DM process from the datapreprocessing, through the general learning, to the testing and the results evaluation. In eachpart of learning process we will try to sketch the problems we can encounter there.

2.1 Main Machine Learning Methods

This section contains only very short description of ML methods, and their brief character-istics. Most of them are in more detail described in [41, 19].

In recent years, attention has been paid to generating and combining different but stillhomogenous classifiers with techniques called bagging, boosting or bootstrapping [11, 17].They are based on repeated generation of the same type of model over evolving trainingdata. These methods enable reduction of model variance. The authors prove that theycan not be applied to instance-based learning cycle due to method stability with respect toperturbations of the data.

2.1.1 Decision Trees

Decision tree learning [43] is a method for approximating discrete function by a decision tree.In the nodes of trees are attributes and in the leaves are values of discrete function. Thedecision tree can be rewrited in a set of if-then rules. Trees learning methods are popularinductive inference algorithms, mostly used for variety of classification tasks (for examplefor diagnosing medical cases). For tree generation there is often used entropy as informationgain measure of the attribute. The best known methods are ID3, C4.5, etc.

2.1.2 Neural Networks

Neural networks learning methods [9] provide a robust approach to approximating real-valued, discrete-valued and vector-valued functions. The well-known algorithm – Back-

propagation – uses gradient descent to tune network parameters to best fit to trainingset with input-output pair. This method is inspired by neurobiology. It imitates functionof brain, where many neurons are inter connected. The instances are represented by manyinput-output pairs. NN learning is robust to errors in training data and has been successfullyapplied to problems such as speech recognition, face recognition, etc.

2.1.3 Bayesian Methods

Bayesian reasoning [8] provides a probabilistic approach to inference. Bayesian reasoningprovides the basis for learning algorithms that directly manipulate with probabilities, as wellas a framework for analyzing the operation of other algorithms. Bayesian learning algorithmthat calculates explicit probabilities for hypothesis, such us the naive Bayes, are among themost practical approaches to certain type of learning problems. Bayes classifier is competitivewith other ML algorithms in many cases. For example for learning to classify text documents,the naive Bayes classifier is one of the most effective classifiers.

4

Page 6: Machine Learning and Data

2.1.4 Reinforcement Learning

Reinforcement learning [28] solves the task – how the agent (that can sense and act inenvironment) can learn to choose optimal actions to reach its goal. Each time the agentperforms an action in its environment, a trainer may provide a reward or penalty to indicatethe conveniency of the resulting state. For example, when agent is trained to play a gamethen trainer might provide a positive reward when the game is won, negative reward when itis lost, and zero reward in other states. The task of agent is to learn from this delayed reward,to choose sequences of actions that produce the greatest cumulative reward. An algorithmthat can acquire optimal control strategies from delayed reward is called Q–learning. Thismethod can solve the problems like learning to control mobile robot, learning to optimizeoperations in factories, learning to plan therapeutic procedures, etc.

2.1.5 Inductive Logic Programing

Inductive logic programming [18] has its roots in concept learning from examples, a relativelystraightforward form of induction. The aim of concept learning is to discover, from a givenset of pre-classified examples, a set of classification rules with high predictive power. Thetheory of ILP is based on proof theory and model theory for the first order predicate calculus.Inductive hypothesis formation is characterized by techniques including inverse resolution,relative least general generalisations, inverse implication, and inverse entailment.

This method can be used for creating logical programs from training data set. Thefinal program should be able to generate that data back. The creating logical programs isvery dependent on task complexity. In many cases this method is not usable without manyrestrictions posed on the final program. With success ILP is mostly used in Data Mining forfinding rules in huge databases.

2.1.6 Case-Based Reasoning

Case-Based Reasoning (CBR) [1, 34] is a lazy learning algorithm that classifies new queryinstance by analyzing similar instances while ignoring instances that are very different fromthe query. This method holds all previous instances in case memory. The instances/cases canbe represented by values, symbols, trees, various hierarchical structures or other structures. Itis non-generalization approach. The CBR works in the cycle: case retrieval – reuse – solutiontesting – learning. This method is inspired by biology, concretely by human reasoning usingknowledge from old similar situations. This learning method is also known as Learning byAnalogy.

CBR paradigm covers a range of different methods. Widely used is Instance-Based Rea-soning (IBR) algorithm [2, 49] that differs from general CBR mainly in representing in-stances. The representation of the instances is simple, usually it is vector of numeric orsymbolic values. Instance-based learning includes k-Nearest Neighbors (k-NN) and LocallyWeighted Regression (LWR) methods.

2.1.7 Support Vector Machines

Support Vector Machines (SVM) has become very popular method for classification andoptimization at the recent time. SVM were introduced by Vapnik et al. in 1992 [10]. Thismethod combines two main ideas. The first one is concept of an optimum linear marginclassifier, which constructs a separating hyperplane that maximizes distances to the trainingpoint. The second one is concept of a kernel. In its simplest form, the kernel is a functionwhich calculates the dot product of two training vectors. Kernels calculate these dot product

5

Page 7: Machine Learning and Data

in feature space, often without explicitly calculating the feature vectors, operating directlyon the input vectors instead. When we use feature transformation, which reformulates inputvector into new features, the dot product is calculated in feature space, even if the newfeature space has higher dimensionality. So the linear classifier is unaffected.

Margin maximization provides a useful trade off with classification accuracy, which caneasily lead to overfitting of the training data. SVM are well applicable to solve learning taskswhere the number of attributes is large with respect to the number of training examples.

2.1.8 Genetic Algorithms

Genetic algorithms [40] provide a learning method motivated by an analogy to biologicalevolution. The search for an appropriate hypothesis begins with a population of initial hy-pothesis. Members of the current population give rise to the next generation population byoperations such as selection, crossover and mutation. At each step, a collection of hypothesiscalled the current population is updated by replacing some fraction of the population by off-springs of the most fit current hypothesis. Genetic algorithms have been applied successfullyto a variety of learning tasks and optimization problems. For example, Genetic algorithmscan be used in other ML methods, such as Neural Network or Instance-Based Reasoning (seesection 3.5.2), for optimal parameters setting.

2.2 Machine Learning and Data Mining

Data mining – DM (also known as Knowledge Discovery in Databases – KDD) has beendefined as ”The nontrivial extraction of implicit, previously unknown, and potentially usefulinformation from data” [20]. It uses machine learning, statistical and visualization techniquesto discover and present knowledge in a form which is easily comprehensible to humans.

Data mining, the extraction of hidden predictive information from large databases, isa powerful new technology with great potential. For examle, it can help companies andinstitutions to focus on the most important information in their data warehouses. Datamining tools predict future trends and behaviors, allowing businesses to make proactive,knowledge-driven decisions. DM tools can answer business questions that were traditionallytoo time consuming to resolve. They scour databases for hidden patterns, finding predictiveinformation that experts may miss because it lies outside their expectations.

DM technology can generate new business opportunities by providing these capabilities:

• Automated prediction of trends and behaviors. Data mining automates the process offinding predictive information in large databases. Questions that traditionally requiredextensive hands-on analysis can now be answered directly from the data quickly.

• Automated discovery of previously unknown patterns. Data mining tools sweep throughdatabases and identify previously hidden patterns in one step. An example of patterndiscovery is the analysis of retail sales data to identify seemingly unrelated productsthat are often purchased together.

The most commonly used techniques in data mining are:

• Artificial neural networks

• Decision trees – Classification and Regression Trees (CART), Chi Square AutomaticInteraction Detection (CHAID) .

• Genetic algorithms

6

Page 8: Machine Learning and Data

• Nearest neighbor method

• Rule induction – The extraction of useful if-then rules from data based on statisticalsignificance.

Here we can see that DM richly uses the ML methods. Going through a huge databaseand extract some advance knowledge can be managed only by an ”intelligent” method thatis able to learn. The capabilities of DM are now evolving to integrate directly with industry-standard data warehouse and OLAP (On-Line Analytical Processing) platforms.

A Gartner Group Advanced Technology Research Note [22] listed data mining and artificialintelligence at the top of the five key technology areas that will clearly have a major impactacross a wide range of industries within the next 3 to 5 years.

2.3 Data Preprocessing

Almost always when we solve a task, we get raw data that have to be prepared for concretelearning process. According to the problem domain, learning method and software tool, it isneeded to determine appropriate data structure and then data preprocessing method. Veryimportant thing is to define how to deal with undefined or unknown data and uncertainty.There are several possibilities how to tranform input data:

• simple mapping – simplest method, mapping one field to another (can not mergefields),

• expressions – mapping one line from one source to one line in destination,

• interpreted language – programming language that can operate with more sources,very powerful but more complex,

• compiler – fast running, but highly dependent on platform, need of complex imple-mentation.

One of the most powerful preprocessing tools is SumatraTT [3, 4] developed at the De-partment of Cybernetics at CTU. SumatraTT is using SumatraScript language for definingdata transformations and is platform independent. Sumatra language is inspired by C++and Java that are well known to programmers, so it is easy to learn.

Preprocessing as phase of ML or DM determine success of the whole learning/miningprocess. Frequently for automatic preprocessing so-called metadata are used. Metadata areinformations about data, that can be often helpful.

ML theory suggests many different approaches how to deal with available data when amodel is generated. They can differ particularly in relation to the type of learning, whichdetermines constant model characteristics, and size of data set itself. In all cases, they followtwo fundamental intentions: to enable generation of model with predictive power as high aspossible and to give a chance to independently estimate its performance on future data andguarantee the model validity over these unseen data. In order to meet these two goals, allthe data can not be used within model generation (adjusting model settings). The hold-outmethod divides data into two distinct sets: training set is used in learning and testing set isused in evaluation.

We can encounter with several unpleasant problems in data. Typically in classificationtask there can easily happen that the data are not equally distributed in final classes. Thisfact, in the most of learning methods, leads to simple classificator which classification hashigh accuracy in prediction of best represented class, and it is not able to classify the less

7

Page 9: Machine Learning and Data

represented class. This state can occur often in medical applications, for example when pre-dicting results of Coronary Artery Bypass Graft surgery operation [30]. The most interestingfor us is to predict patients that will probably die. But the pattern of those patients con-sists of only 1%. This can be solved by appropriate evaluation function, for example withROC curve, or can be improved by data preprocessing. Those preprocessing should adjustdata distribution into classes, for example by random removing cases from the class bestrepresented.

Other problem appears if we have many attributes and not much data, see section 2.4. Inthis case the Probably Approximately Correct (PAC) analysis [16] can help us to determinehow good can be our solution. The question is with what probability we are able to predictresults with given accuracy. This problem of data can be also partially eliminated by datapreprocessing; mainly by detecting and removing irrelevant features, or by selecting mostrelevant ones [25, 26]. This method is called Feature Reduction. There are many methodsfor selecting relevant attributes, for example methods using PAC [23] or correlation [24].Example of such data can be find in section 4, where we try on SPA data to predict capacityrequired for therapeutic utilities.

2.4 Learning

The term ”learning” usually corresponds to fitting designed model. Through the process oflearning, we are improving model prediction accuracy as fast and high as possible. Afterlearning we expect the best fitting model of input data. Some methods that are not muchrobust to noise usually give very good results on training data, but on testing or real data arerather poor. When this fact occurs we are talking about over-fitting. The rate of over-fittingis also very dependent on input data. This mostly become when we do not have much datacompared with number of attributes, or when the data are noisy. The noise into data can bebrought by, for example, subjective data providing, uncertainity, some acquisition errors, etc.If we have too many attributes and not much data then the state space for finding optimalmodel is too wide and we can easily lose the right way and finish in local optimum. Theproblem of over-fitting can be partially eliminated by suitable preprocessing and by usingadequate learning method.

In this place we would like to mention very powerful tool for machine learning – WEKA[51, 47]. Weka collected plenty of machine learning algorithms for solving real-world datamining problems. It is written in Java and runs on almost any platform. The algorithmscan either be applied directly to a dataset or called from one’s own Java code. Weka isalso well-suited for developing new machine learning schemes. Weka is open source softwareissued under the GNU General Public License. It incorporates about ten different methodsfor classification (Bayes, LWR, SVM, IBR, Perceptron, . . . ), other six methods for numericprediction (linear regression, LWR, IBR, multi-layer perceptron, . . . ) and several so called”meta-schemes” (bagging, stacking, boosting, . . . ). Also included there are clustering meth-ods, and an association rule learner. Apart from actual learning schemes, Weka also containsa large variety of tools that can be used for pre-processing datasets.

2.5 Testing

After learning before using our prediction model in the practice we have to check classificationor prediction accuracy. The most frequently used method is dividing input data to trainingand testing data sets.

N-cross-validation method divides the data into N partitions. Learning process runs inN steps, in each step i all the partitions except the ith are used in learning, the ith group is

8

Page 10: Machine Learning and Data

used for testing. Leave-one-out method is a special case of cross-validation, each partitionconsists of just one case and so number of learning steps is equal to number of data records.This method is very easy to be implemented with the instance-based techniques as it is trivialto ignore single case when searching for the most similar cases.

2.6 Results Evaluation & Model Exchange

The evaluation of results is a key problem of learning process. The evaluation gives feedbackto learning and thus the evaluation can distinctly affect progression of model improving. Thefitness function of some tasks is given by their principle, but for most of them is not. Fornumeric prediction tasks the measure of success is often some type of error rate – RMSE,MAPE, or MAE (section 3.4.2). For classification tasks there is often used probabilisticapproach and especially in medical applications ROC curves are used.

2.6.1 Area under ROC Curve

Receiver operating characteristic (ROC) is defined as a curve of false positive and false neg-ative results at various threshold values indicating the quality of a decision method. Thisquality is expressed by the area under ROC curve [27]. This area can be between 0 and 1.Zero corresponds to perfectly inverted classification, 0.5 corresponds to no apparent accu-racy of the tested method, 1.0 means perfect classification accuracy. In other words, a ROCcurve is a graphical representation of the trade off between the false negative (FN) and falsepositive (FP ) rates for every possible cut off. By tradition, the plot shows the FP on theX-axis and 1 − FN (which is true positive (TP ) rate) on the Y -axis.

Let us consider a simple example, a decision method predicts probability that testingexamples belong among positive cases (class 1) as table 1 shows.

Prediction 0 0.05 0.2 0.22 0.3 0.31 0.4 0.42 0.7 0.8Real class 0 0 1 0 1 0 0 1 1 1

Table 1: Example – calculating ROC curve

The area under ROC curve can be calculated by setting down of false positive and falsenegative rate for each possible decision threshold. First, the threshold is set to 0, i.e. all theexamples are considered to be positive. Then, the threshold is set 0.025 so that the left mosttesting example is considered to be negative while the others remain positive. The thirdthreshold is set to 0.125, the two left most examples are classified as negative. All in all, tendifferent thresholds can be derived. For each threshold it is calculated four-fold table andTP and FP are calculated.

Class/Classification 0 10 a b FP = b/(a + b)1 c d TP = d/(c + d)

Table 2: Four-fold table

The resulting ROC curve can be seen in Figure 1 . The area under the curve is 0.8, itcan be interpreted as an apparent relation between the decision method prediction and realclassification.

9

Page 11: Machine Learning and Data

ROC curve

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

1.10

0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 1.10

FP (1-specificity)

TP

(sen

sitiv

ity)

Figure 1: ROC curve for example ad Table 1 (ROC area 0.8)

2.6.2 Predictive Model Markup Language

After creation and fitting predictive model we would like sometimes to be able to bring thatmodel to another application or tool, for example for final use, visualization, some kind ofpost-processing etc. For that purpose there has been recently developed a special languagefor holding different predictive models.

Predictive Model Markup Language (PMML) - is an markup language based on XMLwhich can be used for holding predictive models and their interchange between compliantvendors’ applications. PMML is defined by Data Mining Group (DMG) and the specificationof PMML is published on the DMG web page [39]. PMML is platform and vendor indepen-dent forward-looking data mining standard. It allows users to develop models within onevendor’s application, and use other vendors’ applications to visualise, analyse, evaluate orotherwise use the models. PMML uses XML [14] to represent mining models. The structureof the PMML document is described in Document Type Definition (DTD). In PMML a singledocument can contain more than one mining model. If application supports mining modelselection, then the user can specify the model for current use, otherwise the first one is usedas default.

In addition, the PMML can hold data statistics, directives for feature normalization,results, and other meta-data. It is possible to include comments into PMML. This advantagecan be used by some applications for improving their functionality.

This language is easy to understand and manipulate. It is able to hold nearly all the mostfrequently used data-mining models. In the future this interchange format should supportnearly all Data Mining applications.

10

Page 12: Machine Learning and Data

3 iBARET – Instance-BAsed REasoning Tool

Development of the iBARET system follow the work done in Master thesis [42]. That workwas focused on development of a system for result prediction of cardiac operations usingCBR method [1]. Since that time, we have been developing more general tool for use inmany domains [32].

In contrast to learning methods that construct a general, explicit description of the targetfunction when training examples are provided, instance-based learning (IBL) methods simplystore the training examples. Generalizing beyond these examples is postponed until a newinstance must be classified. Each time a new query instance is encountered, its relationship tothe previously stored examples is examined in order to assign a target function value for thenew instance. Instance-based learning includes k-nearest neighbor (kNN) and locally weightedregression (LWR) methods. These methods assume that instances can be represented aspoints in a Euclidean or other space. It also includes case-based reasoning (CBR) methodsthat use more complex, symbolic representations for instances. Instance-based methods aresometimes referred to as ”lazy” learning methods because they delay processing until a newinstance must be classified. A key advantage of this kind of delayed, or lazy, learning is thatinstead of estimating the target function once for the entire instance space, these methodscan estimate it locally and differently for each new instance to be classified [41].

kNN performance is highly sensitive to the definition of its distance function. In orderto reduce this sensitivity, it is advisable to parametrize the distance function with featureweights. The authors in [48] argue, that methods which use performance feedback to as-sign weight settings demonstrate three advantages over other methods: they require lesspre-processing, perform better in the presence of interacting features and generally requireless training data to learn good settings. iBARET represents a batch performance feedbackoptimizer. It utilizes genetic algorithms (GAs) and sequential algorithm to effectively searchthe space of possible weight settings. The idea of GAs utilization to find an optimal weightsettings is not new [29, 44, 36]. iBARET concentrates on utilization of proper genetic opera-tors with respect to time needed for processing and possibility to set its parameters throughuser interface, i.e. universality of its application. The universality is the other overall systemaccomplishment. iBARET can be used to solve classification as well as regression tasks. Forclassification tasks it offers two different methodologies to calculate weight setting fitnessfunction. The first one is a probabilistic method, which derives a value of fitness function ofcurrent weight setting of average prediction accuracy reached for different classes. The secondone makes use of receiver operating characteristic (ROC) curves [27] taken from radiologyand medicine. Both of them give a chance to effectively process tasks with non-uniform classdistribution.

More details about iBARET and used methods can be found in [31].The data preprocessing represents the crucial issue of successful iBARET application. In

particular, symbolic features must be properly transformed into numeric features or a properdistance table has to be defined and used within CQL Server. iBARET does not offer anypossibilities of data preprocessing, therefore all the preprocessing has to take place outsideof iBARET prior to its application. A user should be always aware whether he can use hisdata directly or after preprocessing.

3.1 iBARET structure

System iBARET consists of two main units, CQL Server and IBR Interface (see Figure 2).CQL Server receives queries in Case Query Language (CQL) format. Every single querycorresponds to a single case. At the same time, it contains feature weights and number of

11

Page 13: Machine Learning and Data

requested neighbors. CQL Server finds the nearest neighbors to the case contained in queryand sends a CQL response to IBR Interface. CQL Server has its origin in CBR Works 3that was used as a server application at first. CBR Works 3 system is a commercial productof TECINNO GmbH created in frame of INRECA project. Later on we have revealed thissystem shows too slow reaction to CQL queries therefore we have developed our own serverapplication – CQL Server. In order to keep compatibility with IBR Interface, we use thesame TELNET communication protocol and the same CQL. Nevertheless, CQL is not fullysupported in our server, it implements only commands for domain model representation,command for sending query and command for generating answers. In fact, CQL Server is aspecial database engine that can identify the most similar cases (nearest neighborhood). Itworks with the domain model that can be loaded from a text file in the same way as thetested cases.

Training ortestingdata set

IBR Interface

CQLCQL

encoder

Attributes

CQL decoder

Auto.

Manual

Case evaluation

dataProbability

ROC curve

Test evaluation

Sequential alg.

Genetic alg.

Attribute weights tuning

Experimentsettings

CQL Server

Casememory

Domainmodel

CQL ServerEngine

CQL communicator Evaluation unit

Show answer

Figure 2: The iBARET blok structure

IBR Interface is used for finding optimal parameters of k-nearest neighbor method. Itis namely feature weights and parameter k. The parameter k is set by a user through GUI(Graphical User Interface). IBR Interface generates a set of weights and applies them toqueries sent to CQL Server. It uses these weights for all the samples included into a testingset. It processes responses from CQL Server, classifies the testing examples according toreported neighborhood and finally it evaluates classification or regression accuracy reachedwith the given set of method parameters.

IBR Interface consists of four main units. CQL communicator automatically generatesqueries and decodes answers from CQL server for further evaluation. It also enables to senda query manually. Evaluation unit derives a solution of a single query (classification orregression) from the received neighborhood, and after evaluating of all the cases it appointsoverall classification or regression accuracy. The final output of this unit is only one valuethat indicates quality of model and its ability to classify or predict. Probabilistic method orROC curve can be used for this purpose. The value we get from evaluation unit is used inthe next Unit for tuning of attribute weights. This unit implements two methods of featureweights optimization. The first algorithm we have implemented is a sequential algorithm.Better and more useful algorithm is the genetic algorithm. Weights of attributes and othersettings are stored in Experiment settings unit.

12

Page 14: Machine Learning and Data

3.2 CQL Server

The CQL Server [46] is an application that is able to service requests in CQL syntax sentover any TCP/IP network for specific database (case-base) information. It runs on Windows9x/NT/XP systems. On receiving a valid CQL request, it starts up a sequential search forthe nearest neighbors of the case that was transmitted with the request. According to theparameters passed over by the client, the neighbors are found and the compiled results sentback over the net. CQL Server is actually a replacement for the CBR Works 3 databaseserver which does not deliver satisfactory speed and power. When starting to work with anew task (i.e. a new case-base), a CQL Model has to be created and loaded first. The CQLModel represents a description of a case-base in terms of CQL. It defines data types (integer,real, symbol, custom, . . . ) and slots (names and data types of case-base features) containedin the case-base file. At the same time it defines, whether is the given case-base feature (slot)used when distance is calculated or not (determined by a ”not discriminant” statement).A distance weight can be adjoined to each slot as well. It is not implicitly specified whichfeature is a class within the model. This decision is postponed for IBR Interface. In case theclass feature is constant within the processed domain, ”not discriminant” statement can beadded in its slot definition for safety reasons. According to this data, the system generatesa custom binary format, which is stored in memory. This format saves memory space and italso makes searching easier and faster.

After CQL Model definition, a case-base can be loaded into the server. The case-baseis a simple text file consisting of cases, features are separated by commas, cases by linebreaks. The first feature on each line corresponds to the first slot in CQL Model and so on.Number of features and their data types must correspond to predefined values in CQL Model.No symbolic description (feature names, types) is included, the first line represents directlythe first case. The CQL Server uses an Euclidean distance metric to identify the nearestneighbors. The distance between two cases is calculated by adding up all the distances basedon the values of the respective slots. For Integers/Reals their distance on the real line is simplytaken, and for Symbolic variables supplied table metrics are used (from external files). Whena query is initiated, each slot is assigned a weight by the client. The respective distances aremultiplied by this weight and the resulting value is then divided by the maximum distancebetween values of that slot in the entire database. The similarity value is obtained by dividingthe sum of distances by the sum of all used weights and subtracting from 1. The similarityis then dependent on the context (other subjects) in the case-base.

sim(i, j) = 1 −

m∑

k=1

(wk impl · wk q · dis(xik, xjk))

m∑

k=1

(wk impl · wk q)(1)

Where: sim(i, j) – similarity between ith and jth case,m – number of discriminative features,wkimpl – an implicit weight of kth feature taken from CQL Model,wkq

– a user weight of k-th feature taken from actual query,dis – a distance function,xik – a value of kth feature in ith case,xjk – a value of kth feature in jth case.

For numeric data types (integer, real), the distance function is defined as follows:

13

Page 15: Machine Learning and Data

dis(xik, xjk) =|xik − xjk|

max∀l

xlk − min∀l

xlk

(2)

For symbolic unordered data types, the distance function must be defined with aid of asymbol distance table. The symbol table is specified by a lower triangular matrix rather thanan upper triangular one (saves typing tabs). This implies that the values are symmetrical.The tables have a very simple and intuitive format. The tables are stored in a text format,tabs separate fields and line breaks separate table rows. The problematic of symbolic featuresweighting is discussed in [15].

Let us assume a symbolic type with the values {Renault, Buick, Rolls-Royce, Skoda} andthe table will then give the similarities between these types as shows table 3.

Renault Buick RollsRoyce Skoda

10.8 10.7 0.9 10.3 0.2 0.0 1

Table 3: Example of a symbol distance table

Main diagonal is made of 1.0 only that means that a Buick is 100similar to a Buick andso on. Other similarity values express that e.g. Skoda is the most distant car type from theother types. There is one more type of a symbol distance table implemented in the system.The table enables to define shortly a metric that returns a certain value if the feature valuesare the same and another one if they are different. This metric is symbolically specified in atable of another format. See table 4.

+ -

similarity if same

similarity if different

Table 4: Example of the second type of a symbol distance table

This measure might be used when the difference cannot be determined for individualvalues. Let us for example take up the car example again, in the symbol set {Blue, Red,Silver} it is more logical to set the distance to some value if the colors match and to someother if they do not. It might be a problem to quantify the distance Red-Blue and Red-Silver.Which is more different? In these cases the last metric may apply.

If the tables are in the same directory as the model file and they have names in theform ”feature name.tbl” then they can be auto-loaded. This means that the system auto-matically looks up these files, loads them into memory and binds them to the symbols ofthat particular type. Some types are implicitly considered non-discriminative (not taken intoaccount in the total similarity measure). These are especially types, which have the parenttype CQL SYMBOL directly. These types are implicitly shown in red, which indicates thatthey are non-discriminative. On clicking on such a type, a dialog will open requesting thefilename of the table to be loaded for this type. The type is now discriminative. Range

14

Page 16: Machine Learning and Data

checking is done on numeric variables. Out-of-range variables are reported in the system log.Symbols are checked against a supplied symbol list. Unrecognized symbols are reported inthe system log. Simple statistics are provided to evaluate the efficiency of the server and ofthe connection.

3.3 Consultation

CQL is a structured language used for communication between Interface and Server. Thissection explains a way to construct queries and to understand answers. The syntax of theCQL is completely described in the documentation to CBR-Works system [45].

IBR-Interface generates queries itself, so the user does not need to have knowledge of it.Let us overview all the base data necessary for automatic generation of queries. First of all,the names of attributes (slots), their values and weights are needed. The names of slots areread from the attribute file. Values of the attributes are taken from another input text file.The attribute values should be divided by commas or semi-colons at each row.

There are two more data entry necessary for query generation – threshold and number ofcases to retrieve. Number of cases to retrieve defines maximum number of returned neighbors.The CQL Server can return fewer cases in case that fewer cases exceed threshold value ofsimilarity with a queried case.

Server Answers After sending query and its processing by Server, Interface gets an answerin CQL format. The answer has to be decoded, necessary data for classification of the queriedcase has to be acquired. First value that Interface reads is a return code. If this value is notzero, communication between Interface and Server fails and no case is appended. The mostfrequent fault code is -2, it means that query format has been wrong. As soon as return codeis 0, the most similar cases with their similarity values and full description (all the attributevalues) follow after key word cases. For the next processing there is considered only casesimilarity and value of attribute that represents result (class or predicted value). Just foreasier control of process, Interface extracts also case names from answer.

Sometimes, the same cases in Server’s database and in the testing set for Interface canbe used. Of course, when classifying the queried case the nearest case has to be removed ofthe corresponding Server answer as it is certainly the queried case itself. Otherwise the errorestimation of the method would be optimistically biased. This technique of error estimationis referenced as Leave-One-Out Cross-Validation method (LOOCV). When the similarity ofthe nearest neighbor equals 1 then Interface ignores the nearest case from the answer.

Consultation with server can be executed manually or automatically. Manual consultationmeans that the user can create and send the custom made query directly in CQL and theresult is showed entirely again in CQL (without processing answer). This approach presumesthe knowledge of CQL. It suits for the testing purpose mainly. Mostly there is used automaticconsultation that can construct queries from input cases and send them step by step to theserver. It can be used for model training, model testing or final consultation. When domainmodel is trained, the consultation with whole training data set is executed repeatedly formany different experiment setting. In frame of automatic query sending, simple processingof answers for further evaluation is performed.

3.4 Testing Set Evaluation

After processing of the answer by CQL communicator unit, the queried case can be classified(or predicted when the desired value is numeric) by means of Locally weighted regression(LWR) method [13]. For this purpose, the CQL communicator unit outputs names of thenearest neighbors altogether with their similarity to the queried case and classification. At

15

Page 17: Machine Learning and Data

first, Interface determines so-called modified similarity of each of the nearest neighbors. Thismodified similarity weights the neighbor when classifying the queried case. The simplest wayof weighting is direct utilization of returned similarity. The second possibility is to applyscaling. The scaling can be utilized namely when a lot of neighbors show very conformablesimilarity values and similarity has only little effect upon the classification of the queried case.In the present version of interface, user can choose linear rank scaling method (LRank) thatcalculates modified similarity from order of cases by similarity. The first nearest neighbor’smodified similarity equals to number of returned neighbors, for the next nearest neighborsInterface decreases modified similarity step by step by one.

mod simi =

{

sim(i, q) none scalingn − i + 1 LRank scaling

(3)

Where: mod simi – modified similarity of ith nearest neighbor,q – queried case,i – index of ith neighbor of q,n – total number of neighbors returned in answer.

For the evaluation we have to know which task type is executed (classification or regres-sion). For each task type the LWR and evaluation methods are different.

For classification task we calculate for each classification class probability that new casebelongs to that class. Class with highest probability is the predicted result. The train-ing/testing set then can be evaluated probabilistically or with ROC curve.

For regression task the result is determined by simple locally weighted regression methodbased on weighted average calculation. The evaluation of training/testing set is then cal-culated as mean absolute or relative prediction error. In the interface there is implementedpossibility to work with relative values of attributes with normalizing by chosen attribute.

3.4.1 Classification Task

Provided that a task is classificatory, Interface calculates probability for each classificationclass that the queried case belongs to this class. This probability is calculated as follows:

Pq,j =

N∑

i=1

mod simi · rij

N∑

i=1

mod simi

· 100 [%], (4)

Where: Pq,j – probability that queried case q belongs to jth class,n – total number of neighbors returned in answer,mod simi – modified similarity of i-th nearest neighbor,

rij =

{

1 if classification of ith nearest neighbor is j0 otherwise

From these probability values, Interface determines class of the queried case. It is theclass with maximum probability. The index of class predicted for the queried case is:

r = arg(max∀j

(Pq,j)) (5)

16

Page 18: Machine Learning and Data

Where: r – index of classification class with maximum probability,Pq,j – probability that queried case q belongs to jth class.

That was the way Interface evaluates single query. For the purpose of the feature weightsoptimization, classification accuracy for the whole testing/training set has to be calculated.Interface offers two basic methodologies to do it – probabilistic method and ROC curvemethod.

Probabilistic MethodIn the frame of this method, user can choose how to deal with possible difference between

predicted class and right class of the queried case. For each queried case, Interface firstevaluates result Rj . The first possibility is to set result directly to probability Pq,j , wherej is the index of the right class. Otherwise result can be set to 100% if the classification iscorrect (j = r; where j is index of the right class) and to zero if classification is not correct.

Then, the individual probability results of each case Rj has to be transformed to a uni-form evaluation of the testing set evaluation. The most simple way of evaluating overallclassification accuracy is calculating of a simple average over all results Ri. This method canbe used only if working with equally distributed classes.

In the most of tasks the method using weighted average should be used. This methodcalculates probability of correct classification of each class separately and weights them ac-cording user’s setting. First, it is computed probability of successful classification of the k-thclass. It is a simple average of partial success of classifications:

Pk =

N∑

j=1

Rj · rjk

N∑

j=1

rjk

(6)

Where: Pk – success probability of kth classification class,N – number of queried cases (number of cases in the testing set),Rj – probability of correct classification of jth case,

rjk =

{

1 if case j is classified in kth class0 otherwise

Then, Interface calculates weighted average on the bases of these probabilities. Theweights of classification classes can be set by user. The final probabilistic value P evaluatesclassification accuracy reached for the given setting on the entire testing set.

P =

M∑

k=1

Pk · wk

M∑

k=1

wk

(7)

Where: P – probability estimate of overall classification accuracy,Pk – probability of successful classification for kth class,wk – weight of kth classification class.

17

Page 19: Machine Learning and Data

Obviously, utilization of the weighted average of classification accuracy reached for theindividual classes decreases influence of unequal distribution of training/testing cases amongclassification groups. Typical example of such a distribution can be a task of mortalityprediction. Just a few percent of patients actually dies, but we want to be very exact inthese cases. That is why, it is very beneficial to increase remarkably a weight of successfulclassification to class ”dies, will die”

Area under Receiver Operating Characteristic (ROC) CurveThe method of Area under ROC curve was introduced in section 2.6.1. This measure

comes to use mainly when it is not desirable to make the model predictions distinct, althoughthe final classes are. The area under ROC gives a good chance to convert a complex andbalanced comparison of all the predictions and real classifications to a single number. It canbe shown that the area represents the probability that a randomly chosen positive subjectis correctly rated or ranked with greater suspicion than a randomly chosen negative subject.In medical imaging studies the ROC curve is usually constructed as follows: images fromdiseased and non-diseased patients are thoroughly mixed, then presented in this randomorder to a decision system which is asked to rate each on a scale ranging from definitelynormal to definitely abnormal. The scale can be either continuous or discrete ordinal. Thepoints required to produce the ROC curve are obtained by successively considering broaderand broader categories of abnormal in terms of decision system scale and comparing theproportion of diseased and non-diseased patients.

The ROC curve in an example in section 2.6.1 is not smooth because of too low numberof testing examples and consequent low number of possible thresholds. In real application wemostly have thousands of cases so we should have ROC curve from thousands points. Thisnumber of points is too high, so we do not calculate each point for each possible decisionthreshold. We decrease the number of points by merging cases to groups. Interface simplyuses user’s value defining number of points on ROC curve. There are two possibilities todivide testing examples among the defined number of intervals.

First method Group by percent merges cases to equally sized groups, i.e. each intervalcontains the same number of testing examples. For example, if we have 1000 cases and wewant to construct ROC curve from 50 points we make 50 intervals of 20 cases. Dividingboundaries are related to equal number of examples in each interval rather than to equalrange of the classification accuracy probabilities Pq,j of individual examples.

Another method Group by value merges cases by their values. Interface simply dividesthe interval of case values (classification accuracy probabilities of individual examples) tothe same size sub-intervals and merges cases from same sub-intervals. In this method it canoccur that examples will be spread very unequally among the individual intervals.

Moreover, the method Group by percent can be further modified by using correction.When it is used then the cases having the same Pq,j must belong into the same interval.When Interface assigns the last case to an interval it also checks the next case. If the valuePq,j of that case is equal to the value of the last case, Interface adds that next case to thecurrent interval. Then it continues with the next case. This option enables more appropriateand stabile estimate of the area under ROC curve for tasks where many cases show the samevalue of Pq,j .

The best results can be usually acquired with Group by percent method with selectedcorrection. Similarly to probabilistic method, the area under ROC curve is used as a fitnessfunction for automatic tuning of attribute weights.

18

Page 20: Machine Learning and Data

3.4.2 Regression Task

Dealing with a regression task, Interface outputs a numeric value as a prediction for thequeried case. There is used one of the simpliest locally weighted regression method – simpleweighted average of the output attribute values of the nearest neighbors found for the givenquery.

R =

n∑

i=1

mod simi · ri

n∑

i=1

mod simi

(8)

Where: R – numeric prediction for the queried case,n – total number of neighbors returned in answer,mod simi – modified similarity of ith nearest neighbor,ri – output attribute value of the ith nearest neighbor of q.

As weights there are used modified similarities of neighbors. This method is quite simplebut gives good results. It is noise resistant and more accurate than simple average.

In the following example there are shown several methods for calculating result of regres-sion method. The task is to predict values of sinus function. From the function we know only13 values and we would like to predict value of sinus between known points. We have sinusfunction with and without noise. In the real data there are often noise and uncertainity andthe predictor should be able to handle these data. At first we have tested 2-NN method thatgets good results on clear (not noisy) data. But for the most tasks the parameter k = 2 istoo low and as we can see not enough noise resistant. For our example the best solution isthe 4-NN method with weight averaging. For clear data it has greater error (but not greaterthan simple average), but for noisy data it is the best one from all methods in this example.Prediction of all three methods on sinus function and their prediction errors is shown inFigure 3. Calculated Root Mean Square Error is in Table 5.

Value 2-NN 4-NN 4-avg

without noise 0 0,0241 0,121 0,125with noise 0,260 0,154 0,085 0,090

Table 5: Example – RMSE of different LWR method

The evaluation of the current case has to be further converted into evaluation of the entiretesting set. For the evaluation it is necessary to know the right results. We get them frominput file with training cases.

It is not possible to use the probability evaluation for regression task. The best solutionis to calculate error of prediction. For this purpose we implemented method that calculatesMean Absolute Error (MAE) or Mean Absolute Percentage Error (MAPE).

MAE =

n∑

i=1

|Ri − R0i|

N, MAPE =

n∑

i=1

|Ri−R0i|

R0i

N

That value can be directly used for tuning weights of attributes as a fitness function.

19

Page 21: Machine Learning and Data

Prediction of sinus function without noise

-1,5

-1

-0,5

0

0,5

1

1,5

0 2 4 6 8 10 12 14

x

y

Values 2-NN 4-NN 4-avg

(a) Prediction of the Sinus

Errors of prediction

-0,2

-0,15

-0,1

-0,05

0

0,05

0,1

0,15

0,2

0 2 4 6 8 10 12 14

x

y

Values 2-NN 4-NN 4-avg

(b) Error of prediction

Prediction of sinus function with noise

-1,5

-1

-0,5

0

0,5

1

1,5

0 2 4 6 8 10 12 14

x

y

Values 2-NN 4-NN 4-avg

(c) Prediction of noisy Sinus

Errors of prediction

-0,5

-0,4

-0,3

-0,2

-0,1

0

0,1

0,2

0,3

0,4

0,5

0,6

0 2 4 6 8 10 12 14

x

y

Values 2-NN 4-NN 4-avg

(d) Error of prediction

Figure 3: Example of prediction values of sinus function

3.5 IBR Model Tuning

By the time the training set is evaluated Interface can use this value to adapt learningparameters. Actually this learning process tries to tune-up domain model. The adaptationof the parameters lies mainly in tuning of the feature weights. The whole process of thefeature weights tuning is following. An initial set of weights is generated and the trainingset is evaluated. From the knowledge of evaluation Interface generates another set of featureweights and evaluates it again. It iteratively continues until the appropriate weight setting isreached. Two algorithms for tuning weights have been implemented. The first is a sequencealgorithm and the second is a genetic algorithm.

3.5.1 Sequential Algorithm

This algorithm works in cycles. In each cycle it tries to change attribute weights of allattributes to the value that brings better evaluation. It starts with predefined attributeweights that are loaded from text file.

20

Page 22: Machine Learning and Data

At first we evaluate training cases with the initial weight setting. Then we take originalweight of the first attribute, decrease it by some value (depends on setting and number ofcycles) and evaluate it. We can decrease the weight only if its final value is greater than zero.Now we take original weight of that attribute, increase it by the same value and evaluateit again. Then we compare the evaluations with original, decreased and increased weight ofthe first attribute. The weight with the greatest evaluation is set as original. We hold thegreatest evaluation because it is the evaluation for original weight of the second attribute.We try again to decrease original weight, evaluate it, increase original weight, evaluate itand compare all three evaluations. The weight with the greatest evaluation is again set asoriginal weight of the second attribute. We repeat these steps for all attributes. This methodis modification of the well known hill climbing algorithm.

After the first cycle is done it is needed to change the value by that we decrease andincrease attribute weights (we call it difference). In the most cases we would like to reducethat value. We can do it by decreasing or dividing. After selecting method we can insertthe step for decreasing or the value by that we divide difference. If we choose to decreasethe difference then the evolution of the difference in next cycles can be written as arithmeticprogression:

di = d0 − (i − 1) · ds (9)

Where: di – difference in ith cycle,d0 – initial value of difference,ds – step of decreasing difference.

If we choose second method – dividing difference, then the evolution of difference can bewritten as geometric progression:

di = d0 · d(i−1)s (10)

Where: di – difference in ith cycle,d0 – initial value of difference,ds – step of decreasing difference.

For each cycle we calculate difference and try to change weights of attributes from the firstto the last. The algorithm runs until at least one of the termination conditions is reached.

3.5.2 Genetic Algorithm

Other chance to get optimal attribute weights is to use Genetic algorithm [40, 6, 7]. Itis a stochastic optimization method [21] inspired by the nature. In this method we workwith individuals and population compounded of them. Each individual represents a certainsolution of the problem we solve. In our case one individual represents the weights of allattributes. Now we have to encode all weights in one string called chromosome. We usesimple bit string, so the encoding is very easy. We can affect length of chromosome bysetting number of bits per attribute. If we set k bits per attribute, then the total length ofthe chromosome (in bits) is L = k ·n, where n is the number of attributes. At the beginning ofthe algorithm it is necessary to initialize the population. We initialize it pseudo-randomly. Weset the bits in chromosomes randomly but the generated population has to fit this condition:

21

Page 23: Machine Learning and Data

f1(i) − f0(i) ≤ 1, for i = 1, . . . , L

f1 =

Popsize∑

k=1

bit(k, i), f0 = Popsize − f1. (11)

Where: i – index of bit in chromosome,f1(i) – frequency of ith bits set to ’1’ in all individuals chromosomes,f0(i) – frequency of ith bits set to ’0’ in all individuals chromosomes,k – index of individual in population,bit(k, i) – value of ith bit in kth individual,Popsize – size of the population.

At the beginning of the GA cycle the individuals are evaluated by numerical value we callfitness. It is the number we get from training/testing set evaluation – see Section 3.4. If wewant, the fitness values can be scaled. We have implemented only one simple scaling function– Linear Ranking (LRank), where a scaled fitness is assigned to each individual according toits order in the population:

f ′i = PopSize − i + 1

Where i denotes the order of the ith individual in the population after sorting by oldfitness value. Then the scaled fitness of the best individual is set to PopSize and the worstone in assigned the value of 1.

Next step of GA is a selection [5]. We have implemented several selection methods fromthe simplest to the more capable ones. First method Roulette wheel is implemented onlyfrom historical reason and is not much usable.

Other implemented method is Remainder Stochastic Sampling with/without Replacement(RSS with/without R). For this method we have to count so-called expected values for indi-viduals. The expected value is a real number that indicates a number of copies that individualshould receive in the inter-population. It is usually defined by expression:

EVi = fi/favg

Where fi is fitness of ith individual and favg is average fitness over the whole population.In the first phase of algorithm we select each individual so many times how much is the

integer part of its expected value. The rest of individuals we select by roulette wheel, wherethe fitness values are the fraction parts of expected values. When we set RSS without R,then each individual can be selected by roulette wheel only once. We do it simply - whenwe select a certain individual, we set its fitness to zero, so we ensure that the individual willnot be selected again. When we set RSS with R, then individual can be selected more thanonce.

The last implemented selection method is Tournament. This method has parameter Nthat can be different for each parent. The parameter determines, from how much individualswe make tournament. So we randomly choose N individuals and select from them one parent(the best one). This sequence is repeated PopSize-times (PopSize is size of the population).

If we have selected individuals we can apply recombination operator like crossover andmutation and then we complete new population from recombined individual.

22

Page 24: Machine Learning and Data

It is possible to apply 1-point or 2-point crossover method. In additional, the 2-pointcrossover method has 2−pp option. When 2−pp is selected it means that the possible cross-ing points are only between attributes (cannot be inside binary representation of attributeweight). Crossover and mutation is proceed with probability set by the user.

After crossover we apply next recombination operator - mutation. We take all offspringindividuals and other selected individuals that have not been used for crossover and applymutation to them. We take each bit of individual chromosome and negate it with probabilityPmut.

At last we should replace old population with the new one. For this we use a simplemethod. In the new population we insert all offspring individuals and individuals that havenot been used for crossover (all individuals after mutation). Slightly different method isused when we check Elitism. Then we do not select inter-population with size PopSize but(PopSize−1) and as the last individual to new population we insert the best individual fromthe old population (without crossover and mutation).

New populations are generated in cycles until some termination condition is fulfilled. Thework can be terminated after certain number of generated populations or when we accomplishcertain success.

The newest GA we have implemented is Genetic Algoritm with Limited Convergence(GALCO) [35]. This algorithm is little different from GA mentioned above. This methoddoes not replace all population at one time. Only two parents are selected with tournamentselection. To those parents crossover (Pcross = 100%) is applied, mutation is skipped. Theoffsprings are evaluated and if one of them is better than the best of their parents, then theparents are replaced with offsprings. If not then some bad individual is chosen – the worstone or by tournament (worse wins). The bad individual is then progressively replaced withthe first offspring, but the following condition has to be fulfiled.

f1(i) − f0(i) ≤ B, for i = 1, . . . , L (12)

Where i, f1(i), f0(i), L has the same meaning as in (11) and B is balance deviationparameter.

The new individual is evaluated and if it is better than the old bad one, then replacementis confirmed otherwise the old individual is returned. The same procedure is applied to thesecond offspring. After processing of both offsprings the new parents are selected and thecycle continues.

This algorithm has one big advantage – it does not converge to one local extreme and thefinal population holds the different results where all results represents very good solutions. Ithas capability to find very good solutions in shorter time than other implemented methods.

3.6 Future Work

We would like to expand this tool with several more powerful methods for learning andreasoning. Implementing possibility to handle with PMML [39] would be very helpful. Thislanguage could store the whole IBR model including the cases. The big advantages should bein ability of easy use of other applications which support PMML to cooperate with iBARET– Figure 4. On the iBARET’s input some PMML compliant application could for examplepreprocess data, make clusters, etc. If the model includes (instead of case) cluster represen-tatives, the iBARET should be able to work with them as with classical data. If the iBARETis able to make output in PMML, then the model could be used in another PMML compliantapplication, for example for visualization, consultation, etc.

23

Page 25: Machine Learning and Data

classified data

unclassified dataPMMLclustering model

Clusteringsoftware

iBARET

Compliantapplications

(IBR classifier)

Figure 4: PMML utilization

Advantages of PMML should be used in another method we plan to implement – incase reduction techniques. There is a great number of suggested reduction techniques, theiroverview can be found in [50]. We would like to implement two case memory reductionalgorithms.

The first of them was suggested in [12], its underlying idea is very similar to the idea ofhierarchical clustering. In this case, application of this algorithm approaches loading of exter-nal PMML model generated by a hierarchical clustering algorithm. The algorithm deals withprototypes. Initially, each instance belonging to case memory represents a single prototype.Then, the nearest two prototypes that have the same class are merged into a single prototypeusing a weighted averaging scheme. The new prototype is located somewhere between twooriginal prototypes. Representation of prototype is straightforward for linear features. Incase of presence of symbolic features in instance descriptive vector, these features have to bedichotomized and the prototype is represented by competency to individual symbolic featurevalues. For example, having the symbolic feature Gender (with values male, female), thisfeature is replaced by two new features Male and Female. When merging prototypes Pa(1)having Male = 0, Female = 1 (Pa represents single original instance) and Pb(2) havingMale = 1 and Female = 0 (Pb represents two original instances), the new prototype Pc(3)has values Male = 0.66 and Female = 0.33. Advantage of the algorithm is its applicabil-ity to regression tasks. In this case, the condition of the same classification of the mergedprototypes is replaced by a requirement on similarity of their numerical outputs.

The second algorithm would have its origin in TIBL system suggested in [52]. Thisalgorithm attempts to save typical instances representing individual classes and thus enableshuge instance reduction and smooth decision boundaries. Resulting PMML model is compactand robust in the presence of noise. Disadvantage of this algorithm is its overgeneralizationon problems with complex decision surfaces. The typicality of instance is defined as theratio of its average similarity to instances of the same class to its average similarity toinstances of other classes. Similarity is defined in terms of the defined distance function(1− distance(i1, i2)). The reduction algorithm will proceed iteratively as follows. Start withthe empty reduced instance set S. Pick the most typical instance x of the original set Twhich is not in S and it is not correctly classified by the instances in S. Find the mosttypical instance y in T − S which causes x to be correctly classified and add it to S. Repeatthis process until all instances in T are classified correctly.

For improving evaluating regression task we would like to implement more evaluationfunctions. At least Mean Square Error (MSE) and Root Mean Square Error (RMSE). Thismeasures should help us to compare results with results of other methods (applications).

With help of PMML the user interface could be more intuitive in loading inputs. Now,before starting experiments two files has to be loaded into server and two files into interface.With use of PMML we could load only one file into interface and send data and model to

24

Page 26: Machine Learning and Data

the server through CQL connection. We would like to make some other improvements iniBARET control to make it more user friendly.

The basic idea behind the iBARET implementation is an effort to develop a general IBRtool. This tool should be easily applied to wide variety of classification and regression tasks.Our further plan is to use this system on benchmark data to find out its performance andcompare it with performance of other learning tasks.

4 Experiments

The first real-life application where we tried an instance-based modelling tool was predic-tion of result of Coronary Artery Bypass Graft (CABG) surgery operation. This task wassolved with commercial CBR-Works 3.0 Professional system [45] with an original module ofautomated interface. In that time we did not have CQL server. The results can be found in[42, 30]

The most recent task we solve is challenge in the SPA domain. The object of research isa timely prediction of capacity requirements of the therapeutic utilities. In other words, it isinteresting to know (predict) how many individual procedures are going to be prescribed forthe following time period on the assumption that we know the basic patient group character-istics for the regarded period (overall number of patients and their structure). The presentedwork is based on the CTU weeks dataset (produced by SumatraTT [3]). This dataset isfocused on week representation of the SPA problem. Each record represents a single week. Itconsists of attributes expressing number of the patients belonging to each of 128 pre-defineddistinct patient groups in the given week (Gr1, Gr2, . . . ). These attributes are followed bynumbers of the individual procedure applications in the given week. There are 38 differentprocedures we would like to predict (Proc1, P roc2, . . . ). Consequently, the prediction modelgenerated upon this dataset is concerned with the patient structure (patients are dividedamong 128 distinct groups) and outputs weekly predictions of the individual procedures(totaly 38 procedures).

Up to now, iBARET was applied to predict the sixth procedure (Proc6) only. Thefeature weights are learned first at training data. Predictions for all the available weeks areconstructed after it (LOOCV is used for this purpose). The results can be seen in Figure 5.The given model delivers results with following characteristics:

MAE = 31.3 (i.e., mean absolute error about 30 procedures per week),

MAPE = 9.8% (i.e., the model makes about 10% error on average week).

The question is how good and reliable this output is. Let us try to apply a simpleregression model that does not consider any (structural) patient information and deals purelywith an overall patient number. This model looks as follows:

Procj numi pred = a1Gr alli + a0, (a1 = 0.185, a0 = −66)

Where Procj numi pred is a predicted number of applications of procedure number j inweek number i, Gr alli is overall number of patients staying in the spa in week number i.

The given model delivers results with following characteristics:

MAE = 76.4 (i.e., mean absolute error about 76 procedures per week),

MAPE = 21.4% (i.e., the model makes about 20% error on average week).

25

Page 27: Machine Learning and Data

Proc6 - predictions vs. real value (original data)

0

100

200

300

400

500

600

700

800

1 11 21 31 41 51 61 71 81 91 101 111 121

week

no o

f Pro

c6 a

pplic

atio

ns

Proc6 - real

Proc6_pred1

Proc6_pred2

Figure 5: Procedure 6 - iBARET’s training error

In Figure 5, the predictive result of the simple regression model is shown for comparison.Mutual comparison of predictive result quality of both models leads to a logical explanation.It suggests that overall number of patients is a very significant attribute, but the resultquality can still be improved considering structural information.

The developed models mentioned above were tested on unseen data that were not availableduring training. These data represent another 21 week time period, i.e., the testing setconsists of 21 additional records. The following graph (Figure 6) shows performances ofdifferent modifications of iBARET predictor.

Proc6 - predictions vs. real value (testing data)

0

100

200

300

400

500

600

700

800

1 11 21

week

no o

f Pro

c6 a

pplic

atio

ns

Proc6 -real

Proc6-iBARET

Proc6-iBARET-win

Proc6-iBARET-equalweights

Figure 6: Procedure 6 - iBARET’s performance on testing data

The red line represents such a setting (further numbered as setting 1 ) in which set offeature weights learned on the training data was applied. Only the training data were includedinto the case memory when predicting the testing instances. The dark blue line deals withthe same set of weights, but it uses a sort of windowing - the case memory always contains allthe records that precede currently predicted testing record (setting 2 ). Finally, the green linerepresents a setting in which uniform weights are used (∀i wi = 1), no windowing is applied

26

Page 28: Machine Learning and Data

(setting 3 ).The presented results leads to following conclusions: The performance of setting 1 (MAE =

100 procedures/week, MAPE = 23%) and its comparison with the performance of the setting3 (MAE = 91 procedures/week, MAPE = 21%) suggest that the feature weights learnedupon the training set are definitely over-trained (their utilization brings no gain in compari-son with the uniform weights). At the same time it holds that the prediction error estimatedabove on basis of the training error was overly optimistic for the given type of predictor.

Proc6 - predictions vs. real value (testing data)

0

100

200

300

400

500

600

700

800

1 11 21

week

no o

f Pro

c6 a

pplic

atio

ns

Proc6 -real

Proc6-iBARET-win

Proc6-iBARET-reduced

Figure 7: Procedure 6 - iBARET’s performance with reduced number of patient groups

Windowing can bring some gain, setting 2 shows MAE = 81 procedures/week, MAPE =18%. it suggests that recent history (the last week namely) is important when predicting aforthcoming week.

Obviously, learning 128 feature weights on 125 examples gives big space for over-training.That is why, we have tried to identify patient groups that are principal for the given procedureand thus reduce the number of features. It was decided to use 10 features only, the groupswith the largest feature weights were used. This approach (without windowing) broughtindispensable improvement (Figure 7).

MAE = 52 procedures/week, MAPE = 16% on testing data.It follows that this approach dealing with windowing represents the most promising setting

of iBARET.For the sake of simplicity, this section focuses on procedure 6 only. The other procedures

were predicted as well (Proc8, P roc13, P roc16, P roc19, P roc22, and Proc39) and the resultsare in accord with the results reached on procedure 6. Generally, it can be concluded that wehave constructed predictive models that are likely to predict with MAPE lower than 20%.

27

Page 29: Machine Learning and Data

5 Future Research

Now, we try to summarize how our further research could be oriented, what we would liketo work on.

As it has been written in section 3.6, we would like to continue with iBARET developing.The iBARET should include further AI techniques to be more flexible and powerful. Atfirst we would like to implement several case reduction techniques and make some bench-mark experiments. It will require greater intervention in both parts of our client – serverapplication.

On theoretical level there are several possibilities on which to focus our interest. Besidescase-reduction there is a very interesting research area – feature reduction. Both reductiontechniques are very important for relevant results achievement. So we would like to give moreinformation about these techniques and try to include them in ordinary learning process. Casereduction and feature reduction could work together with respect to PAC analysis.

Other very interesting area of k-NN like methods is Locally Weighted Regression (LWR).There we would like to focus on a problem of selecting convenient LWR method in accordancewith task and input data. There is interesting to find out whether we are able to suggestLWR method from amount of features, data, and from some additional knowledge aboutdata (meta-data). As it is shown in experiments in section 3.4.2, noise in data influencesresults as well. The question is whether we are able to determine optimal LWR methodfrom knowledge of data and noise in data (noise distribution function). First we could tryto suggest k-parameter in k-NN method from data and noise estimation.

It could be also helpful to learn more details about techniques mostly used in computervision, such us Support Vector Machines, and try to utilize their advantages in machinelearning. For example, we could find out utilization of Kernel function (used in SVM) insome ML methods.

6 Conclusion

This work includes short state of the art of Machine Learning and summarizes the two yearsongoing research. The first part is mostly theoretical. It describes Machine Learning andData Mining techniques and their common subtasks such us data preprocessing, testing, andresults evaluation.

The main part of this work describes the iBARET system. The system has been ex-tensively expanded in last two years and now it covers many methods for LWR, case setevaluation, feature weights tunning, etc. Almost all methods are explained there. At the endof that section there can be found vision of iBARET’s future development. The functionalityof iBARET is shown in the last made experiment – on SPA data.

We have tried to show some problems linked with Machine Learning and Data Mining.We would like to focus on some of them and thus help to enlarge boundaries of ArtificialIntelligence.

28

Page 30: Machine Learning and Data

References

[1] A. Aamodt and E. Plaza. Case-Based Reasoning: Foundational Issues, MethodologicalVariations, and System Approaches. AI Communications, 7(1):39–59, 1994.

[2] D. Aha, D.W. Kibler, and M. K. Albert. Instance-based learning algorithms. MachineLearning, 6:37–66, 1991.

[3] P. Aubrecht. Sumatra Basics. Technical report GL–121/00 1, Czech Technical University,Department of Cybernetics, December 2000.

[4] P. Aubrecht. Tutorial of Sumatra Embedding. Technical report GL–101/00 1, CzechTechnical University, Department of Cybernetics, October 2000.

[5] J.E. Baker. Reducing Bias and Inefficiency in the Selection Algorithm. In HillsdaleLawrence Erlbaum Associates, editor, Proceedings of the Fourth International Confer-ence on Genetic Algorithms, pages 14–21, 1987.

[6] D. Beasley, D.R. Bull, and R.R. Martin. An Overview of Genetic Algorithms: Part 1,Fundamentals. Technical report, University Computing, 1993.

[7] D. Beasley, D.R. Bull, and R.R. Martin. An Overview of Genetic Algorithms: Part 2,Research Topics. Technical report, University Computing, 1993.

[8] D.A. Berry. Statistics—A Bayesian Perspective. Duxbury Press, Belmont, California,1996.

[9] Ch.M. Bishop. Neural Networks for Pattern Recognition. Clarendon Press, Oxford,1995.

[10] B. Boser, I. Guyon, and V. Vapnik. A training algorithm for optim margin classifier.In Proceedings of the Fifth Annual ACM Workshop on Computational Learning Theory,pages 144–52. ACM, 1992.

[11] L. Breiman. Bagging Predictors. Machine Learning, 24(2):123–140, 1996.

[12] C.L. Chang. Finding prototypes for nearest neighbor classifiers. IEEE Transactions onComputers, 23:1179–1184, 1974.

[13] W. Cleveland and S. Devlin. Locally-weighted regression: An approach to regressionanalysis by local fitting. Journal of the American Statistical Association, 83:596–610,1988.

[14] The World Wide Web Consortium(W3C). http://www.w3c.org. web page.

[15] S. Cost and S. Salzberg. A Weighted Nearest Neighbor Algorithm for Learning withSymbolic Features. Machine Learning, 10:57–78, 1993.

[16] A. Dhagat and L. Hellerstein. PAC learning with irrelevant attributes. In Proc. ofthe 35rd Annual Symposium on Foundations of Computer Science, pages 64–74. IEEEComputer Society Press, Los Alamitos, CA, 1994.

[17] B. Efron and R.J. Tibshirani. An Introduction to the Bootstrap. Chapman and Hall,London, UK, 1993.

29

Page 31: Machine Learning and Data

[18] P.A. Flach. The logic of learning: a brief introduction to Inductive Logic Programming.In Proceedings of the CompulogNet Area Meeting on Computational Logic and MachineLearning, pages 1–17. University of Manchester, 1998.

[19] P.A. Flach. On the state of the art in machine learning: A personal review. ArtificialIntelligence, 13(1/2):199–222, September 2001.

[20] W.J. Frawley, G. Piatetsky-Shapiro, and C.J. Matheus. Knowledge discovery indatabases - an overview. Ai Magazine, 13:57–70, 1992.

[21] D.E. Goldberg. Genetic Algorithms in Search, Optimization and Machine Learning.Addison-Wesley, Reading, MA, 1989.

[22] Gartner Group. http://www.gartner.com. web page.

[23] D. Guijarro, J. Tarui, and T. Tsukiji. Finding Relevant Variables in PAC Model withMembership Queries. In Proc. 10th International Conference on Algorithmic LearningTheory - ALT ’99, volume 1720, pages 313–322. Springer-Verlag, 1999.

[24] M. Hall. Correlation-based Feature Selection for Machine Learning. PhD thesis, WaikatoUniversity, Department of Computer Science, Hamilton, NZ, 1998.

[25] M. Hall and L. Smith. Practical feature subset selection for Machine Learning. In Pro-ceedings of Australian Computer Science Conference. University of Western Australia,February 1996.

[26] M.A. Hall. Feature selection for discrete and numeric class machine learning.

[27] J.A. Hanley and B.J. McNeil. The meaning and use of the area under a receiver operatingcharacteristic (roc) curve. Radiology, 143:29–36, 1982.

[28] L.P. Kaelbling, M.L. Littman, and A.P. Moore. Reinforcement Learning: A Survey.Journal of Artificial Intelligence Research, 4:237–285, 1996.

[29] J.D. Kelley and L. Davies. A Hybrid Genetic Algorithm for Classification. In Proceedingsof the Twelth International Conference on Artificial Intelligence IJCAI-91, volume 2,1991.

[30] J. Klema, L. Lhotska, O. Stepankova, and J. Palous. Instance-Based Modelling inMedical Systems. In R. Trappl, editor, Cybernetics and Systems 2000, 2, pages 365–370,Vienna, Austria, April 2000. Austrian Society for Cybernetics Studies. ISBN 3-85206-151-2.

[31] J. Klema and J. Palous. iBARET – Instance-Based Reasoning Tool. Research reportGL 113/00, CTU FEE, Department of Cybernetics, The Gerstner Laboratory, 2000.

[32] J. Klema and J. Palous. iBARET – Instance-Based Reasoning Tool. In ELITE Foun-dation, editor, European Symposium on Intelligent Technologies, Hybrid Systems andTheir Implementation on Smart Adaptive Systems, 1, page 55, Susterfelderstrasse 83,Aachen, December 2001. Verlag Mainz, Wissenschaftsverlag.

[33] J. Klema and J. Palous. Prıpadove usuzovanı a rozhodovanı. In Sbornık Znalosti 2001,volume 1, pages 231–238. Vysoka skola Ekonomicka, Praha, Jun 2001. ISBN 80-245-0190-2.

30

Page 32: Machine Learning and Data

[34] J. Kolodner. Case-Based Reasoning. Morgan Kaufmann, San Mateo, California, 1993.

[35] J. Kubalık, L.J.M. Rothkrantz, and J. Lazansky. Genetic Algoritms with Limited Con-vergence. In To be published in proceedings of The Fourth International Workshop onFrontiers in Evolutionary Algorithms (FEA 2002), Research Triangle Park, North Car-olina, USA, March 8-13 2002.

[36] D. Lowe. Similarity metric learning for a varable – kernel classifier. Neural Computation,7(1):72–85, 1995.

[37] V. Marık, O. Stepankova, and J. Lazansky a kol. Umela inteligence (1). Academia,Praha, 1993.

[38] Z. Michalewicz. Genetic Algorithms + Data Structures = Evolution Programs. Springer-Verlag, Berlin Heidelberg, second edition edition, 1994.

[39] Data mining group. http://www.dmg.org. web page.

[40] M. Mitchell. An introduction to genetic algorithms. The MIT Press, Camridge, Mas-sachusetts, 1998.

[41] T. Mitchell. Machine Learning. McGraw-Hill Co., 1997.

[42] J. Palous. Vyuzitı prıpadoveho usuzovanı pro lekarske aplikace. Master thesis, CTUFEE, Prague, 2000. in czech.

[43] J.R. Quinlan. Induction of Decision Trees. Machine Learning, 1:81–106, 1986.

[44] D. Skalak. Prototype and Feature Selection by Sampling and Random Mutation HillClimbing Algorithms. In The 11th International Conference on Machine Learning, pages293–301, 1994.

[45] TECINNO GmbH, Kaiserslautern, Germany. CBR Works 3.0 documentation, 1998.

[46] M. Vejmelka. Cql server download, cql server description athttp://phobos.spaceports.com/~vejmelka/cql/main.html. web page.

[47] Weka 3 – machine learning software in java,http://www.cs.waikato.ac.nz/~ml/weka/. web page.

[48] D. Wettschereck, D. Aha, and T. Mohri. A review and empirical evaluation of feature-weighting methods for a class of lazy learning algorithms. Artificial Intelligence Review,11:273–314, April 1997.

[49] D. Wilson. Advances in instance-based learning algorithms. PhD thesis, Brigham YoungUniversity, Provo, UT, 1997.

[50] D.R. Wilson and T.R. Martinez. Reduction Techniques for Instance-Based LearningAlgorithms. Machine Learning, 38(3):257–286, 2000.

[51] I.H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Tech-niques with Java Implementations. Morgan Kaufmann, October 1999.

[52] J. Zhang. Selecting Typical instances in Instance-Based Learning. In Morgan-Kaufmann,editor, Proceedings of the Ninth International Machine Learning Workshop, Aberdeen,Escocia, pages 470–479, San Mateo, Ca, 1992.

31