REAL TIME MOTIF CLASSIFICATION FROM DATABASE USING INTELLIGENT ALGORITHMS

download REAL TIME MOTIF CLASSIFICATION FROM  DATABASE USING INTELLIGENT ALGORITHMS

of 16

Transcript of REAL TIME MOTIF CLASSIFICATION FROM DATABASE USING INTELLIGENT ALGORITHMS

  • 7/30/2019 REAL TIME MOTIF CLASSIFICATION FROM DATABASE USING INTELLIGENT ALGORITHMS

    1/16

    International Journal of Artificial Intelligence & Applications (IJAIA), Vol.3, No.6, November 2012

    DOI : 10.5121/ijaia.2012.3610 95

    REAL TIMEMOTIF CLASSIFICATION FROM

    DATABASE USING INTELLIGENT ALGORITHMS

    Hiteshkumar Nimbark1 , Pratik Chavda2, Chetan Undhad2 and Paresh Kotak3

    1Department of Computer Science and Engineering,

    B H Gardi Engg, Rajkot, Gujarat, [email protected]

    2Noble Enggering College, Junagadh, Gujarat, [email protected],[email protected]

    2MParesh Kotak, A V Parekh polytechnique, Rajkot, Gujarat, India

    [email protected]

    ABSTRACT

    The amount of raw data being accumulated in the databases is increasing at an inconceivable rate.

    However, these data-rich databases are poor in providing substantial information. This is where data

    mining comes into picture. Specifically, data mining is "the process of extracting or mining information

    from large amount of data". Motif classification has been an active area of research in data mining. It

    consists of assigning a data instance to one of the predefined classes/groups based upon the knowledge

    gained from previously seen (classified) data.

    Real world problems demand classifiers that are accurate as well as easy to interpret. Traditional

    comprehensible classifiers like decision trees are very accurate at classifying new instances, but with

    increase in the size of datasets and/or the number of classes, and/or with increase in number of attributes,

    the trees induced are very large in size and hence difficult to interpret. On the other hand, evolutionary

    algorithms like Genetic Programming when applied to classification problems give trees that have smaller

    size but are not very accurate.

    Thus the work proposes an algorithm MotifClass MotifClass, that employs Genetic Proframming as an

    optimization technique to yield efficient and effective classification. The goal is to evolve a classifier that

    performs a trade-off between accuracy and comprehensibility in order to produce an optimal decision tree

    classifier (for n-class where n>=2) using GP.

    When evaluated on some benchmark datasets, the proposed algorithm obtained by merging Genetic

    Programming and decision tree, outperforms the traditional classification techniques in terms of a

    combination of accuracy and comprehensibility.

    KEYWORDS

    Accuracy, Motif Classification Trees, Comprehensibility, Genetic Programming, Multi-class classification.

    1. INTRODUCTION

    The digital revolution has brought an explosion of stored data to our world, and it's increasing at

    an ever faster rate. This collection of data is quickly outpacing the ability for humans tounderstand or make use of it. Scientists across the world have joined in the hunt to discover

    valuable knowledge from this digital information[1].

    mailto:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]
  • 7/30/2019 REAL TIME MOTIF CLASSIFICATION FROM DATABASE USING INTELLIGENT ALGORITHMS

    2/16

    International Journal of Artificial Intelligence & Applications (IJAIA), Vol.3, No.6, November 2012

    96

    Classification algorithms generally yield results in either symbolic or non-symbolic form.Algorithms that provide non-symbolic results only assign class labels to the data instances and do

    not convey the reason as to why the algorithm led to that classification. Examples includeBayesian classifiers, k Nearest Neighbor (kNN) classifier, Neural Networks, etc. However in realworld applications, it is also required to know the reasons behind classifying any data into a

    particular class. Algorithms that provide such reasons fall in the category of symbolic classifiers.Decision trees are symbolic classifiers, i.e. they have high comprehensibility and are accurate too.

    But, with increase in number of classes or when training data becomes very large, size of the treeincreases considerably. This leads to loss of "comprehensibility", which was considered as one of

    the best properties of decision tree.

    Evolutionary algorithms like Genetic Algorithms (GA) and Genetic Programming (GP) have been

    found successful in solving numerous classification problems. Also, tree structures are anencoding scheme for GP individuals. The classification trees generated using GP have smaller

    number of nodes and hence are more comprehensible.

    Hence the need to develop an optimal multiclass classifier that performs a trade-off between

    accuracy and comprehensibility has been the motivation of the work.

    2. LITERATURE SURVEY

    Motif Classification has been an active area of research in data mining. It is a form of dataanalysis that can be used to extract models describing important data classes and can help provide

    us with a better understanding of the data at large. Classification predicts categorical (discrete,unordered) classes, given any unseen data.

    Intelligent Evolutionary algorithms (EAs) are stochastic optimization techniques that imitatesome aspects of natural evolution. They try to obtain good solutions by creating a population of

    data structures that represent them, and evolve that population by changing those solutions,combining them to create new ones, and, on average, making the better-suited survive and the

    discarding worst-suited individuals. EAs are a set of modern met heuristics used successfully in

    many applications with very high complexity.

    2.1. Motif Classification

    Motif Classification is a task of assigning a data instance to one of the predefined classes/groupsbased upon the knowledge gained from previously seen (classified) data.

    Basically, classification is a two step process. First, a training set consisting of records whoseclass labels are known must be provided. The training set is used to induce a classification model,

    which is subsequently applied to the test set consisting of records with unknown class labels[1].Classification plays a major role in numerous practical situations such as classifying e-mails as

    spam or not spam, for a bank in classifying loan applications as either safe or risky, diagnosing

    medical conditions, categorizing news stories as finance, weather, business, entertainment, sports,etc. There are numerous methods used to build classifiers.

    Decision Tree (CART) [1]

    Naive Bayes Classifier [7]

    k Nearest Neighbors Classifier [2,11,12] etc.

  • 7/30/2019 REAL TIME MOTIF CLASSIFICATION FROM DATABASE USING INTELLIGENT ALGORITHMS

    3/16

    International Journal of Artificial Intelligence & Applications (IJAIA), Vol.3, No.6, November 2012

    97

    2.2. Intelligent Algorithms

    Real world applications are most likely to encounter problems that are hard to solve. Some

    examples for these kind of problems are the travelling salesman, knapsack problem, etc. To

    approach such hard problems, a couple of concepts were introduced in the past decades, which

    were inspired by nature. Some of the more popular and successful examples are GeneticAlgorithms [3], Neural Networks (NN), Fuzzy Methods (FM) and Evolutionary Algorithms(EA or also known as Evolutionary Computation), Ant Colony Optimization (AGO), etc.

    EA[23] are slow but the comprehensibility of EA based classifiers is better compared to the non-evolutionary approach based classification algorithms. Hence, EA is the best choice when

    comprehensibility is a selection criterion and decision tree is the first choice when training time isa selection criterion.

    Different ways of establishing the classification technique using GP has been stated and in eachmethod the issues which need to be analyzed for GP are mentioned [22].

    The methods and different research work which can be used to create a decision tree and

    evaluation using GP are as follows:

    The paper [24] presents an approach for designing classifiers for a multiclass problem using

    Genetic Programming).

    An MOGP system [25] designed to optimize two goals is described, but these goals are not fixed.The work of A multi-objective genetic programming approach [25] decision trees are generatedthrough evolution of training data. Usually, DTs are designed from fixed data.

    The authors of Adaptive genetic programming for dynamic classification problems [26]

    investigate the feasibility of using Genetic Programming in dynamically changing environments

    to evolve decision trees for classification problems and proposes an new version of GeneticProgramming called Adaptive Genetic Programming. To cope up with the changing environment,

    a new control parameter known as culling is introduced.

    The author of Novel feature selection method for genetic programming [27] where GP isemployed to achieve a selection of features. Each individual is a tree encoding a classifierrepresented as a discriminate function, and classification accuracy is used as the fitness function.

    The authors capitalize on the implicit feature selection ability of GP classifiers by applying GPfor classification in a two-stage scheme. First, a certain number of GP runs are carried out, each

    one resulting in a best-of-run classifier. Then, in the second stage, the GP is run again using onlythe features most frequently used in the best classifiers obtained in the previous stage. However,

    the goal of the first stage is just to discover the most interesting features, and the classifiers

    obtained are discarded. The final classifier is induced in the second stage using the featuresselected in the first stage.

    Further, Fitness landscapes and inductive genetic programming [28] presents an analyticalstudy on the influence of the fitness function and mutation operator on the quality of the decisiontrees induced, taking into account both classification accuracy and tree size

    Also, Searching the forest: Using decision trees as building blocks for evolutionary search inclassification databases [29], a new mutation technique called Mutation on RHS has been

    proposed. Given a condition, the method consists of changing the right hand side of the condition

  • 7/30/2019 REAL TIME MOTIF CLASSIFICATION FROM DATABASE USING INTELLIGENT ALGORITHMS

    4/16

    International Journal of Artificial Intelligence & Applications (IJAIA), Vol.3, No.6, November 2012

    98

    into a randomly chosen (allowed) new one. This can either be a value of the attribute or anotherattribute. Also, the paper uses pruning as a Genetic Operator.

    The potential of GP to include any kind of operation into individuals is used in Application ofgenetic programming to induction of linear classification trees [30] in order to evolve oblique

    decision trees. The paper shows that in GP, individuals tend to become larger over time. Thisphenomenon is known as bloat.

    The above paper provides an up-to-date overview that is fully focused on evolutionary algorithms

    and decision trees and does not concentrate on any specific evolutionary approach.

    Also, it provides a taxonomy, which addresses works that evolve decision trees and works that

    design decision-tree components by the use of evolutionary algorithms. Further, the paperaddresses some more important issues and open questions that can be the subject of future

    research in concern with combining decision tree and evolutionary algorithms.

    Our the proposed algorithm, the strategy of using decision trees for producing initial population

    instead of initializing the population randomly is adopted. Further, emphasis is laid on the nature

    of mutation is destructive and a solution to this is suggested in the proposed algorithm [24].

    2. PROPOSED ALGORITHM

    A new algorithm named MotifClass that performs a trade-off between accuracy and

    comprehensibility. As the name suggests, the proposed algorithm is a hybrid version of GP andClassification Tree (Decision Tree).

    The algorithm is especially designed for large datasets and for problems involving multi-classclassification because the increase in dataset size or number of classes directly affects the size of

    the trees, resulting into trees that are not comprehensible enough. As a solution to this issue, theproposed approach gives equal importance to size of the tree and its accuracy by designing the

    fitness function accordingly. Undoubtedly, the approach works perfectly well for binary

    classification problems and small datasets.

    The proposed algorithm uses some lessons learnt from the literature survey and combines them toproduce a new classification technique. The basic approach is to form a population of decision

    trees and evolve them using GP with accuracy and number of nodes as fitness parameters.

    Algorithm 1 MotifClass.

    Input: Data Partition, Dt, which is a set of training instances and their associated class labels

    Output: BEST_SO_FAR, the optimized classification tree

    1. for i = 1 to POP_SIZE do // Initial population

    2. Create a decision tree from a random subset of Dt

    3. end for4. Evaluatefitness of each individual in the initial population using fitness measure

    of Equation: 1 and Dt5. while GENERATION < MAX_GENERATIONS or

    Termination criterion not satisfied do

    6. repeat7. Perform genetic operations on parent trees to generate offspring trees for the

    new population

  • 7/30/2019 REAL TIME MOTIF CLASSIFICATION FROM DATABASE USING INTELLIGENT ALGORITHMS

    5/16

    International Journal of Artificial Intelligence & Applications (IJAIA), Vol.3, No.6, November 2012

    99

    8. if(Size of Individual < c1 _ Size of BEST_SO_FAR ) then9. Evaluate its fitness and increment NEW_POP_SIZE

    10. else if(Accuracy of Individual >c2 * Accuracy of BEST SO FAR ) then

    11. Evaluate its fitness and increment NEW POP SIZE

    12.else

    13. Discard the individual

    14. end if15. until POP_SIZE individuals are produced

    16. GENERATION GENERATION + 117. POP_SIZE NEW_POP_SIZE

    18. end while19. Save the population of last generation for incremental learning20. Designate best individual found so far as BEST_SO_FAR

    The input to the algorithm is the set of training instances along with their associated class labels.As an output, the algorithm gives us a classification tree BEST_SO_FAR, which is an optimal

    classifier in terms of accuracy and comprehen-sibility. Initially, POP-SIZE is set equal to the

    desired size of the initial population and as the name suggests, the parameter MAX-GENERATIONS is set equal to the number of maximum generations the algorithm is desired to

    run. Because of scarcity of space we are not including the detailed explanation of the

    algorithm.

    Equation: 1

    where control parameter representing a tradeoffs between accuracy and size. Its value can range

    from 0.05 to 0.1.

    3. IMPLEMENTATION

    This details of the tool used for implementation followed by a block diagram picturizing themodular flow of the project and other relevant details.

    3.1 Tools Used and Modular flow

    CART and the proposed algorithm MotifClass have been implemented in MATLAB 7.8.0

    (R2009a) whereas GP for classification is implemented in HeuristicLab 3.5.33, an open sourceunder GNU General Public License. All the experiments are performed on Windows Platform.

    The reason behind choosing MATLAB with plug-in GPlab[38] for Implementation of theproposed algorithm are Interactive interface, Debugging facilities, High quality graphics and

    visualization facility.

    The block diagram of figure 1 shows the modular flow of the project. In the PreRequisiteCheckstage, the dataset is imported and checked for noisy data. The algorithm also facilitates selecting

    few attributes out of many.

    In the second stage, initial population is generated and evaluated for their efficiency using the

    training data itself. The algorithm practices to keep the best individuals at the top of array so that

  • 7/30/2019 REAL TIME MOTIF CLASSIFICATION FROM DATABASE USING INTELLIGENT ALGORITHMS

    6/16

    International Journal of Artificial Intelligence & Applications (IJAIA), Vol.3, No.6, November 2012

    100

    its comparison with newly developed individuals can be done easily. Hence a sort function is

    used that sorts individuals based on fitness.

    The third stage performs the GP runs until the termination criterion is reached and presents the

    best solutions at the top of the array. The module GPruns include performing Reproduction with

    Rank Selection method implemented in it. Further, it includes Crossover and Mutation functionswith the stated sub-functions like swap, etc.

    The results of the last generation are saved for incremental learning, which occurs after the

    algorithm has produced its best classifier and some new training data arrives at hand. In order toperform incremental learning, one more GP run is conducted on the saved population. This is whyGPruns module is called again from Incremental-Learning module.

    Figure 1. Block diagram of modules implemented

    3.2 Methodology

    3.2.1 Pre Requisite Checking

    Figure 2 shows a GUI that enables selection of the attributes the user wants to allow forparticipating in classification. This feature is provided as the data might be noisy or someattributes may be irrelevant for classification. Such attributes can be eliminated here.

  • 7/30/2019 REAL TIME MOTIF CLASSIFICATION FROM DATABASE USING INTELLIGENT ALGORITHMS

    7/16

    International Journal of Artificial Intelligence & Applications (IJAIA), Vol.3, No.6, November 2012

    101

    Figure 2. GUI enabling selection of attributes for classification

    3.2.2 Initial Population

    Figure 3 shows a snapshot of some of the trees of the initial generation. Few bestindividuals formed as the initial population on the Robot Navigation dataset described earlier are

    listed as in table 1

    Figure 3: Sample of initial population generated (Decision Trees)

  • 7/30/2019 REAL TIME MOTIF CLASSIFICATION FROM DATABASE USING INTELLIGENT ALGORITHMS

    8/16

    International Journal of Artificial Intelligence & Applications (IJAIA), Vol.3, No.6, November 2012

    102

    Table 1: A list of few Best Individuals

    Individual Number Fitness Accuracy No. of Nodes

    16 65.710079 74.701472 13

    3 63.069364 75.118023 3371 61.863498 73.451819 31

    42 59.618168 70.785893 31

    65 58.222897 71.008053 53

    84 57.769857 70.952513 61

    The results show that by using Decision Trees for creating initial population, gives good trees

    from the very beginning which matches the theoretical assumption [32].

    3.2.3 Generic Implementation

    Once the initial trees have been developed, the next step is to perform crossover, mutation andreproduction upon them based on the results. Figures 4 to 6 show how crossover and mutation isimplemented.

    Figure 6 shows an offspring obtain by performing genetic operation on parents. In case of

    crossover, two parents undergo the above stated process whereas for mutation, one parent ismutated with a randomly created new tree.

    Whenever a leaf node is to be swapped with a leaf node, the task becomes very simple and simplyone node swap is required. However, if a node that has branches becomes the point of crossover

    or mutation, it is to be pruned first and then swapped with the crossover or mutation node of the

    other tree.

    Figure 4 (a) (a)Step - 1: Randomly selecting

    crossover/mutation point

    Figure 4(b) : Step - 2: Pruning the node

    undergoing crossover/mutation

  • 7/30/2019 REAL TIME MOTIF CLASSIFICATION FROM DATABASE USING INTELLIGENT ALGORITHMS

    9/16

    International Journal of Artificial Intelligence & Applications (IJAIA), Vol.3, No.6, November 2012

    103

    Figure 5: (a) Step - 3: Substituting node 2 of Tree 2 at

    node 5 of Tree 1

    Figure 5 (b): Step - 4: Addition of new nodes

    and changing properties of all following Nodes

    Figure 6: Step - 5: Addition of new nodes and changing properties of all following Nodes

    If the node that is added has branches, crossover or mutation takes place by finding the next node

    with branch and giving the number of its children to children of the swapped node. All theproperties of the nodes that follow are to be changed.

    Building crossover and mutation functions have been the heart of the implementation as it not

    only involves simple swapping of nodes, but all the properties of the nodes such as its parent, itschildren, etc. have to be swapped. Also, crossover and mutation are responsible for consumingthe major time the algorithm runs. Further, the proposed algorithm works well with both numeric

    and text data.

    3.2.4 Output Classification Tree

    An example of the final classification tree developed on the Robot Navigation dataset is shown infigure 7. The view function of MATLAB is used to display the tree. The final tree is

    comprehensible enough to analyze on the given dataset.

  • 7/30/2019 REAL TIME MOTIF CLASSIFICATION FROM DATABASE USING INTELLIGENT ALGORITHMS

    10/16

  • 7/30/2019 REAL TIME MOTIF CLASSIFICATION FROM DATABASE USING INTELLIGENT ALGORITHMS

    11/16

    International Journal of Artificial Intelligence & Applications (IJAIA), Vol.3, No.6, November 2012

    105

    In the experimentations, 66% of each of these datasets have been used for training the classifierand the rest for testing it. The next section describes in detail the results of experiments on five

    datasets, each representing a multi-class problem. The algorithm has also been tested on manyother datasets representing multi-class as well as binary classification and the test results areshown in Table 4. Although the algorithm has been designed especially for multi-class

    classification problems, it works equally well on problems with binary classes. Hence,experimental results have also been performed on binary classification problems in order to show

    the generalized behavior of the proposed algorithm on classification problems with any number ofclasses.

    4.2 Experimental Results and Findings

    Some initial runs were performed to compare settings for tournament size, trade off factor andvalidation factors. The results showing fitness, accuracy and number of nodes of individuals for

    different values of trade-off factor is shown in Table 3. The results show that 0.05 is the optimal

    value of trade-off factor 7 on the above datasets. Hence the algorithm uses value of 7 as 0.05

    throughout the experiments.

    Table 3: Optimal value of Trade-off factor 7

    Dataset Trade-off

    factor

    7

    Fitness

    Accuracy No. of

    Nodes

    Cardiotocography 0.0f 92.31 96.56 123

    0.05 75.24 94.07 870.1 61.35 88.5 39

    Robot Navigation 0.01 95.34 99.45 68

    0.05 82.86 99.25 370.1 56.69 76.86 21

    Nursery 0.01 78.86 83.24 223

    0.05 62.89 79.65 1130.1 50.48 72.44 37

    The heart of the algorithm are the crossover and mutation operations which lead to increase in

    accuracy from a smaller value to significantly high values. The average results of how thecrossover and mutation operations perform on all the stated datasets is shown in figure 8 (a) and(b) respectively.

    The results show that on an average, mutation raises the accuracy of weak individuals to a higher

    value whereas crossover increases their accuracy to the highest values.

    Figure 9 shows how accuracy of the individuals increase with generations. The results show thatafter a finitely small number of generations, the termination criterion is reached and the algorithmterminates. In such cases, the graph becomes a straight line showing that the accuracy is no more

    increasing.

  • 7/30/2019 REAL TIME MOTIF CLASSIFICATION FROM DATABASE USING INTELLIGENT ALGORITHMS

    12/16

    International Journal of Artificial Intelligence & Applications (IJAIA), Vol.3, No.6, November 2012

    106

    Figure 8(a): Effect of Mutation Figure 8(b): Effect of Crossover

    For Nursery dataset, the fitness becomes stable after 17 generations whereas on RobotNavigation, it takes 13 generations for fitness to stabilize. On the other hand, MotifClass runs for19 generations on Cardiotocography (CTG) dataset to terminate.

    Figure 9: Increase in Fitness of Individuals with

    Generations

    Figure 10: Accuracies of GP, CART and

    MotifClass on Training instances

    Figures 10 and 11 respectively show the results of accuracy (in %) on training and test instanceswhereas Figure 12 shows the results comprehensibility (number of nodes) of the classifier on the

    above stated datasets for the three algorithms CART, GP for classification and MotifClass. with

    increase in number of classes and length of dataset, CART (i.e. decision tree) performs good interms of accuracy but comparison of classifier is less, whereas GP produces classification trees

    with less number of nodes and hence is highly comprehensible. This can also be concluded fromthe results above. Hence the goal of our proposed approach was to perform a trade-off betweenaccuracy and number of nodes and the results shown in figure 11 and 12 verify that we have been

    successful in performing the trade-off. Not only a trade-off, in most of the cases, the accuracy hasalso increased in comparison to CART.

    On training instances, the classifier performs as good as CART in terms of accuracy whereas the

    number of nodes have significantly decreased. On the test instances, MotifClass outperforms GP

    and CART in most of the cases.

  • 7/30/2019 REAL TIME MOTIF CLASSIFICATION FROM DATABASE USING INTELLIGENT ALGORITHMS

    13/16

    International Journal of Artificial Intelligence & Applications (IJAIA), Vol.3, No.6, November 2012

    107

    For example, on CTG and Nursery datasets, the accuracy obtained by applying MotifClass isremarkably more than CART. Moreover, as MotifClass is a result of merging GP to CART, the

    number of nodes have also been reduced as compared to CART. On the other hand, the accuracyof MotifClass on Robot Navigation dataset is almost equivalent to CART, and the number ofnodes in the classification tree are convincingly less than CART resulting into a triumphant trade-

    off.

    The only database on which MotifClass has a notably less accuracy is on the Forest cover-typedatabase. But, the number of nodes have decreased by approximately three times and hence the

    reduction in accuracy is quite acceptable.

    Further, the largest dataset used, that is the Poker-hand dataset, is classified the best with

    MotifClass as compared to CART and GP. Not only is the accuracy of MotifClass fairly high, butthe number of nodes have also been very significantly decreased (almost 6 times). This shows

    that MotifClass performs at its best when mining interpretable results from very large datasets.

    Figure 11: Accuracies of GP, CART and MotifClass on Test instances

    Figure 12: Number of nodes in trees generated by GP, CART and MotifClass

    4.3 Comments on Classification Time

    As far as the time complexity is concerned, it is quite obvious that Genetic Programming takesmore time than CART as it continues for a number of generations. MotifClass being a

    combination of CART and GP, clearly takes more time in construction of the classifier. However,

  • 7/30/2019 REAL TIME MOTIF CLASSIFICATION FROM DATABASE USING INTELLIGENT ALGORITHMS

    14/16

    International Journal of Artificial Intelligence & Applications (IJAIA), Vol.3, No.6, November 2012

    108

    the construction of classifier is a one-time cost and once generated, what is of importance is theamount of time taken to classify new instances using that classifier.

    As the results show, the tree size of CART is considerably large and hence it takes considerablyhigh time in classifying the data. MotifClass on the other hand, may take time in building the

    classifier, but once trained, the new instances can be classified quickly as the size of the tree issmall (at least smaller than tree developed by CART or any other decision tree algorithm).

    Undoubtedly, pruning [39] the trees developed by decision tree algorithm reduces the size, but it

    happens only by compromising in accuracy to a greater extent. MotifClass, while performingtrade-off, sees to the fact that accuracy remains convincingly high.

    MotifClass has been verified on several other datasets representing binary as well as multiclassproblems. The results on some of them are shown in Table 4. As the results of CART and

    MotifClass are more significant, only they are compared. Further, the accuracy tabulated is for thetest instances as the actual performance of the classifier can be measured as how efficiently itclassifies the unknown instances.

    Table 4: Experimental results on different datasets

    DatasetNo. of

    Attributes

    No. of

    Instances

    No. of

    Classes

    Optimization

    ParameterCART MotifClass

    Iris 5 150 3% Accuracy

    No. Of Nodes

    96.07

    9

    98

    7

    Ecoli 8 336 8% Accuracy

    No. Of Nodes

    74.34

    31

    81.42

    23

    Diabetes 9 786 2% Accuracy

    No. Of Nodes

    74.23

    103

    81.54

    27

    Vehicle 18 946 4% Accuracy

    No. Of Nodes

    66.43

    212

    69.58

    99

    CMC 9 1473 3% Accuracy

    No. Of Nodes

    48.8

    249

    49.6

    186

    Segment 20 1500 7 % Accuracy

    No. Of Nodes

    96.17

    77

    96.66

    53

    CarEvaluatioin 7 1728 4% Accuracy

    No. Of Nodes

    95.22

    65

    94.2

    39

    Spambase 58 4601 2% Accuracy

    No. Of Nodes

    80.36

    239

    91.59

    133

    The previous results and the results from this table show that in almost all the cases, MotifClass

    outperforms CART in terms of accuracy and comprehensibility.

    Further, the results presented are an average of experiments conducted three times using the same

    algorithm parameters. The reason for conducting the same experiment thrice is because GP andhence the algorithm MotifClass has its roots in randomization. The parts of training data selected

    for generation of initial population, the selection of trees for crossover, mutation, etc. is allrandomly done. Hence, to assure the consistency of output, the average of three experimentalresults is presented.

  • 7/30/2019 REAL TIME MOTIF CLASSIFICATION FROM DATABASE USING INTELLIGENT ALGORITHMS

    15/16

    International Journal of Artificial Intelligence & Applications (IJAIA), Vol.3, No.6, November 2012

    109

    5. CONCLUSIONS

    The work proposes a new algorithm MotifClass developed with an aim to perform a trade-off

    between accuracy and comprehensibility of the classifier. The work focuses on large datasets andmulti-class classification. The implementation results show that the proposed algorithm

    outperforms the traditional decision tree classifier CART, as well as GP. Not only has a trade-offbeen performed, but the accuracy of the MotifClass classifier is obtained more than decision tree

    on most of the datasets. The number of nodes have also been decreased noticeably giving us a

    comprehensible classifier.

    Using CART to produce the initial population has performed a major role in increasing the

    accuracy, as we have been getting good trees from the very beginning. Secondly, the fitnessmeasure proposed succeeds in imparting more emphasis on trees with high accuracy, as well as

    less nodes, which in turn increases the efficiency of the algorithm.

    Further, the algorithm uses variable size population by performing an extra step that simple GP,in which it validates some of the individuals and reject the few. This extra step saves timerequired in performing operations on the individuals that can be rejected at an early stage. Also, it

    saves a lot of computation involved in calculating fitness of few so-called "useless" trees

    Clearly, with the decrease in population size, the overall running time of algorithm also decreases.Lastly, using Reverse Rank Selection to select trees that undergo mutation results into asignificant increase in accuracy of the weak individuals.

    Summarizing it, the proposed algorithm has been successful in attempting to produce an optimal

    classifier, that is, a classifier with high accuracy and smaller number of nodes.

    REFERENCES

    [1] P. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining. Addison-Wesley, 2005.

    [2] J. Han and M. Kamber, Data Mining: Concepts and Techniques. San Mateo: The Morgan Kaufmann Series

    in Data Management Systems, 2 ed., 2006.[3] W. R. Smart, Genetic Programming for Multiclass Object Classification. PhD thesis, School of

    Mathematical, Statistical and Computing Sciences, Victoria University of Wellington, 2005.

    [4] T. Mitchell, Machine Learning. New York: McGraw-Hill, 2 ed., 1997.

    [5] M. Vijaya, K. Jamuna, and S. Karpagavalli, "Password strength prediction using supervised machine

    learning techniques," in International Conference on Advances in Computing, Control, and

    Telecommunication Technologies, pp. 401-405, 2009.

    [6] L. Rokach and O. Maimon, "Top-down induction of decision trees classifiers-a survey," IEEE transactions

    on Systems, Man, and Cybernetics, Part C, November 2005.

    [7] R. Kotecha, V. Ukani, and S. Garg, "An empirical analysis of multiclass classification techniques in data

    mining," in Nirma University International Conference on Engineering (NUiCONE), (Ahmedabad),

    December 2011.

    [8] M. Aly, "Survey on multiclass classification methods," Citeseer, Nov 2005.

    [9] I. Rish, "An empirical study of the naive bayes classifier," in IJCAI Workshop on Empirical Methods in

    Artificial Intelligence, (Italy), pp. 41-46, 2011.

    [10] A. alias Balamurugan, R. Rajaram, S. Pramala, S. Rajalakshmi, C. Jeyen-dran, and J. D. S. Prakash, "Nb+:An improved naive bayesian algorithm," Knowledge-Based Systems, vol. 24, pp. 563-569, October 2011.

    [11] T. Phyu, "Survey of classification techniques in data mining," Proceedings of the International

    MultiConference of Engineers and Computer Scientists, 2009.

    [12] N. Suguna and K. Thanushkodi, "An improved k-nearest neighbor classification using genetic algorithm,"

    International Journal of Computer Science Issues, vol. 7, July 2010.

    [13] P. G. Espejo, S. Ventura, and F. Herrera, "A survey on the application of genetic programming to

    classification," IEEE transactions on systems, man, and cybernetics-part C: applications and Reviews, vol.

    40, pp. 121-144, March 2010.

  • 7/30/2019 REAL TIME MOTIF CLASSIFICATION FROM DATABASE USING INTELLIGENT ALGORITHMS

    16/16

    International Journal of Artificial Intelligence & Applications (IJAIA), Vol.3, No.6, November 2012

    110

    [14] J. R. Koza, Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT

    Press, 1992.

    [15] T. Loveard, Genetic Programming for Classification Learning Problems. PhD thesis, Royal Melbourne

    Institute of Technology, 2003.

    [16] R. Poli, W. B. Langdon, and N. F. McPhee, A Field Guide to Genetic Programming. PhD thesis.

    [17] R. Feldt and P. Nordin, "Using factorial experiments to evaluate the effect of genetic programming

    parameters," in Proceedings of European Conference on Genetic Programming, vol. 1802 of LNCS, pp.271-282, 2000.

    [18] H. Jabeen and A. R. Baig, "Review of classification using genetic programming," in International Journal

    of Engineering Science and Technology, vol. 2, pp. 94-103, 2010.

    [19] W. Banzhaf, P. Nordin, R. E. Keller, and F. D. Francone, Genetic Programming: An Introduction; On the

    Automatic Evolution of Computer Programs and its Applications. San Mateo, CA/Heidelberg, (Germany):

    Morgan Kaufmann, 1998.

    [20] S. Oka and Q. Zhao, "Design of decision trees through integration of c4.5 and gp," in Proceedings of 4th

    Japan-Australia Joint Workshop Intelligent and Evolutionary Systems, pp. 128-135, 2000.

    [21] T. Enodu and Q. Zhao, "Generation of comprehensible decision trees through evolution of training data,"

    Proceedings of Congress on Evolutionary Computation, vol. 2, pp. 1221-1225, 2002.

    [22] M. H. Saraee and R. S. Sadjady, "Optimizing classification techniques using genetic programming

    approach," in Proceedings of the 12th IEEE International Multitopic Conference, (Advanced Databases,

    Data Mining and Bioinformatics Research Laboratory), pp. 345-348, December 2008.

    [23] P. Kumar and Saroj, "Classification models: Non evolutionary vs evolutionary approach," in International

    Conference on Advances in Computing, Control, and Telecommunication Technologies, pp. 341-343,2009.

    [24] N. S. Chaudhari, A. Purohit, and A. Tiwari, "A multiclass classifier using genetic programming," in 10th

    International Conference on Control, Automation, Robotics and Vision, (Hanoi, Vietnam), pp. 1884-1887,

    December 2008.

    [25] H. Zhao, "A multi-objective genetic programming approach to developing pareto optimal decision trees,"

    Decision Support Systems, vol. 43, pp. 809-826, April 2007.

    [26] M. Riekert, K. M. Malan, and A. Engelbrect, "Adaptive genetic programming for dynamic classification

    problems," Congress on Evolutionary Computation, pp. 674-681, 2009.

    [27] R. A. Davis, A. J. Charlton, S. Oehlschlager, and J. C. Wilson, "Novel feature selection method for

    genetic programming using metabolomic h nmr data," Chemometrics Intelligent Laboratory Systems, vol.

    81, pp. 50-59, March 2006.

    [28] V. Slavov and N. I. Nokolaev, "Fitness landscapes and inductive genetic programming," in Proceedings

    of 3rd International Conference Artificial Neural Networks and Genetic Algorithms, (Berlin, Germany), pp.

    414-418, Springer-Verlag, 1997.

    [29] S. Rouwhorst and A. Engelbrecht, "Searching the forest: Using decision trees as building blocks forevolutionary search in classification databases," In Proceedings of the Congress on Evolutionary

    Computation, pp. 633-638, 2000.

    [30] M. C. J. Bot and W. B. Langdon, "Application of genetic programming to induction of linear

    classification trees," in Proceedings of the Third European Conference on Genetic Programming, vol.

    1802, (Berlin, Germany), pp. 247-258, Springer-Verlag, April 2000. Lecture Notes in Computer Science

    Series.

    [32] C. S. Kuo, T. P. Hong, and C. L. Chen, "Applying genetic programming technique in classification trees,"

    Soft Computing - A Fusion of Foundations, Methodologies and Applications, vol. 11, pp. 1165-1172,

    October 2007.

    Authors

    Prof Hiteshkumar Nimbark working as Associate Professor and Head, at B H Gardi

    college of engineering and technology, Rajkot, Gujarat, India. He is having more than

    12 years of experience in academic field after completion of his M.Tech degree from

    Nirma University, India. His area of interest is Artificial Intelligence, ComputerNetwork and Operating system. He is pursuing PhD in the field of Artificial

    Intelligence. He has published 9 International papers from which 2 are in IEEE

    explore and 3 are available in Springer.