Efficient Mining of Frequent Item Set using Recursive Algorithm

download Efficient Mining of Frequent Item Set using Recursive Algorithm

of 4

description

Frequent Pattern (Or Item set) mining is the extraction of interested collection of items from dataset. The frequent item set is used for achieving the collection of items according user’s requirement. The researchers have proposed various algorithms like Apriori, Eclat, RElim, SaM etc. There is a problem of ordering of items while selecting one item as prefix for mining frequent item sets. This type of problem affects the performance. The researchers introduce a RElim algorithm for frequent item set mining. In this paper, two approaches are considered for solving such type of the problem. The results of these two approaches are compared with RElim execution.

Transcript of Efficient Mining of Frequent Item Set using Recursive Algorithm

  • International Journal of Advanced Engineering Research and Technology (IJAERT) Volume 2 Issue 2, May 2014, ISSN No.: 2348 8190

    61

    www.ijaert.org

    Efficient Mining of Frequent Item Set using Recursive

    Algorithm

    Bhumika H. Patel

    Department of Computer Science and Engineering, PIET, Limda

    Gujarat Technological University

    Vadodara, India

    Abstract - Frequent Pattern (Or Item set) mining is the

    extraction of interested collection of items from dataset. The

    frequent item set is used for achieving the collection of items

    according users requirement. The researchers have proposed

    various algorithms like Apriori, Eclat, RElim, SaM etc. There is a

    problem of ordering of items while selecting one item as prefix for

    mining frequent item sets. This type of problem affects the

    performance. The researchers introduce a RElim algorithm for

    frequent item set mining. In this paper, two approaches are

    considered for solving such type of the problem. The results of

    these two approaches are compared with RElim execution.

    Index Terms Data mining, Frequent item set mining, RElim,

    Support.

    I. INTRODUCTION

    In the era of business world, data mining is gaining

    popularity in terms of organizational profits. The core idea of

    data mining is to gain useful and unknown information or the

    patterns from the data in the large dataset. Data mining is

    currently used in the wide range of profiling practices, such as

    scientific discovery, marketing, fraud detection and

    surveillance[11]. Frequent item set mining works on the

    principle of finding the item sets that are found frequently as

    well as together in the transaction set. Various algorithms like

    Apriori[1,2], FP-Growth[4], Eclat[5], RElim[6], SaM[7], etc.

    have been proposed after Agrawal first introducing the

    problem of deriving categorical association rule from

    transactional databases[2]. Studies of Frequent Itemset Mining

    is held in the data mining because of its broad applications

    in mining association rules, correlations and graph pattern

    constraint based on frequent patterns and many other data

    mining tasks.

    Let I = {i1,, in} be a set of N distinct items and DB be a

    database consist of M transactions {t1, , tm} such that each

    transaction ti is a subset of I . An itemset or pattern x is a

    subset of I which if |x|=k, it is called a k-itemset. One of the

    properties of x is its support count or Sup(x) which is the

    number of transactions in DB that contain the itemset x. If

    Sup(x) is no less than a user specified threshold, called

    Minsup, it is called a frequent pattern. The aim of frequent

    pattern mining is to find all frequent patterns satisfying Minsup

    from a given database DB. As the minimum threshold

    decreases, resulting frequent items would be more. Therefore,

    eliminating infrequent patterns can be done effectively in

    mining process and that is the one of the main issues in

    frequent pattern mining. Our main work is based on this issue

    that how to select less frequent item in case two or more items

    have same frequency, for mining frequent item sets.

    The rest of this paper is organized as follows. Section 2

    describes the work already done related to frequent item set

    mining. In section 3, the limitation regarding to few algorithms

    is presented. In section 4, proposed work is shown and section

    5 illustrates experiment results and finally conclusion is

    derived in section 5.

    II. RELATED WORK

    In this section, we describe few existing frequent item set

    mining algorithms, namely: (i)Can-tree, (ii)CP-tree, (iii)RElim.

    A. Can-tree:

    In [10] a tree structure called Can-Tree is proposed. This

    Can-tree algorithm requires only single scan of database. In

    this algorithm, items are ordered on the basis of a canonical

    standard (e.g. alphabetical) depending upon user choice.

    Therefore, if there is a change in frequency, it will not affect

    the order of items in the Can-tree. Therefore, new transactions

    are inserted into the tree without swapping any tree nodes.

    B. CP-tree:

    In [9] a new tree structure called CP-tree is proposed

    which is a dynamic tree. This structure allows all the

    transactions to be inserted in accordance with a predefined

    item order. This item order is maintained by a list, called I-list.

    After inserting some of the transactions, if the item order of the

    I-list differs from the current frequency-descending item order

    to a predefined degree, the CP-tree is restructured through a

    method called the branch sorting. Then, the item order is

    updated with the current list.

    C. RElim:

    In [6] RElim algorithm is proposed which uses array list

    structure to find frequent item sets. Figure 1 shows all the

    necessary steps that are required to process RElim. In first

    step, orginal database is shown. By scanning the database,

    frequency of each item is determined in step 2. After that items

    in each transactions are sorted in frequency ascending order in

    3. In step 4, each transactions are sorted depending on items

    lexicographic order.

  • International Journal of Advanced Engineering Research and Technology (IJAERT) Volume 2 Issue 2, May 2014, ISSN No.: 2348 8190

    62

    www.ijaert.org

    Fig. 1: (1) Database in original form, (2)item frequencies, (3)transactions with

    sorted items, (4)lexicographically sorted transactions

    In step 5, the data structure used by RElim is created. This

    data structure contains a list which is sorted in frequency

    descending order of the items. This list contains a counter that

    shows the number of transactions that starts with the first

    leading item and a pointer to the head of the list. The list-

    elements themselves contains a successor pointer and pointer

    to the transaction.

    Fig. 2: (5) Data structure used by RElim

    The basic operations of the RElim algorithm are illustrated

    in Figure 3. Basic operations of RElim starts with eliminating

    least frequent item from the list and respective array elements

    are transferred to the conditional database containing that data

    item. The item to be processed is the one associated with the

    last (rightmost) list (in the example this is item e).

    Fig. 3: Basic operations of RElim

    If the counter associated with the list, which states the

    support of the item, exceeds the minimum support, the item set

    consisting of this item and the prefix of the conditional

    database is reported as frequent. In addition, the list is

    traversed and its elements are copied to construct a new list

    array, which represents the conditional database of

    transactions containing the item. In this operation the leading

    item of each transaction (suffix) is used as an index into the list

    array to find the list it has to be added to. In addition, the

    leading item is removed (see Figure 3 on the right). The

    resulting conditional database is then processed recursively to

    find all frequent item sets containing the list item.

    III. LIMITATION OF EXISTING ALGORITHM

    The limitation of RElim algorithm is that when dataset has

    more number of attributes, the performance of algorithm is

    decreased. When more attributes is there, number of items

    available in each transaction is also more. So it is difficult to

    select the prefix with same item frequency.

    IV. PROPOSED WORK

    Frequent item set mining problem can be solved using

    many approaches. One of them is RElim algorithm. As

    discussed in above section, this algorithm has some

    limitations. In order to overcome from this limitation, we have

    proposed two different approaches. As RElim uses array list as

    data structure, the running time of array based FI mining

    algorithms take less time as compared to that of tree-based

    algorithms. RElim operations are simply based on three

    processing steps: deleting items, recursive processing, and

    reassigning transactions. Here we are considering deleting

    items operation step where there is a scope of improvement in

    terms of time. Therefore, all the preprocessing steps are the

    same as that of RElim, the difference will be between the order

    of choosing an item for pruning when items have same

    frequency.

    In the proposed method, two approaches for choosing an

    item for elimination in case they have same frequency are:

    alphabetical order and other is order of occurrence of an item

    in the database.

    Suppose the database and all the preprocessing steps from

    1 to 4 shown in figure 1 are the same and data structure shown

    in figure 2 is also same. Now to begin performing step 6, we

    need to consider the list for selecting an item as prefix for

    pruning. Here in figure 2, item e and item a have the same

    frequency. Now there is a confusion whether to select item a or

    item e as prefix as all the recursive processing and reassigning

    of the transaction greatly depend on this prefix only.

    Therefore if we consider first approach i.e. alphabetical

    order, the item-order list will be {a,b,c,d,e} as shown in figure

    3 and item a is selected as prefix and their array elements will

    be transferred in conditional database and leading item of each

    transaction is used as an index into the list array to find the list

    it has to be added to(in our case it is c only) and their support

    count will be incremented depending upon the number of the

    transactions added to it(in our case c is incremented by 2).

  • International Journal of Advanced Engineering Research and Technology (IJAERT) Volume 2 Issue 2, May 2014, ISSN No.: 2348 8190

    63

    www.ijaert.org

    added to and support count of both transactions will be

    incremented by 1 as one transaction bd will be added in the list

    of b and one transaction cbd will be added in the list of c.

    V. COMPARING ALGORITHMS

    All the tree based algorithms requires more time to find

    frequent itemsets, while RElim requires less time for exection

    as it is simply using three steps: deleting items, recursive

    processing, and reassigning transactions. In modified

    approach, the algorithm will take less time for execution and

    each item will get its importance while generating frequent

    itemsets.

    Fig. 3: operations of Modified RElim using Method-I

    Therefore if we consider second approach i.e. order of

    occurrence the item-order list will be {e,d,b,a,c} as shown in

    figure 4.

    Fig. 4: operations of Modified RElim using Method-II

    and item e is selected as prefix and their array elements will be

    transferred in conditional database. The leading items b and c

    is used as an index into the list array to find the list it has to be

    VI. CONCLUSION

    This paper provides brief introduction about the

    algorithms which is used in the area of frequent item set

    mining. RElim algorithm is based on array-list structure and

    easy to implement. Modified RElim extends existing RElim

    by maintaining item-list for same frequency items. Due to

    comparision of such items, as a part of future work, I am going

    to analyse the behavior of various interesting measures on

    mining frequent itemsets.

    ACKNOWLEDGMENT

    My most sincere thanks go to my advisor Asst.Prof. Neha

    Pandya. I thank her for providing me opportunity to work in

    the area of FI mining. I thank her guidance, encouragement

    and support during initial development of this project. I would

    not like to miss a chance to say thank for the time that she

    spared for me, from her extremely busy schedule.

    REFERENCES

    [1] R. Agrawal and R. Srikant. Fast algorithms for mining

    association rules. In VLDBY94, pp. 487-499.

    [2] R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In

    Proc.1993 ACM-SIGMOD Int. Conf. Management of Data,

    Washington, D.C., May 1993, pp 207-216

    [3] J. Han, J. Pei, Y. Yin, And R. Mao. Mining frequent patterns

    without candidate generation: a frequent-pattern tree approach.

    Data Mining And Knowledge Discovery, 2003.

    [4] J. Han, H. Pei, And Y. Yin. Mining frequent patterns without

    candidate generation. In: Proc. Conf. On The Management Of

    Data (Sigmod00, Dallas, Tx). Acm Press, New York, Ny, Usa

    2000.

    [5] M. Zaki, S. Parthasarathy, M. Ogihara, and W. Li, New

    algorithms for fast discovery of association rules, Proc. 3rd Int.

    Conf. on Knowledge Discovery and Data Mining (KDD97,

    Newport Beach, CA), 283296. AAAI Press, Menlo Park, CA,

    USA 1997

    [6] C. Borgelt, Keeping things simple: finding frequent item sets by

    recursive elimination. Proc. Workshop Open Software for Data

    Mining (OSDM05 at KDD05, Chicago, IL), 6670. ACM

    Press, New York, NY, USA 2005

    [7] C. Borgelt, Simple algorithms for frequent item set mining,

    Springer-Verlag, Berlin, Germany 2010

    [8] J. Han, and M. Kamber, 2000. Data Mining Concepts and

    Techniques. Morgan Kanufmann.

  • International Journal of Advanced Engineering Research and Technology (IJAERT) Volume 2 Issue 2, May 2014, ISSN No.: 2348 8190

    64

    www.ijaert.org

    [9] S.K. Tanbeer, C.F. Ahmed, B.-S. Jeong, Y.-K. Lee, Efficient

    single-pass frequent pattern mining using a prefix-tree.

    Information Sciences 179 (2009) 559583

    [10] C.K.-S. Leung, Q.I. Khan, Z. Li, and T. Hoque, CanTree: A

    canonical-order tree for incremental frequent-pattern mining,

    KAIS, 11 (3), pp. 287311, Apr. 2007.

    [11] R. Somkumar. A study on various data mining approaches of

    association rules. Int.J.Comput. Sci. Eng. Vol.2, pp.141-144.

    [12] C.L. Blake and C.J. Merz. UCI Repository of Machine Learning

    Databases. Dept. of Information and Computer Science,

    University of California at Irvine, CA, USA 1998.

    http://www.ics.uci.edu/mlearn/MLRepository.html

    [13] R. Kohavi, C.E. Bradley, B. Frasca, L. Mason, and Z. Zheng.

    KDD-Cup 2000 Organizers Report: Peeling the Onion.

    SIGKDD Exploration 2(2):8693. ACM Press, New York, NY,

    USA 2000.

    [14] Synthetic Data Generation Code for Associations and Sequential

    Patterns. Intelligent Information Systems, IBM Almaden

    Research Center.

    http://www.almaden.ibm.com/software/quest/Resources/index.sh

    tml