Efficient Mining of Frequent Item Set using Recursive Algorithm

International Journal of Advanced Engineering Research and Technology (IJAERT) Volume 2 Issue 2, May 2014, ISSN No.: 2348 8190

61

www.ijaert.org

Efficient Mining of Frequent Item Set using Recursive

Algorithm

Bhumika H. Patel

Department of Computer Science and Engineering, PIET, Limda

Gujarat Technological University

Vadodara, India

Abstract - Frequent Pattern (Or Item set) mining is the

extraction of interested collection of items from dataset. The

frequent item set is used for achieving the collection of items

according users requirement. The researchers have proposed

various algorithms like Apriori, Eclat, RElim, SaM etc. There is a

problem of ordering of items while selecting one item as prefix for

mining frequent item sets. This type of problem affects the

performance. The researchers introduce a RElim algorithm for

frequent item set mining. In this paper, two approaches are

considered for solving such type of the problem. The results of

these two approaches are compared with RElim execution.

Index Terms Data mining, Frequent item set mining, RElim,

Support.

I. INTRODUCTION

In the era of business world, data mining is gaining

popularity in terms of organizational profits. The core idea of

data mining is to gain useful and unknown information or the

patterns from the data in the large dataset. Data mining is

currently used in the wide range of profiling practices, such as

scientific discovery, marketing, fraud detection and

surveillance[11]. Frequent item set mining works on the

principle of finding the item sets that are found frequently as

well as together in the transaction set. Various algorithms like

Apriori[1,2], FP-Growth[4], Eclat[5], RElim[6], SaM[7], etc.

have been proposed after Agrawal first introducing the

problem of deriving categorical association rule from

transactional databases[2]. Studies of Frequent Itemset Mining

is held in the data mining because of its broad applications

in mining association rules, correlations and graph pattern

constraint based on frequent patterns and many other data

mining tasks.

Let I = {i1,, in} be a set of N distinct items and DB be a

database consist of M transactions {t1, , tm} such that each

transaction ti is a subset of I . An itemset or pattern x is a

subset of I which if |x|=k, it is called a k-itemset. One of the

properties of x is its support count or Sup(x) which is the

number of transactions in DB that contain the itemset x. If

Sup(x) is no less than a user specified threshold, called

Minsup, it is called a frequent pattern. The aim of frequent

pattern mining is to find all frequent patterns satisfying Minsup

from a given database DB. As the minimum threshold

decreases, resulting frequent items would be more. Therefore,

eliminating infrequent patterns can be done effectively in

mining process and that is the one of the main issues in

frequent pattern mining. Our main work is based on this issue

that how to select less frequent item in case two or more items

have same frequency, for mining frequent item sets.

The rest of this paper is organized as follows. Section 2

describes the work already done related to frequent item set

mining. In section 3, the limitation regarding to few algorithms

is presented. In section 4, proposed work is shown and section

5 illustrates experiment results and finally conclusion is

derived in section 5.

II. RELATED WORK

In this section, we describe few existing frequent item set

mining algorithms, namely: (i)Can-tree, (ii)CP-tree, (iii)RElim.

A. Can-tree:

In [10] a tree structure called Can-Tree is proposed. This

Can-tree algorithm requires only single scan of database. In

this algorithm, items are ordered on the basis of a canonical

standard (e.g. alphabetical) depending upon user choice.

Therefore, if there is a change in frequency, it will not affect

the order of items in the Can-tree. Therefore, new transactions

are inserted into the tree without swapping any tree nodes.

B. CP-tree:

In [9] a new tree structure called CP-tree is proposed

which is a dynamic tree. This structure allows all the

transactions to be inserted in accordance with a predefined

item order. This item order is maintained by a list, called I-list.

After inserting some of the transactions, if the item order of the

I-list differs from the current frequency-descending item order

to a predefined degree, the CP-tree is restructured through a

method called the branch sorting. Then, the item order is

updated with the current list.

C. RElim:

In [6] RElim algorithm is proposed which uses array list

structure to find frequent item sets. Figure 1 shows all the

necessary steps that are required to process RElim. In first

step, orginal database is shown. By scanning the database,

frequency of each item is determined in step 2. After that items

in each transactions are sorted in frequency ascending order in

3. In step 4, each transactions are sorted depending on items

lexicographic order.


62

www.ijaert.org

Fig. 1: (1) Database in original form, (2)item frequencies, (3)transactions with

sorted items, (4)lexicographically sorted transactions

In step 5, the data structure used by RElim is created. This

data structure contains a list which is sorted in frequency

descending order of the items. This list contains a counter that

shows the number of transactions that starts with the first

leading item and a pointer to the head of the list. The list-

elements themselves contains a successor pointer and pointer

to the transaction.

Fig. 2: (5) Data structure used by RElim

The basic operations of the RElim algorithm are illustrated

in Figure 3. Basic operations of RElim starts with eliminating

least frequent item from the list and respective array elements

are transferred to the conditional database containing that data

item. The item to be processed is the one associated with the

last (rightmost) list (in the example this is item e).

Fig. 3: Basic operations of RElim

If the counter associated with the list, which states the

support of the item, exceeds the minimum support, the item set

consisting of this item and the prefix of the conditional

database is reported as frequent. In addition, the list is

traversed and its elements are copied to construct a new list

array, which represents the conditional database of

transactions containing the item. In this operation the leading

item of each transaction (suffix) is used as an index into the list

array to find the list it has to be added to. In addition, the

leading item is removed (see Figure 3 on the right). The

resulting conditional database is then processed recursively to

find all frequent item sets containing the list item.

III. LIMITATION OF EXISTING ALGORITHM

The limitation of RElim algorithm is that when dataset has

more number of attributes, the performance of algorithm is

decreased. When more attributes is there, number of items

available in each transaction is also more. So it is difficult to

select the prefix with same item frequency.

IV. PROPOSED WORK

Frequent item set mining problem can be solved using

many approaches. One of them is RElim algorithm. As

discussed in above section, this algorithm has some

limitations. In order to overcome from this limitation, we have

proposed two different approaches. As RElim uses array list as

data structure, the running time of array based FI mining

algorithms take less time as compared to that of tree-based

algorithms. RElim operations are simply based on three

processing steps: deleting items, recursive processing, and

reassigning transactions. Here we are considering deleting

items operation step where there is a scope of improvement in

terms of time. Therefore, all the preprocessing steps are the

same as that of RElim, the difference will be between the order

of choosing an item for pruning when items have same

frequency.

In the proposed method, two approaches for choosing an

item for elimination in case they have same frequency are:

alphabetical order and other is order of occurrence of an item

in the database.

Suppose the database and all the preprocessing steps from

1 to 4 shown in figure 1 are the same and data structure shown

in figure 2 is also same. Now to begin performing step 6, we

need to consider the list for selecting an item as prefix for

pruning. Here in figure 2, item e and item a have the same

frequency. Now there is a confusion whether to select item a or

item e as prefix as all the recursive processing and reassigning

of the transaction greatly depend on this prefix only.

Therefore if we consider first approach i.e. alphabetical

order, the item-order list will be {a,b,c,d,e} as shown in figure

3 and item a is selected as prefix and their array elements will

be transferred in conditional database and leading item of each

transaction is used as an index into the list array to find the list

it has to be added to(in our case it is c only) and their support

count will be incremented depending upon the number of the

transactions added to it(in our case c is incremented by 2).


63

www.ijaert.org

added to and support count of both transactions will be

incremented by 1 as one transaction bd will be added in the list

of b and one transaction cbd will be added in the list of c.

V. COMPARING ALGORITHMS

All the tree based algorithms requires more time to find

frequent itemsets, while RElim requires less time for exection

as it is simply using three steps: deleting items, recursive

processing, and reassigning transactions. In modified

approach, the algorithm will take less time for execution and

each item will get its importance while generating frequent

itemsets.

Fig. 3: operations of Modified RElim using Method-I

Therefore if we consider second approach i.e. order of

occurrence the item-order list will be {e,d,b,a,c} as shown in

figure 4.

Fig. 4: operations of Modified RElim using Method-II

and item e is selected as prefix and their array elements will be

transferred in conditional database. The leading items b and c

is used as an index into the list array to find the list it has to be

VI. CONCLUSION

This paper provides brief introduction about the

algorithms which is used in the area of frequent item set

mining. RElim algorithm is based on array-list structure and

easy to implement. Modified RElim extends existing RElim

by maintaining item-list for same frequency items. Due to

comparision of such items, as a part of future work, I am going

to analyse the behavior of various interesting measures on

mining frequent itemsets.

ACKNOWLEDGMENT

My most sincere thanks go to my advisor Asst.Prof. Neha

Pandya. I thank her for providing me opportunity to work in

the area of FI mining. I thank her guidance, encouragement

and support during initial development of this project. I would

not like to miss a chance to say thank for the time that she

spared for me, from her extremely busy schedule.

REFERENCES

[1] R. Agrawal and R. Srikant. Fast algorithms for mining

association rules. In VLDBY94, pp. 487-499.

[2] R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In

Proc.1993 ACM-SIGMOD Int. Conf. Management of Data,

Washington, D.C., May 1993, pp 207-216

[3] J. Han, J. Pei, Y. Yin, And R. Mao. Mining frequent patterns

without candidate generation: a frequent-pattern tree approach.

Data Mining And Knowledge Discovery, 2003.

[4] J. Han, H. Pei, And Y. Yin. Mining frequent patterns without

candidate generation. In: Proc. Conf. On The Management Of

Data (Sigmod00, Dallas, Tx). Acm Press, New York, Ny, Usa

2000.

[5] M. Zaki, S. Parthasarathy, M. Ogihara, and W. Li, New

algorithms for fast discovery of association rules, Proc. 3rd Int.

Conf. on Knowledge Discovery and Data Mining (KDD97,

Newport Beach, CA), 283296. AAAI Press, Menlo Park, CA,

USA 1997

[6] C. Borgelt, Keeping things simple: finding frequent item sets by

recursive elimination. Proc. Workshop Open Software for Data

Mining (OSDM05 at KDD05, Chicago, IL), 6670. ACM

Press, New York, NY, USA 2005

[7] C. Borgelt, Simple algorithms for frequent item set mining,

Springer-Verlag, Berlin, Germany 2010

[8] J. Han, and M. Kamber, 2000. Data Mining Concepts and

Techniques. Morgan Kanufmann.


64

www.ijaert.org

[9] S.K. Tanbeer, C.F. Ahmed, B.-S. Jeong, Y.-K. Lee, Efficient

single-pass frequent pattern mining using a prefix-tree.

Information Sciences 179 (2009) 559583

[10] C.K.-S. Leung, Q.I. Khan, Z. Li, and T. Hoque, CanTree: A

canonical-order tree for incremental frequent-pattern mining,

KAIS, 11 (3), pp. 287311, Apr. 2007.

[11] R. Somkumar. A study on various data mining approaches of

association rules. Int.J.Comput. Sci. Eng. Vol.2, pp.141-144.

[12] C.L. Blake and C.J. Merz. UCI Repository of Machine Learning

Databases. Dept. of Information and Computer Science,

University of California at Irvine, CA, USA 1998.

http://www.ics.uci.edu/mlearn/MLRepository.html

[13] R. Kohavi, C.E. Bradley, B. Frasca, L. Mason, and Z. Zheng.

KDD-Cup 2000 Organizers Report: Peeling the Onion.

SIGKDD Exploration 2(2):8693. ACM Press, New York, NY,

USA 2000.

[14] Synthetic Data Generation Code for Associations and Sequential

Patterns. Intelligent Information Systems, IBM Almaden

Research Center.

http://www.almaden.ibm.com/software/quest/Resources/index.sh

tml

Efficient Mining of Frequent Item Set using Recursive Algorithm

Documents

Transcript of Efficient Mining of Frequent Item Set using Recursive Algorithm