An Enhancing the Performance of High Utility Itemset ... · cost is incurred with more potential...

16
An Enhancing the Performance of High Utility Itemset Mining using Utility Information Record V.G.Vijilesh 1 , Dr.S. Hari Ganesh 2 and J.James Manoharan 3 1 Senior Lecturer, International School of Business & Technology, Kampala,Uganda. [email protected] 23 Asst. Professor, Dept. of Computer Applications, Bishop Heber College (Autonomous), Tiruchirappalli, Tamilnadu, India. [email protected], james [email protected] January 4, 2018 Abstract Discovering itemsets with high utility like profit from database is known as High Utility Itemset mining. In many real time applications such as retail marketing and Web ser- vice the High utility itemsets mining is useful in decision- making process. Efficient Mining of High utility itemsets plays a very important role in many real time applications and is a vital research issue in data mining area. The exist- ing high utility mining algorithm degrade the performance takes much time to generate large number of candidate item- sets and to find utility value of all candidate itemsets In this research article, the time and space complexity of UP Growth and UP Growth+ algorithm have been reduced by 1 International Journal of Pure and Applied Mathematics Volume 118 No. 17 2018, 257-272 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu Special Issue ijpam.eu 257

Transcript of An Enhancing the Performance of High Utility Itemset ... · cost is incurred with more potential...

Page 1: An Enhancing the Performance of High Utility Itemset ... · cost is incurred with more potential high utility itemsets are gen-erated[8]. To address this issue, we propose in this

An Enhancing the Performance of HighUtility Itemset Mining using Utility

Information Record

V.G.Vijilesh1, Dr.S. Hari Ganesh2

and J.James Manoharan3

1Senior Lecturer,International School of Business & Technology,

Kampala,[email protected]

2 3Asst. Professor,Dept. of Computer Applications,

Bishop Heber College (Autonomous),Tiruchirappalli, Tamilnadu, India.

[email protected], james [email protected]

January 4, 2018

Abstract

Discovering itemsets with high utility like profit fromdatabase is known as High Utility Itemset mining. In manyreal time applications such as retail marketing and Web ser-vice the High utility itemsets mining is useful in decision-making process. Efficient Mining of High utility itemsetsplays a very important role in many real time applicationsand is a vital research issue in data mining area. The exist-ing high utility mining algorithm degrade the performancetakes much time to generate large number of candidate item-sets and to find utility value of all candidate itemsets Inthis research article, the time and space complexity of UPGrowth and UP Growth+ algorithm have been reduced by

1

International Journal of Pure and Applied MathematicsVolume 118 No. 17 2018, 257-272ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version)url: http://www.ijpam.euSpecial Issue ijpam.eu

257

Page 2: An Enhancing the Performance of High Utility Itemset ... · cost is incurred with more potential high utility itemsets are gen-erated[8]. To address this issue, we propose in this

storing the item related to high utility using data structure.The proposed algorithm can avoids the generation of can-didate itemsets to find set of high utility itemset.

Key Words : Utility; Utility Information Record; Can-didate Itemset; Effective High Utility Itemset Mining.

1 Introduction

In Data Mining applications frequent itemset mining is a funda-mental research topic. Fast developments in database techniquesfacilitate storage and usage of data from large database and also tomine the same. How to find valuable information from database isa more essential task today which results in a rise of research topics[1]. Extensive studies [1, 5] have been projected for mining frequentitemsets from the databases and effectively adopted in various ap-plication areas. In market analysis, mining frequent itemsets froma transaction database refers to the finding of the itemsets whichfrequently appear together in the transactions[2,8]. However, theunit profits and purchased quantities of items are not consideredin the framework of frequent itemset mining. Hence, it cannot sat-isfy the requirement of the user who is interested in discovering theitemsets with high sales profits. In view of this, utility mining [9,10] emerges as an important topic in data mining for discoveringthe itemsets with high utility like profits.

Mining frequent itemset takes presence and absence of itemsetin the transactions, other relative information related to the item isnot considered. To overcome this problem, the concept of weightedassociation rule mining was enhanced. In weighted association rulemining, weights of items, such as unit profits of items in transactiondatabases, are considered. With this concept, even if some itemsappear infrequently, they might still be found if they have highweights[11]. However, the quantities of items are not consideredyet. This results in the research area of finding out high utilityitemset from database. Utility is one of the important features ofitemset in transaction that specifies a utility/profit of itemset withfrequency.

Mining high utility itemsets from the databases refers to find-ing the itemsets with high utilities[3]. The fundamental meaning of

2

International Journal of Pure and Applied Mathematics Special Issue

258

Page 3: An Enhancing the Performance of High Utility Itemset ... · cost is incurred with more potential high utility itemsets are gen-erated[8]. To address this issue, we propose in this

utility is the interestedness or importance or profitability of itemsto the users. The utility of items in a transaction database containstwo aspects: (1) the significance of distinct items, which is calledexternal utility and (2) the importance of the items in the trans-action, which is called internal utility. The utility of an itemsetis defined as the external utility multiplied by the internal utility.An itemset is called a high utility itemset if its utility is not lessthan a user specified threshold; otherwise, the itemset is called alow utility itemset. Mining high utility itemsets from databases isan important task which is essential to a wide range of applica-tions such as website click streaming analysis, cross-marketing inretail stores, business promotion in chain hypermarkets and evenbiomedical applications.

However, mining high utility itemsets from the databases is notan easy task. Existing studies [2, 4, 6, 7] applied overestimatedmethods to facilitate the mining performance of utility mining. Inthese methods, potential high utility itemsets are found first, andthen an additional database scan is performed for identifying theirutilities. However, the existing methods often generate a huge setof potential high utility itemsets and the mining performance is de-graded consequently. The situation may become worse when thedatabase contains many long transactions or low threshold is set.The huge number of potential high utility itemsets forms a challeng-ing problem to the mining performance since the higher processingcost is incurred with more potential high utility itemsets are gen-erated[8]. To address this issue, we propose in this paper a novelalgorithm with a compact data structure for efficiently discoveringhigh utility itemsets from transactional databases.

When the number of candidates is so large that they cannotbe stored in memory, the algorithms will fail or their performancewill be degraded due to thrashing. To overcome this problem, wepropose an enhanced algorithm for high utility itemset mining.

The contributions of the paper are as follows:

1. A new structure, called utility information record, is devel-oped. Utility information record stores the utility informationabout an itemset along with the heuristic information aboutwhether the itemset should be pruned or not.

2. An efficient algorithm, called Efficient High Utility Itemset

3

International Journal of Pure and Applied Mathematics Special Issue

259

Page 4: An Enhancing the Performance of High Utility Itemset ... · cost is incurred with more potential high utility itemsets are gen-erated[8]. To address this issue, we propose in this

Mining (EHUIM) Algorithm, is proposed. EHUIM Algorithmdoes not produce candidate high utility itemsets. After con-structing the initial utility information-record from a mineddatabase, EHUIM Algorithm, can mine high utility itemsetsfrom these utility information record. We are using variousstandard and real data sets [4].

2 Related Work

Several algorithms have been developed for high utility itemset min-ing such as, IHUP [7], UP-Growth [5] and Two-Phase [6]. Two-Phase algorithm [6] was proposed by Liu et al. the algorithm con-tains two phases. In phase I, Two-Phase algorithm makes use ofApriory based technique to enumerate HTWUIs. It creates next setof candidate itemsets from the previous set of candidate itemsetsand prunes candidate itemsets by TWDC property. In each pass,HTWUIs and their estimated utility values are calculated by scan-ning database. After this, the complete set of HTWUIs is collected.In phase II, the original database is scanned to discover the highutility itemsets and their utilities.

Although Two-Phase algorithm efficiently reduces the searchspace and discovers the complete set of high utility itemsets, itstill creates too many candidates for HTWUIs and requires mul-tiple database scans. To overcome this issues Ahmed et al. [7]developed a tree-based algorithm, called IHUP. To retain the infor-mation of high utility itemsets and transactions the algorithm usesan IHUP-Tree. Every node in IHUP-Tree contains an item name,a support count, and a TWU value. The algorithm works in threesteps, in first step, items in the transaction are rearranged in a fixedorder such as lexicographic order. The IHUP-tree is then createdusing rearranged transactions. In the second step, HTWUIs arecreated from the IHUP-Tree. In third step, by scanning the origi-nal database, high utility itemsets and their utilities are recognizedfrom the set of HTWUIs.

Even though IHUP discovers HTWUIs without creating anycandidates for HTWUIs and attains a better performance thanTwo-Phase, it still produces too many HTWUIs in phase I. Toovercome this problem, Vincent S. Tseng, Cheng-Wei Wu, Bai-En

4

International Journal of Pure and Applied Mathematics Special Issue

260

Page 5: An Enhancing the Performance of High Utility Itemset ... · cost is incurred with more potential high utility itemsets are gen-erated[8]. To address this issue, we propose in this

Shie, and Philip S. Yu [5] developed the UP-Growth algorithm.A compact tree structure, called utility pattern tree (UP-Tree),for find out high utility itemsets and maintaining significant infor-mation related to utility patterns within databases are projected.High-utility itemsets can be created from UP-Tree efficiently withonly two scans of original databases. Four new strategies are pro-posed namely DGU, DGN, DLU and DLN. First two strategies areapplied on UP Tree to globally reduce unpromising items from ob-tained potential high utility itemsets[9]. The next two strategiesnamely DLU and DLN are applied by the UP-Growth on the UP-Tree for reducing the local unpromising items. The actual highutility itemsets are then defined from a set of potential high utilityitemsets.

All these algorithms first create candidate itemset which needsmore time and space. Here in this algorithm a search space fromthe UP Growth algorithm [5] is minimized. A Utility informationrecord structure is used instead of UP Tree.

3 Proposed Method

The enhanced framework consists of following three steps: 1) Scandatabase to create utility Information Record. 2) Apply EHUImining algorithm. 3) Generate High Utility Itemsets.

3.1 Utility Information Record Structure

In the section, we propose a utility information record structure tomaintain the utility information about a database.

3.1.1 Initial Utility information record

Initial utility information record for storing the utility informationabout a mined database can be constructed by two scans of thedatabase. Firstly, the transaction-weighted utilities of all itemsare collected by performing a database scan. If the transaction-weighted utility of an item is less than a given minutil, the itemis no longer considered. For the items whose transaction-weightedutilities exceed the minutil, they are sorted in transacion-weighted-utility-ascending order.

5

International Journal of Pure and Applied Mathematics Special Issue

261

Page 6: An Enhancing the Performance of High Utility Itemset ... · cost is incurred with more potential high utility itemsets are gen-erated[8]. To address this issue, we propose in this

3.1.2 Utility information record of 2-Itemsets

For constructing the utility information record of 2-itemsets thereis no need to scan the database. The utility information record of2-itemset {xy} can be constructed by the intersection of the utilitylist of {x} and that of {y}. The common transactions are identifiedby comparing the tids in the two utility information records by thealgorithm. Suppose the lengths of the utility-information recordsare m and n respectively, and then (m + n) comparisons at mostare enough for identifying common transactions, because all tids ina utility information record are ordered. The identification processis actually a 2-way comparison.

3.1.3 Utility information record of k-Itemsets (k≥3)

To construct the utility information record of k-itemset {i1...i(k1)ik}(k≥3), we can directly intersect the utility information record of{i1...i(k2)i(k1)}and that of {i1...i(k2)ik} as we do to construct theutility information record of a 2-itemset.

3.2 EHUIM Algorithm

After constructing a Utility information record a EHUIM Algorithmcan mine all high utility itemset from database.

6

International Journal of Pure and Applied Mathematics Special Issue

262

Page 7: An Enhancing the Performance of High Utility Itemset ... · cost is incurred with more potential high utility itemsets are gen-erated[8]. To address this issue, we propose in this

3.2.1 Domain Space

The domain space of the high utility itemset mining problem canbe represented as a combination tree. Given a set of items I = {i1,i2, i3, ... in} and a total order on all items (suppose i1 < i2 < ... <in), a combination tree representing all itemsets can be constructedas follows.

Firstly, the root of the tree is created; secondly, the n-childnodes of the root representing n 1-itemsets are created, respectively;thirdly, for a node representing itemset{is ... ie} (1 ≤ s ≤ e < n),the (ne) child nodes of the node representing itemsets {is ... iei(e+1) },{is ... iei(e+2)}, ...,{is ... iein} are created. The third stepis done repeatedly until all leaf nodes are created. For example,given I = {e, c, b, a, d} and e < c < b < a < d, a combination treerepresenting all itemsets of I is depicted in Fig. 1.

Fig. 1: Combination Tree

7

International Journal of Pure and Applied Mathematics Special Issue

263

Page 8: An Enhancing the Performance of High Utility Itemset ... · cost is incurred with more potential high utility itemsets are gen-erated[8]. To address this issue, we propose in this

3.2.2 Pruning Strategy

For a database with n items, exhaustive search has to check 2nitemsets. To reduce the search space, we can exploit the iutils andrutils in the utility information record of an itemset. The sumof all the iutils in the utility information record of an itemset isthe utility of the itemset according to Definition 5, and thus theitemset is high utility if the sum exceeds a given minutil. The sumof all the iutils and rutils in the utility information record providesEHUIM Algorithm with the key information about whether theitemset should be pruned or not.

3.2.3 EHUI Mining Algorithm

Algorithm 2 shows the pseudo-code of EHUIM Algorithm. For eachutility information record X in ULs (the second parameter), if thesum of all the iutils in X exceeds minutil, and then the extension as-sociated with X is high utility and outputted. According to Lemma1, only when the sum of all the iutils and rutils in X exceeds minu-til should it be processed further. The initial utility informationrecords are constructed from a database and they are sorted andprocessed in transaction-weighted utility ascending order. There-fore, all the utility information records in UIRs are ordered as theinitial utility information record are. To explore the search space,the algorithm intersects X and each utility information record Yafter X in UIRs. Suppose X is the utility information record ofitemset Px and Y that of itemset Py, and then Build (P.UIR, X,Y) in line 8 is to construct the utility information record of itemsetPxy as stated in Algorithm 1. Finally, the set of utility informationrecord of all the 1-extensions of itemset Px is recursively processed.Given a database and a minutil, after the initial utility informationrecord IUIRs are constructed, EHUIM (φ, IUIRs, minutil) can mineall high utility itemsets.

8

International Journal of Pure and Applied Mathematics Special Issue

264

Page 9: An Enhancing the Performance of High Utility Itemset ... · cost is incurred with more potential high utility itemsets are gen-erated[8]. To address this issue, we propose in this

4 Experimental Evaluation

Performance of proposed algorithm is evaluated in this section. Theexperiments were performed on 2.20 GHz Core2 Duo Processor with2GB memory. The operating system is Linux Fedora 14. The algo-rithms are implemented in Java language. Both real and standarddatasets are used in this experiment. Standard data sets are ob-tained from FIMI Repository. Real datasets were generated fromthe actual values. Parameter descriptions and default values ofdatasets are shown in Table no. Educational dataset for evaluationof feedback report of faculty member is used as a real dataset.

Table 1: Statistics about Databases

4.1 Performance comparison on different datasets

Running Time : When measuring running time, we varied the minu-til for each database. The lower the minutil is, the larger the num-

9

International Journal of Pure and Applied Mathematics Special Issue

265

Page 10: An Enhancing the Performance of High Utility Itemset ... · cost is incurred with more potential high utility itemsets are gen-erated[8]. To address this issue, we propose in this

ber of high utility itemsets is, and thus the more the running timeis. For example, for database chess in Fig.4 , when the minutils are80% and 90%, the running times of EHUI are 1400 mSec and 800mSec.

For almost all databases and minutils, EHUI performs the best.In Fig. , EHUI is slower than UPGrowth and UPGrowth+ for lowminutils, and we found out in this case that UPGrowth+ requiresless time. However, for high minutils, EHUI is even an order ofmagnitude faster than UPGrowth and UPGrowth+. For the Edu-cational Feedback Dataset, when minutils are 50%, 60%, and 70%,the running time required for EHUI are 40mSec.

Fig. 2: Time for Educational Dataset (Separate File)

Fig. 3: Time for Educational Dataset (Combine File)

10

International Journal of Pure and Applied Mathematics Special Issue

266

Page 11: An Enhancing the Performance of High Utility Itemset ... · cost is incurred with more potential high utility itemsets are gen-erated[8]. To address this issue, we propose in this

Fig. 4: Time for Chess Dataset

Memory Consumption

Fig. 5: Memory Space for Educational Dataset (Separate File)

11

International Journal of Pure and Applied Mathematics Special Issue

267

Page 12: An Enhancing the Performance of High Utility Itemset ... · cost is incurred with more potential high utility itemsets are gen-erated[8]. To address this issue, we propose in this

Fig. 6: Memory Space for Educational Dataset (Combine File)

Generally, the memory consumption of the algorithms is propor-tional to the number of candidate itemsets they generate. For exam-ple, for database Chess, UP Growth generates 623, UP Growth+generates 551 and that of IHUP generates 30 candidate itemsetsand consumes 17.60MB, 21.81MB, and 16.02MB of memory re-spectively. Similar case is there for Educational Feedback Dataset.EHUI require less space than UPGrowth and in some cases of UP-Growth+ algorithm.

Fig. 7: Memory Space for Chess Dataset

Itemsets Found

12

International Journal of Pure and Applied Mathematics Special Issue

268

Page 13: An Enhancing the Performance of High Utility Itemset ... · cost is incurred with more potential high utility itemsets are gen-erated[8]. To address this issue, we propose in this

Fig. 8: Itemset for Educational Dataset (Separate File)

Fig. 9: Itemset for Educational Dataset (Combine File)

Fig. 10: Itemset for Chess Dataset

13

International Journal of Pure and Applied Mathematics Special Issue

269

Page 14: An Enhancing the Performance of High Utility Itemset ... · cost is incurred with more potential high utility itemsets are gen-erated[8]. To address this issue, we propose in this

5 Conclusion

In this paper, we have proposed a novel data structure, utility infor-mation record, and developed an efficient algorithm, EHUI, for highutility itemset mining. Utility information record provides not onlyutility information about itemsets but also important pruning infor-mation for EHUI. We have used Educational real time and standarddatasets. Previous algorithms have to process a very large numberof candidate itemsets during their mining processes. However, mostcandidate itemsets are not high utility and are discarded finally.EHUI Algorithm can mine high utility itemsets without candidategeneration, so that complexity of UPGrowth and UPGrowth+ isreduced as it require less time and space, which avoids the costlygeneration and utility computation of candidates. However in fu-ture we can again reduce the complexity by reducing the joiningcost of utility information record.

References

[1] Jyothi Pillai, O.P.Vyas Overview of Itemset Utility Mining andits Applications IJCA(0975 8887) Volume 5 No.11, August2010.

[2] J. Han, H. Cheng, D. Xin, and X. Yan. Frequent pattern min-ing: Current status and future directions. Data Mining andKnowledge Discovery, 15(1):5586, 2007.

[3] C. F. Ahmed, S. K. Tanbeer, B.-S. Jeong, and Y.-K.Lee. Ef-ficient tree structures for high utility patternmining in incre-mental databases. IEEE Transactions onKnowledge and DataEngineering, 21(12):17081721,2009.

[4] Frequent Itemset Mining Implementations Repository,http://fimi.cs.helsinki.fi/, 2013.

[5] Vincent S. Tseng, Bai-En Shie, Cheng Wei Wu, and Philip S.Yu, Fellow, Efficient Algorithms for Mining High Utility Item-sets from Transactional Databases IEEE TRANSACTIONSON KNOWLEDGE AND DATA ENGINEERING, VOL. 25,NO. 8, AUGUST 2013.

14

International Journal of Pure and Applied Mathematics Special Issue

270

Page 15: An Enhancing the Performance of High Utility Itemset ... · cost is incurred with more potential high utility itemsets are gen-erated[8]. To address this issue, we propose in this

[6] Y. Liu, W. Liao, and A. Choudhary. A fast high utility itemsetsmining algorithm. In Proc. of the Utility-Based Data MiningWorkshop, 2005.

[7] C. F. Ahmed, S. K. Tanbeer, B.-S. Jeong, and Y.-K. Lee. Ef-ficient tree structures for high utility pattern mining in incre-mental databases. In IEEE Transactions on Knowledge andData Engineering, Vol. 21, Issue 12, pp. 1708-1721, 2009.

[8] Frequent itemset mining implementations reposi-tory,http://fimi.cs.helsinki.fi/

[9] J.Pillai , O.P.Vyas ,Overview of itemset utility mining and itsapplications, in: Internationa Journal of Computer Applica-tions (0975-8887), Volume 5-No.11(August 2010) .

[10] B.-E. Shie, V. S. Tseng, and P. S. Yu. Online mining of tem-poral maximal utility itemsets from data streams. In Proc.ofthe 25th Annual ACM Symposium on Applied Computing,Switzerland, Mar., 2010.

[11] L.Szathmary,A.Napoli, P.Valtchev, Towards rare itemset min-ing,in: Proceedings of the 19th IEEE Interational Conferenceon Tools with Artificial Intelligence , 2007, Volume-1, pp.305-312.

15

International Journal of Pure and Applied Mathematics Special Issue

271

Page 16: An Enhancing the Performance of High Utility Itemset ... · cost is incurred with more potential high utility itemsets are gen-erated[8]. To address this issue, we propose in this

272