[IEEE 2009 IEEE-RIVF International Conference on Computing and Communication Technologies - Danang...

4
Mining High Utility Itemsets from Vertical Distributed Databases Bay Vo Faculty of Information Technology Ho Chi Minh City University of Technology Ho Chi Minh, Vietnam [email protected] Huy Nguyen Faculty of Information Technology Saigon University Ho Chi Minh, Vietnam [email protected] Bac Le Faculty of Information Technology University of Science Ho Chi Minh, Vietnam [email protected] Abstract—The utility based on itemsets mining approach has been discussed widely in recent years. There are many algorithms mining high utility itemsets (HUIs) by pruning candidates based on the estimated utility values, and based on the transaction- weighted utilization values. These algorithms aim to reduce search space. In this paper, we propose a method for HUIs from vertical distributed databases. This method does not integrate local databases in SlaverSites to MasterSite, and scan local database one time. Experiments show the run-time of this method is more efficient than that in the concentration database. Keywords-Concentration database, utility itemset, utility constraint, vertical distributed databases, WIT-tree. I. INTRODUCTION Mining high utility itemsets (HUIs) is the general form of mining frequent itemsets (FIs) [1]. It aims to find the itemset which has high utility from databases. However, it does not like FIs mining, HUIs does not satisfy the Apriori property. That is, the subset of a HUI is not likely a HUI. Therefore, we can not use fully the algorithms of FIs for HUIs mining. In 2004, H. Yao, H. J. Hamilton [14] proposed the model of mining HUIs. They proposed the UMining algorithm and the UMining_H (UMining with heuristic) to find HUIs [15]. Recently, some algorithms which based on Transaction Weighted Utilization (TWU) have been developed [4, 5, 6, 7]. The Two-Phase algorithm was firstly proposed by Y. Liu, Liao, Choudhary [7]. After that, some efficient algorithms were proposed [4, 5], they based on the methods which do not generate candidates to mine HUIs. In [6], authors proposed WIT-tree, a new data structure, and an efficient algorithm for mining HUIs. Although there are many algorithms for mining HUIs, but there is not any method in distributed databases. In this paper, we propose the method for mining HUIs from vertical distributed databases. The main contributions of this paper are as follows: We propose a general model for mining HUIs from vertical distributed databases. We propose a method for mining HUIs which scans database one time and does not need to integrate databases in each site together. This method aims to reduce mining time and reduce memory requirement. The rest of this paper is organized as follows: Section II presents the background and some current methods for solving the problem of HUIs mining. The model for HUI in vertical distributed database is presented in section III. In this section, we also discuss how MasterSite and SlaverSite work, and how they exchange information each other. Section IV provides experimental results and evaluates performance of the proposed strategy. Finally, we present the conclusion and future work in section V. II. RELATED WORKS In recent years, many HUIs algorithms have been proposed [4, 5, 6, 7, 14]. The usefulness of an itemset is characterized as a utility constraint. That is, an itemset is interesting to the user only if its utility satisfies a given utility constraint (minutil). The usefulness of an itemset is computed by objective and subjective values of items. The objective value of an item, denoted x pq , is the value of an attribute associated with an item i p in a transaction t q . The subjective value of an item, denoted y p , is a real number assigned by the user such that for any two items i p and i q , y p is greater than y q if the user prefers item i p to item i q . The utility based itemset mining problem is to discover the set H of all HUIs, i.e., HUIs = { S | S I , u(S) minutil}. 978-1-4244-4568-4/09/$25.00 ©2009 IEEE 1

Transcript of [IEEE 2009 IEEE-RIVF International Conference on Computing and Communication Technologies - Danang...

Page 1: [IEEE 2009 IEEE-RIVF International Conference on Computing and Communication Technologies - Danang City, Viet Nam (2009.07.13-2009.07.17)] 2009 IEEE-RIVF International Conference on

Mining High Utility Itemsets from Vertical Distributed Databases

Bay Vo Faculty of Information Technology

Ho Chi Minh City University of Technology Ho Chi Minh, Vietnam

[email protected]

Huy Nguyen Faculty of Information Technology

Saigon University Ho Chi Minh, Vietnam

[email protected]

Bac Le Faculty of Information Technology

University of Science Ho Chi Minh, Vietnam

[email protected]

Abstract—The utility based on itemsets mining approach has been discussed widely in recent years. There are many algorithms mining high utility itemsets (HUIs) by pruning candidates based on the estimated utility values, and based on the transaction-weighted utilization values. These algorithms aim to reduce search space. In this paper, we propose a method for HUIs from vertical distributed databases. This method does not integrate local databases in SlaverSites to MasterSite, and scan local database one time. Experiments show the run-time of this method is more efficient than that in the concentration database.

Keywords-Concentration database, utility itemset, utility constraint, vertical distributed databases, WIT-tree.

I. INTRODUCTION Mining high utility itemsets (HUIs) is the general form of

mining frequent itemsets (FIs) [1]. It aims to find the itemset which has high utility from databases. However, it does not like FIs mining, HUIs does not satisfy the Apriori property. That is, the subset of a HUI is not likely a HUI. Therefore, we can not use fully the algorithms of FIs for HUIs mining.

In 2004, H. Yao, H. J. Hamilton [14] proposed the model of mining HUIs. They proposed the UMining algorithm and the UMining_H (UMining with heuristic) to find HUIs [15].

Recently, some algorithms which based on Transaction Weighted Utilization (TWU) have been developed [4, 5, 6, 7]. The Two-Phase algorithm was firstly proposed by Y. Liu, Liao, Choudhary [7]. After that, some efficient algorithms were proposed [4, 5], they based on the methods which do not generate candidates to mine HUIs. In [6], authors proposed WIT-tree, a new data structure, and an efficient algorithm for mining HUIs.

Although there are many algorithms for mining HUIs, but there is not any method in distributed databases.

In this paper, we propose the method for mining HUIs from vertical distributed databases. The main contributions of this paper are as follows:

• We propose a general model for mining HUIs from vertical distributed databases.

• We propose a method for mining HUIs which scans database one time and does not need to integrate databases in each site together. This method aims to reduce mining time and reduce memory requirement.

The rest of this paper is organized as follows: Section II presents the background and some current methods for solving the problem of HUIs mining. The model for HUI in vertical distributed database is presented in section III. In this section, we also discuss how MasterSite and SlaverSite work, and how they exchange information each other. Section IV provides experimental results and evaluates performance of the proposed strategy. Finally, we present the conclusion and future work in section V.

II. RELATED WORKS In recent years, many HUIs algorithms have been proposed

[4, 5, 6, 7, 14]. The usefulness of an itemset is characterized as a utility constraint. That is, an itemset is interesting to the user only if its utility satisfies a given utility constraint (minutil). The usefulness of an itemset is computed by objective and subjective values of items. The objective value of an item, denoted xpq, is the value of an attribute associated with an item ip in a transaction tq. The subjective value of an item, denoted yp, is a real number assigned by the user such that for any two items ip and iq, yp is greater than yq if the user prefers item ip to item iq. The utility based itemset mining problem is to discover the set H of all HUIs, i.e., HUIs = { S | S I⊆ , u(S) ≥ minutil}.

978-1-4244-4568-4/09/$25.00 ©2009 IEEE1

Page 2: [IEEE 2009 IEEE-RIVF International Conference on Computing and Communication Technologies - Danang City, Viet Nam (2009.07.13-2009.07.17)] 2009 IEEE-RIVF International Conference on

∑ ∑∈ ∈

=Si Tt

ppqp Sq

yxfSu ),()( (1),

where ppqppq yxyxf ⋅=),( , and TS is the set of transactions that contains itemset S.

A. Estimated utility value method H. Yao, H. J. Hamilton [14] reduced search space by

pruning candidates based on estimated utility value. The utility of an itemsets Sk is always less than or equals the utility upper bound of Sk, and based on the utility upper bound b(Sk), H. Yao, H. J. Hamilton proposed UMining [15] algorithm for mining all HUIs.

B. Transaction-weighted utilization value method Y. Liu et al [7] reduced search space by pruning candidates

based-on transaction-weighted utilization (twu) value. The utility of an itemsets S is always less than or equals the twu value of S,

∑∑∑∈ ∈∈

===Sq qpSq Tt ti

ppqTt

qS yxfttuTtuStwu ),()()()( (2),

)(),(),()( StwuyxfyxfSuSq qpSq p Tt ti

ppqTt Si

ppq =≤= ∑∑∑∑∈ ∈∈ ∈

(3),

)()()()(1

1 k

Ttq

Ttq

k StwuttuttuStwukSqkSq

=≥= ∑∑∈∈

(4),

it satisfies Apriori property.

A. Erwin et al [4, 5] proposed the efficient algorithms using the pattern growth approach. They have developed a new compact data representation named Compressed Utility Pattern tree (CUP-tree) which extends the CFP-tree (Y.G Sucahyo & R.P Gopalan, 2004) for HUIs mining, and a new algorithm named CTU-PRO. The concept of TWU is used for pruning the search space in CTU-PRO, but it must re-scan the database to determine the actual utility of high twu itemsets. The algorithm creates a CUP-Tree named GlobalCUP-Tree from the transaction database after the first time of identifying the individual high TWU items. For each high TWU item, a smaller projection-tree called LocalCUP-Tree is extracted from the GlobalCUP-tree for mining all HUIs beginning with that item as prefix.

B. Le et al [6] proposed WIT-tree data structure and the algorithm for mining HUIs (TWU-Mining algorithm). We recognize that it is suitable for mining HUIs in vertical distributed database.

1) WIT-tree data structure a) Vertex: Includes 3 fields

Itemset X; Tidset: the set of transaction contains X; and twu: The sum of transaction-weighted utilization of X. A vertex is denoted:

)( XtwuTidsetX × .

The value of twu(X) is computed by summing up all twu values of transactions which their tids are contained in Tidset. Thus, the computing of twu(X) and of u(X) will be done quickly by using Tidset.

Arc: Connecting the vertex at kth level (called X) with the vertex at (k+1)th (called Y) in which X

kθ≡ Y.

Example: Consider the following database

TABLE I. OBJECTIVE VALUE TABLE

TABLE II. SUBJECTIVE VALUE TABLE

We have WIT-tree as in Figure 1.

2) TWU-Mining algorithm The TWU-Mining algorithm based on WIT-tree to mine

HUIs. In more details, we see in [6].

Figure 1. WIT-tree for TWU-Mining (the value under the “/” symbol is the utility value of corresponding itemset) with minutil = 130

C. Distributed data mining There are many methods for distributed data mining such

as: association rules [2, 3, 10, 12, 13, 17], classification [8, 9]. In [11], M. Serazi et al proposed an API that is transparent distributed vertical databases. However, according to our opinion, this API can not use for mining HUIs. As we know, there is still not any method in distributed databases.

item TID

A B C D E TWU

T1 0 0 16 0 1 21 T2 0 12 0 2 1 71 T3 2 0 1 0 1 12 T4 1 0 0 2 1 14 T5 0 0 4 0 2 14 T6 1 2 0 0 0 13 T7 0 20 0 2 1 111 T8 3 0 25 6 1 57 T9 1 2 0 0 0 13 T10 0 12 2 0 2 72

Item Benefit A 3 B 5 C 1 D 3 E 5

2

Page 3: [IEEE 2009 IEEE-RIVF International Conference on Computing and Communication Technologies - Danang City, Viet Nam (2009.07.13-2009.07.17)] 2009 IEEE-RIVF International Conference on

III. MODEL FOR MINING HIGH UTILITY ITEMSETS FROM VERTICAL DISTRIBUTED DATABASES

A. Problem A supermarket sold n items I = {i1, i2, …in}, because of the

specialization necessary, the supermarket need to store information of customers in k computers (k sites), i.e., each site stores information of items (set of products). We can formularize as follows:

Database D is divided into k sites {D1, D2, …, Dk}, where Dj contains the set of items Ij = {

vjjj iii ,...,,21

}( v is number of items in sites Dj), the transactions in Dj only contain the item

that contains in ij. Assume that Ii ∩ Ij=∅ ∀i≠j and ∪k

jjI

1=

= I.

When each transaction is created, the new transaction ID, the items are bought and the number of items is updated in corresponding sites. Therefore, it is not being the centralized database, and makes the supermarket be easy to manage and be not overload in case of huge amount of data.

The problem is how to mine HUIs from database of many sites which do not integrate them together (database is very huge in case of integrating all sites together)?

Example: Consider the databases given in Table I, assume that it is distributed into 2 sites as follow:

TABLE III. OBJECTIVE & SUBJECTIVE VALUE TABLES OF SITE 1

TABLE IV. OBJECTIVE & SUBJECTIVE VALUE TABLES OF SITE 2

B. General model for mining high utility itemsets MasterSite: First of all, MasterSite broadcasts request to

all SlaverSites (name of databases, minutil) and waits information from SlaverSites. When it receives enough information from all SlaverSites, it will mine HUIs by calling TWU-Mining algorithm.

SlaverSite: When the SlaverSite connects to the MasterSite, it will receive the necessary information from the MasterSite. After receiving database, the SlaverSite computes the necessary information and sends to MasterSite. Steps are following:

• Receive database and get the information of each single item such as: list of transactions that contain the item, and the item utility value of each transaction after computing.

• Compute the transaction-weighted utilization value for each transaction.

• Send information to MasterSite.

Table V, VI and Figure 3 illustrate the computing result in SlaverSites and information collecting from MasterSite for the mining process.

Figure 2. General model for mining HUIs from vertical distributed database

TABLE V. TIDSET, BENEFIT AND LOCAL TWU OF ITEMS IN SITE 1

TABLE VI. TIDSET, BENEFIT AND LOCAL TWU OF ITEMS IN SITE 2

A B C T1 0 0 16 T2 0 12 0 T3 2 0 1 T4 1 0 0 T5 0 0 4 T6 1 2 0 T7 0 20 0 T8 3 0 25 T9 1 2 0 T10 0 12 2

Item Benefit

A 3 B 5 C 1

D E T1 0 1 T2 2 1 T3 0 1 T4 2 1 T5 0 2 T7 2 1 T8 6 1 T10 0 2

Item Benefit

D 3 E 5

D Tidset 2 4 7 8 Benefit 6 6 6 18

E Tidset 1 2 3 4 5 7 8 10 Benefit 5 5 5 5 10 5 5 10

TID Local TWUT1 16 T2 60 T3 7 T4 3 T5 4 T6 13 T7 100 T8 34 T9 13 T10 62

B Tidset 2 6 7 9 10 Benefit 60 10 100 10 60

C Tidset 1 3 5 8 10 Benefit 16 1 4 25 2

TID Local TWU

T1 5 T2 11 T3 5 T4 11 T5 10 T7 11 T8 23 T10 10

a) MasterSite b) SlaverSite

3

Page 4: [IEEE 2009 IEEE-RIVF International Conference on Computing and Communication Technologies - Danang City, Viet Nam (2009.07.13-2009.07.17)] 2009 IEEE-RIVF International Conference on

After synthesizing all results and removing items that their twu do not satisfy minutil, we have WIT-tree in level 1 as follows:

Figure 3. WIT-tree in level 1 after synthesising results

Using TWU-Mining algorithm at level 1 as mentioned above, we have result as Figure 1.

IV. EXPERIMENTS All algorithms were coded by C# 2005. PC Configuration:

CPU Intel 2.0 GHz, RAM 1 GB, Windows XP. Experiment databases have features such as:

TABLE VII. EXPERIMENTAL DATABASES

Database #Trans #Items Remark BMS-POS 515597 1656 Modified Retails 88162 16469 Modified We modified by adding one more value column (random in

range of 1 to 10) for each item corresponding to each transaction, and create one more table to store benefit values of items (value in range of 1 to 10). Each database is distributed into 5 sites.

Because TWU-Mining [6] is often faster than algorithms based on utility upper bound [15] and Two-Phase [7], so we only compare proposed algorithm with TWU-Mining.

TABLE VIII. EXPERIMENT RESULTS (MODIFIED)

Databases minutil (%)

TWU-Mining (s)

TWU-Mining Distribute (s) #HUIs

BMS-POS

4 39.05 33.14 6 3 55.67 50.88 7 2 95.56 86.52 22 1 7.46 5.33 20

Retails

0.8 11.31 8.16 29 0.6 23.23 13.94 45 0.4 57.69 33.14 64 0.2 178.19 97.97 239

Table XIII shows that the execution time of TWU-Mining algorithm on distributed vertical database is less than that on concentration database. Because of the distributed computing at Sites is done before collecting data, so the mining in MasterSite cost less time.

V. CONCLUSION AND FUTURE WORKS This paper has presented the methods for mining HUIs

from vertical distributed databases, and the efficient algorithm is also proposed from there. By WIT-tree technique, the algorithm scans the local database only one time. Therefore, we spend a little time for communication between MasterSite and SlaverSites.

In this paper, we only mine HUIs from vertical distributed databases. An efficient algorithm for mining HUIs in horizontal distributed databases will be discussed. Beside that, parallel computing in each SlaverSites will be researched to reduce run-time and memory storage in MasterSite.

REFERENCES [1] R. Agrawal, T. Imielinski, and A. Swami, “Mining association rules

between sets of items in large databases”, Proceedings of the 1993 ACM SIGMOD Conference Washington DC, USA, May 1993, pp. 207 – 216, 1993.

[2] R. Agrawal, J. Shafer, “Parallel Mining of Association Rules”, IEEE Trans. Knowledge and Data Eng., Vol. 8, No. 6, Dec. 1996, pp. 962–969, 1996.

[3] D. Cheung et al, “A Fast Distributed Algorithm for Mining Association Rules”, Proc.4th Int’l Conf. Parallel and Distributed Information Systems, IEEE Computer Soc. Press, Los Alamitos, California, pp. 31– 42, 1996.

[4] A. Erwin, R. P. Gopalan,N. R. Achuthan, “CTU-Mine: An efficient High Utility Itemset Mining Algorithm Using the Pattern Growth Approach”, In: IEEE 7th International Conferences on Computer and Information Technology, Aizu Wakamatsu, Japan, pp. 71 – 76, 2007.

[5] A. Erwin, R. P. Gopalan,N. R. Achuthan, “A Bottom-Up Projection Based Algorithm for Mining High Utility Itemsets”, Proceedings of the 2nd International Workshop on Integrating Artificial Intelligence and Data Mining - Volume 84, Gold Coast, Australia, pp. 3 – 11, 2007.

[6] B. Le, H. Nguyen, T. A. Cao, B. Vo, “A Novel Algorithm for Mining High Utility Itemsets”, In: Proceedings of 1st Asian Conference on Intelligent Information and Database Systems, Quang Binh, Vietnam (IEEE press), pp. 13 – 17, 2009.

[7] Y. Liu, W. Liao, A. Choudhary, “A Fast High Utility Itemsets Mining Algorithm”, UBDM '05 , August 21, 2005, Chicago, Illinois, USA, pp. 90 – 99, 2005.

[8] P. Luo, H. Xuong, K. Lu, Z. Shi, “Distributed Classification in Peer-to-Peer Networks”, KDD’07, August 12–15, 2007, San Jose, California, USA, 2007.

[9] D. J. Miller, Y. Zhang, G. Kesidis, “Decision Aggregation in Distributed Classification by a Transductive Extension of Maximum Entropy/Improved Iterative Scaling”, EURASIP Journal on Advances in Signal Processing, Volume 2008 (doi:10.1155/2008/674974), 2008.

[10] A. Schuster, R. Wolff, “Communication-Efficient Distributed Mining of Association Rules”, in Proc. of the 2001 ACM SIGMOD Int'l. Conference on Management of Data', Santa Barbara, California, pp. 473-484, 2001.

[11] M. Serazi, A. Perera, T. Abidin, G. Hamer, W. Perrizo, “An API for Transparent Distributed Vertical Data Mining”, Proceedings of the ISCA 14th International Conference on Intelligent and Adaptive Systems and Software Engineering, Toronto, Canada. ISCA 2005, pp. 151-156, 2005.

[12] P. Tang, M. Turkia, “Parallelizing Frequent Itemset Mining with FP-trees”, Technical Report, Department of Computer Science, University of Arkansas at Little Rock, 2005.

[13] R.Wolff and A.Schuster, "Association Rule Mining in Peer-to-Peer Systems," IEEE Trans. Systems, Man and Cybernetics, Part B, vol.34, no.6, 2004, pp. 2426 – 2438, 2004.

[14] H. Yao, H. J. Hamilton, C. J. Butz, “A Foundational Approach to Mining Itemset Utilities from Databases”, Proceedings 2004 SIAM International Conference on Data Mining, 2004, pp. 482 – 486, 2004.

[15] H. Yao, H. J. Hamilton, “Mining Itemsets Utilities from Transaction Databases”, Data and Knowledge Engineering, Volume 59, pp. 603 – 626, 2005.

[16] M. J. Zaki, C.J. Hsiao, “Efficient Algorithms for Mining Closed Itemsets and Their Lattice Structure”, IEEE Transactions on Knowledge and Data Engineering, Vol. 17, No 4, April 2005, pp. 462 – 478, 2005.

[17] M. J. Zaki, “Parallel and distributed association mining: A survey. IEEE Concurrency”, Special Issue on Parallel Mechanisms for Data Mining, Dec. 1999, pp. 14 – 25, 1999.

4